2008-12-11 15:53:28

by Ingo Molnar

[permalink] [raw]
Subject: [patch] Performance Counters for Linux, v3


This is v3 of our performance counters subsystem implementation. It can
be accessed at:

git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip.git perfcounters/core

(or via http://people.redhat.com/mingo/tip.git/README )

We've made a number of bigger enhancements in the -v3 release:

- The introduction of new "software" performance counters:
PERF_COUNTER_CPU_CLOCK and PERF_COUNTER_TASK_CLOCK. (With page-fault,
context-switches and block-read event counters planned as well)

These sw-counters, besides being useful to applications and being nice
generalizations of the performance counter concept, are also helpful
in porting performance counters to new architectures: the software
counters will work fine without any PMU. Applications can thus
standardize on the availability of _some_ performance counters on all
Linux systems all the time, regardless of current PMU support status.

- The introduction of counter groups: counters can now be grouped up
when created. Such counter groups are scheduled atomically and can
have their events taken with precise (and atomic) multi-dimensional
timestamps as well.

The counter groups are a natural extension of the current single
counters, they still act as individual counters as well.

[ It's a bit like task or tty groups - losely coupled counters with a
strong self-identity. The grouping can be arbitrary - there can be
multiple counter groups per task - mixed with single counters as
well. The concept works for CPU/systemwide counters as well. ]

- The addition of a lowlevel counter hw driver framework that allows
assymetric counter implementation. The sw counters now use this
facility.

- The ability to turn all counters of a task on/off via a single system
call.

- The syscall API has been streamlined significantly - see further below
for details. The event type has been widened to 64 bits for powerpc's
needs, and a few reserve bits have been introduced.

- The ability to turn all counters of a task on/off via a single system
call. This is eseful to applications that self-profile and/or want to
do runtime filtering of which functions to profile. (there's also a
"hw_event.disabled" bit in the API to create counters in disabled
state straight away - useful to powerpc for example - this code is not
fully complete yet. It's the next entry on our TODO list :-)

- [ lots of other updates, fixes and cleanups. ]

New KernelTop features:

http://redhat.com/~mingo/perfcounters/kerneltop.c

- The ability to count multiple event sources at once and combine
them into the same histogram. For example, to create a
cache-misses versus cache-references histogram, just append event ids
like this:

$ ./kerneltop -e 3 -c 5000 -e 2

To get output like this:

------------------------------------------------------------------------------
KernelTop: 1601 irqs/sec [NMI, cache-misses/cache-refs], (all, 16 CPUs)
------------------------------------------------------------------------------

weight RIP kernel function
______ ________________ _______________

85.00 - ffffffff804fc96d : ip_local_deliver
30.50 - ffffffff804cedfa : skb_copy_and_csum_dev
27.11 - ffffffff804ceeb7 : skb_push
27.00 - ffffffff805106a8 : tcp_established_options
20.35 - ffffffff804e5675 : eth_type_trans
19.00 - ffffffff8028a4e8 : zone_statistics
18.40 - ffffffff804d9256 : dst_release
18.07 - ffffffff804fc1cc : ip_rcv_finish
16.00 - ffffffff8050022b : __ip_local_out
15.69 - ffffffff804fc774 : ip_local_deliver_finish
14.41 - ffffffff804cfc87 : skb_release_head_state
14.00 - ffffffff804cbdf0 : sock_alloc_send_skb
10.00 - ffffffff8027d788 : find_get_page
9.71 - ffffffff8050084f : ip_queue_xmit
8.00 - ffffffff802217d5 : read_hpet
6.50 - ffffffff8050d999 : tcp_prune_queue
3.59 - ffffffff80503209 : __inet_lookup_established
2.16 - ffffffff802861ec : put_page
2.00 - ffffffff80222554 : physflat_send_IPI_mask

- the -g 1 option to put all counters into a counter group.

- one-shot profiling

- various other updates.

- NOTE: pick up the latest version of kerneltop.c if you want to try out
the v3 kernel side.

See "kerneltop --help" for all the options:

KernelTop Options (up to 4 event types can be specified):

-e EID --event_id=EID # event type ID [default: 0]
0: CPU cycles
1: instructions
2: cache accesses
3: cache misses
4: branch instructions
5: branch prediction misses
< 0: raw CPU events

-c CNT --count=CNT # event period to sample

-C CPU --cpu=CPU # CPU (-1 for all) [default: -1]
-p PID --pid=PID # PID of sampled task (-1 for all) [default: -1]

-d delay --delay=<seconds> # sampling/display delay [default: 2]
-x path --vmlinux=<path> # the vmlinux binary, for -s use:
-s symbol --symbol=<symbol> # function to be showed annotated one-shot

The new syscall API looks the following way. There's a single system
call which creates counters - VFS ops are used after that to operate on
counters. The API details:

/*
* Generalized performance counter event types, used by the hw_event.type
* parameter of the sys_perf_counter_open() syscall:
*/
enum hw_event_types {
/*
* Common hardware events, generalized by the kernel:
*/
PERF_COUNT_CYCLES = 0,
PERF_COUNT_INSTRUCTIONS = 1,
PERF_COUNT_CACHE_REFERENCES = 2,
PERF_COUNT_CACHE_MISSES = 3,
PERF_COUNT_BRANCH_INSTRUCTIONS = 4,
PERF_COUNT_BRANCH_MISSES = 5,

/*
* Special "software" counters provided by the kernel, even if
* the hardware does not support performance counters. These
* counters measure various physical and sw events of the
* kernel (and allow the profiling of them as well):
*/
PERF_COUNT_CPU_CLOCK = -1,
PERF_COUNT_TASK_CLOCK = -2,
/*
* Future software events:
*/
/* PERF_COUNT_PAGE_FAULTS = -3,
PERF_COUNT_CONTEXT_SWITCHES = -4, */
};

/*
* IRQ-notification data record type:
*/
enum perf_counter_record_type {
PERF_RECORD_SIMPLE = 0,
PERF_RECORD_IRQ = 1,
PERF_RECORD_GROUP = 2,
};

/*
* Hardware event to monitor via a performance monitoring counter:
*/
struct perf_counter_hw_event {
s64 type;

u64 irq_period;
u32 record_type;

u32 disabled : 1, /* off by default */
nmi : 1, /* NMI sampling */
raw : 1, /* raw event type */
__reserved_1 : 29;

u64 __reserved_2;
};

asmlinkage int
sys_perf_counter_open(struct perf_counter_hw_event *hw_event_uptr __user,
pid_t pid, int cpu, int group_fd);


Thanks,

Ingo, Thomas

------------------>
Ingo Molnar (16):
performance counters: documentation
performance counters: x86 support
x86, perfcounters: read out MSR_CORE_PERF_GLOBAL_STATUS with counters disabled
perfcounters: select ANON_INODES
perfcounters, x86: simplify disable/enable of counters
perfcounters, x86: clean up debug code
perfcounters: consolidate global-disable codepaths
perf counters: restructure the API
perf counters: add support for group counters
perf counters: group counter, fixes
perf counters: hw driver API
perf counters: implement PERF_COUNT_CPU_CLOCK
perf counters: consolidate hw_perf save/restore APIs
perf counters: implement PERF_COUNT_TASK_CLOCK
perf counters: add prctl interface to disable/enable counters
perf counters: clean up state transitions

Thomas Gleixner (4):
performance counters: core code
perf counters: protect them against CSTATE transitions
perf counters: clean up 'raw' type API
perf counters: expand use of counter->event


Documentation/perf-counters.txt | 104 ++
arch/x86/Kconfig | 1 +
arch/x86/ia32/ia32entry.S | 3 +-
arch/x86/include/asm/hardirq_32.h | 1 +
arch/x86/include/asm/hw_irq.h | 2 +
arch/x86/include/asm/intel_arch_perfmon.h | 34 +-
arch/x86/include/asm/irq_vectors.h | 5 +
arch/x86/include/asm/mach-default/entry_arch.h | 5 +
arch/x86/include/asm/pda.h | 1 +
arch/x86/include/asm/thread_info.h | 4 +-
arch/x86/include/asm/unistd_32.h | 1 +
arch/x86/include/asm/unistd_64.h | 3 +-
arch/x86/kernel/apic.c | 2 +
arch/x86/kernel/cpu/Makefile | 12 +-
arch/x86/kernel/cpu/common.c | 2 +
arch/x86/kernel/cpu/perf_counter.c | 563 +++++++++++
arch/x86/kernel/entry_64.S | 5 +
arch/x86/kernel/irq.c | 5 +
arch/x86/kernel/irqinit_32.c | 3 +
arch/x86/kernel/irqinit_64.c | 5 +
arch/x86/kernel/signal.c | 7 +-
arch/x86/kernel/syscall_table_32.S | 1 +
drivers/acpi/processor_idle.c | 8 +
drivers/char/sysrq.c | 2 +
include/linux/perf_counter.h | 244 +++++
include/linux/prctl.h | 3 +
include/linux/sched.h | 9 +
include/linux/syscalls.h | 8 +
init/Kconfig | 30 +
kernel/Makefile | 1 +
kernel/fork.c | 1 +
kernel/perf_counter.c | 1266 ++++++++++++++++++++++++
kernel/sched.c | 24 +
kernel/sys.c | 7 +
kernel/sys_ni.c | 3 +
35 files changed, 2354 insertions(+), 21 deletions(-)
create mode 100644 Documentation/perf-counters.txt
create mode 100644 arch/x86/kernel/cpu/perf_counter.c
create mode 100644 include/linux/perf_counter.h
create mode 100644 kernel/perf_counter.c

diff --git a/Documentation/perf-counters.txt b/Documentation/perf-counters.txt
new file mode 100644
index 0000000..19033a0
--- /dev/null
+++ b/Documentation/perf-counters.txt
@@ -0,0 +1,104 @@
+
+Performance Counters for Linux
+------------------------------
+
+Performance counters are special hardware registers available on most modern
+CPUs. These registers count the number of certain types of hw events: such
+as instructions executed, cachemisses suffered, or branches mis-predicted -
+without slowing down the kernel or applications. These registers can also
+trigger interrupts when a threshold number of events have passed - and can
+thus be used to profile the code that runs on that CPU.
+
+The Linux Performance Counter subsystem provides an abstraction of these
+hardware capabilities. It provides per task and per CPU counters, and
+it provides event capabilities on top of those.
+
+Performance counters are accessed via special file descriptors.
+There's one file descriptor per virtual counter used.
+
+The special file descriptor is opened via the perf_counter_open()
+system call:
+
+ int
+ perf_counter_open(u32 hw_event_type,
+ u32 hw_event_period,
+ u32 record_type,
+ pid_t pid,
+ int cpu);
+
+The syscall returns the new fd. The fd can be used via the normal
+VFS system calls: read() can be used to read the counter, fcntl()
+can be used to set the blocking mode, etc.
+
+Multiple counters can be kept open at a time, and the counters
+can be poll()ed.
+
+When creating a new counter fd, 'hw_event_type' is one of:
+
+ enum hw_event_types {
+ PERF_COUNT_CYCLES,
+ PERF_COUNT_INSTRUCTIONS,
+ PERF_COUNT_CACHE_REFERENCES,
+ PERF_COUNT_CACHE_MISSES,
+ PERF_COUNT_BRANCH_INSTRUCTIONS,
+ PERF_COUNT_BRANCH_MISSES,
+ };
+
+These are standardized types of events that work uniformly on all CPUs
+that implements Performance Counters support under Linux. If a CPU is
+not able to count branch-misses, then the system call will return
+-EINVAL.
+
+[ Note: more hw_event_types are supported as well, but they are CPU
+ specific and are enumerated via /sys on a per CPU basis. Raw hw event
+ types can be passed in as negative numbers. For example, to count
+ "External bus cycles while bus lock signal asserted" events on Intel
+ Core CPUs, pass in a -0x4064 event type value. ]
+
+The parameter 'hw_event_period' is the number of events before waking up
+a read() that is blocked on a counter fd. Zero value means a non-blocking
+counter.
+
+'record_type' is the type of data that a read() will provide for the
+counter, and it can be one of:
+
+ enum perf_record_type {
+ PERF_RECORD_SIMPLE,
+ PERF_RECORD_IRQ,
+ };
+
+a "simple" counter is one that counts hardware events and allows
+them to be read out into a u64 count value. (read() returns 8 on
+a successful read of a simple counter.)
+
+An "irq" counter is one that will also provide an IRQ context information:
+the IP of the interrupted context. In this case read() will return
+the 8-byte counter value, plus the Instruction Pointer address of the
+interrupted context.
+
+The 'pid' parameter allows the counter to be specific to a task:
+
+ pid == 0: if the pid parameter is zero, the counter is attached to the
+ current task.
+
+ pid > 0: the counter is attached to a specific task (if the current task
+ has sufficient privilege to do so)
+
+ pid < 0: all tasks are counted (per cpu counters)
+
+The 'cpu' parameter allows a counter to be made specific to a full
+CPU:
+
+ cpu >= 0: the counter is restricted to a specific CPU
+ cpu == -1: the counter counts on all CPUs
+
+Note: the combination of 'pid == -1' and 'cpu == -1' is not valid.
+
+A 'pid > 0' and 'cpu == -1' counter is a per task counter that counts
+events of that task and 'follows' that task to whatever CPU the task
+gets schedule to. Per task counters can be created by any user, for
+their own tasks.
+
+A 'pid == -1' and 'cpu == x' counter is a per CPU counter that counts
+all events on CPU-x. Per CPU counters need CAP_SYS_ADMIN privilege.
+
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index d4d4cb7..f2fdc18 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -643,6 +643,7 @@ config X86_UP_IOAPIC
config X86_LOCAL_APIC
def_bool y
depends on X86_64 || (X86_32 && (X86_UP_APIC || (SMP && !X86_VOYAGER) || X86_GENERICARCH))
+ select HAVE_PERF_COUNTERS

config X86_IO_APIC
def_bool y
diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
index 256b00b..3c14ed0 100644
--- a/arch/x86/ia32/ia32entry.S
+++ b/arch/x86/ia32/ia32entry.S
@@ -823,7 +823,8 @@ ia32_sys_call_table:
.quad compat_sys_signalfd4
.quad sys_eventfd2
.quad sys_epoll_create1
- .quad sys_dup3 /* 330 */
+ .quad sys_dup3 /* 330 */
.quad sys_pipe2
.quad sys_inotify_init1
+ .quad sys_perf_counter_open
ia32_syscall_end:
diff --git a/arch/x86/include/asm/hardirq_32.h b/arch/x86/include/asm/hardirq_32.h
index 5ca135e..b3e475d 100644
--- a/arch/x86/include/asm/hardirq_32.h
+++ b/arch/x86/include/asm/hardirq_32.h
@@ -9,6 +9,7 @@ typedef struct {
unsigned long idle_timestamp;
unsigned int __nmi_count; /* arch dependent */
unsigned int apic_timer_irqs; /* arch dependent */
+ unsigned int apic_perf_irqs; /* arch dependent */
unsigned int irq0_irqs;
unsigned int irq_resched_count;
unsigned int irq_call_count;
diff --git a/arch/x86/include/asm/hw_irq.h b/arch/x86/include/asm/hw_irq.h
index 8de644b..aa93e53 100644
--- a/arch/x86/include/asm/hw_irq.h
+++ b/arch/x86/include/asm/hw_irq.h
@@ -30,6 +30,8 @@
/* Interrupt handlers registered during init_IRQ */
extern void apic_timer_interrupt(void);
extern void error_interrupt(void);
+extern void perf_counter_interrupt(void);
+
extern void spurious_interrupt(void);
extern void thermal_interrupt(void);
extern void reschedule_interrupt(void);
diff --git a/arch/x86/include/asm/intel_arch_perfmon.h b/arch/x86/include/asm/intel_arch_perfmon.h
index fa0fd06..71598a9 100644
--- a/arch/x86/include/asm/intel_arch_perfmon.h
+++ b/arch/x86/include/asm/intel_arch_perfmon.h
@@ -1,22 +1,24 @@
#ifndef _ASM_X86_INTEL_ARCH_PERFMON_H
#define _ASM_X86_INTEL_ARCH_PERFMON_H

-#define MSR_ARCH_PERFMON_PERFCTR0 0xc1
-#define MSR_ARCH_PERFMON_PERFCTR1 0xc2
+#define MSR_ARCH_PERFMON_PERFCTR0 0xc1
+#define MSR_ARCH_PERFMON_PERFCTR1 0xc2

-#define MSR_ARCH_PERFMON_EVENTSEL0 0x186
-#define MSR_ARCH_PERFMON_EVENTSEL1 0x187
+#define MSR_ARCH_PERFMON_EVENTSEL0 0x186
+#define MSR_ARCH_PERFMON_EVENTSEL1 0x187

-#define ARCH_PERFMON_EVENTSEL0_ENABLE (1 << 22)
-#define ARCH_PERFMON_EVENTSEL_INT (1 << 20)
-#define ARCH_PERFMON_EVENTSEL_OS (1 << 17)
-#define ARCH_PERFMON_EVENTSEL_USR (1 << 16)
+#define ARCH_PERFMON_EVENTSEL0_ENABLE (1 << 22)
+#define ARCH_PERFMON_EVENTSEL_INT (1 << 20)
+#define ARCH_PERFMON_EVENTSEL_OS (1 << 17)
+#define ARCH_PERFMON_EVENTSEL_USR (1 << 16)

-#define ARCH_PERFMON_UNHALTED_CORE_CYCLES_SEL (0x3c)
-#define ARCH_PERFMON_UNHALTED_CORE_CYCLES_UMASK (0x00 << 8)
-#define ARCH_PERFMON_UNHALTED_CORE_CYCLES_INDEX (0)
+#define ARCH_PERFMON_UNHALTED_CORE_CYCLES_SEL 0x3c
+#define ARCH_PERFMON_UNHALTED_CORE_CYCLES_UMASK (0x00 << 8)
+#define ARCH_PERFMON_UNHALTED_CORE_CYCLES_INDEX 0
#define ARCH_PERFMON_UNHALTED_CORE_CYCLES_PRESENT \
- (1 << (ARCH_PERFMON_UNHALTED_CORE_CYCLES_INDEX))
+ (1 << (ARCH_PERFMON_UNHALTED_CORE_CYCLES_INDEX))
+
+#define ARCH_PERFMON_BRANCH_MISSES_RETIRED 6

union cpuid10_eax {
struct {
@@ -28,4 +30,12 @@ union cpuid10_eax {
unsigned int full;
};

+#ifdef CONFIG_PERF_COUNTERS
+extern void init_hw_perf_counters(void);
+extern void perf_counters_lapic_init(int nmi);
+#else
+static inline void init_hw_perf_counters(void) { }
+static inline void perf_counters_lapic_init(int nmi) { }
+#endif
+
#endif /* _ASM_X86_INTEL_ARCH_PERFMON_H */
diff --git a/arch/x86/include/asm/irq_vectors.h b/arch/x86/include/asm/irq_vectors.h
index 0005adb..b8d277f 100644
--- a/arch/x86/include/asm/irq_vectors.h
+++ b/arch/x86/include/asm/irq_vectors.h
@@ -87,6 +87,11 @@
#define LOCAL_TIMER_VECTOR 0xef

/*
+ * Performance monitoring interrupt vector:
+ */
+#define LOCAL_PERF_VECTOR 0xee
+
+/*
* First APIC vector available to drivers: (vectors 0x30-0xee) we
* start at 0x31(0x41) to spread out vectors evenly between priority
* levels. (0x80 is the syscall vector)
diff --git a/arch/x86/include/asm/mach-default/entry_arch.h b/arch/x86/include/asm/mach-default/entry_arch.h
index 6b1add8..ad31e5d 100644
--- a/arch/x86/include/asm/mach-default/entry_arch.h
+++ b/arch/x86/include/asm/mach-default/entry_arch.h
@@ -25,10 +25,15 @@ BUILD_INTERRUPT(irq_move_cleanup_interrupt,IRQ_MOVE_CLEANUP_VECTOR)
* a much simpler SMP time architecture:
*/
#ifdef CONFIG_X86_LOCAL_APIC
+
BUILD_INTERRUPT(apic_timer_interrupt,LOCAL_TIMER_VECTOR)
BUILD_INTERRUPT(error_interrupt,ERROR_APIC_VECTOR)
BUILD_INTERRUPT(spurious_interrupt,SPURIOUS_APIC_VECTOR)

+#ifdef CONFIG_PERF_COUNTERS
+BUILD_INTERRUPT(perf_counter_interrupt, LOCAL_PERF_VECTOR)
+#endif
+
#ifdef CONFIG_X86_MCE_P4THERMAL
BUILD_INTERRUPT(thermal_interrupt,THERMAL_APIC_VECTOR)
#endif
diff --git a/arch/x86/include/asm/pda.h b/arch/x86/include/asm/pda.h
index 2fbfff8..90a8d9d 100644
--- a/arch/x86/include/asm/pda.h
+++ b/arch/x86/include/asm/pda.h
@@ -30,6 +30,7 @@ struct x8664_pda {
short isidle;
struct mm_struct *active_mm;
unsigned apic_timer_irqs;
+ unsigned apic_perf_irqs;
unsigned irq0_irqs;
unsigned irq_resched_count;
unsigned irq_call_count;
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index e44d379..810bf26 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -80,6 +80,7 @@ struct thread_info {
#define TIF_SYSCALL_AUDIT 7 /* syscall auditing active */
#define TIF_SECCOMP 8 /* secure computing */
#define TIF_MCE_NOTIFY 10 /* notify userspace of an MCE */
+#define TIF_PERF_COUNTERS 11 /* notify perf counter work */
#define TIF_NOTSC 16 /* TSC is not accessible in userland */
#define TIF_IA32 17 /* 32bit process */
#define TIF_FORK 18 /* ret_from_fork */
@@ -103,6 +104,7 @@ struct thread_info {
#define _TIF_SYSCALL_AUDIT (1 << TIF_SYSCALL_AUDIT)
#define _TIF_SECCOMP (1 << TIF_SECCOMP)
#define _TIF_MCE_NOTIFY (1 << TIF_MCE_NOTIFY)
+#define _TIF_PERF_COUNTERS (1 << TIF_PERF_COUNTERS)
#define _TIF_NOTSC (1 << TIF_NOTSC)
#define _TIF_IA32 (1 << TIF_IA32)
#define _TIF_FORK (1 << TIF_FORK)
@@ -135,7 +137,7 @@ struct thread_info {

/* Only used for 64 bit */
#define _TIF_DO_NOTIFY_MASK \
- (_TIF_SIGPENDING|_TIF_MCE_NOTIFY|_TIF_NOTIFY_RESUME)
+ (_TIF_SIGPENDING|_TIF_MCE_NOTIFY|_TIF_PERF_COUNTERS|_TIF_NOTIFY_RESUME)

/* flags to check in __switch_to() */
#define _TIF_WORK_CTXSW \
diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h
index f2bba78..7e47658 100644
--- a/arch/x86/include/asm/unistd_32.h
+++ b/arch/x86/include/asm/unistd_32.h
@@ -338,6 +338,7 @@
#define __NR_dup3 330
#define __NR_pipe2 331
#define __NR_inotify_init1 332
+#define __NR_perf_counter_open 333

#ifdef __KERNEL__

diff --git a/arch/x86/include/asm/unistd_64.h b/arch/x86/include/asm/unistd_64.h
index d2e415e..53025fe 100644
--- a/arch/x86/include/asm/unistd_64.h
+++ b/arch/x86/include/asm/unistd_64.h
@@ -653,7 +653,8 @@ __SYSCALL(__NR_dup3, sys_dup3)
__SYSCALL(__NR_pipe2, sys_pipe2)
#define __NR_inotify_init1 294
__SYSCALL(__NR_inotify_init1, sys_inotify_init1)
-
+#define __NR_perf_counter_open 295
+__SYSCALL(__NR_perf_counter_open, sys_perf_counter_open)

#ifndef __NO_STUBS
#define __ARCH_WANT_OLD_READDIR
diff --git a/arch/x86/kernel/apic.c b/arch/x86/kernel/apic.c
index 16f9487..8ab8c18 100644
--- a/arch/x86/kernel/apic.c
+++ b/arch/x86/kernel/apic.c
@@ -31,6 +31,7 @@
#include <linux/dmi.h>
#include <linux/dmar.h>

+#include <asm/intel_arch_perfmon.h>
#include <asm/atomic.h>
#include <asm/smp.h>
#include <asm/mtrr.h>
@@ -1147,6 +1148,7 @@ void __cpuinit setup_local_APIC(void)
apic_write(APIC_ESR, 0);
}
#endif
+ perf_counters_lapic_init(0);

preempt_disable();

diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile
index 82ec607..89e5336 100644
--- a/arch/x86/kernel/cpu/Makefile
+++ b/arch/x86/kernel/cpu/Makefile
@@ -1,5 +1,5 @@
#
-# Makefile for x86-compatible CPU details and quirks
+# Makefile for x86-compatible CPU details, features and quirks
#

obj-y := intel_cacheinfo.o addon_cpuid_features.o
@@ -16,11 +16,13 @@ obj-$(CONFIG_CPU_SUP_CENTAUR_64) += centaur_64.o
obj-$(CONFIG_CPU_SUP_TRANSMETA_32) += transmeta.o
obj-$(CONFIG_CPU_SUP_UMC_32) += umc.o

-obj-$(CONFIG_X86_MCE) += mcheck/
-obj-$(CONFIG_MTRR) += mtrr/
-obj-$(CONFIG_CPU_FREQ) += cpufreq/
+obj-$(CONFIG_PERF_COUNTERS) += perf_counter.o

-obj-$(CONFIG_X86_LOCAL_APIC) += perfctr-watchdog.o
+obj-$(CONFIG_X86_MCE) += mcheck/
+obj-$(CONFIG_MTRR) += mtrr/
+obj-$(CONFIG_CPU_FREQ) += cpufreq/
+
+obj-$(CONFIG_X86_LOCAL_APIC) += perfctr-watchdog.o

quiet_cmd_mkcapflags = MKCAP $@
cmd_mkcapflags = $(PERL) $(srctree)/$(src)/mkcapflags.pl $< $@
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index b9c9ea0..4461011 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -17,6 +17,7 @@
#include <asm/mmu_context.h>
#include <asm/mtrr.h>
#include <asm/mce.h>
+#include <asm/intel_arch_perfmon.h>
#include <asm/pat.h>
#include <asm/asm.h>
#include <asm/numa.h>
@@ -750,6 +751,7 @@ void __init identify_boot_cpu(void)
#else
vgetcpu_set_mode();
#endif
+ init_hw_perf_counters();
}

void __cpuinit identify_secondary_cpu(struct cpuinfo_x86 *c)
diff --git a/arch/x86/kernel/cpu/perf_counter.c b/arch/x86/kernel/cpu/perf_counter.c
new file mode 100644
index 0000000..4854cca
--- /dev/null
+++ b/arch/x86/kernel/cpu/perf_counter.c
@@ -0,0 +1,563 @@
+/*
+ * Performance counter x86 architecture code
+ *
+ * Copyright(C) 2008 Thomas Gleixner <[email protected]>
+ * Copyright(C) 2008 Red Hat, Inc., Ingo Molnar
+ *
+ * For licencing details see kernel-base/COPYING
+ */
+
+#include <linux/perf_counter.h>
+#include <linux/capability.h>
+#include <linux/notifier.h>
+#include <linux/hardirq.h>
+#include <linux/kprobes.h>
+#include <linux/module.h>
+#include <linux/kdebug.h>
+#include <linux/sched.h>
+
+#include <asm/intel_arch_perfmon.h>
+#include <asm/apic.h>
+
+static bool perf_counters_initialized __read_mostly;
+
+/*
+ * Number of (generic) HW counters:
+ */
+static int nr_hw_counters __read_mostly;
+static u32 perf_counter_mask __read_mostly;
+
+/* No support for fixed function counters yet */
+
+#define MAX_HW_COUNTERS 8
+
+struct cpu_hw_counters {
+ struct perf_counter *counters[MAX_HW_COUNTERS];
+ unsigned long used[BITS_TO_LONGS(MAX_HW_COUNTERS)];
+};
+
+/*
+ * Intel PerfMon v3. Used on Core2 and later.
+ */
+static DEFINE_PER_CPU(struct cpu_hw_counters, cpu_hw_counters);
+
+const int intel_perfmon_event_map[] =
+{
+ [PERF_COUNT_CYCLES] = 0x003c,
+ [PERF_COUNT_INSTRUCTIONS] = 0x00c0,
+ [PERF_COUNT_CACHE_REFERENCES] = 0x4f2e,
+ [PERF_COUNT_CACHE_MISSES] = 0x412e,
+ [PERF_COUNT_BRANCH_INSTRUCTIONS] = 0x00c4,
+ [PERF_COUNT_BRANCH_MISSES] = 0x00c5,
+};
+
+const int max_intel_perfmon_events = ARRAY_SIZE(intel_perfmon_event_map);
+
+/*
+ * Setup the hardware configuration for a given hw_event_type
+ */
+static int __hw_perf_counter_init(struct perf_counter *counter)
+{
+ struct perf_counter_hw_event *hw_event = &counter->hw_event;
+ struct hw_perf_counter *hwc = &counter->hw;
+
+ if (unlikely(!perf_counters_initialized))
+ return -EINVAL;
+
+ /*
+ * Count user events, and generate PMC IRQs:
+ * (keep 'enabled' bit clear for now)
+ */
+ hwc->config = ARCH_PERFMON_EVENTSEL_USR | ARCH_PERFMON_EVENTSEL_INT;
+
+ /*
+ * If privileged enough, count OS events too, and allow
+ * NMI events as well:
+ */
+ hwc->nmi = 0;
+ if (capable(CAP_SYS_ADMIN)) {
+ hwc->config |= ARCH_PERFMON_EVENTSEL_OS;
+ if (hw_event->nmi)
+ hwc->nmi = 1;
+ }
+
+ hwc->config_base = MSR_ARCH_PERFMON_EVENTSEL0;
+ hwc->counter_base = MSR_ARCH_PERFMON_PERFCTR0;
+
+ hwc->irq_period = hw_event->irq_period;
+ /*
+ * Intel PMCs cannot be accessed sanely above 32 bit width,
+ * so we install an artificial 1<<31 period regardless of
+ * the generic counter period:
+ */
+ if (!hwc->irq_period)
+ hwc->irq_period = 0x7FFFFFFF;
+
+ hwc->next_count = -(s32)hwc->irq_period;
+
+ /*
+ * Raw event type provide the config in the event structure
+ */
+ if (hw_event->raw) {
+ hwc->config |= hw_event->type;
+ } else {
+ if (hw_event->type >= max_intel_perfmon_events)
+ return -EINVAL;
+ /*
+ * The generic map:
+ */
+ hwc->config |= intel_perfmon_event_map[hw_event->type];
+ }
+ counter->wakeup_pending = 0;
+
+ return 0;
+}
+
+void hw_perf_enable_all(void)
+{
+ wrmsr(MSR_CORE_PERF_GLOBAL_CTRL, perf_counter_mask, 0);
+}
+
+void hw_perf_restore(u64 ctrl)
+{
+ wrmsr(MSR_CORE_PERF_GLOBAL_CTRL, ctrl, 0);
+}
+EXPORT_SYMBOL_GPL(hw_perf_restore);
+
+u64 hw_perf_save_disable(void)
+{
+ u64 ctrl;
+
+ rdmsrl(MSR_CORE_PERF_GLOBAL_CTRL, ctrl);
+ wrmsr(MSR_CORE_PERF_GLOBAL_CTRL, 0, 0);
+ return ctrl;
+}
+EXPORT_SYMBOL_GPL(hw_perf_save_disable);
+
+static inline void
+__x86_perf_counter_disable(struct hw_perf_counter *hwc, unsigned int idx)
+{
+ wrmsr(hwc->config_base + idx, hwc->config, 0);
+}
+
+static DEFINE_PER_CPU(u64, prev_next_count[MAX_HW_COUNTERS]);
+
+static void __hw_perf_counter_set_period(struct hw_perf_counter *hwc, int idx)
+{
+ per_cpu(prev_next_count[idx], smp_processor_id()) = hwc->next_count;
+
+ wrmsr(hwc->counter_base + idx, hwc->next_count, 0);
+}
+
+static void __x86_perf_counter_enable(struct hw_perf_counter *hwc, int idx)
+{
+ wrmsr(hwc->config_base + idx,
+ hwc->config | ARCH_PERFMON_EVENTSEL0_ENABLE, 0);
+}
+
+static void x86_perf_counter_enable(struct perf_counter *counter)
+{
+ struct cpu_hw_counters *cpuc = &__get_cpu_var(cpu_hw_counters);
+ struct hw_perf_counter *hwc = &counter->hw;
+ int idx = hwc->idx;
+
+ /* Try to get the previous counter again */
+ if (test_and_set_bit(idx, cpuc->used)) {
+ idx = find_first_zero_bit(cpuc->used, nr_hw_counters);
+ set_bit(idx, cpuc->used);
+ hwc->idx = idx;
+ }
+
+ perf_counters_lapic_init(hwc->nmi);
+
+ __x86_perf_counter_disable(hwc, idx);
+
+ cpuc->counters[idx] = counter;
+
+ __hw_perf_counter_set_period(hwc, idx);
+ __x86_perf_counter_enable(hwc, idx);
+}
+
+static void __hw_perf_save_counter(struct perf_counter *counter,
+ struct hw_perf_counter *hwc, int idx)
+{
+ s64 raw = -1;
+ s64 delta;
+
+ /*
+ * Get the raw hw counter value:
+ */
+ rdmsrl(hwc->counter_base + idx, raw);
+
+ /*
+ * Rebase it to zero (it started counting at -irq_period),
+ * to see the delta since ->prev_count:
+ */
+ delta = (s64)hwc->irq_period + (s64)(s32)raw;
+
+ atomic64_counter_set(counter, hwc->prev_count + delta);
+
+ /*
+ * Adjust the ->prev_count offset - if we went beyond
+ * irq_period of units, then we got an IRQ and the counter
+ * was set back to -irq_period:
+ */
+ while (delta >= (s64)hwc->irq_period) {
+ hwc->prev_count += hwc->irq_period;
+ delta -= (s64)hwc->irq_period;
+ }
+
+ /*
+ * Calculate the next raw counter value we'll write into
+ * the counter at the next sched-in time:
+ */
+ delta -= (s64)hwc->irq_period;
+
+ hwc->next_count = (s32)delta;
+}
+
+void perf_counter_print_debug(void)
+{
+ u64 ctrl, status, overflow, pmc_ctrl, pmc_count, next_count;
+ int cpu, idx;
+
+ if (!nr_hw_counters)
+ return;
+
+ local_irq_disable();
+
+ cpu = smp_processor_id();
+
+ rdmsrl(MSR_CORE_PERF_GLOBAL_CTRL, ctrl);
+ rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS, status);
+ rdmsrl(MSR_CORE_PERF_GLOBAL_OVF_CTRL, overflow);
+
+ printk(KERN_INFO "\n");
+ printk(KERN_INFO "CPU#%d: ctrl: %016llx\n", cpu, ctrl);
+ printk(KERN_INFO "CPU#%d: status: %016llx\n", cpu, status);
+ printk(KERN_INFO "CPU#%d: overflow: %016llx\n", cpu, overflow);
+
+ for (idx = 0; idx < nr_hw_counters; idx++) {
+ rdmsrl(MSR_ARCH_PERFMON_EVENTSEL0 + idx, pmc_ctrl);
+ rdmsrl(MSR_ARCH_PERFMON_PERFCTR0 + idx, pmc_count);
+
+ next_count = per_cpu(prev_next_count[idx], cpu);
+
+ printk(KERN_INFO "CPU#%d: PMC%d ctrl: %016llx\n",
+ cpu, idx, pmc_ctrl);
+ printk(KERN_INFO "CPU#%d: PMC%d count: %016llx\n",
+ cpu, idx, pmc_count);
+ printk(KERN_INFO "CPU#%d: PMC%d next: %016llx\n",
+ cpu, idx, next_count);
+ }
+ local_irq_enable();
+}
+
+static void x86_perf_counter_disable(struct perf_counter *counter)
+{
+ struct cpu_hw_counters *cpuc = &__get_cpu_var(cpu_hw_counters);
+ struct hw_perf_counter *hwc = &counter->hw;
+ unsigned int idx = hwc->idx;
+
+ __x86_perf_counter_disable(hwc, idx);
+
+ clear_bit(idx, cpuc->used);
+ cpuc->counters[idx] = NULL;
+ __hw_perf_save_counter(counter, hwc, idx);
+}
+
+static void x86_perf_counter_read(struct perf_counter *counter)
+{
+ struct hw_perf_counter *hwc = &counter->hw;
+ unsigned long addr = hwc->counter_base + hwc->idx;
+ s64 offs, val = -1LL;
+ s32 val32;
+
+ /* Careful: NMI might modify the counter offset */
+ do {
+ offs = hwc->prev_count;
+ rdmsrl(addr, val);
+ } while (offs != hwc->prev_count);
+
+ val32 = (s32) val;
+ val = (s64)hwc->irq_period + (s64)val32;
+ atomic64_counter_set(counter, hwc->prev_count + val);
+}
+
+static void perf_store_irq_data(struct perf_counter *counter, u64 data)
+{
+ struct perf_data *irqdata = counter->irqdata;
+
+ if (irqdata->len > PERF_DATA_BUFLEN - sizeof(u64)) {
+ irqdata->overrun++;
+ } else {
+ u64 *p = (u64 *) &irqdata->data[irqdata->len];
+
+ *p = data;
+ irqdata->len += sizeof(u64);
+ }
+}
+
+/*
+ * NMI-safe enable method:
+ */
+static void perf_save_and_restart(struct perf_counter *counter)
+{
+ struct hw_perf_counter *hwc = &counter->hw;
+ int idx = hwc->idx;
+ u64 pmc_ctrl;
+
+ rdmsrl(MSR_ARCH_PERFMON_EVENTSEL0 + idx, pmc_ctrl);
+
+ __hw_perf_save_counter(counter, hwc, idx);
+ __hw_perf_counter_set_period(hwc, idx);
+
+ if (pmc_ctrl & ARCH_PERFMON_EVENTSEL0_ENABLE)
+ __x86_perf_counter_enable(hwc, idx);
+}
+
+static void
+perf_handle_group(struct perf_counter *sibling, u64 *status, u64 *overflown)
+{
+ struct perf_counter *counter, *group_leader = sibling->group_leader;
+ int bit;
+
+ /*
+ * Store the counter's own timestamp first:
+ */
+ perf_store_irq_data(sibling, sibling->hw_event.type);
+ perf_store_irq_data(sibling, atomic64_counter_read(sibling));
+
+ /*
+ * Then store sibling timestamps (if any):
+ */
+ list_for_each_entry(counter, &group_leader->sibling_list, list_entry) {
+ if (counter->state != PERF_COUNTER_STATE_ACTIVE) {
+ /*
+ * When counter was not in the overflow mask, we have to
+ * read it from hardware. We read it as well, when it
+ * has not been read yet and clear the bit in the
+ * status mask.
+ */
+ bit = counter->hw.idx;
+ if (!test_bit(bit, (unsigned long *) overflown) ||
+ test_bit(bit, (unsigned long *) status)) {
+ clear_bit(bit, (unsigned long *) status);
+ perf_save_and_restart(counter);
+ }
+ }
+ perf_store_irq_data(sibling, counter->hw_event.type);
+ perf_store_irq_data(sibling, atomic64_counter_read(counter));
+ }
+}
+
+/*
+ * This handler is triggered by the local APIC, so the APIC IRQ handling
+ * rules apply:
+ */
+static void __smp_perf_counter_interrupt(struct pt_regs *regs, int nmi)
+{
+ int bit, cpu = smp_processor_id();
+ u64 ack, status, saved_global;
+ struct cpu_hw_counters *cpuc;
+
+ rdmsrl(MSR_CORE_PERF_GLOBAL_CTRL, saved_global);
+
+ /* Disable counters globally */
+ wrmsr(MSR_CORE_PERF_GLOBAL_CTRL, 0, 0);
+ ack_APIC_irq();
+
+ cpuc = &per_cpu(cpu_hw_counters, cpu);
+
+ rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS, status);
+ if (!status)
+ goto out;
+
+again:
+ ack = status;
+ for_each_bit(bit, (unsigned long *) &status, nr_hw_counters) {
+ struct perf_counter *counter = cpuc->counters[bit];
+
+ clear_bit(bit, (unsigned long *) &status);
+ if (!counter)
+ continue;
+
+ perf_save_and_restart(counter);
+
+ switch (counter->hw_event.record_type) {
+ case PERF_RECORD_SIMPLE:
+ continue;
+ case PERF_RECORD_IRQ:
+ perf_store_irq_data(counter, instruction_pointer(regs));
+ break;
+ case PERF_RECORD_GROUP:
+ perf_handle_group(counter, &status, &ack);
+ break;
+ }
+ /*
+ * From NMI context we cannot call into the scheduler to
+ * do a task wakeup - but we mark these counters as
+ * wakeup_pending and initate a wakeup callback:
+ */
+ if (nmi) {
+ counter->wakeup_pending = 1;
+ set_tsk_thread_flag(current, TIF_PERF_COUNTERS);
+ } else {
+ wake_up(&counter->waitq);
+ }
+ }
+
+ wrmsr(MSR_CORE_PERF_GLOBAL_OVF_CTRL, ack, 0);
+
+ /*
+ * Repeat if there is more work to be done:
+ */
+ rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS, status);
+ if (status)
+ goto again;
+out:
+ /*
+ * Restore - do not reenable when global enable is off:
+ */
+ wrmsr(MSR_CORE_PERF_GLOBAL_CTRL, saved_global, 0);
+}
+
+void smp_perf_counter_interrupt(struct pt_regs *regs)
+{
+ irq_enter();
+#ifdef CONFIG_X86_64
+ add_pda(apic_perf_irqs, 1);
+#else
+ per_cpu(irq_stat, smp_processor_id()).apic_perf_irqs++;
+#endif
+ apic_write(APIC_LVTPC, LOCAL_PERF_VECTOR);
+ __smp_perf_counter_interrupt(regs, 0);
+
+ irq_exit();
+}
+
+/*
+ * This handler is triggered by NMI contexts:
+ */
+void perf_counter_notify(struct pt_regs *regs)
+{
+ struct cpu_hw_counters *cpuc;
+ unsigned long flags;
+ int bit, cpu;
+
+ local_irq_save(flags);
+ cpu = smp_processor_id();
+ cpuc = &per_cpu(cpu_hw_counters, cpu);
+
+ for_each_bit(bit, cpuc->used, nr_hw_counters) {
+ struct perf_counter *counter = cpuc->counters[bit];
+
+ if (!counter)
+ continue;
+
+ if (counter->wakeup_pending) {
+ counter->wakeup_pending = 0;
+ wake_up(&counter->waitq);
+ }
+ }
+
+ local_irq_restore(flags);
+}
+
+void __cpuinit perf_counters_lapic_init(int nmi)
+{
+ u32 apic_val;
+
+ if (!perf_counters_initialized)
+ return;
+ /*
+ * Enable the performance counter vector in the APIC LVT:
+ */
+ apic_val = apic_read(APIC_LVTERR);
+
+ apic_write(APIC_LVTERR, apic_val | APIC_LVT_MASKED);
+ if (nmi)
+ apic_write(APIC_LVTPC, APIC_DM_NMI);
+ else
+ apic_write(APIC_LVTPC, LOCAL_PERF_VECTOR);
+ apic_write(APIC_LVTERR, apic_val);
+}
+
+static int __kprobes
+perf_counter_nmi_handler(struct notifier_block *self,
+ unsigned long cmd, void *__args)
+{
+ struct die_args *args = __args;
+ struct pt_regs *regs;
+
+ if (likely(cmd != DIE_NMI_IPI))
+ return NOTIFY_DONE;
+
+ regs = args->regs;
+
+ apic_write(APIC_LVTPC, APIC_DM_NMI);
+ __smp_perf_counter_interrupt(regs, 1);
+
+ return NOTIFY_STOP;
+}
+
+static __read_mostly struct notifier_block perf_counter_nmi_notifier = {
+ .notifier_call = perf_counter_nmi_handler
+};
+
+void __init init_hw_perf_counters(void)
+{
+ union cpuid10_eax eax;
+ unsigned int unused;
+ unsigned int ebx;
+
+ if (!cpu_has(&boot_cpu_data, X86_FEATURE_ARCH_PERFMON))
+ return;
+
+ /*
+ * Check whether the Architectural PerfMon supports
+ * Branch Misses Retired Event or not.
+ */
+ cpuid(10, &(eax.full), &ebx, &unused, &unused);
+ if (eax.split.mask_length <= ARCH_PERFMON_BRANCH_MISSES_RETIRED)
+ return;
+
+ printk(KERN_INFO "Intel Performance Monitoring support detected.\n");
+
+ printk(KERN_INFO "... version: %d\n", eax.split.version_id);
+ printk(KERN_INFO "... num_counters: %d\n", eax.split.num_counters);
+ nr_hw_counters = eax.split.num_counters;
+ if (nr_hw_counters > MAX_HW_COUNTERS) {
+ nr_hw_counters = MAX_HW_COUNTERS;
+ WARN(1, KERN_ERR "hw perf counters %d > max(%d), clipping!",
+ nr_hw_counters, MAX_HW_COUNTERS);
+ }
+ perf_counter_mask = (1 << nr_hw_counters) - 1;
+ perf_max_counters = nr_hw_counters;
+
+ printk(KERN_INFO "... bit_width: %d\n", eax.split.bit_width);
+ printk(KERN_INFO "... mask_length: %d\n", eax.split.mask_length);
+
+ perf_counters_lapic_init(0);
+ register_die_notifier(&perf_counter_nmi_notifier);
+
+ perf_counters_initialized = true;
+}
+
+static const struct hw_perf_counter_ops x86_perf_counter_ops = {
+ .hw_perf_counter_enable = x86_perf_counter_enable,
+ .hw_perf_counter_disable = x86_perf_counter_disable,
+ .hw_perf_counter_read = x86_perf_counter_read,
+};
+
+const struct hw_perf_counter_ops *
+hw_perf_counter_init(struct perf_counter *counter)
+{
+ int err;
+
+ err = __hw_perf_counter_init(counter);
+ if (err)
+ return NULL;
+
+ return &x86_perf_counter_ops;
+}
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 3194636..fc013cf 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -984,6 +984,11 @@ apicinterrupt ERROR_APIC_VECTOR \
apicinterrupt SPURIOUS_APIC_VECTOR \
spurious_interrupt smp_spurious_interrupt

+#ifdef CONFIG_PERF_COUNTERS
+apicinterrupt LOCAL_PERF_VECTOR \
+ perf_counter_interrupt smp_perf_counter_interrupt
+#endif
+
/*
* Exception entry points.
*/
diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
index d1d4dc5..d92bc71 100644
--- a/arch/x86/kernel/irq.c
+++ b/arch/x86/kernel/irq.c
@@ -56,6 +56,10 @@ static int show_other_interrupts(struct seq_file *p)
for_each_online_cpu(j)
seq_printf(p, "%10u ", irq_stats(j)->apic_timer_irqs);
seq_printf(p, " Local timer interrupts\n");
+ seq_printf(p, "CNT: ");
+ for_each_online_cpu(j)
+ seq_printf(p, "%10u ", irq_stats(j)->apic_perf_irqs);
+ seq_printf(p, " Performance counter interrupts\n");
#endif
#ifdef CONFIG_SMP
seq_printf(p, "RES: ");
@@ -160,6 +164,7 @@ u64 arch_irq_stat_cpu(unsigned int cpu)

#ifdef CONFIG_X86_LOCAL_APIC
sum += irq_stats(cpu)->apic_timer_irqs;
+ sum += irq_stats(cpu)->apic_perf_irqs;
#endif
#ifdef CONFIG_SMP
sum += irq_stats(cpu)->irq_resched_count;
diff --git a/arch/x86/kernel/irqinit_32.c b/arch/x86/kernel/irqinit_32.c
index 607db63..6a33b5e 100644
--- a/arch/x86/kernel/irqinit_32.c
+++ b/arch/x86/kernel/irqinit_32.c
@@ -160,6 +160,9 @@ void __init native_init_IRQ(void)
/* IPI vectors for APIC spurious and error interrupts */
alloc_intr_gate(SPURIOUS_APIC_VECTOR, spurious_interrupt);
alloc_intr_gate(ERROR_APIC_VECTOR, error_interrupt);
+# ifdef CONFIG_PERF_COUNTERS
+ alloc_intr_gate(LOCAL_PERF_VECTOR, perf_counter_interrupt);
+# endif
#endif

#if defined(CONFIG_X86_LOCAL_APIC) && defined(CONFIG_X86_MCE_P4THERMAL)
diff --git a/arch/x86/kernel/irqinit_64.c b/arch/x86/kernel/irqinit_64.c
index 8670b3c..91d785c 100644
--- a/arch/x86/kernel/irqinit_64.c
+++ b/arch/x86/kernel/irqinit_64.c
@@ -138,6 +138,11 @@ static void __init apic_intr_init(void)
/* IPI vectors for APIC spurious and error interrupts */
alloc_intr_gate(SPURIOUS_APIC_VECTOR, spurious_interrupt);
alloc_intr_gate(ERROR_APIC_VECTOR, error_interrupt);
+
+ /* Performance monitoring interrupt: */
+#ifdef CONFIG_PERF_COUNTERS
+ alloc_intr_gate(LOCAL_PERF_VECTOR, perf_counter_interrupt);
+#endif
}

void __init native_init_IRQ(void)
diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
index b1cc6da..dee553c 100644
--- a/arch/x86/kernel/signal.c
+++ b/arch/x86/kernel/signal.c
@@ -6,7 +6,7 @@
* 2000-06-20 Pentium III FXSR, SSE support by Gareth Hughes
* 2000-2002 x86-64 support by Andi Kleen
*/
-
+#include <linux/perf_counter.h>
#include <linux/sched.h>
#include <linux/mm.h>
#include <linux/smp.h>
@@ -891,6 +891,11 @@ do_notify_resume(struct pt_regs *regs, void *unused, __u32 thread_info_flags)
tracehook_notify_resume(regs);
}

+ if (thread_info_flags & _TIF_PERF_COUNTERS) {
+ clear_thread_flag(TIF_PERF_COUNTERS);
+ perf_counter_notify(regs);
+ }
+
#ifdef CONFIG_X86_32
clear_thread_flag(TIF_IRET);
#endif /* CONFIG_X86_32 */
diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S
index d44395f..496726d 100644
--- a/arch/x86/kernel/syscall_table_32.S
+++ b/arch/x86/kernel/syscall_table_32.S
@@ -332,3 +332,4 @@ ENTRY(sys_call_table)
.long sys_dup3 /* 330 */
.long sys_pipe2
.long sys_inotify_init1
+ .long sys_perf_counter_open
diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c
index 5f8d746..a3e66a3 100644
--- a/drivers/acpi/processor_idle.c
+++ b/drivers/acpi/processor_idle.c
@@ -270,8 +270,11 @@ static atomic_t c3_cpu_count;
/* Common C-state entry for C2, C3, .. */
static void acpi_cstate_enter(struct acpi_processor_cx *cstate)
{
+ u64 perf_flags;
+
/* Don't trace irqs off for idle */
stop_critical_timings();
+ perf_flags = hw_perf_save_disable();
if (cstate->entry_method == ACPI_CSTATE_FFH) {
/* Call into architectural FFH based C-state */
acpi_processor_ffh_cstate_enter(cstate);
@@ -284,6 +287,7 @@ static void acpi_cstate_enter(struct acpi_processor_cx *cstate)
gets asserted in time to freeze execution properly. */
unused = inl(acpi_gbl_FADT.xpm_timer_block.address);
}
+ hw_perf_restore(perf_flags);
start_critical_timings();
}
#endif /* !CONFIG_CPU_IDLE */
@@ -1425,8 +1429,11 @@ static inline void acpi_idle_update_bm_rld(struct acpi_processor *pr,
*/
static inline void acpi_idle_do_entry(struct acpi_processor_cx *cx)
{
+ u64 pctrl;
+
/* Don't trace irqs off for idle */
stop_critical_timings();
+ pctrl = hw_perf_save_disable();
if (cx->entry_method == ACPI_CSTATE_FFH) {
/* Call into architectural FFH based C-state */
acpi_processor_ffh_cstate_enter(cx);
@@ -1441,6 +1448,7 @@ static inline void acpi_idle_do_entry(struct acpi_processor_cx *cx)
gets asserted in time to freeze execution properly. */
unused = inl(acpi_gbl_FADT.xpm_timer_block.address);
}
+ hw_perf_restore(pctrl);
start_critical_timings();
}

diff --git a/drivers/char/sysrq.c b/drivers/char/sysrq.c
index ce0d9da..52146c2 100644
--- a/drivers/char/sysrq.c
+++ b/drivers/char/sysrq.c
@@ -25,6 +25,7 @@
#include <linux/kbd_kern.h>
#include <linux/proc_fs.h>
#include <linux/quotaops.h>
+#include <linux/perf_counter.h>
#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/suspend.h>
@@ -244,6 +245,7 @@ static void sysrq_handle_showregs(int key, struct tty_struct *tty)
struct pt_regs *regs = get_irq_regs();
if (regs)
show_regs(regs);
+ perf_counter_print_debug();
}
static struct sysrq_key_op sysrq_showregs_op = {
.handler = sysrq_handle_showregs,
diff --git a/include/linux/perf_counter.h b/include/linux/perf_counter.h
new file mode 100644
index 0000000..8cb095f
--- /dev/null
+++ b/include/linux/perf_counter.h
@@ -0,0 +1,244 @@
+/*
+ * Performance counters:
+ *
+ * Copyright(C) 2008, Thomas Gleixner <[email protected]>
+ * Copyright(C) 2008, Red Hat, Inc., Ingo Molnar
+ *
+ * Data type definitions, declarations, prototypes.
+ *
+ * Started by: Thomas Gleixner and Ingo Molnar
+ *
+ * For licencing details see kernel-base/COPYING
+ */
+#ifndef _LINUX_PERF_COUNTER_H
+#define _LINUX_PERF_COUNTER_H
+
+#include <asm/atomic.h>
+
+#include <linux/list.h>
+#include <linux/mutex.h>
+#include <linux/rculist.h>
+#include <linux/rcupdate.h>
+#include <linux/spinlock.h>
+
+struct task_struct;
+
+/*
+ * User-space ABI bits:
+ */
+
+/*
+ * Generalized performance counter event types, used by the hw_event.type
+ * parameter of the sys_perf_counter_open() syscall:
+ */
+enum hw_event_types {
+ /*
+ * Common hardware events, generalized by the kernel:
+ */
+ PERF_COUNT_CYCLES = 0,
+ PERF_COUNT_INSTRUCTIONS = 1,
+ PERF_COUNT_CACHE_REFERENCES = 2,
+ PERF_COUNT_CACHE_MISSES = 3,
+ PERF_COUNT_BRANCH_INSTRUCTIONS = 4,
+ PERF_COUNT_BRANCH_MISSES = 5,
+
+ /*
+ * Special "software" counters provided by the kernel, even if
+ * the hardware does not support performance counters. These
+ * counters measure various physical and sw events of the
+ * kernel (and allow the profiling of them as well):
+ */
+ PERF_COUNT_CPU_CLOCK = -1,
+ PERF_COUNT_TASK_CLOCK = -2,
+ /*
+ * Future software events:
+ */
+ /* PERF_COUNT_PAGE_FAULTS = -3,
+ PERF_COUNT_CONTEXT_SWITCHES = -4, */
+};
+
+/*
+ * IRQ-notification data record type:
+ */
+enum perf_counter_record_type {
+ PERF_RECORD_SIMPLE = 0,
+ PERF_RECORD_IRQ = 1,
+ PERF_RECORD_GROUP = 2,
+};
+
+/*
+ * Hardware event to monitor via a performance monitoring counter:
+ */
+struct perf_counter_hw_event {
+ s64 type;
+
+ u64 irq_period;
+ u32 record_type;
+
+ u32 disabled : 1, /* off by default */
+ nmi : 1, /* NMI sampling */
+ raw : 1, /* raw event type */
+ __reserved_1 : 29;
+
+ u64 __reserved_2;
+};
+
+/*
+ * Kernel-internal data types:
+ */
+
+/**
+ * struct hw_perf_counter - performance counter hardware details:
+ */
+struct hw_perf_counter {
+ u64 config;
+ unsigned long config_base;
+ unsigned long counter_base;
+ int nmi;
+ unsigned int idx;
+ u64 prev_count;
+ u64 irq_period;
+ s32 next_count;
+};
+
+/*
+ * Hardcoded buffer length limit for now, for IRQ-fed events:
+ */
+#define PERF_DATA_BUFLEN 2048
+
+/**
+ * struct perf_data - performance counter IRQ data sampling ...
+ */
+struct perf_data {
+ int len;
+ int rd_idx;
+ int overrun;
+ u8 data[PERF_DATA_BUFLEN];
+};
+
+struct perf_counter;
+
+/**
+ * struct hw_perf_counter_ops - performance counter hw ops
+ */
+struct hw_perf_counter_ops {
+ void (*hw_perf_counter_enable) (struct perf_counter *counter);
+ void (*hw_perf_counter_disable) (struct perf_counter *counter);
+ void (*hw_perf_counter_read) (struct perf_counter *counter);
+};
+
+/**
+ * enum perf_counter_active_state - the states of a counter
+ */
+enum perf_counter_active_state {
+ PERF_COUNTER_STATE_OFF = -1,
+ PERF_COUNTER_STATE_INACTIVE = 0,
+ PERF_COUNTER_STATE_ACTIVE = 1,
+};
+
+/**
+ * struct perf_counter - performance counter kernel representation:
+ */
+struct perf_counter {
+ struct list_head list_entry;
+ struct list_head sibling_list;
+ struct perf_counter *group_leader;
+ const struct hw_perf_counter_ops *hw_ops;
+
+ enum perf_counter_active_state state;
+#if BITS_PER_LONG == 64
+ atomic64_t count;
+#else
+ atomic_t count32[2];
+#endif
+ struct perf_counter_hw_event hw_event;
+ struct hw_perf_counter hw;
+
+ struct perf_counter_context *ctx;
+ struct task_struct *task;
+
+ /*
+ * Protect attach/detach:
+ */
+ struct mutex mutex;
+
+ int oncpu;
+ int cpu;
+
+ /* read() / irq related data */
+ wait_queue_head_t waitq;
+ /* optional: for NMIs */
+ int wakeup_pending;
+ struct perf_data *irqdata;
+ struct perf_data *usrdata;
+ struct perf_data data[2];
+};
+
+/**
+ * struct perf_counter_context - counter context structure
+ *
+ * Used as a container for task counters and CPU counters as well:
+ */
+struct perf_counter_context {
+#ifdef CONFIG_PERF_COUNTERS
+ /*
+ * Protect the list of counters:
+ */
+ spinlock_t lock;
+
+ struct list_head counter_list;
+ int nr_counters;
+ int nr_active;
+ struct task_struct *task;
+#endif
+};
+
+/**
+ * struct perf_counter_cpu_context - per cpu counter context structure
+ */
+struct perf_cpu_context {
+ struct perf_counter_context ctx;
+ struct perf_counter_context *task_ctx;
+ int active_oncpu;
+ int max_pertask;
+};
+
+/*
+ * Set by architecture code:
+ */
+extern int perf_max_counters;
+
+#ifdef CONFIG_PERF_COUNTERS
+extern const struct hw_perf_counter_ops *
+hw_perf_counter_init(struct perf_counter *counter);
+
+extern void perf_counter_task_sched_in(struct task_struct *task, int cpu);
+extern void perf_counter_task_sched_out(struct task_struct *task, int cpu);
+extern void perf_counter_task_tick(struct task_struct *task, int cpu);
+extern void perf_counter_init_task(struct task_struct *task);
+extern void perf_counter_notify(struct pt_regs *regs);
+extern void perf_counter_print_debug(void);
+extern u64 hw_perf_save_disable(void);
+extern void hw_perf_restore(u64 ctrl);
+extern void atomic64_counter_set(struct perf_counter *counter, u64 val64);
+extern u64 atomic64_counter_read(struct perf_counter *counter);
+extern int perf_counter_task_disable(void);
+extern int perf_counter_task_enable(void);
+
+#else
+static inline void
+perf_counter_task_sched_in(struct task_struct *task, int cpu) { }
+static inline void
+perf_counter_task_sched_out(struct task_struct *task, int cpu) { }
+static inline void
+perf_counter_task_tick(struct task_struct *task, int cpu) { }
+static inline void perf_counter_init_task(struct task_struct *task) { }
+static inline void perf_counter_notify(struct pt_regs *regs) { }
+static inline void perf_counter_print_debug(void) { }
+static inline void hw_perf_restore(u64 ctrl) { }
+static inline u64 hw_perf_save_disable(void) { return 0; }
+static inline int perf_counter_task_disable(void) { return -EINVAL; }
+static inline int perf_counter_task_enable(void) { return -EINVAL; }
+#endif
+
+#endif /* _LINUX_PERF_COUNTER_H */
diff --git a/include/linux/prctl.h b/include/linux/prctl.h
index 48d887e..b00df4c 100644
--- a/include/linux/prctl.h
+++ b/include/linux/prctl.h
@@ -85,4 +85,7 @@
#define PR_SET_TIMERSLACK 29
#define PR_GET_TIMERSLACK 30

+#define PR_TASK_PERF_COUNTERS_DISABLE 31
+#define PR_TASK_PERF_COUNTERS_ENABLE 32
+
#endif /* _LINUX_PRCTL_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 55e30d1..4c53027 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -71,6 +71,7 @@ struct sched_param {
#include <linux/fs_struct.h>
#include <linux/compiler.h>
#include <linux/completion.h>
+#include <linux/perf_counter.h>
#include <linux/pid.h>
#include <linux/percpu.h>
#include <linux/topology.h>
@@ -1326,6 +1327,7 @@ struct task_struct {
struct list_head pi_state_list;
struct futex_pi_state *pi_state_cache;
#endif
+ struct perf_counter_context perf_counter_ctx;
#ifdef CONFIG_NUMA
struct mempolicy *mempolicy;
short il_next;
@@ -2285,6 +2287,13 @@ static inline void inc_syscw(struct task_struct *tsk)
#define TASK_SIZE_OF(tsk) TASK_SIZE
#endif

+/*
+ * Call the function if the target task is executing on a CPU right now:
+ */
+extern void task_oncpu_function_call(struct task_struct *p,
+ void (*func) (void *info), void *info);
+
+
#ifdef CONFIG_MM_OWNER
extern void mm_update_next_owner(struct mm_struct *mm);
extern void mm_init_owner(struct mm_struct *mm, struct task_struct *p);
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 04fb47b..a549678 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -54,6 +54,7 @@ struct compat_stat;
struct compat_timeval;
struct robust_list_head;
struct getcpu_cache;
+struct perf_counter_hw_event;

#include <linux/types.h>
#include <linux/aio_abi.h>
@@ -624,4 +625,11 @@ asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len);

int kernel_execve(const char *filename, char *const argv[], char *const envp[]);

+
+asmlinkage int sys_perf_counter_open(
+
+ struct perf_counter_hw_event *hw_event_uptr __user,
+ pid_t pid,
+ int cpu,
+ int group_fd);
#endif
diff --git a/init/Kconfig b/init/Kconfig
index f763762..7d147a3 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -732,6 +732,36 @@ config AIO
by some high performance threaded applications. Disabling
this option saves about 7k.

+config HAVE_PERF_COUNTERS
+ bool
+
+menu "Performance Counters"
+
+config PERF_COUNTERS
+ bool "Kernel Performance Counters"
+ depends on HAVE_PERF_COUNTERS
+ default y
+ select ANON_INODES
+ help
+ Enable kernel support for performance counter hardware.
+
+ Performance counters are special hardware registers available
+ on most modern CPUs. These registers count the number of certain
+ types of hw events: such as instructions executed, cachemisses
+ suffered, or branches mis-predicted - without slowing down the
+ kernel or applications. These registers can also trigger interrupts
+ when a threshold number of events have passed - and can thus be
+ used to profile the code that runs on that CPU.
+
+ The Linux Performance Counter subsystem provides an abstraction of
+ these hardware capabilities, available via a system call. It
+ provides per task and per CPU counters, and it provides event
+ capabilities on top of those.
+
+ Say Y if unsure.
+
+endmenu
+
config VM_EVENT_COUNTERS
default y
bool "Enable VM event counters for /proc/vmstat" if EMBEDDED
diff --git a/kernel/Makefile b/kernel/Makefile
index 19fad00..1f184a1 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -89,6 +89,7 @@ obj-$(CONFIG_HAVE_GENERIC_DMA_COHERENT) += dma-coherent.o
obj-$(CONFIG_FUNCTION_TRACER) += trace/
obj-$(CONFIG_TRACING) += trace/
obj-$(CONFIG_SMP) += sched_cpupri.o
+obj-$(CONFIG_PERF_COUNTERS) += perf_counter.o

ifneq ($(CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER),y)
# According to Alan Modra <[email protected]>, the -fno-omit-frame-pointer is
diff --git a/kernel/fork.c b/kernel/fork.c
index 2a372a0..441fadf 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -975,6 +975,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
goto fork_out;

rt_mutex_init_task(p);
+ perf_counter_init_task(p);

#ifdef CONFIG_PROVE_LOCKING
DEBUG_LOCKS_WARN_ON(!p->hardirqs_enabled);
diff --git a/kernel/perf_counter.c b/kernel/perf_counter.c
new file mode 100644
index 0000000..559130b
--- /dev/null
+++ b/kernel/perf_counter.c
@@ -0,0 +1,1266 @@
+/*
+ * Performance counter core code
+ *
+ * Copyright(C) 2008 Thomas Gleixner <[email protected]>
+ * Copyright(C) 2008 Red Hat, Inc., Ingo Molnar
+ *
+ * For licencing details see kernel-base/COPYING
+ */
+
+#include <linux/fs.h>
+#include <linux/cpu.h>
+#include <linux/smp.h>
+#include <linux/file.h>
+#include <linux/poll.h>
+#include <linux/sysfs.h>
+#include <linux/ptrace.h>
+#include <linux/percpu.h>
+#include <linux/uaccess.h>
+#include <linux/syscalls.h>
+#include <linux/anon_inodes.h>
+#include <linux/perf_counter.h>
+
+/*
+ * Each CPU has a list of per CPU counters:
+ */
+DEFINE_PER_CPU(struct perf_cpu_context, perf_cpu_context);
+
+int perf_max_counters __read_mostly;
+static int perf_reserved_percpu __read_mostly;
+static int perf_overcommit __read_mostly = 1;
+
+/*
+ * Mutex for (sysadmin-configurable) counter reservations:
+ */
+static DEFINE_MUTEX(perf_resource_mutex);
+
+/*
+ * Architecture provided APIs - weak aliases:
+ */
+extern __weak const struct hw_perf_counter_ops *
+hw_perf_counter_init(struct perf_counter *counter)
+{
+ return ERR_PTR(-EINVAL);
+}
+
+u64 __weak hw_perf_save_disable(void) { return 0; }
+void __weak hw_perf_restore(u64 ctrl) { }
+void __weak hw_perf_counter_setup(void) { }
+
+#if BITS_PER_LONG == 64
+
+/*
+ * Read the cached counter in counter safe against cross CPU / NMI
+ * modifications. 64 bit version - no complications.
+ */
+static inline u64 perf_counter_read_safe(struct perf_counter *counter)
+{
+ return (u64) atomic64_read(&counter->count);
+}
+
+void atomic64_counter_set(struct perf_counter *counter, u64 val)
+{
+ atomic64_set(&counter->count, val);
+}
+
+u64 atomic64_counter_read(struct perf_counter *counter)
+{
+ return atomic64_read(&counter->count);
+}
+
+#else
+
+/*
+ * Read the cached counter in counter safe against cross CPU / NMI
+ * modifications. 32 bit version.
+ */
+static u64 perf_counter_read_safe(struct perf_counter *counter)
+{
+ u32 cntl, cnth;
+
+ local_irq_disable();
+ do {
+ cnth = atomic_read(&counter->count32[1]);
+ cntl = atomic_read(&counter->count32[0]);
+ } while (cnth != atomic_read(&counter->count32[1]));
+
+ local_irq_enable();
+
+ return cntl | ((u64) cnth) << 32;
+}
+
+void atomic64_counter_set(struct perf_counter *counter, u64 val64)
+{
+ u32 *val32 = (void *)&val64;
+
+ atomic_set(counter->count32 + 0, *(val32 + 0));
+ atomic_set(counter->count32 + 1, *(val32 + 1));
+}
+
+u64 atomic64_counter_read(struct perf_counter *counter)
+{
+ return atomic_read(counter->count32 + 0) |
+ (u64) atomic_read(counter->count32 + 1) << 32;
+}
+
+#endif
+
+static void
+list_add_counter(struct perf_counter *counter, struct perf_counter_context *ctx)
+{
+ struct perf_counter *group_leader = counter->group_leader;
+
+ /*
+ * Depending on whether it is a standalone or sibling counter,
+ * add it straight to the context's counter list, or to the group
+ * leader's sibling list:
+ */
+ if (counter->group_leader == counter)
+ list_add_tail(&counter->list_entry, &ctx->counter_list);
+ else
+ list_add_tail(&counter->list_entry, &group_leader->sibling_list);
+}
+
+static void
+list_del_counter(struct perf_counter *counter, struct perf_counter_context *ctx)
+{
+ struct perf_counter *sibling, *tmp;
+
+ list_del_init(&counter->list_entry);
+
+ /*
+ * If this was a group counter with sibling counters then
+ * upgrade the siblings to singleton counters by adding them
+ * to the context list directly:
+ */
+ list_for_each_entry_safe(sibling, tmp,
+ &counter->sibling_list, list_entry) {
+
+ list_del_init(&sibling->list_entry);
+ list_add_tail(&sibling->list_entry, &ctx->counter_list);
+ WARN_ON_ONCE(!sibling->group_leader);
+ WARN_ON_ONCE(sibling->group_leader == sibling);
+ sibling->group_leader = sibling;
+ }
+}
+
+/*
+ * Cross CPU call to remove a performance counter
+ *
+ * We disable the counter on the hardware level first. After that we
+ * remove it from the context list.
+ */
+static void __perf_counter_remove_from_context(void *info)
+{
+ struct perf_cpu_context *cpuctx = &__get_cpu_var(perf_cpu_context);
+ struct perf_counter *counter = info;
+ struct perf_counter_context *ctx = counter->ctx;
+ u64 perf_flags;
+
+ /*
+ * If this is a task context, we need to check whether it is
+ * the current task context of this cpu. If not it has been
+ * scheduled out before the smp call arrived.
+ */
+ if (ctx->task && cpuctx->task_ctx != ctx)
+ return;
+
+ spin_lock(&ctx->lock);
+
+ if (counter->state == PERF_COUNTER_STATE_ACTIVE) {
+ counter->hw_ops->hw_perf_counter_disable(counter);
+ counter->state = PERF_COUNTER_STATE_INACTIVE;
+ ctx->nr_active--;
+ cpuctx->active_oncpu--;
+ counter->task = NULL;
+ }
+ ctx->nr_counters--;
+
+ /*
+ * Protect the list operation against NMI by disabling the
+ * counters on a global level. NOP for non NMI based counters.
+ */
+ perf_flags = hw_perf_save_disable();
+ list_del_counter(counter, ctx);
+ hw_perf_restore(perf_flags);
+
+ if (!ctx->task) {
+ /*
+ * Allow more per task counters with respect to the
+ * reservation:
+ */
+ cpuctx->max_pertask =
+ min(perf_max_counters - ctx->nr_counters,
+ perf_max_counters - perf_reserved_percpu);
+ }
+
+ spin_unlock(&ctx->lock);
+}
+
+
+/*
+ * Remove the counter from a task's (or a CPU's) list of counters.
+ *
+ * Must be called with counter->mutex held.
+ *
+ * CPU counters are removed with a smp call. For task counters we only
+ * call when the task is on a CPU.
+ */
+static void perf_counter_remove_from_context(struct perf_counter *counter)
+{
+ struct perf_counter_context *ctx = counter->ctx;
+ struct task_struct *task = ctx->task;
+
+ if (!task) {
+ /*
+ * Per cpu counters are removed via an smp call and
+ * the removal is always sucessful.
+ */
+ smp_call_function_single(counter->cpu,
+ __perf_counter_remove_from_context,
+ counter, 1);
+ return;
+ }
+
+retry:
+ task_oncpu_function_call(task, __perf_counter_remove_from_context,
+ counter);
+
+ spin_lock_irq(&ctx->lock);
+ /*
+ * If the context is active we need to retry the smp call.
+ */
+ if (ctx->nr_active && !list_empty(&counter->list_entry)) {
+ spin_unlock_irq(&ctx->lock);
+ goto retry;
+ }
+
+ /*
+ * The lock prevents that this context is scheduled in so we
+ * can remove the counter safely, if the call above did not
+ * succeed.
+ */
+ if (!list_empty(&counter->list_entry)) {
+ ctx->nr_counters--;
+ list_del_counter(counter, ctx);
+ counter->task = NULL;
+ }
+ spin_unlock_irq(&ctx->lock);
+}
+
+/*
+ * Cross CPU call to install and enable a preformance counter
+ */
+static void __perf_install_in_context(void *info)
+{
+ struct perf_cpu_context *cpuctx = &__get_cpu_var(perf_cpu_context);
+ struct perf_counter *counter = info;
+ struct perf_counter_context *ctx = counter->ctx;
+ int cpu = smp_processor_id();
+ u64 perf_flags;
+
+ /*
+ * If this is a task context, we need to check whether it is
+ * the current task context of this cpu. If not it has been
+ * scheduled out before the smp call arrived.
+ */
+ if (ctx->task && cpuctx->task_ctx != ctx)
+ return;
+
+ spin_lock(&ctx->lock);
+
+ /*
+ * Protect the list operation against NMI by disabling the
+ * counters on a global level. NOP for non NMI based counters.
+ */
+ perf_flags = hw_perf_save_disable();
+ list_add_counter(counter, ctx);
+ hw_perf_restore(perf_flags);
+
+ ctx->nr_counters++;
+
+ if (cpuctx->active_oncpu < perf_max_counters) {
+ counter->hw_ops->hw_perf_counter_enable(counter);
+ counter->state = PERF_COUNTER_STATE_ACTIVE;
+ counter->oncpu = cpu;
+ ctx->nr_active++;
+ cpuctx->active_oncpu++;
+ }
+
+ if (!ctx->task && cpuctx->max_pertask)
+ cpuctx->max_pertask--;
+
+ spin_unlock(&ctx->lock);
+}
+
+/*
+ * Attach a performance counter to a context
+ *
+ * First we add the counter to the list with the hardware enable bit
+ * in counter->hw_config cleared.
+ *
+ * If the counter is attached to a task which is on a CPU we use a smp
+ * call to enable it in the task context. The task might have been
+ * scheduled away, but we check this in the smp call again.
+ */
+static void
+perf_install_in_context(struct perf_counter_context *ctx,
+ struct perf_counter *counter,
+ int cpu)
+{
+ struct task_struct *task = ctx->task;
+
+ counter->ctx = ctx;
+ if (!task) {
+ /*
+ * Per cpu counters are installed via an smp call and
+ * the install is always sucessful.
+ */
+ smp_call_function_single(cpu, __perf_install_in_context,
+ counter, 1);
+ return;
+ }
+
+ counter->task = task;
+retry:
+ task_oncpu_function_call(task, __perf_install_in_context,
+ counter);
+
+ spin_lock_irq(&ctx->lock);
+ /*
+ * we need to retry the smp call.
+ */
+ if (ctx->nr_active && list_empty(&counter->list_entry)) {
+ spin_unlock_irq(&ctx->lock);
+ goto retry;
+ }
+
+ /*
+ * The lock prevents that this context is scheduled in so we
+ * can add the counter safely, if it the call above did not
+ * succeed.
+ */
+ if (list_empty(&counter->list_entry)) {
+ list_add_counter(counter, ctx);
+ ctx->nr_counters++;
+ }
+ spin_unlock_irq(&ctx->lock);
+}
+
+static void
+counter_sched_out(struct perf_counter *counter,
+ struct perf_cpu_context *cpuctx,
+ struct perf_counter_context *ctx)
+{
+ if (counter->state != PERF_COUNTER_STATE_ACTIVE)
+ return;
+
+ counter->hw_ops->hw_perf_counter_disable(counter);
+ counter->state = PERF_COUNTER_STATE_INACTIVE;
+ counter->oncpu = -1;
+
+ cpuctx->active_oncpu--;
+ ctx->nr_active--;
+}
+
+static void
+group_sched_out(struct perf_counter *group_counter,
+ struct perf_cpu_context *cpuctx,
+ struct perf_counter_context *ctx)
+{
+ struct perf_counter *counter;
+
+ counter_sched_out(group_counter, cpuctx, ctx);
+
+ /*
+ * Schedule out siblings (if any):
+ */
+ list_for_each_entry(counter, &group_counter->sibling_list, list_entry)
+ counter_sched_out(counter, cpuctx, ctx);
+}
+
+/*
+ * Called from scheduler to remove the counters of the current task,
+ * with interrupts disabled.
+ *
+ * We stop each counter and update the counter value in counter->count.
+ *
+ * This does not protect us against NMI, but hw_perf_counter_disable()
+ * sets the disabled bit in the control field of counter _before_
+ * accessing the counter control register. If a NMI hits, then it will
+ * not restart the counter.
+ */
+void perf_counter_task_sched_out(struct task_struct *task, int cpu)
+{
+ struct perf_cpu_context *cpuctx = &per_cpu(perf_cpu_context, cpu);
+ struct perf_counter_context *ctx = &task->perf_counter_ctx;
+ struct perf_counter *counter;
+
+ if (likely(!cpuctx->task_ctx))
+ return;
+
+ spin_lock(&ctx->lock);
+ if (ctx->nr_active) {
+ list_for_each_entry(counter, &ctx->counter_list, list_entry)
+ group_sched_out(counter, cpuctx, ctx);
+ }
+ spin_unlock(&ctx->lock);
+ cpuctx->task_ctx = NULL;
+}
+
+static void
+counter_sched_in(struct perf_counter *counter,
+ struct perf_cpu_context *cpuctx,
+ struct perf_counter_context *ctx,
+ int cpu)
+{
+ if (counter->state == PERF_COUNTER_STATE_OFF)
+ return;
+
+ counter->hw_ops->hw_perf_counter_enable(counter);
+ counter->state = PERF_COUNTER_STATE_ACTIVE;
+ counter->oncpu = cpu; /* TODO: put 'cpu' into cpuctx->cpu */
+
+ cpuctx->active_oncpu++;
+ ctx->nr_active++;
+}
+
+static void
+group_sched_in(struct perf_counter *group_counter,
+ struct perf_cpu_context *cpuctx,
+ struct perf_counter_context *ctx,
+ int cpu)
+{
+ struct perf_counter *counter;
+
+ counter_sched_in(group_counter, cpuctx, ctx, cpu);
+
+ /*
+ * Schedule in siblings as one group (if any):
+ */
+ list_for_each_entry(counter, &group_counter->sibling_list, list_entry)
+ counter_sched_in(counter, cpuctx, ctx, cpu);
+}
+
+/*
+ * Called from scheduler to add the counters of the current task
+ * with interrupts disabled.
+ *
+ * We restore the counter value and then enable it.
+ *
+ * This does not protect us against NMI, but hw_perf_counter_enable()
+ * sets the enabled bit in the control field of counter _before_
+ * accessing the counter control register. If a NMI hits, then it will
+ * keep the counter running.
+ */
+void perf_counter_task_sched_in(struct task_struct *task, int cpu)
+{
+ struct perf_cpu_context *cpuctx = &per_cpu(perf_cpu_context, cpu);
+ struct perf_counter_context *ctx = &task->perf_counter_ctx;
+ struct perf_counter *counter;
+
+ if (likely(!ctx->nr_counters))
+ return;
+
+ spin_lock(&ctx->lock);
+ list_for_each_entry(counter, &ctx->counter_list, list_entry) {
+ if (ctx->nr_active == cpuctx->max_pertask)
+ break;
+
+ /*
+ * Listen to the 'cpu' scheduling filter constraint
+ * of counters:
+ */
+ if (counter->cpu != -1 && counter->cpu != cpu)
+ continue;
+
+ group_sched_in(counter, cpuctx, ctx, cpu);
+ }
+ spin_unlock(&ctx->lock);
+
+ cpuctx->task_ctx = ctx;
+}
+
+int perf_counter_task_disable(void)
+{
+ struct task_struct *curr = current;
+ struct perf_counter_context *ctx = &curr->perf_counter_ctx;
+ struct perf_counter *counter;
+ u64 perf_flags;
+ int cpu;
+
+ if (likely(!ctx->nr_counters))
+ return 0;
+
+ local_irq_disable();
+ cpu = smp_processor_id();
+
+ perf_counter_task_sched_out(curr, cpu);
+
+ spin_lock(&ctx->lock);
+
+ /*
+ * Disable all the counters:
+ */
+ perf_flags = hw_perf_save_disable();
+
+ list_for_each_entry(counter, &ctx->counter_list, list_entry) {
+ WARN_ON_ONCE(counter->state == PERF_COUNTER_STATE_ACTIVE);
+ counter->state = PERF_COUNTER_STATE_OFF;
+ }
+ hw_perf_restore(perf_flags);
+
+ spin_unlock(&ctx->lock);
+
+ local_irq_enable();
+
+ return 0;
+}
+
+int perf_counter_task_enable(void)
+{
+ struct task_struct *curr = current;
+ struct perf_counter_context *ctx = &curr->perf_counter_ctx;
+ struct perf_counter *counter;
+ u64 perf_flags;
+ int cpu;
+
+ if (likely(!ctx->nr_counters))
+ return 0;
+
+ local_irq_disable();
+ cpu = smp_processor_id();
+
+ spin_lock(&ctx->lock);
+
+ /*
+ * Disable all the counters:
+ */
+ perf_flags = hw_perf_save_disable();
+
+ list_for_each_entry(counter, &ctx->counter_list, list_entry) {
+ if (counter->state != PERF_COUNTER_STATE_OFF)
+ continue;
+ counter->state = PERF_COUNTER_STATE_INACTIVE;
+ }
+ hw_perf_restore(perf_flags);
+
+ spin_unlock(&ctx->lock);
+
+ perf_counter_task_sched_in(curr, cpu);
+
+ local_irq_enable();
+
+ return 0;
+}
+
+void perf_counter_task_tick(struct task_struct *curr, int cpu)
+{
+ struct perf_counter_context *ctx = &curr->perf_counter_ctx;
+ struct perf_counter *counter;
+ u64 perf_flags;
+
+ if (likely(!ctx->nr_counters))
+ return;
+
+ perf_counter_task_sched_out(curr, cpu);
+
+ spin_lock(&ctx->lock);
+
+ /*
+ * Rotate the first entry last (works just fine for group counters too):
+ */
+ perf_flags = hw_perf_save_disable();
+ list_for_each_entry(counter, &ctx->counter_list, list_entry) {
+ list_del(&counter->list_entry);
+ list_add_tail(&counter->list_entry, &ctx->counter_list);
+ break;
+ }
+ hw_perf_restore(perf_flags);
+
+ spin_unlock(&ctx->lock);
+
+ perf_counter_task_sched_in(curr, cpu);
+}
+
+/*
+ * Initialize the perf_counter context in a task_struct:
+ */
+static void
+__perf_counter_init_context(struct perf_counter_context *ctx,
+ struct task_struct *task)
+{
+ spin_lock_init(&ctx->lock);
+ INIT_LIST_HEAD(&ctx->counter_list);
+ ctx->nr_counters = 0;
+ ctx->task = task;
+}
+/*
+ * Initialize the perf_counter context in task_struct
+ */
+void perf_counter_init_task(struct task_struct *task)
+{
+ __perf_counter_init_context(&task->perf_counter_ctx, task);
+}
+
+/*
+ * Cross CPU call to read the hardware counter
+ */
+static void __hw_perf_counter_read(void *info)
+{
+ struct perf_counter *counter = info;
+
+ counter->hw_ops->hw_perf_counter_read(counter);
+}
+
+static u64 perf_counter_read(struct perf_counter *counter)
+{
+ /*
+ * If counter is enabled and currently active on a CPU, update the
+ * value in the counter structure:
+ */
+ if (counter->state == PERF_COUNTER_STATE_ACTIVE) {
+ smp_call_function_single(counter->oncpu,
+ __hw_perf_counter_read, counter, 1);
+ }
+
+ return perf_counter_read_safe(counter);
+}
+
+/*
+ * Cross CPU call to switch performance data pointers
+ */
+static void __perf_switch_irq_data(void *info)
+{
+ struct perf_cpu_context *cpuctx = &__get_cpu_var(perf_cpu_context);
+ struct perf_counter *counter = info;
+ struct perf_counter_context *ctx = counter->ctx;
+ struct perf_data *oldirqdata = counter->irqdata;
+
+ /*
+ * If this is a task context, we need to check whether it is
+ * the current task context of this cpu. If not it has been
+ * scheduled out before the smp call arrived.
+ */
+ if (ctx->task) {
+ if (cpuctx->task_ctx != ctx)
+ return;
+ spin_lock(&ctx->lock);
+ }
+
+ /* Change the pointer NMI safe */
+ atomic_long_set((atomic_long_t *)&counter->irqdata,
+ (unsigned long) counter->usrdata);
+ counter->usrdata = oldirqdata;
+
+ if (ctx->task)
+ spin_unlock(&ctx->lock);
+}
+
+static struct perf_data *perf_switch_irq_data(struct perf_counter *counter)
+{
+ struct perf_counter_context *ctx = counter->ctx;
+ struct perf_data *oldirqdata = counter->irqdata;
+ struct task_struct *task = ctx->task;
+
+ if (!task) {
+ smp_call_function_single(counter->cpu,
+ __perf_switch_irq_data,
+ counter, 1);
+ return counter->usrdata;
+ }
+
+retry:
+ spin_lock_irq(&ctx->lock);
+ if (counter->state != PERF_COUNTER_STATE_ACTIVE) {
+ counter->irqdata = counter->usrdata;
+ counter->usrdata = oldirqdata;
+ spin_unlock_irq(&ctx->lock);
+ return oldirqdata;
+ }
+ spin_unlock_irq(&ctx->lock);
+ task_oncpu_function_call(task, __perf_switch_irq_data, counter);
+ /* Might have failed, because task was scheduled out */
+ if (counter->irqdata == oldirqdata)
+ goto retry;
+
+ return counter->usrdata;
+}
+
+static void put_context(struct perf_counter_context *ctx)
+{
+ if (ctx->task)
+ put_task_struct(ctx->task);
+}
+
+static struct perf_counter_context *find_get_context(pid_t pid, int cpu)
+{
+ struct perf_cpu_context *cpuctx;
+ struct perf_counter_context *ctx;
+ struct task_struct *task;
+
+ /*
+ * If cpu is not a wildcard then this is a percpu counter:
+ */
+ if (cpu != -1) {
+ /* Must be root to operate on a CPU counter: */
+ if (!capable(CAP_SYS_ADMIN))
+ return ERR_PTR(-EACCES);
+
+ if (cpu < 0 || cpu > num_possible_cpus())
+ return ERR_PTR(-EINVAL);
+
+ /*
+ * We could be clever and allow to attach a counter to an
+ * offline CPU and activate it when the CPU comes up, but
+ * that's for later.
+ */
+ if (!cpu_isset(cpu, cpu_online_map))
+ return ERR_PTR(-ENODEV);
+
+ cpuctx = &per_cpu(perf_cpu_context, cpu);
+ ctx = &cpuctx->ctx;
+
+ WARN_ON_ONCE(ctx->task);
+ return ctx;
+ }
+
+ rcu_read_lock();
+ if (!pid)
+ task = current;
+ else
+ task = find_task_by_vpid(pid);
+ if (task)
+ get_task_struct(task);
+ rcu_read_unlock();
+
+ if (!task)
+ return ERR_PTR(-ESRCH);
+
+ ctx = &task->perf_counter_ctx;
+ ctx->task = task;
+
+ /* Reuse ptrace permission checks for now. */
+ if (!ptrace_may_access(task, PTRACE_MODE_READ)) {
+ put_context(ctx);
+ return ERR_PTR(-EACCES);
+ }
+
+ return ctx;
+}
+
+/*
+ * Called when the last reference to the file is gone.
+ */
+static int perf_release(struct inode *inode, struct file *file)
+{
+ struct perf_counter *counter = file->private_data;
+ struct perf_counter_context *ctx = counter->ctx;
+
+ file->private_data = NULL;
+
+ mutex_lock(&counter->mutex);
+
+ perf_counter_remove_from_context(counter);
+ put_context(ctx);
+
+ mutex_unlock(&counter->mutex);
+
+ kfree(counter);
+
+ return 0;
+}
+
+/*
+ * Read the performance counter - simple non blocking version for now
+ */
+static ssize_t
+perf_read_hw(struct perf_counter *counter, char __user *buf, size_t count)
+{
+ u64 cntval;
+
+ if (count != sizeof(cntval))
+ return -EINVAL;
+
+ mutex_lock(&counter->mutex);
+ cntval = perf_counter_read(counter);
+ mutex_unlock(&counter->mutex);
+
+ return put_user(cntval, (u64 __user *) buf) ? -EFAULT : sizeof(cntval);
+}
+
+static ssize_t
+perf_copy_usrdata(struct perf_data *usrdata, char __user *buf, size_t count)
+{
+ if (!usrdata->len)
+ return 0;
+
+ count = min(count, (size_t)usrdata->len);
+ if (copy_to_user(buf, usrdata->data + usrdata->rd_idx, count))
+ return -EFAULT;
+
+ /* Adjust the counters */
+ usrdata->len -= count;
+ if (!usrdata->len)
+ usrdata->rd_idx = 0;
+ else
+ usrdata->rd_idx += count;
+
+ return count;
+}
+
+static ssize_t
+perf_read_irq_data(struct perf_counter *counter,
+ char __user *buf,
+ size_t count,
+ int nonblocking)
+{
+ struct perf_data *irqdata, *usrdata;
+ DECLARE_WAITQUEUE(wait, current);
+ ssize_t res;
+
+ irqdata = counter->irqdata;
+ usrdata = counter->usrdata;
+
+ if (usrdata->len + irqdata->len >= count)
+ goto read_pending;
+
+ if (nonblocking)
+ return -EAGAIN;
+
+ spin_lock_irq(&counter->waitq.lock);
+ __add_wait_queue(&counter->waitq, &wait);
+ for (;;) {
+ set_current_state(TASK_INTERRUPTIBLE);
+ if (usrdata->len + irqdata->len >= count)
+ break;
+
+ if (signal_pending(current))
+ break;
+
+ spin_unlock_irq(&counter->waitq.lock);
+ schedule();
+ spin_lock_irq(&counter->waitq.lock);
+ }
+ __remove_wait_queue(&counter->waitq, &wait);
+ __set_current_state(TASK_RUNNING);
+ spin_unlock_irq(&counter->waitq.lock);
+
+ if (usrdata->len + irqdata->len < count)
+ return -ERESTARTSYS;
+read_pending:
+ mutex_lock(&counter->mutex);
+
+ /* Drain pending data first: */
+ res = perf_copy_usrdata(usrdata, buf, count);
+ if (res < 0 || res == count)
+ goto out;
+
+ /* Switch irq buffer: */
+ usrdata = perf_switch_irq_data(counter);
+ if (perf_copy_usrdata(usrdata, buf + res, count - res) < 0) {
+ if (!res)
+ res = -EFAULT;
+ } else {
+ res = count;
+ }
+out:
+ mutex_unlock(&counter->mutex);
+
+ return res;
+}
+
+static ssize_t
+perf_read(struct file *file, char __user *buf, size_t count, loff_t *ppos)
+{
+ struct perf_counter *counter = file->private_data;
+
+ switch (counter->hw_event.record_type) {
+ case PERF_RECORD_SIMPLE:
+ return perf_read_hw(counter, buf, count);
+
+ case PERF_RECORD_IRQ:
+ case PERF_RECORD_GROUP:
+ return perf_read_irq_data(counter, buf, count,
+ file->f_flags & O_NONBLOCK);
+ }
+ return -EINVAL;
+}
+
+static unsigned int perf_poll(struct file *file, poll_table *wait)
+{
+ struct perf_counter *counter = file->private_data;
+ unsigned int events = 0;
+ unsigned long flags;
+
+ poll_wait(file, &counter->waitq, wait);
+
+ spin_lock_irqsave(&counter->waitq.lock, flags);
+ if (counter->usrdata->len || counter->irqdata->len)
+ events |= POLLIN;
+ spin_unlock_irqrestore(&counter->waitq.lock, flags);
+
+ return events;
+}
+
+static const struct file_operations perf_fops = {
+ .release = perf_release,
+ .read = perf_read,
+ .poll = perf_poll,
+};
+
+static void cpu_clock_perf_counter_enable(struct perf_counter *counter)
+{
+}
+
+static void cpu_clock_perf_counter_disable(struct perf_counter *counter)
+{
+}
+
+static void cpu_clock_perf_counter_read(struct perf_counter *counter)
+{
+ int cpu = raw_smp_processor_id();
+
+ atomic64_counter_set(counter, cpu_clock(cpu));
+}
+
+static const struct hw_perf_counter_ops perf_ops_cpu_clock = {
+ .hw_perf_counter_enable = cpu_clock_perf_counter_enable,
+ .hw_perf_counter_disable = cpu_clock_perf_counter_disable,
+ .hw_perf_counter_read = cpu_clock_perf_counter_read,
+};
+
+static void task_clock_perf_counter_enable(struct perf_counter *counter)
+{
+}
+
+static void task_clock_perf_counter_disable(struct perf_counter *counter)
+{
+}
+
+static void task_clock_perf_counter_read(struct perf_counter *counter)
+{
+ atomic64_counter_set(counter, current->se.sum_exec_runtime);
+}
+
+static const struct hw_perf_counter_ops perf_ops_task_clock = {
+ .hw_perf_counter_enable = task_clock_perf_counter_enable,
+ .hw_perf_counter_disable = task_clock_perf_counter_disable,
+ .hw_perf_counter_read = task_clock_perf_counter_read,
+};
+
+static const struct hw_perf_counter_ops *
+sw_perf_counter_init(struct perf_counter *counter)
+{
+ const struct hw_perf_counter_ops *hw_ops = NULL;
+
+ switch (counter->hw_event.type) {
+ case PERF_COUNT_CPU_CLOCK:
+ hw_ops = &perf_ops_cpu_clock;
+ break;
+ case PERF_COUNT_TASK_CLOCK:
+ hw_ops = &perf_ops_task_clock;
+ break;
+ default:
+ break;
+ }
+ return hw_ops;
+}
+
+/*
+ * Allocate and initialize a counter structure
+ */
+static struct perf_counter *
+perf_counter_alloc(struct perf_counter_hw_event *hw_event,
+ int cpu,
+ struct perf_counter *group_leader)
+{
+ const struct hw_perf_counter_ops *hw_ops;
+ struct perf_counter *counter;
+
+ counter = kzalloc(sizeof(*counter), GFP_KERNEL);
+ if (!counter)
+ return NULL;
+
+ /*
+ * Single counters are their own group leaders, with an
+ * empty sibling list:
+ */
+ if (!group_leader)
+ group_leader = counter;
+
+ mutex_init(&counter->mutex);
+ INIT_LIST_HEAD(&counter->list_entry);
+ INIT_LIST_HEAD(&counter->sibling_list);
+ init_waitqueue_head(&counter->waitq);
+
+ counter->irqdata = &counter->data[0];
+ counter->usrdata = &counter->data[1];
+ counter->cpu = cpu;
+ counter->hw_event = *hw_event;
+ counter->wakeup_pending = 0;
+ counter->group_leader = group_leader;
+ counter->hw_ops = NULL;
+
+ hw_ops = NULL;
+ if (!hw_event->raw && hw_event->type < 0)
+ hw_ops = sw_perf_counter_init(counter);
+ if (!hw_ops) {
+ hw_ops = hw_perf_counter_init(counter);
+ }
+
+ if (!hw_ops) {
+ kfree(counter);
+ return NULL;
+ }
+ counter->hw_ops = hw_ops;
+
+ return counter;
+}
+
+/**
+ * sys_perf_task_open - open a performance counter, associate it to a task/cpu
+ *
+ * @hw_event_uptr: event type attributes for monitoring/sampling
+ * @pid: target pid
+ * @cpu: target cpu
+ * @group_fd: group leader counter fd
+ */
+asmlinkage int
+sys_perf_counter_open(struct perf_counter_hw_event *hw_event_uptr __user,
+ pid_t pid, int cpu, int group_fd)
+{
+ struct perf_counter *counter, *group_leader;
+ struct perf_counter_hw_event hw_event;
+ struct perf_counter_context *ctx;
+ struct file *group_file = NULL;
+ int fput_needed = 0;
+ int ret;
+
+ if (copy_from_user(&hw_event, hw_event_uptr, sizeof(hw_event)) != 0)
+ return -EFAULT;
+
+ /*
+ * Get the target context (task or percpu):
+ */
+ ctx = find_get_context(pid, cpu);
+ if (IS_ERR(ctx))
+ return PTR_ERR(ctx);
+
+ /*
+ * Look up the group leader (we will attach this counter to it):
+ */
+ group_leader = NULL;
+ if (group_fd != -1) {
+ ret = -EINVAL;
+ group_file = fget_light(group_fd, &fput_needed);
+ if (!group_file)
+ goto err_put_context;
+ if (group_file->f_op != &perf_fops)
+ goto err_put_context;
+
+ group_leader = group_file->private_data;
+ /*
+ * Do not allow a recursive hierarchy (this new sibling
+ * becoming part of another group-sibling):
+ */
+ if (group_leader->group_leader != group_leader)
+ goto err_put_context;
+ /*
+ * Do not allow to attach to a group in a different
+ * task or CPU context:
+ */
+ if (group_leader->ctx != ctx)
+ goto err_put_context;
+ }
+
+ ret = -EINVAL;
+ counter = perf_counter_alloc(&hw_event, cpu, group_leader);
+ if (!counter)
+ goto err_put_context;
+
+ perf_install_in_context(ctx, counter, cpu);
+
+ ret = anon_inode_getfd("[perf_counter]", &perf_fops, counter, 0);
+ if (ret < 0)
+ goto err_remove_free_put_context;
+
+out_fput:
+ fput_light(group_file, fput_needed);
+
+ return ret;
+
+err_remove_free_put_context:
+ mutex_lock(&counter->mutex);
+ perf_counter_remove_from_context(counter);
+ mutex_unlock(&counter->mutex);
+ kfree(counter);
+
+err_put_context:
+ put_context(ctx);
+
+ goto out_fput;
+}
+
+static void __cpuinit perf_counter_init_cpu(int cpu)
+{
+ struct perf_cpu_context *cpuctx;
+
+ cpuctx = &per_cpu(perf_cpu_context, cpu);
+ __perf_counter_init_context(&cpuctx->ctx, NULL);
+
+ mutex_lock(&perf_resource_mutex);
+ cpuctx->max_pertask = perf_max_counters - perf_reserved_percpu;
+ mutex_unlock(&perf_resource_mutex);
+
+ hw_perf_counter_setup();
+}
+
+#ifdef CONFIG_HOTPLUG_CPU
+static void __perf_counter_exit_cpu(void *info)
+{
+ struct perf_cpu_context *cpuctx = &__get_cpu_var(perf_cpu_context);
+ struct perf_counter_context *ctx = &cpuctx->ctx;
+ struct perf_counter *counter, *tmp;
+
+ list_for_each_entry_safe(counter, tmp, &ctx->counter_list, list_entry)
+ __perf_counter_remove_from_context(counter);
+
+}
+static void perf_counter_exit_cpu(int cpu)
+{
+ smp_call_function_single(cpu, __perf_counter_exit_cpu, NULL, 1);
+}
+#else
+static inline void perf_counter_exit_cpu(int cpu) { }
+#endif
+
+static int __cpuinit
+perf_cpu_notify(struct notifier_block *self, unsigned long action, void *hcpu)
+{
+ unsigned int cpu = (long)hcpu;
+
+ switch (action) {
+
+ case CPU_UP_PREPARE:
+ case CPU_UP_PREPARE_FROZEN:
+ perf_counter_init_cpu(cpu);
+ break;
+
+ case CPU_DOWN_PREPARE:
+ case CPU_DOWN_PREPARE_FROZEN:
+ perf_counter_exit_cpu(cpu);
+ break;
+
+ default:
+ break;
+ }
+
+ return NOTIFY_OK;
+}
+
+static struct notifier_block __cpuinitdata perf_cpu_nb = {
+ .notifier_call = perf_cpu_notify,
+};
+
+static int __init perf_counter_init(void)
+{
+ perf_cpu_notify(&perf_cpu_nb, (unsigned long)CPU_UP_PREPARE,
+ (void *)(long)smp_processor_id());
+ register_cpu_notifier(&perf_cpu_nb);
+
+ return 0;
+}
+early_initcall(perf_counter_init);
+
+static ssize_t perf_show_reserve_percpu(struct sysdev_class *class, char *buf)
+{
+ return sprintf(buf, "%d\n", perf_reserved_percpu);
+}
+
+static ssize_t
+perf_set_reserve_percpu(struct sysdev_class *class,
+ const char *buf,
+ size_t count)
+{
+ struct perf_cpu_context *cpuctx;
+ unsigned long val;
+ int err, cpu, mpt;
+
+ err = strict_strtoul(buf, 10, &val);
+ if (err)
+ return err;
+ if (val > perf_max_counters)
+ return -EINVAL;
+
+ mutex_lock(&perf_resource_mutex);
+ perf_reserved_percpu = val;
+ for_each_online_cpu(cpu) {
+ cpuctx = &per_cpu(perf_cpu_context, cpu);
+ spin_lock_irq(&cpuctx->ctx.lock);
+ mpt = min(perf_max_counters - cpuctx->ctx.nr_counters,
+ perf_max_counters - perf_reserved_percpu);
+ cpuctx->max_pertask = mpt;
+ spin_unlock_irq(&cpuctx->ctx.lock);
+ }
+ mutex_unlock(&perf_resource_mutex);
+
+ return count;
+}
+
+static ssize_t perf_show_overcommit(struct sysdev_class *class, char *buf)
+{
+ return sprintf(buf, "%d\n", perf_overcommit);
+}
+
+static ssize_t
+perf_set_overcommit(struct sysdev_class *class, const char *buf, size_t count)
+{
+ unsigned long val;
+ int err;
+
+ err = strict_strtoul(buf, 10, &val);
+ if (err)
+ return err;
+ if (val > 1)
+ return -EINVAL;
+
+ mutex_lock(&perf_resource_mutex);
+ perf_overcommit = val;
+ mutex_unlock(&perf_resource_mutex);
+
+ return count;
+}
+
+static SYSDEV_CLASS_ATTR(
+ reserve_percpu,
+ 0644,
+ perf_show_reserve_percpu,
+ perf_set_reserve_percpu
+ );
+
+static SYSDEV_CLASS_ATTR(
+ overcommit,
+ 0644,
+ perf_show_overcommit,
+ perf_set_overcommit
+ );
+
+static struct attribute *perfclass_attrs[] = {
+ &attr_reserve_percpu.attr,
+ &attr_overcommit.attr,
+ NULL
+};
+
+static struct attribute_group perfclass_attr_group = {
+ .attrs = perfclass_attrs,
+ .name = "perf_counters",
+};
+
+static int __init perf_counter_sysfs_init(void)
+{
+ return sysfs_create_group(&cpu_sysdev_class.kset.kobj,
+ &perfclass_attr_group);
+}
+device_initcall(perf_counter_sysfs_init);
+
diff --git a/kernel/sched.c b/kernel/sched.c
index b7480fb..254d56d 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2212,6 +2212,27 @@ static int sched_balance_self(int cpu, int flag)

#endif /* CONFIG_SMP */

+/**
+ * task_oncpu_function_call - call a function on the cpu on which a task runs
+ * @p: the task to evaluate
+ * @func: the function to be called
+ * @info: the function call argument
+ *
+ * Calls the function @func when the task is currently running. This might
+ * be on the current CPU, which just calls the function directly
+ */
+void task_oncpu_function_call(struct task_struct *p,
+ void (*func) (void *info), void *info)
+{
+ int cpu;
+
+ preempt_disable();
+ cpu = task_cpu(p);
+ if (task_curr(p))
+ smp_call_function_single(cpu, func, info, 1);
+ preempt_enable();
+}
+
/***
* try_to_wake_up - wake up a thread
* @p: the to-be-woken-up thread
@@ -2534,6 +2555,7 @@ prepare_task_switch(struct rq *rq, struct task_struct *prev,
struct task_struct *next)
{
fire_sched_out_preempt_notifiers(prev, next);
+ perf_counter_task_sched_out(prev, cpu_of(rq));
prepare_lock_switch(rq, next);
prepare_arch_switch(next);
}
@@ -2574,6 +2596,7 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
*/
prev_state = prev->state;
finish_arch_switch(prev);
+ perf_counter_task_sched_in(current, cpu_of(rq));
finish_lock_switch(rq, prev);
#ifdef CONFIG_SMP
if (current->sched_class->post_schedule)
@@ -4296,6 +4319,7 @@ void scheduler_tick(void)
rq->idle_at_tick = idle_cpu(cpu);
trigger_load_balance(rq, cpu);
#endif
+ perf_counter_task_tick(curr, cpu);
}

#if defined(CONFIG_PREEMPT) && (defined(CONFIG_DEBUG_PREEMPT) || \
diff --git a/kernel/sys.c b/kernel/sys.c
index 31deba8..0f66633 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -14,6 +14,7 @@
#include <linux/prctl.h>
#include <linux/highuid.h>
#include <linux/fs.h>
+#include <linux/perf_counter.h>
#include <linux/resource.h>
#include <linux/kernel.h>
#include <linux/kexec.h>
@@ -1716,6 +1717,12 @@ asmlinkage long sys_prctl(int option, unsigned long arg2, unsigned long arg3,
case PR_SET_TSC:
error = SET_TSC_CTL(arg2);
break;
+ case PR_TASK_PERF_COUNTERS_DISABLE:
+ error = perf_counter_task_disable();
+ break;
+ case PR_TASK_PERF_COUNTERS_ENABLE:
+ error = perf_counter_task_enable();
+ break;
case PR_GET_TIMERSLACK:
error = current->timer_slack_ns;
break;
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index e14a232..4be8bbc 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -174,3 +174,6 @@ cond_syscall(compat_sys_timerfd_settime);
cond_syscall(compat_sys_timerfd_gettime);
cond_syscall(sys_eventfd);
cond_syscall(sys_eventfd2);
+
+/* performance counters: */
+cond_syscall(sys_perf_counter_open);


2008-12-11 18:21:56

by Vince Weaver

[permalink] [raw]
Subject: Re: [patch] Performance Counters for Linux, v3


Can someone tell me which performance counter implementation is likely to
get merged into the Kernel?

I have at least 60 machines that I do regular performance counter work on.
They involve Pentium Pro, Pentium II, 32-bit Athlon, 64-bit Athlon,
Pentium 4, Pentium D, Core, Core2, Atom, MIPS R12k, Niagara T1,
and PPC/Playstation 3.

Perfmon3 works for all of those 60 machines. This new proposal works on a
2 out of the 60.

Who is going to add support for all of those machines? I've spent a lot
of developer time getting prefmon going for all of those configurations.
But why should I help out with this new inferior proposal? It could all
be another waste of time.

So I'd like someone to commit to some performance monitoring architecture.
Otherwise we're going to waste thousands of hours of developer time around
the world. It's all pointless.

Also, my primary method of using counters is total aggregate count for a
single user-space process. So I use perfmon's pfmon tool to run an entire
lon-running program, gathering full stats only at the very end. pfmon can
do this with pretty much zero overhead (I have lots of data and a few
publications using this method). Can this new infrastructure to this? I
find the documentation/tools support to be very incomplete.

One comment on the patch.


> + /*
> + * Common hardware events, generalized by the kernel:
> + */
> + PERF_COUNT_CYCLES = 0,
> + PERF_COUNT_INSTRUCTIONS = 1,
> + PERF_COUNT_CACHE_REFERENCES = 2,
> + PERF_COUNT_CACHE_MISSES = 3,
> + PERF_COUNT_BRANCH_INSTRUCTIONS = 4,
> + PERF_COUNT_BRANCH_MISSES = 5,

Many machines do not support these counts. For example, Niagara T1 does
not have a CYCLES count. And good luck if you think you can easily come
up with something meaningful for the various kind of CACHE_MISSES on the
Pentium 4. Also, the Pentium D has various flavors of retired instruction
count with slightly different semantics. This kind of abstraction should
be done in userspace.

Vince

2008-12-11 18:35:56

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch] Performance Counters for Linux, v3

On Thu, 11 Dec 2008 16:52:30 +0100
Ingo Molnar <[email protected]> wrote:

> To: [email protected]
> Cc: Thomas Gleixner <[email protected]>, Andrew Morton <[email protected]>, Stephane Eranian <[email protected]>, Eric Dumazet <[email protected]>, Robert Richter <[email protected]>, Arjan van de Veen <[email protected]>, Peter Anvin <[email protected]>, Peter Zijlstra <[email protected]>, Paul Mackerras <[email protected]>, "David S. Miller" <[email protected]>

Please copy [email protected] on all this. That is where
the real-world people who use these facilities on a regular basis hang out.

2008-12-11 19:11:21

by Tony Luck

[permalink] [raw]
Subject: Re: [patch] Performance Counters for Linux, v3

> /*
> * Special "software" counters provided by the kernel, even if
> * the hardware does not support performance counters. These
> * counters measure various physical and sw events of the
> * kernel (and allow the profiling of them as well):
> */
> PERF_COUNT_CPU_CLOCK = -1,
> PERF_COUNT_TASK_CLOCK = -2,
> /*
> * Future software events:
> */
> /* PERF_COUNT_PAGE_FAULTS = -3,
> PERF_COUNT_CONTEXT_SWITCHES = -4, */

...
> +[ Note: more hw_event_types are supported as well, but they are CPU
> + specific and are enumerated via /sys on a per CPU basis. Raw hw event
> + types can be passed in as negative numbers. For example, to count
> + "External bus cycles while bus lock signal asserted" events on Intel
> + Core CPUs, pass in a -0x4064 event type value. ]

It looks like you have an overlap here. You are using some negative numbers
to denote your special software events, but also as "raw" hardware events.
What if these conflict?

-Tony

2008-12-11 19:50:42

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] Performance Counters for Linux, v3


* Tony Luck <[email protected]> wrote:

> > /*
> > * Special "software" counters provided by the kernel, even if
> > * the hardware does not support performance counters. These
> > * counters measure various physical and sw events of the
> > * kernel (and allow the profiling of them as well):
> > */
> > PERF_COUNT_CPU_CLOCK = -1,
> > PERF_COUNT_TASK_CLOCK = -2,
> > /*
> > * Future software events:
> > */
> > /* PERF_COUNT_PAGE_FAULTS = -3,
> > PERF_COUNT_CONTEXT_SWITCHES = -4, */
>
> ...
> > +[ Note: more hw_event_types are supported as well, but they are CPU
> > + specific and are enumerated via /sys on a per CPU basis. Raw hw event
> > + types can be passed in as negative numbers. For example, to count
> > + "External bus cycles while bus lock signal asserted" events on Intel
> > + Core CPUs, pass in a -0x4064 event type value. ]
>
> It looks like you have an overlap here. You are using some negative
> numbers to denote your special software events, but also as "raw"
> hardware events. What if these conflict?

that's an old comment, not a bug in the code - thx for pointing it out, i
just fixed the comments - see the commit below.

Raw events are now done without using up negative numbers, they are done
via:

struct perf_counter_hw_event {
s64 type;

u64 irq_period;
u32 record_type;

u32 disabled : 1, /* off by default */
nmi : 1, /* NMI sampling */
raw : 1, /* raw event type */
__reserved_1 : 29;

u64 __reserved_2;
};

if the hw_event.raw bit is set to 1, then the hw_event.type is fully
'raw'. The default is for raw to be 0. So negative numbers can be used
for sw events, positive numbers for hw events. Both can be extended
gradually, without arbitrarily limits introduced.

Ingo

------------------------->
>From 447557ac7ce120306b4a31d6003faef39cb1bf14 Mon Sep 17 00:00:00 2001
From: Ingo Molnar <[email protected]>
Date: Thu, 11 Dec 2008 20:40:18 +0100
Subject: [PATCH] perf counters: update docs

Impact: update docs

Signed-off-by: Ingo Molnar <[email protected]>
---
Documentation/perf-counters.txt | 107 +++++++++++++++++++++++++++------------
1 files changed, 75 insertions(+), 32 deletions(-)

diff --git a/Documentation/perf-counters.txt b/Documentation/perf-counters.txt
index 19033a0..fddd321 100644
--- a/Documentation/perf-counters.txt
+++ b/Documentation/perf-counters.txt
@@ -10,8 +10,8 @@ trigger interrupts when a threshold number of events have passed - and can
thus be used to profile the code that runs on that CPU.

The Linux Performance Counter subsystem provides an abstraction of these
-hardware capabilities. It provides per task and per CPU counters, and
-it provides event capabilities on top of those.
+hardware capabilities. It provides per task and per CPU counters, counter
+groups, and it provides event capabilities on top of those.

Performance counters are accessed via special file descriptors.
There's one file descriptor per virtual counter used.
@@ -19,12 +19,8 @@ There's one file descriptor per virtual counter used.
The special file descriptor is opened via the perf_counter_open()
system call:

- int
- perf_counter_open(u32 hw_event_type,
- u32 hw_event_period,
- u32 record_type,
- pid_t pid,
- int cpu);
+ int sys_perf_counter_open(struct perf_counter_hw_event *hw_event_uptr,
+ pid_t pid, int cpu, int group_fd);

The syscall returns the new fd. The fd can be used via the normal
VFS system calls: read() can be used to read the counter, fcntl()
@@ -33,39 +29,78 @@ can be used to set the blocking mode, etc.
Multiple counters can be kept open at a time, and the counters
can be poll()ed.

-When creating a new counter fd, 'hw_event_type' is one of:
-
- enum hw_event_types {
- PERF_COUNT_CYCLES,
- PERF_COUNT_INSTRUCTIONS,
- PERF_COUNT_CACHE_REFERENCES,
- PERF_COUNT_CACHE_MISSES,
- PERF_COUNT_BRANCH_INSTRUCTIONS,
- PERF_COUNT_BRANCH_MISSES,
- };
+When creating a new counter fd, 'perf_counter_hw_event' is:
+
+/*
+ * Hardware event to monitor via a performance monitoring counter:
+ */
+struct perf_counter_hw_event {
+ s64 type;
+
+ u64 irq_period;
+ u32 record_type;
+
+ u32 disabled : 1, /* off by default */
+ nmi : 1, /* NMI sampling */
+ raw : 1, /* raw event type */
+ __reserved_1 : 29;
+
+ u64 __reserved_2;
+};
+
+/*
+ * Generalized performance counter event types, used by the hw_event.type
+ * parameter of the sys_perf_counter_open() syscall:
+ */
+enum hw_event_types {
+ /*
+ * Common hardware events, generalized by the kernel:
+ */
+ PERF_COUNT_CYCLES = 0,
+ PERF_COUNT_INSTRUCTIONS = 1,
+ PERF_COUNT_CACHE_REFERENCES = 2,
+ PERF_COUNT_CACHE_MISSES = 3,
+ PERF_COUNT_BRANCH_INSTRUCTIONS = 4,
+ PERF_COUNT_BRANCH_MISSES = 5,
+
+ /*
+ * Special "software" counters provided by the kernel, even if
+ * the hardware does not support performance counters. These
+ * counters measure various physical and sw events of the
+ * kernel (and allow the profiling of them as well):
+ */
+ PERF_COUNT_CPU_CLOCK = -1,
+ PERF_COUNT_TASK_CLOCK = -2,
+ /*
+ * Future software events:
+ */
+ /* PERF_COUNT_PAGE_FAULTS = -3,
+ PERF_COUNT_CONTEXT_SWITCHES = -4, */
+};

These are standardized types of events that work uniformly on all CPUs
that implements Performance Counters support under Linux. If a CPU is
not able to count branch-misses, then the system call will return
-EINVAL.

-[ Note: more hw_event_types are supported as well, but they are CPU
- specific and are enumerated via /sys on a per CPU basis. Raw hw event
- types can be passed in as negative numbers. For example, to count
- "External bus cycles while bus lock signal asserted" events on Intel
- Core CPUs, pass in a -0x4064 event type value. ]
-
-The parameter 'hw_event_period' is the number of events before waking up
-a read() that is blocked on a counter fd. Zero value means a non-blocking
-counter.
+More hw_event_types are supported as well, but they are CPU
+specific and are enumerated via /sys on a per CPU basis. Raw hw event
+types can be passed in under hw_event.type if hw_event.raw is 1.
+For example, to count "External bus cycles while bus lock signal asserted"
+events on Intel Core CPUs, pass in a 0x4064 event type value and set
+hw_event.raw to 1.

'record_type' is the type of data that a read() will provide for the
counter, and it can be one of:

- enum perf_record_type {
- PERF_RECORD_SIMPLE,
- PERF_RECORD_IRQ,
- };
+/*
+ * IRQ-notification data record type:
+ */
+enum perf_counter_record_type {
+ PERF_RECORD_SIMPLE = 0,
+ PERF_RECORD_IRQ = 1,
+ PERF_RECORD_GROUP = 2,
+};

a "simple" counter is one that counts hardware events and allows
them to be read out into a u64 count value. (read() returns 8 on
@@ -76,6 +111,10 @@ the IP of the interrupted context. In this case read() will return
the 8-byte counter value, plus the Instruction Pointer address of the
interrupted context.

+The parameter 'hw_event_period' is the number of events before waking up
+a read() that is blocked on a counter fd. Zero value means a non-blocking
+counter.
+
The 'pid' parameter allows the counter to be specific to a task:

pid == 0: if the pid parameter is zero, the counter is attached to the
@@ -92,7 +131,7 @@ CPU:
cpu >= 0: the counter is restricted to a specific CPU
cpu == -1: the counter counts on all CPUs

-Note: the combination of 'pid == -1' and 'cpu == -1' is not valid.
+(Note: the combination of 'pid == -1' and 'cpu == -1' is not valid.)

A 'pid > 0' and 'cpu == -1' counter is a per task counter that counts
events of that task and 'follows' that task to whatever CPU the task
@@ -102,3 +141,7 @@ their own tasks.
A 'pid == -1' and 'cpu == x' counter is a per CPU counter that counts
all events on CPU-x. Per CPU counters need CAP_SYS_ADMIN privilege.

+Group counters are created by passing in a group_fd of another counter.
+Groups are scheduled at once and can be used with PERF_RECORD_GROUP
+to record multi-dimensional timestamps.
+

2008-12-11 22:05:22

by William Cohen

[permalink] [raw]
Subject: Re: [patch] Performance Counters for Linux, v3

I was taking a look at the proposed performance monitoring and kerneltop.c. I
noticed that http://redhat.com/~mingo/perfcounters/kerneltop.c doesn't work
with the v3 version. I didn't see a more recent version available, so I made
some modifications to make allow it to work with the v3 kernel (with the
attached). However, I assume some where there is an updated version of kerneltop.c

The Documentation/perf-counters.txt doesn't describe how the group_fd is used.
Found that -1 used to indicate not connected to any other fd.

-Will


Attachments:
v3.diff (2.36 kB)

2008-12-12 06:22:39

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] Performance Counters for Linux, v3


* Andrew Morton <[email protected]> wrote:

> On Thu, 11 Dec 2008 16:52:30 +0100
> Ingo Molnar <[email protected]> wrote:
>
> > To: [email protected]
> > Cc: Thomas Gleixner <[email protected]>, Andrew Morton <[email protected]>, Stephane Eranian <[email protected]>, Eric Dumazet <[email protected]>, Robert Richter <[email protected]>, Arjan van de Veen <[email protected]>, Peter Anvin <[email protected]>, Peter Zijlstra <[email protected]>, Paul Mackerras <[email protected]>, "David S. Miller" <[email protected]>
>
> Please copy [email protected] on all this. That is
> where the real-world people who use these facilities on a regular basis
> hang out.

Sure, we'll do that for v4.

The reason we kept posting this to lkml initially was because there is a
visible detachment of this community from kernel developers. And that is
at least in part because this stuff has never been made interesting
enough to kernel developers. I dont remember a _single_ perfmon-generated
profile (be that user-space or kernel-space) in my mailbox before - and
optimizing the kernel is supposed to be one of the most important aspects
of performance tuning.

That's why we concentrate on making this useful and interesting to kernel
developers too via KernelTop, that's why we made the BTS/[PEBS] hardware
tracer available via an ftrace plugin, etc.

Furthermore, kernel developers tend to be quite good at co-designing,
influencing [and flaming ;-) ] such APIs at the early prototype stages,
so the main early technical feedback we were looking for on the kernel
side structure was lkml. But the wider community is not ignored either,
of course - with v4 it might be useful already for wider circulation.

Ingo

2008-12-12 08:26:10

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [patch] Performance Counters for Linux, v3

On Thu, 2008-12-11 at 13:02 -0500, Vince Weaver wrote:

> I have at least 60 machines that I do regular performance counter work on.
> They involve Pentium Pro, Pentium II, 32-bit Athlon, 64-bit Athlon,
> Pentium 4, Pentium D, Core, Core2, Atom, MIPS R12k, Niagara T1,
> and PPC/Playstation 3.

Good.

> Perfmon3 works for all of those 60 machines. This new proposal works on a
> 2 out of the 60.

s/works/is implemented/

> Who is going to add support for all of those machines? I've spent a lot
> of developer time getting prefmon going for all of those configurations.
> But why should I help out with this new inferior proposal? It could all
> be another waste of time.

So much for constructive critisism.. have you tried taking the design to
its limits, if so, where do you see problems?

I read the above as: I invested a lot of time in something of dubious
statue (out of tree patch), and now expect it to be merged because I
have invested in it.

> Also, my primary method of using counters is total aggregate count for a
> single user-space process.

Process, as in single thread, or multi-threaded? I'll assume
single-thread.

> Can this new infrastructure to this?

Yes, afaict it can.

You can group counters in v3, a read out of such a group will be an
atomic read out and provide vectored output that contains all the data
in one stream.

> I find the documentation/tools support to be very incomplete.

Gosh, what does one expect from something that is hardly a week old..

> One comment on the patch.
>
> > + /*
> > + * Common hardware events, generalized by the kernel:
> > + */
> > + PERF_COUNT_CYCLES = 0,
> > + PERF_COUNT_INSTRUCTIONS = 1,
> > + PERF_COUNT_CACHE_REFERENCES = 2,
> > + PERF_COUNT_CACHE_MISSES = 3,
> > + PERF_COUNT_BRANCH_INSTRUCTIONS = 4,
> > + PERF_COUNT_BRANCH_MISSES = 5,
>
> Many machines do not support these counts. For example, Niagara T1 does
> not have a CYCLES count. And good luck if you think you can easily come
> up with something meaningful for the various kind of CACHE_MISSES on the
> Pentium 4. Also, the Pentium D has various flavors of retired instruction
> count with slightly different semantics. This kind of abstraction should
> be done in userspace.

I'll argue to disagree, sure such events might not be supported by any
particular hardware implementation - but the fact that PAPI gives a list
of 'common' events means that they are, well, common. So unifying them
between those archs that do implement them seems like a sane choice, no?

For those archs that do not support it, it will just fail to open. No
harm done.

The proposal allows for you to specify raw hardware events, so you can
just totally ignore this part of the abstraction.

2008-12-12 08:29:57

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [patch] Performance Counters for Linux, v3

On Thu, 2008-12-11 at 20:34 +0100, Ingo Molnar wrote:

> struct perf_counter_hw_event {
> s64 type;
>
> u64 irq_period;
> u32 record_type;
>
> u32 disabled : 1, /* off by default */
> nmi : 1, /* NMI sampling */
> raw : 1, /* raw event type */
> __reserved_1 : 29;
>
> u64 __reserved_2;
> };
>
> if the hw_event.raw bit is set to 1, then the hw_event.type is fully
> 'raw'. The default is for raw to be 0. So negative numbers can be used
> for sw events, positive numbers for hw events. Both can be extended
> gradually, without arbitrarily limits introduced.

On that, I still don't think its a good idea to use bitfields in an ABI.
The C std is just not strict enough on them, and I guess that is the
reason this would be the first such usage.

2008-12-12 08:35:55

by Stephane Eranian

[permalink] [raw]
Subject: Re: [patch] Performance Counters for Linux, v3

Peter,

On Fri, Dec 12, 2008 at 9:25 AM, Peter Zijlstra <[email protected]> wrote:
>> > + /*
>> > + * Common hardware events, generalized by the kernel:
>> > + */
>> > + PERF_COUNT_CYCLES = 0,
>> > + PERF_COUNT_INSTRUCTIONS = 1,
>> > + PERF_COUNT_CACHE_REFERENCES = 2,
>> > + PERF_COUNT_CACHE_MISSES = 3,
>> > + PERF_COUNT_BRANCH_INSTRUCTIONS = 4,
>> > + PERF_COUNT_BRANCH_MISSES = 5,
>>
>> Many machines do not support these counts. For example, Niagara T1 does
>> not have a CYCLES count. And good luck if you think you can easily come
>> up with something meaningful for the various kind of CACHE_MISSES on the
>> Pentium 4. Also, the Pentium D has various flavors of retired instruction
>> count with slightly different semantics. This kind of abstraction should
>> be done in userspace.
>
> I'll argue to disagree, sure such events might not be supported by any
> particular hardware implementation - but the fact that PAPI gives a list
> of 'common' events means that they are, well, common. So unifying them
> between those archs that do implement them seems like a sane choice, no?
>
> For those archs that do not support it, it will just fail to open. No
> harm done.
>
> The proposal allows for you to specify raw hardware events, so you can
> just totally ignore this part of the abstraction.
>
I believe the cache related events do not belong in here. There is no definition
for them. You don't know what cache miss level, what kind of access. You cannot
do this even on Intel Core processors.

2008-12-12 08:51:41

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [patch] Performance Counters for Linux, v3

On Fri, 2008-12-12 at 09:35 +0100, stephane eranian wrote:
> Peter,
>
> On Fri, Dec 12, 2008 at 9:25 AM, Peter Zijlstra <[email protected]> wrote:
> >> > + /*
> >> > + * Common hardware events, generalized by the kernel:
> >> > + */
> >> > + PERF_COUNT_CYCLES = 0,
> >> > + PERF_COUNT_INSTRUCTIONS = 1,
> >> > + PERF_COUNT_CACHE_REFERENCES = 2,
> >> > + PERF_COUNT_CACHE_MISSES = 3,
> >> > + PERF_COUNT_BRANCH_INSTRUCTIONS = 4,
> >> > + PERF_COUNT_BRANCH_MISSES = 5,
> >>
> >> Many machines do not support these counts. For example, Niagara T1 does
> >> not have a CYCLES count. And good luck if you think you can easily come
> >> up with something meaningful for the various kind of CACHE_MISSES on the
> >> Pentium 4. Also, the Pentium D has various flavors of retired instruction
> >> count with slightly different semantics. This kind of abstraction should
> >> be done in userspace.
> >
> > I'll argue to disagree, sure such events might not be supported by any
> > particular hardware implementation - but the fact that PAPI gives a list
> > of 'common' events means that they are, well, common. So unifying them
> > between those archs that do implement them seems like a sane choice, no?
> >
> > For those archs that do not support it, it will just fail to open. No
> > harm done.
> >
> > The proposal allows for you to specify raw hardware events, so you can
> > just totally ignore this part of the abstraction.
> >
> I believe the cache related events do not belong in here. There is no definition
> for them. You don't know what cache miss level, what kind of access. You cannot
> do this even on Intel Core processors.

I might agree with that, perhaps we should model this to the common list
PAPI specifies?

2008-12-12 08:55:33

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] Performance Counters for Linux, v3


* Peter Zijlstra <[email protected]> wrote:

> On Thu, 2008-12-11 at 20:34 +0100, Ingo Molnar wrote:
>
> > struct perf_counter_hw_event {
> > s64 type;
> >
> > u64 irq_period;
> > u32 record_type;
> >
> > u32 disabled : 1, /* off by default */
> > nmi : 1, /* NMI sampling */
> > raw : 1, /* raw event type */
> > __reserved_1 : 29;
> >
> > u64 __reserved_2;
> > };
> >
> > if the hw_event.raw bit is set to 1, then the hw_event.type is fully
> > 'raw'. The default is for raw to be 0. So negative numbers can be used
> > for sw events, positive numbers for hw events. Both can be extended
> > gradually, without arbitrarily limits introduced.
>
> On that, I still don't think its a good idea to use bitfields in an
> ABI. The C std is just not strict enough on them, and I guess that is
> the reason this would be the first such usage.

I dont feel strongly about this, we could certainly change it.

But these are system calls which have per platform bit order anyway - is
it really an issue? I'd agree that it would be bad for any sort of
persistent or otherwise cross-platform data such as filesystems, network
protocol bits, etc.

We use bitfields in a couple of system calls ABIs already, for example in
PPP:

if_ppp.h-/* For PPPIOCGL2TPSTATS */
if_ppp.h-struct pppol2tp_ioc_stats {
if_ppp.h- __u16 tunnel_id; /* redundant */
if_ppp.h- __u16 session_id; /* if zero, get tunnel stats */
if_ppp.h: __u32 using_ipsec:1; /* valid only for session_id ==

Ingo

2008-12-12 09:00:13

by Stephane Eranian

[permalink] [raw]
Subject: Re: [patch] Performance Counters for Linux, v3

Peter,

On Fri, Dec 12, 2008 at 9:25 AM, Peter Zijlstra <[email protected]> wrote:
> On Thu, 2008-12-11 at 13:02 -0500, Vince Weaver wrote:
>

>> Perfmon3 works for all of those 60 machines. This new proposal works on a
>> 2 out of the 60.
>
> s/works/is implemented/
>
>> Who is going to add support for all of those machines? I've spent a lot
>> of developer time getting prefmon going for all of those configurations.
>> But why should I help out with this new inferior proposal? It could all
>> be another waste of time.
>
> So much for constructive critisism.. have you tried taking the design to
> its limits, if so, where do you see problems?
>
People have pointed out problems, but you keep forgetting to answer them.

For instance, people have pointed out that your design necessarily implies
pulling into the kernel the event table for all PMU models out there. This
is not just data, this is also complex algorithms to assign events to counters.
The constraints between events can be very tricky to solve. If you get this
wrong, this leads to silent errors, and that is really bad.

Looking at Intel Core, Nehalem, or AMD64 does not reflect the reality of
the complexity of this. Paul pointed out earlier the complexity on Power.
I can relate to the complexity on Itanium (I implemented all the code in
the user level libpfm for them). Read the Itanium PMU description and I
hope you'll understand.

Events constraints are not going away anytime soon, quite the contrary.

Furthermore, event tables are not always correct. In fact, they are
always bogus.
Event semantics varies between steppings. New events shows up, others
get removed.
Constraints are discovered later on.

If you have all of that in the kernel, it means you'll have to
generate a kernel patch each
time. Even if that can be encapsulated into a kernel module, you will
still have problems.

Furthermore, Linux commercial distribution release cycles do not
align well with new processor
releases. I can boot my RHEL5 kernel on a Nehalem system and it would
be nice not to have to
wait for a new kernel update to get the full Nehalem PMU event table,
so I can program more than
the basic 6 architected events of Intel X86.

I know the argument about the fact that you'll have a patch with 24h
on kernel.org. The problem
is that no end-user runs a kernel.org kernel, nobody. Changing the
kernel is not an option for
many end-users, it may even require re-certifications for many customers.

I believe many people would like to see how you plan on addressing those issues.

2008-12-12 09:01:20

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [patch] Performance Counters for Linux, v3

On Fri, 2008-12-12 at 09:51 +0100, Peter Zijlstra wrote:
> On Fri, 2008-12-12 at 09:35 +0100, stephane eranian wrote:
> > Peter,
> >
> > On Fri, Dec 12, 2008 at 9:25 AM, Peter Zijlstra <[email protected]> wrote:
> > >> > + /*
> > >> > + * Common hardware events, generalized by the kernel:
> > >> > + */
> > >> > + PERF_COUNT_CYCLES = 0,
> > >> > + PERF_COUNT_INSTRUCTIONS = 1,
> > >> > + PERF_COUNT_CACHE_REFERENCES = 2,
> > >> > + PERF_COUNT_CACHE_MISSES = 3,
> > >> > + PERF_COUNT_BRANCH_INSTRUCTIONS = 4,
> > >> > + PERF_COUNT_BRANCH_MISSES = 5,
> > >>
> > >> Many machines do not support these counts. For example, Niagara T1 does
> > >> not have a CYCLES count. And good luck if you think you can easily come
> > >> up with something meaningful for the various kind of CACHE_MISSES on the
> > >> Pentium 4. Also, the Pentium D has various flavors of retired instruction
> > >> count with slightly different semantics. This kind of abstraction should
> > >> be done in userspace.
> > >
> > > I'll argue to disagree, sure such events might not be supported by any
> > > particular hardware implementation - but the fact that PAPI gives a list
> > > of 'common' events means that they are, well, common. So unifying them
> > > between those archs that do implement them seems like a sane choice, no?
> > >
> > > For those archs that do not support it, it will just fail to open. No
> > > harm done.
> > >
> > > The proposal allows for you to specify raw hardware events, so you can
> > > just totally ignore this part of the abstraction.
> > >
> > I believe the cache related events do not belong in here. There is no definition
> > for them. You don't know what cache miss level, what kind of access. You cannot
> > do this even on Intel Core processors.
>
> I might agree with that, perhaps we should model this to the common list
> PAPI specifies?

http://icl.cs.utk.edu/projects/papi/files/html_man3/papi_presets.html

Has a lot of cache events.

And I can see the use of a set without the L[123] in there, which would
signify either all or the lack of more specific knowledge. Like with
PAPI its perfectly fine to not support these common events on a
particular hardware platform.

2008-12-12 09:08:24

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] Performance Counters for Linux, v3


* Peter Zijlstra <[email protected]> wrote:

> On Fri, 2008-12-12 at 09:51 +0100, Peter Zijlstra wrote:
> > On Fri, 2008-12-12 at 09:35 +0100, stephane eranian wrote:
> > > Peter,
> > >
> > > On Fri, Dec 12, 2008 at 9:25 AM, Peter Zijlstra <[email protected]> wrote:
> > > >> > + /*
> > > >> > + * Common hardware events, generalized by the kernel:
> > > >> > + */
> > > >> > + PERF_COUNT_CYCLES = 0,
> > > >> > + PERF_COUNT_INSTRUCTIONS = 1,
> > > >> > + PERF_COUNT_CACHE_REFERENCES = 2,
> > > >> > + PERF_COUNT_CACHE_MISSES = 3,
> > > >> > + PERF_COUNT_BRANCH_INSTRUCTIONS = 4,
> > > >> > + PERF_COUNT_BRANCH_MISSES = 5,
> > > >>
> > > >> Many machines do not support these counts. For example, Niagara T1 does
> > > >> not have a CYCLES count. And good luck if you think you can easily come
> > > >> up with something meaningful for the various kind of CACHE_MISSES on the
> > > >> Pentium 4. Also, the Pentium D has various flavors of retired instruction
> > > >> count with slightly different semantics. This kind of abstraction should
> > > >> be done in userspace.
> > > >
> > > > I'll argue to disagree, sure such events might not be supported by any
> > > > particular hardware implementation - but the fact that PAPI gives a list
> > > > of 'common' events means that they are, well, common. So unifying them
> > > > between those archs that do implement them seems like a sane choice, no?
> > > >
> > > > For those archs that do not support it, it will just fail to open. No
> > > > harm done.
> > > >
> > > > The proposal allows for you to specify raw hardware events, so you can
> > > > just totally ignore this part of the abstraction.
> > > >
> > > I believe the cache related events do not belong in here. There is no definition
> > > for them. You don't know what cache miss level, what kind of access. You cannot
> > > do this even on Intel Core processors.
> >
> > I might agree with that, perhaps we should model this to the common list
> > PAPI specifies?
>
> http://icl.cs.utk.edu/projects/papi/files/html_man3/papi_presets.html
>
> Has a lot of cache events.
>
> And I can see the use of a set without the L[123] in there, which would
> signify either all or the lack of more specific knowledge. Like with
> PAPI its perfectly fine to not support these common events on a
> particular hardware platform.

yes, exactly.

A PAPI wrapper on top of this code might even opt to never use any of the
generic types, because it can be well aware of all the CPU types and
their exact event mappings to raw types, and can use those directly.

Different apps like KernelTop might opt to utilize the generic types.

A kernel is all about providing intelligent, generalized access to hw
resources.

Ingo

2008-12-12 09:25:04

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [patch] Performance Counters for Linux, v3

On Fri, 2008-12-12 at 09:59 +0100, stephane eranian wrote:
> Peter,
>
> On Fri, Dec 12, 2008 at 9:25 AM, Peter Zijlstra <[email protected]> wrote:
> > On Thu, 2008-12-11 at 13:02 -0500, Vince Weaver wrote:
> >
>
> >> Perfmon3 works for all of those 60 machines. This new proposal works on a
> >> 2 out of the 60.
> >
> > s/works/is implemented/
> >
> >> Who is going to add support for all of those machines? I've spent a lot
> >> of developer time getting prefmon going for all of those configurations.
> >> But why should I help out with this new inferior proposal? It could all
> >> be another waste of time.
> >
> > So much for constructive critisism.. have you tried taking the design to
> > its limits, if so, where do you see problems?
> >
> People have pointed out problems, but you keep forgetting to answer them.

I thought some of that (and surely more to follow) has been
incorporated.

> For instance, people have pointed out that your design necessarily implies
> pulling into the kernel the event table for all PMU models out there. This
> is not just data, this is also complex algorithms to assign events to counters.
> The constraints between events can be very tricky to solve. If you get this
> wrong, this leads to silent errors, and that is really bad.

(well, its not my design - I'm just trying to see how far we can push it
out of sheer curiosity)

This has to be done anyway, and getting it wrong in userspace is just as
bad no?

The _ONLY_ technical argument I've seen to do this in userspace is that
these tables and text segments are unswappable in-kernel - which doesn't
count too heavily in my book.

> Looking at Intel Core, Nehalem, or AMD64 does not reflect the reality of
> the complexity of this. Paul pointed out earlier the complexity on Power.
> I can relate to the complexity on Itanium (I implemented all the code in
> the user level libpfm for them). Read the Itanium PMU description and I
> hope you'll understand.

Again, I appreciate the fact that multi-dimensional constraint solving
isn't easy. But any which way we turn this thing, it still needs to be
done.

> Events constraints are not going away anytime soon, quite the contrary.
>
> Furthermore, event tables are not always correct. In fact, they are
> always bogus.
> Event semantics varies between steppings. New events shows up, others
> get removed.
> Constraints are discovered later on.
>
> If you have all of that in the kernel, it means you'll have to
> generate a kernel patch each
> time. Even if that can be encapsulated into a kernel module, you will
> still have problems.

How is updating a kernel module (esp one that only contains constraint
tables) more difficult than upgrading a user-space library? That just
doesn't make sense.

> Furthermore, Linux commercial distribution release cycles do not
> align well with new processor
> releases. I can boot my RHEL5 kernel on a Nehalem system and it would
> be nice not to have to
> wait for a new kernel update to get the full Nehalem PMU event table,
> so I can program more than
> the basic 6 architected events of Intel X86.

Talking with my community hat on, that is an artificial problem created
by distributions, tell them to fix it.

All it requires is a new kernel module that describes the new chip,
surely that can be shipped as easily as a new library.

> I know the argument about the fact that you'll have a patch with 24h
> on kernel.org. The problem
> is that no end-user runs a kernel.org kernel, nobody. Changing the
> kernel is not an option for
> many end-users, it may even require re-certifications for many customers.
>
> I believe many people would like to see how you plan on addressing those issues.

You're talking to LKML here - we don't care about stuff older than -git
(well, only a little, but not much more beyond n-1).

What we do care about is technical arguments, and last time I checked,
hardware resource scheduling was an OS level job.

But if the PMU control is critical to the enterprise deployment of
$customer, then he would have to re-certify on the library update too.

If its only development phase stuff, then the deployment machines won't
even load the module so there'd be no problem anyway.

Subject: Re: [patch] Performance Counters for Linux, v3

On 12.12.08 10:23:54, Peter Zijlstra wrote:
> On Fri, 2008-12-12 at 09:59 +0100, stephane eranian wrote:
> > For instance, people have pointed out that your design necessarily implies
> > pulling into the kernel the event table for all PMU models out there. This
> > is not just data, this is also complex algorithms to assign events to counters.
> > The constraints between events can be very tricky to solve. If you get this
> > wrong, this leads to silent errors, and that is really bad.
>
> (well, its not my design - I'm just trying to see how far we can push it
> out of sheer curiosity)
>
> This has to be done anyway, and getting it wrong in userspace is just as
> bad no?
>
> The _ONLY_ technical argument I've seen to do this in userspace is that
> these tables and text segments are unswappable in-kernel - which doesn't
> count too heavily in my book.

But, there are also no arguments to implement it not in userspace.

> > Looking at Intel Core, Nehalem, or AMD64 does not reflect the reality of
> > the complexity of this. Paul pointed out earlier the complexity on Power.
> > I can relate to the complexity on Itanium (I implemented all the code in
> > the user level libpfm for them). Read the Itanium PMU description and I
> > hope you'll understand.
>
> Again, I appreciate the fact that multi-dimensional constraint solving
> isn't easy. But any which way we turn this thing, it still needs to be
> done.

I agree with Stephane. There are already many different PMU
descriptions depending on family, model and steppping and with *every*
new cpu revision you will get one more update. Implementing this in
the kernel would require kernel updates where otherwise no changes
would be necessary.

If you look at current pmu implementations, there are tons of
descriptions files and code you don't want to have in the kernel.

Also, a profiling tool that needs a certain pmu feature would depend
then on its kernel implementation. (Actually, it is impossible to have
a 100% implementation coverage.) If the pmu could be programmed from
userspace, the tool could provide the feature itself.

> > Events constraints are not going away anytime soon, quite the contrary.
> >
> > Furthermore, event tables are not always correct. In fact, they are
> > always bogus.
> > Event semantics varies between steppings. New events shows up, others
> > get removed.
> > Constraints are discovered later on.
> >
> > If you have all of that in the kernel, it means you'll have to
> > generate a kernel patch each
> > time. Even if that can be encapsulated into a kernel module, you will
> > still have problems.
>
> How is updating a kernel module (esp one that only contains constraint
> tables) more difficult than upgrading a user-space library? That just
> doesn't make sense.

At least this would require a kernel with modules enabled.

> > Furthermore, Linux commercial distribution release cycles do not
> > align well with new processor
> > releases. I can boot my RHEL5 kernel on a Nehalem system and it would
> > be nice not to have to
> > wait for a new kernel update to get the full Nehalem PMU event table,
> > so I can program more than
> > the basic 6 architected events of Intel X86.
>
> Talking with my community hat on, that is an artificial problem created
> by distributions, tell them to fix it.

It does not make sense to close the eyes to reality. There are systems
where it is not possible to update the kernel frequently. Probably you
have one running yourself.

-Robert

--
Advanced Micro Devices, Inc.
Operating System Research Center
email: [email protected]

2008-12-12 10:59:58

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [patch] Performance Counters for Linux, v3

On Fri, Dec 12, 2008 at 11:21:11AM +0100, Robert Richter wrote:
> I agree with Stephane. There are already many different PMU
> descriptions depending on family, model and steppping and with *every*
> new cpu revision you will get one more update. Implementing this in
> the kernel would require kernel updates where otherwise no changes
> would be necessary.

Please stop the Bullshit. You have to update _something_. It makes a
lot of sense to update the thing you need to udpate anyway for new
hardware support, and not some piece of junk library like libperfmon.

> > Talking with my community hat on, that is an artificial problem created
> > by distributions, tell them to fix it.
>
> It does not make sense to close the eyes to reality. There are systems
> where it is not possible to update the kernel frequently. Probably you
> have one running yourself.

Of course it is. And on many of my systems it's much easier to update a
kernel than a library. A kernel I can build myself, for libraries I'm
more or less reliant on the distro or hacking fugly rpm or debian
packagging bits.

Having HW support in the kernel is a lot easier than in weird libraries.

Subject: Re: [patch] Performance Counters for Linux, v3

On 12.12.08 05:59:38, Christoph Hellwig wrote:
> On Fri, Dec 12, 2008 at 11:21:11AM +0100, Robert Richter wrote:
> > I agree with Stephane. There are already many different PMU
> > descriptions depending on family, model and steppping and with *every*
> > new cpu revision you will get one more update. Implementing this in
> > the kernel would require kernel updates where otherwise no changes
> > would be necessary.
>
> Please stop the Bullshit. You have to update _something_. It makes a
> lot of sense to update the thing you need to udpate anyway for new
> hardware support, and not some piece of junk library like libperfmon.

New hardware does not always mean to implement new hardware
support. Sometimes it is sufficient to simply program the same
registers in another way. Why changing the kernel for this?

-Robert

--
Advanced Micro Devices, Inc.
Operating System Research Center
email: [email protected]

2008-12-12 13:41:58

by Andi Kleen

[permalink] [raw]
Subject: Re: [patch] Performance Counters for Linux, v3

Peter Zijlstra <[email protected]> writes:
> On that, I still don't think its a good idea to use bitfields in an ABI.
> The C std is just not strict enough on them,

If you constrain yourself to a single architecture in practice C
bitfield standards are quite good e.g. on Linux/x86 it is "everyone
implements what gcc does" (and on linux/ppc "what ppc gcc does").
And the syscall ABI is certainly restricted to one architecture.

-Andi

--
[email protected]

2008-12-12 16:46:43

by Chris Friesen

[permalink] [raw]
Subject: Re: [patch] Performance Counters for Linux, v3

Peter Zijlstra wrote:
> On Fri, 2008-12-12 at 09:59 +0100, stephane eranian wrote:

>>Furthermore, Linux commercial distribution release cycles do not
>>align well with new processor
>>releases. I can boot my RHEL5 kernel on a Nehalem system and it would
>>be nice not to have to
>>wait for a new kernel update to get the full Nehalem PMU event table,
>>so I can program more than
>>the basic 6 architected events of Intel X86.
>
>
> Talking with my community hat on, that is an artificial problem created
> by distributions, tell them to fix it.
>
> All it requires is a new kernel module that describes the new chip,
> surely that can be shipped as easily as a new library.

I have to confess that I haven't had a chance to look at the code. Is
the current proposal set up in such a way as to support loading a module
and having the new description picked up automatically?


>>Changing the
>>kernel is not an option for
>>many end-users, it may even require re-certifications for many customers.

> What we do care about is technical arguments, and last time I checked,
> hardware resource scheduling was an OS level job.

Here I agree.

> But if the PMU control is critical to the enterprise deployment of
> $customer, then he would have to re-certify on the library update too.

It may not have any basis in fact, but in practice it seems like kernel
changes are considered more risky than userspace changes.

As you say though, it's not likely that most production systems would be
running performance monitoring code, so this may only be an issue for
development machines.


Chris

2008-12-12 17:12:14

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [patch] Performance Counters for Linux, v3

On Fri, 2008-12-12 at 18:03 +0100, Samuel Thibault wrote:
> Peter Zijlstra, le Fri 12 Dec 2008 09:25:45 +0100, a écrit :
> > On Thu, 2008-12-11 at 13:02 -0500, Vince Weaver wrote:
> > > Also, my primary method of using counters is total aggregate count for a
> > > single user-space process.
> >
> > Process, as in single thread, or multi-threaded? I'll assume
> > single-thread.
>
> BTW, just to make sure it is taken into account (I haven't followed the
> thread up to here, just saw a "pid_t" somwhere that alarmed me): for our
> uses, we _do_ need per-kernelthread counters.

Yes, counters are per task - not sure on the exact interface thingy
though - I guess it should be tid_t but glibc does a bit weird there or
something.

2008-12-12 17:13:38

by Samuel Thibault

[permalink] [raw]
Subject: Re: [patch] Performance Counters for Linux, v3

Peter Zijlstra, le Fri 12 Dec 2008 09:25:45 +0100, a ?crit :
> On Thu, 2008-12-11 at 13:02 -0500, Vince Weaver wrote:
> > Also, my primary method of using counters is total aggregate count for a
> > single user-space process.
>
> Process, as in single thread, or multi-threaded? I'll assume
> single-thread.

BTW, just to make sure it is taken into account (I haven't followed the
thread up to here, just saw a "pid_t" somwhere that alarmed me): for our
uses, we _do_ need per-kernelthread counters.

Samuel

2008-12-12 17:46:51

by Stephane Eranian

[permalink] [raw]
Subject: Re: [patch] Performance Counters for Linux, v3

Peter,

On Fri, Dec 12, 2008 at 10:23 AM, Peter Zijlstra <[email protected]> wrote:
>> For instance, people have pointed out that your design necessarily implies
>> pulling into the kernel the event table for all PMU models out there. This
>> is not just data, this is also complex algorithms to assign events to counters.
>> The constraints between events can be very tricky to solve. If you get this
>> wrong, this leads to silent errors, and that is really bad.
>
> (well, its not my design - I'm just trying to see how far we can push it
> out of sheer curiosity)
>
> This has to be done anyway, and getting it wrong in userspace is just as
> bad no?
>
No as bad. If a library is bad, then just don't the library. In fact,
I know tools
which do not even need a library. What is important is that there is a way
to avoid the problem. If the kernel controls this, then there is no way out.

To remain in your world, look at the Pentium 4 (Netburst) PMU
description. And you'll see that things are very complicated already there.


> The _ONLY_ technical argument I've seen to do this in userspace is that
> these tables and text segments are unswappable in-kernel - which doesn't
> count too heavily in my book.
>
>> Looking at Intel Core, Nehalem, or AMD64 does not reflect the reality of
>> the complexity of this. Paul pointed out earlier the complexity on Power.
>> I can relate to the complexity on Itanium (I implemented all the code in
>> the user level libpfm for them). Read the Itanium PMU description and I
>> hope you'll understand.
>
> Again, I appreciate the fact that multi-dimensional constraint solving
> isn't easy. But any which way we turn this thing, it still needs to be
> done.
>

Yes, but you have lots of ways of doing this at the user level. For all I know,
you could even hardcode the values (register, value) pairs in your tool if you
know what you are doing. And don't discount the fact that advanced tools
know what they are doing very precisely.

>> Events constraints are not going away anytime soon, quite the contrary.
>>
>> Furthermore, event tables are not always correct. In fact, they are
>> always bogus.
>> Event semantics varies between steppings. New events shows up, others
>> get removed.
>> Constraints are discovered later on.
>>
>> If you have all of that in the kernel, it means you'll have to
>> generate a kernel patch each
>> time. Even if that can be encapsulated into a kernel module, you will
>> still have problems.
>
> How is updating a kernel module (esp one that only contains constraint
> tables) more difficult than upgrading a user-space library? That just
> doesn't make sense.
>
Go ask end-users what they think of that?

You don't even need a library. All of this could be integrated into the tool.
New processor, just go download the updated version of the tool.
No kernel changes.

>> Furthermore, Linux commercial distribution release cycles do not
>> align well with new processor
>> releases. I can boot my RHEL5 kernel on a Nehalem system and it would
>> be nice not to have to
>> wait for a new kernel update to get the full Nehalem PMU event table,
>> so I can program more than
>> the basic 6 architected events of Intel X86.
>
> Talking with my community hat on, that is an artificial problem created
> by distributions, tell them to fix it.
>
> All it requires is a new kernel module that describes the new chip,
> surely that can be shipped as easily as a new library.
>

No, because you need tons of versions of that module based on kernel
versions. People do not recompile kernel modules.

>> I know the argument about the fact that you'll have a patch with 24h
>> on kernel.org. The problem
>> is that no end-user runs a kernel.org kernel, nobody. Changing the
>> kernel is not an option for
>> many end-users, it may even require re-certifications for many customers.
>>
>> I believe many people would like to see how you plan on addressing those issues.
>
> You're talking to LKML here - we don't care about stuff older than -git
> (well, only a little, but not much more beyond n-1).
>
That is why you don't always understand the issues of users, unfortunately.

> What we do care about is technical arguments, and last time I checked,
> hardware resource scheduling was an OS level job.
>
Yes, if you get it wrong, applications are screwed.

> But if the PMU control is critical to the enterprise deployment of
> $customer, then he would have to re-certify on the library update too.
>
No, they just download a new version of the tool.

> If its only development phase stuff, then the deployment machines won't
> even load the module so there'd be no problem anyway.
>
This not just development stuff anymore.

2008-12-12 18:05:26

by Stephane Eranian

[permalink] [raw]
Subject: Re: [patch] Performance Counters for Linux, v3

Hi,

Given the level of abstractions you are using for the API, and given
your argument
that the kernel can do the HW resource scheduling better than anybody else.

What happens in the following test case:

- 2-way system (cpu0, cpu1)

- on cpu0, two processes P1, P2, each self-monitoring and counting event E1.
Event E1 can only be measured on counter C1.

- on cpu1, there is a cpu-wide session, monitoring event E1, thus using C1

- the scheduler decides to migrate P1 onto CPU1. You now have a
conflict on C1.

How is this managed?

2008-12-12 18:14:19

by Vince Weaver

[permalink] [raw]
Subject: Re: [patch] Performance Counters for Linux, v3


On Fri, 12 Dec 2008, Peter Zijlstra wrote:
> On Thu, 2008-12-11 at 13:02 -0500, Vince Weaver wrote:
>
>> Perfmon3 works for all of those 60 machines. This new proposal works on a
>> 2 out of the 60.
>
> s/works/is implemented/

Once you "implement" the new solution for all the machines I listed, it's
going to be just as bad, if not worse, than current perfmon3.

> So much for constructive critisism.. have you tried taking the design to
> its limits, if so, where do you see problems?

I have a currently working solution in perfmon3.
I need a pretty strong reason to abandon that.

> I read the above as: I invested a lot of time in something of dubious
> statue (out of tree patch), and now expect it to be merged because I
> have invested in it.

perfmon has been around for years. It's even been in the kernel (in
Itanium form) for years. The perfmon patchset has been posted numerous
time for reviews to the linux-kernel list. It's not like perfmon was some
sort of secret project sprung on the world last-minute.

I know the way the Linux kernel development works. If some other
performance monitoring implementation does get merged, I will cope and
move on. I'm just trying to help avoid a costly mistake.

>> Also, my primary method of using counters is total aggregate count for a
>> single user-space process.
>
> Process, as in single thread, or multi-threaded? I'll assume
> single-thread.

No. Multi-thread too.

> I'll argue to disagree, sure such events might not be supported by any
> particular hardware implementation - but the fact that PAPI gives a list
> of 'common' events means that they are, well, common. So unifying them
> between those archs that do implement them seems like a sane choice, no?

No.

I do not use PAPI. PAPI only supports a small subset of counters.

What is needed is a tool for accessing _all_ performance counters on
various machines.

What is _not_ needed is pushing PAPI into kernel space.

> The proposal allows for you to specify raw hardware events, so you can
> just totally ignore this part of the abstraction.

If you can do raw events, then that's enough. There's no need to put some
sort of abstraction level into the kernel. That way lies madness if
you've ever looked at any code that tries to do it.

As others have suggested, check out the P4 PMU documentation.

Vince

2008-12-12 19:46:26

by Chris Friesen

[permalink] [raw]
Subject: Re: [patch] Performance Counters for Linux, v3

stephane eranian wrote:

> What happens in the following test case:
>
> - 2-way system (cpu0, cpu1)
>
> - on cpu0, two processes P1, P2, each self-monitoring and counting event E1.
> Event E1 can only be measured on counter C1.
>
> - on cpu1, there is a cpu-wide session, monitoring event E1, thus using C1
>
> - the scheduler decides to migrate P1 onto CPU1. You now have a
> conflict on C1.
>
> How is this managed?

Prevent the load balancer from moving P1 onto cpu1?

Chris

2008-12-13 11:18:07

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [patch] Performance Counters for Linux, v3

On Fri, 2008-12-12 at 18:42 +0100, stephane eranian wrote:
> In fact, I know tools which do not even need a library.

By your own saying, the problem solved by libperfmon is a hard problem
(and I fully understand that).

Now you say there is software out there that doesn't use libperfmon,
that means they'll have to duplicate that functionality.

And only commercial software has a clear gain by wastefully duplicating
that effort. This means there is an active commercial interest to not
make perfmon the best technical solution there is, which is contrary to
the very thing Linux is about.

What is worse, you defend that:

> Go ask end-users what they think of that?
>
> You don't even need a library. All of this could be integrated into the tool.
> New processor, just go download the updated version of the tool.

No! what people want is their problem fixed - no matter how. That is one
of the powers of FOSS, you can fix your problems in any way suitable.

Would it not be much better if those folks duped into using a binary
only product only had to upgrade their FOSS kernel, instead of possibly
forking over more $$$ for an upgrade?

You have just irrevocably proven to me this needs to go into the kernel,
as the design of perfmon is little more than a GPL circumvention device
- independent of whether you are aware of that or not.

For that I hereby fully NAK perfmon

Nacked-by: Peter Zijlstra <[email protected]>


Subject: Re: [patch] Performance Counters for Linux, v3

On Sat, 13 Dec 2008, Peter Zijlstra wrote:
> You have just irrevocably proven to me this needs to go into the kernel,
> as the design of perfmon is little more than a GPL circumvention device
> - independent of whether you are aware of that or not.

As long as it uses some sort of "module plugin" approach, perhaps coupled to
the firmware loader system to avoid wasting a ton of space with tables for
processors other than the one you need... you could just move all of the
hardware-related parts of perfmon lib into the kernel.

That would close the doors to non-gpl badness.

--
"One disk to rule them all, One disk to find them. One disk to bring
them all and in the darkness grind them. In the Land of Redmond
where the shadows lie." -- The Silicon Valley Tarot
Henrique Holschuh

2008-12-13 17:45:01

by Stephane Eranian

[permalink] [raw]
Subject: Re: [patch] Performance Counters for Linux, v3

Peter,

I don't think you understand what libpfm actually does and therefore
you rush to the wrong conclusion.

At its core, libpfm does NOT know anything about the perfmon kernel API.

I think you missed that, unfortunately.

It is a helper library which helps tool writer solves the event-> code
-> counter assignment problems.
That's it. It does not make any perfmon syscall at ALL to do that.
Proof is people have been using it on
Windows, I can also use it on MacOS.

Looking at your proposal, you think you won't need such a library and
that the kernel is
going to do all this for you. Let's go back to your kerneltop program:

KernelTop Options (up to 4 event types can be specified):

-e EID --event_id=EID # event type ID [default: 0]
0: CPU cycles
1: instructions
2: cache accesses
3: cache misses
4: branch instructions
5: branch prediction misses
< 0: raw CPU events

Looks like I can do:

$ kerneltop --event_id=-0x510088

You think users are going to come up with 0x510088 out of the blue?

I want to say:

$ kerneltop --event_id=BR_INST_EXEC --plm=user

Where do you think they are going to get that from?

The kernel or a helper user library?

Do not denigrate other people's software without understanding what it does.


On Sat, Dec 13, 2008 at 12:17 PM, Peter Zijlstra <[email protected]> wrote:
> On Fri, 2008-12-12 at 18:42 +0100, stephane eranian wrote:
>> In fact, I know tools which do not even need a library.
>
> By your own saying, the problem solved by libperfmon is a hard problem
> (and I fully understand that).
>
> Now you say there is software out there that doesn't use libperfmon,
> that means they'll have to duplicate that functionality.
>
> And only commercial software has a clear gain by wastefully duplicating
> that effort. This means there is an active commercial interest to not
> make perfmon the best technical solution there is, which is contrary to
> the very thing Linux is about.
>
> What is worse, you defend that:
>
>> Go ask end-users what they think of that?
>>
>> You don't even need a library. All of this could be integrated into the tool.
>> New processor, just go download the updated version of the tool.
>
> No! what people want is their problem fixed - no matter how. That is one
> of the powers of FOSS, you can fix your problems in any way suitable.
>
> Would it not be much better if those folks duped into using a binary
> only product only had to upgrade their FOSS kernel, instead of possibly
> forking over more $$$ for an upgrade?
>
> You have just irrevocably proven to me this needs to go into the kernel,
> as the design of perfmon is little more than a GPL circumvention device
> - independent of whether you are aware of that or not.
>
> For that I hereby fully NAK perfmon
>
> Nacked-by: Peter Zijlstra <[email protected]>
>
>
>
>

2008-12-14 01:02:59

by Paul Mackerras

[permalink] [raw]
Subject: Re: [patch] Performance Counters for Linux, v3

Peter Zijlstra writes:

> On Fri, 2008-12-12 at 18:42 +0100, stephane eranian wrote:
> > In fact, I know tools which do not even need a library.
>
> By your own saying, the problem solved by libperfmon is a hard problem
> (and I fully understand that).
>
> Now you say there is software out there that doesn't use libperfmon,
> that means they'll have to duplicate that functionality.
>
> And only commercial software has a clear gain by wastefully duplicating
> that effort. This means there is an active commercial interest to not
> make perfmon the best technical solution there is, which is contrary to
> the very thing Linux is about.
>
> What is worse, you defend that:
>
> > Go ask end-users what they think of that?
> >
> > You don't even need a library. All of this could be integrated into the tool.
> > New processor, just go download the updated version of the tool.
>
> No! what people want is their problem fixed - no matter how. That is one
> of the powers of FOSS, you can fix your problems in any way suitable.
>
> Would it not be much better if those folks duped into using a binary
> only product only had to upgrade their FOSS kernel, instead of possibly
> forking over more $$$ for an upgrade?
>
> You have just irrevocably proven to me this needs to go into the kernel,
> as the design of perfmon is little more than a GPL circumvention device
> - independent of whether you are aware of that or not.

I'm sorry, but that is a pretty silly argument.

By that logic, the kernel module loader should include an in-kernel
copy of gcc and binutils, and the fact that it doesn't proves that the
module loader is little more than a GPL circumvention device -
independent of whether you are aware of that or not. 8-)

Paul.

2008-12-14 14:51:03

by Andi Kleen

[permalink] [raw]
Subject: Re: Performance counter API review was [patch] Performance Counters for Linux, v3

Ingo Molnar <[email protected]> writes:

Here are some comments from my (mostly x86) perspective on the interface.
I'm focusing on the interface only, not the code.

- There was a lot of discussion about counter assignment. But an event
actually needs much more meta data than just the counter assignments.
For example here's an event-set out of the upcoming Core i7 oprofile
events file:

event:0xC3 counters:0,1,2,3 um:machine_clears minimum:6000 name:machine_clears : Counts the cycles machine clear is asserted.

and the associated sub unit masks:

name:machine_clears type:bitmask default:0x01
0x01 cycles Counts the cycles machine clear is asserted
0x02 mem_order Counts the number of machine clears due to memory order conflicts
0x04 smc Counts the number of times that a program writes to a code section
0x10 fusion_assist Counts the number of macro-fusion assists


As you can see there is a lot of meta data in there and to my knowledge
none of it is really optional. For example without the name and the description
it's pretty much impossible to use the event (in fact even with description
it is often hard enough to figure out what it means). I think every
non trivial perfctr user front end will need a way to query name and
description. Where should they be stored?

Then the minimum overflow period is needed (see below)

Counter assignment is needed as discussed earlier: there are some events
that can only go to specific counters, and then there are complication
like fixed event counters and uncore events in separate registers.

Then there is the concept of unit_masks, which define the sub-events.
Right now the single event number does not specify how unit masks
are specified. Unit masks also are complicated because they are
sometimes masks (you can or them up) or enumerations (you can't)
To make good use of them the software needs to know the difference.

So these all need to be somewhere. I assume the right place is
not the kernel. I don't think it would be a good idea to duplicate
all of this in every application. So some user space library is needed anyways.

- All the event meta data should be ideally stored in a single place,
otherwise there is risk of it getting out of sync. Events are relatively
often updated (even during a CPU life-cycle when a event is found
to be buggy), so a smooth upgrade procedure is crucial.

- There doesn't seem to be a way to enforce minimum overflow periods.
It's also pretty easy to hang a system by programming a too short
overflow period to a commonly encountered event. For example
if you program a counter to trigger an NMI every hundred cycles
then the system will not do much useful work anymore.

This might even be a security hazard because the interface is available
to non-root. Solving that one would actually argue to put at least
some knowledge into the kernel or always enforce a minimum safe period?

The minimum safe period has the problem that it might break some
useful tracing setups on low frequency event where it might
be quite useful to useful on each event. But on a common event
that's a really bad idea. So probably it needs per event information.

Hard problem. oprofile avoids it by only allowing root to configure events.

[btw i'm not sure perfmon3 has solved that one either]

- Split of event into event and unit mask
On x86 events consist of a event number and a unit mask (which
can be sometimes an enumeration, not a mask). It's unclear
right now how the unit mask is specified in the perfctr structure.
While it could be both encoded in type that would be clumsy,
requiring special macros. So likely it needs a separate field.

- PEBS/Debug Store

Intel/x86 has support for letting the CPU directly log events into a memory
ring buffer with some additional information like register contents. From
the first look this could be supported with additional record types. One
issue there is that the record layout is not architectural and varies
with different CPUs. Getting a nice general API out of that might be tricky.
Would each new CPU need a new record type?

Processing PEBS records is also moderately performance critical
(and they can be quite big) so it would be a good idea to have some way
to process them copy less.

Another issue is that you need to specify the buffer size/overflow threshold
somewhere. Right now there is no way in the API to do that (and the
existing syscall has already quite a lot of arguments). So PEBS would
likely need a new syscall?

- Additional bits. x86 has some more flag bits in the perfctr
registers like edge triggering or counter inversion. Right now there
doesn't seem to be any way to specify those in the syscall. There are
some events (especially when multiple events are counted together)
which can be only counted by setting those bits. Likely needs to be
controlled by the application.

I suppose adding new fields to perf_counter_hw_event would be possible.

- It's unclear to me why the API has a special NMI mode. For me it looks
like that if NMIs are implemented they should be the default way.
Or rather if you have NMI events, why ever not use them?
The only exception I can think of would be if the system is known
to have NMI problems in the BIOS like some ThinkPads. In that case
it shouldn't be per syscall/user controlled though, but some global
root only knob (ideally set automatically)

- Global tracing. Right now there seem to be two modi: per task and
per CPU. But a common variant is global tracing of all CPUs. While this
could be in theory done right now by attaching to each CPU
this has the problem that it doesn't interact very well with CPU
hot plug. The application would need to poll for additional/lost
CPUs somehow and then re-attach to them (or detach). This would
likely be quite clumsy and slow. It would be better if the kernel supported
that better.

Or alternative here is to do nothing and keep oprofile for that job
(which it doesn't do that badly)

- Ring 3 vs ring 0.
x86 supports counting only user space or only kernel space. Right
now there is no way to specify that in the syscall interface.
I suppose adding a new field to perf_counter_hw_event would be possible.

- SMT support
Sometimes you want to count events occurred by both SMT siblings.
For example this is useful when measuring a multi threaded
application that uses both threads and you want to see the
shared cache events of both.
In arch perfmon v3 there is a new perfctr "AnyThread" bit
that controls this. It needs to be exposed.

- In general the SMT and shared resource semantics seem to be a
bit unclear recently. Some clarification of that would be good.
What happens when the resource is not available? How are
the reservation semantics?

- Uncore monitoring
Nehalem has some additional performance counters in the Uncore
which count specific uncore events. They have slightly different
semantics and additional register (like an opcode filter).
It's unclear how they would be programmed in this API.

Also the shared resource problem applies. An uncore is shared
by multiple cores/threads on a socket. Neither a CPU number nor
a pid are particularly useful to address them.

- RDPMC self monitoring
x86 supports reading performance counters from user space
using the RDPMC application. I find that rather useful
as a replacement for RDTSC because it allows to count
real cycles using one of the fixed performance counter.

One problem is that it needs to be explicitely enabled and also
controlled because it always exposes information from
all performance counters (which could be an information
leak). So ideally it needs to cooperate with the kernel
and allow to set up suitable counters for own use and also
to make sure that counters do not leak information on context
switch. There should be some way in the API to specify that.

-Andi

--
[email protected]

2008-12-14 22:38:37

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] Performance Counters for Linux, v3


* Paul Mackerras <[email protected]> wrote:

> Peter Zijlstra writes:
>
> > On Fri, 2008-12-12 at 18:42 +0100, stephane eranian wrote:
> > > In fact, I know tools which do not even need a library.
> >
> > By your own saying, the problem solved by libperfmon is a hard problem
> > (and I fully understand that).
> >
> > Now you say there is software out there that doesn't use libperfmon,
> > that means they'll have to duplicate that functionality.
> >
> > And only commercial software has a clear gain by wastefully duplicating
> > that effort. This means there is an active commercial interest to not
> > make perfmon the best technical solution there is, which is contrary to
> > the very thing Linux is about.
> >
> > What is worse, you defend that:
> >
> > > Go ask end-users what they think of that?
> > >
> > > You don't even need a library. All of this could be integrated into the tool.
> > > New processor, just go download the updated version of the tool.
> >
> > No! what people want is their problem fixed - no matter how. That is one
> > of the powers of FOSS, you can fix your problems in any way suitable.
> >
> > Would it not be much better if those folks duped into using a binary
> > only product only had to upgrade their FOSS kernel, instead of possibly
> > forking over more $$$ for an upgrade?
> >
> > You have just irrevocably proven to me this needs to go into the kernel,
> > as the design of perfmon is little more than a GPL circumvention device
> > - independent of whether you are aware of that or not.
>
> I'm sorry, but that is a pretty silly argument.
>
> By that logic, the kernel module loader should include an in-kernel copy
> of gcc and binutils, and the fact that it doesn't proves that the module
> loader is little more than a GPL circumvention device - independent of
> whether you are aware of that or not. 8-)

i'm not sure how your example applies: the kernel module loader is not an
application that needs to be updated to new versions of syscalls. Nor is
it a needless duplication of infrastructure - it runs in a completely
different protection domain - just to name one of the key differences.

Applications going to complex raw syscalls and avoiding a neutral hw
infrastructure library that implements a non-trivial job is quite typical
for FOSS-library-shy bin-only apps. The "you cannot infringe what you do
not link to at all" kind of defensive thinking.

Ingo

2008-12-14 23:14:17

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] Performance Counters for Linux, v3


* stephane eranian <[email protected]> wrote:

> Hi,
>
> Given the level of abstractions you are using for the API, and given
> your argument that the kernel can do the HW resource scheduling better
> than anybody else.
>
> What happens in the following test case:
>
> - 2-way system (cpu0, cpu1)
>
> - on cpu0, two processes P1, P2, each self-monitoring and counting event E1.
> Event E1 can only be measured on counter C1.
>
> - on cpu1, there is a cpu-wide session, monitoring event E1, thus using C1
>
> - the scheduler decides to migrate P1 onto CPU1. You now have a
> conflict on C1.
>
> How is this managed?

If there's a single unit of sharable resource [such as an event counter,
or a physical CPU], then there's just three main possibilities: either
user 1 gets it all, or user 2 gets it all, or they share it.

We've implemented the essence of these variants, with sharing the resource
being the sane default, and with the sysadmin also having a configuration
vector to reserve the resource to himself permanently. (There could be
more variations of this.)

What is your point?

Ingo

2008-12-15 00:37:42

by Paul Mackerras

[permalink] [raw]
Subject: Re: [patch] Performance Counters for Linux, v3

Ingo Molnar writes:

> * stephane eranian <[email protected]> wrote:
>
> > Hi,
> >
> > Given the level of abstractions you are using for the API, and given
> > your argument that the kernel can do the HW resource scheduling better
> > than anybody else.
> >
> > What happens in the following test case:
> >
> > - 2-way system (cpu0, cpu1)
> >
> > - on cpu0, two processes P1, P2, each self-monitoring and counting event E1.
> > Event E1 can only be measured on counter C1.
> >
> > - on cpu1, there is a cpu-wide session, monitoring event E1, thus using C1
> >
> > - the scheduler decides to migrate P1 onto CPU1. You now have a
> > conflict on C1.
> >
> > How is this managed?
>
> If there's a single unit of sharable resource [such as an event counter,
> or a physical CPU], then there's just three main possibilities: either
> user 1 gets it all, or user 2 gets it all, or they share it.
>
> We've implemented the essence of these variants, with sharing the resource
> being the sane default, and with the sysadmin also having a configuration
> vector to reserve the resource to himself permanently. (There could be
> more variations of this.)
>
> What is your point?

Note that Stephane said *counting* event E1.

One of the important things about counting (as opposed to sampling) is
that it matters whether or not the event is being counted the whole
time or only part of the time. Thus it puts constraints on counter
scheduling and reporting that don't apply for sampling.

In other words, if I'm counting an event, I want it to be counted all
the time (i.e. whenever the task is executing, for a per-task counter,
or continuously for a per-cpu counter). If that causes conflicts and
the kernel decides not to count the event for part of the time, that
is very much second-best, and I absolutely need to know that that
happened, and also when the kernel started and stopped counting the
event (so I can scale the result to get some idea what the result
would have been if it had been counted the whole time).

Now, I haven't digested V4 yet, so you might have already implemented
something like that. Have you? :)

Paul.

2008-12-15 00:50:53

by Paul Mackerras

[permalink] [raw]
Subject: Re: [patch] Performance Counters for Linux, v3

Ingo Molnar writes:

> * Paul Mackerras <[email protected]> wrote:
>
> > Peter Zijlstra writes:
> >
> > > On Fri, 2008-12-12 at 18:42 +0100, stephane eranian wrote:
> > > > In fact, I know tools which do not even need a library.
> > >
> > > By your own saying, the problem solved by libperfmon is a hard problem
> > > (and I fully understand that).
> > >
> > > Now you say there is software out there that doesn't use libperfmon,
> > > that means they'll have to duplicate that functionality.
> > >
> > > And only commercial software has a clear gain by wastefully duplicating
> > > that effort. This means there is an active commercial interest to not
> > > make perfmon the best technical solution there is, which is contrary to
> > > the very thing Linux is about.
> > >
> > > What is worse, you defend that:
> > >
> > > > Go ask end-users what they think of that?
> > > >
> > > > You don't even need a library. All of this could be integrated into the tool.
> > > > New processor, just go download the updated version of the tool.
> > >
> > > No! what people want is their problem fixed - no matter how. That is one
> > > of the powers of FOSS, you can fix your problems in any way suitable.
> > >
> > > Would it not be much better if those folks duped into using a binary
> > > only product only had to upgrade their FOSS kernel, instead of possibly
> > > forking over more $$$ for an upgrade?
> > >
> > > You have just irrevocably proven to me this needs to go into the kernel,
> > > as the design of perfmon is little more than a GPL circumvention device
> > > - independent of whether you are aware of that or not.
> >
> > I'm sorry, but that is a pretty silly argument.
> >
> > By that logic, the kernel module loader should include an in-kernel copy
> > of gcc and binutils, and the fact that it doesn't proves that the module
> > loader is little more than a GPL circumvention device - independent of
> > whether you are aware of that or not. 8-)
>
> i'm not sure how your example applies: the kernel module loader is not an
> application that needs to be updated to new versions of syscalls. Nor is
> it a needless duplication of infrastructure - it runs in a completely
> different protection domain - just to name one of the key differences.

Peter's argument was in essence that since using perfmon3 involves some
userspace computation that can be done by proprietary software instead
of a GPL'd library (libpfm), that makes perfmon3 a GPL-circumvention
device.

I was trying to point out that that argument is silly by applying it
to the kernel module loader. There the userspace component is gcc and
binutils, and the computation they do can be done alternatively by
proprietary software such as icc or xlc. That of itself doesn't make
the module loader a GPL-circumvention device (though it may be for
other reasons).

And if the argument is silly in that case (which it is), it is even
more silly in the case of perfmon3, where what is being computed and
passed to the kernel is just a few register values, not instructions.

> Applications going to complex raw syscalls and avoiding a neutral hw
> infrastructure library that implements a non-trivial job is quite typical
> for FOSS-library-shy bin-only apps. The "you cannot infringe what you do
> not link to at all" kind of defensive thinking.

FOSS is about freedom - we don't force anyone to use our code. If
someone wants to use their own code instead of glibc or libpfm on the
user-space side of the syscall interface, that's fine.

Paul.

2008-12-15 12:58:29

by Stephane Eranian

[permalink] [raw]
Subject: Re: [patch] Performance Counters for Linux, v3

Hi,

On Mon, Dec 15, 2008 at 1:37 AM, Paul Mackerras <[email protected]> wrote:
> Ingo Molnar writes:
>
>> * stephane eranian <[email protected]> wrote:
>>
>> > Hi,
>> >
>> > Given the level of abstractions you are using for the API, and given
>> > your argument that the kernel can do the HW resource scheduling better
>> > than anybody else.
>> >
>> > What happens in the following test case:
>> >
>> > - 2-way system (cpu0, cpu1)
>> >
>> > - on cpu0, two processes P1, P2, each self-monitoring and counting event E1.
>> > Event E1 can only be measured on counter C1.
>> >
>> > - on cpu1, there is a cpu-wide session, monitoring event E1, thus using C1
>> >
>> > - the scheduler decides to migrate P1 onto CPU1. You now have a
>> > conflict on C1.
>> >
>> > How is this managed?
>>
>> If there's a single unit of sharable resource [such as an event counter,
>> or a physical CPU], then there's just three main possibilities: either
>> user 1 gets it all, or user 2 gets it all, or they share it.
>>
>> We've implemented the essence of these variants, with sharing the resource
>> being the sane default, and with the sysadmin also having a configuration
>> vector to reserve the resource to himself permanently. (There could be
>> more variations of this.)
>>
>> What is your point?
>>
Could you explain what you mean by sharing here?

Are you talking about time multiplexing the counter?

2008-12-15 13:03:16

by Stephane Eranian

[permalink] [raw]
Subject: Re: [patch] Performance Counters for Linux, v3

Hi,

On Mon, Dec 15, 2008 at 1:50 AM, Paul Mackerras <[email protected]> wrote:

> FOSS is about freedom - we don't force anyone to use our code. If
> someone wants to use their own code instead of glibc or libpfm on the
> user-space side of the syscall interface, that's fine.
>
Exactly right!

That was exactly my point when I said, you are free to not use libpfm
in your tool.

2008-12-15 14:42:42

by Stephane Eranian

[permalink] [raw]
Subject: Re: [patch] Performance Counters for Linux, v3

Hi,

On Mon, Dec 15, 2008 at 1:37 AM, Paul Mackerras <[email protected]> wrote:
> Ingo Molnar writes:
>
>> * stephane eranian <[email protected]> wrote:
>>
>> > Hi,
>> >
>> > Given the level of abstractions you are using for the API, and given
>> > your argument that the kernel can do the HW resource scheduling better
>> > than anybody else.
>> >
>> > What happens in the following test case:
>> >
>> > - 2-way system (cpu0, cpu1)
>> >
>> > - on cpu0, two processes P1, P2, each self-monitoring and counting event E1.
>> > Event E1 can only be measured on counter C1.
>> >
>> > - on cpu1, there is a cpu-wide session, monitoring event E1, thus using C1
>> >
>> > - the scheduler decides to migrate P1 onto CPU1. You now have a
>> > conflict on C1.
>> >
>> > How is this managed?
>>
>> If there's a single unit of sharable resource [such as an event counter,
>> or a physical CPU], then there's just three main possibilities: either
>> user 1 gets it all, or user 2 gets it all, or they share it.
>>
>> We've implemented the essence of these variants, with sharing the resource
>> being the sane default, and with the sysadmin also having a configuration
>> vector to reserve the resource to himself permanently. (There could be
>> more variations of this.)
>>
>> What is your point?
>
> Note that Stephane said *counting* event E1.
>
> One of the important things about counting (as opposed to sampling) is
> that it matters whether or not the event is being counted the whole
> time or only part of the time. Thus it puts constraints on counter
> scheduling and reporting that don't apply for sampling.
>
Paul is right.

> In other words, if I'm counting an event, I want it to be counted all
> the time (i.e. whenever the task is executing, for a per-task counter,
> or continuously for a per-cpu counter). If that causes conflicts and
> the kernel decides not to count the event for part of the time, that
> is very much second-best, and I absolutely need to know that that
> happened, and also when the kernel started and stopped counting the
> event (so I can scale the result to get some idea what the result
> would have been if it had been counted the whole time).
>
That is very true.

You cannot multiplex events onto counters without applications knowing.
They need to know how long each 'set' has been active. This is needed
to scale the results. This is especially true for cpu-wide measurements.

2008-12-15 14:50:40

by Stephane Eranian

[permalink] [raw]
Subject: Re: [patch] Performance Counters for Linux, v3

On Fri, Dec 12, 2008 at 8:45 PM, Chris Friesen <[email protected]> wrote:
> stephane eranian wrote:
>
>> What happens in the following test case:
>>
>> - 2-way system (cpu0, cpu1)
>>
>> - on cpu0, two processes P1, P2, each self-monitoring and counting event
>> E1.
>> Event E1 can only be measured on counter C1.
>>
>> - on cpu1, there is a cpu-wide session, monitoring event E1, thus using
>> C1
>>
>> - the scheduler decides to migrate P1 onto CPU1. You now have a
>> conflict on C1.
>>
>> How is this managed?
>
> Prevent the load balancer from moving P1 onto cpu1?
>
You don't want to do that.

There was a reason why the scheduler decided to move the task.
Now, because of monitoring you would change the behavior of the task
and scheduler.
Monitoring should be unintrusive. You want the task/scheduler to
behave as if no monitoring
was present otherwise what is it you are actually measuring?

Changing or forcing the affinity because of monitoring is also a bad
idea, for the same reason.

2008-12-15 20:58:23

by Stephane Eranian

[permalink] [raw]
Subject: Re: [patch] Performance Counters for Linux, v3

Hi,

On Mon, Dec 15, 2008 at 12:13 AM, Ingo Molnar <[email protected]> wrote:
> We've implemented the essence of these variants, with sharing the resource
> being the sane default, and with the sysadmin also having a configuration
> vector to reserve the resource to himself permanently. (There could be
> more variations of this.)
>
Reading the v4 code, it does not appear the sysadmin can specify which
resource to reserve. The current code reserves a number of counters.
This is problematic with hardware where not all counters can measure
everything, or when not all PMU registers are counters.

2008-12-15 22:33:08

by Chris Friesen

[permalink] [raw]
Subject: Re: [patch] Performance Counters for Linux, v3

stephane eranian wrote:
> On Fri, Dec 12, 2008 at 8:45 PM, Chris Friesen <[email protected]> wrote:
>
>>stephane eranian wrote:
>>
>>
>>>What happens in the following test case:
>>>
>>> - 2-way system (cpu0, cpu1)
>>>
>>> - on cpu0, two processes P1, P2, each self-monitoring and counting event
>>>E1.
>>> Event E1 can only be measured on counter C1.
>>>
>>> - on cpu1, there is a cpu-wide session, monitoring event E1, thus using
>>>C1
>>>
>>> - the scheduler decides to migrate P1 onto CPU1. You now have a
>>>conflict on C1.
>>>
>>>How is this managed?
>>
>>Prevent the load balancer from moving P1 onto cpu1?
>>
>
> You don't want to do that.
>
> There was a reason why the scheduler decided to move the task.
> Now, because of monitoring you would change the behavior of the task
> and scheduler.
> Monitoring should be unintrusive. You want the task/scheduler to
> behave as if no monitoring
> was present otherwise what is it you are actually measuring?

In a scenario where the system physically cannot gather the desired data
without influencing the behaviour of the program, I see two options:

1) limit the behaviour of the system to ensure that we can gather the
performance monitoring data as specified

2) limit the performance monitoring to minimize any influence on the
program, and report the fact that performance monitoring was limited.

You've indicated that you don't want option 1, so I assume that you
prefer option 2. In the above scenario, how would _you_ handle it?


Chris

2008-12-15 22:54:25

by Paul Mackerras

[permalink] [raw]
Subject: Re: [patch] Performance Counters for Linux, v3

Ingo Molnar writes:

> If there's a single unit of sharable resource [such as an event counter,
> or a physical CPU], then there's just three main possibilities: either
> user 1 gets it all, or user 2 gets it all, or they share it.
>
> We've implemented the essence of these variants, with sharing the resource
> being the sane default, and with the sysadmin also having a configuration
> vector to reserve the resource to himself permanently. (There could be
> more variations of this.)

Thinking about this a bit more, it seems to me that there is an
unstated assumption that dealing with performance counters is mostly a
scheduling problem - that the hardware resource of a fixed number of
performance counters can be virtualized to provide a larger number of
software counters in much the same way that a fixed number of physical
cpus are virtualized to support a larger number of tasks.

Put another way, your assumption seems to be that software counters
can be transparently time-multiplexed onto the physical counters,
without affecting the end results. In other words, you assume that
time-multiplexing is a reasonable way to implement sharing of hardware
performance counters, and that users shouldn't have to know or care
that their counters are being time-multiplexed. Is that an accurate
statement of your belief?

If it is (and the code you've posted seems to indicate that it is)
then you are going to have unhappy users, because counting part of the
time is not at all the same thing as counting all the time. As just
one example, imagine that the period over which you are counting is
shorter than the counter timeslice period (for example because the
executable you are measuring doesn't run for very long). If you have
N software counters but only M < N hardware counters, then only the
first M software counters will report anything useful, and the
remaining M - N will report zero!

Sampling, as opposed to counting, may be more tolerant of
time-multiplexing of counters, particularly for long-running programs,
but even there time-multiplexing will affect the results and users
need to know about it.

It seems to me that this assumption is pretty deeply rooted in the
design of your performance counter subsystem, and I'm not sure at this
point what is the best way to fix it.

Paul.

2008-12-17 07:45:23

by Stephane Eranian

[permalink] [raw]
Subject: Re: [patch] Performance Counters for Linux, v3

On Mon, Dec 15, 2008 at 11:32 PM, Chris Friesen <[email protected]> wrote:
> stephane eranian wrote:
>>
>> On Fri, Dec 12, 2008 at 8:45 PM, Chris Friesen <[email protected]>
>> wrote:
>>
>>> stephane eranian wrote:
>>>
>>>
>>>> What happens in the following test case:
>>>>
>>>> - 2-way system (cpu0, cpu1)
>>>>
>>>> - on cpu0, two processes P1, P2, each self-monitoring and counting
>>>> event
>>>> E1.
>>>> Event E1 can only be measured on counter C1.
>>>>
>>>> - on cpu1, there is a cpu-wide session, monitoring event E1, thus using
>>>> C1
>>>>
>>>> - the scheduler decides to migrate P1 onto CPU1. You now have a
>>>> conflict on C1.
>>>>
>>>> How is this managed?
>>>
>>> Prevent the load balancer from moving P1 onto cpu1?
>>>
>>
>> You don't want to do that.
>>
>> There was a reason why the scheduler decided to move the task.
>> Now, because of monitoring you would change the behavior of the task
>> and scheduler.
>> Monitoring should be unintrusive. You want the task/scheduler to
>> behave as if no monitoring
>> was present otherwise what is it you are actually measuring?
>
> In a scenario where the system physically cannot gather the desired data
> without influencing the behaviour of the program, I see two options:
>
> 1) limit the behaviour of the system to ensure that we can gather the
> performance monitoring data as specified
>
> 2) limit the performance monitoring to minimize any influence on the
> program, and report the fact that performance monitoring was limited.
>
> You've indicated that you don't want option 1, so I assume that you prefer
> option 2. In the above scenario, how would _you_ handle it?
>
That's right, you have to fail monitoring.

In this particular example, it is okay for per-thread sessions to each use C1.
Any cpu-wide session trying to access C1 should fail. Vice versa if a
cpu-wide session is using C1, then no per-thread session can be accessing it.

Things can get even more complicated than that even for per-thread sessions.
Some PMU registers may be shared per core, e.g, Nehalem or Pentium 4. Thus
if HT is enabled, you also have to fail per-thread sessions, as only
one can grab
the resource globally.