2008-12-14 21:29:41

by Ingo Molnar

[permalink] [raw]
Subject: [patch] Performance Counters for Linux, v4

We are pleased to announce the v4 release of our performance counters
subsystem implementation. The kernel changes can be picked up from:

git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip.git perfcounters/core

(also in the master branch. There's also a kernel patch attached
below.)

The biggest new feature in this release is the implementation of
"performance counter inheritance" for the per task counters: the ability
to extend performance counters to cover the execution of child tasks
too, transparently and automatically - following them to other CPUs.

This can be used to monitor a hierarchy of tasks without stopping them
(or impacting them in any observable way), and extending that monitoring
to all child tasks as well.

We've written a new utility: 'timec', which takes advantage of this new
kernel capability:

http://redhat.com/~mingo/perfcounters/timec.c

'timec' works like /usr/bin/time, but it extends the dimension of "time"
with all the metrics that hardware and software performance counters are
able to capture.

For example, a full kernel build's statistics on a 16-way x86 box are:

$ timec -e -5,-4,-3,1,2,3,5 make -j32 bzImage

Performance counter stats for 'make':

142420.882 task clock ticks (millisecs)

9951033 pagefaults (events)
302628 context switches (events)
57810 CPU migrations (events)
208439082509 instructions (events)
657918810 cache references (events)
120243697 cache misses (events)
3134162468 branch misses (events)

The CPU elapsed time metric is also a lot more accurate and finegrained
than /usr/bin/time is able to offer. (the counter measures nanoseconds -
it is displayed by timec up to microsecond resolution.)

Or here are the stats for a 'hackbench' run that creates and tears down
a web of 1000 tasks within 3 seconds:

$ timec -e -5,-4,-3,0,1,3,5 ./hackbench 50

Time: 3.073

Performance counter stats for './hackbench':

23340.852 task clock ticks (millisecs)

13173 CPU migrations (events)
45992 context switches (events)
80053 pagefaults (events)
57653910508 cycles (events)
43456427830 instructions (events)
93659475 cache misses (events)
901367477 branch misses (events)

These stats include the summed up counter stats of all the sub-tasks
propagated back and added up back to the original counter - without
slowing down the workload. (i could not measure a clearly observable
impact on even context-switchy workloads)

Another change in v4 is the addition of more types of "software
counters":

PERF_COUNT_CPU_CLOCK = -1,
PERF_COUNT_TASK_CLOCK = -2,
PERF_COUNT_PAGE_FAULTS = -3,
PERF_COUNT_CONTEXT_SWITCHES = -4,
PERF_COUNT_CPU_MIGRATIONS = -5,

The page-faults and context-switches per task metric is now implemented,
plus nr-of-cpu-migrations scheduler metric as well. These sw counters
are available on all CPUs, regardless of hardware perf-counter support.

There's a new KernelTop profiler update as well:

http://redhat.com/~mingo/perfcounters/kerneltop.c

There's a high-pass filter useful to filter out rare entries from
weighted histograms (--filter):

------------------------------------------------------------------------------
KernelTop: 9152 irqs/sec [NMI, cache-misses/cache-refs], (all, 16 CPUs)
------------------------------------------------------------------------------

weight events RIP kernel function
______ ______ ________________ _______________

2.2 17028 - ffffffff802b3f3f : check_bytes_and_report
1.4 8630 - ffffffff802b466c : init_object
1.3 718 - ffffffff802b4031 : check_object
1.2 2050 - ffffffff8037ae10 : copy_user_generic_string!
1.2 6631 - ffffffff802b3d7e : slab_pad_check
1.2 1078 - ffffffff8037ea9a : _raw_spin_lock
1.1 864 - ffffffff802b6250 : kfree
1.1 948 - ffffffff802b55e6 : new_slab
1.0 1001 - ffffffff8055b1c2 : unix_stream_recvmsg
1.0 6266 - ffffffff802b3b20 : on_freelist
1.0 588 - ffffffff802b596d : __slab_free
1.0 956 - ffffffff802602a2 : lock_release
0.8 571 - ffffffff802b3ead : check_slab
0.5 3353 - ffffffff8025e8c8 : __lock_acquire

Also a number of bugfixes and small tweaks are included i this release.

There's a handful of updates to the core perfcounter code as well, see
the Git history for details.

Hardware support is still limited to Core2 and later CPUs - but the
design is not limited to these CPU types - help is welcome with more CPU
drivers :-)

Thanks,

Ingo, Thomas

------------------>

The latest perfcounters/core git tree can be found at:

git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip.git perfcounters/core

------------------>
Ingo Molnar (27):
performance counters: documentation
performance counters: x86 support
x86, perfcounters: read out MSR_CORE_PERF_GLOBAL_STATUS with counters disabled
perfcounters: select ANON_INODES
perfcounters, x86: simplify disable/enable of counters
perfcounters, x86: clean up debug code
perfcounters: consolidate global-disable codepaths
perf counters: restructure the API
perf counters: add support for group counters
perf counters: group counter, fixes
perf counters: hw driver API
perf counters: implement PERF_COUNT_CPU_CLOCK
perf counters: consolidate hw_perf save/restore APIs
perf counters: implement PERF_COUNT_TASK_CLOCK
perf counters: add prctl interface to disable/enable counters
perf counters: clean up state transitions
perf counters: update docs
x86: implement atomic64_t on 32-bit
perfcounters: restructure x86 counter math
perfcounters: implement "counter inheritance"
perfcounters: fix task clock counter
perfcounters: add context switch counter
perfcounters: add task migrations counter
perfcounters: add nr-of-faults counter
perfcounters: fix non-intel-perfmon CPUs
perfcounters, x86: fix sw counters on non-PMC CPUs
perfcounters: fix lapic initialization

Thomas Gleixner (4):
performance counters: core code
perf counters: protect them against CSTATE transitions
perf counters: clean up 'raw' type API
perf counters: expand use of counter->event


Documentation/perf-counters.txt | 147 +++
arch/x86/Kconfig | 1 +
arch/x86/ia32/ia32entry.S | 3 +-
arch/x86/include/asm/atomic_32.h | 218 ++++
arch/x86/include/asm/hardirq_32.h | 1 +
arch/x86/include/asm/hw_irq.h | 2 +
arch/x86/include/asm/intel_arch_perfmon.h | 34 +-
arch/x86/include/asm/irq_vectors.h | 5 +
arch/x86/include/asm/mach-default/entry_arch.h | 5 +
arch/x86/include/asm/pda.h | 1 +
arch/x86/include/asm/thread_info.h | 4 +-
arch/x86/include/asm/unistd_32.h | 1 +
arch/x86/include/asm/unistd_64.h | 3 +-
arch/x86/kernel/apic.c | 2 +
arch/x86/kernel/cpu/Makefile | 12 +-
arch/x86/kernel/cpu/common.c | 2 +
arch/x86/kernel/cpu/perf_counter.c | 587 +++++++++
arch/x86/kernel/entry_64.S | 5 +
arch/x86/kernel/irq.c | 5 +
arch/x86/kernel/irqinit_32.c | 3 +
arch/x86/kernel/irqinit_64.c | 5 +
arch/x86/kernel/signal.c | 7 +-
arch/x86/kernel/syscall_table_32.S | 1 +
drivers/acpi/processor_idle.c | 8 +
drivers/char/sysrq.c | 2 +
include/linux/perf_counter.h | 257 ++++
include/linux/prctl.h | 3 +
include/linux/sched.h | 12 +-
include/linux/syscalls.h | 8 +
init/Kconfig | 30 +
kernel/Makefile | 1 +
kernel/exit.c | 7 +-
kernel/fork.c | 1 +
kernel/perf_counter.c | 1542 ++++++++++++++++++++++++
kernel/sched.c | 31 +-
kernel/sys.c | 7 +
kernel/sys_ni.c | 3 +
37 files changed, 2939 insertions(+), 27 deletions(-)
create mode 100644 Documentation/perf-counters.txt
create mode 100644 arch/x86/kernel/cpu/perf_counter.c
create mode 100644 include/linux/perf_counter.h
create mode 100644 kernel/perf_counter.c

diff --git a/Documentation/perf-counters.txt b/Documentation/perf-counters.txt
new file mode 100644
index 0000000..fddd321
--- /dev/null
+++ b/Documentation/perf-counters.txt
@@ -0,0 +1,147 @@
+
+Performance Counters for Linux
+------------------------------
+
+Performance counters are special hardware registers available on most modern
+CPUs. These registers count the number of certain types of hw events: such
+as instructions executed, cachemisses suffered, or branches mis-predicted -
+without slowing down the kernel or applications. These registers can also
+trigger interrupts when a threshold number of events have passed - and can
+thus be used to profile the code that runs on that CPU.
+
+The Linux Performance Counter subsystem provides an abstraction of these
+hardware capabilities. It provides per task and per CPU counters, counter
+groups, and it provides event capabilities on top of those.
+
+Performance counters are accessed via special file descriptors.
+There's one file descriptor per virtual counter used.
+
+The special file descriptor is opened via the perf_counter_open()
+system call:
+
+ int sys_perf_counter_open(struct perf_counter_hw_event *hw_event_uptr,
+ pid_t pid, int cpu, int group_fd);
+
+The syscall returns the new fd. The fd can be used via the normal
+VFS system calls: read() can be used to read the counter, fcntl()
+can be used to set the blocking mode, etc.
+
+Multiple counters can be kept open at a time, and the counters
+can be poll()ed.
+
+When creating a new counter fd, 'perf_counter_hw_event' is:
+
+/*
+ * Hardware event to monitor via a performance monitoring counter:
+ */
+struct perf_counter_hw_event {
+ s64 type;
+
+ u64 irq_period;
+ u32 record_type;
+
+ u32 disabled : 1, /* off by default */
+ nmi : 1, /* NMI sampling */
+ raw : 1, /* raw event type */
+ __reserved_1 : 29;
+
+ u64 __reserved_2;
+};
+
+/*
+ * Generalized performance counter event types, used by the hw_event.type
+ * parameter of the sys_perf_counter_open() syscall:
+ */
+enum hw_event_types {
+ /*
+ * Common hardware events, generalized by the kernel:
+ */
+ PERF_COUNT_CYCLES = 0,
+ PERF_COUNT_INSTRUCTIONS = 1,
+ PERF_COUNT_CACHE_REFERENCES = 2,
+ PERF_COUNT_CACHE_MISSES = 3,
+ PERF_COUNT_BRANCH_INSTRUCTIONS = 4,
+ PERF_COUNT_BRANCH_MISSES = 5,
+
+ /*
+ * Special "software" counters provided by the kernel, even if
+ * the hardware does not support performance counters. These
+ * counters measure various physical and sw events of the
+ * kernel (and allow the profiling of them as well):
+ */
+ PERF_COUNT_CPU_CLOCK = -1,
+ PERF_COUNT_TASK_CLOCK = -2,
+ /*
+ * Future software events:
+ */
+ /* PERF_COUNT_PAGE_FAULTS = -3,
+ PERF_COUNT_CONTEXT_SWITCHES = -4, */
+};
+
+These are standardized types of events that work uniformly on all CPUs
+that implements Performance Counters support under Linux. If a CPU is
+not able to count branch-misses, then the system call will return
+-EINVAL.
+
+More hw_event_types are supported as well, but they are CPU
+specific and are enumerated via /sys on a per CPU basis. Raw hw event
+types can be passed in under hw_event.type if hw_event.raw is 1.
+For example, to count "External bus cycles while bus lock signal asserted"
+events on Intel Core CPUs, pass in a 0x4064 event type value and set
+hw_event.raw to 1.
+
+'record_type' is the type of data that a read() will provide for the
+counter, and it can be one of:
+
+/*
+ * IRQ-notification data record type:
+ */
+enum perf_counter_record_type {
+ PERF_RECORD_SIMPLE = 0,
+ PERF_RECORD_IRQ = 1,
+ PERF_RECORD_GROUP = 2,
+};
+
+a "simple" counter is one that counts hardware events and allows
+them to be read out into a u64 count value. (read() returns 8 on
+a successful read of a simple counter.)
+
+An "irq" counter is one that will also provide an IRQ context information:
+the IP of the interrupted context. In this case read() will return
+the 8-byte counter value, plus the Instruction Pointer address of the
+interrupted context.
+
+The parameter 'hw_event_period' is the number of events before waking up
+a read() that is blocked on a counter fd. Zero value means a non-blocking
+counter.
+
+The 'pid' parameter allows the counter to be specific to a task:
+
+ pid == 0: if the pid parameter is zero, the counter is attached to the
+ current task.
+
+ pid > 0: the counter is attached to a specific task (if the current task
+ has sufficient privilege to do so)
+
+ pid < 0: all tasks are counted (per cpu counters)
+
+The 'cpu' parameter allows a counter to be made specific to a full
+CPU:
+
+ cpu >= 0: the counter is restricted to a specific CPU
+ cpu == -1: the counter counts on all CPUs
+
+(Note: the combination of 'pid == -1' and 'cpu == -1' is not valid.)
+
+A 'pid > 0' and 'cpu == -1' counter is a per task counter that counts
+events of that task and 'follows' that task to whatever CPU the task
+gets schedule to. Per task counters can be created by any user, for
+their own tasks.
+
+A 'pid == -1' and 'cpu == x' counter is a per CPU counter that counts
+all events on CPU-x. Per CPU counters need CAP_SYS_ADMIN privilege.
+
+Group counters are created by passing in a group_fd of another counter.
+Groups are scheduled at once and can be used with PERF_RECORD_GROUP
+to record multi-dimensional timestamps.
+
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index d4d4cb7..fe94490 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -643,6 +643,7 @@ config X86_UP_IOAPIC
config X86_LOCAL_APIC
def_bool y
depends on X86_64 || (X86_32 && (X86_UP_APIC || (SMP && !X86_VOYAGER) || X86_GENERICARCH))
+ select HAVE_PERF_COUNTERS if (!M386 && !M486)

config X86_IO_APIC
def_bool y
diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
index 256b00b..3c14ed0 100644
--- a/arch/x86/ia32/ia32entry.S
+++ b/arch/x86/ia32/ia32entry.S
@@ -823,7 +823,8 @@ ia32_sys_call_table:
.quad compat_sys_signalfd4
.quad sys_eventfd2
.quad sys_epoll_create1
- .quad sys_dup3 /* 330 */
+ .quad sys_dup3 /* 330 */
.quad sys_pipe2
.quad sys_inotify_init1
+ .quad sys_perf_counter_open
ia32_syscall_end:
diff --git a/arch/x86/include/asm/atomic_32.h b/arch/x86/include/asm/atomic_32.h
index ad5b9f6..9927e01 100644
--- a/arch/x86/include/asm/atomic_32.h
+++ b/arch/x86/include/asm/atomic_32.h
@@ -255,5 +255,223 @@ static inline int atomic_add_unless(atomic_t *v, int a, int u)
#define smp_mb__before_atomic_inc() barrier()
#define smp_mb__after_atomic_inc() barrier()

+/* An 64bit atomic type */
+
+typedef struct {
+ unsigned long long counter;
+} atomic64_t;
+
+#define ATOMIC64_INIT(val) { (val) }
+
+/**
+ * atomic64_read - read atomic64 variable
+ * @v: pointer of type atomic64_t
+ *
+ * Atomically reads the value of @v.
+ * Doesn't imply a read memory barrier.
+ */
+#define __atomic64_read(ptr) ((ptr)->counter)
+
+static inline unsigned long long
+cmpxchg8b(unsigned long long *ptr, unsigned long long old, unsigned long long new)
+{
+ asm volatile(
+
+ LOCK_PREFIX "cmpxchg8b (%[ptr])\n"
+
+ : "=A" (old)
+
+ : [ptr] "D" (ptr),
+ "A" (old),
+ "b" (ll_low(new)),
+ "c" (ll_high(new))
+
+ : "memory");
+
+ return old;
+}
+
+static inline unsigned long long
+atomic64_cmpxchg(atomic64_t *ptr, unsigned long long old_val,
+ unsigned long long new_val)
+{
+ return cmpxchg8b(&ptr->counter, old_val, new_val);
+}
+
+/**
+ * atomic64_set - set atomic64 variable
+ * @ptr: pointer to type atomic64_t
+ * @new_val: value to assign
+ *
+ * Atomically sets the value of @ptr to @new_val.
+ */
+static inline void atomic64_set(atomic64_t *ptr, unsigned long long new_val)
+{
+ unsigned long long old_val;
+
+ do {
+ old_val = atomic_read(ptr);
+ } while (atomic64_cmpxchg(ptr, old_val, new_val) != old_val);
+}
+
+/**
+ * atomic64_read - read atomic64 variable
+ * @ptr: pointer to type atomic64_t
+ *
+ * Atomically reads the value of @ptr and returns it.
+ */
+static inline unsigned long long atomic64_read(atomic64_t *ptr)
+{
+ unsigned long long curr_val;
+
+ do {
+ curr_val = __atomic64_read(ptr);
+ } while (atomic64_cmpxchg(ptr, curr_val, curr_val) != curr_val);
+
+ return curr_val;
+}
+
+/**
+ * atomic64_add_return - add and return
+ * @delta: integer value to add
+ * @ptr: pointer to type atomic64_t
+ *
+ * Atomically adds @delta to @ptr and returns @delta + *@ptr
+ */
+static inline unsigned long long
+atomic64_add_return(unsigned long long delta, atomic64_t *ptr)
+{
+ unsigned long long old_val, new_val;
+
+ do {
+ old_val = atomic_read(ptr);
+ new_val = old_val + delta;
+
+ } while (atomic64_cmpxchg(ptr, old_val, new_val) != old_val);
+
+ return new_val;
+}
+
+static inline long atomic64_sub_return(unsigned long long delta, atomic64_t *ptr)
+{
+ return atomic64_add_return(-delta, ptr);
+}
+
+static inline long atomic64_inc_return(atomic64_t *ptr)
+{
+ return atomic64_add_return(1, ptr);
+}
+
+static inline long atomic64_dec_return(atomic64_t *ptr)
+{
+ return atomic64_sub_return(1, ptr);
+}
+
+/**
+ * atomic64_add - add integer to atomic64 variable
+ * @delta: integer value to add
+ * @ptr: pointer to type atomic64_t
+ *
+ * Atomically adds @delta to @ptr.
+ */
+static inline void atomic64_add(unsigned long long delta, atomic64_t *ptr)
+{
+ atomic64_add_return(delta, ptr);
+}
+
+/**
+ * atomic64_sub - subtract the atomic64 variable
+ * @delta: integer value to subtract
+ * @ptr: pointer to type atomic64_t
+ *
+ * Atomically subtracts @delta from @ptr.
+ */
+static inline void atomic64_sub(unsigned long long delta, atomic64_t *ptr)
+{
+ atomic64_add(-delta, ptr);
+}
+
+/**
+ * atomic64_sub_and_test - subtract value from variable and test result
+ * @delta: integer value to subtract
+ * @ptr: pointer to type atomic64_t
+ *
+ * Atomically subtracts @delta from @ptr and returns
+ * true if the result is zero, or false for all
+ * other cases.
+ */
+static inline int
+atomic64_sub_and_test(unsigned long long delta, atomic64_t *ptr)
+{
+ unsigned long long old_val = atomic64_sub_return(delta, ptr);
+
+ return old_val == 0;
+}
+
+/**
+ * atomic64_inc - increment atomic64 variable
+ * @ptr: pointer to type atomic64_t
+ *
+ * Atomically increments @ptr by 1.
+ */
+static inline void atomic64_inc(atomic64_t *ptr)
+{
+ atomic64_add(1, ptr);
+}
+
+/**
+ * atomic64_dec - decrement atomic64 variable
+ * @ptr: pointer to type atomic64_t
+ *
+ * Atomically decrements @ptr by 1.
+ */
+static inline void atomic64_dec(atomic64_t *ptr)
+{
+ atomic64_sub(1, ptr);
+}
+
+/**
+ * atomic64_dec_and_test - decrement and test
+ * @ptr: pointer to type atomic64_t
+ *
+ * Atomically decrements @ptr by 1 and
+ * returns true if the result is 0, or false for all other
+ * cases.
+ */
+static inline int atomic64_dec_and_test(atomic64_t *ptr)
+{
+ return atomic64_sub_and_test(1, ptr);
+}
+
+/**
+ * atomic64_inc_and_test - increment and test
+ * @ptr: pointer to type atomic64_t
+ *
+ * Atomically increments @ptr by 1
+ * and returns true if the result is zero, or false for all
+ * other cases.
+ */
+static inline int atomic64_inc_and_test(atomic64_t *ptr)
+{
+ return atomic64_sub_and_test(-1, ptr);
+}
+
+/**
+ * atomic64_add_negative - add and test if negative
+ * @delta: integer value to add
+ * @ptr: pointer to type atomic64_t
+ *
+ * Atomically adds @delta to @ptr and returns true
+ * if the result is negative, or false when
+ * result is greater than or equal to zero.
+ */
+static inline int
+atomic64_add_negative(unsigned long long delta, atomic64_t *ptr)
+{
+ long long old_val = atomic64_add_return(delta, ptr);
+
+ return old_val < 0;
+}
+
#include <asm-generic/atomic.h>
#endif /* _ASM_X86_ATOMIC_32_H */
diff --git a/arch/x86/include/asm/hardirq_32.h b/arch/x86/include/asm/hardirq_32.h
index cf7954d..7a07897 100644
--- a/arch/x86/include/asm/hardirq_32.h
+++ b/arch/x86/include/asm/hardirq_32.h
@@ -9,6 +9,7 @@ typedef struct {
unsigned long idle_timestamp;
unsigned int __nmi_count; /* arch dependent */
unsigned int apic_timer_irqs; /* arch dependent */
+ unsigned int apic_perf_irqs; /* arch dependent */
unsigned int irq0_irqs;
unsigned int irq_resched_count;
unsigned int irq_call_count;
diff --git a/arch/x86/include/asm/hw_irq.h b/arch/x86/include/asm/hw_irq.h
index 8de644b..aa93e53 100644
--- a/arch/x86/include/asm/hw_irq.h
+++ b/arch/x86/include/asm/hw_irq.h
@@ -30,6 +30,8 @@
/* Interrupt handlers registered during init_IRQ */
extern void apic_timer_interrupt(void);
extern void error_interrupt(void);
+extern void perf_counter_interrupt(void);
+
extern void spurious_interrupt(void);
extern void thermal_interrupt(void);
extern void reschedule_interrupt(void);
diff --git a/arch/x86/include/asm/intel_arch_perfmon.h b/arch/x86/include/asm/intel_arch_perfmon.h
index fa0fd06..71598a9 100644
--- a/arch/x86/include/asm/intel_arch_perfmon.h
+++ b/arch/x86/include/asm/intel_arch_perfmon.h
@@ -1,22 +1,24 @@
#ifndef _ASM_X86_INTEL_ARCH_PERFMON_H
#define _ASM_X86_INTEL_ARCH_PERFMON_H

-#define MSR_ARCH_PERFMON_PERFCTR0 0xc1
-#define MSR_ARCH_PERFMON_PERFCTR1 0xc2
+#define MSR_ARCH_PERFMON_PERFCTR0 0xc1
+#define MSR_ARCH_PERFMON_PERFCTR1 0xc2

-#define MSR_ARCH_PERFMON_EVENTSEL0 0x186
-#define MSR_ARCH_PERFMON_EVENTSEL1 0x187
+#define MSR_ARCH_PERFMON_EVENTSEL0 0x186
+#define MSR_ARCH_PERFMON_EVENTSEL1 0x187

-#define ARCH_PERFMON_EVENTSEL0_ENABLE (1 << 22)
-#define ARCH_PERFMON_EVENTSEL_INT (1 << 20)
-#define ARCH_PERFMON_EVENTSEL_OS (1 << 17)
-#define ARCH_PERFMON_EVENTSEL_USR (1 << 16)
+#define ARCH_PERFMON_EVENTSEL0_ENABLE (1 << 22)
+#define ARCH_PERFMON_EVENTSEL_INT (1 << 20)
+#define ARCH_PERFMON_EVENTSEL_OS (1 << 17)
+#define ARCH_PERFMON_EVENTSEL_USR (1 << 16)

-#define ARCH_PERFMON_UNHALTED_CORE_CYCLES_SEL (0x3c)
-#define ARCH_PERFMON_UNHALTED_CORE_CYCLES_UMASK (0x00 << 8)
-#define ARCH_PERFMON_UNHALTED_CORE_CYCLES_INDEX (0)
+#define ARCH_PERFMON_UNHALTED_CORE_CYCLES_SEL 0x3c
+#define ARCH_PERFMON_UNHALTED_CORE_CYCLES_UMASK (0x00 << 8)
+#define ARCH_PERFMON_UNHALTED_CORE_CYCLES_INDEX 0
#define ARCH_PERFMON_UNHALTED_CORE_CYCLES_PRESENT \
- (1 << (ARCH_PERFMON_UNHALTED_CORE_CYCLES_INDEX))
+ (1 << (ARCH_PERFMON_UNHALTED_CORE_CYCLES_INDEX))
+
+#define ARCH_PERFMON_BRANCH_MISSES_RETIRED 6

union cpuid10_eax {
struct {
@@ -28,4 +30,12 @@ union cpuid10_eax {
unsigned int full;
};

+#ifdef CONFIG_PERF_COUNTERS
+extern void init_hw_perf_counters(void);
+extern void perf_counters_lapic_init(int nmi);
+#else
+static inline void init_hw_perf_counters(void) { }
+static inline void perf_counters_lapic_init(int nmi) { }
+#endif
+
#endif /* _ASM_X86_INTEL_ARCH_PERFMON_H */
diff --git a/arch/x86/include/asm/irq_vectors.h b/arch/x86/include/asm/irq_vectors.h
index 0005adb..b8d277f 100644
--- a/arch/x86/include/asm/irq_vectors.h
+++ b/arch/x86/include/asm/irq_vectors.h
@@ -87,6 +87,11 @@
#define LOCAL_TIMER_VECTOR 0xef

/*
+ * Performance monitoring interrupt vector:
+ */
+#define LOCAL_PERF_VECTOR 0xee
+
+/*
* First APIC vector available to drivers: (vectors 0x30-0xee) we
* start at 0x31(0x41) to spread out vectors evenly between priority
* levels. (0x80 is the syscall vector)
diff --git a/arch/x86/include/asm/mach-default/entry_arch.h b/arch/x86/include/asm/mach-default/entry_arch.h
index 6b1add8..ad31e5d 100644
--- a/arch/x86/include/asm/mach-default/entry_arch.h
+++ b/arch/x86/include/asm/mach-default/entry_arch.h
@@ -25,10 +25,15 @@ BUILD_INTERRUPT(irq_move_cleanup_interrupt,IRQ_MOVE_CLEANUP_VECTOR)
* a much simpler SMP time architecture:
*/
#ifdef CONFIG_X86_LOCAL_APIC
+
BUILD_INTERRUPT(apic_timer_interrupt,LOCAL_TIMER_VECTOR)
BUILD_INTERRUPT(error_interrupt,ERROR_APIC_VECTOR)
BUILD_INTERRUPT(spurious_interrupt,SPURIOUS_APIC_VECTOR)

+#ifdef CONFIG_PERF_COUNTERS
+BUILD_INTERRUPT(perf_counter_interrupt, LOCAL_PERF_VECTOR)
+#endif
+
#ifdef CONFIG_X86_MCE_P4THERMAL
BUILD_INTERRUPT(thermal_interrupt,THERMAL_APIC_VECTOR)
#endif
diff --git a/arch/x86/include/asm/pda.h b/arch/x86/include/asm/pda.h
index 2fbfff8..90a8d9d 100644
--- a/arch/x86/include/asm/pda.h
+++ b/arch/x86/include/asm/pda.h
@@ -30,6 +30,7 @@ struct x8664_pda {
short isidle;
struct mm_struct *active_mm;
unsigned apic_timer_irqs;
+ unsigned apic_perf_irqs;
unsigned irq0_irqs;
unsigned irq_resched_count;
unsigned irq_call_count;
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index e44d379..810bf26 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -80,6 +80,7 @@ struct thread_info {
#define TIF_SYSCALL_AUDIT 7 /* syscall auditing active */
#define TIF_SECCOMP 8 /* secure computing */
#define TIF_MCE_NOTIFY 10 /* notify userspace of an MCE */
+#define TIF_PERF_COUNTERS 11 /* notify perf counter work */
#define TIF_NOTSC 16 /* TSC is not accessible in userland */
#define TIF_IA32 17 /* 32bit process */
#define TIF_FORK 18 /* ret_from_fork */
@@ -103,6 +104,7 @@ struct thread_info {
#define _TIF_SYSCALL_AUDIT (1 << TIF_SYSCALL_AUDIT)
#define _TIF_SECCOMP (1 << TIF_SECCOMP)
#define _TIF_MCE_NOTIFY (1 << TIF_MCE_NOTIFY)
+#define _TIF_PERF_COUNTERS (1 << TIF_PERF_COUNTERS)
#define _TIF_NOTSC (1 << TIF_NOTSC)
#define _TIF_IA32 (1 << TIF_IA32)
#define _TIF_FORK (1 << TIF_FORK)
@@ -135,7 +137,7 @@ struct thread_info {

/* Only used for 64 bit */
#define _TIF_DO_NOTIFY_MASK \
- (_TIF_SIGPENDING|_TIF_MCE_NOTIFY|_TIF_NOTIFY_RESUME)
+ (_TIF_SIGPENDING|_TIF_MCE_NOTIFY|_TIF_PERF_COUNTERS|_TIF_NOTIFY_RESUME)

/* flags to check in __switch_to() */
#define _TIF_WORK_CTXSW \
diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h
index f2bba78..7e47658 100644
--- a/arch/x86/include/asm/unistd_32.h
+++ b/arch/x86/include/asm/unistd_32.h
@@ -338,6 +338,7 @@
#define __NR_dup3 330
#define __NR_pipe2 331
#define __NR_inotify_init1 332
+#define __NR_perf_counter_open 333

#ifdef __KERNEL__

diff --git a/arch/x86/include/asm/unistd_64.h b/arch/x86/include/asm/unistd_64.h
index d2e415e..53025fe 100644
--- a/arch/x86/include/asm/unistd_64.h
+++ b/arch/x86/include/asm/unistd_64.h
@@ -653,7 +653,8 @@ __SYSCALL(__NR_dup3, sys_dup3)
__SYSCALL(__NR_pipe2, sys_pipe2)
#define __NR_inotify_init1 294
__SYSCALL(__NR_inotify_init1, sys_inotify_init1)
-
+#define __NR_perf_counter_open 295
+__SYSCALL(__NR_perf_counter_open, sys_perf_counter_open)

#ifndef __NO_STUBS
#define __ARCH_WANT_OLD_READDIR
diff --git a/arch/x86/kernel/apic.c b/arch/x86/kernel/apic.c
index 1771dd7..0579ec1 100644
--- a/arch/x86/kernel/apic.c
+++ b/arch/x86/kernel/apic.c
@@ -31,6 +31,7 @@
#include <linux/dmi.h>
#include <linux/dmar.h>

+#include <asm/intel_arch_perfmon.h>
#include <asm/atomic.h>
#include <asm/smp.h>
#include <asm/mtrr.h>
@@ -1143,6 +1144,7 @@ void __cpuinit setup_local_APIC(void)
apic_write(APIC_ESR, 0);
}
#endif
+ perf_counters_lapic_init(0);

preempt_disable();

diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile
index 82ec607..89e5336 100644
--- a/arch/x86/kernel/cpu/Makefile
+++ b/arch/x86/kernel/cpu/Makefile
@@ -1,5 +1,5 @@
#
-# Makefile for x86-compatible CPU details and quirks
+# Makefile for x86-compatible CPU details, features and quirks
#

obj-y := intel_cacheinfo.o addon_cpuid_features.o
@@ -16,11 +16,13 @@ obj-$(CONFIG_CPU_SUP_CENTAUR_64) += centaur_64.o
obj-$(CONFIG_CPU_SUP_TRANSMETA_32) += transmeta.o
obj-$(CONFIG_CPU_SUP_UMC_32) += umc.o

-obj-$(CONFIG_X86_MCE) += mcheck/
-obj-$(CONFIG_MTRR) += mtrr/
-obj-$(CONFIG_CPU_FREQ) += cpufreq/
+obj-$(CONFIG_PERF_COUNTERS) += perf_counter.o

-obj-$(CONFIG_X86_LOCAL_APIC) += perfctr-watchdog.o
+obj-$(CONFIG_X86_MCE) += mcheck/
+obj-$(CONFIG_MTRR) += mtrr/
+obj-$(CONFIG_CPU_FREQ) += cpufreq/
+
+obj-$(CONFIG_X86_LOCAL_APIC) += perfctr-watchdog.o

quiet_cmd_mkcapflags = MKCAP $@
cmd_mkcapflags = $(PERL) $(srctree)/$(src)/mkcapflags.pl $< $@
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index b9c9ea0..4461011 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -17,6 +17,7 @@
#include <asm/mmu_context.h>
#include <asm/mtrr.h>
#include <asm/mce.h>
+#include <asm/intel_arch_perfmon.h>
#include <asm/pat.h>
#include <asm/asm.h>
#include <asm/numa.h>
@@ -750,6 +751,7 @@ void __init identify_boot_cpu(void)
#else
vgetcpu_set_mode();
#endif
+ init_hw_perf_counters();
}

void __cpuinit identify_secondary_cpu(struct cpuinfo_x86 *c)
diff --git a/arch/x86/kernel/cpu/perf_counter.c b/arch/x86/kernel/cpu/perf_counter.c
new file mode 100644
index 0000000..8a154bd
--- /dev/null
+++ b/arch/x86/kernel/cpu/perf_counter.c
@@ -0,0 +1,587 @@
+/*
+ * Performance counter x86 architecture code
+ *
+ * Copyright(C) 2008 Thomas Gleixner <[email protected]>
+ * Copyright(C) 2008 Red Hat, Inc., Ingo Molnar
+ *
+ * For licencing details see kernel-base/COPYING
+ */
+
+#include <linux/perf_counter.h>
+#include <linux/capability.h>
+#include <linux/notifier.h>
+#include <linux/hardirq.h>
+#include <linux/kprobes.h>
+#include <linux/module.h>
+#include <linux/kdebug.h>
+#include <linux/sched.h>
+
+#include <asm/intel_arch_perfmon.h>
+#include <asm/apic.h>
+
+static bool perf_counters_initialized __read_mostly;
+
+/*
+ * Number of (generic) HW counters:
+ */
+static int nr_hw_counters __read_mostly;
+static u32 perf_counter_mask __read_mostly;
+
+/* No support for fixed function counters yet */
+
+#define MAX_HW_COUNTERS 8
+
+struct cpu_hw_counters {
+ struct perf_counter *counters[MAX_HW_COUNTERS];
+ unsigned long used[BITS_TO_LONGS(MAX_HW_COUNTERS)];
+};
+
+/*
+ * Intel PerfMon v3. Used on Core2 and later.
+ */
+static DEFINE_PER_CPU(struct cpu_hw_counters, cpu_hw_counters);
+
+const int intel_perfmon_event_map[] =
+{
+ [PERF_COUNT_CYCLES] = 0x003c,
+ [PERF_COUNT_INSTRUCTIONS] = 0x00c0,
+ [PERF_COUNT_CACHE_REFERENCES] = 0x4f2e,
+ [PERF_COUNT_CACHE_MISSES] = 0x412e,
+ [PERF_COUNT_BRANCH_INSTRUCTIONS] = 0x00c4,
+ [PERF_COUNT_BRANCH_MISSES] = 0x00c5,
+};
+
+const int max_intel_perfmon_events = ARRAY_SIZE(intel_perfmon_event_map);
+
+/*
+ * Propagate counter elapsed time into the generic counter.
+ * Can only be executed on the CPU where the counter is active.
+ * Returns the delta events processed.
+ */
+static void
+x86_perf_counter_update(struct perf_counter *counter,
+ struct hw_perf_counter *hwc, int idx)
+{
+ u64 prev_raw_count, new_raw_count, delta;
+
+ WARN_ON_ONCE(counter->state != PERF_COUNTER_STATE_ACTIVE);
+ /*
+ * Careful: an NMI might modify the previous counter value.
+ *
+ * Our tactic to handle this is to first atomically read and
+ * exchange a new raw count - then add that new-prev delta
+ * count to the generic counter atomically:
+ */
+again:
+ prev_raw_count = atomic64_read(&hwc->prev_count);
+ rdmsrl(hwc->counter_base + idx, new_raw_count);
+
+ if (atomic64_cmpxchg(&hwc->prev_count, prev_raw_count,
+ new_raw_count) != prev_raw_count)
+ goto again;
+
+ /*
+ * Now we have the new raw value and have updated the prev
+ * timestamp already. We can now calculate the elapsed delta
+ * (counter-)time and add that to the generic counter.
+ *
+ * Careful, not all hw sign-extends above the physical width
+ * of the count, so we do that by clipping the delta to 32 bits:
+ */
+ delta = (u64)(u32)((s32)new_raw_count - (s32)prev_raw_count);
+ WARN_ON_ONCE((int)delta < 0);
+
+ atomic64_add(delta, &counter->count);
+ atomic64_sub(delta, &hwc->period_left);
+}
+
+/*
+ * Setup the hardware configuration for a given hw_event_type
+ */
+static int __hw_perf_counter_init(struct perf_counter *counter)
+{
+ struct perf_counter_hw_event *hw_event = &counter->hw_event;
+ struct hw_perf_counter *hwc = &counter->hw;
+
+ if (unlikely(!perf_counters_initialized))
+ return -EINVAL;
+
+ /*
+ * Count user events, and generate PMC IRQs:
+ * (keep 'enabled' bit clear for now)
+ */
+ hwc->config = ARCH_PERFMON_EVENTSEL_USR | ARCH_PERFMON_EVENTSEL_INT;
+
+ /*
+ * If privileged enough, count OS events too, and allow
+ * NMI events as well:
+ */
+ hwc->nmi = 0;
+ if (capable(CAP_SYS_ADMIN)) {
+ hwc->config |= ARCH_PERFMON_EVENTSEL_OS;
+ if (hw_event->nmi)
+ hwc->nmi = 1;
+ }
+
+ hwc->config_base = MSR_ARCH_PERFMON_EVENTSEL0;
+ hwc->counter_base = MSR_ARCH_PERFMON_PERFCTR0;
+
+ hwc->irq_period = hw_event->irq_period;
+ /*
+ * Intel PMCs cannot be accessed sanely above 32 bit width,
+ * so we install an artificial 1<<31 period regardless of
+ * the generic counter period:
+ */
+ if ((s64)hwc->irq_period <= 0 || hwc->irq_period > 0x7FFFFFFF)
+ hwc->irq_period = 0x7FFFFFFF;
+
+ atomic64_set(&hwc->period_left, hwc->irq_period);
+
+ /*
+ * Raw event type provide the config in the event structure
+ */
+ if (hw_event->raw) {
+ hwc->config |= hw_event->type;
+ } else {
+ if (hw_event->type >= max_intel_perfmon_events)
+ return -EINVAL;
+ /*
+ * The generic map:
+ */
+ hwc->config |= intel_perfmon_event_map[hw_event->type];
+ }
+ counter->wakeup_pending = 0;
+
+ return 0;
+}
+
+void hw_perf_enable_all(void)
+{
+ if (unlikely(!perf_counters_initialized))
+ return;
+
+ wrmsr(MSR_CORE_PERF_GLOBAL_CTRL, perf_counter_mask, 0);
+}
+
+u64 hw_perf_save_disable(void)
+{
+ u64 ctrl;
+
+ if (unlikely(!perf_counters_initialized))
+ return 0;
+
+ rdmsrl(MSR_CORE_PERF_GLOBAL_CTRL, ctrl);
+ wrmsr(MSR_CORE_PERF_GLOBAL_CTRL, 0, 0);
+
+ return ctrl;
+}
+EXPORT_SYMBOL_GPL(hw_perf_save_disable);
+
+void hw_perf_restore(u64 ctrl)
+{
+ if (unlikely(!perf_counters_initialized))
+ return;
+
+ wrmsr(MSR_CORE_PERF_GLOBAL_CTRL, ctrl, 0);
+}
+EXPORT_SYMBOL_GPL(hw_perf_restore);
+
+static inline void
+__x86_perf_counter_disable(struct perf_counter *counter,
+ struct hw_perf_counter *hwc, unsigned int idx)
+{
+ int err;
+
+ err = wrmsr_safe(hwc->config_base + idx, hwc->config, 0);
+ WARN_ON_ONCE(err);
+}
+
+static DEFINE_PER_CPU(u64, prev_left[MAX_HW_COUNTERS]);
+
+/*
+ * Set the next IRQ period, based on the hwc->period_left value.
+ * To be called with the counter disabled in hw:
+ */
+static void
+__hw_perf_counter_set_period(struct perf_counter *counter,
+ struct hw_perf_counter *hwc, int idx)
+{
+ s32 left = atomic64_read(&hwc->period_left);
+ s32 period = hwc->irq_period;
+
+ WARN_ON_ONCE(period <= 0);
+
+ /*
+ * If we are way outside a reasoable range then just skip forward:
+ */
+ if (unlikely(left <= -period)) {
+ left = period;
+ atomic64_set(&hwc->period_left, left);
+ }
+
+ if (unlikely(left <= 0)) {
+ left += period;
+ atomic64_set(&hwc->period_left, left);
+ }
+
+ WARN_ON_ONCE(left <= 0);
+
+ per_cpu(prev_left[idx], smp_processor_id()) = left;
+
+ /*
+ * The hw counter starts counting from this counter offset,
+ * mark it to be able to extra future deltas:
+ */
+ atomic64_set(&hwc->prev_count, (u64)(s64)-left);
+
+ wrmsr(hwc->counter_base + idx, -left, 0);
+}
+
+static void
+__x86_perf_counter_enable(struct perf_counter *counter,
+ struct hw_perf_counter *hwc, int idx)
+{
+ wrmsr(hwc->config_base + idx,
+ hwc->config | ARCH_PERFMON_EVENTSEL0_ENABLE, 0);
+}
+
+/*
+ * Find a PMC slot for the freshly enabled / scheduled in counter:
+ */
+static void x86_perf_counter_enable(struct perf_counter *counter)
+{
+ struct cpu_hw_counters *cpuc = &__get_cpu_var(cpu_hw_counters);
+ struct hw_perf_counter *hwc = &counter->hw;
+ int idx = hwc->idx;
+
+ /* Try to get the previous counter again */
+ if (test_and_set_bit(idx, cpuc->used)) {
+ idx = find_first_zero_bit(cpuc->used, nr_hw_counters);
+ set_bit(idx, cpuc->used);
+ hwc->idx = idx;
+ }
+
+ perf_counters_lapic_init(hwc->nmi);
+
+ __x86_perf_counter_disable(counter, hwc, idx);
+
+ cpuc->counters[idx] = counter;
+
+ __hw_perf_counter_set_period(counter, hwc, idx);
+ __x86_perf_counter_enable(counter, hwc, idx);
+}
+
+void perf_counter_print_debug(void)
+{
+ u64 ctrl, status, overflow, pmc_ctrl, pmc_count, prev_left;
+ int cpu, idx;
+
+ if (!nr_hw_counters)
+ return;
+
+ local_irq_disable();
+
+ cpu = smp_processor_id();
+
+ rdmsrl(MSR_CORE_PERF_GLOBAL_CTRL, ctrl);
+ rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS, status);
+ rdmsrl(MSR_CORE_PERF_GLOBAL_OVF_CTRL, overflow);
+
+ printk(KERN_INFO "\n");
+ printk(KERN_INFO "CPU#%d: ctrl: %016llx\n", cpu, ctrl);
+ printk(KERN_INFO "CPU#%d: status: %016llx\n", cpu, status);
+ printk(KERN_INFO "CPU#%d: overflow: %016llx\n", cpu, overflow);
+
+ for (idx = 0; idx < nr_hw_counters; idx++) {
+ rdmsrl(MSR_ARCH_PERFMON_EVENTSEL0 + idx, pmc_ctrl);
+ rdmsrl(MSR_ARCH_PERFMON_PERFCTR0 + idx, pmc_count);
+
+ prev_left = per_cpu(prev_left[idx], cpu);
+
+ printk(KERN_INFO "CPU#%d: PMC%d ctrl: %016llx\n",
+ cpu, idx, pmc_ctrl);
+ printk(KERN_INFO "CPU#%d: PMC%d count: %016llx\n",
+ cpu, idx, pmc_count);
+ printk(KERN_INFO "CPU#%d: PMC%d left: %016llx\n",
+ cpu, idx, prev_left);
+ }
+ local_irq_enable();
+}
+
+static void x86_perf_counter_disable(struct perf_counter *counter)
+{
+ struct cpu_hw_counters *cpuc = &__get_cpu_var(cpu_hw_counters);
+ struct hw_perf_counter *hwc = &counter->hw;
+ unsigned int idx = hwc->idx;
+
+ __x86_perf_counter_disable(counter, hwc, idx);
+
+ clear_bit(idx, cpuc->used);
+ cpuc->counters[idx] = NULL;
+
+ /*
+ * Drain the remaining delta count out of a counter
+ * that we are disabling:
+ */
+ x86_perf_counter_update(counter, hwc, idx);
+}
+
+static void perf_store_irq_data(struct perf_counter *counter, u64 data)
+{
+ struct perf_data *irqdata = counter->irqdata;
+
+ if (irqdata->len > PERF_DATA_BUFLEN - sizeof(u64)) {
+ irqdata->overrun++;
+ } else {
+ u64 *p = (u64 *) &irqdata->data[irqdata->len];
+
+ *p = data;
+ irqdata->len += sizeof(u64);
+ }
+}
+
+/*
+ * Save and restart an expired counter. Called by NMI contexts,
+ * so it has to be careful about preempting normal counter ops:
+ */
+static void perf_save_and_restart(struct perf_counter *counter)
+{
+ struct hw_perf_counter *hwc = &counter->hw;
+ int idx = hwc->idx;
+ u64 pmc_ctrl;
+
+ rdmsrl(MSR_ARCH_PERFMON_EVENTSEL0 + idx, pmc_ctrl);
+
+ x86_perf_counter_update(counter, hwc, idx);
+ __hw_perf_counter_set_period(counter, hwc, idx);
+
+ if (pmc_ctrl & ARCH_PERFMON_EVENTSEL0_ENABLE)
+ __x86_perf_counter_enable(counter, hwc, idx);
+}
+
+static void
+perf_handle_group(struct perf_counter *sibling, u64 *status, u64 *overflown)
+{
+ struct perf_counter *counter, *group_leader = sibling->group_leader;
+
+ /*
+ * Store sibling timestamps (if any):
+ */
+ list_for_each_entry(counter, &group_leader->sibling_list, list_entry) {
+ x86_perf_counter_update(counter, &counter->hw, counter->hw.idx);
+ perf_store_irq_data(sibling, counter->hw_event.type);
+ perf_store_irq_data(sibling, atomic64_read(&counter->count));
+ }
+}
+
+/*
+ * This handler is triggered by the local APIC, so the APIC IRQ handling
+ * rules apply:
+ */
+static void __smp_perf_counter_interrupt(struct pt_regs *regs, int nmi)
+{
+ int bit, cpu = smp_processor_id();
+ u64 ack, status, saved_global;
+ struct cpu_hw_counters *cpuc;
+
+ rdmsrl(MSR_CORE_PERF_GLOBAL_CTRL, saved_global);
+
+ /* Disable counters globally */
+ wrmsr(MSR_CORE_PERF_GLOBAL_CTRL, 0, 0);
+ ack_APIC_irq();
+
+ cpuc = &per_cpu(cpu_hw_counters, cpu);
+
+ rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS, status);
+ if (!status)
+ goto out;
+
+again:
+ ack = status;
+ for_each_bit(bit, (unsigned long *) &status, nr_hw_counters) {
+ struct perf_counter *counter = cpuc->counters[bit];
+
+ clear_bit(bit, (unsigned long *) &status);
+ if (!counter)
+ continue;
+
+ perf_save_and_restart(counter);
+
+ switch (counter->hw_event.record_type) {
+ case PERF_RECORD_SIMPLE:
+ continue;
+ case PERF_RECORD_IRQ:
+ perf_store_irq_data(counter, instruction_pointer(regs));
+ break;
+ case PERF_RECORD_GROUP:
+ perf_handle_group(counter, &status, &ack);
+ break;
+ }
+ /*
+ * From NMI context we cannot call into the scheduler to
+ * do a task wakeup - but we mark these counters as
+ * wakeup_pending and initate a wakeup callback:
+ */
+ if (nmi) {
+ counter->wakeup_pending = 1;
+ set_tsk_thread_flag(current, TIF_PERF_COUNTERS);
+ } else {
+ wake_up(&counter->waitq);
+ }
+ }
+
+ wrmsr(MSR_CORE_PERF_GLOBAL_OVF_CTRL, ack, 0);
+
+ /*
+ * Repeat if there is more work to be done:
+ */
+ rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS, status);
+ if (status)
+ goto again;
+out:
+ /*
+ * Restore - do not reenable when global enable is off:
+ */
+ wrmsr(MSR_CORE_PERF_GLOBAL_CTRL, saved_global, 0);
+}
+
+void smp_perf_counter_interrupt(struct pt_regs *regs)
+{
+ irq_enter();
+ inc_irq_stat(apic_perf_irqs);
+ apic_write(APIC_LVTPC, LOCAL_PERF_VECTOR);
+ __smp_perf_counter_interrupt(regs, 0);
+
+ irq_exit();
+}
+
+/*
+ * This handler is triggered by NMI contexts:
+ */
+void perf_counter_notify(struct pt_regs *regs)
+{
+ struct cpu_hw_counters *cpuc;
+ unsigned long flags;
+ int bit, cpu;
+
+ local_irq_save(flags);
+ cpu = smp_processor_id();
+ cpuc = &per_cpu(cpu_hw_counters, cpu);
+
+ for_each_bit(bit, cpuc->used, nr_hw_counters) {
+ struct perf_counter *counter = cpuc->counters[bit];
+
+ if (!counter)
+ continue;
+
+ if (counter->wakeup_pending) {
+ counter->wakeup_pending = 0;
+ wake_up(&counter->waitq);
+ }
+ }
+
+ local_irq_restore(flags);
+}
+
+void __cpuinit perf_counters_lapic_init(int nmi)
+{
+ u32 apic_val;
+
+ if (!perf_counters_initialized)
+ return;
+ /*
+ * Enable the performance counter vector in the APIC LVT:
+ */
+ apic_val = apic_read(APIC_LVTERR);
+
+ apic_write(APIC_LVTERR, apic_val | APIC_LVT_MASKED);
+ if (nmi)
+ apic_write(APIC_LVTPC, APIC_DM_NMI);
+ else
+ apic_write(APIC_LVTPC, LOCAL_PERF_VECTOR);
+ apic_write(APIC_LVTERR, apic_val);
+}
+
+static int __kprobes
+perf_counter_nmi_handler(struct notifier_block *self,
+ unsigned long cmd, void *__args)
+{
+ struct die_args *args = __args;
+ struct pt_regs *regs;
+
+ if (likely(cmd != DIE_NMI_IPI))
+ return NOTIFY_DONE;
+
+ regs = args->regs;
+
+ apic_write(APIC_LVTPC, APIC_DM_NMI);
+ __smp_perf_counter_interrupt(regs, 1);
+
+ return NOTIFY_STOP;
+}
+
+static __read_mostly struct notifier_block perf_counter_nmi_notifier = {
+ .notifier_call = perf_counter_nmi_handler
+};
+
+void __init init_hw_perf_counters(void)
+{
+ union cpuid10_eax eax;
+ unsigned int unused;
+ unsigned int ebx;
+
+ if (!cpu_has(&boot_cpu_data, X86_FEATURE_ARCH_PERFMON))
+ return;
+
+ /*
+ * Check whether the Architectural PerfMon supports
+ * Branch Misses Retired Event or not.
+ */
+ cpuid(10, &(eax.full), &ebx, &unused, &unused);
+ if (eax.split.mask_length <= ARCH_PERFMON_BRANCH_MISSES_RETIRED)
+ return;
+
+ printk(KERN_INFO "Intel Performance Monitoring support detected.\n");
+
+ printk(KERN_INFO "... version: %d\n", eax.split.version_id);
+ printk(KERN_INFO "... num_counters: %d\n", eax.split.num_counters);
+ nr_hw_counters = eax.split.num_counters;
+ if (nr_hw_counters > MAX_HW_COUNTERS) {
+ nr_hw_counters = MAX_HW_COUNTERS;
+ WARN(1, KERN_ERR "hw perf counters %d > max(%d), clipping!",
+ nr_hw_counters, MAX_HW_COUNTERS);
+ }
+ perf_counter_mask = (1 << nr_hw_counters) - 1;
+ perf_max_counters = nr_hw_counters;
+
+ printk(KERN_INFO "... bit_width: %d\n", eax.split.bit_width);
+ printk(KERN_INFO "... mask_length: %d\n", eax.split.mask_length);
+
+ perf_counters_initialized = true;
+
+ perf_counters_lapic_init(0);
+ register_die_notifier(&perf_counter_nmi_notifier);
+}
+
+static void x86_perf_counter_read(struct perf_counter *counter)
+{
+ x86_perf_counter_update(counter, &counter->hw, counter->hw.idx);
+}
+
+static const struct hw_perf_counter_ops x86_perf_counter_ops = {
+ .hw_perf_counter_enable = x86_perf_counter_enable,
+ .hw_perf_counter_disable = x86_perf_counter_disable,
+ .hw_perf_counter_read = x86_perf_counter_read,
+};
+
+const struct hw_perf_counter_ops *
+hw_perf_counter_init(struct perf_counter *counter)
+{
+ int err;
+
+ err = __hw_perf_counter_init(counter);
+ if (err)
+ return NULL;
+
+ return &x86_perf_counter_ops;
+}
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 3194636..fc013cf 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -984,6 +984,11 @@ apicinterrupt ERROR_APIC_VECTOR \
apicinterrupt SPURIOUS_APIC_VECTOR \
spurious_interrupt smp_spurious_interrupt

+#ifdef CONFIG_PERF_COUNTERS
+apicinterrupt LOCAL_PERF_VECTOR \
+ perf_counter_interrupt smp_perf_counter_interrupt
+#endif
+
/*
* Exception entry points.
*/
diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
index d1d4dc5..d92bc71 100644
--- a/arch/x86/kernel/irq.c
+++ b/arch/x86/kernel/irq.c
@@ -56,6 +56,10 @@ static int show_other_interrupts(struct seq_file *p)
for_each_online_cpu(j)
seq_printf(p, "%10u ", irq_stats(j)->apic_timer_irqs);
seq_printf(p, " Local timer interrupts\n");
+ seq_printf(p, "CNT: ");
+ for_each_online_cpu(j)
+ seq_printf(p, "%10u ", irq_stats(j)->apic_perf_irqs);
+ seq_printf(p, " Performance counter interrupts\n");
#endif
#ifdef CONFIG_SMP
seq_printf(p, "RES: ");
@@ -160,6 +164,7 @@ u64 arch_irq_stat_cpu(unsigned int cpu)

#ifdef CONFIG_X86_LOCAL_APIC
sum += irq_stats(cpu)->apic_timer_irqs;
+ sum += irq_stats(cpu)->apic_perf_irqs;
#endif
#ifdef CONFIG_SMP
sum += irq_stats(cpu)->irq_resched_count;
diff --git a/arch/x86/kernel/irqinit_32.c b/arch/x86/kernel/irqinit_32.c
index 607db63..6a33b5e 100644
--- a/arch/x86/kernel/irqinit_32.c
+++ b/arch/x86/kernel/irqinit_32.c
@@ -160,6 +160,9 @@ void __init native_init_IRQ(void)
/* IPI vectors for APIC spurious and error interrupts */
alloc_intr_gate(SPURIOUS_APIC_VECTOR, spurious_interrupt);
alloc_intr_gate(ERROR_APIC_VECTOR, error_interrupt);
+# ifdef CONFIG_PERF_COUNTERS
+ alloc_intr_gate(LOCAL_PERF_VECTOR, perf_counter_interrupt);
+# endif
#endif

#if defined(CONFIG_X86_LOCAL_APIC) && defined(CONFIG_X86_MCE_P4THERMAL)
diff --git a/arch/x86/kernel/irqinit_64.c b/arch/x86/kernel/irqinit_64.c
index 8670b3c..91d785c 100644
--- a/arch/x86/kernel/irqinit_64.c
+++ b/arch/x86/kernel/irqinit_64.c
@@ -138,6 +138,11 @@ static void __init apic_intr_init(void)
/* IPI vectors for APIC spurious and error interrupts */
alloc_intr_gate(SPURIOUS_APIC_VECTOR, spurious_interrupt);
alloc_intr_gate(ERROR_APIC_VECTOR, error_interrupt);
+
+ /* Performance monitoring interrupt: */
+#ifdef CONFIG_PERF_COUNTERS
+ alloc_intr_gate(LOCAL_PERF_VECTOR, perf_counter_interrupt);
+#endif
}

void __init native_init_IRQ(void)
diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
index b1cc6da..dee553c 100644
--- a/arch/x86/kernel/signal.c
+++ b/arch/x86/kernel/signal.c
@@ -6,7 +6,7 @@
* 2000-06-20 Pentium III FXSR, SSE support by Gareth Hughes
* 2000-2002 x86-64 support by Andi Kleen
*/
-
+#include <linux/perf_counter.h>
#include <linux/sched.h>
#include <linux/mm.h>
#include <linux/smp.h>
@@ -891,6 +891,11 @@ do_notify_resume(struct pt_regs *regs, void *unused, __u32 thread_info_flags)
tracehook_notify_resume(regs);
}

+ if (thread_info_flags & _TIF_PERF_COUNTERS) {
+ clear_thread_flag(TIF_PERF_COUNTERS);
+ perf_counter_notify(regs);
+ }
+
#ifdef CONFIG_X86_32
clear_thread_flag(TIF_IRET);
#endif /* CONFIG_X86_32 */
diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S
index d44395f..496726d 100644
--- a/arch/x86/kernel/syscall_table_32.S
+++ b/arch/x86/kernel/syscall_table_32.S
@@ -332,3 +332,4 @@ ENTRY(sys_call_table)
.long sys_dup3 /* 330 */
.long sys_pipe2
.long sys_inotify_init1
+ .long sys_perf_counter_open
diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c
index 5f8d746..a3e66a3 100644
--- a/drivers/acpi/processor_idle.c
+++ b/drivers/acpi/processor_idle.c
@@ -270,8 +270,11 @@ static atomic_t c3_cpu_count;
/* Common C-state entry for C2, C3, .. */
static void acpi_cstate_enter(struct acpi_processor_cx *cstate)
{
+ u64 perf_flags;
+
/* Don't trace irqs off for idle */
stop_critical_timings();
+ perf_flags = hw_perf_save_disable();
if (cstate->entry_method == ACPI_CSTATE_FFH) {
/* Call into architectural FFH based C-state */
acpi_processor_ffh_cstate_enter(cstate);
@@ -284,6 +287,7 @@ static void acpi_cstate_enter(struct acpi_processor_cx *cstate)
gets asserted in time to freeze execution properly. */
unused = inl(acpi_gbl_FADT.xpm_timer_block.address);
}
+ hw_perf_restore(perf_flags);
start_critical_timings();
}
#endif /* !CONFIG_CPU_IDLE */
@@ -1425,8 +1429,11 @@ static inline void acpi_idle_update_bm_rld(struct acpi_processor *pr,
*/
static inline void acpi_idle_do_entry(struct acpi_processor_cx *cx)
{
+ u64 pctrl;
+
/* Don't trace irqs off for idle */
stop_critical_timings();
+ pctrl = hw_perf_save_disable();
if (cx->entry_method == ACPI_CSTATE_FFH) {
/* Call into architectural FFH based C-state */
acpi_processor_ffh_cstate_enter(cx);
@@ -1441,6 +1448,7 @@ static inline void acpi_idle_do_entry(struct acpi_processor_cx *cx)
gets asserted in time to freeze execution properly. */
unused = inl(acpi_gbl_FADT.xpm_timer_block.address);
}
+ hw_perf_restore(pctrl);
start_critical_timings();
}

diff --git a/drivers/char/sysrq.c b/drivers/char/sysrq.c
index ce0d9da..52146c2 100644
--- a/drivers/char/sysrq.c
+++ b/drivers/char/sysrq.c
@@ -25,6 +25,7 @@
#include <linux/kbd_kern.h>
#include <linux/proc_fs.h>
#include <linux/quotaops.h>
+#include <linux/perf_counter.h>
#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/suspend.h>
@@ -244,6 +245,7 @@ static void sysrq_handle_showregs(int key, struct tty_struct *tty)
struct pt_regs *regs = get_irq_regs();
if (regs)
show_regs(regs);
+ perf_counter_print_debug();
}
static struct sysrq_key_op sysrq_showregs_op = {
.handler = sysrq_handle_showregs,
diff --git a/include/linux/perf_counter.h b/include/linux/perf_counter.h
new file mode 100644
index 0000000..f30486f
--- /dev/null
+++ b/include/linux/perf_counter.h
@@ -0,0 +1,257 @@
+/*
+ * Performance counters:
+ *
+ * Copyright(C) 2008, Thomas Gleixner <[email protected]>
+ * Copyright(C) 2008, Red Hat, Inc., Ingo Molnar
+ *
+ * Data type definitions, declarations, prototypes.
+ *
+ * Started by: Thomas Gleixner and Ingo Molnar
+ *
+ * For licencing details see kernel-base/COPYING
+ */
+#ifndef _LINUX_PERF_COUNTER_H
+#define _LINUX_PERF_COUNTER_H
+
+#include <asm/atomic.h>
+
+#include <linux/list.h>
+#include <linux/mutex.h>
+#include <linux/rculist.h>
+#include <linux/rcupdate.h>
+#include <linux/spinlock.h>
+
+struct task_struct;
+
+/*
+ * User-space ABI bits:
+ */
+
+/*
+ * Generalized performance counter event types, used by the hw_event.type
+ * parameter of the sys_perf_counter_open() syscall:
+ */
+enum hw_event_types {
+ /*
+ * Common hardware events, generalized by the kernel:
+ */
+ PERF_COUNT_CYCLES = 0,
+ PERF_COUNT_INSTRUCTIONS = 1,
+ PERF_COUNT_CACHE_REFERENCES = 2,
+ PERF_COUNT_CACHE_MISSES = 3,
+ PERF_COUNT_BRANCH_INSTRUCTIONS = 4,
+ PERF_COUNT_BRANCH_MISSES = 5,
+
+ PERF_HW_EVENTS_MAX = 6,
+
+ /*
+ * Special "software" counters provided by the kernel, even if
+ * the hardware does not support performance counters. These
+ * counters measure various physical and sw events of the
+ * kernel (and allow the profiling of them as well):
+ */
+ PERF_COUNT_CPU_CLOCK = -1,
+ PERF_COUNT_TASK_CLOCK = -2,
+ PERF_COUNT_PAGE_FAULTS = -3,
+ PERF_COUNT_CONTEXT_SWITCHES = -4,
+ PERF_COUNT_CPU_MIGRATIONS = -5,
+
+ PERF_SW_EVENTS_MIN = -6,
+};
+
+/*
+ * IRQ-notification data record type:
+ */
+enum perf_counter_record_type {
+ PERF_RECORD_SIMPLE = 0,
+ PERF_RECORD_IRQ = 1,
+ PERF_RECORD_GROUP = 2,
+};
+
+/*
+ * Hardware event to monitor via a performance monitoring counter:
+ */
+struct perf_counter_hw_event {
+ s64 type;
+
+ u64 irq_period;
+ u32 record_type;
+
+ u32 disabled : 1, /* off by default */
+ nmi : 1, /* NMI sampling */
+ raw : 1, /* raw event type */
+ inherit : 1, /* children inherit it */
+ __reserved_1 : 28;
+
+ u64 __reserved_2;
+};
+
+/*
+ * Kernel-internal data types:
+ */
+
+/**
+ * struct hw_perf_counter - performance counter hardware details:
+ */
+struct hw_perf_counter {
+#ifdef CONFIG_PERF_COUNTERS
+ u64 config;
+ unsigned long config_base;
+ unsigned long counter_base;
+ int nmi;
+ unsigned int idx;
+ atomic64_t prev_count;
+ u64 irq_period;
+ atomic64_t period_left;
+#endif
+};
+
+/*
+ * Hardcoded buffer length limit for now, for IRQ-fed events:
+ */
+#define PERF_DATA_BUFLEN 2048
+
+/**
+ * struct perf_data - performance counter IRQ data sampling ...
+ */
+struct perf_data {
+ int len;
+ int rd_idx;
+ int overrun;
+ u8 data[PERF_DATA_BUFLEN];
+};
+
+struct perf_counter;
+
+/**
+ * struct hw_perf_counter_ops - performance counter hw ops
+ */
+struct hw_perf_counter_ops {
+ void (*hw_perf_counter_enable) (struct perf_counter *counter);
+ void (*hw_perf_counter_disable) (struct perf_counter *counter);
+ void (*hw_perf_counter_read) (struct perf_counter *counter);
+};
+
+/**
+ * enum perf_counter_active_state - the states of a counter
+ */
+enum perf_counter_active_state {
+ PERF_COUNTER_STATE_OFF = -1,
+ PERF_COUNTER_STATE_INACTIVE = 0,
+ PERF_COUNTER_STATE_ACTIVE = 1,
+};
+
+struct file;
+
+/**
+ * struct perf_counter - performance counter kernel representation:
+ */
+struct perf_counter {
+#ifdef CONFIG_PERF_COUNTERS
+ struct list_head list_entry;
+ struct list_head sibling_list;
+ struct perf_counter *group_leader;
+ const struct hw_perf_counter_ops *hw_ops;
+
+ enum perf_counter_active_state state;
+ atomic64_t count;
+
+ struct perf_counter_hw_event hw_event;
+ struct hw_perf_counter hw;
+
+ struct perf_counter_context *ctx;
+ struct task_struct *task;
+ struct file *filp;
+
+ unsigned int nr_inherited;
+ struct perf_counter *parent;
+ /*
+ * Protect attach/detach:
+ */
+ struct mutex mutex;
+
+ int oncpu;
+ int cpu;
+
+ /* read() / irq related data */
+ wait_queue_head_t waitq;
+ /* optional: for NMIs */
+ int wakeup_pending;
+ struct perf_data *irqdata;
+ struct perf_data *usrdata;
+ struct perf_data data[2];
+#endif
+};
+
+/**
+ * struct perf_counter_context - counter context structure
+ *
+ * Used as a container for task counters and CPU counters as well:
+ */
+struct perf_counter_context {
+#ifdef CONFIG_PERF_COUNTERS
+ /*
+ * Protect the list of counters:
+ */
+ spinlock_t lock;
+
+ struct list_head counter_list;
+ int nr_counters;
+ int nr_active;
+ struct task_struct *task;
+#endif
+};
+
+/**
+ * struct perf_counter_cpu_context - per cpu counter context structure
+ */
+struct perf_cpu_context {
+ struct perf_counter_context ctx;
+ struct perf_counter_context *task_ctx;
+ int active_oncpu;
+ int max_pertask;
+};
+
+/*
+ * Set by architecture code:
+ */
+extern int perf_max_counters;
+
+#ifdef CONFIG_PERF_COUNTERS
+extern void
+perf_counter_show(struct perf_counter *counter, char *str, int trace);
+extern const struct hw_perf_counter_ops *
+hw_perf_counter_init(struct perf_counter *counter);
+
+extern void perf_counter_task_sched_in(struct task_struct *task, int cpu);
+extern void perf_counter_task_sched_out(struct task_struct *task, int cpu);
+extern void perf_counter_task_tick(struct task_struct *task, int cpu);
+extern void perf_counter_init_task(struct task_struct *child);
+extern void perf_counter_exit_task(struct task_struct *child);
+extern void perf_counter_notify(struct pt_regs *regs);
+extern void perf_counter_print_debug(void);
+extern u64 hw_perf_save_disable(void);
+extern void hw_perf_restore(u64 ctrl);
+extern int perf_counter_task_disable(void);
+extern int perf_counter_task_enable(void);
+
+#else
+static inline void
+perf_counter_show(struct perf_counter *counter, char *str, int trace) { }
+static inline void
+perf_counter_task_sched_in(struct task_struct *task, int cpu) { }
+static inline void
+perf_counter_task_sched_out(struct task_struct *task, int cpu) { }
+static inline void
+perf_counter_task_tick(struct task_struct *task, int cpu) { }
+static inline void perf_counter_init_task(struct task_struct *child) { }
+static inline void perf_counter_exit_task(struct task_struct *child) { }
+static inline void perf_counter_notify(struct pt_regs *regs) { }
+static inline void perf_counter_print_debug(void) { }
+static inline void hw_perf_restore(u64 ctrl) { }
+static inline u64 hw_perf_save_disable(void) { return 0; }
+static inline int perf_counter_task_disable(void) { return -EINVAL; }
+static inline int perf_counter_task_enable(void) { return -EINVAL; }
+#endif
+
+#endif /* _LINUX_PERF_COUNTER_H */
diff --git a/include/linux/prctl.h b/include/linux/prctl.h
index 48d887e..b00df4c 100644
--- a/include/linux/prctl.h
+++ b/include/linux/prctl.h
@@ -85,4 +85,7 @@
#define PR_SET_TIMERSLACK 29
#define PR_GET_TIMERSLACK 30

+#define PR_TASK_PERF_COUNTERS_DISABLE 31
+#define PR_TASK_PERF_COUNTERS_ENABLE 32
+
#endif /* _LINUX_PRCTL_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 55e30d1..2e15be8 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -71,6 +71,7 @@ struct sched_param {
#include <linux/fs_struct.h>
#include <linux/compiler.h>
#include <linux/completion.h>
+#include <linux/perf_counter.h>
#include <linux/pid.h>
#include <linux/percpu.h>
#include <linux/topology.h>
@@ -1013,6 +1014,8 @@ struct sched_entity {
u64 last_wakeup;
u64 avg_overlap;

+ u64 nr_migrations;
+
#ifdef CONFIG_SCHEDSTATS
u64 wait_start;
u64 wait_max;
@@ -1028,7 +1031,6 @@ struct sched_entity {
u64 exec_max;
u64 slice_max;

- u64 nr_migrations;
u64 nr_migrations_cold;
u64 nr_failed_migrations_affine;
u64 nr_failed_migrations_running;
@@ -1326,6 +1328,7 @@ struct task_struct {
struct list_head pi_state_list;
struct futex_pi_state *pi_state_cache;
#endif
+ struct perf_counter_context perf_counter_ctx;
#ifdef CONFIG_NUMA
struct mempolicy *mempolicy;
short il_next;
@@ -2285,6 +2288,13 @@ static inline void inc_syscw(struct task_struct *tsk)
#define TASK_SIZE_OF(tsk) TASK_SIZE
#endif

+/*
+ * Call the function if the target task is executing on a CPU right now:
+ */
+extern void task_oncpu_function_call(struct task_struct *p,
+ void (*func) (void *info), void *info);
+
+
#ifdef CONFIG_MM_OWNER
extern void mm_update_next_owner(struct mm_struct *mm);
extern void mm_init_owner(struct mm_struct *mm, struct task_struct *p);
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 04fb47b..a549678 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -54,6 +54,7 @@ struct compat_stat;
struct compat_timeval;
struct robust_list_head;
struct getcpu_cache;
+struct perf_counter_hw_event;

#include <linux/types.h>
#include <linux/aio_abi.h>
@@ -624,4 +625,11 @@ asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len);

int kernel_execve(const char *filename, char *const argv[], char *const envp[]);

+
+asmlinkage int sys_perf_counter_open(
+
+ struct perf_counter_hw_event *hw_event_uptr __user,
+ pid_t pid,
+ int cpu,
+ int group_fd);
#endif
diff --git a/init/Kconfig b/init/Kconfig
index f763762..7d147a3 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -732,6 +732,36 @@ config AIO
by some high performance threaded applications. Disabling
this option saves about 7k.

+config HAVE_PERF_COUNTERS
+ bool
+
+menu "Performance Counters"
+
+config PERF_COUNTERS
+ bool "Kernel Performance Counters"
+ depends on HAVE_PERF_COUNTERS
+ default y
+ select ANON_INODES
+ help
+ Enable kernel support for performance counter hardware.
+
+ Performance counters are special hardware registers available
+ on most modern CPUs. These registers count the number of certain
+ types of hw events: such as instructions executed, cachemisses
+ suffered, or branches mis-predicted - without slowing down the
+ kernel or applications. These registers can also trigger interrupts
+ when a threshold number of events have passed - and can thus be
+ used to profile the code that runs on that CPU.
+
+ The Linux Performance Counter subsystem provides an abstraction of
+ these hardware capabilities, available via a system call. It
+ provides per task and per CPU counters, and it provides event
+ capabilities on top of those.
+
+ Say Y if unsure.
+
+endmenu
+
config VM_EVENT_COUNTERS
default y
bool "Enable VM event counters for /proc/vmstat" if EMBEDDED
diff --git a/kernel/Makefile b/kernel/Makefile
index 19fad00..1f184a1 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -89,6 +89,7 @@ obj-$(CONFIG_HAVE_GENERIC_DMA_COHERENT) += dma-coherent.o
obj-$(CONFIG_FUNCTION_TRACER) += trace/
obj-$(CONFIG_TRACING) += trace/
obj-$(CONFIG_SMP) += sched_cpupri.o
+obj-$(CONFIG_PERF_COUNTERS) += perf_counter.o

ifneq ($(CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER),y)
# According to Alan Modra <[email protected]>, the -fno-omit-frame-pointer is
diff --git a/kernel/exit.c b/kernel/exit.c
index 2d8be7e..d336c90 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -1093,11 +1093,12 @@ NORET_TYPE void do_exit(long code)
mpol_put(tsk->mempolicy);
tsk->mempolicy = NULL;
#endif
-#ifdef CONFIG_FUTEX
/*
- * This must happen late, after the PID is not
- * hashed anymore:
+ * These must happen late, after the PID is not
+ * hashed anymore, but still at a point that may sleep:
*/
+ perf_counter_exit_task(tsk);
+#ifdef CONFIG_FUTEX
if (unlikely(!list_empty(&tsk->pi_state_list)))
exit_pi_state_list(tsk);
if (unlikely(current->pi_state_cache))
diff --git a/kernel/fork.c b/kernel/fork.c
index 495da2e..e207860 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -978,6 +978,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
goto fork_out;

rt_mutex_init_task(p);
+ perf_counter_init_task(p);

#ifdef CONFIG_PROVE_LOCKING
DEBUG_LOCKS_WARN_ON(!p->hardirqs_enabled);
diff --git a/kernel/perf_counter.c b/kernel/perf_counter.c
new file mode 100644
index 0000000..539fa82
--- /dev/null
+++ b/kernel/perf_counter.c
@@ -0,0 +1,1542 @@
+/*
+ * Performance counter core code
+ *
+ * Copyright(C) 2008 Thomas Gleixner <[email protected]>
+ * Copyright(C) 2008 Red Hat, Inc., Ingo Molnar
+ *
+ * For licencing details see kernel-base/COPYING
+ */
+
+#include <linux/fs.h>
+#include <linux/cpu.h>
+#include <linux/smp.h>
+#include <linux/file.h>
+#include <linux/poll.h>
+#include <linux/sysfs.h>
+#include <linux/ptrace.h>
+#include <linux/percpu.h>
+#include <linux/uaccess.h>
+#include <linux/syscalls.h>
+#include <linux/anon_inodes.h>
+#include <linux/perf_counter.h>
+
+/*
+ * Each CPU has a list of per CPU counters:
+ */
+DEFINE_PER_CPU(struct perf_cpu_context, perf_cpu_context);
+
+int perf_max_counters __read_mostly = 1;
+static int perf_reserved_percpu __read_mostly;
+static int perf_overcommit __read_mostly = 1;
+
+/*
+ * Mutex for (sysadmin-configurable) counter reservations:
+ */
+static DEFINE_MUTEX(perf_resource_mutex);
+
+/*
+ * Architecture provided APIs - weak aliases:
+ */
+extern __weak const struct hw_perf_counter_ops *
+hw_perf_counter_init(struct perf_counter *counter)
+{
+ return ERR_PTR(-EINVAL);
+}
+
+u64 __weak hw_perf_save_disable(void) { return 0; }
+void __weak hw_perf_restore(u64 ctrl) { }
+void __weak hw_perf_counter_setup(void) { }
+
+static void
+list_add_counter(struct perf_counter *counter, struct perf_counter_context *ctx)
+{
+ struct perf_counter *group_leader = counter->group_leader;
+
+ /*
+ * Depending on whether it is a standalone or sibling counter,
+ * add it straight to the context's counter list, or to the group
+ * leader's sibling list:
+ */
+ if (counter->group_leader == counter)
+ list_add_tail(&counter->list_entry, &ctx->counter_list);
+ else
+ list_add_tail(&counter->list_entry, &group_leader->sibling_list);
+}
+
+static void
+list_del_counter(struct perf_counter *counter, struct perf_counter_context *ctx)
+{
+ struct perf_counter *sibling, *tmp;
+
+ list_del_init(&counter->list_entry);
+
+ /*
+ * If this was a group counter with sibling counters then
+ * upgrade the siblings to singleton counters by adding them
+ * to the context list directly:
+ */
+ list_for_each_entry_safe(sibling, tmp,
+ &counter->sibling_list, list_entry) {
+
+ list_del_init(&sibling->list_entry);
+ list_add_tail(&sibling->list_entry, &ctx->counter_list);
+ sibling->group_leader = sibling;
+ }
+}
+
+/*
+ * Cross CPU call to remove a performance counter
+ *
+ * We disable the counter on the hardware level first. After that we
+ * remove it from the context list.
+ */
+static void __perf_counter_remove_from_context(void *info)
+{
+ struct perf_cpu_context *cpuctx = &__get_cpu_var(perf_cpu_context);
+ struct perf_counter *counter = info;
+ struct perf_counter_context *ctx = counter->ctx;
+ unsigned long flags;
+ u64 perf_flags;
+
+ /*
+ * If this is a task context, we need to check whether it is
+ * the current task context of this cpu. If not it has been
+ * scheduled out before the smp call arrived.
+ */
+ if (ctx->task && cpuctx->task_ctx != ctx)
+ return;
+
+ spin_lock_irqsave(&ctx->lock, flags);
+
+ if (counter->state == PERF_COUNTER_STATE_ACTIVE) {
+ counter->hw_ops->hw_perf_counter_disable(counter);
+ counter->state = PERF_COUNTER_STATE_INACTIVE;
+ ctx->nr_active--;
+ cpuctx->active_oncpu--;
+ counter->task = NULL;
+ }
+ ctx->nr_counters--;
+
+ /*
+ * Protect the list operation against NMI by disabling the
+ * counters on a global level. NOP for non NMI based counters.
+ */
+ perf_flags = hw_perf_save_disable();
+ list_del_counter(counter, ctx);
+ hw_perf_restore(perf_flags);
+
+ if (!ctx->task) {
+ /*
+ * Allow more per task counters with respect to the
+ * reservation:
+ */
+ cpuctx->max_pertask =
+ min(perf_max_counters - ctx->nr_counters,
+ perf_max_counters - perf_reserved_percpu);
+ }
+
+ spin_unlock_irqrestore(&ctx->lock, flags);
+}
+
+
+/*
+ * Remove the counter from a task's (or a CPU's) list of counters.
+ *
+ * Must be called with counter->mutex held.
+ *
+ * CPU counters are removed with a smp call. For task counters we only
+ * call when the task is on a CPU.
+ */
+static void perf_counter_remove_from_context(struct perf_counter *counter)
+{
+ struct perf_counter_context *ctx = counter->ctx;
+ struct task_struct *task = ctx->task;
+
+ if (!task) {
+ /*
+ * Per cpu counters are removed via an smp call and
+ * the removal is always sucessful.
+ */
+ smp_call_function_single(counter->cpu,
+ __perf_counter_remove_from_context,
+ counter, 1);
+ return;
+ }
+
+retry:
+ task_oncpu_function_call(task, __perf_counter_remove_from_context,
+ counter);
+
+ spin_lock_irq(&ctx->lock);
+ /*
+ * If the context is active we need to retry the smp call.
+ */
+ if (ctx->nr_active && !list_empty(&counter->list_entry)) {
+ spin_unlock_irq(&ctx->lock);
+ goto retry;
+ }
+
+ /*
+ * The lock prevents that this context is scheduled in so we
+ * can remove the counter safely, if the call above did not
+ * succeed.
+ */
+ if (!list_empty(&counter->list_entry)) {
+ ctx->nr_counters--;
+ list_del_counter(counter, ctx);
+ counter->task = NULL;
+ }
+ spin_unlock_irq(&ctx->lock);
+}
+
+/*
+ * Cross CPU call to install and enable a preformance counter
+ */
+static void __perf_install_in_context(void *info)
+{
+ struct perf_cpu_context *cpuctx = &__get_cpu_var(perf_cpu_context);
+ struct perf_counter *counter = info;
+ struct perf_counter_context *ctx = counter->ctx;
+ int cpu = smp_processor_id();
+ unsigned long flags;
+ u64 perf_flags;
+
+ /*
+ * If this is a task context, we need to check whether it is
+ * the current task context of this cpu. If not it has been
+ * scheduled out before the smp call arrived.
+ */
+ if (ctx->task && cpuctx->task_ctx != ctx)
+ return;
+
+ spin_lock_irqsave(&ctx->lock, flags);
+
+ /*
+ * Protect the list operation against NMI by disabling the
+ * counters on a global level. NOP for non NMI based counters.
+ */
+ perf_flags = hw_perf_save_disable();
+ list_add_counter(counter, ctx);
+ hw_perf_restore(perf_flags);
+
+ ctx->nr_counters++;
+
+ if (cpuctx->active_oncpu < perf_max_counters) {
+ counter->state = PERF_COUNTER_STATE_ACTIVE;
+ counter->oncpu = cpu;
+ ctx->nr_active++;
+ cpuctx->active_oncpu++;
+ counter->hw_ops->hw_perf_counter_enable(counter);
+ }
+
+ if (!ctx->task && cpuctx->max_pertask)
+ cpuctx->max_pertask--;
+
+ spin_unlock_irqrestore(&ctx->lock, flags);
+}
+
+/*
+ * Attach a performance counter to a context
+ *
+ * First we add the counter to the list with the hardware enable bit
+ * in counter->hw_config cleared.
+ *
+ * If the counter is attached to a task which is on a CPU we use a smp
+ * call to enable it in the task context. The task might have been
+ * scheduled away, but we check this in the smp call again.
+ */
+static void
+perf_install_in_context(struct perf_counter_context *ctx,
+ struct perf_counter *counter,
+ int cpu)
+{
+ struct task_struct *task = ctx->task;
+
+ counter->ctx = ctx;
+ if (!task) {
+ /*
+ * Per cpu counters are installed via an smp call and
+ * the install is always sucessful.
+ */
+ smp_call_function_single(cpu, __perf_install_in_context,
+ counter, 1);
+ return;
+ }
+
+ counter->task = task;
+retry:
+ task_oncpu_function_call(task, __perf_install_in_context,
+ counter);
+
+ spin_lock_irq(&ctx->lock);
+ /*
+ * we need to retry the smp call.
+ */
+ if (ctx->nr_active && list_empty(&counter->list_entry)) {
+ spin_unlock_irq(&ctx->lock);
+ goto retry;
+ }
+
+ /*
+ * The lock prevents that this context is scheduled in so we
+ * can add the counter safely, if it the call above did not
+ * succeed.
+ */
+ if (list_empty(&counter->list_entry)) {
+ list_add_counter(counter, ctx);
+ ctx->nr_counters++;
+ }
+ spin_unlock_irq(&ctx->lock);
+}
+
+static void
+counter_sched_out(struct perf_counter *counter,
+ struct perf_cpu_context *cpuctx,
+ struct perf_counter_context *ctx)
+{
+ if (counter->state != PERF_COUNTER_STATE_ACTIVE)
+ return;
+
+ counter->hw_ops->hw_perf_counter_disable(counter);
+ counter->state = PERF_COUNTER_STATE_INACTIVE;
+ counter->oncpu = -1;
+
+ cpuctx->active_oncpu--;
+ ctx->nr_active--;
+}
+
+static void
+group_sched_out(struct perf_counter *group_counter,
+ struct perf_cpu_context *cpuctx,
+ struct perf_counter_context *ctx)
+{
+ struct perf_counter *counter;
+
+ counter_sched_out(group_counter, cpuctx, ctx);
+
+ /*
+ * Schedule out siblings (if any):
+ */
+ list_for_each_entry(counter, &group_counter->sibling_list, list_entry)
+ counter_sched_out(counter, cpuctx, ctx);
+}
+
+/*
+ * Called from scheduler to remove the counters of the current task,
+ * with interrupts disabled.
+ *
+ * We stop each counter and update the counter value in counter->count.
+ *
+ * This does not protect us against NMI, but hw_perf_counter_disable()
+ * sets the disabled bit in the control field of counter _before_
+ * accessing the counter control register. If a NMI hits, then it will
+ * not restart the counter.
+ */
+void perf_counter_task_sched_out(struct task_struct *task, int cpu)
+{
+ struct perf_cpu_context *cpuctx = &per_cpu(perf_cpu_context, cpu);
+ struct perf_counter_context *ctx = &task->perf_counter_ctx;
+ struct perf_counter *counter;
+
+ if (likely(!cpuctx->task_ctx))
+ return;
+
+ spin_lock(&ctx->lock);
+ if (ctx->nr_active) {
+ list_for_each_entry(counter, &ctx->counter_list, list_entry)
+ group_sched_out(counter, cpuctx, ctx);
+ }
+ spin_unlock(&ctx->lock);
+ cpuctx->task_ctx = NULL;
+}
+
+static void
+counter_sched_in(struct perf_counter *counter,
+ struct perf_cpu_context *cpuctx,
+ struct perf_counter_context *ctx,
+ int cpu)
+{
+ if (counter->state == PERF_COUNTER_STATE_OFF)
+ return;
+
+ counter->hw_ops->hw_perf_counter_enable(counter);
+ counter->state = PERF_COUNTER_STATE_ACTIVE;
+ counter->oncpu = cpu; /* TODO: put 'cpu' into cpuctx->cpu */
+
+ cpuctx->active_oncpu++;
+ ctx->nr_active++;
+}
+
+static void
+group_sched_in(struct perf_counter *group_counter,
+ struct perf_cpu_context *cpuctx,
+ struct perf_counter_context *ctx,
+ int cpu)
+{
+ struct perf_counter *counter;
+
+ counter_sched_in(group_counter, cpuctx, ctx, cpu);
+
+ /*
+ * Schedule in siblings as one group (if any):
+ */
+ list_for_each_entry(counter, &group_counter->sibling_list, list_entry)
+ counter_sched_in(counter, cpuctx, ctx, cpu);
+}
+
+/*
+ * Called from scheduler to add the counters of the current task
+ * with interrupts disabled.
+ *
+ * We restore the counter value and then enable it.
+ *
+ * This does not protect us against NMI, but hw_perf_counter_enable()
+ * sets the enabled bit in the control field of counter _before_
+ * accessing the counter control register. If a NMI hits, then it will
+ * keep the counter running.
+ */
+void perf_counter_task_sched_in(struct task_struct *task, int cpu)
+{
+ struct perf_cpu_context *cpuctx = &per_cpu(perf_cpu_context, cpu);
+ struct perf_counter_context *ctx = &task->perf_counter_ctx;
+ struct perf_counter *counter;
+
+ if (likely(!ctx->nr_counters))
+ return;
+
+ spin_lock(&ctx->lock);
+ list_for_each_entry(counter, &ctx->counter_list, list_entry) {
+ if (ctx->nr_active == cpuctx->max_pertask)
+ break;
+
+ /*
+ * Listen to the 'cpu' scheduling filter constraint
+ * of counters:
+ */
+ if (counter->cpu != -1 && counter->cpu != cpu)
+ continue;
+
+ group_sched_in(counter, cpuctx, ctx, cpu);
+ }
+ spin_unlock(&ctx->lock);
+
+ cpuctx->task_ctx = ctx;
+}
+
+int perf_counter_task_disable(void)
+{
+ struct task_struct *curr = current;
+ struct perf_counter_context *ctx = &curr->perf_counter_ctx;
+ struct perf_counter *counter;
+ u64 perf_flags;
+ int cpu;
+
+ if (likely(!ctx->nr_counters))
+ return 0;
+
+ local_irq_disable();
+ cpu = smp_processor_id();
+
+ perf_counter_task_sched_out(curr, cpu);
+
+ spin_lock(&ctx->lock);
+
+ /*
+ * Disable all the counters:
+ */
+ perf_flags = hw_perf_save_disable();
+
+ list_for_each_entry(counter, &ctx->counter_list, list_entry)
+ counter->state = PERF_COUNTER_STATE_OFF;
+
+ hw_perf_restore(perf_flags);
+
+ spin_unlock(&ctx->lock);
+
+ local_irq_enable();
+
+ return 0;
+}
+
+int perf_counter_task_enable(void)
+{
+ struct task_struct *curr = current;
+ struct perf_counter_context *ctx = &curr->perf_counter_ctx;
+ struct perf_counter *counter;
+ u64 perf_flags;
+ int cpu;
+
+ if (likely(!ctx->nr_counters))
+ return 0;
+
+ local_irq_disable();
+ cpu = smp_processor_id();
+
+ spin_lock(&ctx->lock);
+
+ /*
+ * Disable all the counters:
+ */
+ perf_flags = hw_perf_save_disable();
+
+ list_for_each_entry(counter, &ctx->counter_list, list_entry) {
+ if (counter->state != PERF_COUNTER_STATE_OFF)
+ continue;
+ counter->state = PERF_COUNTER_STATE_INACTIVE;
+ }
+ hw_perf_restore(perf_flags);
+
+ spin_unlock(&ctx->lock);
+
+ perf_counter_task_sched_in(curr, cpu);
+
+ local_irq_enable();
+
+ return 0;
+}
+
+void perf_counter_task_tick(struct task_struct *curr, int cpu)
+{
+ struct perf_counter_context *ctx = &curr->perf_counter_ctx;
+ struct perf_counter *counter;
+ u64 perf_flags;
+
+ if (likely(!ctx->nr_counters))
+ return;
+
+ perf_counter_task_sched_out(curr, cpu);
+
+ spin_lock(&ctx->lock);
+
+ /*
+ * Rotate the first entry last (works just fine for group counters too):
+ */
+ perf_flags = hw_perf_save_disable();
+ list_for_each_entry(counter, &ctx->counter_list, list_entry) {
+ list_del(&counter->list_entry);
+ list_add_tail(&counter->list_entry, &ctx->counter_list);
+ break;
+ }
+ hw_perf_restore(perf_flags);
+
+ spin_unlock(&ctx->lock);
+
+ perf_counter_task_sched_in(curr, cpu);
+}
+
+/*
+ * Cross CPU call to read the hardware counter
+ */
+static void __hw_perf_counter_read(void *info)
+{
+ struct perf_counter *counter = info;
+
+ counter->hw_ops->hw_perf_counter_read(counter);
+}
+
+static u64 perf_counter_read(struct perf_counter *counter)
+{
+ /*
+ * If counter is enabled and currently active on a CPU, update the
+ * value in the counter structure:
+ */
+ if (counter->state == PERF_COUNTER_STATE_ACTIVE) {
+ smp_call_function_single(counter->oncpu,
+ __hw_perf_counter_read, counter, 1);
+ }
+
+ return atomic64_read(&counter->count);
+}
+
+/*
+ * Cross CPU call to switch performance data pointers
+ */
+static void __perf_switch_irq_data(void *info)
+{
+ struct perf_cpu_context *cpuctx = &__get_cpu_var(perf_cpu_context);
+ struct perf_counter *counter = info;
+ struct perf_counter_context *ctx = counter->ctx;
+ struct perf_data *oldirqdata = counter->irqdata;
+
+ /*
+ * If this is a task context, we need to check whether it is
+ * the current task context of this cpu. If not it has been
+ * scheduled out before the smp call arrived.
+ */
+ if (ctx->task) {
+ if (cpuctx->task_ctx != ctx)
+ return;
+ spin_lock(&ctx->lock);
+ }
+
+ /* Change the pointer NMI safe */
+ atomic_long_set((atomic_long_t *)&counter->irqdata,
+ (unsigned long) counter->usrdata);
+ counter->usrdata = oldirqdata;
+
+ if (ctx->task)
+ spin_unlock(&ctx->lock);
+}
+
+static struct perf_data *perf_switch_irq_data(struct perf_counter *counter)
+{
+ struct perf_counter_context *ctx = counter->ctx;
+ struct perf_data *oldirqdata = counter->irqdata;
+ struct task_struct *task = ctx->task;
+
+ if (!task) {
+ smp_call_function_single(counter->cpu,
+ __perf_switch_irq_data,
+ counter, 1);
+ return counter->usrdata;
+ }
+
+retry:
+ spin_lock_irq(&ctx->lock);
+ if (counter->state != PERF_COUNTER_STATE_ACTIVE) {
+ counter->irqdata = counter->usrdata;
+ counter->usrdata = oldirqdata;
+ spin_unlock_irq(&ctx->lock);
+ return oldirqdata;
+ }
+ spin_unlock_irq(&ctx->lock);
+ task_oncpu_function_call(task, __perf_switch_irq_data, counter);
+ /* Might have failed, because task was scheduled out */
+ if (counter->irqdata == oldirqdata)
+ goto retry;
+
+ return counter->usrdata;
+}
+
+static void put_context(struct perf_counter_context *ctx)
+{
+ if (ctx->task)
+ put_task_struct(ctx->task);
+}
+
+static struct perf_counter_context *find_get_context(pid_t pid, int cpu)
+{
+ struct perf_cpu_context *cpuctx;
+ struct perf_counter_context *ctx;
+ struct task_struct *task;
+
+ /*
+ * If cpu is not a wildcard then this is a percpu counter:
+ */
+ if (cpu != -1) {
+ /* Must be root to operate on a CPU counter: */
+ if (!capable(CAP_SYS_ADMIN))
+ return ERR_PTR(-EACCES);
+
+ if (cpu < 0 || cpu > num_possible_cpus())
+ return ERR_PTR(-EINVAL);
+
+ /*
+ * We could be clever and allow to attach a counter to an
+ * offline CPU and activate it when the CPU comes up, but
+ * that's for later.
+ */
+ if (!cpu_isset(cpu, cpu_online_map))
+ return ERR_PTR(-ENODEV);
+
+ cpuctx = &per_cpu(perf_cpu_context, cpu);
+ ctx = &cpuctx->ctx;
+
+ return ctx;
+ }
+
+ rcu_read_lock();
+ if (!pid)
+ task = current;
+ else
+ task = find_task_by_vpid(pid);
+ if (task)
+ get_task_struct(task);
+ rcu_read_unlock();
+
+ if (!task)
+ return ERR_PTR(-ESRCH);
+
+ ctx = &task->perf_counter_ctx;
+ ctx->task = task;
+
+ /* Reuse ptrace permission checks for now. */
+ if (!ptrace_may_access(task, PTRACE_MODE_READ)) {
+ put_context(ctx);
+ return ERR_PTR(-EACCES);
+ }
+
+ return ctx;
+}
+
+/*
+ * Called when the last reference to the file is gone.
+ */
+static int perf_release(struct inode *inode, struct file *file)
+{
+ struct perf_counter *counter = file->private_data;
+ struct perf_counter_context *ctx = counter->ctx;
+
+ file->private_data = NULL;
+
+ mutex_lock(&counter->mutex);
+
+ perf_counter_remove_from_context(counter);
+ put_context(ctx);
+
+ mutex_unlock(&counter->mutex);
+
+ kfree(counter);
+
+ return 0;
+}
+
+/*
+ * Read the performance counter - simple non blocking version for now
+ */
+static ssize_t
+perf_read_hw(struct perf_counter *counter, char __user *buf, size_t count)
+{
+ u64 cntval;
+
+ if (count != sizeof(cntval))
+ return -EINVAL;
+
+ mutex_lock(&counter->mutex);
+ cntval = perf_counter_read(counter);
+ mutex_unlock(&counter->mutex);
+
+ return put_user(cntval, (u64 __user *) buf) ? -EFAULT : sizeof(cntval);
+}
+
+static ssize_t
+perf_copy_usrdata(struct perf_data *usrdata, char __user *buf, size_t count)
+{
+ if (!usrdata->len)
+ return 0;
+
+ count = min(count, (size_t)usrdata->len);
+ if (copy_to_user(buf, usrdata->data + usrdata->rd_idx, count))
+ return -EFAULT;
+
+ /* Adjust the counters */
+ usrdata->len -= count;
+ if (!usrdata->len)
+ usrdata->rd_idx = 0;
+ else
+ usrdata->rd_idx += count;
+
+ return count;
+}
+
+static ssize_t
+perf_read_irq_data(struct perf_counter *counter,
+ char __user *buf,
+ size_t count,
+ int nonblocking)
+{
+ struct perf_data *irqdata, *usrdata;
+ DECLARE_WAITQUEUE(wait, current);
+ ssize_t res;
+
+ irqdata = counter->irqdata;
+ usrdata = counter->usrdata;
+
+ if (usrdata->len + irqdata->len >= count)
+ goto read_pending;
+
+ if (nonblocking)
+ return -EAGAIN;
+
+ spin_lock_irq(&counter->waitq.lock);
+ __add_wait_queue(&counter->waitq, &wait);
+ for (;;) {
+ set_current_state(TASK_INTERRUPTIBLE);
+ if (usrdata->len + irqdata->len >= count)
+ break;
+
+ if (signal_pending(current))
+ break;
+
+ spin_unlock_irq(&counter->waitq.lock);
+ schedule();
+ spin_lock_irq(&counter->waitq.lock);
+ }
+ __remove_wait_queue(&counter->waitq, &wait);
+ __set_current_state(TASK_RUNNING);
+ spin_unlock_irq(&counter->waitq.lock);
+
+ if (usrdata->len + irqdata->len < count)
+ return -ERESTARTSYS;
+read_pending:
+ mutex_lock(&counter->mutex);
+
+ /* Drain pending data first: */
+ res = perf_copy_usrdata(usrdata, buf, count);
+ if (res < 0 || res == count)
+ goto out;
+
+ /* Switch irq buffer: */
+ usrdata = perf_switch_irq_data(counter);
+ if (perf_copy_usrdata(usrdata, buf + res, count - res) < 0) {
+ if (!res)
+ res = -EFAULT;
+ } else {
+ res = count;
+ }
+out:
+ mutex_unlock(&counter->mutex);
+
+ return res;
+}
+
+static ssize_t
+perf_read(struct file *file, char __user *buf, size_t count, loff_t *ppos)
+{
+ struct perf_counter *counter = file->private_data;
+
+ switch (counter->hw_event.record_type) {
+ case PERF_RECORD_SIMPLE:
+ return perf_read_hw(counter, buf, count);
+
+ case PERF_RECORD_IRQ:
+ case PERF_RECORD_GROUP:
+ return perf_read_irq_data(counter, buf, count,
+ file->f_flags & O_NONBLOCK);
+ }
+ return -EINVAL;
+}
+
+static unsigned int perf_poll(struct file *file, poll_table *wait)
+{
+ struct perf_counter *counter = file->private_data;
+ unsigned int events = 0;
+ unsigned long flags;
+
+ poll_wait(file, &counter->waitq, wait);
+
+ spin_lock_irqsave(&counter->waitq.lock, flags);
+ if (counter->usrdata->len || counter->irqdata->len)
+ events |= POLLIN;
+ spin_unlock_irqrestore(&counter->waitq.lock, flags);
+
+ return events;
+}
+
+static const struct file_operations perf_fops = {
+ .release = perf_release,
+ .read = perf_read,
+ .poll = perf_poll,
+};
+
+static void cpu_clock_perf_counter_enable(struct perf_counter *counter)
+{
+}
+
+static void cpu_clock_perf_counter_disable(struct perf_counter *counter)
+{
+}
+
+static void cpu_clock_perf_counter_read(struct perf_counter *counter)
+{
+ int cpu = raw_smp_processor_id();
+
+ atomic64_set(&counter->count, cpu_clock(cpu));
+}
+
+static const struct hw_perf_counter_ops perf_ops_cpu_clock = {
+ .hw_perf_counter_enable = cpu_clock_perf_counter_enable,
+ .hw_perf_counter_disable = cpu_clock_perf_counter_disable,
+ .hw_perf_counter_read = cpu_clock_perf_counter_read,
+};
+
+static void task_clock_perf_counter_update(struct perf_counter *counter)
+{
+ u64 prev, now;
+ s64 delta;
+
+ prev = atomic64_read(&counter->hw.prev_count);
+ now = current->se.sum_exec_runtime;
+
+ atomic64_set(&counter->hw.prev_count, now);
+
+ delta = now - prev;
+ if (WARN_ON_ONCE(delta < 0))
+ delta = 0;
+
+ atomic64_add(delta, &counter->count);
+}
+
+static void task_clock_perf_counter_read(struct perf_counter *counter)
+{
+ task_clock_perf_counter_update(counter);
+}
+
+static void task_clock_perf_counter_enable(struct perf_counter *counter)
+{
+ atomic64_set(&counter->hw.prev_count, current->se.sum_exec_runtime);
+}
+
+static void task_clock_perf_counter_disable(struct perf_counter *counter)
+{
+ task_clock_perf_counter_update(counter);
+}
+
+static const struct hw_perf_counter_ops perf_ops_task_clock = {
+ .hw_perf_counter_enable = task_clock_perf_counter_enable,
+ .hw_perf_counter_disable = task_clock_perf_counter_disable,
+ .hw_perf_counter_read = task_clock_perf_counter_read,
+};
+
+static u64 get_page_faults(void)
+{
+ struct task_struct *curr = current;
+
+ return curr->maj_flt + curr->min_flt;
+}
+
+static void page_faults_perf_counter_update(struct perf_counter *counter)
+{
+ u64 prev, now;
+ s64 delta;
+
+ prev = atomic64_read(&counter->hw.prev_count);
+ now = get_page_faults();
+
+ atomic64_set(&counter->hw.prev_count, now);
+
+ delta = now - prev;
+ if (WARN_ON_ONCE(delta < 0))
+ delta = 0;
+
+ atomic64_add(delta, &counter->count);
+}
+
+static void page_faults_perf_counter_read(struct perf_counter *counter)
+{
+ page_faults_perf_counter_update(counter);
+}
+
+static void page_faults_perf_counter_enable(struct perf_counter *counter)
+{
+ /*
+ * page-faults is a per-task value already,
+ * so we dont have to clear it on switch-in.
+ */
+}
+
+static void page_faults_perf_counter_disable(struct perf_counter *counter)
+{
+ page_faults_perf_counter_update(counter);
+}
+
+static const struct hw_perf_counter_ops perf_ops_page_faults = {
+ .hw_perf_counter_enable = page_faults_perf_counter_enable,
+ .hw_perf_counter_disable = page_faults_perf_counter_disable,
+ .hw_perf_counter_read = page_faults_perf_counter_read,
+};
+
+static u64 get_context_switches(void)
+{
+ struct task_struct *curr = current;
+
+ return curr->nvcsw + curr->nivcsw;
+}
+
+static void context_switches_perf_counter_update(struct perf_counter *counter)
+{
+ u64 prev, now;
+ s64 delta;
+
+ prev = atomic64_read(&counter->hw.prev_count);
+ now = get_context_switches();
+
+ atomic64_set(&counter->hw.prev_count, now);
+
+ delta = now - prev;
+ if (WARN_ON_ONCE(delta < 0))
+ delta = 0;
+
+ atomic64_add(delta, &counter->count);
+}
+
+static void context_switches_perf_counter_read(struct perf_counter *counter)
+{
+ context_switches_perf_counter_update(counter);
+}
+
+static void context_switches_perf_counter_enable(struct perf_counter *counter)
+{
+ /*
+ * ->nvcsw + curr->nivcsw is a per-task value already,
+ * so we dont have to clear it on switch-in.
+ */
+}
+
+static void context_switches_perf_counter_disable(struct perf_counter *counter)
+{
+ context_switches_perf_counter_update(counter);
+}
+
+static const struct hw_perf_counter_ops perf_ops_context_switches = {
+ .hw_perf_counter_enable = context_switches_perf_counter_enable,
+ .hw_perf_counter_disable = context_switches_perf_counter_disable,
+ .hw_perf_counter_read = context_switches_perf_counter_read,
+};
+
+static inline u64 get_cpu_migrations(void)
+{
+ return current->se.nr_migrations;
+}
+
+static void cpu_migrations_perf_counter_update(struct perf_counter *counter)
+{
+ u64 prev, now;
+ s64 delta;
+
+ prev = atomic64_read(&counter->hw.prev_count);
+ now = get_cpu_migrations();
+
+ atomic64_set(&counter->hw.prev_count, now);
+
+ delta = now - prev;
+ if (WARN_ON_ONCE(delta < 0))
+ delta = 0;
+
+ atomic64_add(delta, &counter->count);
+}
+
+static void cpu_migrations_perf_counter_read(struct perf_counter *counter)
+{
+ cpu_migrations_perf_counter_update(counter);
+}
+
+static void cpu_migrations_perf_counter_enable(struct perf_counter *counter)
+{
+ /*
+ * se.nr_migrations is a per-task value already,
+ * so we dont have to clear it on switch-in.
+ */
+}
+
+static void cpu_migrations_perf_counter_disable(struct perf_counter *counter)
+{
+ cpu_migrations_perf_counter_update(counter);
+}
+
+static const struct hw_perf_counter_ops perf_ops_cpu_migrations = {
+ .hw_perf_counter_enable = cpu_migrations_perf_counter_enable,
+ .hw_perf_counter_disable = cpu_migrations_perf_counter_disable,
+ .hw_perf_counter_read = cpu_migrations_perf_counter_read,
+};
+
+static const struct hw_perf_counter_ops *
+sw_perf_counter_init(struct perf_counter *counter)
+{
+ const struct hw_perf_counter_ops *hw_ops = NULL;
+
+ switch (counter->hw_event.type) {
+ case PERF_COUNT_CPU_CLOCK:
+ hw_ops = &perf_ops_cpu_clock;
+ break;
+ case PERF_COUNT_TASK_CLOCK:
+ hw_ops = &perf_ops_task_clock;
+ break;
+ case PERF_COUNT_PAGE_FAULTS:
+ hw_ops = &perf_ops_page_faults;
+ break;
+ case PERF_COUNT_CONTEXT_SWITCHES:
+ hw_ops = &perf_ops_context_switches;
+ break;
+ case PERF_COUNT_CPU_MIGRATIONS:
+ hw_ops = &perf_ops_cpu_migrations;
+ break;
+ default:
+ break;
+ }
+ return hw_ops;
+}
+
+/*
+ * Allocate and initialize a counter structure
+ */
+static struct perf_counter *
+perf_counter_alloc(struct perf_counter_hw_event *hw_event,
+ int cpu,
+ struct perf_counter *group_leader,
+ gfp_t gfpflags)
+{
+ const struct hw_perf_counter_ops *hw_ops;
+ struct perf_counter *counter;
+
+ counter = kzalloc(sizeof(*counter), gfpflags);
+ if (!counter)
+ return NULL;
+
+ /*
+ * Single counters are their own group leaders, with an
+ * empty sibling list:
+ */
+ if (!group_leader)
+ group_leader = counter;
+
+ mutex_init(&counter->mutex);
+ INIT_LIST_HEAD(&counter->list_entry);
+ INIT_LIST_HEAD(&counter->sibling_list);
+ init_waitqueue_head(&counter->waitq);
+
+ counter->irqdata = &counter->data[0];
+ counter->usrdata = &counter->data[1];
+ counter->cpu = cpu;
+ counter->hw_event = *hw_event;
+ counter->wakeup_pending = 0;
+ counter->group_leader = group_leader;
+ counter->hw_ops = NULL;
+
+ hw_ops = NULL;
+ if (!hw_event->raw && hw_event->type < 0)
+ hw_ops = sw_perf_counter_init(counter);
+ if (!hw_ops)
+ hw_ops = hw_perf_counter_init(counter);
+
+ if (!hw_ops) {
+ kfree(counter);
+ return NULL;
+ }
+ counter->hw_ops = hw_ops;
+
+ return counter;
+}
+
+/**
+ * sys_perf_task_open - open a performance counter, associate it to a task/cpu
+ *
+ * @hw_event_uptr: event type attributes for monitoring/sampling
+ * @pid: target pid
+ * @cpu: target cpu
+ * @group_fd: group leader counter fd
+ */
+asmlinkage int
+sys_perf_counter_open(struct perf_counter_hw_event *hw_event_uptr __user,
+ pid_t pid, int cpu, int group_fd)
+{
+ struct perf_counter *counter, *group_leader;
+ struct perf_counter_hw_event hw_event;
+ struct perf_counter_context *ctx;
+ struct file *counter_file = NULL;
+ struct file *group_file = NULL;
+ int fput_needed = 0;
+ int fput_needed2 = 0;
+ int ret;
+
+ if (copy_from_user(&hw_event, hw_event_uptr, sizeof(hw_event)) != 0)
+ return -EFAULT;
+
+ /*
+ * Get the target context (task or percpu):
+ */
+ ctx = find_get_context(pid, cpu);
+ if (IS_ERR(ctx))
+ return PTR_ERR(ctx);
+
+ /*
+ * Look up the group leader (we will attach this counter to it):
+ */
+ group_leader = NULL;
+ if (group_fd != -1) {
+ ret = -EINVAL;
+ group_file = fget_light(group_fd, &fput_needed);
+ if (!group_file)
+ goto err_put_context;
+ if (group_file->f_op != &perf_fops)
+ goto err_put_context;
+
+ group_leader = group_file->private_data;
+ /*
+ * Do not allow a recursive hierarchy (this new sibling
+ * becoming part of another group-sibling):
+ */
+ if (group_leader->group_leader != group_leader)
+ goto err_put_context;
+ /*
+ * Do not allow to attach to a group in a different
+ * task or CPU context:
+ */
+ if (group_leader->ctx != ctx)
+ goto err_put_context;
+ }
+
+ ret = -EINVAL;
+ counter = perf_counter_alloc(&hw_event, cpu, group_leader, GFP_KERNEL);
+ if (!counter)
+ goto err_put_context;
+
+ ret = anon_inode_getfd("[perf_counter]", &perf_fops, counter, 0);
+ if (ret < 0)
+ goto err_free_put_context;
+
+ counter_file = fget_light(ret, &fput_needed2);
+ if (!counter_file)
+ goto err_free_put_context;
+
+ counter->filp = counter_file;
+ perf_install_in_context(ctx, counter, cpu);
+
+ fput_light(counter_file, fput_needed2);
+
+out_fput:
+ fput_light(group_file, fput_needed);
+
+ return ret;
+
+err_free_put_context:
+ kfree(counter);
+
+err_put_context:
+ put_context(ctx);
+
+ goto out_fput;
+}
+
+/*
+ * Initialize the perf_counter context in a task_struct:
+ */
+static void
+__perf_counter_init_context(struct perf_counter_context *ctx,
+ struct task_struct *task)
+{
+ memset(ctx, 0, sizeof(*ctx));
+ spin_lock_init(&ctx->lock);
+ INIT_LIST_HEAD(&ctx->counter_list);
+ ctx->task = task;
+}
+
+/*
+ * inherit a counter from parent task to child task:
+ */
+static int
+inherit_counter(struct perf_counter *parent_counter,
+ struct task_struct *parent,
+ struct perf_counter_context *parent_ctx,
+ struct task_struct *child,
+ struct perf_counter_context *child_ctx)
+{
+ struct perf_counter *child_counter;
+
+ child_counter = perf_counter_alloc(&parent_counter->hw_event,
+ parent_counter->cpu, NULL,
+ GFP_ATOMIC);
+ if (!child_counter)
+ return -ENOMEM;
+
+ /*
+ * Link it up in the child's context:
+ */
+ child_counter->ctx = child_ctx;
+ child_counter->task = child;
+ list_add_counter(child_counter, child_ctx);
+ child_ctx->nr_counters++;
+
+ child_counter->parent = parent_counter;
+ parent_counter->nr_inherited++;
+ /*
+ * inherit into child's child as well:
+ */
+ child_counter->hw_event.inherit = 1;
+
+ /*
+ * Get a reference to the parent filp - we will fput it
+ * when the child counter exits. This is safe to do because
+ * we are in the parent and we know that the filp still
+ * exists and has a nonzero count:
+ */
+ atomic_long_inc(&parent_counter->filp->f_count);
+
+ return 0;
+}
+
+static void
+__perf_counter_exit_task(struct task_struct *child,
+ struct perf_counter *child_counter,
+ struct perf_counter_context *child_ctx)
+{
+ struct perf_counter *parent_counter;
+ u64 parent_val, child_val;
+ u64 perf_flags;
+
+ /*
+ * Disable and unlink this counter.
+ *
+ * Be careful about zapping the list - IRQ/NMI context
+ * could still be processing it:
+ */
+ local_irq_disable();
+ perf_flags = hw_perf_save_disable();
+
+ if (child_counter->state == PERF_COUNTER_STATE_ACTIVE)
+ child_counter->hw_ops->hw_perf_counter_disable(child_counter);
+ list_del_init(&child_counter->list_entry);
+
+ hw_perf_restore(perf_flags);
+ local_irq_enable();
+
+ parent_counter = child_counter->parent;
+ /*
+ * It can happen that parent exits first, and has counters
+ * that are still around due to the child reference. These
+ * counters need to be zapped - but otherwise linger.
+ */
+ if (!parent_counter)
+ return;
+
+ parent_val = atomic64_read(&parent_counter->count);
+ child_val = atomic64_read(&child_counter->count);
+
+ /*
+ * Add back the child's count to the parent's count:
+ */
+ atomic64_add(child_val, &parent_counter->count);
+
+ fput(parent_counter->filp);
+
+ kfree(child_counter);
+}
+
+/*
+ * When a child task exist, feed back counter values to parent counters.
+ *
+ * Note: we are running in child context, but the PID is not hashed
+ * anymore so new counters will not be added.
+ */
+void perf_counter_exit_task(struct task_struct *child)
+{
+ struct perf_counter *child_counter, *tmp;
+ struct perf_counter_context *child_ctx;
+
+ child_ctx = &child->perf_counter_ctx;
+
+ if (likely(!child_ctx->nr_counters))
+ return;
+
+ list_for_each_entry_safe(child_counter, tmp, &child_ctx->counter_list,
+ list_entry)
+ __perf_counter_exit_task(child, child_counter, child_ctx);
+}
+
+/*
+ * Initialize the perf_counter context in task_struct
+ */
+void perf_counter_init_task(struct task_struct *child)
+{
+ struct perf_counter_context *child_ctx, *parent_ctx;
+ struct perf_counter *counter, *parent_counter;
+ struct task_struct *parent = current;
+ unsigned long flags;
+
+ child_ctx = &child->perf_counter_ctx;
+ parent_ctx = &parent->perf_counter_ctx;
+
+ __perf_counter_init_context(child_ctx, child);
+
+ /*
+ * This is executed from the parent task context, so inherit
+ * counters that have been marked for cloning:
+ */
+
+ if (likely(!parent_ctx->nr_counters))
+ return;
+
+ /*
+ * Lock the parent list. No need to lock the child - not PID
+ * hashed yet and not running, so nobody can access it.
+ */
+ spin_lock_irqsave(&parent_ctx->lock, flags);
+
+ /*
+ * We dont have to disable NMIs - we are only looking at
+ * the list, not manipulating it:
+ */
+ list_for_each_entry(counter, &parent_ctx->counter_list, list_entry) {
+ if (!counter->hw_event.inherit || counter->group_leader != counter)
+ continue;
+
+ /*
+ * Instead of creating recursive hierarchies of counters,
+ * we link inheritd counters back to the original parent,
+ * which has a filp for sure, which we use as the reference
+ * count:
+ */
+ parent_counter = counter;
+ if (counter->parent)
+ parent_counter = counter->parent;
+
+ if (inherit_counter(parent_counter, parent,
+ parent_ctx, child, child_ctx))
+ break;
+ }
+
+ spin_unlock_irqrestore(&parent_ctx->lock, flags);
+}
+
+static void __cpuinit perf_counter_init_cpu(int cpu)
+{
+ struct perf_cpu_context *cpuctx;
+
+ cpuctx = &per_cpu(perf_cpu_context, cpu);
+ __perf_counter_init_context(&cpuctx->ctx, NULL);
+
+ mutex_lock(&perf_resource_mutex);
+ cpuctx->max_pertask = perf_max_counters - perf_reserved_percpu;
+ mutex_unlock(&perf_resource_mutex);
+
+ hw_perf_counter_setup();
+}
+
+#ifdef CONFIG_HOTPLUG_CPU
+static void __perf_counter_exit_cpu(void *info)
+{
+ struct perf_cpu_context *cpuctx = &__get_cpu_var(perf_cpu_context);
+ struct perf_counter_context *ctx = &cpuctx->ctx;
+ struct perf_counter *counter, *tmp;
+
+ list_for_each_entry_safe(counter, tmp, &ctx->counter_list, list_entry)
+ __perf_counter_remove_from_context(counter);
+
+}
+static void perf_counter_exit_cpu(int cpu)
+{
+ smp_call_function_single(cpu, __perf_counter_exit_cpu, NULL, 1);
+}
+#else
+static inline void perf_counter_exit_cpu(int cpu) { }
+#endif
+
+static int __cpuinit
+perf_cpu_notify(struct notifier_block *self, unsigned long action, void *hcpu)
+{
+ unsigned int cpu = (long)hcpu;
+
+ switch (action) {
+
+ case CPU_UP_PREPARE:
+ case CPU_UP_PREPARE_FROZEN:
+ perf_counter_init_cpu(cpu);
+ break;
+
+ case CPU_DOWN_PREPARE:
+ case CPU_DOWN_PREPARE_FROZEN:
+ perf_counter_exit_cpu(cpu);
+ break;
+
+ default:
+ break;
+ }
+
+ return NOTIFY_OK;
+}
+
+static struct notifier_block __cpuinitdata perf_cpu_nb = {
+ .notifier_call = perf_cpu_notify,
+};
+
+static int __init perf_counter_init(void)
+{
+ perf_cpu_notify(&perf_cpu_nb, (unsigned long)CPU_UP_PREPARE,
+ (void *)(long)smp_processor_id());
+ register_cpu_notifier(&perf_cpu_nb);
+
+ return 0;
+}
+early_initcall(perf_counter_init);
+
+static ssize_t perf_show_reserve_percpu(struct sysdev_class *class, char *buf)
+{
+ return sprintf(buf, "%d\n", perf_reserved_percpu);
+}
+
+static ssize_t
+perf_set_reserve_percpu(struct sysdev_class *class,
+ const char *buf,
+ size_t count)
+{
+ struct perf_cpu_context *cpuctx;
+ unsigned long val;
+ int err, cpu, mpt;
+
+ err = strict_strtoul(buf, 10, &val);
+ if (err)
+ return err;
+ if (val > perf_max_counters)
+ return -EINVAL;
+
+ mutex_lock(&perf_resource_mutex);
+ perf_reserved_percpu = val;
+ for_each_online_cpu(cpu) {
+ cpuctx = &per_cpu(perf_cpu_context, cpu);
+ spin_lock_irq(&cpuctx->ctx.lock);
+ mpt = min(perf_max_counters - cpuctx->ctx.nr_counters,
+ perf_max_counters - perf_reserved_percpu);
+ cpuctx->max_pertask = mpt;
+ spin_unlock_irq(&cpuctx->ctx.lock);
+ }
+ mutex_unlock(&perf_resource_mutex);
+
+ return count;
+}
+
+static ssize_t perf_show_overcommit(struct sysdev_class *class, char *buf)
+{
+ return sprintf(buf, "%d\n", perf_overcommit);
+}
+
+static ssize_t
+perf_set_overcommit(struct sysdev_class *class, const char *buf, size_t count)
+{
+ unsigned long val;
+ int err;
+
+ err = strict_strtoul(buf, 10, &val);
+ if (err)
+ return err;
+ if (val > 1)
+ return -EINVAL;
+
+ mutex_lock(&perf_resource_mutex);
+ perf_overcommit = val;
+ mutex_unlock(&perf_resource_mutex);
+
+ return count;
+}
+
+static SYSDEV_CLASS_ATTR(
+ reserve_percpu,
+ 0644,
+ perf_show_reserve_percpu,
+ perf_set_reserve_percpu
+ );
+
+static SYSDEV_CLASS_ATTR(
+ overcommit,
+ 0644,
+ perf_show_overcommit,
+ perf_set_overcommit
+ );
+
+static struct attribute *perfclass_attrs[] = {
+ &attr_reserve_percpu.attr,
+ &attr_overcommit.attr,
+ NULL
+};
+
+static struct attribute_group perfclass_attr_group = {
+ .attrs = perfclass_attrs,
+ .name = "perf_counters",
+};
+
+static int __init perf_counter_sysfs_init(void)
+{
+ return sysfs_create_group(&cpu_sysdev_class.kset.kobj,
+ &perfclass_attr_group);
+}
+device_initcall(perf_counter_sysfs_init);
+
diff --git a/kernel/sched.c b/kernel/sched.c
index e4bb1dd..382cfdb 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1852,12 +1852,14 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
p->se.sleep_start -= clock_offset;
if (p->se.block_start)
p->se.block_start -= clock_offset;
+#endif
if (old_cpu != new_cpu) {
- schedstat_inc(p, se.nr_migrations);
+ p->se.nr_migrations++;
+#ifdef CONFIG_SCHEDSTATS
if (task_hot(p, old_rq->clock, NULL))
schedstat_inc(p, se.nr_forced2_migrations);
- }
#endif
+ }
p->se.vruntime -= old_cfsrq->min_vruntime -
new_cfsrq->min_vruntime;

@@ -2212,6 +2214,27 @@ static int sched_balance_self(int cpu, int flag)

#endif /* CONFIG_SMP */

+/**
+ * task_oncpu_function_call - call a function on the cpu on which a task runs
+ * @p: the task to evaluate
+ * @func: the function to be called
+ * @info: the function call argument
+ *
+ * Calls the function @func when the task is currently running. This might
+ * be on the current CPU, which just calls the function directly
+ */
+void task_oncpu_function_call(struct task_struct *p,
+ void (*func) (void *info), void *info)
+{
+ int cpu;
+
+ preempt_disable();
+ cpu = task_cpu(p);
+ if (task_curr(p))
+ smp_call_function_single(cpu, func, info, 1);
+ preempt_enable();
+}
+
/***
* try_to_wake_up - wake up a thread
* @p: the to-be-woken-up thread
@@ -2354,6 +2377,7 @@ static void __sched_fork(struct task_struct *p)
p->se.exec_start = 0;
p->se.sum_exec_runtime = 0;
p->se.prev_sum_exec_runtime = 0;
+ p->se.nr_migrations = 0;
p->se.last_wakeup = 0;
p->se.avg_overlap = 0;

@@ -2534,6 +2558,7 @@ prepare_task_switch(struct rq *rq, struct task_struct *prev,
struct task_struct *next)
{
fire_sched_out_preempt_notifiers(prev, next);
+ perf_counter_task_sched_out(prev, cpu_of(rq));
prepare_lock_switch(rq, next);
prepare_arch_switch(next);
}
@@ -2574,6 +2599,7 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
*/
prev_state = prev->state;
finish_arch_switch(prev);
+ perf_counter_task_sched_in(current, cpu_of(rq));
finish_lock_switch(rq, prev);
#ifdef CONFIG_SMP
if (current->sched_class->post_schedule)
@@ -4296,6 +4322,7 @@ void scheduler_tick(void)
rq->idle_at_tick = idle_cpu(cpu);
trigger_load_balance(rq, cpu);
#endif
+ perf_counter_task_tick(curr, cpu);
}

#if defined(CONFIG_PREEMPT) && (defined(CONFIG_DEBUG_PREEMPT) || \
diff --git a/kernel/sys.c b/kernel/sys.c
index 31deba8..0f66633 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -14,6 +14,7 @@
#include <linux/prctl.h>
#include <linux/highuid.h>
#include <linux/fs.h>
+#include <linux/perf_counter.h>
#include <linux/resource.h>
#include <linux/kernel.h>
#include <linux/kexec.h>
@@ -1716,6 +1717,12 @@ asmlinkage long sys_prctl(int option, unsigned long arg2, unsigned long arg3,
case PR_SET_TSC:
error = SET_TSC_CTL(arg2);
break;
+ case PR_TASK_PERF_COUNTERS_DISABLE:
+ error = perf_counter_task_disable();
+ break;
+ case PR_TASK_PERF_COUNTERS_ENABLE:
+ error = perf_counter_task_enable();
+ break;
case PR_GET_TIMERSLACK:
error = current->timer_slack_ns;
break;
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index e14a232..4be8bbc 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -174,3 +174,6 @@ cond_syscall(compat_sys_timerfd_settime);
cond_syscall(compat_sys_timerfd_gettime);
cond_syscall(sys_eventfd);
cond_syscall(sys_eventfd2);
+
+/* performance counters: */
+cond_syscall(sys_perf_counter_open);


2008-12-15 12:11:58

by Paul Mackerras

[permalink] [raw]
Subject: Re: [patch] Performance Counters for Linux, v4

Ingo Molnar writes:

> We are pleased to announce the v4 release of our performance counters
> subsystem implementation.

Looking at the code, I am wondering what you are planning to do to
support machines that have constraints on what sets of events can be
counted simultaneously. Currently you have the core code calling
counter->hw_ops->hw_perf_counter_enable which can't return an error.
The core expects it to be able to add any counter regardless of what
event it's counting, subject only to a maximum number of counters.
I assume you're going to change that.

I think the core should put together a list of counters and counter
groups that it would like to have on the PMU simultaneously and then
make one call to the arch layer to ask if that is possible. That
could either return success or failure. If it returns failure then
the core needs to ask for something less, or something different. I'm
not sure how the core should choose what to ask for instead, though.

Paul.

2008-12-15 12:12:24

by Paul Mackerras

[permalink] [raw]
Subject: Re: [patch] Performance Counters for Linux, v4

Ingo Molnar writes:

> For example, a full kernel build's statistics on a 16-way x86 box are:
>
> $ timec -e -5,-4,-3,1,2,3,5 make -j32 bzImage
>
> Performance counter stats for 'make':
>
> 142420.882 task clock ticks (millisecs)
>
> 9951033 pagefaults (events)
> 302628 context switches (events)
> 57810 CPU migrations (events)
> 208439082509 instructions (events)
> 657918810 cache references (events)
> 120243697 cache misses (events)
> 3134162468 branch misses (events)

Does this machine have sufficient hardware counters to count those
four hardware events at the same time? Or were those counters
timeshared onto 1 or 2 hardware counters? If it's the latter, are
those counts from half or a quarter of the total execution?

Paul.

2008-12-15 17:39:53

by Vince Weaver

[permalink] [raw]
Subject: Re: [patch] Performance Counters for Linux, v4

Hello

I see a large (2300 instruction) fixed overhead when measuring
retired instruction count using the "timec" command
compared to the "pfmon" tool that comes with perfmon3
(the pfmon tool has essentially no overhead when
doing aggragate counts).

Is this an inherent weakness with the new proposed performance
counter infrastructure?

I wanted to compare perfmon3 against Ingo's proposed
performance counter infrastructure. This is on
a Core2 Q6600 (the only machine I have that supports
Ingo's codebase).

For perfmon3 comparison, it's the same machine running
2.6.27.4 patched with the appropriate full (not stripped-down)
perfmon3 patchset available from perfmon2.sf.net.

All code for these tests can be had from:
http://www.csl.cornell.edu/~vince/projects/perf_counter/

#
# 100 instruction test
#

Testing with a 100 instruction assembly program:

# perfmon3

tasse:~/assembly_tests% pfmon -e INSTRUCTIONS_RETIRED ./100_insns
100 INSTRUCTIONS_RETIRED

# Ingo

tasse:~/assembly_tests% ./timec -e 1 ./100_insns

Performance counter stats for './100_insns':

0.762 task clock ticks (millisecs)

2446 instructions (events)

As we can see, timec overcounts by a lot! Is it 24x, or
a fixed value?


#
# 8 billion instruction comparison
#

# perfmon3


tasse:~/assembly_tests% time pfmon -e INSTRUCTIONS_RETIRED ./8B_insns
8000000440 INSTRUCTIONS_RETIRED
1.77s user 0.00s system 100% cpu 1.771 total

Note that on almost all x86 chips that any hardware interrupt that
occurs adds an extra retired instruction to the total count
(some AMD engineers told me this is probably due to some artifact
due to long pipelines and how the microcode changes user/kernel
flag).

So you see that in 1.77s we acccumulate 1.77s*250Hz timer interrupts
which is 442.5 which is roughly the extra instructions we see.

(for more info on sources of non-determinism in instruction counting
with performance counters see the paper here:
http://www.csl.cornell.edu/~vince/papers/iiswc08 )


# ingo

tasse:~/assembly_tests% ./timec -e 1 ./8B_insns

Performance counter stats for './8B_insns':

1743.446 task clock ticks (millisecs)

8000002799 instructions (events)


So it turns out the overhead isn't 24x, but is actually
a fixed 2300 or so.

Still, that's overhead perfmon does not have.

Will this be fixed, or is it an inherent limitation of
the new proposal?

Vince

2008-12-15 21:02:42

by Vince Weaver

[permalink] [raw]
Subject: Re: [patch] Performance Counters for Linux, v4

Hello

I'm trying a more complicated benchmark and getting even stranger
results.

This is still on the Q6600 machine

The benchmark does a loop, reading some memory. It should have
roughly:
12295 instructions
4096 memory loads
4096 branches

perfmon3 is close on all of these stats, and this is consistent
across runs with a small variation (+/- 3 or so).

The timec program returns 0 (!) for all of the stats except
retired instruction count! And with certain combinations
of counters I get 0 for all counts. No error messages
are printed.

Is this expected behavior?

The test program can be had from:
http://www.csl.cornell.edu/~vince/projects/perf_counter/


Details below:

#
# Perfmon results
#

# First, trying to read all 5 events at once fails, only 4 counters
# avail

tasse:~/assembly_tests% pfmon -e INSTRUCTIONS_RETIRED,BRANCH_INSTRUCTIONS_RETIRED,L1D_ALL_CACHE_REF,MEM_LOAD_RETIRED:L1D_MISS ./read_test
cannot configure events: set0 events incompatible or too many events

# Cache results are close to expected, L1D looks a little high

tasse:~/assembly_tests% pfmon -e INSTRUCTIONS_RETIRED,L1D_ALL_CACHE_REF,MEM_LOAD_RETIRED:L1D_MISS ./read_test
12299 INSTRUCTIONS_RETIRED
4164 L1D_ALL_CACHE_REF
4 MEM_LOAD_RETIRED:L1D_MISS

# Branch results. Close to what they should be, though a bit higher
# than expected.

tasse:~/assembly_tests% pfmon -e INSTRUCTIONS_RETIRED,BRANCH_INSTRUCTIONS_RETIRED,MISPREDICTED_BRANCH_RETIRED ./read_test
12299 INSTRUCTIONS_RETIRED
4102 BRANCH_INSTRUCTIONS_RETIRED
1 MISPREDICTED_BRANCH_RETIRED


#
# performance counter v4
#

# Including all stats gives no errors, but gives no results
# either


tasse:~/assembly_tests% ./timec -e 0 -e 1 -e 2 -e 3 -e 4 -e 5 ./read_test

Performance counter stats for './read_test':

0.716 task clock ticks (millisecs)

85049 cycles (events)
0 instructions (events)
0 cache references (events)
0 cache misses (events)
0 branches (events)
0 branch misses (events)



#
# If I include the cycles count, I consistently get 0
# for all counts???

tasse:~/assembly_tests% ./timec -e 0 -e 1 -e 2 -e 3 ./read_test

Performance counter stats for './read_test':

0.520 task clock ticks (millisecs)

73833 cycles (events)
0 instructions (events)
0 cache references (events)
0 cache misses (events)

#
# If I drop the cycles count, I get an instruction count
# with a value 2300 too high (see previous e-mail)
# And really low cache values.

tasse:~/assembly_tests% ./timec -e 1 -e 2 -e 3 ./read_test

Performance counter stats for './read_test':

0.723 task clock ticks (millisecs)

14644 instructions (events)
8 cache references (events)
0 cache misses (events)


#
# And the branch stats don't work either
#

tasse:~/assembly_tests% ./timec -e 1 -e 4 -e 5 ./read_test

Performance counter stats for './read_test':

0.711 task clock ticks (millisecs)

14643 instructions (events)
0 branches (events)
0 branch misses (events)

2008-12-15 21:42:58

by Paul Mackerras

[permalink] [raw]
Subject: Re: [patch] Performance Counters for Linux, v4

Vince Weaver writes:

> I see a large (2300 instruction) fixed overhead when measuring
> retired instruction count using the "timec" command
> compared to the "pfmon" tool that comes with perfmon3
> (the pfmon tool has essentially no overhead when
> doing aggragate counts).

Looks like timec will be counting the fork() and execvp() system calls
that are used to run your executable, as well as the executable
itself. The fork() overhead could be removed fairly easily I think,
the execvp would be hard to get rid of without using ptrace() - and
the use of ptrace was one of the things that Ingo et al. objected to
in perfmon3.

Paul.

2008-12-15 22:09:20

by Stephane Eranian

[permalink] [raw]
Subject: Re: [patch] Performance Counters for Linux, v4

On Mon, Dec 15, 2008 at 10:42 PM, Paul Mackerras <[email protected]> wrote:
> Vince Weaver writes:
>
>> I see a large (2300 instruction) fixed overhead when measuring
>> retired instruction count using the "timec" command
>> compared to the "pfmon" tool that comes with perfmon3
>> (the pfmon tool has essentially no overhead when
>> doing aggragate counts).
>
> Looks like timec will be counting the fork() and execvp() system calls
> that are used to run your executable, as well as the executable
> itself. The fork() overhead could be removed fairly easily I think,
> the execvp would be hard to get rid of without using ptrace() - and
> the use of ptrace was one of the things that Ingo et al. objected to
> in perfmon3.
>
Paul, I think your analysis is correct. This is likely what is happening.

Not that timec could not use ptrace() to block the task from executing
its first instruction, but you'd still have a problem because of prctl(ENABLE)
which applies to the current task, not another task, unless I am mistaken.

Prctl() looks odd to me because you have all those supposedly independent
file descriptors to identify events you want to measure, but they are not
used to start/stop. If you are attaching to multiple tasks at the same time
which you can do with the current API, you may not necessarily want to
start/stop all counters at the same time.

Looks like prctl() is not what we want after all...

2008-12-15 22:53:58

by Paul Mackerras

[permalink] [raw]
Subject: Re: [patch] Performance Counters for Linux, v4

Vince Weaver writes:

> I'm trying a more complicated benchmark and getting even stranger
> results.
>
> This is still on the Q6600 machine
>
> The benchmark does a loop, reading some memory. It should have
> roughly:
> 12295 instructions
> 4096 memory loads
> 4096 branches
>
> perfmon3 is close on all of these stats, and this is consistent
> across runs with a small variation (+/- 3 or so).
>
> The timec program returns 0 (!) for all of the stats except
> retired instruction count! And with certain combinations
> of counters I get 0 for all counts. No error messages
> are printed.
>
> Is this expected behavior?

When you have more software counters than hardware counters, the
kernel will be time-slicing the counters, but because your program
doesn't run for very long, it's possible that only the first set will
ever get to count anything.

That doesn't seem to explain all the 0 values (since the kernel should
be able to use at least 2 hardware counters on your machine, I expect)
but it might be part of the explanation.

Paul.

2008-12-16 12:22:56

by Pavel Machek

[permalink] [raw]
Subject: Re: [patch] Performance Counters for Linux, v4

On Sun 2008-12-14 22:28:29, Ingo Molnar wrote:
> We are pleased to announce the v4 release of our performance counters
> subsystem implementation. The kernel changes can be picked up from:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip.git perfcounters/core
>
> (also in the master branch. There's also a kernel patch attached
> below.)
>
> The biggest new feature in this release is the implementation of
> "performance counter inheritance" for the per task counters: the ability
> to extend performance counters to cover the execution of child tasks
> too, transparently and automatically - following them to other CPUs.
>
> This can be used to monitor a hierarchy of tasks without stopping them
> (or impacting them in any observable way), and extending that monitoring
> to all child tasks as well.
>
> We've written a new utility: 'timec', which takes advantage of this new
> kernel capability:
>
> http://redhat.com/~mingo/perfcounters/timec.c
>
> 'timec' works like /usr/bin/time, but it extends the dimension of "time"
> with all the metrics that hardware and software performance counters are
> able to capture.

Hmm, if I timec some setuid program, what happens?

Performance counters seem like great tool to pull secret keys out of
other processes :-).
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2008-12-16 12:50:43

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] Performance Counters for Linux, v4


* Pavel Machek <[email protected]> wrote:

> Hmm, if I timec some setuid program, what happens?

yes, i already had a quick look at that a few days ago when i implemented
counter inheritance (for different reasons) and couldnt find the cleanest
place to put the exec() flushing into so i procrastinated that a bit :)

> Performance counters seem like great tool to pull secret keys out of
> other processes :-).

if you worry about _that_ angle you also have to:

- turn off the cycle counter

- turn off precise utimes

- plus you have to forbid SMT CPUs as well. On HT a task could
co-schedule with your setuid task and observe its timing
characteristics via its _own_ behavior. (which is impacted by whatever
is running on another SMT/HT thread.)

the real exec() worry are: active, IRQ driven samples/events. Not possible
yet via the current iteration of counter inheritance (hence my
procrastination) - but it makes sense and that's why i was looking at the
exec() angle.

and that will flush simple counters too, removing your theoretical attack
angle as well.

So how about the patch below?

Ingo

--------------->
Subject: perfcounters: flush on setuid exec
From: Ingo Molnar <[email protected]>
Date: Tue Dec 16 13:40:44 CET 2008

Pavel Machek pointed out that performance counters should be flushed
when crossing protection domains on setuid execution.

Reported-by: Pavel Machek <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
fs/exec.c | 8 ++++++++
1 file changed, 8 insertions(+)

Index: linux/fs/exec.c
===================================================================
--- linux.orig/fs/exec.c
+++ linux/fs/exec.c
@@ -33,6 +33,7 @@
#include <linux/string.h>
#include <linux/init.h>
#include <linux/pagemap.h>
+#include <linux/perf_counter.h>
#include <linux/highmem.h>
#include <linux/spinlock.h>
#include <linux/key.h>
@@ -1015,6 +1016,13 @@ int flush_old_exec(struct linux_binprm *
set_dumpable(current->mm, suid_dumpable);
}

+ /*
+ * Flush performance counters when crossing a
+ * security domain:
+ */
+ if (!get_dumpable(current->mm))
+ perf_counter_exit_task(current);
+
/* An exec changes our domain. We are no longer part of the thread
group */

2008-12-16 12:57:42

by Pavel Machek

[permalink] [raw]
Subject: Re: [patch] Performance Counters for Linux, v4

On Tue 2008-12-16 13:50:00, Ingo Molnar wrote:
>
> * Pavel Machek <[email protected]> wrote:
>
> > Hmm, if I timec some setuid program, what happens?
>
> yes, i already had a quick look at that a few days ago when i implemented
> counter inheritance (for different reasons) and couldnt find the cleanest
> place to put the exec() flushing into so i procrastinated that a bit :)
>
> > Performance counters seem like great tool to pull secret keys out of
> > other processes :-).
>
> if you worry about _that_ angle you also have to:
>
> - turn off the cycle counter
>
> - turn off precise utimes

Probably good idea, yes.

> - plus you have to forbid SMT CPUs as well. On HT a task could
> co-schedule with your setuid task and observe its timing
> characteristics via its _own_ behavior. (which is impacted by whatever
> is running on another SMT/HT thread.)

Yes, SMT is evil.

> the real exec() worry are: active, IRQ driven samples/events. Not possible
> yet via the current iteration of counter inheritance (hence my
> procrastination) - but it makes sense and that's why i was looking at the
> exec() angle.
>
> and that will flush simple counters too, removing your theoretical attack
> angle as well.
>
> So how about the patch below?

Thanks!

> Subject: perfcounters: flush on setuid exec
> From: Ingo Molnar <[email protected]>
> Date: Tue Dec 16 13:40:44 CET 2008
>
> Pavel Machek pointed out that performance counters should be flushed
> when crossing protection domains on setuid execution.
>
> Reported-by: Pavel Machek <[email protected]>
> Signed-off-by: Ingo Molnar <[email protected]>

Acked-by: Pavel Machek <[email protected]>

> @@ -1015,6 +1016,13 @@ int flush_old_exec(struct linux_binprm *
> set_dumpable(current->mm, suid_dumpable);
> }
>
> + /*
> + * Flush performance counters when crossing a
> + * security domain:
> + */
> + if (!get_dumpable(current->mm))
> + perf_counter_exit_task(current);
> +
> /* An exec changes our domain. We are no longer part of the thread
> group */
>

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2008-12-16 13:03:33

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] Performance Counters for Linux, v4


* Pavel Machek <[email protected]> wrote:

> On Tue 2008-12-16 13:50:00, Ingo Molnar wrote:
> >
> > * Pavel Machek <[email protected]> wrote:
> >
> > > Hmm, if I timec some setuid program, what happens?
> >
> > yes, i already had a quick look at that a few days ago when i implemented
> > counter inheritance (for different reasons) and couldnt find the cleanest
> > place to put the exec() flushing into so i procrastinated that a bit :)
> >
> > > Performance counters seem like great tool to pull secret keys out of
> > > other processes :-).
> >
> > if you worry about _that_ angle you also have to:
> >
> > - turn off the cycle counter
> >
> > - turn off precise utimes
>
> Probably good idea, yes.
>
> > - plus you have to forbid SMT CPUs as well. On HT a task could
> > co-schedule with your setuid task and observe its timing
> > characteristics via its _own_ behavior. (which is impacted by whatever
> > is running on another SMT/HT thread.)
>
> Yes, SMT is evil.

HT got added back to Nehalem, so SMT is coming to you in every future x86
CPU. It brings a serious performance win, so nobody will turn off SMT
threading in practice. If SMT worries you, it needs explicit partitioning
of security-relevant processing to different physical CPUs, via
cgroups/cpusets/etc.

> > the real exec() worry are: active, IRQ driven samples/events. Not possible
> > yet via the current iteration of counter inheritance (hence my
> > procrastination) - but it makes sense and that's why i was looking at the
> > exec() angle.
> >
> > and that will flush simple counters too, removing your theoretical attack
> > angle as well.
> >
> > So how about the patch below?
>
> Thanks!
>
> > Subject: perfcounters: flush on setuid exec
> > From: Ingo Molnar <[email protected]>
> > Date: Tue Dec 16 13:40:44 CET 2008
> >
> > Pavel Machek pointed out that performance counters should be flushed
> > when crossing protection domains on setuid execution.
> >
> > Reported-by: Pavel Machek <[email protected]>
> > Signed-off-by: Ingo Molnar <[email protected]>
>
> Acked-by: Pavel Machek <[email protected]>

find below the final commit, thanks Pavel.

Ingo

------------>
>From f65cb45cba63f249458b669aa67069eabc37b2f5 Mon Sep 17 00:00:00 2001
From: Ingo Molnar <[email protected]>
Date: Tue, 16 Dec 2008 13:40:44 +0100
Subject: [PATCH] perfcounters: flush on setuid exec

Pavel Machek pointed out that performance counters should be flushed
when crossing protection domains on setuid execution.

Reported-by: Pavel Machek <[email protected]>
Acked-by: Pavel Machek <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
fs/exec.c | 8 ++++++++
1 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index ec5df9a..d5165d8 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -33,6 +33,7 @@
#include <linux/string.h>
#include <linux/init.h>
#include <linux/pagemap.h>
+#include <linux/perf_counter.h>
#include <linux/highmem.h>
#include <linux/spinlock.h>
#include <linux/key.h>
@@ -1017,6 +1018,13 @@ int flush_old_exec(struct linux_binprm * bprm)
set_dumpable(current->mm, suid_dumpable);
}

+ /*
+ * Flush performance counters when crossing a
+ * security domain:
+ */
+ if (!get_dumpable(current->mm))
+ perf_counter_exit_task(current);
+
/* An exec changes our domain. We are no longer part of the thread
group */

2008-12-16 13:13:48

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [patch] Performance Counters for Linux, v4

On Tue, 16 Dec 2008 14:03:02 +0100
Ingo Molnar <[email protected]> wrote:

> > > - plus you have to forbid SMT CPUs as well. On HT a task could
> > > co-schedule with your setuid task and observe its timing
> > > characteristics via its _own_ behavior. (which is impacted by
> > > whatever is running on another SMT/HT thread.)
> >
> > Yes, SMT is evil.
>
> HT got added back to Nehalem, so SMT is coming to you in every future
> x86 CPU. It brings a serious performance win, so nobody will turn off
> SMT threading in practice. If SMT worries you, it needs explicit
> partitioning of security-relevant processing to different physical
> CPUs, via cgroups/cpusets/etc.

and/or you use properly implemented crypto code (see Bruce Schneider's
books). The timing "problem" isn't really SMT specific. If you have
improperly implemented crypto (eg crypto code where the code paths and
not just the data payload are key dependent) then on any system with
more than one (logical) processor there is interference that an
attacker can use.

The only possible answer is to use proper implementation; turning off HT
may make you feel good but you go from shoddy crypto for which there is
some internet papers on how to crack it, to shoddy crypto for which the
same papers apply ;)


--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2008-12-16 14:23:30

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [patch] Performance Counters for Linux, v4

On Mon, 2008-12-15 at 23:11 +1100, Paul Mackerras wrote:
> Ingo Molnar writes:
>
> > We are pleased to announce the v4 release of our performance counters
> > subsystem implementation.
>
> Looking at the code, I am wondering what you are planning to do to
> support machines that have constraints on what sets of events can be
> counted simultaneously. Currently you have the core code calling
> counter->hw_ops->hw_perf_counter_enable which can't return an error.
> The core expects it to be able to add any counter regardless of what
> event it's counting, subject only to a maximum number of counters.
> I assume you're going to change that.
>
> I think the core should put together a list of counters and counter
> groups that it would like to have on the PMU simultaneously and then
> make one call to the arch layer to ask if that is possible. That
> could either return success or failure. If it returns failure then
> the core needs to ask for something less, or something different. I'm
> not sure how the core should choose what to ask for instead, though.

I think the constraint set should be applied when we add to a group, if
when we add a counter to the group, the result isn't schedulable
anymore, we should fail the group addition - and thereby the counter
creation.

This would leave us with groups that are always schedulable in an atomic
fashion.

>From what I understand the code RRs groups (co-scheduling groups where
possible) (ungrouped counter is a group of one), this means that with
the above addition you'd have the needed control over things.

If you need things to be atomic, create a single group, if you're fine
with RR time-sharing, create multiple.

This seems to leave a hole where multiple monitors collide and create
multiple groups unaware of each-other - could we plug this hole with a
group attribute?


2008-12-16 14:43:52

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [patch] Performance Counters for Linux, v4

On Tue, 2008-12-16 at 08:42 +1100, Paul Mackerras wrote:
> the execvp would be hard to get rid of without using ptrace() - and
> the use of ptrace was one of the things that Ingo et al. objected to
> in perfmon3.

I don't think using ptrace in this case is a big issue - aside from the
fact that ptrace is crap in that you'd not be able to timec from a
debugger context :-(

The biggest objection to using ptrace was that ptrace was needed
_during_ the execution of the monitored load, thereby distorting the
load.

This case is different in that it would be used to start off the load.

Still it would be good if we could find another (elegant) way to fix
this.

Also, I'm pretty sure the regular 'time' suffers the very same issue and
counts the exec syscall as well - I saw that when I tinkered with the
execve argument code.

Furthermore, I think output of tools such as time and now timec are most
relevant when compared between runs - that is, the change in values
between runs, not the absolute values as such. At least, that's what I
usually do:

time ./foo

tinker with foo.c

time ./foo

if time2 < time1 :-)
else :-(

2008-12-16 14:46:00

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [patch] Performance Counters for Linux, v4


Could people fix this perfctl-devel list thingy to not generate silly
bounces - otherwise I'm going to drop it from the CC on further emails.

---
The reason it is being held:

Too many recipients to the message

2008-12-16 15:55:48

by Martin Cracauer

[permalink] [raw]
Subject: Re: [Perfctr-devel] [patch] Performance Counters for Linux, v4

Ingo Molnar wrote on Sun, Dec 14, 2008 at 10:28:29PM +0100:
> We are pleased to announce the v4 release of our performance counters
> subsystem implementation. The kernel changes can be picked up from:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip.git perfcounters/core
>
> (also in the master branch. There's also a kernel patch attached
> below.)
>
> The biggest new feature in this release is the implementation of
> "performance counter inheritance" for the per task counters: the ability
> to extend performance counters to cover the execution of child tasks
> too, transparently and automatically - following them to other CPUs.

Does this come with a PAPI frontend or is one planned?

I picked the old perfctr mainly because I need PAPI. Outside-process
measurement doesn't do for my application, I need subdivisions in my
code.

Thanks
Martin
--
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Martin Cracauer <[email protected]> http://www.cons.org/cracauer/

2008-12-16 16:50:52

by Vince Weaver

[permalink] [raw]
Subject: Re: [patch] Performance Counters for Linux, v4


> On Tue, 2008-12-16 at 08:42 +1100, Paul Mackerras wrote:
> Furthermore, I think output of tools such as time and now timec are most
> relevant when compared between runs - that is, the change in values
> between runs, not the absolute values as such. At least, that's what I
> usually do:

That's doesn't do you any good when comparing results across different
machines, or even different kernels on the same machine.

perfmon shows that good results can be had, even if it's not the cleanest
way in the world. It would be a shame to lose that.

Small micro-benchmarks like this are important. You can't always trust
the performance counters to work, so being able to sanity check them with
exact test-cases is critical. Otherwise you might just be measuring
nonsense.

And while it might be able to subtract the exec() overhead for something
like retired instructions, it gets a lot more complicated when you have
something like cache bus snoops or branch mispredicts where it's hard to
tell what comes from the program and what is overhead from the monitoring
infrastructure.

Vince

2008-12-16 17:34:24

by Vince Weaver

[permalink] [raw]
Subject: Re: [patch] Performance Counters for Linux, v4


I'm trying to evaluate this new proposal for the kind of workloads I use
performance counters for, and even the simplest tests don't work.

I'm trying to do a simple aggragate count for some benchmarks here using
timec and I'm getting poor results.

Are any of the problems I'm reporting going to be fixed?

In any case, I was testing aggregate counts on a longer running benchmark,
this time equake from the spec2k benchmark suite, still on the q6600.

If I only count retired instructions, I get consistent results:

timec -e 1

119175255369 instructions (events)
119175255561 instructions (events)
119175255383 instructions (events)


however the minute I add another count, say cycles so I can calculate
CPI/IPC the results for instructions are suddenly off by 33%.

Needless to say, perfmon can handle reading both cycles and instructions
at the same time.


timec -e 0, -e 1
91758816320 cycles (events)
79428247907 instructions (events)

91849140396 cycles (events)
79449560742 instructions (events)


It gets worse when trying to look at cache statistics:

timec -e 1 -e 2 -e 3

59611457943 instructions (events)
1872499771 cache references (events)
97471971 cache misses (events)

59601907232 instructions (events)
1871766376 cache references (events)
97435199 cache misses (events)

and so on

timec -e1 -e2 -e3 -e4


47671703285 instructions (events)
1498246999 cache references (events)
77838085 cache misses (events)
3394839360 branches (events)

47666131604 instructions (events)
1497069685 cache references (events)
78065325 cache misses (events)
3393244879 branches (events)



So apparently this performance counter infrastructure will always be
useless for trying to get plain aggregate counts? It's the simplest case
to get right, so it makes me wonder about the design of the rest of the
infrastructure.

Vince

2008-12-16 19:48:04

by Corey Ashford

[permalink] [raw]
Subject: Re: [patch] Performance Counters for Linux, v4

Vince Weaver wrote:
>
> I'm trying to evaluate this new proposal for the kind of workloads I use
> performance counters for, and even the simplest tests don't work.
>
> I'm trying to do a simple aggragate count for some benchmarks here using
> timec and I'm getting poor results.
>
> Are any of the problems I'm reporting going to be fixed?
>
> In any case, I was testing aggregate counts on a longer running
> benchmark, this time equake from the spec2k benchmark suite, still on
> the q6600.
>
> If I only count retired instructions, I get consistent results:
>
> timec -e 1
>
> 119175255369 instructions (events)
> 119175255561 instructions (events)
> 119175255383 instructions (events)
>
>
> however the minute I add another count, say cycles so I can calculate
> CPI/IPC the results for instructions are suddenly off by 33%.
>
> Needless to say, perfmon can handle reading both cycles and instructions
> at the same time.
>
>
> timec -e 0, -e 1
> 91758816320 cycles (events)
> 79428247907 instructions (events)
>
> 91849140396 cycles (events)
> 79449560742 instructions (events)
>
>
> It gets worse when trying to look at cache statistics:
>
> timec -e 1 -e 2 -e 3
>
> 59611457943 instructions (events)
> 1872499771 cache references (events)
> 97471971 cache misses (events)
>
> 59601907232 instructions (events)
> 1871766376 cache references (events)
> 97435199 cache misses (events)
>
> and so on
>
> timec -e1 -e2 -e3 -e4
>
>
> 47671703285 instructions (events)
> 1498246999 cache references (events)
> 77838085 cache misses (events)
> 3394839360 branches (events)
>
> 47666131604 instructions (events)
> 1497069685 cache references (events)
> 78065325 cache misses (events)
> 3393244879 branches (events)
>
>
>
> So apparently this performance counter infrastructure will always be
> useless for trying to get plain aggregate counts? It's the simplest
> case to get right, so it makes me wonder about the design of the rest of
> the infrastructure.
>
> Vince

Your test case demonstrates that scaling is missing from the current
version of Performance Counters for Linux.

When each set of events is scheduled onto a set of hardware event
counters, in order to scale the results properly, a cycles counter needs
to be included in each set as well.

When the counts are read up, the counts from each set need to be scaled
by a factor of
(total cycles)/(cycles in that set)

This is something that can be handled by perfmon3 (full) because set
multiplexing is explicitly programmed, not transparent as it is in
Ingo's current code. In perfmon3, the set switching can be determined
by events counter overflow, as well as time.

In common with both perfmon3 and Ingo's solution is that as more and
more events are scheduled onto the same set of hardware registers, the
accuracy drops and has to be compensated with longer run times.

Another source of error is that if the sets are rotated across the
hardware at a fixed periodic rate, if there's any correlation between
that rate and what's going on in the program being analyzed, the results
will be dubious. Ideally, you'd want to have some sort of pseudo-random
set switching rate to mitigate this sort of problem.

If Ingo could make some sort of provision for including a cycles count
in every set, and then transparently performing the scaling, that would
make it easier to use. As it stands now, I don't think there's any way
to recover the needed scaling information, because you cannot tell what
events are in what sets and how many cycles are associated with each set.

- Corey

Corey Ashford
Software Engineer
IBM Linux Technology Center, Linux Toolchain
Beaverton, OR
503-578-3507
[email protected]

2008-12-16 19:57:16

by William Cohen

[permalink] [raw]
Subject: Re: [Perfctr-devel] [patch] Performance Counters for Linux, v4

Ingo Molnar wrote:
> We are pleased to announce the v4 release of our performance counters
> subsystem implementation. The kernel changes can be picked up from:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip.git perfcounters/core
>
> (also in the master branch. There's also a kernel patch attached
> below.)


PAPI also has the concept of event presets to hide some of the event
selection details. The following URL lists the presets and which
processors they are supported on:

http://icl.cs.utk.edu/projects/papi/presets.html

Looking through the PAPI preset events can see lots of variation due
to the difference in what precisely the performance monitoring
hardware support within the processor. The following events are
defined in include/linux/perf_counter.h. It would be helpful to state
what is meant by of the following events:

PERF_COUNT_CYCLES
PERF_COUNT_INSTRUCTIONS
PERF_COUNT_CACHE_REFERENCES
PERF_COUNT_CACHE_MISSES
PERF_COUNT_BRANCH_INSTRUCTIONS
PERF_COUNT_BRANCH_MISSES

PERF_COUNT_CYCLES and PERF_COUNT_INSTRUCTIONS

Is this the cpu clock rate to compute clocks per instruction (CPI)? On
some processors there are several possible sources of "cycles":

Reference clock frequency (fixed frequency, e.g. always 2.2GHz)
Cycles of processor subject to frequency changes and halts

Is PERF_COUNT_INSTRUCTIONS the instructions actually retired by the
processor and would be used with PERF_COUNT_CYCLES to estimate CPI? In
the case of a SMT (symetric multi-threaded) processor these are
going to be kept on a per-virtual-CPU basis?


PERF_COUNT_CACHE_REFERENCES and PERF_COUNT_CACHE_MISSES

PERF_COUNT_CACHE_REFERENCES and PERF_COUNT_CACHE_MISSES are not single
monolitic events on many processors. There are multiple cache
levels. The L1 cache most processors have separate instruction and
data caches and require multiple counters to implement. Would these
refer to the last level of cache before memory and just be used to
compute the hit/miss rate for that last level? Some processors in the
same family have L2 and some processors have L3 cache. The setup code
would need to distinguish between these processor variants.

What memory references and misses are included and excluded in cache operation
counts? TLB accesses? Cache eviction/snooping operations?


PERF_COUNT_BRANCH_INSTRUCTIONS and PERF_COUNT_BRANCH_MISSES

Assume the PERF_COUNT_BRANCH_INSTRUCTIONS and PERF_COUNT_BRANCH_MISSES
refer only to retired instructions are are used to compute branch
misprediction ratio. Speculative instructions don't count for these numbers.



Any thought to including events that are not in the Intel architected events
such as ITLB/DTLB accesses and misses?

-Will

2008-12-16 20:04:51

by Pavel Machek

[permalink] [raw]
Subject: Re: [patch] Performance Counters for Linux, v4

On Tue 2008-12-16 14:13:30, Arjan van de Ven wrote:
> On Tue, 16 Dec 2008 14:03:02 +0100
> Ingo Molnar <[email protected]> wrote:
>
> > > > - plus you have to forbid SMT CPUs as well. On HT a task could
> > > > co-schedule with your setuid task and observe its timing
> > > > characteristics via its _own_ behavior. (which is impacted by
> > > > whatever is running on another SMT/HT thread.)
> > >
> > > Yes, SMT is evil.
> >
> > HT got added back to Nehalem, so SMT is coming to you in every future
> > x86 CPU. It brings a serious performance win, so nobody will turn
> > off

Fortunately, Intel is not the only x86 vendor :-).

> > SMT threading in practice. If SMT worries you, it needs explicit
> > partitioning of security-relevant processing to different physical
> > CPUs, via cgroups/cpusets/etc.

I guess we should refuse to run threads from different uids on one
physical core...

> and/or you use properly implemented crypto code (see Bruce Schneider's
> books). The timing "problem" isn't really SMT specific. If you have
> improperly implemented crypto (eg crypto code where the code paths and
> not just the data payload are key dependent) then on any system with
> more than one (logical) processor there is interference that an
> attacker can use.
>
> The only possible answer is to use proper implementation; turning off HT
> may make you feel good but you go from shoddy crypto for which there is
> some internet papers on how to crack it, to shoddy crypto for which the
> same papers apply ;)

It is not only timing. If attacker has access to detailed cache miss
statistics, things are easier for him... I probably should review the
books, but even if code paths are key-independend (hard), you'll get
timing differences due to [data] cache misses...?

Ok, this is getting off-topic.

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2008-12-16 20:51:26

by Vince Weaver

[permalink] [raw]
Subject: Re: [patch] Performance Counters for Linux, v4

On Tue, 16 Dec 2008, Corey Ashford wrote:
> In common with both perfmon3 and Ingo's solution is that as more and more
> events are scheduled onto the same set of hardware registers, the accuracy
> drops and has to be compensated with longer run times.

There seems to be some confusion.

I want aggregate instruction count. I do not want any sort of scaling or
sampling.

When I cound retired instructions and cycles, I want the full counts for
those. The q6600 definitely has more than 2 counters available, so it
should be able to give me exact aggregate counts for those counters. No
sampling or scaling should be involved.

Is it not possible to get raw, aggregate count with Ingo's infrastructure?
The documentation is vague on this.

Vince

2008-12-16 21:52:44

by Paul Mackerras

[permalink] [raw]
Subject: Re: [patch] Performance Counters for Linux, v4

Vince Weaver writes:
>
> > On Tue, 2008-12-16 at 08:42 +1100, Paul Mackerras wrote:
> > Furthermore, I think output of tools such as time and now timec are most
> > relevant when compared between runs - that is, the change in values
> > between runs, not the absolute values as such. At least, that's what I
> > usually do:

Just for the record, I didn't write the quoted paragraph, Peter
Zijlstra did.

Paul.

2008-12-16 23:07:30

by Paul Mackerras

[permalink] [raw]
Subject: Re: [patch] Performance Counters for Linux, v4

Peter Zijlstra writes:

> On Mon, 2008-12-15 at 23:11 +1100, Paul Mackerras wrote:
> > I think the core should put together a list of counters and counter
> > groups that it would like to have on the PMU simultaneously and then
> > make one call to the arch layer to ask if that is possible. That
> > could either return success or failure. If it returns failure then
> > the core needs to ask for something less, or something different. I'm
> > not sure how the core should choose what to ask for instead, though.
>
> I think the constraint set should be applied when we add to a group, if
> when we add a counter to the group, the result isn't schedulable
> anymore, we should fail the group addition - and thereby the counter
> creation.
>
> This would leave us with groups that are always schedulable in an atomic
> fashion.

I agree that if adding a counter to a group results in that group not
being schedulable any more, we should fail the addition.

> >From what I understand the code RRs groups (co-scheduling groups where
> possible) (ungrouped counter is a group of one), this means that with
> the above addition you'd have the needed control over things.
>
> If you need things to be atomic, create a single group, if you're fine
> with RR time-sharing, create multiple.

I think we need a "full-time" attribute for counters and groups that
says "I need to be on the whole time", where "whole time" means
whenever the task is running, for a per-task counter or group, or
continuously for per-cpu counters/groups.

With that, the core can check at creation or enable time and return an
error if it can't fit all the full-time counters and groups on, or if
there is any part-time counter/group that would never get to go on
when all the full-time counters/groups are on. (And I guess creation
or enabling of a part-time counter or group should fail if it would
never be able to go on.)

> This seems to leave a hole where multiple monitors collide and create
> multiple groups unaware of each-other - could we plug this hole with a
> group attribute?

There could be a "whole-PMU" group attribute, that says "I need raw
access to the PMU with no other counters scheduled", which would allow
monitoring programs to use arcane PMU features that the kernel doesn't
necessarily know about. But I think that's a bit different from what
you're talking about.

The perf counter subsystem will, in Ingo's design, naturally try to
schedule as many counters and groups on as it can. Given a list of
counters/groups, it could start with the first and keep on trying to
add counters or groups while it can, essentially trying all possible
combinations until it either fills up all the hardware counters or
exhausts the possible combinations. If it moves all the
counters/groups that do fit on up to the head of the list, and then
rotates them to the back of the list when the timeslice expires, that
would probably be OK. In fact the computation about what set of
counters/groups to put on should be done when adding/removing a
counter/group and when the timeslice expires, rather than at context
switch time. (I'm talking about the list of part-time counters/groups
here, of course.)

Paul.

2008-12-16 23:51:53

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] Performance Counters for Linux, v4


* Paul Mackerras <[email protected]> wrote:

> Peter Zijlstra writes:
>
> > On Mon, 2008-12-15 at 23:11 +1100, Paul Mackerras wrote:

> > > I think the core should put together a list of counters and counter
> > > groups that it would like to have on the PMU simultaneously and then
> > > make one call to the arch layer to ask if that is possible. That
> > > could either return success or failure. If it returns failure then
> > > the core needs to ask for something less, or something different.
> > > I'm not sure how the core should choose what to ask for instead,
> > > though.
> >
> > I think the constraint set should be applied when we add to a group,
> > if when we add a counter to the group, the result isn't schedulable
> > anymore, we should fail the group addition - and thereby the counter
> > creation.
> >
> > This would leave us with groups that are always schedulable in an
> > atomic fashion.
>
> I agree that if adding a counter to a group results in that group not
> being schedulable any more, we should fail the addition.
>
> > > From what I understand the code RRs groups (co-scheduling groups
> > > where possible) (ungrouped counter is a group of one), this means
> > > that with the above addition you'd have the needed control over
> > > things.
> >
> > If you need things to be atomic, create a single group, if you're fine
> > with RR time-sharing, create multiple.
>
> I think we need a "full-time" attribute for counters and groups that
> says "I need to be on the whole time", where "whole time" means whenever
> the task is running, for a per-task counter or group, or continuously
> for per-cpu counters/groups.
>
> With that, the core can check at creation or enable time and return an
> error if it can't fit all the full-time counters and groups on, or if
> there is any part-time counter/group that would never get to go on when
> all the full-time counters/groups are on. (And I guess creation or
> enabling of a part-time counter or group should fail if it would never
> be able to go on.)
>
> > This seems to leave a hole where multiple monitors collide and create
> > multiple groups unaware of each-other - could we plug this hole with a
> > group attribute?
>
> There could be a "whole-PMU" group attribute, that says "I need raw
> access to the PMU with no other counters scheduled", which would allow
> monitoring programs to use arcane PMU features that the kernel doesn't
> necessarily know about. But I think that's a bit different from what
> you're talking about.
>
> The perf counter subsystem will, in Ingo's design, naturally try to
> schedule as many counters and groups on as it can. Given a list of
> counters/groups, it could start with the first and keep on trying to add
> counters or groups while it can, essentially trying all possible
> combinations until it either fills up all the hardware counters or
> exhausts the possible combinations. If it moves all the counters/groups
> that do fit on up to the head of the list, and then rotates them to the
> back of the list when the timeslice expires, that would probably be OK.
> In fact the computation about what set of counters/groups to put on
> should be done when adding/removing a counter/group and when the
> timeslice expires, rather than at context switch time. (I'm talking
> about the list of part-time counters/groups here, of course.)

yeah, something like that sounds sensible.

The principle is this: as long as it's an add-on attribute with a default
value of zero, so that normal usage does not need to care about it, it's a
reasonable extension. What we want to avoid is burdening the normal case
with the weird cases. (and none of the things you mentioned seem to be in
that category so we are fine.)

Ingo

2008-12-17 01:51:29

by Andi Kleen

[permalink] [raw]
Subject: Re: [Perfctr-devel] [patch] Performance Counters for Linux, v4

William Cohen <[email protected]> writes:
>
> PERF_COUNT_CACHE_REFERENCES and PERF_COUNT_CACHE_MISSES are not single
> monolitic events on many processors. There are multiple cache
> levels. The L1 cache most processors have separate instruction and
> data caches and require multiple counters to implement. Would these
> refer to the last level of cache before memory and just be used to
> compute the hit/miss rate for that last level? Some processors in the
> same family have L2 and some processors have L3 cache. The setup code
> would need to distinguish between these processor variants.

The difference between L1 and L3 caches can be huge (in some cases
two orders of magnitude). With that I'm not sure a single cache
miss/hit event even makes any sense.

On a modern CPU with L2 and L3 caches as soon as you fall out
of the L2 you're going to perform more poorly on parallel workloads.

With that you really have to distingush some levels.

-Andi


--
[email protected]

2008-12-17 01:54:43

by Andi Kleen

[permalink] [raw]
Subject: Re: [patch] Performance Counters for Linux, v4

Paul Mackerras <[email protected]> writes:
>
> The perf counter subsystem will, in Ingo's design, naturally try to
> schedule as many counters and groups on as it can. Given a list of
> counters/groups, it could start with the first and keep on trying to
> add counters or groups while it can, essentially trying all possible
> combinations until it either fills up all the hardware counters or
> exhausts the possible combinations. If it moves all the
> counters/groups that do fit on up to the head of the list, and then
> rotates them to the back of the list when the timeslice expires, that
> would probably be OK. In fact the computation about what set of
> counters/groups to put on should be done when adding/removing a
> counter/group and when the timeslice expires, rather than at context
> switch time. (I'm talking about the list of part-time counters/groups
> here, of course.)

One issue is that PMU counts can cover more than one CPU. One example
for this are the Uncore events on Nehalem (which cover a whole socket)
or when you are in AnyThreads monitoring mode (then you get events
from both SMT siblings in a core)

With that you would need to examine other CPU's state at context switch
time. Probably not a good idea for scalability.

-Andi

--
[email protected]

2008-12-17 02:22:16

by Samuel Thibault

[permalink] [raw]
Subject: Re: [Perfctr-devel] [patch] Performance Counters for Linux, v4

Andi Kleen, le Wed 17 Dec 2008 02:51:54 +0100, a ?crit :
> William Cohen <[email protected]> writes:
> > PERF_COUNT_CACHE_REFERENCES and PERF_COUNT_CACHE_MISSES are not single
> > monolitic events on many processors. There are multiple cache
> > levels. The L1 cache most processors have separate instruction and
> > data caches and require multiple counters to implement. Would these
> > refer to the last level of cache before memory and just be used to
> > compute the hit/miss rate for that last level? Some processors in the
> > same family have L2 and some processors have L3 cache. The setup code
> > would need to distinguish between these processor variants.
>
> The difference between L1 and L3 caches can be huge (in some cases
> two orders of magnitude). With that I'm not sure a single cache
> miss/hit event even makes any sense.

Confirmed. I have a code for which I'd like to know whether it fits
into at least L2 or even L1.

Samuel

2008-12-17 02:28:49

by Dan Terpstra

[permalink] [raw]
Subject: RE: [Perfctr-devel] [patch] Performance Counters for Linux, v4


> Peter Zijlstra writes:
>
> > On Mon, 2008-12-15 at 23:11 +1100, Paul Mackerras wrote:
> > > I think the core should put together a list of counters and counter
> > > groups that it would like to have on the PMU simultaneously and then
> > > make one call to the arch layer to ask if that is possible. That
> > > could either return success or failure. If it returns failure then
> > > the core needs to ask for something less, or something different. I'm
> > > not sure how the core should choose what to ask for instead, though.
> >
> > I think the constraint set should be applied when we add to a group, if
> > when we add a counter to the group, the result isn't schedulable
> > anymore, we should fail the group addition - and thereby the counter
> > creation.
> >
> > This would leave us with groups that are always schedulable in an atomic
> > fashion.
>
> I agree that if adding a counter to a group results in that group not
> being schedulable any more, we should fail the addition.
>
That's what PAPI does.
In userspace.
Using libpfm.
Before counting anything.
On linux, AIX, Windows, Cray...
Talking to perfmon, perfctr, our own drivers, and maybe someday even the
linux performance counter subsystem.
- d

2008-12-17 07:34:21

by Stephane Eranian

[permalink] [raw]
Subject: Re: [patch] Performance Counters for Linux, v4

Paul,

On Wed, Dec 17, 2008 at 12:06 AM, Paul Mackerras <[email protected]> wrote:
>
> I think we need a "full-time" attribute for counters and groups that
> says "I need to be on the whole time", where "whole time" means
> whenever the task is running, for a per-task counter or group, or
> continuously for per-cpu counters/groups.
>
Who would want not to be the only owner of the PMU?

Especially when you know that if you don't, you may be sharing it
with another user who may perturb your measurement, for instance
because of heavy sampling while you are just counting. It is a tradeoff
between flexibility and accuracy. Which one is more important?

2008-12-17 16:01:22

by William Cohen

[permalink] [raw]
Subject: Re: [Perfctr-devel] [patch] Performance Counters for Linux, v4

Ingo Molnar wrote:
> We are pleased to announce the v4 release of our performance counters
> subsystem implementation. The kernel changes can be picked up from:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip.git perfcounters/core
>
> (also in the master branch. There's also a kernel patch attached
> below.)

Machines that support virtualization are becoming very common. How is
this performance monitoring support going to work with virtualization
(e.g. KVM)? Having the performance counters only work on physical
machines would be pretty limiting.

-Will

2008-12-17 20:54:37

by Corey Ashford

[permalink] [raw]
Subject: Re: [Perfctr-devel] [patch] Performance Counters for Linux, v4

William Cohen wrote:
> Ingo Molnar wrote:
>> We are pleased to announce the v4 release of our performance counters
>> subsystem implementation. The kernel changes can be picked up from:
>>
>> git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip.git
>> perfcounters/core
>>
>> (also in the master branch. There's also a kernel patch attached
>> below.)
>
> Machines that support virtualization are becoming very common. How is
> this performance monitoring support going to work with virtualization
> (e.g. KVM)? Having the performance counters only work on physical
> machines would be pretty limiting.

On Power machines, the PMU counter registers are virtualized by the
hypervisor, at the request of the OS, on a per-cpu/per-partition basis.
So this issue is handled transparently on Power anyway.

- Corey

Corey Ashford
Software Engineer
IBM Linux Technology Center, Linux Toolchain
Beaverton, OR
503-578-3507
[email protected]

2009-01-16 18:01:52

by Corey Ashford

[permalink] [raw]
Subject: Re: [patch] Performance Counters for Linux, v4

Andi Kleen wrote:
> Paul Mackerras <[email protected]> writes:
>> The perf counter subsystem will, in Ingo's design, naturally try to
>> schedule as many counters and groups on as it can. Given a list of
>> counters/groups, it could start with the first and keep on trying to
>> add counters or groups while it can, essentially trying all possible
>> combinations until it either fills up all the hardware counters or
>> exhausts the possible combinations. If it moves all the
>> counters/groups that do fit on up to the head of the list, and then
>> rotates them to the back of the list when the timeslice expires, that
>> would probably be OK. In fact the computation about what set of
>> counters/groups to put on should be done when adding/removing a
>> counter/group and when the timeslice expires, rather than at context
>> switch time. (I'm talking about the list of part-time counters/groups
>> here, of course.)
>
> One issue is that PMU counts can cover more than one CPU. One example
> for this are the Uncore events on Nehalem (which cover a whole socket)
> or when you are in AnyThreads monitoring mode (then you get events
> from both SMT siblings in a core)
>
> With that you would need to examine other CPU's state at context switch
> time. Probably not a good idea for scalability.
>
> -Andi
>

Over time, it seems clear that we will see multi-core processor designs
with increasingly large uncore/nest facilities, so this could become
more and more of an issue.

- Corey

Corey Ashford
Software Engineer
IBM Linux Technology Center, Linux Toolchain
Beaverton, OR
503-578-3507
[email protected]

2009-01-16 22:14:58

by Maynard Johnson

[permalink] [raw]
Subject: Re: [patch] Performance Counters for Linux, v4

Corey Ashford wrote:
> Andi Kleen wrote:
>> Paul Mackerras <[email protected]> writes:
>>> The perf counter subsystem will, in Ingo's design, naturally try to
>>> schedule as many counters and groups on as it can. Given a list of
>>> counters/groups, it could start with the first and keep on trying to
>>> add counters or groups while it can, essentially trying all possible
>>> combinations until it either fills up all the hardware counters or
>>> exhausts the possible combinations. If it moves all the
>>> counters/groups that do fit on up to the head of the list, and then
>>> rotates them to the back of the list when the timeslice expires, that
>>> would probably be OK. In fact the computation about what set of
>>> counters/groups to put on should be done when adding/removing a
>>> counter/group and when the timeslice expires, rather than at context
>>> switch time. (I'm talking about the list of part-time counters/groups
>>> here, of course.)
>> One issue is that PMU counts can cover more than one CPU. One example
>> for this are the Uncore events on Nehalem (which cover a whole socket)
>> or when you are in AnyThreads monitoring mode (then you get events
>> from both SMT siblings in a core)
>>
>> With that you would need to examine other CPU's state at context switch
>> time. Probably not a good idea for scalability.
>>
>> -Andi
>>
>
> Over time, it seems clear that we will see multi-core processor designs
> with increasingly large uncore/nest facilities, so this could become
> more and more of an issue.
Ingo, I'll add my voice to the chorus here. To reiterate the point, some PMUs count events that are external to the processor cores, and these events cannot be attributed to any one particular CPU -- and certainly not to a particular pid. The current interface has a restriction that the user cannot pass -1 for both pid and cpu. But it seems to me that's exactly what would be needed for such off-core events. Can this feature fit in with the current interface or is some sort of extension needed?

Thanks.
-Maynard
>
> - Corey
>
> Corey Ashford
> Software Engineer
> IBM Linux Technology Center, Linux Toolchain
> Beaverton, OR
> 503-578-3507
> [email protected]
>

2009-01-16 23:12:45

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] Performance Counters for Linux, v4


* Maynard Johnson <[email protected]> wrote:

> > Over time, it seems clear that we will see multi-core processor
> > designs with increasingly large uncore/nest facilities, so this could
> > become more and more of an issue.

> Ingo, I'll add my voice to the chorus here. To reiterate the point,
> some PMUs count events that are external to the processor cores, and
> these events cannot be attributed to any one particular CPU -- and
> certainly not to a particular pid. The current interface has a
> restriction that the user cannot pass -1 for both pid and cpu. But it
> seems to me that's exactly what would be needed for such off-core
> events. Can this feature fit in with the current interface or is some
> sort of extension needed?

They fit in just fine, they just will have constraints that dont allow
their scheduling in a conflicting way. I.e. you'll only be able to occupy
it from a single counter per physical package - but otherwise it still
behaves like a normal counter if you define a single such counter per
physical package, as a percpu counter. (they dont make much sense as task
counters)

Btw., those kind of constraints make them quite noisy and hard to
interpret as well - because they summarize per physical package
characteristics (of up to 8 logical CPUs on Nehalem for example) and
cannot be tied to tasks easily.

I'm not dismissing them entirely: they do give an overview of "all stuff
that happens" at that level, and they do show a few things that is
obviously tied to the 'uncore' of a CPU (the cache, the memory interlink,
etc.), which cannot be provided by the normal PMCs - but they are too
highlevel to really be useful for finegrained analysis and for eventing.

In any case, despite their limitations they can still be provided just
fine, and any limitations they have is an inherent limitation of those
hardware counters, not of the perfcounters framework.

Ingo

2009-01-17 01:28:26

by Paul Mackerras

[permalink] [raw]
Subject: Re: [patch] Performance Counters for Linux, v4

Corey Ashford writes:

> Over time, it seems clear that we will see multi-core processor designs
> with increasingly large uncore/nest facilities, so this could become
> more and more of an issue.

Those nest events still get counted on counters that are in the CPU
core, right? So that sounds like they can be counted by one or more
per-cpu perf_counter instances. That means that you're measuring them
across all processes. Does it make any sense to try to attribute
those events to individual processes? How would one do that?

Clearly, something has to know enough about the system topology to
know how many counters are needed and which (virtual) cpus they should
be on. At present that defaults to userspace, but we could extend
perf_counters to handle it in the kernel by adding 'core' and 'node'
specifiers to the hw_event structure (assuming a three-level node /
core / cpu hierarchy for the system structure).

Paul.

2009-01-17 09:39:15

by Andi Kleen

[permalink] [raw]
Subject: Re: [patch] Performance Counters for Linux, v4

On Sat, Jan 17, 2009 at 12:26:15PM +1100, Paul Mackerras wrote:
> Corey Ashford writes:
>
> > Over time, it seems clear that we will see multi-core processor designs
> > with increasingly large uncore/nest facilities, so this could become
> > more and more of an issue.
>
> Those nest events still get counted on counters that are in the CPU
> core, right?

Nope, Nehalem uncore counters are separate per socket. The uncore
has its own counters.

You can program them to interrupt some fixed CPU thread (but
it's not necessarily the thread who caused the event) or all CPU threads
on the socket.

I found these semantics quite hard to fit into oprofile too because
it also has too many per cpu events assumptions.

> So that sounds like they can be counted by one or more
> per-cpu perf_counter instances. That means that you're measuring them
> across all processes. Does it make any sense to try to attribute
> those events to individual processes? How would one do that?

You can't (except by forcing only a single thread to run on
the socket). The hardware doesn't know otherwise.

perfmon3 has a concept of counters which are system wide.
While that's also not a 100% match on a multi socket system
it's better than assigning it to some random CPU.

>
> Clearly, something has to know enough about the system topology to
> know how many counters are needed and which (virtual) cpus they should

Hmm you want to create new virtual cpus for this?

-Andi

--
[email protected] -- Speaking for myself only.