2008-12-08 01:23:16

by Ingo Molnar

[permalink] [raw]
Subject: [patch] Performance Counters for Linux, v2


[ Performance counters are special hardware registers available on most
modern CPUs. These register count the number of certain types of hw
events: such as instructions executed, cachemisses suffered, or
branches mis-predicted, without slowing down the kernel or
applications. These registers can also trigger interrupts when a
threshold number of events have passed - and can thus be used to
profile the code that runs on that CPU. ]

This is version 2 of our Performance Counters subsystem implementation.

The biggest user-visible change in this release is a new user-space
text-mode profiling utility that is based on this code: KernelTop.

KernelTop can be downloaded from:

http://redhat.com/~mingo/perfcounters/kerneltop.c

It's a standalone .c file that needs no extra libraries - it only needs a
CONFIG_PERF_COUNTERS=y kernel to run on.

This utility is intended for kernel developers - it's basically a dynamic
kernel profiler that gets hardware counter events dispatched to it
continuously, which it feeds into a histogram and outputs it
periodically.

Here is a screenshot of it:

------------------------------------------------------------------------------
KernelTop: 250880 irqs/sec [NMI, 10000 cycles], (all, cpu: 0)
------------------------------------------------------------------------------

events RIP kernel function
______ ________________ _______________

17319 - ffffffff8106f8fa : audit_syscall_exit
16300 - ffffffff81042ce2 : sys_rt_sigprocmask
11031 - ffffffff8106fdc8 : audit_syscall_entry
10880 - ffffffff8100bd8d : rff_trace
9780 - ffffffff810a232f : kfree [ehci_hcd]
9707 - ffffffff81298cb7 : _spin_lock_irq [ehci_hcd]
7903 - ffffffff8106db17 : unroll_tree_refs
7266 - ffffffff81138d10 : copy_user_generic_string
5751 - ffffffff8100be45 : sysret_check
4803 - ffffffff8100bea8 : sysret_signal
4696 - ffffffff8100bdb0 : system_call
4425 - ffffffff8100bdc0 : system_call_after_swapgs
2855 - ffffffff810ae183 : path_put [ext3]
2773 - ffffffff8100bedb : auditsys
1589 - ffffffff810b6864 : dput [sunrpc]
1253 - ffffffff8100be40 : ret_from_sys_call
690 - ffffffff8105034c : current_kernel_time [ext3]
673 - ffffffff81042bd4 : sys_sigprocmask
611 - ffffffff8100bf25 : sysret_audit

It will correctly profile core kernel, module space and vsyscall areas as
well. It allows the use of the most common hw counters: cycles,
instructions, branches, cachemisses, cache-references and branch-misses.

KernelTop does not have to be started/stopped - it will continously
profile the system and updates the histogram as the workload changes. The
histogram is not cumulative: old workload effects will time out
gradually. For example if the system goes idle, then the profiler output
will go down to near zero within 10-20 seconds. So there's no need to
stop or restart profiling - it all updates automatically, as the workflow
changes its characteristics.

KernelTop can also profile raw event IDs. For example, on a Core2 CPU, to
profile the "Number of instruction length decoder stalls" (raw event
0x0087) during a hackbench run, i did this:

$ ./kerneltop -e -$(printf "%d\n" 0x00000087) -c 10000 -n 1

------------------------------------------------------------------------------
KernelTop: 331 irqs/sec [NMI, 10000 raw:0087], (all, 2 CPUs)
------------------------------------------------------------------------------

events RIP kernel function
______ ________________ _______________

1016 - ffffffff802a611e : kmem_cache_alloc_node
898 - ffffffff804ca381 : sock_wfree
64 - ffffffff80567306 : schedule
50 - ffffffff804cdb39 : skb_release_head_state
45 - ffffffff8053ed54 : unix_write_space
33 - ffffffff802a6a4d : __kmalloc_node
18 - ffffffff802a642c : cache_alloc_refill
13 - ffffffff804cdd50 : __alloc_skb
7 - ffffffff8053ec0a : unix_shutdown

[ The printf is done to pass in a negative event number as a parameter. ]

We also made a good number of internal changes as well to the subsystem:

There's a new "counter group record" facility that is a straightforward
extension of the existing "irq record" notification type. This record
type can be set on a 'master' counter, and if the master counter triggers
an IRQ or an NMI, all the 'secondary' counters are read out atomically
and are put into the counter-group record. The result can then be read()
out by userspace via a single system call. (Based on extensive feedback
from Paul Mackerras and David Miller, thanks guys!)

The other big change is the support of virtual task counters via counter
scheduling: a task can specify more counters than there are on the CPU,
the kernel will then schedule the counters periodically to spread out hw
resources. So for example if a task starts 6 counters on a CPU that has
only two hardware counters, it still gets this output:

counter[0 cycles ]: 5204680573 , delta: 1733680843 events
counter[1 instructions ]: 1364468045 , delta: 454818351 events
counter[2 cache-refs ]: 12732 , delta: 4399 events
counter[3 cache-misses ]: 1009 , delta: 336 events
counter[4 branch-instructions ]: 125993304 , delta: 42006998 events
counter[5 branch-misses ]: 1946 , delta: 649 events

See this sample code at:

http://redhat.com/~mingo/perfcounters/hello-loop.c

There's also now the ability to do NMI profiling: this works both for per
CPU and per task counters. NMI counters are transparent and are enabled
via the PERF_COUNTER_NMI bit in the "hardware event type" parameter of
the sys_perf_counter_open() system call.

There's also more generic x86 support: all 4 generic PMCs of Nehalem /
Core i7 are supported - i've run 4 instances of KernelTop and they used
up four separate PMCs.

There's also perf counters debug output that can be triggered via sysrq,
for diagnostic purposes.

Ingo, Thomas

------------------->

The latest performance counters experimental git tree can be found at:

git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip.git perfcounters/core

--------------->
Ingo Molnar (2):
performance counters: documentation
performance counters: x86 support

Thomas Gleixner (1):
performance counters: core code


Documentation/perf-counters.txt | 104 +++
arch/x86/Kconfig | 1 +
arch/x86/ia32/ia32entry.S | 3 +-
arch/x86/include/asm/hardirq_32.h | 1 +
arch/x86/include/asm/hw_irq.h | 2 +
arch/x86/include/asm/intel_arch_perfmon.h | 34 +-
arch/x86/include/asm/irq_vectors.h | 5 +
arch/x86/include/asm/mach-default/entry_arch.h | 5 +
arch/x86/include/asm/pda.h | 1 +
arch/x86/include/asm/thread_info.h | 4 +-
arch/x86/include/asm/unistd_32.h | 1 +
arch/x86/include/asm/unistd_64.h | 3 +-
arch/x86/kernel/apic.c | 2 +
arch/x86/kernel/cpu/Makefile | 12 +-
arch/x86/kernel/cpu/common.c | 2 +
arch/x86/kernel/cpu/perf_counter.c | 571 ++++++++++++++
arch/x86/kernel/entry_64.S | 6 +
arch/x86/kernel/irq.c | 5 +
arch/x86/kernel/irqinit_32.c | 3 +
arch/x86/kernel/irqinit_64.c | 5 +
arch/x86/kernel/signal_32.c | 8 +-
arch/x86/kernel/signal_64.c | 5 +
arch/x86/kernel/syscall_table_32.S | 1 +
drivers/char/sysrq.c | 2 +
include/linux/perf_counter.h | 171 +++++
include/linux/sched.h | 9 +
include/linux/syscalls.h | 6 +
init/Kconfig | 29 +
kernel/Makefile | 1 +
kernel/fork.c | 1 +
kernel/perf_counter.c | 945 ++++++++++++++++++++++++
kernel/sched.c | 24 +
kernel/sys_ni.c | 3 +
33 files changed, 1954 insertions(+), 21 deletions(-)
create mode 100644 Documentation/perf-counters.txt
create mode 100644 arch/x86/kernel/cpu/perf_counter.c
create mode 100644 include/linux/perf_counter.h
create mode 100644 kernel/perf_counter.c

diff --git a/Documentation/perf-counters.txt b/Documentation/perf-counters.txt
new file mode 100644
index 0000000..19033a0
--- /dev/null
+++ b/Documentation/perf-counters.txt
@@ -0,0 +1,104 @@
+
+Performance Counters for Linux
+------------------------------
+
+Performance counters are special hardware registers available on most modern
+CPUs. These registers count the number of certain types of hw events: such
+as instructions executed, cachemisses suffered, or branches mis-predicted -
+without slowing down the kernel or applications. These registers can also
+trigger interrupts when a threshold number of events have passed - and can
+thus be used to profile the code that runs on that CPU.
+
+The Linux Performance Counter subsystem provides an abstraction of these
+hardware capabilities. It provides per task and per CPU counters, and
+it provides event capabilities on top of those.
+
+Performance counters are accessed via special file descriptors.
+There's one file descriptor per virtual counter used.
+
+The special file descriptor is opened via the perf_counter_open()
+system call:
+
+ int
+ perf_counter_open(u32 hw_event_type,
+ u32 hw_event_period,
+ u32 record_type,
+ pid_t pid,
+ int cpu);
+
+The syscall returns the new fd. The fd can be used via the normal
+VFS system calls: read() can be used to read the counter, fcntl()
+can be used to set the blocking mode, etc.
+
+Multiple counters can be kept open at a time, and the counters
+can be poll()ed.
+
+When creating a new counter fd, 'hw_event_type' is one of:
+
+ enum hw_event_types {
+ PERF_COUNT_CYCLES,
+ PERF_COUNT_INSTRUCTIONS,
+ PERF_COUNT_CACHE_REFERENCES,
+ PERF_COUNT_CACHE_MISSES,
+ PERF_COUNT_BRANCH_INSTRUCTIONS,
+ PERF_COUNT_BRANCH_MISSES,
+ };
+
+These are standardized types of events that work uniformly on all CPUs
+that implements Performance Counters support under Linux. If a CPU is
+not able to count branch-misses, then the system call will return
+-EINVAL.
+
+[ Note: more hw_event_types are supported as well, but they are CPU
+ specific and are enumerated via /sys on a per CPU basis. Raw hw event
+ types can be passed in as negative numbers. For example, to count
+ "External bus cycles while bus lock signal asserted" events on Intel
+ Core CPUs, pass in a -0x4064 event type value. ]
+
+The parameter 'hw_event_period' is the number of events before waking up
+a read() that is blocked on a counter fd. Zero value means a non-blocking
+counter.
+
+'record_type' is the type of data that a read() will provide for the
+counter, and it can be one of:
+
+ enum perf_record_type {
+ PERF_RECORD_SIMPLE,
+ PERF_RECORD_IRQ,
+ };
+
+a "simple" counter is one that counts hardware events and allows
+them to be read out into a u64 count value. (read() returns 8 on
+a successful read of a simple counter.)
+
+An "irq" counter is one that will also provide an IRQ context information:
+the IP of the interrupted context. In this case read() will return
+the 8-byte counter value, plus the Instruction Pointer address of the
+interrupted context.
+
+The 'pid' parameter allows the counter to be specific to a task:
+
+ pid == 0: if the pid parameter is zero, the counter is attached to the
+ current task.
+
+ pid > 0: the counter is attached to a specific task (if the current task
+ has sufficient privilege to do so)
+
+ pid < 0: all tasks are counted (per cpu counters)
+
+The 'cpu' parameter allows a counter to be made specific to a full
+CPU:
+
+ cpu >= 0: the counter is restricted to a specific CPU
+ cpu == -1: the counter counts on all CPUs
+
+Note: the combination of 'pid == -1' and 'cpu == -1' is not valid.
+
+A 'pid > 0' and 'cpu == -1' counter is a per task counter that counts
+events of that task and 'follows' that task to whatever CPU the task
+gets schedule to. Per task counters can be created by any user, for
+their own tasks.
+
+A 'pid == -1' and 'cpu == x' counter is a per CPU counter that counts
+all events on CPU-x. Per CPU counters need CAP_SYS_ADMIN privilege.
+
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index ac22bb7..5a2d74a 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -651,6 +651,7 @@ config X86_UP_IOAPIC
config X86_LOCAL_APIC
def_bool y
depends on X86_64 || (X86_32 && (X86_UP_APIC || (SMP && !X86_VOYAGER) || X86_GENERICARCH))
+ select HAVE_PERF_COUNTERS

config X86_IO_APIC
def_bool y
diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
index 256b00b..3c14ed0 100644
--- a/arch/x86/ia32/ia32entry.S
+++ b/arch/x86/ia32/ia32entry.S
@@ -823,7 +823,8 @@ ia32_sys_call_table:
.quad compat_sys_signalfd4
.quad sys_eventfd2
.quad sys_epoll_create1
- .quad sys_dup3 /* 330 */
+ .quad sys_dup3 /* 330 */
.quad sys_pipe2
.quad sys_inotify_init1
+ .quad sys_perf_counter_open
ia32_syscall_end:
diff --git a/arch/x86/include/asm/hardirq_32.h b/arch/x86/include/asm/hardirq_32.h
index 5ca135e..b3e475d 100644
--- a/arch/x86/include/asm/hardirq_32.h
+++ b/arch/x86/include/asm/hardirq_32.h
@@ -9,6 +9,7 @@ typedef struct {
unsigned long idle_timestamp;
unsigned int __nmi_count; /* arch dependent */
unsigned int apic_timer_irqs; /* arch dependent */
+ unsigned int apic_perf_irqs; /* arch dependent */
unsigned int irq0_irqs;
unsigned int irq_resched_count;
unsigned int irq_call_count;
diff --git a/arch/x86/include/asm/hw_irq.h b/arch/x86/include/asm/hw_irq.h
index b97aecb..c22900e 100644
--- a/arch/x86/include/asm/hw_irq.h
+++ b/arch/x86/include/asm/hw_irq.h
@@ -30,6 +30,8 @@
/* Interrupt handlers registered during init_IRQ */
extern void apic_timer_interrupt(void);
extern void error_interrupt(void);
+extern void perf_counter_interrupt(void);
+
extern void spurious_interrupt(void);
extern void thermal_interrupt(void);
extern void reschedule_interrupt(void);
diff --git a/arch/x86/include/asm/intel_arch_perfmon.h b/arch/x86/include/asm/intel_arch_perfmon.h
index fa0fd06..71598a9 100644
--- a/arch/x86/include/asm/intel_arch_perfmon.h
+++ b/arch/x86/include/asm/intel_arch_perfmon.h
@@ -1,22 +1,24 @@
#ifndef _ASM_X86_INTEL_ARCH_PERFMON_H
#define _ASM_X86_INTEL_ARCH_PERFMON_H

-#define MSR_ARCH_PERFMON_PERFCTR0 0xc1
-#define MSR_ARCH_PERFMON_PERFCTR1 0xc2
+#define MSR_ARCH_PERFMON_PERFCTR0 0xc1
+#define MSR_ARCH_PERFMON_PERFCTR1 0xc2

-#define MSR_ARCH_PERFMON_EVENTSEL0 0x186
-#define MSR_ARCH_PERFMON_EVENTSEL1 0x187
+#define MSR_ARCH_PERFMON_EVENTSEL0 0x186
+#define MSR_ARCH_PERFMON_EVENTSEL1 0x187

-#define ARCH_PERFMON_EVENTSEL0_ENABLE (1 << 22)
-#define ARCH_PERFMON_EVENTSEL_INT (1 << 20)
-#define ARCH_PERFMON_EVENTSEL_OS (1 << 17)
-#define ARCH_PERFMON_EVENTSEL_USR (1 << 16)
+#define ARCH_PERFMON_EVENTSEL0_ENABLE (1 << 22)
+#define ARCH_PERFMON_EVENTSEL_INT (1 << 20)
+#define ARCH_PERFMON_EVENTSEL_OS (1 << 17)
+#define ARCH_PERFMON_EVENTSEL_USR (1 << 16)

-#define ARCH_PERFMON_UNHALTED_CORE_CYCLES_SEL (0x3c)
-#define ARCH_PERFMON_UNHALTED_CORE_CYCLES_UMASK (0x00 << 8)
-#define ARCH_PERFMON_UNHALTED_CORE_CYCLES_INDEX (0)
+#define ARCH_PERFMON_UNHALTED_CORE_CYCLES_SEL 0x3c
+#define ARCH_PERFMON_UNHALTED_CORE_CYCLES_UMASK (0x00 << 8)
+#define ARCH_PERFMON_UNHALTED_CORE_CYCLES_INDEX 0
#define ARCH_PERFMON_UNHALTED_CORE_CYCLES_PRESENT \
- (1 << (ARCH_PERFMON_UNHALTED_CORE_CYCLES_INDEX))
+ (1 << (ARCH_PERFMON_UNHALTED_CORE_CYCLES_INDEX))
+
+#define ARCH_PERFMON_BRANCH_MISSES_RETIRED 6

union cpuid10_eax {
struct {
@@ -28,4 +30,12 @@ union cpuid10_eax {
unsigned int full;
};

+#ifdef CONFIG_PERF_COUNTERS
+extern void init_hw_perf_counters(void);
+extern void perf_counters_lapic_init(int nmi);
+#else
+static inline void init_hw_perf_counters(void) { }
+static inline void perf_counters_lapic_init(int nmi) { }
+#endif
+
#endif /* _ASM_X86_INTEL_ARCH_PERFMON_H */
diff --git a/arch/x86/include/asm/irq_vectors.h b/arch/x86/include/asm/irq_vectors.h
index 0005adb..b8d277f 100644
--- a/arch/x86/include/asm/irq_vectors.h
+++ b/arch/x86/include/asm/irq_vectors.h
@@ -87,6 +87,11 @@
#define LOCAL_TIMER_VECTOR 0xef

/*
+ * Performance monitoring interrupt vector:
+ */
+#define LOCAL_PERF_VECTOR 0xee
+
+/*
* First APIC vector available to drivers: (vectors 0x30-0xee) we
* start at 0x31(0x41) to spread out vectors evenly between priority
* levels. (0x80 is the syscall vector)
diff --git a/arch/x86/include/asm/mach-default/entry_arch.h b/arch/x86/include/asm/mach-default/entry_arch.h
index 6b1add8..ad31e5d 100644
--- a/arch/x86/include/asm/mach-default/entry_arch.h
+++ b/arch/x86/include/asm/mach-default/entry_arch.h
@@ -25,10 +25,15 @@ BUILD_INTERRUPT(irq_move_cleanup_interrupt,IRQ_MOVE_CLEANUP_VECTOR)
* a much simpler SMP time architecture:
*/
#ifdef CONFIG_X86_LOCAL_APIC
+
BUILD_INTERRUPT(apic_timer_interrupt,LOCAL_TIMER_VECTOR)
BUILD_INTERRUPT(error_interrupt,ERROR_APIC_VECTOR)
BUILD_INTERRUPT(spurious_interrupt,SPURIOUS_APIC_VECTOR)

+#ifdef CONFIG_PERF_COUNTERS
+BUILD_INTERRUPT(perf_counter_interrupt, LOCAL_PERF_VECTOR)
+#endif
+
#ifdef CONFIG_X86_MCE_P4THERMAL
BUILD_INTERRUPT(thermal_interrupt,THERMAL_APIC_VECTOR)
#endif
diff --git a/arch/x86/include/asm/pda.h b/arch/x86/include/asm/pda.h
index 2fbfff8..90a8d9d 100644
--- a/arch/x86/include/asm/pda.h
+++ b/arch/x86/include/asm/pda.h
@@ -30,6 +30,7 @@ struct x8664_pda {
short isidle;
struct mm_struct *active_mm;
unsigned apic_timer_irqs;
+ unsigned apic_perf_irqs;
unsigned irq0_irqs;
unsigned irq_resched_count;
unsigned irq_call_count;
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index e44d379..810bf26 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -80,6 +80,7 @@ struct thread_info {
#define TIF_SYSCALL_AUDIT 7 /* syscall auditing active */
#define TIF_SECCOMP 8 /* secure computing */
#define TIF_MCE_NOTIFY 10 /* notify userspace of an MCE */
+#define TIF_PERF_COUNTERS 11 /* notify perf counter work */
#define TIF_NOTSC 16 /* TSC is not accessible in userland */
#define TIF_IA32 17 /* 32bit process */
#define TIF_FORK 18 /* ret_from_fork */
@@ -103,6 +104,7 @@ struct thread_info {
#define _TIF_SYSCALL_AUDIT (1 << TIF_SYSCALL_AUDIT)
#define _TIF_SECCOMP (1 << TIF_SECCOMP)
#define _TIF_MCE_NOTIFY (1 << TIF_MCE_NOTIFY)
+#define _TIF_PERF_COUNTERS (1 << TIF_PERF_COUNTERS)
#define _TIF_NOTSC (1 << TIF_NOTSC)
#define _TIF_IA32 (1 << TIF_IA32)
#define _TIF_FORK (1 << TIF_FORK)
@@ -135,7 +137,7 @@ struct thread_info {

/* Only used for 64 bit */
#define _TIF_DO_NOTIFY_MASK \
- (_TIF_SIGPENDING|_TIF_MCE_NOTIFY|_TIF_NOTIFY_RESUME)
+ (_TIF_SIGPENDING|_TIF_MCE_NOTIFY|_TIF_PERF_COUNTERS|_TIF_NOTIFY_RESUME)

/* flags to check in __switch_to() */
#define _TIF_WORK_CTXSW \
diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h
index f2bba78..7e47658 100644
--- a/arch/x86/include/asm/unistd_32.h
+++ b/arch/x86/include/asm/unistd_32.h
@@ -338,6 +338,7 @@
#define __NR_dup3 330
#define __NR_pipe2 331
#define __NR_inotify_init1 332
+#define __NR_perf_counter_open 333

#ifdef __KERNEL__

diff --git a/arch/x86/include/asm/unistd_64.h b/arch/x86/include/asm/unistd_64.h
index d2e415e..53025fe 100644
--- a/arch/x86/include/asm/unistd_64.h
+++ b/arch/x86/include/asm/unistd_64.h
@@ -653,7 +653,8 @@ __SYSCALL(__NR_dup3, sys_dup3)
__SYSCALL(__NR_pipe2, sys_pipe2)
#define __NR_inotify_init1 294
__SYSCALL(__NR_inotify_init1, sys_inotify_init1)
-
+#define __NR_perf_counter_open 295
+__SYSCALL(__NR_perf_counter_open, sys_perf_counter_open)

#ifndef __NO_STUBS
#define __ARCH_WANT_OLD_READDIR
diff --git a/arch/x86/kernel/apic.c b/arch/x86/kernel/apic.c
index 16f9487..8ab8c18 100644
--- a/arch/x86/kernel/apic.c
+++ b/arch/x86/kernel/apic.c
@@ -31,6 +31,7 @@
#include <linux/dmi.h>
#include <linux/dmar.h>

+#include <asm/intel_arch_perfmon.h>
#include <asm/atomic.h>
#include <asm/smp.h>
#include <asm/mtrr.h>
@@ -1147,6 +1148,7 @@ void __cpuinit setup_local_APIC(void)
apic_write(APIC_ESR, 0);
}
#endif
+ perf_counters_lapic_init(0);

preempt_disable();

diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile
index 82ec607..89e5336 100644
--- a/arch/x86/kernel/cpu/Makefile
+++ b/arch/x86/kernel/cpu/Makefile
@@ -1,5 +1,5 @@
#
-# Makefile for x86-compatible CPU details and quirks
+# Makefile for x86-compatible CPU details, features and quirks
#

obj-y := intel_cacheinfo.o addon_cpuid_features.o
@@ -16,11 +16,13 @@ obj-$(CONFIG_CPU_SUP_CENTAUR_64) += centaur_64.o
obj-$(CONFIG_CPU_SUP_TRANSMETA_32) += transmeta.o
obj-$(CONFIG_CPU_SUP_UMC_32) += umc.o

-obj-$(CONFIG_X86_MCE) += mcheck/
-obj-$(CONFIG_MTRR) += mtrr/
-obj-$(CONFIG_CPU_FREQ) += cpufreq/
+obj-$(CONFIG_PERF_COUNTERS) += perf_counter.o

-obj-$(CONFIG_X86_LOCAL_APIC) += perfctr-watchdog.o
+obj-$(CONFIG_X86_MCE) += mcheck/
+obj-$(CONFIG_MTRR) += mtrr/
+obj-$(CONFIG_CPU_FREQ) += cpufreq/
+
+obj-$(CONFIG_X86_LOCAL_APIC) += perfctr-watchdog.o

quiet_cmd_mkcapflags = MKCAP $@
cmd_mkcapflags = $(PERL) $(srctree)/$(src)/mkcapflags.pl $< $@
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index b9c9ea0..4461011 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -17,6 +17,7 @@
#include <asm/mmu_context.h>
#include <asm/mtrr.h>
#include <asm/mce.h>
+#include <asm/intel_arch_perfmon.h>
#include <asm/pat.h>
#include <asm/asm.h>
#include <asm/numa.h>
@@ -750,6 +751,7 @@ void __init identify_boot_cpu(void)
#else
vgetcpu_set_mode();
#endif
+ init_hw_perf_counters();
}

void __cpuinit identify_secondary_cpu(struct cpuinfo_x86 *c)
diff --git a/arch/x86/kernel/cpu/perf_counter.c b/arch/x86/kernel/cpu/perf_counter.c
new file mode 100644
index 0000000..82440cb
--- /dev/null
+++ b/arch/x86/kernel/cpu/perf_counter.c
@@ -0,0 +1,571 @@
+/*
+ * Performance counter x86 architecture code
+ *
+ * Copyright(C) 2008 Thomas Gleixner <[email protected]>
+ * Copyright(C) 2008 Red Hat, Inc., Ingo Molnar
+ *
+ * For licencing details see kernel-base/COPYING
+ */
+
+#include <linux/perf_counter.h>
+#include <linux/capability.h>
+#include <linux/notifier.h>
+#include <linux/hardirq.h>
+#include <linux/kprobes.h>
+#include <linux/kdebug.h>
+#include <linux/sched.h>
+
+#include <asm/intel_arch_perfmon.h>
+#include <asm/apic.h>
+
+static bool perf_counters_initialized __read_mostly;
+
+/*
+ * Number of (generic) HW counters:
+ */
+static int nr_hw_counters __read_mostly;
+static u32 perf_counter_mask __read_mostly;
+
+/* No support for fixed function counters yet */
+
+#define MAX_HW_COUNTERS 8
+
+struct cpu_hw_counters {
+ struct perf_counter *counters[MAX_HW_COUNTERS];
+ unsigned long used[BITS_TO_LONGS(MAX_HW_COUNTERS)];
+ int enable_all;
+};
+
+/*
+ * Intel PerfMon v3. Used on Core2 and later.
+ */
+static DEFINE_PER_CPU(struct cpu_hw_counters, cpu_hw_counters);
+
+const int intel_perfmon_event_map[] =
+{
+ [PERF_COUNT_CYCLES] = 0x003c,
+ [PERF_COUNT_INSTRUCTIONS] = 0x00c0,
+ [PERF_COUNT_CACHE_REFERENCES] = 0x4f2e,
+ [PERF_COUNT_CACHE_MISSES] = 0x412e,
+ [PERF_COUNT_BRANCH_INSTRUCTIONS] = 0x00c4,
+ [PERF_COUNT_BRANCH_MISSES] = 0x00c5,
+};
+
+const int max_intel_perfmon_events = ARRAY_SIZE(intel_perfmon_event_map);
+
+/*
+ * Setup the hardware configuration for a given hw_event_type
+ */
+int hw_perf_counter_init(struct perf_counter *counter, s32 hw_event_type)
+{
+ struct hw_perf_counter *hwc = &counter->hw;
+
+ if (unlikely(!perf_counters_initialized))
+ return -EINVAL;
+
+ /*
+ * Count user events, and generate PMC IRQs:
+ * (keep 'enabled' bit clear for now)
+ */
+ hwc->config = ARCH_PERFMON_EVENTSEL_USR | ARCH_PERFMON_EVENTSEL_INT;
+
+ /*
+ * If privileged enough, count OS events too, and allow
+ * NMI events as well:
+ */
+ hwc->nmi = 0;
+ if (capable(CAP_SYS_ADMIN)) {
+ hwc->config |= ARCH_PERFMON_EVENTSEL_OS;
+ if (hw_event_type & PERF_COUNT_NMI)
+ hwc->nmi = 1;
+ }
+
+ hwc->config_base = MSR_ARCH_PERFMON_EVENTSEL0;
+ hwc->counter_base = MSR_ARCH_PERFMON_PERFCTR0;
+
+ hwc->irq_period = counter->__irq_period;
+ /*
+ * Intel PMCs cannot be accessed sanely above 32 bit width,
+ * so we install an artificial 1<<31 period regardless of
+ * the generic counter period:
+ */
+ if (!hwc->irq_period)
+ hwc->irq_period = 0x7FFFFFFF;
+
+ hwc->next_count = -((s32) hwc->irq_period);
+
+ /*
+ * Negative event types mean raw encoded event+umask values:
+ */
+ if (hw_event_type < 0) {
+ counter->hw_event_type = -hw_event_type;
+ counter->hw_event_type &= ~PERF_COUNT_NMI;
+ } else {
+ hw_event_type &= ~PERF_COUNT_NMI;
+ if (hw_event_type >= max_intel_perfmon_events)
+ return -EINVAL;
+ /*
+ * The generic map:
+ */
+ counter->hw_event_type = intel_perfmon_event_map[hw_event_type];
+ }
+ hwc->config |= counter->hw_event_type;
+ counter->wakeup_pending = 0;
+
+ return 0;
+}
+
+static void __hw_perf_enable_all(void)
+{
+ wrmsr(MSR_CORE_PERF_GLOBAL_CTRL, perf_counter_mask, 0);
+}
+
+void hw_perf_enable_all(void)
+{
+ struct cpu_hw_counters *cpuc = &__get_cpu_var(cpu_hw_counters);
+
+ cpuc->enable_all = 1;
+ __hw_perf_enable_all();
+}
+
+void hw_perf_disable_all(void)
+{
+ struct cpu_hw_counters *cpuc = &__get_cpu_var(cpu_hw_counters);
+
+ cpuc->enable_all = 0;
+ wrmsr(MSR_CORE_PERF_GLOBAL_CTRL, 0, 0);
+}
+
+static DEFINE_PER_CPU(u64, prev_next_count[MAX_HW_COUNTERS]);
+
+static void __hw_perf_counter_enable(struct hw_perf_counter *hwc, int idx)
+{
+ per_cpu(prev_next_count[idx], smp_processor_id()) = hwc->next_count;
+
+ wrmsr(hwc->counter_base + idx, hwc->next_count, 0);
+ wrmsr(hwc->config_base + idx, hwc->config, 0);
+}
+
+void hw_perf_counter_enable(struct perf_counter *counter)
+{
+ struct cpu_hw_counters *cpuc = &__get_cpu_var(cpu_hw_counters);
+ struct hw_perf_counter *hwc = &counter->hw;
+ int idx = hwc->idx;
+
+ /* Try to get the previous counter again */
+ if (test_and_set_bit(idx, cpuc->used)) {
+ idx = find_first_zero_bit(cpuc->used, nr_hw_counters);
+ set_bit(idx, cpuc->used);
+ hwc->idx = idx;
+ }
+
+ perf_counters_lapic_init(hwc->nmi);
+
+ wrmsr(hwc->config_base + idx,
+ hwc->config & ~ARCH_PERFMON_EVENTSEL0_ENABLE, 0);
+
+ cpuc->counters[idx] = counter;
+ counter->hw.config |= ARCH_PERFMON_EVENTSEL0_ENABLE;
+ __hw_perf_counter_enable(hwc, idx);
+}
+
+#ifdef CONFIG_X86_64
+static inline void atomic64_counter_set(struct perf_counter *counter, u64 val)
+{
+ atomic64_set(&counter->count, val);
+}
+
+static inline u64 atomic64_counter_read(struct perf_counter *counter)
+{
+ return atomic64_read(&counter->count);
+}
+#else
+/*
+ * Todo: add proper atomic64_t support to 32-bit x86:
+ */
+static inline void atomic64_counter_set(struct perf_counter *counter, u64 val64)
+{
+ u32 *val32 = (void *)&val64;
+
+ atomic_set(counter->count32 + 0, *(val32 + 0));
+ atomic_set(counter->count32 + 1, *(val32 + 1));
+}
+
+static inline u64 atomic64_counter_read(struct perf_counter *counter)
+{
+ return atomic_read(counter->count32 + 0) |
+ (u64) atomic_read(counter->count32 + 1) << 32;
+}
+#endif
+
+static void __hw_perf_save_counter(struct perf_counter *counter,
+ struct hw_perf_counter *hwc, int idx)
+{
+ s64 raw = -1;
+ s64 delta;
+ int err;
+
+ /*
+ * Get the raw hw counter value:
+ */
+ err = rdmsrl_safe(hwc->counter_base + idx, &raw);
+ WARN_ON_ONCE(err);
+
+ /*
+ * Rebase it to zero (it started counting at -irq_period),
+ * to see the delta since ->prev_count:
+ */
+ delta = (s64)hwc->irq_period + (s64)(s32)raw;
+
+ atomic64_counter_set(counter, hwc->prev_count + delta);
+
+ /*
+ * Adjust the ->prev_count offset - if we went beyond
+ * irq_period of units, then we got an IRQ and the counter
+ * was set back to -irq_period:
+ */
+ while (delta >= (s64)hwc->irq_period) {
+ hwc->prev_count += hwc->irq_period;
+ delta -= (s64)hwc->irq_period;
+ }
+
+ /*
+ * Calculate the next raw counter value we'll write into
+ * the counter at the next sched-in time:
+ */
+ delta -= (s64)hwc->irq_period;
+
+ hwc->next_count = (s32)delta;
+}
+
+void perf_counter_print_debug(void)
+{
+ u64 ctrl, status, overflow, pmc_ctrl, pmc_count, next_count;
+ int cpu, err, idx;
+
+ local_irq_disable();
+
+ cpu = smp_processor_id();
+
+ err = rdmsrl_safe(MSR_CORE_PERF_GLOBAL_CTRL, &ctrl);
+ WARN_ON_ONCE(err);
+
+ err = rdmsrl_safe(MSR_CORE_PERF_GLOBAL_STATUS, &status);
+ WARN_ON_ONCE(err);
+
+ err = rdmsrl_safe(MSR_CORE_PERF_GLOBAL_OVF_CTRL, &overflow);
+ WARN_ON_ONCE(err);
+
+ printk(KERN_INFO "\n");
+ printk(KERN_INFO "CPU#%d: ctrl: %016llx\n", cpu, ctrl);
+ printk(KERN_INFO "CPU#%d: status: %016llx\n", cpu, status);
+ printk(KERN_INFO "CPU#%d: overflow: %016llx\n", cpu, overflow);
+
+ for (idx = 0; idx < nr_hw_counters; idx++) {
+ err = rdmsrl_safe(MSR_ARCH_PERFMON_EVENTSEL0 + idx, &pmc_ctrl);
+ WARN_ON_ONCE(err);
+
+ err = rdmsrl_safe(MSR_ARCH_PERFMON_PERFCTR0 + idx, &pmc_count);
+ WARN_ON_ONCE(err);
+
+ next_count = per_cpu(prev_next_count[idx], cpu);
+
+ printk(KERN_INFO "CPU#%d: PMC%d ctrl: %016llx\n",
+ cpu, idx, pmc_ctrl);
+ printk(KERN_INFO "CPU#%d: PMC%d count: %016llx\n",
+ cpu, idx, pmc_count);
+ printk(KERN_INFO "CPU#%d: PMC%d next: %016llx\n",
+ cpu, idx, next_count);
+ }
+ local_irq_enable();
+}
+
+void hw_perf_counter_disable(struct perf_counter *counter)
+{
+ struct cpu_hw_counters *cpuc = &__get_cpu_var(cpu_hw_counters);
+ struct hw_perf_counter *hwc = &counter->hw;
+ unsigned int idx = hwc->idx;
+
+ counter->hw.config &= ~ARCH_PERFMON_EVENTSEL0_ENABLE;
+ wrmsr(hwc->config_base + idx, hwc->config, 0);
+
+ clear_bit(idx, cpuc->used);
+ cpuc->counters[idx] = NULL;
+ __hw_perf_save_counter(counter, hwc, idx);
+}
+
+void hw_perf_counter_read(struct perf_counter *counter)
+{
+ struct hw_perf_counter *hwc = &counter->hw;
+ unsigned long addr = hwc->counter_base + hwc->idx;
+ s64 offs, val = -1LL;
+ s32 val32;
+ int err;
+
+ /* Careful: NMI might modify the counter offset */
+ do {
+ offs = hwc->prev_count;
+ err = rdmsrl_safe(addr, &val);
+ WARN_ON_ONCE(err);
+ } while (offs != hwc->prev_count);
+
+ val32 = (s32) val;
+ val = (s64)hwc->irq_period + (s64)val32;
+ atomic64_counter_set(counter, hwc->prev_count + val);
+}
+
+static void perf_store_irq_data(struct perf_counter *counter, u64 data)
+{
+ struct perf_data *irqdata = counter->irqdata;
+
+ if (irqdata->len > PERF_DATA_BUFLEN - sizeof(u64)) {
+ irqdata->overrun++;
+ } else {
+ u64 *p = (u64 *) &irqdata->data[irqdata->len];
+
+ *p = data;
+ irqdata->len += sizeof(u64);
+ }
+}
+
+static void perf_save_and_restart(struct perf_counter *counter)
+{
+ struct hw_perf_counter *hwc = &counter->hw;
+ int idx = hwc->idx;
+
+ wrmsr(hwc->config_base + idx,
+ hwc->config & ~ARCH_PERFMON_EVENTSEL0_ENABLE, 0);
+
+ if (hwc->config & ARCH_PERFMON_EVENTSEL0_ENABLE) {
+ __hw_perf_save_counter(counter, hwc, idx);
+ __hw_perf_counter_enable(hwc, idx);
+ }
+}
+
+static void
+perf_handle_group(struct perf_counter *leader, u64 *status, u64 *overflown)
+{
+ struct perf_counter_context *ctx = leader->ctx;
+ struct perf_counter *counter;
+ int bit;
+
+ list_for_each_entry(counter, &ctx->counters, list) {
+ if (counter->record_type != PERF_RECORD_SIMPLE ||
+ counter == leader)
+ continue;
+
+ if (counter->active) {
+ /*
+ * When counter was not in the overflow mask, we have to
+ * read it from hardware. We read it as well, when it
+ * has not been read yet and clear the bit in the
+ * status mask.
+ */
+ bit = counter->hw.idx;
+ if (!test_bit(bit, (unsigned long *) overflown) ||
+ test_bit(bit, (unsigned long *) status)) {
+ clear_bit(bit, (unsigned long *) status);
+ perf_save_and_restart(counter);
+ }
+ }
+ perf_store_irq_data(leader, counter->hw_event_type);
+ perf_store_irq_data(leader, atomic64_counter_read(counter));
+ }
+}
+
+/*
+ * This handler is triggered by the local APIC, so the APIC IRQ handling
+ * rules apply:
+ */
+static void __smp_perf_counter_interrupt(struct pt_regs *regs, int nmi)
+{
+ int bit, cpu = smp_processor_id();
+ struct cpu_hw_counters *cpuc;
+ u64 ack, status;
+
+ rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS, status);
+ if (!status) {
+ ack_APIC_irq();
+ return;
+ }
+
+ /* Disable counters globally */
+ wrmsr(MSR_CORE_PERF_GLOBAL_CTRL, 0, 0);
+ ack_APIC_irq();
+
+ cpuc = &per_cpu(cpu_hw_counters, cpu);
+
+again:
+ ack = status;
+ for_each_bit(bit, (unsigned long *) &status, nr_hw_counters) {
+ struct perf_counter *counter = cpuc->counters[bit];
+
+ clear_bit(bit, (unsigned long *) &status);
+ if (!counter)
+ continue;
+
+ perf_save_and_restart(counter);
+
+ switch (counter->record_type) {
+ case PERF_RECORD_SIMPLE:
+ continue;
+ case PERF_RECORD_IRQ:
+ perf_store_irq_data(counter, instruction_pointer(regs));
+ break;
+ case PERF_RECORD_GROUP:
+ perf_store_irq_data(counter, counter->hw_event_type);
+ perf_store_irq_data(counter,
+ atomic64_counter_read(counter));
+ perf_handle_group(counter, &status, &ack);
+ break;
+ }
+ /*
+ * From NMI context we cannot call into the scheduler to
+ * do a task wakeup - but we mark these counters as
+ * wakeup_pending and initate a wakeup callback:
+ */
+ if (nmi) {
+ counter->wakeup_pending = 1;
+ set_tsk_thread_flag(current, TIF_PERF_COUNTERS);
+ } else {
+ wake_up(&counter->waitq);
+ }
+ }
+
+ wrmsr(MSR_CORE_PERF_GLOBAL_OVF_CTRL, ack, 0);
+
+ /*
+ * Repeat if there is more work to be done:
+ */
+ rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS, status);
+ if (status)
+ goto again;
+
+ /*
+ * Do not reenable when global enable is off:
+ */
+ if (cpuc->enable_all)
+ __hw_perf_enable_all();
+}
+
+void smp_perf_counter_interrupt(struct pt_regs *regs)
+{
+ irq_enter();
+#ifdef CONFIG_X86_64
+ add_pda(apic_perf_irqs, 1);
+#else
+ per_cpu(irq_stat, smp_processor_id()).apic_perf_irqs++;
+#endif
+ apic_write(APIC_LVTPC, LOCAL_PERF_VECTOR);
+ __smp_perf_counter_interrupt(regs, 0);
+
+ irq_exit();
+}
+
+/*
+ * This handler is triggered by NMI contexts:
+ */
+void perf_counter_notify(struct pt_regs *regs)
+{
+ struct cpu_hw_counters *cpuc;
+ unsigned long flags;
+ int bit, cpu;
+
+ local_irq_save(flags);
+ cpu = smp_processor_id();
+ cpuc = &per_cpu(cpu_hw_counters, cpu);
+
+ for_each_bit(bit, cpuc->used, nr_hw_counters) {
+ struct perf_counter *counter = cpuc->counters[bit];
+
+ if (!counter)
+ continue;
+
+ if (counter->wakeup_pending) {
+ counter->wakeup_pending = 0;
+ wake_up(&counter->waitq);
+ }
+ }
+
+ local_irq_restore(flags);
+}
+
+void __cpuinit perf_counters_lapic_init(int nmi)
+{
+ u32 apic_val;
+
+ if (!perf_counters_initialized)
+ return;
+ /*
+ * Enable the performance counter vector in the APIC LVT:
+ */
+ apic_val = apic_read(APIC_LVTERR);
+
+ apic_write(APIC_LVTERR, apic_val | APIC_LVT_MASKED);
+ if (nmi)
+ apic_write(APIC_LVTPC, APIC_DM_NMI);
+ else
+ apic_write(APIC_LVTPC, LOCAL_PERF_VECTOR);
+ apic_write(APIC_LVTERR, apic_val);
+}
+
+static int __kprobes
+perf_counter_nmi_handler(struct notifier_block *self,
+ unsigned long cmd, void *__args)
+{
+ struct die_args *args = __args;
+ struct pt_regs *regs;
+
+ if (likely(cmd != DIE_NMI_IPI))
+ return NOTIFY_DONE;
+
+ regs = args->regs;
+
+ apic_write(APIC_LVTPC, APIC_DM_NMI);
+ __smp_perf_counter_interrupt(regs, 1);
+
+ return NOTIFY_STOP;
+}
+
+static __read_mostly struct notifier_block perf_counter_nmi_notifier = {
+ .notifier_call = perf_counter_nmi_handler
+};
+
+void __init init_hw_perf_counters(void)
+{
+ union cpuid10_eax eax;
+ unsigned int unused;
+ unsigned int ebx;
+
+ if (!cpu_has(&boot_cpu_data, X86_FEATURE_ARCH_PERFMON))
+ return;
+
+ /*
+ * Check whether the Architectural PerfMon supports
+ * Branch Misses Retired Event or not.
+ */
+ cpuid(10, &(eax.full), &ebx, &unused, &unused);
+ if (eax.split.mask_length <= ARCH_PERFMON_BRANCH_MISSES_RETIRED)
+ return;
+
+ printk(KERN_INFO "Intel Performance Monitoring support detected.\n");
+
+ printk(KERN_INFO "... version: %d\n", eax.split.version_id);
+ printk(KERN_INFO "... num_counters: %d\n", eax.split.num_counters);
+ nr_hw_counters = eax.split.num_counters;
+ if (nr_hw_counters > MAX_HW_COUNTERS) {
+ nr_hw_counters = MAX_HW_COUNTERS;
+ WARN(1, KERN_ERR "hw perf counters %d > max(%d), clipping!",
+ nr_hw_counters, MAX_HW_COUNTERS);
+ }
+ perf_counter_mask = (1 << nr_hw_counters) - 1;
+ perf_max_counters = nr_hw_counters;
+
+ printk(KERN_INFO "... bit_width: %d\n", eax.split.bit_width);
+ printk(KERN_INFO "... mask_length: %d\n", eax.split.mask_length);
+
+ perf_counters_lapic_init(0);
+ register_die_notifier(&perf_counter_nmi_notifier);
+
+ perf_counters_initialized = true;
+}
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index b86f332..ad70f59 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -869,6 +869,12 @@ END(error_interrupt)
ENTRY(spurious_interrupt)
apicinterrupt SPURIOUS_APIC_VECTOR,smp_spurious_interrupt
END(spurious_interrupt)
+
+#ifdef CONFIG_PERF_COUNTERS
+ENTRY(perf_counter_interrupt)
+ apicinterrupt LOCAL_PERF_VECTOR,smp_perf_counter_interrupt
+END(perf_counter_interrupt)
+#endif

/*
* Exception entry points.
diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
index d1d4dc5..d92bc71 100644
--- a/arch/x86/kernel/irq.c
+++ b/arch/x86/kernel/irq.c
@@ -56,6 +56,10 @@ static int show_other_interrupts(struct seq_file *p)
for_each_online_cpu(j)
seq_printf(p, "%10u ", irq_stats(j)->apic_timer_irqs);
seq_printf(p, " Local timer interrupts\n");
+ seq_printf(p, "CNT: ");
+ for_each_online_cpu(j)
+ seq_printf(p, "%10u ", irq_stats(j)->apic_perf_irqs);
+ seq_printf(p, " Performance counter interrupts\n");
#endif
#ifdef CONFIG_SMP
seq_printf(p, "RES: ");
@@ -160,6 +164,7 @@ u64 arch_irq_stat_cpu(unsigned int cpu)

#ifdef CONFIG_X86_LOCAL_APIC
sum += irq_stats(cpu)->apic_timer_irqs;
+ sum += irq_stats(cpu)->apic_perf_irqs;
#endif
#ifdef CONFIG_SMP
sum += irq_stats(cpu)->irq_resched_count;
diff --git a/arch/x86/kernel/irqinit_32.c b/arch/x86/kernel/irqinit_32.c
index 845aa98..de2bb7c 100644
--- a/arch/x86/kernel/irqinit_32.c
+++ b/arch/x86/kernel/irqinit_32.c
@@ -160,6 +160,9 @@ void __init native_init_IRQ(void)
/* IPI vectors for APIC spurious and error interrupts */
alloc_intr_gate(SPURIOUS_APIC_VECTOR, spurious_interrupt);
alloc_intr_gate(ERROR_APIC_VECTOR, error_interrupt);
+# ifdef CONFIG_PERF_COUNTERS
+ alloc_intr_gate(LOCAL_PERF_VECTOR, perf_counter_interrupt);
+# endif
#endif

#if defined(CONFIG_X86_LOCAL_APIC) && defined(CONFIG_X86_MCE_P4THERMAL)
diff --git a/arch/x86/kernel/irqinit_64.c b/arch/x86/kernel/irqinit_64.c
index ff02353..eb04dd9 100644
--- a/arch/x86/kernel/irqinit_64.c
+++ b/arch/x86/kernel/irqinit_64.c
@@ -204,6 +204,11 @@ static void __init apic_intr_init(void)
/* IPI vectors for APIC spurious and error interrupts */
alloc_intr_gate(SPURIOUS_APIC_VECTOR, spurious_interrupt);
alloc_intr_gate(ERROR_APIC_VECTOR, error_interrupt);
+
+ /* Performance monitoring interrupt: */
+#ifdef CONFIG_PERF_COUNTERS
+ alloc_intr_gate(LOCAL_PERF_VECTOR, perf_counter_interrupt);
+#endif
}

void __init native_init_IRQ(void)
diff --git a/arch/x86/kernel/signal_32.c b/arch/x86/kernel/signal_32.c
index d6dd057..6d39c27 100644
--- a/arch/x86/kernel/signal_32.c
+++ b/arch/x86/kernel/signal_32.c
@@ -6,7 +6,9 @@
*/
#include <linux/list.h>

+#include <linux/perf_counter.h>
#include <linux/personality.h>
+#include <linux/tracehook.h>
#include <linux/binfmts.h>
#include <linux/suspend.h>
#include <linux/kernel.h>
@@ -17,7 +19,6 @@
#include <linux/errno.h>
#include <linux/sched.h>
#include <linux/wait.h>
-#include <linux/tracehook.h>
#include <linux/elf.h>
#include <linux/smp.h>
#include <linux/mm.h>
@@ -694,6 +695,11 @@ do_notify_resume(struct pt_regs *regs, void *unused, __u32 thread_info_flags)
tracehook_notify_resume(regs);
}

+ if (thread_info_flags & _TIF_PERF_COUNTERS) {
+ clear_thread_flag(TIF_PERF_COUNTERS);
+ perf_counter_notify(regs);
+ }
+
#ifdef CONFIG_X86_32
clear_thread_flag(TIF_IRET);
#endif /* CONFIG_X86_32 */
diff --git a/arch/x86/kernel/signal_64.c b/arch/x86/kernel/signal_64.c
index a5c9627..066a13f 100644
--- a/arch/x86/kernel/signal_64.c
+++ b/arch/x86/kernel/signal_64.c
@@ -7,6 +7,7 @@
* 2000-2002 x86-64 support by Andi Kleen
*/

+#include <linux/perf_counter.h>
#include <linux/sched.h>
#include <linux/mm.h>
#include <linux/smp.h>
@@ -493,6 +494,10 @@ do_notify_resume(struct pt_regs *regs, void *unused, __u32 thread_info_flags)
clear_thread_flag(TIF_NOTIFY_RESUME);
tracehook_notify_resume(regs);
}
+ if (thread_info_flags & _TIF_PERF_COUNTERS) {
+ clear_thread_flag(TIF_PERF_COUNTERS);
+ perf_counter_notify(regs);
+ }

#ifdef CONFIG_X86_32
clear_thread_flag(TIF_IRET);
diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S
index d44395f..496726d 100644
--- a/arch/x86/kernel/syscall_table_32.S
+++ b/arch/x86/kernel/syscall_table_32.S
@@ -332,3 +332,4 @@ ENTRY(sys_call_table)
.long sys_dup3 /* 330 */
.long sys_pipe2
.long sys_inotify_init1
+ .long sys_perf_counter_open
diff --git a/drivers/char/sysrq.c b/drivers/char/sysrq.c
index ce0d9da..52146c2 100644
--- a/drivers/char/sysrq.c
+++ b/drivers/char/sysrq.c
@@ -25,6 +25,7 @@
#include <linux/kbd_kern.h>
#include <linux/proc_fs.h>
#include <linux/quotaops.h>
+#include <linux/perf_counter.h>
#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/suspend.h>
@@ -244,6 +245,7 @@ static void sysrq_handle_showregs(int key, struct tty_struct *tty)
struct pt_regs *regs = get_irq_regs();
if (regs)
show_regs(regs);
+ perf_counter_print_debug();
}
static struct sysrq_key_op sysrq_showregs_op = {
.handler = sysrq_handle_showregs,
diff --git a/include/linux/perf_counter.h b/include/linux/perf_counter.h
new file mode 100644
index 0000000..22c4469
--- /dev/null
+++ b/include/linux/perf_counter.h
@@ -0,0 +1,171 @@
+/*
+ * Performance counters:
+ *
+ * Copyright(C) 2008, Thomas Gleixner <[email protected]>
+ * Copyright(C) 2008, Red Hat, Inc., Ingo Molnar
+ *
+ * Data type definitions, declarations, prototypes.
+ *
+ * Started by: Thomas Gleixner and Ingo Molnar
+ *
+ * For licencing details see kernel-base/COPYING
+ */
+#ifndef _LINUX_PERF_COUNTER_H
+#define _LINUX_PERF_COUNTER_H
+
+#include <asm/atomic.h>
+
+#include <linux/list.h>
+#include <linux/mutex.h>
+#include <linux/rculist.h>
+#include <linux/rcupdate.h>
+#include <linux/spinlock.h>
+
+struct task_struct;
+
+/*
+ * Generalized hardware event types, used by the hw_event_type parameter
+ * of the sys_perf_counter_open() syscall:
+ */
+enum hw_event_types {
+ PERF_COUNT_CYCLES,
+ PERF_COUNT_INSTRUCTIONS,
+ PERF_COUNT_CACHE_REFERENCES,
+ PERF_COUNT_CACHE_MISSES,
+ PERF_COUNT_BRANCH_INSTRUCTIONS,
+ PERF_COUNT_BRANCH_MISSES,
+ /*
+ * If this bit is set in the type, then trigger NMI sampling:
+ */
+ PERF_COUNT_NMI = (1 << 30),
+};
+
+/*
+ * IRQ-notification data record type:
+ */
+enum perf_record_type {
+ PERF_RECORD_SIMPLE,
+ PERF_RECORD_IRQ,
+ PERF_RECORD_GROUP,
+};
+
+/**
+ * struct hw_perf_counter - performance counter hardware details
+ */
+struct hw_perf_counter {
+ u64 config;
+ unsigned long config_base;
+ unsigned long counter_base;
+ int nmi;
+ unsigned int idx;
+ u64 prev_count;
+ s32 next_count;
+ u64 irq_period;
+};
+
+/*
+ * Hardcoded buffer length limit for now, for IRQ-fed events:
+ */
+#define PERF_DATA_BUFLEN 2048
+
+/**
+ * struct perf_data - performance counter IRQ data sampling ...
+ */
+struct perf_data {
+ int len;
+ int rd_idx;
+ int overrun;
+ u8 data[PERF_DATA_BUFLEN];
+};
+
+/**
+ * struct perf_counter - performance counter kernel representation:
+ */
+struct perf_counter {
+ struct list_head list;
+ int active;
+#if BITS_PER_LONG == 64
+ atomic64_t count;
+#else
+ atomic_t count32[2];
+#endif
+ u64 __irq_period;
+
+ struct hw_perf_counter hw;
+
+ struct perf_counter_context *ctx;
+ struct task_struct *task;
+
+ /*
+ * Protect attach/detach:
+ */
+ struct mutex mutex;
+
+ int oncpu;
+ int cpu;
+
+ s32 hw_event_type;
+ enum perf_record_type record_type;
+
+ /* read() / irq related data */
+ wait_queue_head_t waitq;
+ /* optional: for NMIs */
+ int wakeup_pending;
+ struct perf_data *irqdata;
+ struct perf_data *usrdata;
+ struct perf_data data[2];
+};
+
+/**
+ * struct perf_counter_context - counter context structure
+ *
+ * Used as a container for task counters and CPU counters as well:
+ */
+struct perf_counter_context {
+#ifdef CONFIG_PERF_COUNTERS
+ /*
+ * Protect the list of counters:
+ */
+ spinlock_t lock;
+ struct list_head counters;
+ int nr_counters;
+ int nr_active;
+ struct task_struct *task;
+#endif
+};
+
+/**
+ * struct perf_counter_cpu_context - per cpu counter context structure
+ */
+struct perf_cpu_context {
+ struct perf_counter_context ctx;
+ struct perf_counter_context *task_ctx;
+ int active_oncpu;
+ int max_pertask;
+};
+
+/*
+ * Set by architecture code:
+ */
+extern int perf_max_counters;
+
+#ifdef CONFIG_PERF_COUNTERS
+extern void perf_counter_task_sched_in(struct task_struct *task, int cpu);
+extern void perf_counter_task_sched_out(struct task_struct *task, int cpu);
+extern void perf_counter_task_tick(struct task_struct *task, int cpu);
+extern void perf_counter_init_task(struct task_struct *task);
+extern void perf_counter_notify(struct pt_regs *regs);
+extern void perf_counter_print_debug(void);
+#else
+static inline void
+perf_counter_task_sched_in(struct task_struct *task, int cpu) { }
+static inline void
+perf_counter_task_sched_out(struct task_struct *task, int cpu) { }
+static inline void
+perf_counter_task_tick(struct task_struct *task, int cpu) { }
+static inline void perf_counter_init_task(struct task_struct *task) { }
+static inline void perf_counter_notify(struct pt_regs *regs) { }
+static inline void perf_counter_print_debug(void) { }
+#endif
+
+#endif /* _LINUX_PERF_COUNTER_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 55e30d1..4c53027 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -71,6 +71,7 @@ struct sched_param {
#include <linux/fs_struct.h>
#include <linux/compiler.h>
#include <linux/completion.h>
+#include <linux/perf_counter.h>
#include <linux/pid.h>
#include <linux/percpu.h>
#include <linux/topology.h>
@@ -1326,6 +1327,7 @@ struct task_struct {
struct list_head pi_state_list;
struct futex_pi_state *pi_state_cache;
#endif
+ struct perf_counter_context perf_counter_ctx;
#ifdef CONFIG_NUMA
struct mempolicy *mempolicy;
short il_next;
@@ -2285,6 +2287,13 @@ static inline void inc_syscw(struct task_struct *tsk)
#define TASK_SIZE_OF(tsk) TASK_SIZE
#endif

+/*
+ * Call the function if the target task is executing on a CPU right now:
+ */
+extern void task_oncpu_function_call(struct task_struct *p,
+ void (*func) (void *info), void *info);
+
+
#ifdef CONFIG_MM_OWNER
extern void mm_update_next_owner(struct mm_struct *mm);
extern void mm_init_owner(struct mm_struct *mm, struct task_struct *p);
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 04fb47b..6cce728 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -624,4 +624,10 @@ asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len);

int kernel_execve(const char *filename, char *const argv[], char *const envp[]);

+asmlinkage int
+sys_perf_counter_open(u32 hw_event_type,
+ u32 hw_event_period,
+ u32 record_type,
+ pid_t pid,
+ int cpu);
#endif
diff --git a/init/Kconfig b/init/Kconfig
index f763762..78bede2 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -732,6 +732,35 @@ config AIO
by some high performance threaded applications. Disabling
this option saves about 7k.

+config HAVE_PERF_COUNTERS
+ bool
+
+menu "Performance Counters"
+
+config PERF_COUNTERS
+ bool "Kernel Performance Counters"
+ depends on HAVE_PERF_COUNTERS
+ default y
+ help
+ Enable kernel support for performance counter hardware.
+
+ Performance counters are special hardware registers available
+ on most modern CPUs. These registers count the number of certain
+ types of hw events: such as instructions executed, cachemisses
+ suffered, or branches mis-predicted - without slowing down the
+ kernel or applications. These registers can also trigger interrupts
+ when a threshold number of events have passed - and can thus be
+ used to profile the code that runs on that CPU.
+
+ The Linux Performance Counter subsystem provides an abstraction of
+ these hardware capabilities, available via a system call. It
+ provides per task and per CPU counters, and it provides event
+ capabilities on top of those.
+
+ Say Y if unsure.
+
+endmenu
+
config VM_EVENT_COUNTERS
default y
bool "Enable VM event counters for /proc/vmstat" if EMBEDDED
diff --git a/kernel/Makefile b/kernel/Makefile
index 19fad00..1f184a1 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -89,6 +89,7 @@ obj-$(CONFIG_HAVE_GENERIC_DMA_COHERENT) += dma-coherent.o
obj-$(CONFIG_FUNCTION_TRACER) += trace/
obj-$(CONFIG_TRACING) += trace/
obj-$(CONFIG_SMP) += sched_cpupri.o
+obj-$(CONFIG_PERF_COUNTERS) += perf_counter.o

ifneq ($(CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER),y)
# According to Alan Modra <[email protected]>, the -fno-omit-frame-pointer is
diff --git a/kernel/fork.c b/kernel/fork.c
index 2a372a0..441fadf 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -975,6 +975,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
goto fork_out;

rt_mutex_init_task(p);
+ perf_counter_init_task(p);

#ifdef CONFIG_PROVE_LOCKING
DEBUG_LOCKS_WARN_ON(!p->hardirqs_enabled);
diff --git a/kernel/perf_counter.c b/kernel/perf_counter.c
new file mode 100644
index 0000000..f84b400
--- /dev/null
+++ b/kernel/perf_counter.c
@@ -0,0 +1,945 @@
+/*
+ * Performance counter core code
+ *
+ * Copyright(C) 2008 Thomas Gleixner <[email protected]>
+ * Copyright(C) 2008 Red Hat, Inc., Ingo Molnar
+ *
+ * For licencing details see kernel-base/COPYING
+ */
+
+#include <linux/fs.h>
+#include <linux/cpu.h>
+#include <linux/smp.h>
+#include <linux/poll.h>
+#include <linux/sysfs.h>
+#include <linux/ptrace.h>
+#include <linux/percpu.h>
+#include <linux/uaccess.h>
+#include <linux/syscalls.h>
+#include <linux/anon_inodes.h>
+#include <linux/perf_counter.h>
+
+/*
+ * Each CPU has a list of per CPU counters:
+ */
+DEFINE_PER_CPU(struct perf_cpu_context, perf_cpu_context);
+
+int perf_max_counters __read_mostly;
+static int perf_reserved_percpu __read_mostly;
+static int perf_overcommit __read_mostly = 1;
+
+/*
+ * Mutex for (sysadmin-configurable) counter reservations:
+ */
+static DEFINE_MUTEX(perf_resource_mutex);
+
+/*
+ * Architecture provided APIs - weak aliases:
+ */
+
+int __weak hw_perf_counter_init(struct perf_counter *counter, u32 hw_event_type)
+{
+ return -EINVAL;
+}
+
+void __weak hw_perf_counter_enable(struct perf_counter *counter) { }
+void __weak hw_perf_counter_disable(struct perf_counter *counter) { }
+void __weak hw_perf_counter_read(struct perf_counter *counter) { }
+void __weak hw_perf_disable_all(void) { }
+void __weak hw_perf_enable_all(void) { }
+void __weak hw_perf_counter_setup(void) { }
+
+#if BITS_PER_LONG == 64
+
+/*
+ * Read the cached counter in counter safe against cross CPU / NMI
+ * modifications. 64 bit version - no complications.
+ */
+static inline u64 perf_read_counter_safe(struct perf_counter *counter)
+{
+ return (u64) atomic64_read(&counter->count);
+}
+
+#else
+
+/*
+ * Read the cached counter in counter safe against cross CPU / NMI
+ * modifications. 32 bit version.
+ */
+static u64 perf_read_counter_safe(struct perf_counter *counter)
+{
+ u32 cntl, cnth;
+
+ local_irq_disable();
+ do {
+ cnth = atomic_read(&counter->count32[1]);
+ cntl = atomic_read(&counter->count32[0]);
+ } while (cnth != atomic_read(&counter->count32[1]));
+
+ local_irq_enable();
+
+ return cntl | ((u64) cnth) << 32;
+}
+
+#endif
+
+/*
+ * Cross CPU call to remove a performance counter
+ *
+ * We disable the counter on the hardware level first. After that we
+ * remove it from the context list.
+ */
+static void __perf_remove_from_context(void *info)
+{
+ struct perf_cpu_context *cpuctx = &__get_cpu_var(perf_cpu_context);
+ struct perf_counter *counter = info;
+ struct perf_counter_context *ctx = counter->ctx;
+
+ /*
+ * If this is a task context, we need to check whether it is
+ * the current task context of this cpu. If not it has been
+ * scheduled out before the smp call arrived.
+ */
+ if (ctx->task && cpuctx->task_ctx != ctx)
+ return;
+
+ spin_lock(&ctx->lock);
+
+ if (counter->active) {
+ hw_perf_counter_disable(counter);
+ counter->active = 0;
+ ctx->nr_active--;
+ cpuctx->active_oncpu--;
+ counter->task = NULL;
+ }
+ ctx->nr_counters--;
+
+ /*
+ * Protect the list operation against NMI by disabling the
+ * counters on a global level. NOP for non NMI based counters.
+ */
+ hw_perf_disable_all();
+ list_del_init(&counter->list);
+ hw_perf_enable_all();
+
+ if (!ctx->task) {
+ /*
+ * Allow more per task counters with respect to the
+ * reservation:
+ */
+ cpuctx->max_pertask =
+ min(perf_max_counters - ctx->nr_counters,
+ perf_max_counters - perf_reserved_percpu);
+ }
+
+ spin_unlock(&ctx->lock);
+}
+
+
+/*
+ * Remove the counter from a task's (or a CPU's) list of counters.
+ *
+ * Must be called with counter->mutex held.
+ *
+ * CPU counters are removed with a smp call. For task counters we only
+ * call when the task is on a CPU.
+ */
+static void perf_remove_from_context(struct perf_counter *counter)
+{
+ struct perf_counter_context *ctx = counter->ctx;
+ struct task_struct *task = ctx->task;
+
+ if (!task) {
+ /*
+ * Per cpu counters are removed via an smp call and
+ * the removal is always sucessful.
+ */
+ smp_call_function_single(counter->cpu,
+ __perf_remove_from_context,
+ counter, 1);
+ return;
+ }
+
+retry:
+ task_oncpu_function_call(task, __perf_remove_from_context,
+ counter);
+
+ spin_lock_irq(&ctx->lock);
+ /*
+ * If the context is active we need to retry the smp call.
+ */
+ if (ctx->nr_active && !list_empty(&counter->list)) {
+ spin_unlock_irq(&ctx->lock);
+ goto retry;
+ }
+
+ /*
+ * The lock prevents that this context is scheduled in so we
+ * can remove the counter safely, if it the call above did not
+ * succeed.
+ */
+ if (!list_empty(&counter->list)) {
+ ctx->nr_counters--;
+ list_del_init(&counter->list);
+ counter->task = NULL;
+ }
+ spin_unlock_irq(&ctx->lock);
+}
+
+/*
+ * Cross CPU call to install and enable a preformance counter
+ */
+static void __perf_install_in_context(void *info)
+{
+ struct perf_cpu_context *cpuctx = &__get_cpu_var(perf_cpu_context);
+ struct perf_counter *counter = info;
+ struct perf_counter_context *ctx = counter->ctx;
+ int cpu = smp_processor_id();
+
+ /*
+ * If this is a task context, we need to check whether it is
+ * the current task context of this cpu. If not it has been
+ * scheduled out before the smp call arrived.
+ */
+ if (ctx->task && cpuctx->task_ctx != ctx)
+ return;
+
+ spin_lock(&ctx->lock);
+
+ /*
+ * Protect the list operation against NMI by disabling the
+ * counters on a global level. NOP for non NMI based counters.
+ */
+ hw_perf_disable_all();
+ list_add_tail(&counter->list, &ctx->counters);
+ hw_perf_enable_all();
+
+ ctx->nr_counters++;
+
+ if (cpuctx->active_oncpu < perf_max_counters) {
+ hw_perf_counter_enable(counter);
+ counter->active = 1;
+ counter->oncpu = cpu;
+ ctx->nr_active++;
+ cpuctx->active_oncpu++;
+ }
+
+ if (!ctx->task && cpuctx->max_pertask)
+ cpuctx->max_pertask--;
+
+ spin_unlock(&ctx->lock);
+}
+
+/*
+ * Attach a performance counter to a context
+ *
+ * First we add the counter to the list with the hardware enable bit
+ * in counter->hw_config cleared.
+ *
+ * If the counter is attached to a task which is on a CPU we use a smp
+ * call to enable it in the task context. The task might have been
+ * scheduled away, but we check this in the smp call again.
+ */
+static void
+perf_install_in_context(struct perf_counter_context *ctx,
+ struct perf_counter *counter,
+ int cpu)
+{
+ struct task_struct *task = ctx->task;
+
+ counter->ctx = ctx;
+ if (!task) {
+ /*
+ * Per cpu counters are installed via an smp call and
+ * the install is always sucessful.
+ */
+ smp_call_function_single(cpu, __perf_install_in_context,
+ counter, 1);
+ return;
+ }
+
+ counter->task = task;
+retry:
+ task_oncpu_function_call(task, __perf_install_in_context,
+ counter);
+
+ spin_lock_irq(&ctx->lock);
+ /*
+ * If the context is active and the counter has not been added
+ * we need to retry the smp call.
+ */
+ if (ctx->nr_active && list_empty(&counter->list)) {
+ spin_unlock_irq(&ctx->lock);
+ goto retry;
+ }
+
+ /*
+ * The lock prevents that this context is scheduled in so we
+ * can add the counter safely, if it the call above did not
+ * succeed.
+ */
+ if (list_empty(&counter->list)) {
+ list_add_tail(&counter->list, &ctx->counters);
+ ctx->nr_counters++;
+ }
+ spin_unlock_irq(&ctx->lock);
+}
+
+/*
+ * Called from scheduler to remove the counters of the current task,
+ * with interrupts disabled.
+ *
+ * We stop each counter and update the counter value in counter->count.
+ *
+ * This does not protect us against NMI, but hw_perf_counter_disable()
+ * sets the disabled bit in the control field of counter _before_
+ * accessing the counter control register. If a NMI hits, then it will
+ * not restart the counter.
+ */
+void perf_counter_task_sched_out(struct task_struct *task, int cpu)
+{
+ struct perf_cpu_context *cpuctx = &per_cpu(perf_cpu_context, cpu);
+ struct perf_counter_context *ctx = &task->perf_counter_ctx;
+ struct perf_counter *counter;
+
+ if (likely(!cpuctx->task_ctx))
+ return;
+
+ spin_lock(&ctx->lock);
+ list_for_each_entry(counter, &ctx->counters, list) {
+ if (!ctx->nr_active)
+ break;
+ if (counter->active) {
+ hw_perf_counter_disable(counter);
+ counter->active = 0;
+ counter->oncpu = -1;
+ ctx->nr_active--;
+ cpuctx->active_oncpu--;
+ }
+ }
+ spin_unlock(&ctx->lock);
+ cpuctx->task_ctx = NULL;
+}
+
+/*
+ * Called from scheduler to add the counters of the current task
+ * with interrupts disabled.
+ *
+ * We restore the counter value and then enable it.
+ *
+ * This does not protect us against NMI, but hw_perf_counter_enable()
+ * sets the enabled bit in the control field of counter _before_
+ * accessing the counter control register. If a NMI hits, then it will
+ * keep the counter running.
+ */
+void perf_counter_task_sched_in(struct task_struct *task, int cpu)
+{
+ struct perf_cpu_context *cpuctx = &per_cpu(perf_cpu_context, cpu);
+ struct perf_counter_context *ctx = &task->perf_counter_ctx;
+ struct perf_counter *counter;
+
+ if (likely(!ctx->nr_counters))
+ return;
+
+ spin_lock(&ctx->lock);
+ list_for_each_entry(counter, &ctx->counters, list) {
+ if (ctx->nr_active == cpuctx->max_pertask)
+ break;
+ if (counter->cpu != -1 && counter->cpu != cpu)
+ continue;
+
+ hw_perf_counter_enable(counter);
+ counter->active = 1;
+ counter->oncpu = cpu;
+ ctx->nr_active++;
+ cpuctx->active_oncpu++;
+ }
+ spin_unlock(&ctx->lock);
+ cpuctx->task_ctx = ctx;
+}
+
+void perf_counter_task_tick(struct task_struct *curr, int cpu)
+{
+ struct perf_counter_context *ctx = &curr->perf_counter_ctx;
+ struct perf_counter *counter;
+
+ if (likely(!ctx->nr_counters))
+ return;
+
+ perf_counter_task_sched_out(curr, cpu);
+
+ spin_lock(&ctx->lock);
+
+ /*
+ * Rotate the first entry last:
+ */
+ hw_perf_disable_all();
+ list_for_each_entry(counter, &ctx->counters, list) {
+ list_del(&counter->list);
+ list_add_tail(&counter->list, &ctx->counters);
+ break;
+ }
+ hw_perf_enable_all();
+
+ spin_unlock(&ctx->lock);
+
+ perf_counter_task_sched_in(curr, cpu);
+}
+
+/*
+ * Initialize the perf_counter context in task_struct
+ */
+void perf_counter_init_task(struct task_struct *task)
+{
+ struct perf_counter_context *ctx = &task->perf_counter_ctx;
+
+ spin_lock_init(&ctx->lock);
+ INIT_LIST_HEAD(&ctx->counters);
+ ctx->nr_counters = 0;
+ ctx->task = task;
+}
+
+/*
+ * Cross CPU call to read the hardware counter
+ */
+static void __hw_perf_counter_read(void *info)
+{
+ hw_perf_counter_read(info);
+}
+
+static u64 perf_read_counter(struct perf_counter *counter)
+{
+ /*
+ * If counter is enabled and currently active on a CPU, update the
+ * value in the counter structure:
+ */
+ if (counter->active) {
+ smp_call_function_single(counter->oncpu,
+ __hw_perf_counter_read, counter, 1);
+ }
+
+ return perf_read_counter_safe(counter);
+}
+
+/*
+ * Cross CPU call to switch performance data pointers
+ */
+static void __perf_switch_irq_data(void *info)
+{
+ struct perf_cpu_context *cpuctx = &__get_cpu_var(perf_cpu_context);
+ struct perf_counter *counter = info;
+ struct perf_counter_context *ctx = counter->ctx;
+ struct perf_data *oldirqdata = counter->irqdata;
+
+ /*
+ * If this is a task context, we need to check whether it is
+ * the current task context of this cpu. If not it has been
+ * scheduled out before the smp call arrived.
+ */
+ if (ctx->task) {
+ if (cpuctx->task_ctx != ctx)
+ return;
+ spin_lock(&ctx->lock);
+ }
+
+ /* Change the pointer NMI safe */
+ atomic_long_set((atomic_long_t *)&counter->irqdata,
+ (unsigned long) counter->usrdata);
+ counter->usrdata = oldirqdata;
+
+ if (ctx->task)
+ spin_unlock(&ctx->lock);
+}
+
+static struct perf_data *perf_switch_irq_data(struct perf_counter *counter)
+{
+ struct perf_counter_context *ctx = counter->ctx;
+ struct perf_data *oldirqdata = counter->irqdata;
+ struct task_struct *task = ctx->task;
+
+ if (!task) {
+ smp_call_function_single(counter->cpu,
+ __perf_switch_irq_data,
+ counter, 1);
+ return counter->usrdata;
+ }
+
+retry:
+ spin_lock_irq(&ctx->lock);
+ if (!counter->active) {
+ counter->irqdata = counter->usrdata;
+ counter->usrdata = oldirqdata;
+ spin_unlock_irq(&ctx->lock);
+ return oldirqdata;
+ }
+ spin_unlock_irq(&ctx->lock);
+ task_oncpu_function_call(task, __perf_switch_irq_data, counter);
+ /* Might have failed, because task was scheduled out */
+ if (counter->irqdata == oldirqdata)
+ goto retry;
+
+ return counter->usrdata;
+}
+
+static void put_context(struct perf_counter_context *ctx)
+{
+ if (ctx->task) {
+ put_task_struct(ctx->task);
+ ctx->task = NULL;
+ }
+}
+
+static struct perf_counter_context *find_get_context(pid_t pid, int cpu)
+{
+ struct perf_cpu_context *cpuctx;
+ struct perf_counter_context *ctx;
+ struct task_struct *task;
+
+ /*
+ * If cpu is not a wildcard then this is a percpu counter:
+ */
+ if (cpu != -1) {
+ /* Must be root to operate on a CPU counter: */
+ if (!capable(CAP_SYS_ADMIN))
+ return ERR_PTR(-EACCES);
+
+ if (cpu < 0 || cpu > num_possible_cpus())
+ return ERR_PTR(-EINVAL);
+
+ /*
+ * We could be clever and allow to attach a counter to an
+ * offline CPU and activate it when the CPU comes up, but
+ * that's for later.
+ */
+ if (!cpu_isset(cpu, cpu_online_map))
+ return ERR_PTR(-ENODEV);
+
+ cpuctx = &per_cpu(perf_cpu_context, cpu);
+ ctx = &cpuctx->ctx;
+
+ WARN_ON_ONCE(ctx->task);
+ return ctx;
+ }
+
+ rcu_read_lock();
+ if (!pid)
+ task = current;
+ else
+ task = find_task_by_vpid(pid);
+ if (task)
+ get_task_struct(task);
+ rcu_read_unlock();
+
+ if (!task)
+ return ERR_PTR(-ESRCH);
+
+ ctx = &task->perf_counter_ctx;
+ ctx->task = task;
+
+ /* Reuse ptrace permission checks for now. */
+ if (!ptrace_may_access(task, PTRACE_MODE_READ)) {
+ put_context(ctx);
+ return ERR_PTR(-EACCES);
+ }
+
+ return ctx;
+}
+
+/*
+ * Called when the last reference to the file is gone.
+ */
+static int perf_release(struct inode *inode, struct file *file)
+{
+ struct perf_counter *counter = file->private_data;
+ struct perf_counter_context *ctx = counter->ctx;
+
+ file->private_data = NULL;
+
+ mutex_lock(&counter->mutex);
+
+ perf_remove_from_context(counter);
+ put_context(ctx);
+
+ mutex_unlock(&counter->mutex);
+
+ kfree(counter);
+
+ return 0;
+}
+
+/*
+ * Read the performance counter - simple non blocking version for now
+ */
+static ssize_t
+perf_read_hw(struct perf_counter *counter, char __user *buf, size_t count)
+{
+ u64 cntval;
+
+ if (count != sizeof(cntval))
+ return -EINVAL;
+
+ mutex_lock(&counter->mutex);
+ cntval = perf_read_counter(counter);
+ mutex_unlock(&counter->mutex);
+
+ return put_user(cntval, (u64 __user *) buf) ? -EFAULT : sizeof(cntval);
+}
+
+static ssize_t
+perf_copy_usrdata(struct perf_data *usrdata, char __user *buf, size_t count)
+{
+ if (!usrdata->len)
+ return 0;
+
+ count = min(count, (size_t)usrdata->len);
+ if (copy_to_user(buf, usrdata->data + usrdata->rd_idx, count))
+ return -EFAULT;
+
+ /* Adjust the counters */
+ usrdata->len -= count;
+ if (!usrdata->len)
+ usrdata->rd_idx = 0;
+ else
+ usrdata->rd_idx += count;
+
+ return count;
+}
+
+static ssize_t
+perf_read_irq_data(struct perf_counter *counter,
+ char __user *buf,
+ size_t count,
+ int nonblocking)
+{
+ struct perf_data *irqdata, *usrdata;
+ DECLARE_WAITQUEUE(wait, current);
+ ssize_t res;
+
+ irqdata = counter->irqdata;
+ usrdata = counter->usrdata;
+
+ if (usrdata->len + irqdata->len >= count)
+ goto read_pending;
+
+ if (nonblocking)
+ return -EAGAIN;
+
+ spin_lock_irq(&counter->waitq.lock);
+ __add_wait_queue(&counter->waitq, &wait);
+ for (;;) {
+ set_current_state(TASK_INTERRUPTIBLE);
+ if (usrdata->len + irqdata->len >= count)
+ break;
+
+ if (signal_pending(current))
+ break;
+
+ spin_unlock_irq(&counter->waitq.lock);
+ schedule();
+ spin_lock_irq(&counter->waitq.lock);
+ }
+ __remove_wait_queue(&counter->waitq, &wait);
+ __set_current_state(TASK_RUNNING);
+ spin_unlock_irq(&counter->waitq.lock);
+
+ if (usrdata->len + irqdata->len < count)
+ return -ERESTARTSYS;
+read_pending:
+ mutex_lock(&counter->mutex);
+
+ /* Drain pending data first: */
+ res = perf_copy_usrdata(usrdata, buf, count);
+ if (res < 0 || res == count)
+ goto out;
+
+ /* Switch irq buffer: */
+ usrdata = perf_switch_irq_data(counter);
+ if (perf_copy_usrdata(usrdata, buf + res, count - res) < 0) {
+ if (!res)
+ res = -EFAULT;
+ } else {
+ res = count;
+ }
+out:
+ mutex_unlock(&counter->mutex);
+
+ return res;
+}
+
+static ssize_t
+perf_read(struct file *file, char __user *buf, size_t count, loff_t *ppos)
+{
+ struct perf_counter *counter = file->private_data;
+
+ switch (counter->record_type) {
+ case PERF_RECORD_SIMPLE:
+ return perf_read_hw(counter, buf, count);
+
+ case PERF_RECORD_IRQ:
+ case PERF_RECORD_GROUP:
+ return perf_read_irq_data(counter, buf, count,
+ file->f_flags & O_NONBLOCK);
+ }
+ return -EINVAL;
+}
+
+static unsigned int perf_poll(struct file *file, poll_table *wait)
+{
+ struct perf_counter *counter = file->private_data;
+ unsigned int events = 0;
+ unsigned long flags;
+
+ poll_wait(file, &counter->waitq, wait);
+
+ spin_lock_irqsave(&counter->waitq.lock, flags);
+ if (counter->usrdata->len || counter->irqdata->len)
+ events |= POLLIN;
+ spin_unlock_irqrestore(&counter->waitq.lock, flags);
+
+ return events;
+}
+
+static const struct file_operations perf_fops = {
+ .release = perf_release,
+ .read = perf_read,
+ .poll = perf_poll,
+};
+
+/*
+ * Allocate and initialize a counter structure
+ */
+static struct perf_counter *
+perf_counter_alloc(u32 hw_event_period, int cpu, u32 record_type)
+{
+ struct perf_counter *counter = kzalloc(sizeof(*counter), GFP_KERNEL);
+
+ if (!counter)
+ return NULL;
+
+ mutex_init(&counter->mutex);
+ INIT_LIST_HEAD(&counter->list);
+ init_waitqueue_head(&counter->waitq);
+
+ counter->irqdata = &counter->data[0];
+ counter->usrdata = &counter->data[1];
+ counter->cpu = cpu;
+ counter->record_type = record_type;
+ counter->__irq_period = hw_event_period;
+ counter->wakeup_pending = 0;
+
+ return counter;
+}
+
+/**
+ * sys_perf_task_open - open a performance counter associate it to a task
+ * @hw_event_type: event type for monitoring/sampling...
+ * @pid: target pid
+ */
+asmlinkage int
+sys_perf_counter_open(u32 hw_event_type,
+ u32 hw_event_period,
+ u32 record_type,
+ pid_t pid,
+ int cpu)
+{
+ struct perf_counter_context *ctx;
+ struct perf_counter *counter;
+ int ret;
+
+ ctx = find_get_context(pid, cpu);
+ if (IS_ERR(ctx))
+ return PTR_ERR(ctx);
+
+ ret = -ENOMEM;
+ counter = perf_counter_alloc(hw_event_period, cpu, record_type);
+ if (!counter)
+ goto err_put_context;
+
+ ret = hw_perf_counter_init(counter, hw_event_type);
+ if (ret)
+ goto err_free_put_context;
+
+ perf_install_in_context(ctx, counter, cpu);
+
+ ret = anon_inode_getfd("[perf_counter]", &perf_fops, counter, 0);
+ if (ret < 0)
+ goto err_remove_free_put_context;
+
+ return ret;
+
+err_remove_free_put_context:
+ mutex_lock(&counter->mutex);
+ perf_remove_from_context(counter);
+ mutex_unlock(&counter->mutex);
+
+err_free_put_context:
+ kfree(counter);
+
+err_put_context:
+ put_context(ctx);
+
+ return ret;
+}
+
+static void __cpuinit perf_init_cpu(int cpu)
+{
+ struct perf_cpu_context *ctx;
+
+ ctx = &per_cpu(perf_cpu_context, cpu);
+ spin_lock_init(&ctx->ctx.lock);
+ INIT_LIST_HEAD(&ctx->ctx.counters);
+
+ mutex_lock(&perf_resource_mutex);
+ ctx->max_pertask = perf_max_counters - perf_reserved_percpu;
+ mutex_unlock(&perf_resource_mutex);
+ hw_perf_counter_setup();
+}
+
+#ifdef CONFIG_HOTPLUG_CPU
+static void __perf_exit_cpu(void *info)
+{
+ struct perf_cpu_context *cpuctx = &__get_cpu_var(perf_cpu_context);
+ struct perf_counter_context *ctx = &cpuctx->ctx;
+ struct perf_counter *counter, *tmp;
+
+ list_for_each_entry_safe(counter, tmp, &ctx->counters, list)
+ __perf_remove_from_context(counter);
+
+}
+static void perf_exit_cpu(int cpu)
+{
+ smp_call_function_single(cpu, __perf_exit_cpu, NULL, 1);
+}
+#else
+static inline void perf_exit_cpu(int cpu) { }
+#endif
+
+static int __cpuinit
+perf_cpu_notify(struct notifier_block *self, unsigned long action, void *hcpu)
+{
+ unsigned int cpu = (long)hcpu;
+
+ switch (action) {
+
+ case CPU_UP_PREPARE:
+ case CPU_UP_PREPARE_FROZEN:
+ perf_init_cpu(cpu);
+ break;
+
+ case CPU_DOWN_PREPARE:
+ case CPU_DOWN_PREPARE_FROZEN:
+ perf_exit_cpu(cpu);
+ break;
+
+ default:
+ break;
+ }
+
+ return NOTIFY_OK;
+}
+
+static struct notifier_block __cpuinitdata perf_cpu_nb = {
+ .notifier_call = perf_cpu_notify,
+};
+
+static int __init perf_counter_init(void)
+{
+ perf_cpu_notify(&perf_cpu_nb, (unsigned long)CPU_UP_PREPARE,
+ (void *)(long)smp_processor_id());
+ register_cpu_notifier(&perf_cpu_nb);
+
+ return 0;
+}
+early_initcall(perf_counter_init);
+
+static ssize_t perf_show_reserve_percpu(struct sysdev_class *class, char *buf)
+{
+ return sprintf(buf, "%d\n", perf_reserved_percpu);
+}
+
+static ssize_t
+perf_set_reserve_percpu(struct sysdev_class *class,
+ const char *buf,
+ size_t count)
+{
+ struct perf_cpu_context *cpuctx;
+ unsigned long val;
+ int err, cpu, mpt;
+
+ err = strict_strtoul(buf, 10, &val);
+ if (err)
+ return err;
+ if (val > perf_max_counters)
+ return -EINVAL;
+
+ mutex_lock(&perf_resource_mutex);
+ perf_reserved_percpu = val;
+ for_each_online_cpu(cpu) {
+ cpuctx = &per_cpu(perf_cpu_context, cpu);
+ spin_lock_irq(&cpuctx->ctx.lock);
+ mpt = min(perf_max_counters - cpuctx->ctx.nr_counters,
+ perf_max_counters - perf_reserved_percpu);
+ cpuctx->max_pertask = mpt;
+ spin_unlock_irq(&cpuctx->ctx.lock);
+ }
+ mutex_unlock(&perf_resource_mutex);
+
+ return count;
+}
+
+static ssize_t perf_show_overcommit(struct sysdev_class *class, char *buf)
+{
+ return sprintf(buf, "%d\n", perf_overcommit);
+}
+
+static ssize_t
+perf_set_overcommit(struct sysdev_class *class, const char *buf, size_t count)
+{
+ unsigned long val;
+ int err;
+
+ err = strict_strtoul(buf, 10, &val);
+ if (err)
+ return err;
+ if (val > 1)
+ return -EINVAL;
+
+ mutex_lock(&perf_resource_mutex);
+ perf_overcommit = val;
+ mutex_unlock(&perf_resource_mutex);
+
+ return count;
+}
+
+static SYSDEV_CLASS_ATTR(
+ reserve_percpu,
+ 0644,
+ perf_show_reserve_percpu,
+ perf_set_reserve_percpu
+ );
+
+static SYSDEV_CLASS_ATTR(
+ overcommit,
+ 0644,
+ perf_show_overcommit,
+ perf_set_overcommit
+ );
+
+static struct attribute *perfclass_attrs[] = {
+ &attr_reserve_percpu.attr,
+ &attr_overcommit.attr,
+ NULL
+};
+
+static struct attribute_group perfclass_attr_group = {
+ .attrs = perfclass_attrs,
+ .name = "perf_counters",
+};
+
+static int __init perf_counter_sysfs_init(void)
+{
+ return sysfs_create_group(&cpu_sysdev_class.kset.kobj,
+ &perfclass_attr_group);
+}
+device_initcall(perf_counter_sysfs_init);
+
diff --git a/kernel/sched.c b/kernel/sched.c
index b7480fb..254d56d 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2212,6 +2212,27 @@ static int sched_balance_self(int cpu, int flag)

#endif /* CONFIG_SMP */

+/**
+ * task_oncpu_function_call - call a function on the cpu on which a task runs
+ * @p: the task to evaluate
+ * @func: the function to be called
+ * @info: the function call argument
+ *
+ * Calls the function @func when the task is currently running. This might
+ * be on the current CPU, which just calls the function directly
+ */
+void task_oncpu_function_call(struct task_struct *p,
+ void (*func) (void *info), void *info)
+{
+ int cpu;
+
+ preempt_disable();
+ cpu = task_cpu(p);
+ if (task_curr(p))
+ smp_call_function_single(cpu, func, info, 1);
+ preempt_enable();
+}
+
/***
* try_to_wake_up - wake up a thread
* @p: the to-be-woken-up thread
@@ -2534,6 +2555,7 @@ prepare_task_switch(struct rq *rq, struct task_struct *prev,
struct task_struct *next)
{
fire_sched_out_preempt_notifiers(prev, next);
+ perf_counter_task_sched_out(prev, cpu_of(rq));
prepare_lock_switch(rq, next);
prepare_arch_switch(next);
}
@@ -2574,6 +2596,7 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
*/
prev_state = prev->state;
finish_arch_switch(prev);
+ perf_counter_task_sched_in(current, cpu_of(rq));
finish_lock_switch(rq, prev);
#ifdef CONFIG_SMP
if (current->sched_class->post_schedule)
@@ -4296,6 +4319,7 @@ void scheduler_tick(void)
rq->idle_at_tick = idle_cpu(cpu);
trigger_load_balance(rq, cpu);
#endif
+ perf_counter_task_tick(curr, cpu);
}

#if defined(CONFIG_PREEMPT) && (defined(CONFIG_DEBUG_PREEMPT) || \
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index e14a232..4be8bbc 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -174,3 +174,6 @@ cond_syscall(compat_sys_timerfd_settime);
cond_syscall(compat_sys_timerfd_gettime);
cond_syscall(sys_eventfd);
cond_syscall(sys_eventfd2);
+
+/* performance counters: */
+cond_syscall(sys_perf_counter_open);


2008-12-08 01:48:26

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [patch] Performance Counters for Linux, v2

On Mon, 8 Dec 2008 02:22:12 +0100
Ingo Molnar <[email protected]> wrote:

>
> [ Performance counters are special hardware registers available on
> most modern CPUs. These register count the number of certain types of
> hw events: such as instructions executed, cachemisses suffered, or
> branches mis-predicted, without slowing down the kernel or
> applications. These registers can also trigger interrupts when a
> threshold number of events have passed - and can thus be used to
> profile the code that runs on that CPU. ]
>
> This is version 2 of our Performance Counters subsystem
> implementation.
>
> The biggest user-visible change in this release is a new user-space
> text-mode profiling utility that is based on this code: KernelTop.
>
> KernelTop can be downloaded from:
>
> http://redhat.com/~mingo/perfcounters/kerneltop.c
>
> It's a standalone .c file that needs no extra libraries - it only
> needs a CONFIG_PERF_COUNTERS=y kernel to run on.
>
> This utility is intended for kernel developers - it's basically a
> dynamic kernel profiler that gets hardware counter events dispatched
> to it continuously, which it feeds into a histogram and outputs it
> periodically.
>

I played with this a little, and while it works neat, I wanted a
feature added where it shows a detailed profile for the top function.
I've hacked this in quickly (the usability isn't all that great yet)
and put the source code up at
http://www.tglx.de/~arjan/kerneltop-0.02.tar.gz

with this it looks like this:

$ sudo ./kerneltop --vmlinux=/home/arjan/linux-2.6.git/vmlinux

------------------------------------------------------------------------------
KernelTop: 274 irqs/sec [NMI, 1000000 cycles], (all, 2 CPUs)
------------------------------------------------------------------------------

events RIP kernel function
______ ________________ _______________

230 - 00000000c04189e9 : read_hpet
82 - 00000000c0409439 : mwait_idle_with_hints
77 - 00000000c051a7b7 : acpi_os_read_port
52 - 00000000c053cb3a : acpi_idle_enter_bm
38 - 00000000c0418d93 : hpet_next_event
19 - 00000000c051a802 : acpi_os_write_port
14 - 00000000c04f8704 : __copy_to_user_ll
13 - 00000000c0460c20 : get_page_from_freelist
7 - 00000000c041c96c : kunmap_atomic
5 - 00000000c06a30d2 : _spin_lock [joydev]
4 - 00000000c04f79b7 : vsnprintf [snd_seq]
4 - 00000000c06a3048 : _spin_lock_irqsave [pcspkr]
3 - 00000000c0403b3c : irq_entries_start
3 - 00000000c0423fee : run_rebalance_domains
3 - 00000000c0425e2c : scheduler_tick
3 - 00000000c0430938 : get_next_timer_interrupt
3 - 00000000c043cdfa : __update_sched_clock
3 - 00000000c0448b14 : update_iter
2 - 00000000c04304bd : run_timer_softirq

Showing details for read_hpet
0 c04189e9 <read_hpet>:
2 c04189e9: a1 b0 e0 89 c0 mov 0xc089e0b0,%eax
0
0 /*
0 * Clock source related code
0 */
0 static cycle_t read_hpet(void)
0 {
1 c04189ee: 55 push %ebp
0 c04189ef: 89 e5 mov %esp,%ebp
1 c04189f1: 05 f0 00 00 00 add $0xf0,%eax
0 c04189f6: 8b 00 mov (%eax),%eax
0 return (cycle_t)hpet_readl(HPET_COUNTER);
0 }
300 c04189f8: 31 d2 xor %edx,%edx
0 c04189fa: 5d pop %ebp
0 c04189fb: c3 ret
0

As is usual with profile outputs, the cost for the function always gets added to the instruction after
the really guilty one. I'd move the count one back, but this is hard if the previous instruction was a
(conditional) jump...

--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2008-12-08 03:25:26

by Paul Mackerras

[permalink] [raw]
Subject: Re: [patch] Performance Counters for Linux, v2

Ingo Molnar writes:

> There's a new "counter group record" facility that is a straightforward
> extension of the existing "irq record" notification type. This record
> type can be set on a 'master' counter, and if the master counter triggers
> an IRQ or an NMI, all the 'secondary' counters are read out atomically
> and are put into the counter-group record. The result can then be read()
> out by userspace via a single system call. (Based on extensive feedback
> from Paul Mackerras and David Miller, thanks guys!)
>
> The other big change is the support of virtual task counters via counter
> scheduling: a task can specify more counters than there are on the CPU,
> the kernel will then schedule the counters periodically to spread out hw
> resources.

Still not good enough, I'm sorry.

* I have no guarantee that the secondary counters were all counting
at the same time(s) as the master counter, so the numbers are
virtually useless.

* I might legitimately want to be notified based on any of the
"secondary" counters reaching particular values. The "master"
vs. "secondary" distinction is an artificial one that is going to
make certain reasonable use-cases impossible.

These things are both symptoms of the fact that you still have the
abstraction at the wrong level. The basic abstraction really needs to
be a counter-set, not an individual counter.

I think your patch can be extended to do counter-sets without
complicating the interface too much. We could have:

struct event_spec {
u32 hw_event_type;
u32 hw_event_period;
u64 hw_raw_ctrl;
};

int perf_counterset_open(u32 n_counters,
struct event_spec *counters,
u32 record_type,
pid_t pid,
int cpu);

and then you could have perf_counter_open as a simple wrapper around
perf_counterset_open.

With an approach like this we can also provide an "exclusive" mode for
the PMU (e.g. with a flag bit in record_type or n_counters), which
means that the counter-set occupies the whole PMU. That will give a
way for userspace to specify all the details of how the PMU is to be
programmed, which in turn means that the kernel doesn't need to know
all the arcane details of every event on every processor; it just
needs to know the common events.

I notice the implementation also still assumes it can add any counter
at any time subject only to a limit on the number of counters in use.
That will have to be fixed before it is usable on powerpc (and
apparently on some x86 processors too).

Paul.

2008-12-08 08:32:57

by Corey J Ashford

[permalink] [raw]
Subject: Re: [patch] Performance Counters for Linux, v2

[email protected] wrote on 12/07/2008 05:22:12 PM:
[snip]
>
> The other big change is the support of virtual task counters via counter

> scheduling: a task can specify more counters than there are on the CPU,
> the kernel will then schedule the counters periodically to spread out hw

> resources. So for example if a task starts 6 counters on a CPU that has
> only two hardware counters, it still gets this output:
>
> counter[0 cycles ]: 5204680573 , delta: 1733680843
events
> counter[1 instructions ]: 1364468045 , delta: 454818351
events
> counter[2 cache-refs ]: 12732 , delta: 4399
events
> counter[3 cache-misses ]: 1009 , delta: 336
events
> counter[4 branch-instructions ]: 125993304 , delta: 42006998
events
> counter[5 branch-misses ]: 1946 , delta: 649
events
>

Hello Ingo,

I posted some questions about this capability in your proposal on LKML,
but I wasn't able to get the reply threaded in properly. Could you take a
look at this post, please?

http://lkml.org/lkml/2008/12/5/299

Thanks for your consideration,

- Corey

Corey Ashford
Software Engineer
IBM Linux Technology Center, Linux Toolchain
Beaverton, OR
503-578-3507
[email protected]

2008-12-08 11:34:05

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] Performance Counters for Linux, v2


* Paul Mackerras <[email protected]> wrote:

> Ingo Molnar writes:
>
> > There's a new "counter group record" facility that is a straightforward
> > extension of the existing "irq record" notification type. This record
> > type can be set on a 'master' counter, and if the master counter triggers
> > an IRQ or an NMI, all the 'secondary' counters are read out atomically
> > and are put into the counter-group record. The result can then be read()
> > out by userspace via a single system call. (Based on extensive feedback
> > from Paul Mackerras and David Miller, thanks guys!)
> >
> > The other big change is the support of virtual task counters via counter
> > scheduling: a task can specify more counters than there are on the CPU,
> > the kernel will then schedule the counters periodically to spread out hw
> > resources.
>
> Still not good enough, I'm sorry.
>
> * I have no guarantee that the secondary counters were all counting
> at the same time(s) as the master counter, so the numbers are
> virtually useless.

If you want a _guarantee_ that multiple counters can count at once you
can still do it: for example by using the separate, orthogonal
reservation mechanism we had in -v1 already.

Also, you dont _have to_ overcommit counters.

Your whole statistical argument that group readout is a must-have for
precision is fundamentally flawed as well: counters _themselves_, as used
by most applications, by their nature, are a statistical sample to begin
with. There's way too many hardware events to track each of them
unintrusively - so this type of instrumentation is _all_ sampling based,
and fundamentally so. (with a few narrow exceptions such as single-event
interrupts for certain rare event types)

This means that the only correct technical/mathematical argument is to
talk about "levels of noise" and how they compare and correlate - and
i've seen no actual measurements or estimations pro or contra. Group
readout of counters can reduce noise for sure, but it is wrong for you to
try to turn this into some sort of all-or-nothing property. Other sources
of noise tend to be of much higher of magnitude.

You need really stable workloads to see such low noise levels that group
readout of counters starts to matter - and the thing is that often such
'stable' workloads are rather boringly artificial, because in real life
there's no such thing as a stable workload.

Finally, the basic API to user-space is not the way to impose rigid "I
own the whole PMU" notion that you are pushing. That notion can be
achieved in different, system administration means - and a perf-counter
reservation facility was included in the v1 patchset.

Note that you are doing something that is a kernel design no-no: you are
trying to design a "guarantee" for hardware constraints by complicating
it into the userpace ABI - and that is a fundamentally losing
proposition.

It's a tail-wags-the-dog design situation that we are routinely resisting
in the upstream kernel: you are putting hardware constraints ahead of
usability, you are putting hardware constraints ahead of sane interface
design - and such an approach is wrong and shortsighted on every level.

It's also shortsighted because it's a red herring: there's nothing that
forbids the counter scheduler from listening to the hw constraints, for
CPUs where there's a lot of counter constraints.

> * I might legitimately want to be notified based on any of the
> "secondary" counters reaching particular values. The "master" vs.
> "secondary" distinction is an artificial one that is going to make
> certain reasonable use-cases impossible.

the secondary counters can cause records too - independently of the
master counter. This is because the objects (and fds) are separate so
there's no restriction at all on the secondary counters. This is a lot
less natural to do if you have a "vector of counters" abstraction.

> These things are both symptoms of the fact that you still have the
> abstraction at the wrong level. The basic abstraction really needs to
> be a counter-set, not an individual counter.

Being per object is a very fundamental property of Linux, and you have to
understand and respect that down to your bone if you want to design new
syscall ABIs for Linux.

The "perfmon v3 light" system calls, all five of them, are a classic
loundry list of what _not_ to do in new Linux APIs: they are too
specific, too complex and way too limited on every level.

Per object and per fd abstractions are a _very strong_ conceptual
property of Linux. Look at what they bring in the performance counters
case:

- All the VFS syscalls work naturally: sys_read(), sys_close(),
sys_dup(), you name it.

- It makes all counters poll()able. Any subset of them, and at any time,
independently of any context descriptor. Look at kerneltop.c: it has a
USE_POLLING switch to switch to a poll() loop, and it just works the
way you'd expect it to work.

- We can share fds between monitor threads and you can do a thread pool
that works down new events - without forcing any counter scheduling in
the monitored task.

- It makes the same task monitorable by multiple monitors, trivially
so. There's no forced context notion that privatizes the PMU - with
some 'virtual context' extra dimension slapped on top of it.

- Doing a proper per object abstraction simplifies event and error
handling significantly: instead of having to work down a vector of
counters and demultiplexing events and matching them up to individual
counters, the demultiplexing is done by the _kernel_.

- It makes counter scheduling very dynamic. Instead of exposing
user-space to a static "counter allocation" (with all the insane ABI
and kernel internal complications this brings), perf-counters
subsystem does not expose user-space to such scheduling details
_at all_.

- Difference in complexity. The "v3 light" version of perfmon (which
does not even schedule any PMU contexts), contains these core kernel
files:

19 files changed, 4424 insertions(+)

Our code has this core kernel impact:

10 files changed, 1191 insertions(+)

And in some areas it's already more capable than "perfmon v3".
The difference is very obvious.

All in one, using the 1:1 fd:counter design is a powerful, modern Linux
abstraction to its core. It's much easier to think about for application
developers as well, so we'll see a much sharper adoption rate.

Also, i noticed that your claims about our code tend to be rather
abstract and are often dwelling on issues that IMO have no big practical
relevance - so may i suggest the following approach instead to break the
(mutual!) cycle of miscommunication: if you think an issue is important,
could you please point out the problem in practical terms what you think
would not be possible with our scheme? We tend to prioritize items by
practical value.

Things like: "kerneltop would not be as accurate with: ..., to the level
of adding 5% of extra noise.". Would that work for you?

> I think your patch can be extended to do counter-sets without
> complicating the interface too much. We could have:
>
> struct event_spec {
> u32 hw_event_type;
> u32 hw_event_period;
> u64 hw_raw_ctrl;
> };

This needless vectoring and the exposing of contexts would kill many good
properties of the new subsystem, without any tangible benefits - see
above.

This is really scheduling school 101: a hardware context allocation is
the _last_ thing we want to expose to user-space in this particular case.
This is a fundamental property of hardware resource scheduling. We _dont_
want to tie the hands of the kernel by putting resource scheduling into
user-space!

Your arguments remind me a bit of the "user-space threads have to be
scheduled in user-space!" N:M threading design discussions we had years
ago. IBM folks were pushing NGPT very strongly back then and claimed that
it's the right design for high-performance threading, etc. etc.

In reality, doing user-space scheduling for cheap-to-context-switch
hardware resources was a fundamentally wrong proposition back then too,
and it is still the wrong concept today as well.

> int perf_counterset_open(u32 n_counters,
> struct event_spec *counters,
> u32 record_type,
> pid_t pid,
> int cpu);
>
> and then you could have perf_counter_open as a simple wrapper around
> perf_counterset_open.
>
> With an approach like this we can also provide an "exclusive" mode for
> the PMU [...]

You can already allocate "exclusive" counters in a guaranteed way via our
code, here and today.

> [...] (e.g. with a flag bit in record_type or n_counters), which means
> that the counter-set occupies the whole PMU. That will give a way for
> userspace to specify all the details of how the PMU is to be
> programmed, which in turn means that the kernel doesn't need to know
> all the arcane details of every event on every processor; it just needs
> to know the common events.
>
> I notice the implementation also still assumes it can add any counter
> at any time subject only to a limit on the number of counters in use.
> That will have to be fixed before it is usable on powerpc (and
> apparently on some x86 processors too).

There's constrained PMCs on x86 too, as you mention. Instead of repeating
the answer that i gave before (that this is easy and natural), how about
this approach: if we added real, working support for constrained PMCs on
x86, that will then address this point of yours rather forcefully,
correct?

Ingo

2008-12-08 11:54:56

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] Performance Counters for Linux, v2


* Arjan van de Ven <[email protected]> wrote:

> On Mon, 8 Dec 2008 02:22:12 +0100
> Ingo Molnar <[email protected]> wrote:
>
> >
> > [ Performance counters are special hardware registers available on
> > most modern CPUs. These register count the number of certain types of
> > hw events: such as instructions executed, cachemisses suffered, or
> > branches mis-predicted, without slowing down the kernel or
> > applications. These registers can also trigger interrupts when a
> > threshold number of events have passed - and can thus be used to
> > profile the code that runs on that CPU. ]
> >
> > This is version 2 of our Performance Counters subsystem
> > implementation.
> >
> > The biggest user-visible change in this release is a new user-space
> > text-mode profiling utility that is based on this code: KernelTop.
> >
> > KernelTop can be downloaded from:
> >
> > http://redhat.com/~mingo/perfcounters/kerneltop.c
> >
> > It's a standalone .c file that needs no extra libraries - it only
> > needs a CONFIG_PERF_COUNTERS=y kernel to run on.
> >
> > This utility is intended for kernel developers - it's basically a
> > dynamic kernel profiler that gets hardware counter events dispatched
> > to it continuously, which it feeds into a histogram and outputs it
> > periodically.
> >
>
> I played with this a little, and while it works neat, I wanted a
> feature added where it shows a detailed profile for the top function.

ah, very nice idea!

> I've hacked this in quickly (the usability isn't all that great yet)
> and put the source code up at
>
> http://www.tglx.de/~arjan/kerneltop-0.02.tar.gz

ok, picked it up :-)

> with this it looks like this:
>
> $ sudo ./kerneltop --vmlinux=/home/arjan/linux-2.6.git/vmlinux
>
> ------------------------------------------------------------------------------
> KernelTop: 274 irqs/sec [NMI, 1000000 cycles], (all, 2 CPUs)
> ------------------------------------------------------------------------------
>
> events RIP kernel function
> ______ ________________ _______________
>
> 230 - 00000000c04189e9 : read_hpet
> 82 - 00000000c0409439 : mwait_idle_with_hints
> 77 - 00000000c051a7b7 : acpi_os_read_port
> 52 - 00000000c053cb3a : acpi_idle_enter_bm
> 38 - 00000000c0418d93 : hpet_next_event
> 19 - 00000000c051a802 : acpi_os_write_port
> 14 - 00000000c04f8704 : __copy_to_user_ll
> 13 - 00000000c0460c20 : get_page_from_freelist
> 7 - 00000000c041c96c : kunmap_atomic
> 5 - 00000000c06a30d2 : _spin_lock [joydev]
> 4 - 00000000c04f79b7 : vsnprintf [snd_seq]
> 4 - 00000000c06a3048 : _spin_lock_irqsave [pcspkr]
> 3 - 00000000c0403b3c : irq_entries_start
> 3 - 00000000c0423fee : run_rebalance_domains
> 3 - 00000000c0425e2c : scheduler_tick
> 3 - 00000000c0430938 : get_next_timer_interrupt
> 3 - 00000000c043cdfa : __update_sched_clock
> 3 - 00000000c0448b14 : update_iter
> 2 - 00000000c04304bd : run_timer_softirq
>
> Showing details for read_hpet
> 0 c04189e9 <read_hpet>:
> 2 c04189e9: a1 b0 e0 89 c0 mov 0xc089e0b0,%eax
> 0
> 0 /*
> 0 * Clock source related code
> 0 */
> 0 static cycle_t read_hpet(void)
> 0 {
> 1 c04189ee: 55 push %ebp
> 0 c04189ef: 89 e5 mov %esp,%ebp
> 1 c04189f1: 05 f0 00 00 00 add $0xf0,%eax
> 0 c04189f6: 8b 00 mov (%eax),%eax
> 0 return (cycle_t)hpet_readl(HPET_COUNTER);
> 0 }
> 300 c04189f8: 31 d2 xor %edx,%edx
> 0 c04189fa: 5d pop %ebp
> 0 c04189fb: c3 ret
> 0

very nice and useful output! This for example shows that it's the readl()
on the HPET_COUNTER IO address that is causing the overhead. That is to
be expected - HPET is mapped uncached and the access goes out to the
chipset.

> As is usual with profile outputs, the cost for the function always gets
> added to the instruction after the really guilty one. I'd move the
> count one back, but this is hard if the previous instruction was a
> (conditional) jump...

yeah. Sometimes the delay can be multiple instructions - so it's best to
leave the profiling picture as pristine as possible, and let the kernel
developer chose the right counter type that displays particular problem
areas in the most expressive way.

For example when i'm doing SMP scalability work, i generally look at
cachemiss counts, for cacheline bouncing. The following kerneltop output
shows last-level data-cache misses in the kernel during a tbench 64 run
on a 16-way box, using latest mainline -git:

------------------------------------------------------------------------------
KernelTop: 3744 irqs/sec [NMI, 1000 cache-misses], (all, 16 CPUs)
------------------------------------------------------------------------------

events RIP kernel function
______ ________________ _______________

7757 - ffffffff804d723e : dst_release
7649 - ffffffff804e3611 : eth_type_trans
6402 - ffffffff8050e470 : tcp_established_options
5975 - ffffffff804fa054 : ip_rcv_finish
5530 - ffffffff80365fb0 : copy_user_generic_string!
3979 - ffffffff804ccf0c : skb_push
3474 - ffffffff804fe6cb : ip_queue_xmit
1950 - ffffffff804cdcdd : skb_release_head_state
1595 - ffffffff804cce4f : skb_copy_and_csum_dev
1365 - ffffffff80501079 : __inet_lookup_established
908 - ffffffff804fa5fc : ip_local_deliver_finish
743 - ffffffff8036cbcc : unmap_single
452 - ffffffff80569402 : _read_lock
411 - ffffffff80283010 : get_page_from_freelist
410 - ffffffff80505b16 : tcp_sendmsg
406 - ffffffff8028631a : put_page
386 - ffffffff80509067 : tcp_ack
204 - ffffffff804d2d55 : netif_rx
194 - ffffffff8050b94b : tcp_data_queue

Cachemiss event samples tend to line up quite close to the instruction
that causes them.

Looking at pure cycles (same workload) gives a different view:

------------------------------------------------------------------------------
KernelTop: 27357 irqs/sec [NMI, 1000000 cycles], (all, 16 CPUs)
------------------------------------------------------------------------------

events RIP kernel function
______ ________________ _______________

16602 - ffffffff80365fb0 : copy_user_generic_string!
7947 - ffffffff80505b16 : tcp_sendmsg
7450 - ffffffff80509067 : tcp_ack
7384 - ffffffff80332881 : avc_has_perm_noaudit
6888 - ffffffff80504e7c : tcp_recvmsg
6564 - ffffffff8056745e : schedule
6170 - ffffffff8050ecd5 : tcp_transmit_skb
4949 - ffffffff8020a75b : __switch_to
4417 - ffffffff8050cc4f : tcp_rcv_established
4283 - ffffffff804d723e : dst_release
3842 - ffffffff804fed58 : ip_finish_output
3760 - ffffffff804fe6cb : ip_queue_xmit
3580 - ffffffff80501079 : __inet_lookup_established
3540 - ffffffff80514ce5 : tcp_v4_rcv
3475 - ffffffff8026c31f : audit_syscall_exit
3411 - ffffffffff600130 : vread_hpet
3267 - ffffffff802a73de : kfree
3058 - ffffffff804d39ed : dev_queue_xmit
3047 - ffffffff804eecf8 : nf_iterate

Cycles overhead tends to be harder to match up with instructions.

Ingo

2008-12-08 12:02:22

by David Miller

[permalink] [raw]
Subject: Re: [patch] Performance Counters for Linux, v2

From: Ingo Molnar <[email protected]>
Date: Mon, 8 Dec 2008 12:33:18 +0100

> Your whole statistical argument that group readout is a must-have for
> precision is fundamentally flawed as well: counters _themselves_, as used
> by most applications, by their nature, are a statistical sample to begin
> with. There's way too many hardware events to track each of them
> unintrusively - so this type of instrumentation is _all_ sampling based,
> and fundamentally so. (with a few narrow exceptions such as single-event
> interrupts for certain rare event types)

There are a lot of people who are going to fundamentally
disagree with this, myself included.

A lot of things are being stated about what people do with this stuff,
but I think there are people working longer in this area who quite
possibly know a lot better. But they were blindsided by this new work
instead of being consulted, which was pretty unnice.

2008-12-08 14:41:56

by Andi Kleen

[permalink] [raw]
Subject: Re: [patch] Performance Counters for Linux, v2

Ingo Molnar <[email protected]> writes:

> This means that the only correct technical/mathematical argument is to
> talk about "levels of noise" and how they compare and correlate - and
> i've seen no actual measurements or estimations pro or contra. Group
> readout of counters can reduce noise for sure, but it is wrong for you to
> try to turn this into some sort of all-or-nothing property. Other sources
> of noise tend to be of much higher of magnitude.

Ingo, could you please describe how PEBS and IBS fit into your model?

Thanks.

-Andi

--
[email protected]

2008-12-08 22:04:58

by Paul Mackerras

[permalink] [raw]
Subject: Re: [patch] Performance Counters for Linux, v2

Ingo Molnar writes:

> If you want a _guarantee_ that multiple counters can count at once you
> can still do it: for example by using the separate, orthogonal
> reservation mechanism we had in -v1 already.

Is that this?

" - There's a /sys based reservation facility that allows the allocation
of a certain number of hw counters for guaranteed sysadmin access."

Sounds like I can't do that as an ordinary user, even on my own
processes...

I don't want the whole PMU all the time, I just want it while my
monitored process is running, and only on the CPU where it is
running.

> Also, you dont _have to_ overcommit counters.
>
> Your whole statistical argument that group readout is a must-have for
> precision is fundamentally flawed as well: counters _themselves_, as used
> by most applications, by their nature, are a statistical sample to begin
> with. There's way too many hardware events to track each of them
> unintrusively - so this type of instrumentation is _all_ sampling based,
> and fundamentally so. (with a few narrow exceptions such as single-event
> interrupts for certain rare event types)

No - at least on the machines I'm familiar with, I can count every
single cache miss and hit at every level of the memory hierarchy,
every single TLB miss, every load and store instruction, etc. etc.

I want to be able to work out things like cache hit rates, just as one
example. To do that I need two numbers that are directly comparable
because they relate to the same set of instructions. If I have a
count of L1 Dcache hits for one set of instructions and a count of L1
Dcache misses over some different stretch of instructions, the ratio
of them doesn't mean anything.

Your argument about "it's all statistical" is bogus because even if
the things we are measuring are statistical, that's still no excuse
for being sloppy about how we make our estimates. And not being able
to have synchronized counters is just sloppy. The users want it, the
hardware provides it, so that makes it a must-have as far as I am
concerned.

> This means that the only correct technical/mathematical argument is to
> talk about "levels of noise" and how they compare and correlate - and
> i've seen no actual measurements or estimations pro or contra. Group
> readout of counters can reduce noise for sure, but it is wrong for you to
> try to turn this into some sort of all-or-nothing property. Other sources
> of noise tend to be of much higher of magnitude.

What can you back that assertion up with?

> You need really stable workloads to see such low noise levels that group
> readout of counters starts to matter - and the thing is that often such
> 'stable' workloads are rather boringly artificial, because in real life
> there's no such thing as a stable workload.

More unsupported assertions, that sound wrong to me...

> Finally, the basic API to user-space is not the way to impose rigid "I
> own the whole PMU" notion that you are pushing. That notion can be
> achieved in different, system administration means - and a perf-counter
> reservation facility was included in the v1 patchset.

Only for root, which isn't good enough.

What I was proposing was NOT a rigid notion - you don't have to own
the whole PMU if you are happy to use the events that the kernel knows
about. If you do want the whole PMU, you can have it while the
process you're monitoring is running, and the kernel will
context-switch it between you and other users, who can also have the
whole PMU when their processes are running.

> Note that you are doing something that is a kernel design no-no: you are
> trying to design a "guarantee" for hardware constraints by complicating
> it into the userpace ABI - and that is a fundamentally losing
> proposition.

Perhaps you have misunderstood my proposal. A counter-set doesn't
have to be the whole PMU, and you can have multiple counter-sets
active at the same time as long as they fit. You can even have
multiple "whole PMU" counter-sets and the kernel will multiplex them
onto the real PMU.

> It's a tail-wags-the-dog design situation that we are routinely resisting
> in the upstream kernel: you are putting hardware constraints ahead of
> usability, you are putting hardware constraints ahead of sane interface
> design - and such an approach is wrong and shortsighted on every level.

Well, I'll ignore the patronizing tone (but please try to avoid it in
future).

The PRIMARY reason for wanting counter-sets is because THAT IS WHAT
THE USERS WANT. A "usable" and "sane" interface design that doesn't
do what users want is useless.

Anyway, my proposal is just as "usable" as yours, since users still
have perf_counter_open, exactly as in your proposal. Users with
simpler requirements can do things exactly the same way as with your
proposal.

> It's also shortsighted because it's a red herring: there's nothing that
> forbids the counter scheduler from listening to the hw constraints, for
> CPUs where there's a lot of counter constraints.

Handling the counter constraints is indeed a matter of implementation,
and as I noted previously, your current proposed implementation
doesn't handle them.

> Being per object is a very fundamental property of Linux, and you have to
> understand and respect that down to your bone if you want to design new
> syscall ABIs for Linux.

It's the choice of a single counter as being your "object" that I
object to. :)

> - It makes counter scheduling very dynamic. Instead of exposing
> user-space to a static "counter allocation" (with all the insane ABI
> and kernel internal complications this brings), perf-counters
> subsystem does not expose user-space to such scheduling details
> _at all_.

Which is not necessarily a good thing. Fundamentally, if you are
trying to measure something, and you get a number, you need to know
what exactly got measured.

For example, suppose I am trying to count TLB misses during the
execution of a program. If my TLB miss counter keeps getting bumped
off because the kernel is scheduling my counter along with a dozen
other counters, then I *at least* want to know about it, and
preferably control it. Otherwise I'll be getting results that vary by
an order of magnitude with no way to tell why.

> All in one, using the 1:1 fd:counter design is a powerful, modern Linux
> abstraction to its core. It's much easier to think about for application
> developers as well, so we'll see a much sharper adoption rate.

For simple things, yes it is simpler. But it can't do the more
complex things in any sort of clean or sane way.

> Also, i noticed that your claims about our code tend to be rather
> abstract

... because the design of your code is wrong at an abstract level ...

> and are often dwelling on issues that IMO have no big practical
> relevance - so may i suggest the following approach instead to break the
> (mutual!) cycle of miscommunication: if you think an issue is important,
> could you please point out the problem in practical terms what you think
> would not be possible with our scheme? We tend to prioritize items by
> practical value.
>
> Things like: "kerneltop would not be as accurate with: ..., to the level
> of adding 5% of extra noise.". Would that work for you?

OK, here's an example. I have an application whose execution has
several different phases, and I want to measure the L1 Icache hit rate
and the L1 Dcache hit rate as a function of time and make a graph. So
I need counters for L1 Icache accesses, L1 Icache misses, L1 Dcache
accesses, and L1 Dcache misses. I want to sample at 1ms intervals.
The CPU I'm running on has two counters.

With your current proposal, I don't see any way to make sure that the
counter scheduler counts L1 Dcache accesses and L1 Dcache misses at
the same time, then schedules L1 Icache accesses and L1 Icache
misses. I could end up with L1 Dcache accesses and L1 Icache
accesses, then L1 Dcache misses and L1 Icache misses - and get a
nonsensical situation like the misses being greater than the accesses.

> This needless vectoring and the exposing of contexts would kill many good
> properties of the new subsystem, without any tangible benefits - see
> above.

No. Where did you get contexts from? I didn't write anything about
contexts. Please read what I wrote.

> This is really scheduling school 101: a hardware context allocation is
> the _last_ thing we want to expose to user-space in this particular case.

Please drop the patronizing tone, again.

What user-space applications want to be able to do is this:

* Ensure that a set of counters are all counting at the same time.

* Know when counters get scheduled on and off the process so that the
results can be interpreted properly. Either that or be able to
control the scheduling.

* Sophisticated applications want to be able to do things with the PMU
that the kernel doesn't necessarily understand.

> This is a fundamental property of hardware resource scheduling. We _dont_
> want to tie the hands of the kernel by putting resource scheduling into
> user-space!

You'd rather provide useless numbers to userspace? :)

> Your arguments remind me a bit of the "user-space threads have to be
> scheduled in user-space!" N:M threading design discussions we had years
> ago. IBM folks were pushing NGPT very strongly back then and claimed that
> it's the right design for high-performance threading, etc. etc.

Your arguments remind me of a filesystem that a colleague of mine once
designed that only had files, but no directories (you could have "/"
characters in the filenames, though). This whole discussion is a bit
like you arguing that directories are an unnecessary complication that
only messes up the interface and adds extra system calls.

> You can already allocate "exclusive" counters in a guaranteed way via our
> code, here and today.

But then I don't get context-switching between processes.

> There's constrained PMCs on x86 too, as you mention. Instead of repeating
> the answer that i gave before (that this is easy and natural), how about
> this approach: if we added real, working support for constrained PMCs on
> x86, that will then address this point of yours rather forcefully,
> correct?

It still means we end up having to add something approaching 29,000
lines of code and 320kB to the kernel, just for the IBM 64-bit PowerPC
processors. (I don't guarantee that code is optimal, but that is some
indication of the complexity required.)

I am perfectly happy to add code for the kernel to know about the most
commonly-used, simple events on those processors. But I surely don't
want to have to teach the kernel about every last event and every last
capability of those machines' PMUs.

For example, there is a facility on POWER6 where certain instructions
can be selected (based on (instruction_word & mask) == value) and
marked, and then there are events that allow you to measure how long
marked instructions take in various stages of execution. How would I
make such a feature available for applications to use, within your
framework?

Paul.

2008-12-09 06:37:42

by Stephane Eranian

[permalink] [raw]
Subject: Re: [patch] Performance Counters for Linux, v2

Hi,

On Mon, Dec 8, 2008 at 2:22 AM, Ingo Molnar <[email protected]> wrote:
>
>
> There's a new "counter group record" facility that is a straightforward
> extension of the existing "irq record" notification type. This record
> type can be set on a 'master' counter, and if the master counter triggers
> an IRQ or an NMI, all the 'secondary' counters are read out atomically
> and are put into the counter-group record. The result can then be read()
> out by userspace via a single system call. (Based on extensive feedback
> from Paul Mackerras and David Miller, thanks guys!)
>

That is unfortunately not generic enough.You need a bit more
flexibility than master/secondaries, I am afraid. What tools want
is to be able to express:
- when event X overflows, record values of events J, K
- when event Y overflows, record values of events Z, J

I am not making this up. I know tools that do just that, i.e., that is
collecting
two distinct profiles in a single run. This is how, for instance, you
can collect
a flat profile and the call graph in one run, very much like gprof.

When a you get a notification and you read out the sample, you'd like
to know if which order values are returned. Given you do not expose counters,
I would assume the only possibility would be return them in file
descriptor order.
But that assumes at the time you create the file descriptor for an
event you have
all the other file descriptors you need...

> There's also more generic x86 support: all 4 generic PMCs of Nehalem /
> Core i7 are supported - i've run 4 instances of KernelTop and they used
> up four separate PMCs.
>
Core/Atom have 5 counters, Nehalem has 7.
Why are you not using all of them already?

2008-12-09 11:03:35

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] Performance Counters for Linux, v2


* stephane eranian <[email protected]> wrote:

> > There's also more generic x86 support: all 4 generic PMCs of Nehalem
> > / Core i7 are supported - i've run 4 instances of KernelTop and they
> > used up four separate PMCs.
>
> Core/Atom have 5 counters, Nehalem has 7. Why are you not using all of
> them already?

no, Nehalem has 4 generic purpose PMCs and 3 fixed-purpose PMCs (7
total), Core/Atom has 2 generic PMCs and 3 fixed-purpose PMCs (5 total).
Saying that it has 7 is misleading. (and even the generic PMCs have
constraints)

Ingo

2008-12-09 11:11:43

by David Miller

[permalink] [raw]
Subject: Re: [patch] Performance Counters for Linux, v2

From: Ingo Molnar <[email protected]>
Date: Tue, 9 Dec 2008 12:02:46 +0100

>
> * stephane eranian <[email protected]> wrote:
>
> > > There's also more generic x86 support: all 4 generic PMCs of Nehalem
> > > / Core i7 are supported - i've run 4 instances of KernelTop and they
> > > used up four separate PMCs.
> >
> > Core/Atom have 5 counters, Nehalem has 7. Why are you not using all of
> > them already?
>
> no, Nehalem has 4 generic purpose PMCs and 3 fixed-purpose PMCs (7
> total),
...
> Saying that it has 7 is misleading.

Even you just did.

2008-12-09 11:23:27

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] Performance Counters for Linux, v2


* David Miller <[email protected]> wrote:

> From: Ingo Molnar <[email protected]>
> Date: Tue, 9 Dec 2008 12:02:46 +0100
>
> >
> > * stephane eranian <[email protected]> wrote:
> >
> > > > There's also more generic x86 support: all 4 generic PMCs of Nehalem
> > > > / Core i7 are supported - i've run 4 instances of KernelTop and they
> > > > used up four separate PMCs.
> > >
> > > Core/Atom have 5 counters, Nehalem has 7. Why are you not using all of
> > > them already?
> >
> > no, Nehalem has 4 generic purpose PMCs and 3 fixed-purpose PMCs (7
> > total),
> ...
> > Saying that it has 7 is misleading.
>
> Even you just did.

which portion of my point stressing the general purpose attribute was
unclear to you?

Saying it has 7 is misleading in the same way as if i told you now:
"look, i have four eyes!". (they are: left eye looking right, left eye
looking left, right eye looking left and right eye looking right)

Nehalem has 4 general purpose PMCs, not 7. Yes, it has 7 counters but
they are not all general purpose. The P4 has 18.

Ingo

2008-12-09 11:30:09

by David Miller

[permalink] [raw]
Subject: Re: [patch] Performance Counters for Linux, v2

From: Ingo Molnar <[email protected]>
Date: Tue, 9 Dec 2008 12:22:25 +0100

>
> * David Miller <[email protected]> wrote:
>
> > From: Ingo Molnar <[email protected]>
> > Date: Tue, 9 Dec 2008 12:02:46 +0100
> >
> > >
> > > * stephane eranian <[email protected]> wrote:
> > >
> > > > > There's also more generic x86 support: all 4 generic PMCs of Nehalem
> > > > > / Core i7 are supported - i've run 4 instances of KernelTop and they
> > > > > used up four separate PMCs.
> > > >
> > > > Core/Atom have 5 counters, Nehalem has 7. Why are you not using all of
> > > > them already?
> > >
> > > no, Nehalem has 4 generic purpose PMCs and 3 fixed-purpose PMCs (7
> > > total),
> > ...
> > > Saying that it has 7 is misleading.
> >
> > Even you just did.
>
> which portion of my point stressing the general purpose attribute was
> unclear to you?

I'm just teasing you because you picked a trite point from stephane's
email instead of the meat later on, which I would have found more
interesting to hear you comment on.

2008-12-09 12:14:25

by Paolo Ciarrocchi

[permalink] [raw]
Subject: Re: [patch] Performance Counters for Linux, v2

On Tue, Dec 9, 2008 at 12:29 PM, David Miller <[email protected]> wrote:
> From: Ingo Molnar <[email protected]>
> Date: Tue, 9 Dec 2008 12:22:25 +0100
>
>>
>> * David Miller <[email protected]> wrote:
>>
>> > From: Ingo Molnar <[email protected]>
>> > Date: Tue, 9 Dec 2008 12:02:46 +0100
>> >
>> > >
>> > > * stephane eranian <[email protected]> wrote:
>> > >
>> > > > > There's also more generic x86 support: all 4 generic PMCs of Nehalem
>> > > > > / Core i7 are supported - i've run 4 instances of KernelTop and they
>> > > > > used up four separate PMCs.
>> > > >
>> > > > Core/Atom have 5 counters, Nehalem has 7. Why are you not using all of
>> > > > them already?
>> > >
>> > > no, Nehalem has 4 generic purpose PMCs and 3 fixed-purpose PMCs (7
>> > > total),
>> > ...
>> > > Saying that it has 7 is misleading.
>> >
>> > Even you just did.
>>
>> which portion of my point stressing the general purpose attribute was
>> unclear to you?
>
> I'm just teasing you because you picked a trite point from stephane's
> email instead of the meat later on, which I would have found more
> interesting to hear you comment on.

I'm interested in Ingo's comments on that argument as well but I don't
feel the need to act like were are all in a kindergarten.

You are two astonishing developers and I'm sure you can demonstrate
that you can have a pure technical discussion on this topic as you did
several times in the past.

Regards,
--
Paolo
http://paolo.ciarrocchi.googlepages.com/
http://mypage.vodafone.it/

2008-12-09 13:01:20

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] Performance Counters for Linux, v2


* Paul Mackerras <[email protected]> wrote:

> > Things like: "kerneltop would not be as accurate with: ..., to the
> > level of adding 5% of extra noise.". Would that work for you?
>
> OK, here's an example. I have an application whose execution has
> several different phases, and I want to measure the L1 Icache hit rate
> and the L1 Dcache hit rate as a function of time and make a graph. So
> I need counters for L1 Icache accesses, L1 Icache misses, L1 Dcache
> accesses, and L1 Dcache misses. I want to sample at 1ms intervals. The
> CPU I'm running on has two counters.
>
> With your current proposal, I don't see any way to make sure that the
> counter scheduler counts L1 Dcache accesses and L1 Dcache misses at the
> same time, then schedules L1 Icache accesses and L1 Icache misses. I
> could end up with L1 Dcache accesses and L1 Icache accesses, then L1
> Dcache misses and L1 Icache misses - and get a nonsensical situation
> like the misses being greater than the accesses.

yes, agreed, this is a valid special case of simple counter readout -
we'll add support to couple counters like that.

Note that this issue does not impact use of multiple counters in
profilers. (i.e. anything that is not a pure readout of the counter,
along linear time, as your example above suggests).

Once we start sampling the context, grouping of counters becomes
irrelevant (and a hindrance) and static frequency sampling becomes an
inferior method of sampling.

( The highest quality statistical approach is the kind of multi-counter
sampling model you can see implemented in KernelTop for example, where
the counters are independently sampled. Can go on in great detail about
this if you are interested - this is the far more interesting usecase
in practice. )

Ingo

2008-12-09 13:47:20

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] Performance Counters for Linux, v2


* stephane eranian <[email protected]> wrote:

> > There's a new "counter group record" facility that is a
> > straightforward extension of the existing "irq record" notification
> > type. This record type can be set on a 'master' counter, and if the
> > master counter triggers an IRQ or an NMI, all the 'secondary'
> > counters are read out atomically and are put into the counter-group
> > record. The result can then be read() out by userspace via a single
> > system call. (Based on extensive feedback from Paul Mackerras and
> > David Miller, thanks guys!)
>
> That is unfortunately not generic enough. You need a bit more
> flexibility than master/secondaries, I am afraid. What tools want is
> to be able to express:
>
> - when event X overflows, record values of events J, K
> - when event Y overflows, record values of events Z, J

hm, the new group code in perfcounters-v2 can already do this. Have you
tried to use it and it didnt work? If so then that's a bug. Nothing in
the design prevents that kind of group readout.

[ We could (and probably will) enhance the grouping relationship some
more, but group readouts are a fundamentally inferior mode of
profiling. (see below for the explanation) ]

> I am not making this up. I know tools that do just that, i.e., that is
> collecting two distinct profiles in a single run. This is how, for
> instance, you can collect a flat profile and the call graph in one run,
> very much like gprof.

yeah, but it's still the fundamentally wrong thing to do.

Being able to extract high-quality performance information from the
system is the cornerstone of our design, and chosing the right sampling
model permeates the whole issue of single-counter versus group-readout.

I dont think finer design aspects of kernel support for performance
counters can be argued without being on the same page about this, so
please let me outline our view on these things, in (boringly) verbose
detail - spiked with examples and code as well.

Firstly, sampling "at 1msec intervals" or any fixed period is a _very_
wrong mindset - and cross-sampling counters is a similarly wrong mindset.

When there are two (or more) hw metrics to profile, the ideally best
(i.e. the statistically most stable and most relevant) sampling for the
two statistical variables (say of l2_misses versus l2_accesses) is to
sample them independently, via their own metric. Not via a static 1khz
rate - or via picking one of the variables to generate samples.

[ Sidenote: as long as the hw supports such sort of independent sampling
- lets assume so for the sake of argument - not all CPUs are capable of
that - most modern CPUs do though. ]

Static frequency [time] sampling has a number of disadvantages that
drastically reduce its precision and reduce its utility, and 'group'
sampling where one counter controls the events has similar problems:

- It under-samples rare events such as cachemisses.

An example: say we have a workload that executes 1 billion instructions
a second, of which 5000 generate a cachemiss. Only one in 200,000
instructions generates a cachemiss. The chance for a static sampling
IRQ to hit exactly an instruction that causes the cachemiss is 1:200
(0.5%) in every second. That is very low probability, and the profile
would not be very helpful - even though it samples at a seemingly
adequate frequency of 1000 events per second!

With per event counters and per event sampling that KernelTop uses, we
get an event next to the instruction that causes a cachemiss with a
100% certainty, all the time. The profile and its per instruction
aspects suddenly become a whole lot more accurate and whole lot more
interesting.

- Static frequency and group sampling also runs the risk of systematic
error/skew of sampling if any workload component has any correlation
with the "1msec" global sampling period.

For example: say we profile a workload that runs a timer every 20
msecs. In such a case the profile could be skewed assymetrically
against [or in favor of] that timer activity that it does every 10
milliseconds.

Good sampling wants the samples to be generated in proportion to the
variable itself, not proportional to absolute time.

- Static sampling also over-samples when the workload activity goes
down (when it goes more idle).

For example: we profile a fluctuating workload that is sometimes only
0.2% busy, i.e. running only for 2 milliseconds every second. Still we
keep interrupting it at 1 khz - that can be a very brutal systematic
skew if the sampling overhead is 2 microseconds, totalling to 2 msecs
overhead every second - so 50% of what runs on the CPU will be sampling
code - impacting/skewing the sampled code.

Good sampling wants to 'follow' the ebb and flow of the actual hw
events that the CPU has.

The best way to sample two metrics such as "cache accesses" and "cache
misses" (or say "cache misses" versus "TLB misses") is to sample the two
variables _independently_, and to build independent histograms out of
them.

The combination (or 'grouping') of the measured variables is thus done at
the output stage _after_ data acquisition, to provide a weighted
histogram (or a split-view double histogram).

For example, in a "l2 misses" versus "l2 accesses" case, the highest
quality of sampling is to use two independent sampling IRQs with such
sampling parameters:

- one notification every 200 L2 cache misses
- one notification every 10,000 L2 cache accesses

[ this is a ballpark figure - the sample rate is a function of the
averages of the workload and the characteristics of the CPU. ]

And at the output stage display a combination of:

l2_accesses[pc]
l2_misses[pc]
l2_misses[pc] / l2_accesseses[pc]

Note that if we had a third variable as well - say icache_misses[], we
could combine the three metrics:

l2_misses[pc] / l2_accesses[pc] / icache_misses[pc]

( such a view expresses the miss/access ratio in a branch-weighted
fashion: it weighs down instructions that also show signs of icache
pressure and goes for the functions with a high dcache rate but low
icache pressure - i.e. commonly executed functions with a high data
miss rate. )

Sampling at a static frequency is acceptable as well in some cases, and
will lead to an output that is usable for some things. It's just not the
best sampling model, and it's not usable at all for certain important
things such as highly derived views, good instruction level profiles or
rare hw events.

I've uploaded a new version of kerneltop.c that has such a multi-counter
sampling model that follows this statistical model:

http://redhat.com/~mingo/perfcounters/kerneltop.c

Example of usage:

I've started a tbench 64 localhost workload on a 16way x86 box. I want to
check the miss/refs ratio. I first did a sample one of the metrics,
cache-references:

$ ./kerneltop -e 2 -c 100000 -C 2

------------------------------------------------------------------------------
KernelTop: 1311 irqs/sec [NMI, 10000 cache-refs], (all, cpu: 2)
------------------------------------------------------------------------------

events RIP kernel function
______ ________________ _______________

5717.00 - ffffffff803666c0 : copy_user_generic_string!
355.00 - ffffffff80507646 : tcp_sendmsg
315.00 - ffffffff8050abcb : tcp_ack
222.00 - ffffffff804fbb20 : ip_rcv_finish
215.00 - ffffffff8020a75b : __switch_to
194.00 - ffffffff804d0b76 : skb_copy_datagram_iovec
187.00 - ffffffff80502b5d : __inet_lookup_established
183.00 - ffffffff8051083d : tcp_transmit_skb
160.00 - ffffffff804e4fc9 : eth_type_trans
156.00 - ffffffff8026ae31 : audit_syscall_exit


Then i checked the characteristics of the other metric [cache-misses]:

$ ./kerneltop -e 3 -c 200 -C 2

------------------------------------------------------------------------------
KernelTop: 1362 irqs/sec [NMI, 200 cache-misses], (all, cpu: 2)
------------------------------------------------------------------------------

events RIP kernel function
______ ________________ _______________

1419.00 - ffffffff803666c0 : copy_user_generic_string!
1075.00 - ffffffff804e4fc9 : eth_type_trans
1059.00 - ffffffff804d8baa : dst_release
949.00 - ffffffff80510004 : tcp_established_options
841.00 - ffffffff804fbb20 : ip_rcv_finish
569.00 - ffffffff804ce808 : skb_push
454.00 - ffffffff80502b5d : __inet_lookup_established
453.00 - ffffffff805001a3 : ip_queue_xmit
298.00 - ffffffff804cf5d8 : skb_release_head_state
247.00 - ffffffff804ce74b : skb_copy_and_csum_dev

then, to get the "combination" view of the two counters, i appended the
two command lines:

$ ./kerneltop -e 3 -c 200 -e 2 -c 10000 -C 2

------------------------------------------------------------------------------
KernelTop: 2669 irqs/sec [NMI, cache-misses/cache-refs], (all, cpu: 2)
------------------------------------------------------------------------------

weight RIP kernel function
______ ________________ _______________

35.20 - ffffffff804ce74b : skb_copy_and_csum_dev
33.00 - ffffffff804cb740 : sock_alloc_send_skb
31.26 - ffffffff804ce808 : skb_push
22.43 - ffffffff80510004 : tcp_established_options
19.00 - ffffffff8027d250 : find_get_page
15.76 - ffffffff804e4fc9 : eth_type_trans
15.20 - ffffffff804d8baa : dst_release
14.86 - ffffffff804cf5d8 : skb_release_head_state
14.00 - ffffffff802217d5 : read_hpet
12.00 - ffffffff804ffb7f : __ip_local_out
11.97 - ffffffff804fc0c8 : ip_local_deliver_finish
8.54 - ffffffff805001a3 : ip_queue_xmit

[ It's interesting to see that a seemingly common function,
copy_user_generic_string(), got eliminated from the top spots - because
there are other functions whose relative cachemiss rate is far more
serious. ]

The above "derived" profile output is relatively stable under kerneltop
with the use of ~2600 sample irqs/sec and the 2 seconds default refresh.
I'd encourage you to try to achieve the same quality of output with
static 2600 hz sampling - it wont work with the kind of event rates i've
worked with above, no matter whether you read out a single counter or a
group of counters, atomically or not. (because we just dont get
notification PCs at the relevant hw events - we get PCs with a time
sample)

And that is just one 'rare' event type (cachemisses) - if we had two such
sources (say l2 cachemisses and TLB misses) then such type of combined
view would only be possible if we got independent events from both
hardware events.

And note that once you accept that the highest quality approach is to
sample the hw events independently, all the "group readout" approaches
become a second-tier mechanism. KernelTop uses that model and works just
fine without any group readout and it is making razor sharp profiles,
down to the instruction level.

[ Note that there's special-cases where group-sampling can limp along
with acceptable results: if one of the two counters has so many events
that sampling by time or sampling by the rare event type gives relevant
context info. But the moment both event sources are rare, the group
model breaks down completely and produces meaningless results. It's
just a fundamentally wrong kind of abstraction to mix together
unrelated statistical variables. And that's one of the fundamental
design problems i see with perfmon-v3. ]

Ingo

2008-12-09 16:40:26

by Chris Friesen

[permalink] [raw]
Subject: Re: [patch] Performance Counters for Linux, v2

Ingo Molnar wrote:

> When there are two (or more) hw metrics to profile, the ideally best
> (i.e. the statistically most stable and most relevant) sampling for the
> two statistical variables (say of l2_misses versus l2_accesses) is to
> sample them independently, via their own metric. Not via a static 1khz
> rate - or via picking one of the variables to generate samples.

Regardless of sampling method, don't you still want some way to
enable/disable the various counters as close to simultaneously as possible?

Chris

2008-12-09 16:46:42

by Will Newton

[permalink] [raw]
Subject: Re: [patch] Performance Counters for Linux, v2

On Tue, Dec 9, 2008 at 1:46 PM, Ingo Molnar <[email protected]> wrote:

> Firstly, sampling "at 1msec intervals" or any fixed period is a _very_
> wrong mindset - and cross-sampling counters is a similarly wrong mindset.

If your hardware does not interrupt on overflow I don't think you have
any choice in the matter. I know such hardware is less than ideal but
it exists so it should be supported.

2008-12-09 17:35:49

by Chris Friesen

[permalink] [raw]
Subject: Re: [patch] Performance Counters for Linux, v2

Will Newton wrote:
> On Tue, Dec 9, 2008 at 1:46 PM, Ingo Molnar <[email protected]> wrote:
>
>> Firstly, sampling "at 1msec intervals" or any fixed period is a _very_
>> wrong mindset - and cross-sampling counters is a similarly wrong mindset.
>
> If your hardware does not interrupt on overflow I don't think you have
> any choice in the matter. I know such hardware is less than ideal but
> it exists so it should be supported.

I think you could still set up the counters as Ingo describes and then
sample the counters (as opposed to the program) at a suitable interval
(chosen such that the counters won't overflow more than once between
samples).

Chris

2008-12-09 19:03:46

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] Performance Counters for Linux, v2


* Chris Friesen <[email protected]> wrote:

> Ingo Molnar wrote:
>
>> When there are two (or more) hw metrics to profile, the ideally best
>> (i.e. the statistically most stable and most relevant) sampling for
>> the two statistical variables (say of l2_misses versus l2_accesses) is
>> to sample them independently, via their own metric. Not via a static
>> 1khz rate - or via picking one of the variables to generate samples.
>
> Regardless of sampling method, don't you still want some way to
> enable/disable the various counters as close to simultaneously as
> possible?

If it's about counter control for the monitored task, then we sure could
do something about that. (apps/libraries could thus select a subset of
functions to profile/measure, runtime, etc.)

If it's about counter control for the profiler/debugger, i'm not sure how
useful that is - do you have a good usecase for it?

Ingo

2008-12-09 19:52:20

by Chris Friesen

[permalink] [raw]
Subject: Re: [patch] Performance Counters for Linux, v2

Ingo Molnar wrote:
> * Chris Friesen <[email protected]> wrote:

>> Regardless of sampling method, don't you still want some way to
>> enable/disable the various counters as close to simultaneously as
>> possible?
>
> If it's about counter control for the monitored task, then we sure could
> do something about that. (apps/libraries could thus select a subset of
> functions to profile/measure, runtime, etc.)
>
> If it's about counter control for the profiler/debugger, i'm not sure how
> useful that is - do you have a good usecase for it?

I'm sure that others could give more usecases, but I was thinking about
cases like "I want to test _these_ multiple metrics simultaneously over
_this_ specific section of code". In a case like this, it seems
desirable to start/stop the various performance counters as close
together as possible, especially if the section of code being tested is
short.

Chris

2008-12-09 21:23:40

by Stephane Eranian

[permalink] [raw]
Subject: Re: [patch] Performance Counters for Linux, v2

Hi,

On Tue, Dec 9, 2008 at 2:46 PM, Ingo Molnar <[email protected]> wrote:
>
> * stephane eranian <[email protected]> wrote:
>
>> > There's a new "counter group record" facility that is a
>> > straightforward extension of the existing "irq record" notification
>> > type. This record type can be set on a 'master' counter, and if the
>> > master counter triggers an IRQ or an NMI, all the 'secondary'
>> > counters are read out atomically and are put into the counter-group
>> > record. The result can then be read() out by userspace via a single
>> > system call. (Based on extensive feedback from Paul Mackerras and
>> > David Miller, thanks guys!)
>>
>> That is unfortunately not generic enough. You need a bit more
>> flexibility than master/secondaries, I am afraid. What tools want is
>> to be able to express:
>>
>> - when event X overflows, record values of events J, K
>> - when event Y overflows, record values of events Z, J
>
> hm, the new group code in perfcounters-v2 can already do this. Have you
> tried to use it and it didnt work? If so then that's a bug. Nothing in
> the design prevents that kind of group readout.
>
> [ We could (and probably will) enhance the grouping relationship some
> more, but group readouts are a fundamentally inferior mode of
> profiling. (see below for the explanation) ]
>
>> I am not making this up. I know tools that do just that, i.e., that is
>> collecting two distinct profiles in a single run. This is how, for
>> instance, you can collect a flat profile and the call graph in one run,
>> very much like gprof.
>
> yeah, but it's still the fundamentally wrong thing to do.
>
That's not for you to say. This is decision for the tool writers.

There is absolutely nothing wrong with this. In fact, people do this
kind of measurements all the time. Your horizon seems a bit too
limited, maybe.

Certain PMU features do not count events, they capture information about
where they occur, so they are more like buffers. Sometimes, they are hosted
in registers. For instance, Itanium has long been able to capture where
cache misses occur. The data is stored in a couple of PMU registers and only
one cache miss at a time. There is a PMU event that counts how many misses
are captured. So you program that event into a counter and when it overflows
you want to read out the pair of data registers containing the last captured
cache miss. Thus, when event X overflows, you capture values in registers Z, Y.
There is nothing wrong with this. You do the same thing when you want to
sample on a branch trace buffer, like X86 LBR. Again nothing wrong with this.
In fact you can collect both at the same time and in independent manner.


> Being able to extract high-quality performance information from the
> system is the cornerstone of our design, and chosing the right sampling
> model permeates the whole issue of single-counter versus group-readout.
>
> I dont think finer design aspects of kernel support for performance
> counters can be argued without being on the same page about this, so
> please let me outline our view on these things, in (boringly) verbose
> detail - spiked with examples and code as well.
>
> Firstly, sampling "at 1msec intervals" or any fixed period is a _very_
> wrong mindset - and cross-sampling counters is a similarly wrong mindset.
>
> When there are two (or more) hw metrics to profile, the ideally best
> (i.e. the statistically most stable and most relevant) sampling for the
> two statistical variables (say of l2_misses versus l2_accesses) is to
> sample them independently, via their own metric. Not via a static 1khz
> rate - or via picking one of the variables to generate samples.
>

Did I talk about static sampling period?

> [ Sidenote: as long as the hw supports such sort of independent sampling
> - lets assume so for the sake of argument - not all CPUs are capable of
> that - most modern CPUs do though. ]
>
> Static frequency [time] sampling has a number of disadvantages that
> drastically reduce its precision and reduce its utility, and 'group'
> sampling where one counter controls the events has similar problems:
>
> - It under-samples rare events such as cachemisses.
>
> An example: say we have a workload that executes 1 billion instructions
> a second, of which 5000 generate a cachemiss. Only one in 200,000
> instructions generates a cachemiss. The chance for a static sampling
> IRQ to hit exactly an instruction that causes the cachemiss is 1:200
> (0.5%) in every second. That is very low probability, and the profile
> would not be very helpful - even though it samples at a seemingly
> adequate frequency of 1000 events per second!
>
Who talked about periods expressed as events per second?

I did not talk about that. If you had looked at the perfmon API, you would
have noticed that it does not know anything about sampling periods. It
only sees register values. Tools are free to pick whatever value they like.
And the value, by definition. is defined as the number of occurrences of
the event, not the number of occurrences per seconds. You can say:
every 2000 cache miss, take a sample, just program that counter
with -2000.

> With per event counters and per event sampling that KernelTop uses, we
> get an event next to the instruction that causes a cachemiss with a

You have no guarantee on how close the RIP is compared to where the cache
miss occurred. It can be several of instructions away (NMI or not by the way).
There is nothing software can do about it, neither my inferior design nor
your superior design.

>
> And note that once you accept that the highest quality approach is to
> sample the hw events independently, all the "group readout" approaches
> become a second-tier mechanism. KernelTop uses that model and works just
> fine without any group readout and it is making razor sharp profiles,
> down to the instruction level.
>

And you think you cannot do independent sampling with perfmon3?

As for 'razor sharp', that is your interpretation. As far as I know a
RIP is always
pointing to an instruction anyway. What you seem to be ignoring here
is the fact that
the RIP is as good as the hardware can give you. And it just happens that
on ALL processor architectures it is off compared to where the event actually
occurred. It can be several cycles away actually: skid. Your superior design
does not improve that precision whatsoever. It has to be handled at the
hardware level. Why do you think AMD added IBS, why Intel added PEBS on
X86 and why Intel added IP-EAR on Itanium2? Even PEBS is not solving that
issue completely. As far I know the quality of your profiles are as
good as Oprofile,
VTUNE, or perfmon.


> [ Note that there's special-cases where group-sampling can limp along
> with acceptable results: if one of the two counters has so many events
> that sampling by time or sampling by the rare event type gives relevant
> context info. But the moment both event sources are rare, the group

> model breaks down completely and produces meaningless results. It's
> just a fundamentally wrong kind of abstraction to mix together
> unrelated statistical variables. And that's one of the fundamental
> design problems i see with perfmon-v3. ]
>

Again an unfounded statement, perfmon3 does not mandate what is recorded
on overflow. It does not mandate how many events you can sample on at the same
time. It does not know about sampling periods, it only knows about data register
values and reset values on overflow. For each counters, you can freely specify
what you want recorded using a simple bitmask.

Are we on the same page, then?

2008-12-09 22:19:24

by Paul Mackerras

[permalink] [raw]
Subject: Re: [patch] Performance Counters for Linux, v2

Ingo Molnar writes:

> yeah, but it's still the fundamentally wrong thing to do.
>
> Being able to extract high-quality performance information from the
> system is the cornerstone of our design, and chosing the right sampling
> model permeates the whole issue of single-counter versus group-readout.

Thanks for taking the time to write all this down, and I will respond
in detail once I have thought about it some more.

The thing that stands out to me immediately, though, is that you are
concentrating entirely on _sampling_ as opposed to _counting_.
Perhaps this is the main reason we have been disagreeing.

Now of course sampling is interesting, but counting is also
interesting, whether over the whole execution of a program or over
short intervals during the execution of a program (based either on
time, or on the execution of certain functions).

It seems to me that a well-designed performance monitor infrastructure
should support both counting and sampling. And for counting, getting
high-quality data requires synchronized counters (ones that all start
counting and stop counting at the same times).

Looking back at the discussion so far, I can see that your arguments
make more sense if you are only wanting to do sampling. And I have
been arguing for what I believe we need to do counting properly (I
have focused more on counting because we already have infrastructure
for sampling, namely oprofile).

So, can we agree to discuss both sampling and counting, and design an
infrastructure that's good for both?

Paul.

2008-12-09 22:40:27

by Andi Kleen

[permalink] [raw]
Subject: Re: [patch] Performance Counters for Linux, v2

Paul Mackerras <[email protected]> writes:

> So, can we agree to discuss both sampling and counting, and design an
> infrastructure that's good for both?

When you say counting you should also include "event ring buffers with
metadata", like PEBS on Intel x86.

-Andi

--
[email protected]

2008-12-09 23:01:32

by Paul Mackerras

[permalink] [raw]
Subject: Re: [patch] Performance Counters for Linux, v2

Ingo Molnar writes:

> * Paul Mackerras <[email protected]> wrote:
>
> > > Things like: "kerneltop would not be as accurate with: ..., to the
> > > level of adding 5% of extra noise.". Would that work for you?
> >
> > OK, here's an example. I have an application whose execution has
> > several different phases, and I want to measure the L1 Icache hit rate
> > and the L1 Dcache hit rate as a function of time and make a graph. So
> > I need counters for L1 Icache accesses, L1 Icache misses, L1 Dcache
> > accesses, and L1 Dcache misses. I want to sample at 1ms intervals. The
> > CPU I'm running on has two counters.
> >
> > With your current proposal, I don't see any way to make sure that the
> > counter scheduler counts L1 Dcache accesses and L1 Dcache misses at the
> > same time, then schedules L1 Icache accesses and L1 Icache misses. I
> > could end up with L1 Dcache accesses and L1 Icache accesses, then L1
> > Dcache misses and L1 Icache misses - and get a nonsensical situation
> > like the misses being greater than the accesses.
>
> yes, agreed, this is a valid special case of simple counter readout -
> we'll add support to couple counters like that.

This is an example of a sampling problem, but one where the thing
being sampled is a derived statistic from two counter values.

I don't agree that this is really a "special case". There are lots of
derived statistics that are interesting for performance analysis,
starting with CPI (cycles per instruction), proportions of various
instructions in the code, cache hit/miss rates for various different
caches, etc., etc.

> Note that this issue does not impact use of multiple counters in
> profilers. (i.e. anything that is not a pure readout of the counter,
> along linear time, as your example above suggests).

Well, that's the sampling vs. counting distinction that I made in my
other email. We need to do both well.

As far as I can see, my "counter set" proposal does everything yours
does (since a counter set can be just a single counter), and also
cleanly accommodates what's needed for counting and for sampling
derived statistics. No?

Paul.

2008-12-10 04:45:32

by Paul Mackerras

[permalink] [raw]
Subject: Re: [patch] Performance Counters for Linux, v2

Andi Kleen writes:

> When you say counting you should also include "event ring buffers with
> metadata", like PEBS on Intel x86.

I'm not familiar with PEBS. Maybe it's something different again,
neither sampling nor counting, but a third thing?

Paul.

2008-12-10 05:03:22

by Stephane Eranian

[permalink] [raw]
Subject: Re: [patch] Performance Counters for Linux, v2

Paul,

On Wed, Dec 10, 2008 at 5:44 AM, Paul Mackerras <[email protected]> wrote:
> Andi Kleen writes:
>
>> When you say counting you should also include "event ring buffers with
>> metadata", like PEBS on Intel x86.
>
> I'm not familiar with PEBS. Maybe it's something different again,
> neither sampling nor counting, but a third thing?
>
PEBS is an Intel only feature used for sampling. However, this time
the hardware (and microcode) does the sampling for you. You point
the CPU to a structure in memory, called DS, which then points to
a region of memory you can designate, i.e., the sampling buffer.
Buffer can be any size you want.

Then you program counter0 when an event and a sampling period.
When the counter overflows, there is no interrupt, the microcode
records the RIP and full machine state, and reloads the counter with
the period specified in DS. The OS gets an interrupt ONLY when the
buffer fills up. Overhead is thus minimized, but you have no control
over the format of the samples. The precision (P) comes from the fact
that the RIP is guaranteed to point the an instruction that is just after
an instruction which generated the event you're sampling on. The catch
is that no all events support PEBS, and only one counter works with PEBS
on Core 2. Nehalem is better, more events support PEBS, all 4 generic
counters do support PEBS. Furthermore,PEBS can now capture where
cache misses occur, very much like what Itanium can do.

Needless to say all of this is supported by perfmon.

Hope this helps.

2008-12-10 10:14:58

by Andi Kleen

[permalink] [raw]
Subject: Re: [patch] Performance Counters for Linux, v2

On Wed, Dec 10, 2008 at 03:44:31PM +1100, Paul Mackerras wrote:
> Andi Kleen writes:
>
> > When you say counting you should also include "event ring buffers with
> > metadata", like PEBS on Intel x86.
>
> I'm not familiar with PEBS. Maybe it's something different again,
> neither sampling nor counting, but a third thing?

Yes it's a third thing. A CPU controlled ring event buffer. See Stephane's
description.

There's also some crosses, e.g. AMD's IBS (which is essentially a counter +
some additional registers to give more details about the interrupted
instruction)

-Andi

--
[email protected]

2009-01-07 07:43:58

by Yanmin Zhang

[permalink] [raw]
Subject: Re: [patch] Performance Counters for Linux, v2

On Mon, 2008-12-08 at 12:49 +0100, Ingo Molnar wrote:
> * Arjan van de Ven <[email protected]> wrote:
>
> > On Mon, 8 Dec 2008 02:22:12 +0100
> > Ingo Molnar <[email protected]> wrote:
> >
> > >
> > > [ Performance counters are special hardware registers available on
> > > most modern CPUs. These register count the number of certain types of
> > > hw events: such as instructions executed, cachemisses suffered, or
> > > branches mis-predicted, without slowing down the kernel or
> > > applications. These registers can also trigger interrupts when a
> > > threshold number of events have passed - and can thus be used to
> > > profile the code that runs on that CPU. ]
> > >
> > > This is version 2 of our Performance Counters subsystem
> > > implementation.
> > >
> > > The biggest user-visible change in this release is a new user-space
> > > text-mode profiling utility that is based on this code: KernelTop.
> > >
> > > KernelTop can be downloaded from:
> > >
> > > http://redhat.com/~mingo/perfcounters/kerneltop.c
> > >
> > > It's a standalone .c file that needs no extra libraries - it only
> > > needs a CONFIG_PERF_COUNTERS=y kernel to run on.
> > >
> > > This utility is intended for kernel developers - it's basically a
> > > dynamic kernel profiler that gets hardware counter events dispatched
> > > to it continuously, which it feeds into a histogram and outputs it
> > > periodically.
> > >
> >
> > I played with this a little, and while it works neat, I wanted a
> > feature added where it shows a detailed profile for the top function.
>
> ah, very nice idea!
>
> > I've hacked this in quickly (the usability isn't all that great yet)
> > and put the source code up at
> >
> > http://www.tglx.de/~arjan/kerneltop-0.02.tar.gz
>
> ok, picked it up :-)
Ingo,

I try to use patch V5 and the latest kerneltop to collect some cachemiss data.

It seems kerneltop just shows the first instruction ip address of functions. Does
the latest kerneltop include the enhancement from Arjan? As you know, with oprofile,
we can get detailed instrument ip address which causes the cache miss although the ip
address should go back one instruction mostly.

>
> > with this it looks like this:
> >
> > $ sudo ./kerneltop --vmlinux=/home/arjan/linux-2.6.git/vmlinux
> >
> > ------------------------------------------------------------------------------
> > KernelTop: 274 irqs/sec [NMI, 1000000 cycles], (all, 2 CPUs)
> > ------------------------------------------------------------------------------
> >
> > events RIP kernel function
> > ______ ________________ _______________
> >
> > 230 - 00000000c04189e9 : read_hpet
> > 82 - 00000000c0409439 : mwait_idle_with_hints
> > 77 - 00000000c051a7b7 : acpi_os_read_port
> > 52 - 00000000c053cb3a : acpi_idle_enter_bm
> > 38 - 00000000c0418d93 : hpet_next_event
> > 19 - 00000000c051a802 : acpi_os_write_port
> > 14 - 00000000c04f8704 : __copy_to_user_ll
> > 13 - 00000000c0460c20 : get_page_from_freelist
> > 7 - 00000000c041c96c : kunmap_atomic
> > 5 - 00000000c06a30d2 : _spin_lock [joydev]
> > 4 - 00000000c04f79b7 : vsnprintf [snd_seq]
> > 4 - 00000000c06a3048 : _spin_lock_irqsave [pcspkr]
> > 3 - 00000000c0403b3c : irq_entries_start
> > 3 - 00000000c0423fee : run_rebalance_domains
> > 3 - 00000000c0425e2c : scheduler_tick
> > 3 - 00000000c0430938 : get_next_timer_interrupt
> > 3 - 00000000c043cdfa : __update_sched_clock
> > 3 - 00000000c0448b14 : update_iter
> > 2 - 00000000c04304bd : run_timer_softirq
> >
> > Showing details for read_hpet
> > 0 c04189e9 <read_hpet>:
> > 2 c04189e9: a1 b0 e0 89 c0 mov 0xc089e0b0,%eax
> > 0
> > 0 /*
> > 0 * Clock source related code
> > 0 */
> > 0 static cycle_t read_hpet(void)
> > 0 {
> > 1 c04189ee: 55 push %ebp
> > 0 c04189ef: 89 e5 mov %esp,%ebp
> > 1 c04189f1: 05 f0 00 00 00 add $0xf0,%eax
> > 0 c04189f6: 8b 00 mov (%eax),%eax
> > 0 return (cycle_t)hpet_readl(HPET_COUNTER);
> > 0 }
> > 300 c04189f8: 31 d2 xor %edx,%edx
> > 0 c04189fa: 5d pop %ebp
> > 0 c04189fb: c3 ret
> > 0
>
Just like above detailed information.

Thanks,
Yanmin

2009-01-09 01:07:51

by Yanmin Zhang

[permalink] [raw]
Subject: Re: [patch] Performance Counters for Linux, v2

On Wed, 2009-01-07 at 15:43 +0800, Zhang, Yanmin wrote:
> On Mon, 2008-12-08 at 12:49 +0100, Ingo Molnar wrote:
> > * Arjan van de Ven <[email protected]> wrote:
> >
> > > On Mon, 8 Dec 2008 02:22:12 +0100
> > > Ingo Molnar <[email protected]> wrote:
> > >
> > > >
> > > > [ Performance counters are special hardware registers available on
> > > > most modern CPUs. These register count the number of certain types of
> > > > hw events: such as instructions executed, cachemisses suffered, or
> > > > branches mis-predicted, without slowing down the kernel or
> > > > applications. These registers can also trigger interrupts when a
> > > > threshold number of events have passed - and can thus be used to
> > > > profile the code that runs on that CPU. ]
> > > >
> > > > This is version 2 of our Performance Counters subsystem
> > > > implementation.
> > > >
> > > > The biggest user-visible change in this release is a new user-space
> > > > text-mode profiling utility that is based on this code: KernelTop.
> > > >
> > > > KernelTop can be downloaded from:
> > > >
> > > > http://redhat.com/~mingo/perfcounters/kerneltop.c
> > > >
> > > > It's a standalone .c file that needs no extra libraries - it only
> > > > needs a CONFIG_PERF_COUNTERS=y kernel to run on.
> > > >
> > > > This utility is intended for kernel developers - it's basically a
> > > > dynamic kernel profiler that gets hardware counter events dispatched
> > > > to it continuously, which it feeds into a histogram and outputs it
> > > > periodically.
> > > >
> > >
> > > I played with this a little, and while it works neat, I wanted a
> > > feature added where it shows a detailed profile for the top function.
> >
> > ah, very nice idea!
> >
> > > I've hacked this in quickly (the usability isn't all that great yet)
> > > and put the source code up at
> > >
> > > http://www.tglx.de/~arjan/kerneltop-0.02.tar.gz
> >
> > ok, picked it up :-)
> Ingo,
>
> I try to use patch V5 and the latest kerneltop to collect some cachemiss data.
>
> It seems kerneltop just shows the first instruction ip address of functions. Does
> the latest kerneltop include the enhancement from Arjan? As you know, with oprofile,
> we can get detailed instrument ip address which causes the cache miss although the ip
> address should go back one instruction mostly.
As a matter of fact, the original kerneltop has parameter -s to support it. But
kerneltop has a bug to show details of the symbol. sym_filter_entry should be initiated
after qsort.

below is an example.
#./kerneltop --vmlinux=/root/linux-2.6.28_slqb1230flush/vmlinux -d 20 -e 3 -f 1000 -s flush_free_list

------------------------------------------------------------------------------
KernelTop: 20297 irqs/sec [NMI, 10000 cache-misses], (all, 8 CPUs)
------------------------------------------------------------------------------

events RIP kernel function
______ ______ ________________ _______________

12816.00 - ffffffff803d5760 : copy_user_generic_string!
11751.00 - ffffffff80647a2c : unix_stream_recvmsg
10215.00 - ffffffff805eda5f : sock_alloc_send_skb
9738.00 - ffffffff80284821 : flush_free_list
6749.00 - ffffffff802854a1 : __kmalloc_track_caller
3663.00 - ffffffff805f09fa : skb_dequeue
3591.00 - ffffffff80284be2 : kmem_cache_alloc [qla2xxx]
3501.00 - ffffffff805f15f5 : __alloc_skb
1296.00 - ffffffff803d8eb4 : list_del [qla2xxx]
1110.00 - ffffffff805f0ed2 : kfree_skb
Showing details for flush_free_list
0 ffffffff8028488a: 78 00 00
0 ffffffff8028488d: 49 8d 04 00 lea (%r8,%rax,1),%rax
0 ffffffff80284891: 4c 8b 31 mov (%rcx),%r14
1143 ffffffff80284894: 48 c1 e8 0c shr $0xc,%rax
0 ffffffff80284898: 48 6b c0 38 imul $0x38,%rax,%rax
0 ffffffff8028489c: 48 8d 1c 10 lea (%rax,%rdx,1),%rbx
0 ffffffff802848a0: 48 8b 03 mov (%rbx),%rax
3195 ffffffff802848a3: 25 00 40 00 00 and $0x4000,%eax


The disassembly of lots of functions are big, so the new kernertop truncates them by filtering some
instructions whose count are smaller than count_filter. It just shows the instructions whose count
are more than count_filter and 3 instructions ahead of the reported instructions. For example, before
printing
3195 ffffffff802848a3: 25 00 40 00 00 and $0x4000,%eax
the new kerneltop prints other 3 instructions:
0 ffffffff80284898: 48 6b c0 38 imul $0x38,%rax,%rax
0 ffffffff8028489c: 48 8d 1c 10 lea (%rax,%rdx,1),%rbx
0 ffffffff802848a0: 48 8b 03 mov (%rbx),%rax

So users can go back quickly to find the instruction who really causes the event (not the reported
instruction by the performance counter).

Below is the patch against kernetop of Dec/23/2008 version.

yanmin

---

--- kerneltop.c.orig 2009-01-08 16:39:16.000000000 +0800
+++ kerneltop.c 2009-01-08 16:39:16.000000000 +0800
@@ -3,7 +3,7 @@

Build with:

- cc -O6 -Wall `pkg-config --cflags glib-2.0` -c -o kerneltop.o kerneltop.c
+ cc -O6 -Wall `pkg-config --cflags --libs glib-2.0` -o kerneltop kerneltop.c

Sample output:

@@ -291,8 +291,6 @@ static void process_options(int argc, ch
else
event_count[counter] = 100000;
}
- if (nr_counters == 1)
- count_filter = 0;
}

static uint64_t min_ip;
@@ -307,7 +305,7 @@ struct sym_entry {

#define MAX_SYMS 100000

-static unsigned int sym_table_count;
+static int sym_table_count;

struct sym_entry *sym_filter_entry;

@@ -350,7 +348,7 @@ static struct sym_entry tmp[MAX_SYMS];

static void print_sym_table(void)
{
- unsigned int i, printed;
+ int i, printed;
int counter;

memcpy(tmp, sym_table, sizeof(sym_table[0])*sym_table_count);
@@ -494,7 +492,6 @@ static int read_symbol(FILE *in, struct
printf("symbol filter start: %016lx\n", filter_start);
printf(" end: %016lx\n", filter_end);
filter_end = filter_start = 0;
- sym_filter_entry = NULL;
sym_filter = NULL;
sleep(1);
}
@@ -502,7 +499,6 @@ static int read_symbol(FILE *in, struct
if (filter_match == 0 && sym_filter && !strcmp(s->sym, sym_filter)) {
filter_match = 1;
filter_start = s->addr;
- sym_filter_entry = s;
}

return 0;
@@ -538,6 +534,16 @@ static void parse_symbols(void)
last->sym = "<end>";

qsort(sym_table, sym_table_count, sizeof(sym_table[0]), compare_addr);
+
+ if (filter_end) {
+ int count;
+ for (count=0; count < sym_table_count; count ++) {
+ if (!strcmp(sym_table[count].sym, sym_filter)) {
+ sym_filter_entry = &sym_table[count];
+ break;
+ }
+ }
+ }
}


@@ -617,11 +623,27 @@ static void lookup_sym_in_vmlinux(struct
}
}

+void show_lines(GList *item_queue, int item_queue_count)
+{
+ int i;
+ struct source_line *line;
+
+ for (i = 0; i < item_queue_count; i++) {
+ line = item_queue->data;
+ printf("%8li\t%s\n", line->count, line->line);
+ item_queue = g_list_next(item_queue);
+ }
+}
+
+#define TRACE_COUNT 3
+
static void show_details(struct sym_entry *sym)
{
struct source_line *line;
GList *item;
int displayed = 0;
+ GList *item_queue;
+ int item_queue_count = 0;

if (!sym->source)
lookup_sym_in_vmlinux(sym);
@@ -633,16 +655,28 @@ static void show_details(struct sym_entr
item = sym->source;
while (item) {
line = item->data;
- item = g_list_next(item);
if (displayed && strstr(line->line, ">:"))
break;

- printf("%8li\t%s\n", line->count, line->line);
+ if (!item_queue_count)
+ item_queue = item;
+ item_queue_count ++;
+
+ if (line->count >= count_filter) {
+ show_lines(item_queue, item_queue_count);
+ item_queue_count = 0;
+ item_queue = NULL;
+ } else if (item_queue_count > TRACE_COUNT) {
+ item_queue = g_list_next(item_queue);
+ item_queue_count --;
+ }
+
+ line->count = 0;
displayed++;
if (displayed > 300)
break;
+ item = g_list_next(item);
}
- exit(0);
}

/*