2013-09-03 13:29:39

by Ingo Molnar

[permalink] [raw]
Subject: [GIT PULL] perf changes for v3.12

Linus,

Please pull the latest perf-core-for-linus git tree from:

git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git perf-core-for-linus

HEAD: 61bf86ad86443a710ae7eed372cccb8ed5038eda Merge tag 'perf-core-for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/core

As a first remark I'd like to point out that the obsolete '-f' (--force)
option, which has not done anything for several releases, has been removed
from 'perf record' and related utilities. Everyone please update muscle
memory accordingly! :-)

Main changes on the perf kernel side:

* Performance optimizations:
. for trace events, by Steve Rostedt.
. for time values, by Peter Zijlstra

* New hardware support:
. for Intel Silvermont (22nm Atom) CPUs, by Zheng Yan
. for Intel SNB-EP uncore PMUs, by Zheng Yan

* Enhanced hardware support:
. for Intel uncore PMUs: add filter support for QPI boxes, by Zheng Yan

* Core perf events code enhancements and fixes:
. for full-nohz feature handling, by Frederic Weisbecker
. for group events, by Jiri Olsa
. for call chains, by Frederic Weisbecker
. for event stream parsing, by Adrian Hunter

* New ABI details:
. Add attr->mmap2 attribute, by Stephane Eranian
. Add PERF_EVENT_IOC_ID ioctl to return event ID, by Jiri Olsa
. Export u64 time_zero on the mmap header page to allow TSC
calculation, by Adrian Hunter
. Add dummy software event, by Adrian Hunter.
. Add a new PERF_SAMPLE_IDENTIFIER to make samples always
parseable, by Adrian Hunter.
. Make Power7 events available via sysfs, by Runzhen Wang.

* Code cleanups and refactorings:
. for nohz-full, by Frederic Weisbecker
. for group events, by Jiri Olsa

* Documentation updates:
. for perf_event_type, by Peter Zijlstra

Main changes on the perf tooling side (some of these tooling changes
utilize the above kernel side changes):

* Lots of 'perf trace' enhancements:

. Make 'perf trace' command line arguments consistent with
'perf record', by David Ahern.

. Allow specifying syscalls a la strace, by Arnaldo Carvalho de Melo.

. Add --verbose and -o/--output options, by Arnaldo Carvalho de Melo.

. Support ! in -e expressions, to filter a list of syscalls,
by Arnaldo Carvalho de Melo.

. Arg formatting improvements to allow masking arguments in
syscalls such as futex and open, where the some arguments are
ignored and thus should not be printed depending on other args,
by Arnaldo Carvalho de Melo.

. Beautify futex open, openat, open_by_handle_at, lseek and futex
syscalls, by Arnaldo Carvalho de Melo.

. Add option to analyze events in a file versus live, so that one can do:

[root@zoo ~]# perf record -a -e raw_syscalls:* sleep 1
[ perf record: Woken up 0 times to write data ]
[ perf record: Captured and wrote 25.150 MB perf.data (~1098836 samples) ]
[root@zoo ~]# perf trace -i perf.data -e futex --duration 1
17.799 ( 1.020 ms): 7127 futex(uaddr: 0x7fff3f6c6674, op: 393, val: 1, utime: 0x7fff3f6c6470, ua
113.344 (95.429 ms): 7127 futex(uaddr: 0x7fff3f6c6674, op: 393, val: 1, utime: 0x7fff3f6c6470, uaddr2: 0x7fff3f6c6648, val3: 4294967
133.778 ( 1.042 ms): 18004 futex(uaddr: 0x7fff3f6c6674, op: 393, val: 1, utime: 0x7fff3f6c6470, uaddr2: 0x7fff3f6c6648, val3: 429496
[root@zoo ~]#

By David Ahern.

. Honor target pid / tid options when analyzing a file, by David Ahern.

. Introduce better formatting of syscall arguments, including so
far beautifiers for mmap, madvise, syscall return values,
by Arnaldo Carvalho de Melo.

. Handle HUGEPAGE defines in the mmap beautifier, by David Ahern.


* 'perf report/top' enhancements:

. Do annotation using /proc/kcore and /proc/kallsyms when
available, removing the forced need for a vmlinux file kernel
assembly annotation. This also improves this use case because
vmlinux has just the initial kernel image, not what is actually
in use after various code patchings by things like alternatives.
By Adrian Hunter.

. Add --ignore-callees=<regex> option to collapse undesired parts
of call graphs, by Greg Price.

. Simplify symbol filtering by doing it at machine class level,
by Adrian Hunter.

. Add support for callchains in the gtk UI, by Namhyung Kim.

. Add --objdump option to 'perf top', by Sukadev Bhattiprolu.


* 'perf kvm' enhancements:

. Add option to print only events that exceed a specified time
duration, by David Ahern.

. Improve stack trace printing, by David Ahern.

. Update documentation of the live command, by David Ahern

. Add perf kvm stat live mode that combines aspects of 'perf kvm
stat' record and report, by David Ahern.

. Add option to analyze specific VM in perf kvm stat report, by
David Ahern.

. Do not require /lib/modules/* on a guest, by Jason Wessel.


* 'perf script' enhancements:

. Fix symbol offset computation for some dsos, by David Ahern.

. Fix named threads support, by David Ahern.

. Don't install scripting files files when perl/python support
is disabled, by Arnaldo Carvalho de Melo.


* 'perf test' enhancements:

. Add various improvements and fixes to the "vmlinux matches
kallsyms" 'perf test' entry, related to the /proc/kcore
annotation feature. By Arnaldo Carvalho de Melo.

. Add sample parsing test, by Adrian Hunter.

. Add test for reading object code, by Adrian Hunter.

. Add attr record group sampling test, by Jiri Olsa.

. Misc testing infrastructure improvements and other details,
by Jiri Olsa.


* 'perf list' enhancements:

. Skip unsupported hardware events, by Namhyung Kim.

. List pmu events, by Andi Kleen.


* 'perf diff' enhancements:

. Add support for more than two files comparison, by Jiri Olsa.


* 'perf sched' enhancements:

. Various improvements, including removing reliance on some
scheduler tracepoints that provide the same information as the
PERF_RECORD_{FORK,EXIT} events. By Arnaldo Carvalho de Melo.

. Remove odd build stall by moving a large struct initialization
from a local variable to a global one, by Namhyung Kim.

* 'perf stat' enhancements:

. Add --initial-delay option to skip measuring for a defined
startup phase, by Andi Kleen.

* Generic perf tooling infrastructure/plumbing changes:

. Tidy up sample parsing validation, by Adrian Hunter.

. Fix up jobserver setup in libtraceevent Makefile.
by Arnaldo Carvalho de Melo.

. Debug improvements, by Adrian Hunter.

. Fix correlation of samples coming after PERF_RECORD_EXIT event,
by David Ahern.

. Improve robustness of the topology parsing code,
by Stephane Eranian.

. Add group leader sampling, that allows just one event in a group
to sample while the other events have just its values read,
by Jiri Olsa.

. Add support for a new modifier "D", which requests that the
event, or group of events, be pinned to the PMU.
By Michael Ellerman.

. Support callchain sorting based on addresses, by Andi Kleen

. Prep work for multi perf data file storage, by Jiri Olsa.

. libtraceevent cleanups, by Namhyung Kim.

And lots and lots of other fixes and code reorganizations that did not
make it into the list, see the shortlog, diffstat and the Git log for
details!

Thanks,

Ingo

------------------>
Adrian Hunter (48):
perf inject: Remove unused parameter
perf tools: Fix missing tool parameter
perf inject: Add missing 'finished_round'
perf tools: Add const specifier to perf_pmu__find name parameter
perf evlist: Tidy duplicated munmap code
perf tools: Validate perf event header size
perf tools: struct thread has a tid not a pid
perf: Fix broken union in 'struct perf_event_mmap_page'
perf/x86: Add ability to calculate TSC from perf sample timestamps
perf tools: Add test for converting perf time to/from TSC
perf symbols: avoid SyS kernel syscall aliases
perf tests: Add test for reading object code
perf symbols: Load kernel maps before using
perf tools: Make it possible to read object code from vmlinux
perf tests: Adjust the vmlinux symtab matches kallsyms test
perf tools: Make it possible to read object code from kernel modules
perf symbols: Add support for reading from /proc/kcore
perf tests: Adjust the vmlinux symtab matches kallsyms test again
perf tests: Add kcore to the object code reading test
perf annotate: Allow disassembly using /proc/kcore
perf annotate: Put dso name in symbol annotation title
perf annotate: Remove nop at end of annotation
perf annotate: Add call target name if it is missing
perf machine: Add symbol filter to struct machine
perf top: Set the machines symbol filter
perf report: Set the machines symbol filter
perf mem: Remove unused symbol filter member
perf annotate: Set the machines symbol filter
perf tools: Remove filter parameter of perf_event__preprocess_sample()
perf tools: Remove filter parameter of thread__find_addr_location()
perf tools: Remove filter parameter of thread__find_addr_map()
perf tools: Re-implement debug print function for linking python/perf.so
perf tools: Add debug prints
perf tools: Add pid to struct thread
perf tools: change machine__findnew_thread() to set thread pid
perf evsel: Tidy up sample parsing overflow checking
perf callchain: Remove unnecessary validation
perf tools: Remove references to struct ip_event
perf: make events stream always parsable
perf evlist: Move perf_evlist__config() to a new source file
perf tools: Add support for PERF_SAMPLE_IDENTIFIER
perf tools: Add missing 'abi' member to 'struct regs_dump'
perf tools: Expand perf_event__synthesize_sample()
perf tools: Add a function to calculate sample event size
perf tests: Add a sample parsing test
perf: Add a dummy software event to keep tracking
perf tools: Add support for PERF_COUNT_SW_DUMMY
perf tests: Add 'keep tracking' test

Andi Kleen (10):
perf tools: Default to cpu// for events v5
perf list: List kernel supplied event aliases
perf bench: Fix memcpy benchmark for large sizes
perf tools: Support callchain sorting based on addresses
perf tools: Move weight back to common sort keys
perf evlist: Remove obsolete dummy execve
perf evsel: Add support for enabling counters
perf stat: Add support for --initial-delay option
perf stat: Flush output after each line in interval mode
perf tools: Try to increase the file descriptor limits on EMFILE

Arnaldo Carvalho de Melo (23):
perf trace: Beautify 'connect' result
perf python: Remove duplicate TID bit from mask
perf annotate browser: Improve description of '?' hotkey
perf trace: Allow specifying which syscalls to trace
perf trace: Implement -o/--output filename
perf tools: Don't install scripting files files when disabled
perf trace: Support ! in -e expressions
perf trace: Add --verbose option
perf trace: Hide sys_exit messages about syscall id = -1
perf trace: Introduce syscall arg formatters
perf trace: Simplify sys_exit return printing
perf trace: Allow printing syscall return values in hex
perf trace: Add aliases to remaining syscalls of the sys_enter_newfoo
perf trace: Allow overiding the formatting of syscall fields
perf trace: Add beautifier for mmap prot parm
perf trace: Add beautifier for mmap flags parm
perf trace: Add beautifier for madvise behaviour/advice parm
tools lib traceevent: Fixup jobserver setup
perf trace: Allow syscall arg formatters to mask args
perf trace: Add beautifier for futex 'operation' parm
perf trace: Add beautifier for lseek's whence arg
perf trace: Add beautifier for open's flags arg
perf trace: Tell arg formatters the arg index

David Ahern (36):
perf tools: Add methods for setting/retrieving priv element of thread struct
perf evlist: Fix use of uninitialized variable
perf tools: Don't free list head in parse_events__free_terms
perf tests: Make terms a stack variable in test_term
perf parse events: Demystify memory allocations
perf script: Fix named threads support
perf evsel: Handle ENODEV on default cycles event
perf evsel: Actually show symbol offset in stack trace when requested
perf tools: Fix compile of util/tsc.c
perf top: move CONSOLE_CLEAR to header file
perf stats: Add max and min stats
perf session: Export a few functions for event processing
perf kvm: Split out tracepoints from record args
perf session: Export queue_event function
perf kvm: Add live mode
perf kvm: Add min and max stats to display
perf kvm stat report: Add option to analyze specific VM
perf kvm: Option to print events that exceed a duration
perf kvm: Update documentation with live command
perf sched: Simplify arguments to read_events
perf sched: Remove thread lookup in sample handler
perf sched: Remove sched_process_exit tracepoint
perf sched: Remove sched_process_fork tracepoint
perf tool: Simplify options to perf_evsel__print_ip
perf evsel: Add option to print stack trace on single line
perf evsel: Add option to limit stack depth in callchain dumps
perf session: Change perf_session__has_traces to actually check for tracepoints
perf tests: Fix compile failure on do_sort_something
perf kvm: Remove force option to cmd_record
perf trace: Make command line arguments consistent with perf-record
perf tools: Sample after exit loses thread correlation
perf evlist: Add tracepoint lookup by name
perf trace: Add option to analyze events in a file versus live
perf trace: Honor target pid / tid options when analyzing a file
perf trace: Handle missing HUGEPAGE defines
perf tools: Fix symbol offset computation for some dsos

Frederic Weisbecker (10):
perf: Fix branch stack refcount leak on callchain init failure
perf: Sanitize get_callchain_buffer()
perf: Factor out event accounting code to account_event()/__free_event()
perf: Split the per-cpu accounting part of the event accounting code
perf: Migrate per cpu event accounting
perf: Account freq events per cpu
perf: Implement finer grained full dynticks kick
watchdog: Make it work under full dynticks
perf: Roll back callchain buffer refcount under the callchain mutex
perf: Account freq events globally

Greg Price (1):
perf report/top: Add option to collapse undesired parts of call graph

Ingo Molnar (1):
perf annotate browser: Fix typo

Jason Wessel (1):
perf machine: Do not require /lib/modules/* on a guest

Jiri Kosina (3):
x86: Introduce int3 (breakpoint)-based instruction patching
x86: Make jump_label use int3-based patching
kprobes/x86: Call out into INT3 handler directly instead of using notifier

Jiri Olsa (46):
perf: Remove the 'match' callback for auxiliary events processing
perf tools: Remove cwd from perf_session struct
perf tests: Omit end of the symbol check failure for test 1
perf tests: Make TEST_ASSERT_VAL global
perf tools: Remove callchain_cursor_reset call
perf tools: Do not elide parent symbol column
perf report: Fix perf_session__delete removal
perf tools: Add struct perf_hpp_fmt into hpp callbacks
perf tools: Centralize default columns init in perf_hpp__init
perf diff: Introducing diff_data object to hold files
perf diff: Switching the base hists to be pairs head
perf hists: Marking dummy hists entries
perf diff: Display data file info ahead of the diff output
perf diff: Move diff related columns into diff command
perf diff: Move columns into struct data__file
perf diff: Change diff command to work over multiple data files
perf diff: Update perf diff documentation for multiple data comparison
perf diff: Making compute functions static
perf diff: Add generic order option for compute sorting
perf tools: Move hist_entry__period_snprintf into stdio code
perf timechart: Use traceevent lib event-parse.h include
perf timechart: Remove event types framework only user
perf tools: Remove event types from perf data file
perf record: Remove event types pushing
perf tools: Remove event types framework completely
perf tests: Check proper prev_state size for sched_switch tp
perf session: Use session->fd instead of passing fd as argument
perf header: Remove data_offset seek as it's not needed
perf header: Remove attr_offset from perf_header
perf header: Introduce feat_offset into perf_header
perf header: Recognize version number for perf data file
perf tests: Run ctags/cscope make tests only with needed binaries
perf tests: Rename TMP to TMP_O tests/make variable
perf tests: Add DESTDIR=TMP_DEST tests/make variable
perf tests: Add 'make install/install-bin' tests into tests/make
perf tests: Add broken install-* tests into tests/make
perf: Add PERF_EVENT_IOC_ID ioctl to return event ID
perf: Do not get values from disabled counters in group format read
perf evlist: Use PERF_EVENT_IOC_ID perf ioctl to read event id
perf tools: Add support for parsing PERF_SAMPLE_READ sample type
perf evlist: Fix event ID retrieval for group format read case
perf evlist: Add perf_evlist__id2sid method to get event ID related data
perf evsel: Add PERF_SAMPLE_READ sample related processing
perf tools: Add 'S' event/group modifier to read sample value
perf tests: Add attr record group sampling test
perf tests: Add parse events tests for leader sampling

Joe Perches (1):
perf: Convert kmalloc_node(...GFP_ZERO...) to kzalloc_node()

Joonsoo Kim (1):
Revert "tools lib lk: Fix for cross build"

Kirill A. Shutemov (1):
perf tools: Fix build with perl 5.18

Masami Hiramatsu (3):
kprobes/x86: Remove an incorrect comment about int3 in NMI/MCE
kprobes/x86: Use text_poke_bp() instead of text_poke_smp*()
kprobes/x86: Remove unused text_poke_smp() and text_poke_smp_batch() functions

Michael Ellerman (2):
perf tools: Add support for pinned modifier
perf tests: Add tests of new pinned modifier

Michael Wang (1):
sched: Implement smarter wake-affine logic

Namhyung Kim (29):
perf util: Move debugfs/tracing helper functions to util.c
perf util: Use evsel->name to get tracepoint_paths
tools lib traceevent: Remove unused install targets
tools lib traceevent: Get rid of unused gui target
tools lib traceevent: Add const qualifier to string arguments
tools lib traceevent: Add trace_seq_reset()
tools lib traceevent: Add page_size field to pevent
tools lib traceevent: Port kbuffer parser routines
perf util: Save page size in a trace file to pevent
perf util: Save long size of traced system
perf util: Make file/host_bigendian variable local
perf util: Skip reading header_event file
perf util: Parse header_page to get proper long size
perf util: Get rid of unused header_page_* variables
perf script: Adopt latency_format variable
perf util: Rename read_*() functions in trace-event-info.c
perf util: No need to call read_trace_init() in tracing_data_header()
perf util: Remove unused enum and macro in trace-event.h
perf sched: Move struct perf_sched definition out of cmd_sched()
perf gtk/hists: Use GtkTreeStore instead of GtkListStore
perf gtk/hists: Add support for callchains
perf gtk/hists: Display callchain overhead also
perf gtk/hists: Make column headers resizable
perf gtk/hists: Add a double-click handler for callchains
perf gtk/hists: Set rules hint for the hist browser
perf symbols: Do not apply symfs for an absolute vmlinux path
perf util: Add parse_nsec_time() function
perf ui/gtk: Fix segmentation fault on perf_hpp__for_each_format loop
perf list: Skip unsupported events

Oleg Nesterov (3):
tracing/perf: Expand TRACE_EVENT(sched_stat_runtime)
tracing/perf: Reimplement TP_perf_assign() logic
tracing/perf: Avoid perf_trace_buf_*() in perf_trace_##call() when possible

Peter Zijlstra (3):
perf: Update perf_event_type documentation
sched: Micro-optimize the smart wake-affine logic
perf: Do not compute time values unnecessarily

Robert Richter (2):
perf tools: Use default include path notation for libtraceevent headers
perf tools: Fix 'make tools/perf'

Runzhen Wang (1):
perf tools: Make Power7 events available for perf

Stephane Eranian (2):
perf tools: Improve robustness of topology parsing code
perf: Add attr->mmap2 attribute to an event

Sukadev Bhattiprolu (2):
perf tools: Rename cpu_map__all() to cpu_map__empty()
perf top: Add --objdump option

Vince Weaver (1):
perf: Export struct perf_branch_entry to userspace

Vladimir Davydov (1):
sched: Move h_load calculation to task_h_load()

Yan, Zheng (5):
perf/x86/intel/uncore: Add auxiliary pci device support
perf/x86/intel/uncore: Add filter support for QPI boxes
perf/x86/intel/uncore: Enable EV_SEL_EXT bit for PCU
perf/x86: use INTEL_UEVENT_EXTRA_REG to define MSR_OFFCORE_RSP_X
perf/x86: Add Silvermont (22nm Atom) support


arch/powerpc/include/asm/perf_event_server.h | 4 +-
arch/powerpc/perf/power7-events-list.h | 548 +++++++++++++++
arch/powerpc/perf/power7-pmu.c | 148 +---
arch/x86/Kconfig | 5 -
arch/x86/include/asm/alternative.h | 14 +-
arch/x86/include/asm/tsc.h | 1 +
arch/x86/kernel/alternative.c | 155 +++--
arch/x86/kernel/cpu/perf_event.c | 6 +
arch/x86/kernel/cpu/perf_event.h | 2 +
arch/x86/kernel/cpu/perf_event_amd.c | 3 +-
arch/x86/kernel/cpu/perf_event_intel.c | 180 ++++-
arch/x86/kernel/cpu/perf_event_intel_ds.c | 32 +-
arch/x86/kernel/cpu/perf_event_intel_uncore.c | 254 +++++--
arch/x86/kernel/cpu/perf_event_intel_uncore.h | 10 +
arch/x86/kernel/jump_label.c | 16 +-
arch/x86/kernel/kprobes/common.h | 5 -
arch/x86/kernel/kprobes/core.c | 2 +-
arch/x86/kernel/kprobes/opt.c | 110 +--
arch/x86/kernel/traps.c | 4 +
arch/x86/kernel/tsc.c | 6 +
include/linux/perf_event.h | 24 -
include/linux/sched.h | 3 +
include/trace/events/sched.h | 22 +-
include/trace/ftrace.h | 33 +-
include/uapi/linux/perf_event.h | 123 +++-
kernel/events/callchain.c | 3 +
kernel/events/core.c | 381 +++++++----
kernel/sched/core.c | 7 +-
kernel/sched/fair.c | 105 ++-
kernel/sched/sched.h | 8 +-
kernel/watchdog.c | 8 -
tools/lib/lk/Makefile | 15 -
tools/lib/traceevent/Makefile | 20 +-
tools/lib/traceevent/event-parse.c | 7 +-
tools/lib/traceevent/event-parse.h | 15 +-
tools/lib/traceevent/kbuffer-parse.c | 732 ++++++++++++++++++++
tools/lib/traceevent/kbuffer.h | 67 ++
tools/lib/traceevent/trace-seq.c | 13 +
tools/perf/Documentation/perf-diff.txt | 79 ++-
tools/perf/Documentation/perf-kvm.txt | 46 +-
tools/perf/Documentation/perf-list.txt | 6 +-
tools/perf/Documentation/perf-report.txt | 13 +-
tools/perf/Documentation/perf-stat.txt | 5 +
tools/perf/Documentation/perf-top.txt | 5 +
tools/perf/Documentation/perf-trace.txt | 24 +
tools/perf/Makefile | 21 +-
tools/perf/arch/x86/Makefile | 2 +
tools/perf/arch/x86/util/tsc.c | 59 ++
tools/perf/arch/x86/util/tsc.h | 20 +
tools/perf/bench/mem-memcpy.c | 2 +
tools/perf/builtin-annotate.c | 5 +-
tools/perf/builtin-diff.c | 664 ++++++++++++++----
tools/perf/builtin-inject.c | 53 +-
tools/perf/builtin-kmem.c | 5 +-
tools/perf/builtin-kvm.c | 754 +++++++++++++++++++--
tools/perf/builtin-list.c | 3 +
tools/perf/builtin-lock.c | 3 +-
tools/perf/builtin-mem.c | 6 +-
tools/perf/builtin-record.c | 13 -
tools/perf/builtin-report.c | 88 +--
tools/perf/builtin-sched.c | 161 ++---
tools/perf/builtin-script.c | 42 +-
tools/perf/builtin-stat.c | 24 +-
tools/perf/builtin-timechart.c | 176 ++---
tools/perf/builtin-top.c | 36 +-
tools/perf/builtin-trace.c | 721 ++++++++++++++++++--
tools/perf/config/Makefile | 5 +-
tools/perf/perf.h | 3 +
tools/perf/python/twatch.py | 2 +-
tools/perf/tests/attr/test-record-group-sampling | 36 +
tools/perf/tests/builtin-test.c | 18 +
tools/perf/tests/code-reading.c | 572 ++++++++++++++++
tools/perf/tests/dso-data.c | 8 -
tools/perf/tests/evsel-tp-sched.c | 4 +-
tools/perf/tests/hists_link.c | 27 +-
tools/perf/tests/keep-tracking.c | 154 +++++
tools/perf/tests/make | 67 +-
tools/perf/tests/mmap-basic.c | 2 +-
tools/perf/tests/parse-events.c | 190 +++++-
tools/perf/tests/perf-time-to-tsc.c | 177 +++++
tools/perf/tests/sample-parsing.c | 316 +++++++++
tools/perf/tests/tests.h | 12 +
tools/perf/tests/vmlinux-kallsyms.c | 49 +-
tools/perf/ui/browsers/annotate.c | 20 +-
tools/perf/ui/browsers/hists.c | 18 +-
tools/perf/ui/gtk/hists.c | 128 +++-
tools/perf/ui/hist.c | 258 +------
tools/perf/ui/setup.c | 1 -
tools/perf/ui/stdio/hist.c | 45 +-
tools/perf/util/annotate.c | 60 +-
tools/perf/util/build-id.c | 11 +-
tools/perf/util/callchain.c | 15 +-
tools/perf/util/callchain.h | 11 +-
tools/perf/util/cpumap.h | 2 +-
tools/perf/util/dso.c | 10 +-
tools/perf/util/dso.h | 17 +
tools/perf/util/event.c | 31 +-
tools/perf/util/event.h | 41 +-
tools/perf/util/evlist.c | 294 ++++++--
tools/perf/util/evlist.h | 21 +-
tools/perf/util/evsel.c | 581 ++++++++++++++--
tools/perf/util/evsel.h | 19 +-
tools/perf/util/header.c | 161 +----
tools/perf/util/header.h | 40 +-
tools/perf/util/hist.c | 4 +-
tools/perf/util/hist.h | 26 +-
tools/perf/util/include/linux/string.h | 1 +
tools/perf/util/machine.c | 155 +++--
tools/perf/util/machine.h | 14 +-
tools/perf/util/map.c | 67 +-
tools/perf/util/map.h | 13 +
tools/perf/util/parse-events.c | 174 +++--
tools/perf/util/parse-events.h | 11 +-
tools/perf/util/parse-events.l | 4 +-
tools/perf/util/parse-events.y | 62 +-
tools/perf/util/pmu.c | 87 ++-
tools/perf/util/pmu.h | 5 +-
tools/perf/util/python.c | 21 +
tools/perf/util/record.c | 108 +++
.../perf/util/scripting-engines/trace-event-perl.c | 14 +-
.../util/scripting-engines/trace-event-python.c | 9 +-
tools/perf/util/session.c | 241 +++++--
tools/perf/util/session.h | 14 +-
tools/perf/util/sort.c | 12 +-
tools/perf/util/sort.h | 13 +-
tools/perf/util/stat.c | 6 +
tools/perf/util/stat.h | 9 +
tools/perf/util/string.c | 24 +
tools/perf/util/symbol-elf.c | 174 ++++-
tools/perf/util/symbol-minimal.c | 7 +
tools/perf/util/symbol.c | 285 +++++++-
tools/perf/util/symbol.h | 5 +
tools/perf/util/thread.c | 11 +-
tools/perf/util/thread.h | 23 +-
tools/perf/util/tool.h | 10 +-
tools/perf/util/top.h | 2 +
tools/perf/util/trace-event-info.c | 96 +--
tools/perf/util/trace-event-parse.c | 6 -
tools/perf/util/trace-event-read.c | 52 +-
tools/perf/util/trace-event-scripting.c | 3 +-
tools/perf/util/trace-event.h | 21 +-
tools/perf/util/unwind.c | 2 +-
tools/perf/util/util.c | 92 +++
tools/perf/util/util.h | 5 +
144 files changed, 9132 insertions(+), 2309 deletions(-)
create mode 100644 arch/powerpc/perf/power7-events-list.h
create mode 100644 tools/lib/traceevent/kbuffer-parse.c
create mode 100644 tools/lib/traceevent/kbuffer.h
create mode 100644 tools/perf/arch/x86/util/tsc.c
create mode 100644 tools/perf/arch/x86/util/tsc.h
create mode 100644 tools/perf/tests/attr/test-record-group-sampling
create mode 100644 tools/perf/tests/code-reading.c
create mode 100644 tools/perf/tests/keep-tracking.c
create mode 100644 tools/perf/tests/perf-time-to-tsc.c
create mode 100644 tools/perf/tests/sample-parsing.c
create mode 100644 tools/perf/util/record.c

[ Full diff omitted for size reasons. ]


2013-09-03 13:38:09

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: [GIT PULL] perf changes for v3.12

Em Tue, Sep 03, 2013 at 03:29:33PM +0200, Ingo Molnar escreveu:
> Linus,
>
> Please pull the latest perf-core-for-linus git tree from:

There were some misatributions I found, from memory, here, clarifying
FWIW:

> * 'perf test' enhancements:
>
> . Add various improvements and fixes to the "vmlinux matches
> kallsyms" 'perf test' entry, related to the /proc/kcore
> annotation feature. By Arnaldo Carvalho de Melo.

by Adrian Hunter

> * 'perf sched' enhancements:
>
> . Various improvements, including removing reliance on some
> scheduler tracepoints that provide the same information as the
> PERF_RECORD_{FORK,EXIT} events. By Arnaldo Carvalho de Melo.

By David Ahern.

:-)

- Arnaldo

2013-09-03 13:43:24

by Ingo Molnar

[permalink] [raw]
Subject: Re: [GIT PULL] perf changes for v3.12


* Arnaldo Carvalho de Melo <[email protected]> wrote:

> Em Tue, Sep 03, 2013 at 03:29:33PM +0200, Ingo Molnar escreveu:
> > Linus,
> >
> > Please pull the latest perf-core-for-linus git tree from:
>
> There were some misatributions I found, from memory, here, clarifying
> FWIW:
>
> > * 'perf test' enhancements:
> >
> > . Add various improvements and fixes to the "vmlinux matches
> > kallsyms" 'perf test' entry, related to the /proc/kcore
> > annotation feature. By Arnaldo Carvalho de Melo.
>
> by Adrian Hunter
>
> > * 'perf sched' enhancements:
> >
> > . Various improvements, including removing reliance on some
> > scheduler tracepoints that provide the same information as the
> > PERF_RECORD_{FORK,EXIT} events. By Arnaldo Carvalho de Melo.
>
> By David Ahern.
>
> :-)

Thanks, I knew I'd get something wrong when reorganizing and summarizing
all the changelogs topically! ;-)

Ingo

2013-09-03 17:01:28

by Vince Weaver

[permalink] [raw]
Subject: Re: [GIT PULL] perf changes for v3.12

On Tue, 3 Sep 2013, Ingo Molnar wrote:
> * New ABI details:
> . Make Power7 events available via sysfs, by Runzhen Wang.

So we're really going to add 100+ Power7 events to the stable sysfs ABI?
Are all the new events listed under the sysfs ABI documentation?

Are we going to add all of the x86 events too, adding 100kB of unswappable
event listings to every user's kernel whether they use perf_event or not?

Yes, I know I am a broken record on this issue, but I don't feel like
anyone has clarified the future plans here.

Vince

2013-09-04 17:53:50

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] perf changes for v3.12

On Tue, Sep 3, 2013 at 6:29 AM, Ingo Molnar <[email protected]> wrote:
>
> Please pull the latest perf-core-for-linus git tree from:

I don't think this is new at all, but I just tried to do a perf
record/report of "make -j64 test" on git:

It's a big perf.data file (1.6G), but after it has done the
"processing time ordered events" thing it results in:

┌─Warning:───────────────────────────────────┐
│Processed 8672030 events and lost 71 chunks!│
│Check IO/CPU overload! │
│ │
│ │
│Press any key... │
└────────────────────────────────────────────┘

and then it just hangs using 100% CPU time. Pressing any key doesn't
do anything.

It may well still be *doing* something, and maybe it will come back
some day with results. But it sure doesn't show any indication that it
will.

Try this (in a current git source tree: note, by "git" I actually mean
git itself, not some random git repository)::

perf record -g -e cycles:pp make -j64 test >& out
perf report

maybe you can reproduce it.

Linus
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2013-09-05 10:56:46

by Ingo Molnar

[permalink] [raw]
Subject: Re: [GIT PULL] perf changes for v3.12


(Cc:-ed Frederic and Namhyung as well, it's about bad overhead in
tools/perf/util/hist.c.)

* Linus Torvalds <[email protected]> wrote:

> On Tue, Sep 3, 2013 at 6:29 AM, Ingo Molnar <[email protected]> wrote:
> >
> > Please pull the latest perf-core-for-linus git tree from:
>
> I don't think this is new at all, but I just tried to do a perf
> record/report of "make -j64 test" on git:
>
> It's a big perf.data file (1.6G), but after it has done the
> "processing time ordered events" thing it results in:
>
> ┌─Warning:───────────────────────────────────┐
> │Processed 8672030 events and lost 71 chunks!│
> │Check IO/CPU overload! │
> │ │
> │ │
> │Press any key... │
> └────────────────────────────────────────────┘
>
> and then it just hangs using 100% CPU time. Pressing any key doesn't
> do anything.
>
> It may well still be *doing* something, and maybe it will come back
> some day with results. But it sure doesn't show any indication that it
> will.
>
> Try this (in a current git source tree: note, by "git" I actually mean
> git itself, not some random git repository)::
>
> perf record -g -e cycles:pp make -j64 test >& out
> perf report
>
> maybe you can reproduce it.

I managed to reproduce it on a 32-way box via:

perf record -g make -j64 bzImage >/dev/null 2>&1

It's easier to debug it without the TUI:

perf --no-pages report --stdio

It turns out that even with a 400 MB perf.data the 'perf report' call will
eventually finish - here it ran for almost half an hour(!) on a fast box.

Arnaldo, the large overhead is in hists__collapse_resort(), in particular
it's doing append_chain_children() 99% of the time:

- 99.74% perf perf [.] append_chain_children ◆
- append_chain_children ▒
- 99.76% merge_chain_branch ▒
- merge_chain_branch ▒
+ 98.04% hists__collapse_resort ▒
+ 1.96% merge_chain_branch ▒
+ 0.05% perf perf [.] merge_chain_branch ▒
+ 0.03% perf libc-2.17.so [.] _int_free ▒
+ 0.03% perf libc-2.17.so [.] __libc_calloc ▒
+ 0.02% perf [kernel.kallsyms] [k] account_user_time ▒
+ 0.02% perf libc-2.17.so [.] _int_malloc ▒

It seems to be stuck in hists__collapse_resort().

In particular the overhead arises because the following loop in
append_chain_children():

/* lookup in childrens */
chain_for_each_child(rnode, root) {
unsigned int ret = append_chain(rnode, cursor, period);

Reaches very long counts and the algorithm gets quadratic (at least). The
child count reaches over 100,000 entries in the end (!).

I don't think the high child count in itself is anomalous: a kernel build
generates thousands of processes, tons of symbol ranges and tens of
millions of call chain entries.

So I think what we need here is to speed up the lookup: put children into
a secondary, ->pos,len indexed range-rbtree and do a binary search instead
of a linear search over 100,000 child entries ... or something like that.

Btw., a side note, append_chain() is a rather confusing function in
itself, with logic-inversion gems like:

if (!found)
found = true;

All that should be cleaned up as well I guess.

The 'IO overload' message appears to be a separate, unrelated bug, it just
annoyingly does not get refreshed away in the TUI before
hists__collapse_resort() is called, and there's also no progress bar for
the hists__collapse_resort() pass, so to the user it all looks like a
deadlock.

So there's at least two bugs here:

- the bad overhead in hists__collapse_resort()

- bad usability if hists__collapse_resort() takes more than 1 second to finish

Thanks,

Ingo

2013-09-05 12:42:50

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [GIT PULL] perf changes for v3.12

On Thu, Sep 05, 2013 at 12:56:39PM +0200, Ingo Molnar wrote:
>
> (Cc:-ed Frederic and Namhyung as well, it's about bad overhead in
> tools/perf/util/hist.c.)
>
> * Linus Torvalds <[email protected]> wrote:
>
> > On Tue, Sep 3, 2013 at 6:29 AM, Ingo Molnar <[email protected]> wrote:
> > >
> > > Please pull the latest perf-core-for-linus git tree from:
> >
> > I don't think this is new at all, but I just tried to do a perf
> > record/report of "make -j64 test" on git:
> >
> > It's a big perf.data file (1.6G), but after it has done the
> > "processing time ordered events" thing it results in:
> >
> > ┌─Warning:───────────────────────────────────┐
> > │Processed 8672030 events and lost 71 chunks!│
> > │Check IO/CPU overload! │
> > │ │
> > │ │
> > │Press any key... │
> > └────────────────────────────────────────────┘
> >
> > and then it just hangs using 100% CPU time. Pressing any key doesn't
> > do anything.
> >
> > It may well still be *doing* something, and maybe it will come back
> > some day with results. But it sure doesn't show any indication that it
> > will.
> >
> > Try this (in a current git source tree: note, by "git" I actually mean
> > git itself, not some random git repository)::
> >
> > perf record -g -e cycles:pp make -j64 test >& out
> > perf report
> >
> > maybe you can reproduce it.
>
> I managed to reproduce it on a 32-way box via:
>
> perf record -g make -j64 bzImage >/dev/null 2>&1
>
> It's easier to debug it without the TUI:
>
> perf --no-pages report --stdio
>
> It turns out that even with a 400 MB perf.data the 'perf report' call will
> eventually finish - here it ran for almost half an hour(!) on a fast box.
>
> Arnaldo, the large overhead is in hists__collapse_resort(), in particular
> it's doing append_chain_children() 99% of the time:
>
> - 99.74% perf perf [.] append_chain_children ◆
> - append_chain_children ▒
> - 99.76% merge_chain_branch ▒
> - merge_chain_branch ▒
> + 98.04% hists__collapse_resort ▒
> + 1.96% merge_chain_branch ▒
> + 0.05% perf perf [.] merge_chain_branch ▒
> + 0.03% perf libc-2.17.so [.] _int_free ▒
> + 0.03% perf libc-2.17.so [.] __libc_calloc ▒
> + 0.02% perf [kernel.kallsyms] [k] account_user_time ▒
> + 0.02% perf libc-2.17.so [.] _int_malloc ▒
>
> It seems to be stuck in hists__collapse_resort().
>
> In particular the overhead arises because the following loop in
> append_chain_children():
>
> /* lookup in childrens */
> chain_for_each_child(rnode, root) {
> unsigned int ret = append_chain(rnode, cursor, period);
>
> Reaches very long counts and the algorithm gets quadratic (at least). The
> child count reaches over 100,000 entries in the end (!).
>
> I don't think the high child count in itself is anomalous: a kernel build
> generates thousands of processes, tons of symbol ranges and tens of
> millions of call chain entries.
>
> So I think what we need here is to speed up the lookup: put children into
> a secondary, ->pos,len indexed range-rbtree and do a binary search instead
> of a linear search over 100,000 child entries ... or something like that.

You're right it's worth trying.

At least it might give better results for such high scale callchain trees.
I'll see what I can come up with.

>
> Btw., a side note, append_chain() is a rather confusing function in
> itself, with logic-inversion gems like:
>
> if (!found)
> found = true;

The check is pointless yeah, I'll remove that.

>
> All that should be cleaned up as well I guess.
>
> The 'IO overload' message appears to be a separate, unrelated bug, it just
> annoyingly does not get refreshed away in the TUI before
> hists__collapse_resort() is called, and there's also no progress bar for
> the hists__collapse_resort() pass, so to the user it all looks like a
> deadlock.
>
> So there's at least two bugs here:
>
> - the bad overhead in hists__collapse_resort()
>
> - bad usability if hists__collapse_resort() takes more than 1 second to finish

Also IIUC, collapsing/merging hists is only used for comm hists merging, due to
set_task_comm after exec.

Perhaps we can do better to anticipate the comm of a process based on tid/pid, fork
and comm events? This way we can avoid late collapses/merges. In the best case we
could get rid of collapses entirely and that would be some nice fresh air for util/hist.c

And ideally, the comm should be associated to a lifetime as a thread can change
its comm anytime.

At least we should be able to limit collapses significantly.
I can try to make callchains more scalable, but collapsing is a heavy weight operation
for them in any case.

> Thanks,
>
> Ingo

2013-09-05 12:51:08

by Ingo Molnar

[permalink] [raw]
Subject: Re: [GIT PULL] perf changes for v3.12


* Frederic Weisbecker <[email protected]> wrote:

> > Btw., a side note, append_chain() is a rather confusing function in
> > itself, with logic-inversion gems like:
> >
> > if (!found)
> > found = true;
>
> The check is pointless yeah, I'll remove that.

Are you sure it can be removed as-is? It inverts the 'found' condition,
which then can be used later on AFAICS. (I haven't looked very deeply
though.)

Thanks,

Ingo

2013-09-05 12:58:13

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [GIT PULL] perf changes for v3.12

On Thu, Sep 05, 2013 at 02:51:01PM +0200, Ingo Molnar wrote:
>
> * Frederic Weisbecker <[email protected]> wrote:
>
> > > Btw., a side note, append_chain() is a rather confusing function in
> > > itself, with logic-inversion gems like:
> > >
> > > if (!found)
> > > found = true;
> >
> > The check is pointless yeah, I'll remove that.
>
> Are you sure it can be removed as-is? It inverts the 'found' condition,
> which then can be used later on AFAICS. (I haven't looked very deeply
> though.)

Yeah in fact found is initialized to false. So the if (!found) is always true there.
May be it's a leftover from past version that relied on gotos or so.

2013-09-05 13:38:42

by Ingo Molnar

[permalink] [raw]
Subject: Re: [GIT PULL] perf changes for v3.12


* Ingo Molnar <[email protected]> wrote:

> * 'perf report/top' enhancements:
>
> . Do annotation using /proc/kcore and /proc/kallsyms when
> available, removing the forced need for a vmlinux file kernel
> assembly annotation. This also improves this use case because
> vmlinux has just the initial kernel image, not what is actually
> in use after various code patchings by things like alternatives.
> By Adrian Hunter.

Btw., this new feature is most noticeable if you run 'perf top' as root on
any random box: you can annotate based on the live kernel image straight
away, without having the vmlinux.

This works on various distro kernels as well.

Thanks,

Ingo

2013-09-08 02:17:31

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] perf changes for v3.12

This is new for me, but I suspect it is more related to the new
Haswell CPU I have than necessarily the 3.12 perf pull request.

Regardless, nothing bad happened, but my dmesg has this in it:

Unexpected number of pebs records 10

when I was profiling some git workloads. Full trace appended.

Linus

---
perf samples too long (2749 > 2500), lowering
kernel.perf_event_max_sample_rate to 50000
perf samples too long (5200 > 5000), lowering
kernel.perf_event_max_sample_rate to 25000
.. many seconds pass ..
------------[ cut here ]------------
WARNING: CPU: 1 PID: 3270 at
arch/x86/kernel/cpu/perf_event_intel_ds.c:1003
intel_pmu_drain_pebs_hsw+0x91/0xa0()
Unexpected number of pebs records 10
Modules linked in: fuse ip6t_REJECT nf_conntrack_ipv4
nf_conntrack_ipv6 nf_defrag_ipv4 nf_defrag_ipv6 xt_conntrack
ip6table_filter nf_conntrack ip6_tables snd_hda_codec_realtek
snd_hda_codec_hdmi snd_hda_intel snd_hda_codec snd_hwdep snd_seq
snd_seq_device snd_pcm e1000e serio_raw snd_page_alloc snd_timer ptp
lpc_ich snd i2c_i801 pps_core mfd_core soundcore uinput nfsd
auth_rpcgss nfs_acl lockd sunrpc binfmt_misc hid_logitech_dj i915
i2c_algo_bit drm_kms_helper drm i2c_core video
CPU: 1 PID: 3270 Comm: git Not tainted 3.11.0-07749-ge7d33bb5ea82-dirty #4
Hardware name: /DH87RL, BIOS
RLH8710H.86A.0317.2013.0426.1724 04/26/2013
0000000000000009 ffff88041fa46ca0 ffffffff815fc637 ffff88041fa46ce8
ffff88041fa46cd8 ffffffff81051e78 ffff88040f42f000 ffff88040f42f780
ffff88041fa46ef8 ffffffff81c15860 0000000000000002 ffff88041fa46d38
Call Trace:
<NMI> [<ffffffff815fc637>] dump_stack+0x45/0x56
[<ffffffff81051e78>] warn_slowpath_common+0x78/0xa0
[<ffffffff81051ee7>] warn_slowpath_fmt+0x47/0x50
[<ffffffff8101b051>] intel_pmu_drain_pebs_hsw+0x91/0xa0
[<ffffffff8101c5d0>] intel_pmu_handle_irq+0x210/0x390
[<ffffffff81604deb>] perf_event_nmi_handler+0x2b/0x50
[<ffffffff81604670>] nmi_handle.isra.3+0x80/0x180
[<ffffffff81604840>] do_nmi+0xd0/0x310
[<ffffffff81603d37>] end_repeat_nmi+0x1e/0x2e
<<EOE>> [<ffffffff810167df>] perf_events_lapic_init+0x2f/0x40
[<ffffffff81016a50>] x86_pmu_enable+0x260/0x310
[<ffffffff81111d87>] perf_pmu_enable+0x27/0x30
[<ffffffff81112140>] perf_event_context_sched_in+0x80/0xc0
[<ffffffff811127eb>] __perf_event_task_sched_in+0x16b/0x180
[<ffffffff8107c300>] finish_task_switch+0x70/0xa0
[<ffffffff81600f48>] __schedule+0x368/0x7c0
[<ffffffff816013c4>] schedule+0x24/0x70
[<ffffffff816038a4>] retint_careful+0xd/0x21
---[ end trace 52dad7a9d8d96b5f ]---

2013-09-09 10:06:01

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [GIT PULL] perf changes for v3.12

On Sat, Sep 07, 2013 at 07:17:28PM -0700, Linus Torvalds wrote:
> This is new for me, but I suspect it is more related to the new
> Haswell CPU I have than necessarily the 3.12 perf pull request.
>
> Regardless, nothing bad happened, but my dmesg has this in it:
>
> Unexpected number of pebs records 10
>
> when I was profiling some git workloads. Full trace appended.

> ------------[ cut here ]------------
> WARNING: CPU: 1 PID: 3270 at
> arch/x86/kernel/cpu/perf_event_intel_ds.c:1003
> intel_pmu_drain_pebs_hsw+0x91/0xa0()
> Unexpected number of pebs records 10
> Modules linked in: fuse ip6t_REJECT nf_conntrack_ipv4
> nf_conntrack_ipv6 nf_defrag_ipv4 nf_defrag_ipv6 xt_conntrack
> ip6table_filter nf_conntrack ip6_tables snd_hda_codec_realtek
> snd_hda_codec_hdmi snd_hda_intel snd_hda_codec snd_hwdep snd_seq
> snd_seq_device snd_pcm e1000e serio_raw snd_page_alloc snd_timer ptp
> lpc_ich snd i2c_i801 pps_core mfd_core soundcore uinput nfsd
> auth_rpcgss nfs_acl lockd sunrpc binfmt_misc hid_logitech_dj i915
> i2c_algo_bit drm_kms_helper drm i2c_core video
> CPU: 1 PID: 3270 Comm: git Not tainted 3.11.0-07749-ge7d33bb5ea82-dirty #4
> Hardware name: /DH87RL, BIOS
> RLH8710H.86A.0317.2013.0426.1724 04/26/2013
> 0000000000000009 ffff88041fa46ca0 ffffffff815fc637 ffff88041fa46ce8
> ffff88041fa46cd8 ffffffff81051e78 ffff88040f42f000 ffff88040f42f780
> ffff88041fa46ef8 ffffffff81c15860 0000000000000002 ffff88041fa46d38
> Call Trace:
> <NMI> [<ffffffff815fc637>] dump_stack+0x45/0x56
> [<ffffffff81051e78>] warn_slowpath_common+0x78/0xa0
> [<ffffffff81051ee7>] warn_slowpath_fmt+0x47/0x50
> [<ffffffff8101b051>] intel_pmu_drain_pebs_hsw+0x91/0xa0
> [<ffffffff8101c5d0>] intel_pmu_handle_irq+0x210/0x390
> [<ffffffff81604deb>] perf_event_nmi_handler+0x2b/0x50
> [<ffffffff81604670>] nmi_handle.isra.3+0x80/0x180
> [<ffffffff81604840>] do_nmi+0xd0/0x310
> [<ffffffff81603d37>] end_repeat_nmi+0x1e/0x2e
> <<EOE>> [<ffffffff810167df>] perf_events_lapic_init+0x2f/0x40
> [<ffffffff81016a50>] x86_pmu_enable+0x260/0x310
> [<ffffffff81111d87>] perf_pmu_enable+0x27/0x30
> [<ffffffff81112140>] perf_event_context_sched_in+0x80/0xc0
> [<ffffffff811127eb>] __perf_event_task_sched_in+0x16b/0x180
> [<ffffffff8107c300>] finish_task_switch+0x70/0xa0
> [<ffffffff81600f48>] __schedule+0x368/0x7c0
> [<ffffffff816013c4>] schedule+0x24/0x70
> [<ffffffff816038a4>] retint_careful+0xd/0x21
> ---[ end trace 52dad7a9d8d96b5f ]---

So I'm not entirely sure how this can happen. We do _not_ set
ds->pebs_event_reset[] so once a PEBS assist happens we do not
automatically re-engage. We wait for the PMI to happen, consume the data
and reprogram the hardware.

We allow for up to x86_pmu.max_pebs_events (4 on hsw afaik) of PEBS
entries to be in the buffer so each PEBS capable counter can trigger the
assist once while we 'wait' for the PMI to hit.

There's a little 'funny' with the hsw PEBS code in that it doesn't
unconditionally clear the buffer in intel_pmu_drain_pebs_hsw() but that
would only make a difference if n actually was < 0 -- again something
that shouldn't ever happen.

That said, there's two peculiar things about your report. First and
foremost HSW; I'm not at all sure how it deals with a transaction being
undone if a PEBS assist had happened during it. Does it reset the
hardware but not remove the data? If so, this could cause excess
entries.

Stephane, Andi, it this at all possible?, section 18.10.5.1 (June 2013)
seems to suggest not; the way I read it PEBS assist will cause an abort
(and the PMI certainly would) and we'll process the entire thing outside
of transaction context.

That said, 18.10.5.1 mentions that entries from inside a transaction
will only have 0xb0 and 0xb8 aka perf_record::real_ip and
perf_record::tsx_tuning set, I don't think __intel_pmu_drain_pebs_nhm()
will do the right thing for them. In particular, when no
perf_record::status bit is set we'll drop the event on the ground.

Dropping them means we do not do intel_pmu_save_and_restart() to
reprogram the hardware and the counter would get 'stuck'. Then again,
without a status bit we have no idea what event to re-program so we're
pretty much up shit creek without no stinkin' paddle.

Another possibility is that the meaning of pebs_record::status has
changed -- it was 'broken' and Peggy agreed and said she would look into
it, not sure this has been 'fixed' but I couldn't find anything on it.
That said, I think the code always assumed it was 'sane' so it having
been fixed shouldn't matter.

Secondly you seem to have triggered the nested NMI replay path. I'm not
immediately sure how that could have messed up the PEBS state, but it
sure is a complication.

I think we can do something like the below as a pure code reduction --
completely untested:

---
arch/x86/kernel/cpu/perf_event_intel_ds.c | 109 +++++++++---------------------
1 file changed, 32 insertions(+), 77 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index 63438aa..bfda50e 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -178,20 +178,14 @@ struct pebs_record_nhm {
u64 status, dla, dse, lat;
};

-/*
- * Same as pebs_record_nhm, with two additional fields.
- */
struct pebs_record_hsw {
- struct pebs_record_nhm nhm;
- /*
- * Real IP of the event. In the Intel documentation this
- * is called eventingrip.
- */
- u64 real_ip;
- /*
- * TSX tuning information field: abort cycles and abort flags.
- */
- u64 tsx_tuning;
+ u64 flags, ip;
+ u64 ax, bx, cx, dx;
+ u64 si, di, bp, sp;
+ u64 r8, r9, r10, r11;
+ u64 r12, r13, r14, r15;
+ u64 status, dla, dse, lat;
+ u64 real_ip, tsx_tuning;
};

void init_debug_store_on_cpu(int cpu)
@@ -789,12 +783,11 @@ static void __intel_pmu_pebs_event(struct perf_event *event,
struct pt_regs *iregs, void *__pebs)
{
/*
- * We cast to pebs_record_nhm to get the load latency data
- * if extra_reg MSR_PEBS_LD_LAT_THRESHOLD used
+ * We cast to the biggest pebs_record but are careful not to
+ * unconditionally access the 'extra' entries.
*/
struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
- struct pebs_record_nhm *pebs = __pebs;
- struct pebs_record_hsw *pebs_hsw = __pebs;
+ struct pebs_record_hsw *pebs = __pebs;
struct perf_sample_data data;
struct pt_regs regs;
u64 sample_type;
@@ -853,7 +846,7 @@ static void __intel_pmu_pebs_event(struct perf_event *event,
regs.sp = pebs->sp;

if (event->attr.precise_ip > 1 && x86_pmu.intel_cap.pebs_format >= 2) {
- regs.ip = pebs_hsw->real_ip;
+ regs.ip = pebs->real_ip;
regs.flags |= PERF_EFLAGS_EXACT;
} else if (event->attr.precise_ip > 1 && intel_pmu_pebs_fixup_ip(&regs))
regs.flags |= PERF_EFLAGS_EXACT;
@@ -912,17 +905,34 @@ static void intel_pmu_drain_pebs_core(struct pt_regs *iregs)
__intel_pmu_pebs_event(event, iregs, at);
}

-static void __intel_pmu_drain_pebs_nhm(struct pt_regs *iregs, void *at,
- void *top)
+static void intel_pmu_drain_pebs_nhm(struct pt_regs *iregs)
{
struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
struct debug_store *ds = cpuc->ds;
struct perf_event *event = NULL;
+ void *at, *top;
u64 status = 0;
- int bit;
+ int bit, n;
+
+ if (!x86_pmu.pebs_active)
+ return;
+
+ at = (struct pebs_record_nhm *)(unsigned long)ds->pebs_buffer_base;
+ top = (struct pebs_record_nhm *)(unsigned long)ds->pebs_index;

ds->pebs_index = ds->pebs_buffer_base;

+ n = (top - at) / x86_pmu.pebs_record_size;
+ if (n <= 0)
+ return;
+
+ /*
+ * Should not happen, we program the threshold at 1 and do not
+ * set a reset value.
+ */
+ WARN_ONCE(n > x86_pmu.max_pebs_events,
+ "Unexpected number of pebs records %d\n", n);
+
for (; at < top; at += x86_pmu.pebs_record_size) {
struct pebs_record_nhm *p = at;

@@ -950,61 +960,6 @@ static void __intel_pmu_drain_pebs_nhm(struct pt_regs *iregs, void *at,
}
}

-static void intel_pmu_drain_pebs_nhm(struct pt_regs *iregs)
-{
- struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
- struct debug_store *ds = cpuc->ds;
- struct pebs_record_nhm *at, *top;
- int n;
-
- if (!x86_pmu.pebs_active)
- return;
-
- at = (struct pebs_record_nhm *)(unsigned long)ds->pebs_buffer_base;
- top = (struct pebs_record_nhm *)(unsigned long)ds->pebs_index;
-
- ds->pebs_index = ds->pebs_buffer_base;
-
- n = top - at;
- if (n <= 0)
- return;
-
- /*
- * Should not happen, we program the threshold at 1 and do not
- * set a reset value.
- */
- WARN_ONCE(n > x86_pmu.max_pebs_events,
- "Unexpected number of pebs records %d\n", n);
-
- return __intel_pmu_drain_pebs_nhm(iregs, at, top);
-}
-
-static void intel_pmu_drain_pebs_hsw(struct pt_regs *iregs)
-{
- struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
- struct debug_store *ds = cpuc->ds;
- struct pebs_record_hsw *at, *top;
- int n;
-
- if (!x86_pmu.pebs_active)
- return;
-
- at = (struct pebs_record_hsw *)(unsigned long)ds->pebs_buffer_base;
- top = (struct pebs_record_hsw *)(unsigned long)ds->pebs_index;
-
- n = top - at;
- if (n <= 0)
- return;
- /*
- * Should not happen, we program the threshold at 1 and do not
- * set a reset value.
- */
- WARN_ONCE(n > x86_pmu.max_pebs_events,
- "Unexpected number of pebs records %d\n", n);
-
- return __intel_pmu_drain_pebs_nhm(iregs, at, top);
-}
-
/*
* BTS, PEBS probe and setup
*/
@@ -1039,7 +994,7 @@ void intel_ds_init(void)
case 2:
pr_cont("PEBS fmt2%c, ", pebs_type);
x86_pmu.pebs_record_size = sizeof(struct pebs_record_hsw);
- x86_pmu.drain_pebs = intel_pmu_drain_pebs_hsw;
+ x86_pmu.drain_pebs = intel_pmu_drain_pebs_nhm;
break;

default:

2013-09-10 08:06:13

by Namhyung Kim

[permalink] [raw]
Subject: Re: [GIT PULL] perf changes for v3.12

Hi,

On Thu, 5 Sep 2013 14:42:44 +0200, Frederic Weisbecker wrote:
> On Thu, Sep 05, 2013 at 12:56:39PM +0200, Ingo Molnar wrote:
>>
>> (Cc:-ed Frederic and Namhyung as well, it's about bad overhead in
>> tools/perf/util/hist.c.)
>>
>> * Linus Torvalds <[email protected]> wrote:
>>
>> > On Tue, Sep 3, 2013 at 6:29 AM, Ingo Molnar <[email protected]> wrote:
>> > >
>> > > Please pull the latest perf-core-for-linus git tree from:
>> >
>> > I don't think this is new at all, but I just tried to do a perf
>> > record/report of "make -j64 test" on git:
>> >
>> > It's a big perf.data file (1.6G), but after it has done the
>> > "processing time ordered events" thing it results in:
>> >
>> > ┌─Warning:───────────────────────────────────┐
>> > │Processed 8672030 events and lost 71 chunks!│
>> > │Check IO/CPU overload! │
>> > │ │
>> > │ │
>> > │Press any key... │
>> > └────────────────────────────────────────────┘
>> >
>> > and then it just hangs using 100% CPU time. Pressing any key doesn't
>> > do anything.
>> >
>> > It may well still be *doing* something, and maybe it will come back
>> > some day with results. But it sure doesn't show any indication that it
>> > will.
>> >
>> > Try this (in a current git source tree: note, by "git" I actually mean
>> > git itself, not some random git repository)::
>> >
>> > perf record -g -e cycles:pp make -j64 test >& out
>> > perf report
>> >
>> > maybe you can reproduce it.
>>
>> I managed to reproduce it on a 32-way box via:
>>
>> perf record -g make -j64 bzImage >/dev/null 2>&1
>>
>> It's easier to debug it without the TUI:
>>
>> perf --no-pages report --stdio
>>
>> It turns out that even with a 400 MB perf.data the 'perf report' call will
>> eventually finish - here it ran for almost half an hour(!) on a fast box.
>>
>> Arnaldo, the large overhead is in hists__collapse_resort(), in particular
>> it's doing append_chain_children() 99% of the time:
>>
>> - 99.74% perf perf [.] append_chain_children ◆
>> - append_chain_children ▒
>> - 99.76% merge_chain_branch ▒
>> - merge_chain_branch ▒
>> + 98.04% hists__collapse_resort ▒
>> + 1.96% merge_chain_branch ▒
>> + 0.05% perf perf [.] merge_chain_branch ▒
>> + 0.03% perf libc-2.17.so [.] _int_free ▒
>> + 0.03% perf libc-2.17.so [.] __libc_calloc ▒
>> + 0.02% perf [kernel.kallsyms] [k] account_user_time ▒
>> + 0.02% perf libc-2.17.so [.] _int_malloc ▒
>>
>> It seems to be stuck in hists__collapse_resort().
>>
>> In particular the overhead arises because the following loop in
>> append_chain_children():
>>
>> /* lookup in childrens */
>> chain_for_each_child(rnode, root) {
>> unsigned int ret = append_chain(rnode, cursor, period);
>>
>> Reaches very long counts and the algorithm gets quadratic (at least). The
>> child count reaches over 100,000 entries in the end (!).
>>
>> I don't think the high child count in itself is anomalous: a kernel build
>> generates thousands of processes, tons of symbol ranges and tens of
>> millions of call chain entries.
>>
>> So I think what we need here is to speed up the lookup: put children into
>> a secondary, ->pos,len indexed range-rbtree and do a binary search instead
>> of a linear search over 100,000 child entries ... or something like that.
>
> You're right it's worth trying.
>
> At least it might give better results for such high scale callchain trees.
> I'll see what I can come up with.

I justed converted it to a normal rbtree and the total processing time went
down from 380s to 20s! I couldn't understand how I can use the
range-rbtree in this case so probably there's a room for further
enhancement. I'll send the patches soon.

>
>>
>> Btw., a side note, append_chain() is a rather confusing function in
>> itself, with logic-inversion gems like:
>>
>> if (!found)
>> found = true;
>
> The check is pointless yeah, I'll remove that.
>
>>
>> All that should be cleaned up as well I guess.
>>
>> The 'IO overload' message appears to be a separate, unrelated bug, it just
>> annoyingly does not get refreshed away in the TUI before
>> hists__collapse_resort() is called, and there's also no progress bar for
>> the hists__collapse_resort() pass, so to the user it all looks like a
>> deadlock.
>>
>> So there's at least two bugs here:
>>
>> - the bad overhead in hists__collapse_resort()
>>
>> - bad usability if hists__collapse_resort() takes more than 1 second to finish
>
> Also IIUC, collapsing/merging hists is only used for comm hists merging, due to
> set_task_comm after exec.
>
> Perhaps we can do better to anticipate the comm of a process based on tid/pid, fork
> and comm events? This way we can avoid late collapses/merges. In the best case we
> could get rid of collapses entirely and that would be some nice fresh air for util/hist.c
>
> And ideally, the comm should be associated to a lifetime as a thread can change
> its comm anytime.

Right. I also thought about why the separate collapsing stage is
needed. Maybe we can collect hist entries that have same comm at insert
time. One problem I can imagine is that the target thread changes its
comm after collecting some hist entries. In this case we should look up
another thread which has same old comm and pass the entries to it. But
we don't have information that which entries are belong to a certain
thread so for now it'll require full traversal of hist entries. If we
add the info to threads, I think we can get rid of collapses entirely.

Thanks,
Namhyung

2013-09-10 11:19:00

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [GIT PULL] perf changes for v3.12

On Tue, Sep 10, 2013 at 05:06:06PM +0900, Namhyung Kim wrote:
> Hi,
>
> On Thu, 5 Sep 2013 14:42:44 +0200, Frederic Weisbecker wrote:
> > On Thu, Sep 05, 2013 at 12:56:39PM +0200, Ingo Molnar wrote:
> >>
> >> (Cc:-ed Frederic and Namhyung as well, it's about bad overhead in
> >> tools/perf/util/hist.c.)
> >>
> >> * Linus Torvalds <[email protected]> wrote:
> >>
> >> > On Tue, Sep 3, 2013 at 6:29 AM, Ingo Molnar <[email protected]> wrote:
> >> > >
> >> > > Please pull the latest perf-core-for-linus git tree from:
> >> >
> >> > I don't think this is new at all, but I just tried to do a perf
> >> > record/report of "make -j64 test" on git:
> >> >
> >> > It's a big perf.data file (1.6G), but after it has done the
> >> > "processing time ordered events" thing it results in:
> >> >
> >> > ┌─Warning:───────────────────────────────────┐
> >> > │Processed 8672030 events and lost 71 chunks!│
> >> > │Check IO/CPU overload! │
> >> > │ │
> >> > │ │
> >> > │Press any key... │
> >> > └────────────────────────────────────────────┘
> >> >
> >> > and then it just hangs using 100% CPU time. Pressing any key doesn't
> >> > do anything.
> >> >
> >> > It may well still be *doing* something, and maybe it will come back
> >> > some day with results. But it sure doesn't show any indication that it
> >> > will.
> >> >
> >> > Try this (in a current git source tree: note, by "git" I actually mean
> >> > git itself, not some random git repository)::
> >> >
> >> > perf record -g -e cycles:pp make -j64 test >& out
> >> > perf report
> >> >
> >> > maybe you can reproduce it.
> >>
> >> I managed to reproduce it on a 32-way box via:
> >>
> >> perf record -g make -j64 bzImage >/dev/null 2>&1
> >>
> >> It's easier to debug it without the TUI:
> >>
> >> perf --no-pages report --stdio
> >>
> >> It turns out that even with a 400 MB perf.data the 'perf report' call will
> >> eventually finish - here it ran for almost half an hour(!) on a fast box.
> >>
> >> Arnaldo, the large overhead is in hists__collapse_resort(), in particular
> >> it's doing append_chain_children() 99% of the time:
> >>
> >> - 99.74% perf perf [.] append_chain_children ◆
> >> - append_chain_children ▒
> >> - 99.76% merge_chain_branch ▒
> >> - merge_chain_branch ▒
> >> + 98.04% hists__collapse_resort ▒
> >> + 1.96% merge_chain_branch ▒
> >> + 0.05% perf perf [.] merge_chain_branch ▒
> >> + 0.03% perf libc-2.17.so [.] _int_free ▒
> >> + 0.03% perf libc-2.17.so [.] __libc_calloc ▒
> >> + 0.02% perf [kernel.kallsyms] [k] account_user_time ▒
> >> + 0.02% perf libc-2.17.so [.] _int_malloc ▒
> >>
> >> It seems to be stuck in hists__collapse_resort().
> >>
> >> In particular the overhead arises because the following loop in
> >> append_chain_children():
> >>
> >> /* lookup in childrens */
> >> chain_for_each_child(rnode, root) {
> >> unsigned int ret = append_chain(rnode, cursor, period);
> >>
> >> Reaches very long counts and the algorithm gets quadratic (at least). The
> >> child count reaches over 100,000 entries in the end (!).
> >>
> >> I don't think the high child count in itself is anomalous: a kernel build
> >> generates thousands of processes, tons of symbol ranges and tens of
> >> millions of call chain entries.
> >>
> >> So I think what we need here is to speed up the lookup: put children into
> >> a secondary, ->pos,len indexed range-rbtree and do a binary search instead
> >> of a linear search over 100,000 child entries ... or something like that.
> >
> > You're right it's worth trying.
> >
> > At least it might give better results for such high scale callchain trees.
> > I'll see what I can come up with.
>
> I justed converted it to a normal rbtree and the total processing time went
> down from 380s to 20s! I couldn't understand how I can use the
> range-rbtree in this case so probably there's a room for further
> enhancement. I'll send the patches soon.
>
> >
> >>
> >> Btw., a side note, append_chain() is a rather confusing function in
> >> itself, with logic-inversion gems like:
> >>
> >> if (!found)
> >> found = true;
> >
> > The check is pointless yeah, I'll remove that.
> >
> >>
> >> All that should be cleaned up as well I guess.
> >>
> >> The 'IO overload' message appears to be a separate, unrelated bug, it just
> >> annoyingly does not get refreshed away in the TUI before
> >> hists__collapse_resort() is called, and there's also no progress bar for
> >> the hists__collapse_resort() pass, so to the user it all looks like a
> >> deadlock.
> >>
> >> So there's at least two bugs here:
> >>
> >> - the bad overhead in hists__collapse_resort()
> >>
> >> - bad usability if hists__collapse_resort() takes more than 1 second to finish
> >
> > Also IIUC, collapsing/merging hists is only used for comm hists merging, due to
> > set_task_comm after exec.
> >
> > Perhaps we can do better to anticipate the comm of a process based on tid/pid, fork
> > and comm events? This way we can avoid late collapses/merges. In the best case we
> > could get rid of collapses entirely and that would be some nice fresh air for util/hist.c
> >
> > And ideally, the comm should be associated to a lifetime as a thread can change
> > its comm anytime.
>
> Right. I also thought about why the separate collapsing stage is
> needed. Maybe we can collect hist entries that have same comm at insert
> time. One problem I can imagine is that the target thread changes its
> comm after collecting some hist entries. In this case we should look up
> another thread which has same old comm and pass the entries to it. But
> we don't have information that which entries are belong to a certain
> thread so for now it'll require full traversal of hist entries. If we
> add the info to threads, I think we can get rid of collapses entirely.

Right, that's exactly what I'm working on. I should have something ready soon.

Thanks.

> Thanks,
> Namhyung

2013-09-10 11:28:47

by Stephane Eranian

[permalink] [raw]
Subject: Re: [GIT PULL] perf changes for v3.12

Hi,


And what was the perf record command line for this crash?


On Mon, Sep 9, 2013 at 12:05 PM, Peter Zijlstra <[email protected]> wrote:
> On Sat, Sep 07, 2013 at 07:17:28PM -0700, Linus Torvalds wrote:
>> This is new for me, but I suspect it is more related to the new
>> Haswell CPU I have than necessarily the 3.12 perf pull request.
>>
>> Regardless, nothing bad happened, but my dmesg has this in it:
>>
>> Unexpected number of pebs records 10
>>
>> when I was profiling some git workloads. Full trace appended.
>
>> ------------[ cut here ]------------
>> WARNING: CPU: 1 PID: 3270 at
>> arch/x86/kernel/cpu/perf_event_intel_ds.c:1003
>> intel_pmu_drain_pebs_hsw+0x91/0xa0()
>> Unexpected number of pebs records 10
>> Modules linked in: fuse ip6t_REJECT nf_conntrack_ipv4
>> nf_conntrack_ipv6 nf_defrag_ipv4 nf_defrag_ipv6 xt_conntrack
>> ip6table_filter nf_conntrack ip6_tables snd_hda_codec_realtek
>> snd_hda_codec_hdmi snd_hda_intel snd_hda_codec snd_hwdep snd_seq
>> snd_seq_device snd_pcm e1000e serio_raw snd_page_alloc snd_timer ptp
>> lpc_ich snd i2c_i801 pps_core mfd_core soundcore uinput nfsd
>> auth_rpcgss nfs_acl lockd sunrpc binfmt_misc hid_logitech_dj i915
>> i2c_algo_bit drm_kms_helper drm i2c_core video
>> CPU: 1 PID: 3270 Comm: git Not tainted 3.11.0-07749-ge7d33bb5ea82-dirty #4
>> Hardware name: /DH87RL, BIOS
>> RLH8710H.86A.0317.2013.0426.1724 04/26/2013
>> 0000000000000009 ffff88041fa46ca0 ffffffff815fc637 ffff88041fa46ce8
>> ffff88041fa46cd8 ffffffff81051e78 ffff88040f42f000 ffff88040f42f780
>> ffff88041fa46ef8 ffffffff81c15860 0000000000000002 ffff88041fa46d38
>> Call Trace:
>> <NMI> [<ffffffff815fc637>] dump_stack+0x45/0x56
>> [<ffffffff81051e78>] warn_slowpath_common+0x78/0xa0
>> [<ffffffff81051ee7>] warn_slowpath_fmt+0x47/0x50
>> [<ffffffff8101b051>] intel_pmu_drain_pebs_hsw+0x91/0xa0
>> [<ffffffff8101c5d0>] intel_pmu_handle_irq+0x210/0x390
>> [<ffffffff81604deb>] perf_event_nmi_handler+0x2b/0x50
>> [<ffffffff81604670>] nmi_handle.isra.3+0x80/0x180
>> [<ffffffff81604840>] do_nmi+0xd0/0x310
>> [<ffffffff81603d37>] end_repeat_nmi+0x1e/0x2e
>> <<EOE>> [<ffffffff810167df>] perf_events_lapic_init+0x2f/0x40
>> [<ffffffff81016a50>] x86_pmu_enable+0x260/0x310
>> [<ffffffff81111d87>] perf_pmu_enable+0x27/0x30
>> [<ffffffff81112140>] perf_event_context_sched_in+0x80/0xc0
>> [<ffffffff811127eb>] __perf_event_task_sched_in+0x16b/0x180
>> [<ffffffff8107c300>] finish_task_switch+0x70/0xa0
>> [<ffffffff81600f48>] __schedule+0x368/0x7c0
>> [<ffffffff816013c4>] schedule+0x24/0x70
>> [<ffffffff816038a4>] retint_careful+0xd/0x21
>> ---[ end trace 52dad7a9d8d96b5f ]---
>
> So I'm not entirely sure how this can happen. We do _not_ set
> ds->pebs_event_reset[] so once a PEBS assist happens we do not
> automatically re-engage. We wait for the PMI to happen, consume the data
> and reprogram the hardware.
>
> We allow for up to x86_pmu.max_pebs_events (4 on hsw afaik) of PEBS
> entries to be in the buffer so each PEBS capable counter can trigger the
> assist once while we 'wait' for the PMI to hit.
>
> There's a little 'funny' with the hsw PEBS code in that it doesn't
> unconditionally clear the buffer in intel_pmu_drain_pebs_hsw() but that
> would only make a difference if n actually was < 0 -- again something
> that shouldn't ever happen.
>
> That said, there's two peculiar things about your report. First and
> foremost HSW; I'm not at all sure how it deals with a transaction being
> undone if a PEBS assist had happened during it. Does it reset the
> hardware but not remove the data? If so, this could cause excess
> entries.
>
> Stephane, Andi, it this at all possible?, section 18.10.5.1 (June 2013)
> seems to suggest not; the way I read it PEBS assist will cause an abort
> (and the PMI certainly would) and we'll process the entire thing outside
> of transaction context.
>
> That said, 18.10.5.1 mentions that entries from inside a transaction
> will only have 0xb0 and 0xb8 aka perf_record::real_ip and
> perf_record::tsx_tuning set, I don't think __intel_pmu_drain_pebs_nhm()
> will do the right thing for them. In particular, when no
> perf_record::status bit is set we'll drop the event on the ground.
>
> Dropping them means we do not do intel_pmu_save_and_restart() to
> reprogram the hardware and the counter would get 'stuck'. Then again,
> without a status bit we have no idea what event to re-program so we're
> pretty much up shit creek without no stinkin' paddle.
>
> Another possibility is that the meaning of pebs_record::status has
> changed -- it was 'broken' and Peggy agreed and said she would look into
> it, not sure this has been 'fixed' but I couldn't find anything on it.
> That said, I think the code always assumed it was 'sane' so it having
> been fixed shouldn't matter.
>
> Secondly you seem to have triggered the nested NMI replay path. I'm not
> immediately sure how that could have messed up the PEBS state, but it
> sure is a complication.
>
> I think we can do something like the below as a pure code reduction --
> completely untested:
>
> ---
> arch/x86/kernel/cpu/perf_event_intel_ds.c | 109 +++++++++---------------------
> 1 file changed, 32 insertions(+), 77 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
> index 63438aa..bfda50e 100644
> --- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
> +++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
> @@ -178,20 +178,14 @@ struct pebs_record_nhm {
> u64 status, dla, dse, lat;
> };
>
> -/*
> - * Same as pebs_record_nhm, with two additional fields.
> - */
> struct pebs_record_hsw {
> - struct pebs_record_nhm nhm;
> - /*
> - * Real IP of the event. In the Intel documentation this
> - * is called eventingrip.
> - */
> - u64 real_ip;
> - /*
> - * TSX tuning information field: abort cycles and abort flags.
> - */
> - u64 tsx_tuning;
> + u64 flags, ip;
> + u64 ax, bx, cx, dx;
> + u64 si, di, bp, sp;
> + u64 r8, r9, r10, r11;
> + u64 r12, r13, r14, r15;
> + u64 status, dla, dse, lat;
> + u64 real_ip, tsx_tuning;
> };
>
> void init_debug_store_on_cpu(int cpu)
> @@ -789,12 +783,11 @@ static void __intel_pmu_pebs_event(struct perf_event *event,
> struct pt_regs *iregs, void *__pebs)
> {
> /*
> - * We cast to pebs_record_nhm to get the load latency data
> - * if extra_reg MSR_PEBS_LD_LAT_THRESHOLD used
> + * We cast to the biggest pebs_record but are careful not to
> + * unconditionally access the 'extra' entries.
> */
> struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
> - struct pebs_record_nhm *pebs = __pebs;
> - struct pebs_record_hsw *pebs_hsw = __pebs;
> + struct pebs_record_hsw *pebs = __pebs;
> struct perf_sample_data data;
> struct pt_regs regs;
> u64 sample_type;
> @@ -853,7 +846,7 @@ static void __intel_pmu_pebs_event(struct perf_event *event,
> regs.sp = pebs->sp;
>
> if (event->attr.precise_ip > 1 && x86_pmu.intel_cap.pebs_format >= 2) {
> - regs.ip = pebs_hsw->real_ip;
> + regs.ip = pebs->real_ip;
> regs.flags |= PERF_EFLAGS_EXACT;
> } else if (event->attr.precise_ip > 1 && intel_pmu_pebs_fixup_ip(&regs))
> regs.flags |= PERF_EFLAGS_EXACT;
> @@ -912,17 +905,34 @@ static void intel_pmu_drain_pebs_core(struct pt_regs *iregs)
> __intel_pmu_pebs_event(event, iregs, at);
> }
>
> -static void __intel_pmu_drain_pebs_nhm(struct pt_regs *iregs, void *at,
> - void *top)
> +static void intel_pmu_drain_pebs_nhm(struct pt_regs *iregs)
> {
> struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
> struct debug_store *ds = cpuc->ds;
> struct perf_event *event = NULL;
> + void *at, *top;
> u64 status = 0;
> - int bit;
> + int bit, n;
> +
> + if (!x86_pmu.pebs_active)
> + return;
> +
> + at = (struct pebs_record_nhm *)(unsigned long)ds->pebs_buffer_base;
> + top = (struct pebs_record_nhm *)(unsigned long)ds->pebs_index;
>
> ds->pebs_index = ds->pebs_buffer_base;
>
> + n = (top - at) / x86_pmu.pebs_record_size;
> + if (n <= 0)
> + return;
> +
> + /*
> + * Should not happen, we program the threshold at 1 and do not
> + * set a reset value.
> + */
> + WARN_ONCE(n > x86_pmu.max_pebs_events,
> + "Unexpected number of pebs records %d\n", n);
> +
> for (; at < top; at += x86_pmu.pebs_record_size) {
> struct pebs_record_nhm *p = at;
>
> @@ -950,61 +960,6 @@ static void __intel_pmu_drain_pebs_nhm(struct pt_regs *iregs, void *at,
> }
> }
>
> -static void intel_pmu_drain_pebs_nhm(struct pt_regs *iregs)
> -{
> - struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
> - struct debug_store *ds = cpuc->ds;
> - struct pebs_record_nhm *at, *top;
> - int n;
> -
> - if (!x86_pmu.pebs_active)
> - return;
> -
> - at = (struct pebs_record_nhm *)(unsigned long)ds->pebs_buffer_base;
> - top = (struct pebs_record_nhm *)(unsigned long)ds->pebs_index;
> -
> - ds->pebs_index = ds->pebs_buffer_base;
> -
> - n = top - at;
> - if (n <= 0)
> - return;
> -
> - /*
> - * Should not happen, we program the threshold at 1 and do not
> - * set a reset value.
> - */
> - WARN_ONCE(n > x86_pmu.max_pebs_events,
> - "Unexpected number of pebs records %d\n", n);
> -
> - return __intel_pmu_drain_pebs_nhm(iregs, at, top);
> -}
> -
> -static void intel_pmu_drain_pebs_hsw(struct pt_regs *iregs)
> -{
> - struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
> - struct debug_store *ds = cpuc->ds;
> - struct pebs_record_hsw *at, *top;
> - int n;
> -
> - if (!x86_pmu.pebs_active)
> - return;
> -
> - at = (struct pebs_record_hsw *)(unsigned long)ds->pebs_buffer_base;
> - top = (struct pebs_record_hsw *)(unsigned long)ds->pebs_index;
> -
> - n = top - at;
> - if (n <= 0)
> - return;
> - /*
> - * Should not happen, we program the threshold at 1 and do not
> - * set a reset value.
> - */
> - WARN_ONCE(n > x86_pmu.max_pebs_events,
> - "Unexpected number of pebs records %d\n", n);
> -
> - return __intel_pmu_drain_pebs_nhm(iregs, at, top);
> -}
> -
> /*
> * BTS, PEBS probe and setup
> */
> @@ -1039,7 +994,7 @@ void intel_ds_init(void)
> case 2:
> pr_cont("PEBS fmt2%c, ", pebs_type);
> x86_pmu.pebs_record_size = sizeof(struct pebs_record_hsw);
> - x86_pmu.drain_pebs = intel_pmu_drain_pebs_hsw;
> + x86_pmu.drain_pebs = intel_pmu_drain_pebs_nhm;
> break;
>
> default:

2013-09-10 11:53:12

by Ingo Molnar

[permalink] [raw]
Subject: PEBS bug on HSW: "Unexpected number of pebs records 10" (was: Re: [GIT PULL] perf changes for v3.12)


* Stephane Eranian <[email protected]> wrote:

> Hi,
>
>
> And what was the perf record command line for this crash?

AFAICS it wasn't a crash but the WARN_ON() in intel_pmu_drain_pebs_hsw(),
at arch/x86/kernel/cpu/perf_event_intel_ds.c:1003.

at = (struct pebs_record_hsw *)(unsigned long)ds->pebs_buffer_base;
top = (struct pebs_record_hsw *)(unsigned long)ds->pebs_index;

n = top - at;
if (n <= 0)
return;

/*
* Should not happen, we program the threshold at 1 and do not
* set a reset value.
*/
WARN_ONCE(n > x86_pmu.max_pebs_events,
"Unexpected number of pebs records %d\n", n);

The command line Linus used was probably close to:

perf record -e cycles:pp -g make -j64 bzImage

i.e. PEBS precise profiling, call chains, LBR is used to figure out the
real instruction, but no '-a' per CPU profiling option, i.e. high
frequency per task PMU context switching.

Note that AFAIK neither the kernel nor user-space used any TSX extensions,
so this is the Haswell PMU in pure compatibility mode.

My (wild) guess is that unless all of us missed some subtle race in the
PEBS code it's an (unknown?) erratum: the hardware got confused by the
high frequency PMU switches, in this particular case where we got a new
PMI right after a very short interval was programmed:

>> Call Trace:
>> <NMI> [<ffffffff815fc637>] dump_stack+0x45/0x56
>> [<ffffffff81051e78>] warn_slowpath_common+0x78/0xa0
>> [<ffffffff81051ee7>] warn_slowpath_fmt+0x47/0x50
>> [<ffffffff8101b051>] intel_pmu_drain_pebs_hsw+0x91/0xa0
>> [<ffffffff8101c5d0>] intel_pmu_handle_irq+0x210/0x390
>> [<ffffffff81604deb>] perf_event_nmi_handler+0x2b/0x50
>> [<ffffffff81604670>] nmi_handle.isra.3+0x80/0x180
>> [<ffffffff81604840>] do_nmi+0xd0/0x310
>> [<ffffffff81603d37>] end_repeat_nmi+0x1e/0x2e
>> <<EOE>> [<ffffffff810167df>] perf_events_lapic_init+0x2f/0x40
>> [<ffffffff81016a50>] x86_pmu_enable+0x260/0x310
>> [<ffffffff81111d87>] perf_pmu_enable+0x27/0x30
>> [<ffffffff81112140>] perf_event_context_sched_in+0x80/0xc0
>> [<ffffffff811127eb>] __perf_event_task_sched_in+0x16b/0x180
>> [<ffffffff8107c300>] finish_task_switch+0x70/0xa0
>> [<ffffffff81600f48>] __schedule+0x368/0x7c0
>> [<ffffffff816013c4>] schedule+0x24/0x70

Note that due to per task profiling the default (long, about 1 KHz)
interval can get chopped up and can result in a very small period value
being reprogrammed at PMU-sched-in time.

That kind of high-freq back-to-back activity could, in theory, confuse the
PEBS hardware. Or the kernel :-)

Thanks,

Ingo

2013-09-10 12:32:17

by Stephane Eranian

[permalink] [raw]
Subject: Re: PEBS bug on HSW: "Unexpected number of pebs records 10" (was: Re: [GIT PULL] perf changes for v3.12)

Hi,

Ok, so I am able to reproduce the problem using a simpler
test case with a simple multithreaded program where
#threads >> #CPUs.

[ 2229.021934] WARNING: CPU: 6 PID: 17496 at
arch/x86/kernel/cpu/perf_event_intel_ds.c:1003
intel_pmu_drain_pebs_hsw+0xa8/0xc0()
[ 2229.021936] Unexpected number of pebs records 21

[ 2229.021966] Call Trace:
[ 2229.021967] <NMI> [<ffffffff8159dcd6>] dump_stack+0x46/0x58
[ 2229.021976] [<ffffffff8108dfdc>] warn_slowpath_common+0x8c/0xc0
[ 2229.021979] [<ffffffff8108e0c6>] warn_slowpath_fmt+0x46/0x50
[ 2229.021982] [<ffffffff810646c8>] intel_pmu_drain_pebs_hsw+0xa8/0xc0
[ 2229.021986] [<ffffffff810668f0>] intel_pmu_handle_irq+0x220/0x380
[ 2229.021991] [<ffffffff810c1d35>] ? sched_clock_cpu+0xc5/0x120
[ 2229.021995] [<ffffffff815a5a84>] perf_event_nmi_handler+0x34/0x60
[ 2229.021998] [<ffffffff815a52b8>] nmi_handle.isra.3+0x88/0x180
[ 2229.022001] [<ffffffff815a5490>] do_nmi+0xe0/0x330
[ 2229.022004] [<ffffffff815a48f7>] end_repeat_nmi+0x1e/0x2e
[ 2229.022008] [<ffffffff810652b3>] ? intel_pmu_pebs_enable_all+0x33/0x40
[ 2229.022011] [<ffffffff810652b3>] ? intel_pmu_pebs_enable_all+0x33/0x40
[ 2229.022015] [<ffffffff810652b3>] ? intel_pmu_pebs_enable_all+0x33/0x40
[ 2229.022016] <<EOE>> [<ffffffff810659f3>] intel_pmu_enable_all+0x23/0xa0
[ 2229.022021] [<ffffffff8105ff84>] x86_pmu_enable+0x274/0x310
[ 2229.022025] [<ffffffff81141927>] perf_pmu_enable+0x27/0x30
[ 2229.022029] [<ffffffff81143219>] perf_event_context_sched_in+0x79/0xc0

Could be a HW race whereby the PEBS of each HT threads get mixed up.
I will add a couple more checks to verify that. The intr_thres should not
have changed. Yet looks like we have a sitation where the index is way
past the threshold.



On Tue, Sep 10, 2013 at 4:53 AM, Ingo Molnar <[email protected]> wrote:
>
> * Stephane Eranian <[email protected]> wrote:
>
>> Hi,
>>
>>
>> And what was the perf record command line for this crash?
>
> AFAICS it wasn't a crash but the WARN_ON() in intel_pmu_drain_pebs_hsw(),
> at arch/x86/kernel/cpu/perf_event_intel_ds.c:1003.
>
> at = (struct pebs_record_hsw *)(unsigned long)ds->pebs_buffer_base;
> top = (struct pebs_record_hsw *)(unsigned long)ds->pebs_index;
>
> n = top - at;
> if (n <= 0)
> return;
>
> /*
> * Should not happen, we program the threshold at 1 and do not
> * set a reset value.
> */
> WARN_ONCE(n > x86_pmu.max_pebs_events,
> "Unexpected number of pebs records %d\n", n);
>
> The command line Linus used was probably close to:
>
> perf record -e cycles:pp -g make -j64 bzImage
>
> i.e. PEBS precise profiling, call chains, LBR is used to figure out the
> real instruction, but no '-a' per CPU profiling option, i.e. high
> frequency per task PMU context switching.
>
> Note that AFAIK neither the kernel nor user-space used any TSX extensions,
> so this is the Haswell PMU in pure compatibility mode.
>
> My (wild) guess is that unless all of us missed some subtle race in the
> PEBS code it's an (unknown?) erratum: the hardware got confused by the
> high frequency PMU switches, in this particular case where we got a new
> PMI right after a very short interval was programmed:
>
>>> Call Trace:
>>> <NMI> [<ffffffff815fc637>] dump_stack+0x45/0x56
>>> [<ffffffff81051e78>] warn_slowpath_common+0x78/0xa0
>>> [<ffffffff81051ee7>] warn_slowpath_fmt+0x47/0x50
>>> [<ffffffff8101b051>] intel_pmu_drain_pebs_hsw+0x91/0xa0
>>> [<ffffffff8101c5d0>] intel_pmu_handle_irq+0x210/0x390
>>> [<ffffffff81604deb>] perf_event_nmi_handler+0x2b/0x50
>>> [<ffffffff81604670>] nmi_handle.isra.3+0x80/0x180
>>> [<ffffffff81604840>] do_nmi+0xd0/0x310
>>> [<ffffffff81603d37>] end_repeat_nmi+0x1e/0x2e
>>> <<EOE>> [<ffffffff810167df>] perf_events_lapic_init+0x2f/0x40
>>> [<ffffffff81016a50>] x86_pmu_enable+0x260/0x310
>>> [<ffffffff81111d87>] perf_pmu_enable+0x27/0x30
>>> [<ffffffff81112140>] perf_event_context_sched_in+0x80/0xc0
>>> [<ffffffff811127eb>] __perf_event_task_sched_in+0x16b/0x180
>>> [<ffffffff8107c300>] finish_task_switch+0x70/0xa0
>>> [<ffffffff81600f48>] __schedule+0x368/0x7c0
>>> [<ffffffff816013c4>] schedule+0x24/0x70
>
> Note that due to per task profiling the default (long, about 1 KHz)
> interval can get chopped up and can result in a very small period value
> being reprogrammed at PMU-sched-in time.
>
> That kind of high-freq back-to-back activity could, in theory, confuse the
> PEBS hardware. Or the kernel :-)
>
> Thanks,
>
> Ingo

2013-09-10 12:43:12

by Ramkumar Ramachandra

[permalink] [raw]
Subject: Re: PEBS bug on HSW: "Unexpected number of pebs records 10" (was: Re: [GIT PULL] perf changes for v3.12)

Stephane Eranian wrote:
> [ 2229.021966] Call Trace:
> [ 2229.021967] <NMI> [<ffffffff8159dcd6>] dump_stack+0x46/0x58
> [ 2229.021976] [<ffffffff8108dfdc>] warn_slowpath_common+0x8c/0xc0
> [ 2229.021979] [<ffffffff8108e0c6>] warn_slowpath_fmt+0x46/0x50
> [ 2229.021982] [<ffffffff810646c8>] intel_pmu_drain_pebs_hsw+0xa8/0xc0
> [ 2229.021986] [<ffffffff810668f0>] intel_pmu_handle_irq+0x220/0x380
> [ 2229.021991] [<ffffffff810c1d35>] ? sched_clock_cpu+0xc5/0x120
> [ 2229.021995] [<ffffffff815a5a84>] perf_event_nmi_handler+0x34/0x60
> [ 2229.021998] [<ffffffff815a52b8>] nmi_handle.isra.3+0x88/0x180
> [ 2229.022001] [<ffffffff815a5490>] do_nmi+0xe0/0x330
> [ 2229.022004] [<ffffffff815a48f7>] end_repeat_nmi+0x1e/0x2e
> [ 2229.022008] [<ffffffff810652b3>] ? intel_pmu_pebs_enable_all+0x33/0x40
> [ 2229.022011] [<ffffffff810652b3>] ? intel_pmu_pebs_enable_all+0x33/0x40
> [ 2229.022015] [<ffffffff810652b3>] ? intel_pmu_pebs_enable_all+0x33/0x40
> [ 2229.022016] <<EOE>> [<ffffffff810659f3>] intel_pmu_enable_all+0x23/0xa0
> [ 2229.022021] [<ffffffff8105ff84>] x86_pmu_enable+0x274/0x310
> [ 2229.022025] [<ffffffff81141927>] perf_pmu_enable+0x27/0x30
> [ 2229.022029] [<ffffffff81143219>] perf_event_context_sched_in+0x79/0xc0

Unnecessary repetition of the same task on multiple cores seems to be
the issue, although I'm not sure. Haven't gone through mingo's trace
yet.

2013-09-10 12:51:45

by Ramkumar Ramachandra

[permalink] [raw]
Subject: Re: PEBS bug on HSW: "Unexpected number of pebs records 10" (was: Re: [GIT PULL] perf changes for v3.12)

Stephane Eranian wrote:
> a simple multithreaded program where
> #threads >> #CPUs

To put it another way, does Intel's HT work for CPU intensive and IO
minimal tasks? I think HT assumes some amount of inefficient IO
coupled with pure CPU usage.

2013-09-10 12:55:47

by Stephane Eranian

[permalink] [raw]
Subject: Re: PEBS bug on HSW: "Unexpected number of pebs records 10" (was: Re: [GIT PULL] perf changes for v3.12)

On Tue, Sep 10, 2013 at 5:51 AM, Ramkumar Ramachandra
<[email protected]> wrote:
> Stephane Eranian wrote:
>> a simple multithreaded program where
>> #threads >> #CPUs
>
> To put it another way, does Intel's HT work for CPU intensive and IO
> minimal tasks? I think HT assumes some amount of inefficient IO
> coupled with pure CPU usage.

But your remark has nothing to do with the PMU issue.

2013-09-10 13:22:23

by Ingo Molnar

[permalink] [raw]
Subject: Re: PEBS bug on HSW: "Unexpected number of pebs records 10" (was: Re: [GIT PULL] perf changes for v3.12)


* Stephane Eranian <[email protected]> wrote:

> On Tue, Sep 10, 2013 at 5:51 AM, Ramkumar Ramachandra
> <[email protected]> wrote:
> > Stephane Eranian wrote:
> >> a simple multithreaded program where
> >> #threads >> #CPUs
> >
> > To put it another way, does Intel's HT work for CPU intensive and IO
> > minimal tasks? I think HT assumes some amount of inefficient IO
> > coupled with pure CPU usage.
>
> But your remark has nothing to do with the PMU issue.

To quote Al Viro:

"One name: Mark V. Shaney..."

Update your email filters if it gets annoying.

Thanks,

Ingo

2013-09-10 13:38:50

by Ingo Molnar

[permalink] [raw]
Subject: Re: PEBS bug on HSW: "Unexpected number of pebs records 10" (was: Re: [GIT PULL] perf changes for v3.12)


* Stephane Eranian <[email protected]> wrote:

> Hi,
>
> Ok, so I am able to reproduce the problem using a simpler
> test case with a simple multithreaded program where
> #threads >> #CPUs.

Does it go away if you use 'perf record --all-cpus'?

> [ 2229.021934] WARNING: CPU: 6 PID: 17496 at
> arch/x86/kernel/cpu/perf_event_intel_ds.c:1003
> intel_pmu_drain_pebs_hsw+0xa8/0xc0()
> [ 2229.021936] Unexpected number of pebs records 21
>
> [ 2229.021966] Call Trace:
> [ 2229.021967] <NMI> [<ffffffff8159dcd6>] dump_stack+0x46/0x58
> [ 2229.021976] [<ffffffff8108dfdc>] warn_slowpath_common+0x8c/0xc0
> [ 2229.021979] [<ffffffff8108e0c6>] warn_slowpath_fmt+0x46/0x50
> [ 2229.021982] [<ffffffff810646c8>] intel_pmu_drain_pebs_hsw+0xa8/0xc0
> [ 2229.021986] [<ffffffff810668f0>] intel_pmu_handle_irq+0x220/0x380
> [ 2229.021991] [<ffffffff810c1d35>] ? sched_clock_cpu+0xc5/0x120
> [ 2229.021995] [<ffffffff815a5a84>] perf_event_nmi_handler+0x34/0x60
> [ 2229.021998] [<ffffffff815a52b8>] nmi_handle.isra.3+0x88/0x180
> [ 2229.022001] [<ffffffff815a5490>] do_nmi+0xe0/0x330
> [ 2229.022004] [<ffffffff815a48f7>] end_repeat_nmi+0x1e/0x2e
> [ 2229.022008] [<ffffffff810652b3>] ? intel_pmu_pebs_enable_all+0x33/0x40
> [ 2229.022011] [<ffffffff810652b3>] ? intel_pmu_pebs_enable_all+0x33/0x40
> [ 2229.022015] [<ffffffff810652b3>] ? intel_pmu_pebs_enable_all+0x33/0x40
> [ 2229.022016] <<EOE>> [<ffffffff810659f3>] intel_pmu_enable_all+0x23/0xa0
> [ 2229.022021] [<ffffffff8105ff84>] x86_pmu_enable+0x274/0x310
> [ 2229.022025] [<ffffffff81141927>] perf_pmu_enable+0x27/0x30
> [ 2229.022029] [<ffffffff81143219>] perf_event_context_sched_in+0x79/0xc0
>
> Could be a HW race whereby the PEBS of each HT threads get mixed up.

Yes, that seems plausible and would explain why the overrun is usually a
small integer. We set up the DS with PEBS_BUFFER_SIZE == 4096, so with a
record size of 192 bytes on HSW we should get index values of 0-21.

That fits within the indices range reported so far.

> [...] I will add a couple more checks to verify that. The intr_thres
> should not have changed. Yet looks like we have a sitation where the
> index is way past the threshold.

Btw., it would also be nice to add a check of ds->pebs_index against
ds->pebs_absolute_maximum, to make sure the PEBS record index never goes
outside the DS area. I.e. to protect against random corruption.

Right now we do only half a check:

n = top - at;
if (n <= 0)
return;

this still allows an upwards overflow. We check x86_pmu.max_pebs_events
but then let it continue:

WARN_ONCE(n > x86_pmu.max_pebs_events,
"Unexpected number of pebs records %d\n", n);

return __intel_pmu_drain_pebs_nhm(iregs, at, top);

Instead it should be something more robust, like:

if (WARN_ONCE(n > max ...)) {
/* Drain the PEBS buffer: */
ds->pebs_index = ds->pebs_buffer_base;
return;
}

Thanks,

Ingo

2013-09-10 14:15:21

by Stephane Eranian

[permalink] [raw]
Subject: Re: PEBS bug on HSW: "Unexpected number of pebs records 10" (was: Re: [GIT PULL] perf changes for v3.12)

On Tue, Sep 10, 2013 at 6:38 AM, Ingo Molnar <[email protected]> wrote:
>
> * Stephane Eranian <[email protected]> wrote:
>
>> Hi,
>>
>> Ok, so I am able to reproduce the problem using a simpler
>> test case with a simple multithreaded program where
>> #threads >> #CPUs.
>
> Does it go away if you use 'perf record --all-cpus'?
>
Haven't tried that yet.

But I verified the DS pointers:
init:
CPU6 pebs base=ffff8808262de000 index=ffff8808262de000
intr=ffff8808262de0c0 max=ffff8808262defc0
crash:
CPU6 pebs base=ffff8808262de000 index=ffff8808262de9c0
intr=ffff8808262de0c0 max=ffff8808262defc0

Neither the base nor the max are modified.
The index simply goes beyond the threshold but that's not a bug.
It is 12 after the threshold of 1, so total 13 is my new crash report.

Two things to try:
- measure only one thread/core
- move the threshold a bit farther away (to get 2 or 3 entries)

The threshold is where to generate the interrupt. It does not mean where to stop
PEBS recording. So it is possible that in HSW, we may get into a situation where
it takes time to get to the handler to stop the PMU. I don't know how
given we use
NMI. Well, unless we were already servicing an NMI at the time. But
given that we
stop the PMU almost immediately in the handler, I don't see how that
would possible.
The other oddity in HSW is that we clear the NMI on entry to the
handler and not at
the end. I never gotten an good explanation as to why that was
necessary. So maybe
it is related...





>> [ 2229.021934] WARNING: CPU: 6 PID: 17496 at
>> arch/x86/kernel/cpu/perf_event_intel_ds.c:1003
>> intel_pmu_drain_pebs_hsw+0xa8/0xc0()
>> [ 2229.021936] Unexpected number of pebs records 21
>>
>> [ 2229.021966] Call Trace:
>> [ 2229.021967] <NMI> [<ffffffff8159dcd6>] dump_stack+0x46/0x58
>> [ 2229.021976] [<ffffffff8108dfdc>] warn_slowpath_common+0x8c/0xc0
>> [ 2229.021979] [<ffffffff8108e0c6>] warn_slowpath_fmt+0x46/0x50
>> [ 2229.021982] [<ffffffff810646c8>] intel_pmu_drain_pebs_hsw+0xa8/0xc0
>> [ 2229.021986] [<ffffffff810668f0>] intel_pmu_handle_irq+0x220/0x380
>> [ 2229.021991] [<ffffffff810c1d35>] ? sched_clock_cpu+0xc5/0x120
>> [ 2229.021995] [<ffffffff815a5a84>] perf_event_nmi_handler+0x34/0x60
>> [ 2229.021998] [<ffffffff815a52b8>] nmi_handle.isra.3+0x88/0x180
>> [ 2229.022001] [<ffffffff815a5490>] do_nmi+0xe0/0x330
>> [ 2229.022004] [<ffffffff815a48f7>] end_repeat_nmi+0x1e/0x2e
>> [ 2229.022008] [<ffffffff810652b3>] ? intel_pmu_pebs_enable_all+0x33/0x40
>> [ 2229.022011] [<ffffffff810652b3>] ? intel_pmu_pebs_enable_all+0x33/0x40
>> [ 2229.022015] [<ffffffff810652b3>] ? intel_pmu_pebs_enable_all+0x33/0x40
>> [ 2229.022016] <<EOE>> [<ffffffff810659f3>] intel_pmu_enable_all+0x23/0xa0
>> [ 2229.022021] [<ffffffff8105ff84>] x86_pmu_enable+0x274/0x310
>> [ 2229.022025] [<ffffffff81141927>] perf_pmu_enable+0x27/0x30
>> [ 2229.022029] [<ffffffff81143219>] perf_event_context_sched_in+0x79/0xc0
>>
>> Could be a HW race whereby the PEBS of each HT threads get mixed up.
>
> Yes, that seems plausible and would explain why the overrun is usually a
> small integer. We set up the DS with PEBS_BUFFER_SIZE == 4096, so with a
> record size of 192 bytes on HSW we should get index values of 0-21.
>
> That fits within the indices range reported so far.
>
>> [...] I will add a couple more checks to verify that. The intr_thres
>> should not have changed. Yet looks like we have a sitation where the
>> index is way past the threshold.
>
> Btw., it would also be nice to add a check of ds->pebs_index against
> ds->pebs_absolute_maximum, to make sure the PEBS record index never goes
> outside the DS area. I.e. to protect against random corruption.
>
> Right now we do only half a check:
>
> n = top - at;
> if (n <= 0)
> return;
>
> this still allows an upwards overflow. We check x86_pmu.max_pebs_events
> but then let it continue:
>
> WARN_ONCE(n > x86_pmu.max_pebs_events,
> "Unexpected number of pebs records %d\n", n);
>
> return __intel_pmu_drain_pebs_nhm(iregs, at, top);
>
> Instead it should be something more robust, like:
>
> if (WARN_ONCE(n > max ...)) {
> /* Drain the PEBS buffer: */
> ds->pebs_index = ds->pebs_buffer_base;
> return;
> }
>
> Thanks,
>
> Ingo

2013-09-10 14:29:51

by Ingo Molnar

[permalink] [raw]
Subject: Re: PEBS bug on HSW: "Unexpected number of pebs records 10" (was: Re: [GIT PULL] perf changes for v3.12)


* Stephane Eranian <[email protected]> wrote:

> On Tue, Sep 10, 2013 at 6:38 AM, Ingo Molnar <[email protected]> wrote:
> >
> > * Stephane Eranian <[email protected]> wrote:
> >
> >> Hi,
> >>
> >> Ok, so I am able to reproduce the problem using a simpler
> >> test case with a simple multithreaded program where
> >> #threads >> #CPUs.
> >
> > Does it go away if you use 'perf record --all-cpus'?
> >
> Haven't tried that yet.
>
> But I verified the DS pointers:
> init:
> CPU6 pebs base=ffff8808262de000 index=ffff8808262de000
> intr=ffff8808262de0c0 max=ffff8808262defc0
> crash:
> CPU6 pebs base=ffff8808262de000 index=ffff8808262de9c0
> intr=ffff8808262de0c0 max=ffff8808262defc0
>
> Neither the base nor the max are modified.
> The index simply goes beyond the threshold but that's not a bug.
> It is 12 after the threshold of 1, so total 13 is my new crash report.
>
> Two things to try:
> - measure only one thread/core
> - move the threshold a bit farther away (to get 2 or 3 entries)
>
> The threshold is where to generate the interrupt. It does not mean where
> to stop PEBS recording. So it is possible that in HSW, we may get into a
> situation where it takes time to get to the handler to stop the PMU. I
> don't know how given we use NMI. Well, unless we were already servicing
> an NMI at the time. But given that we stop the PMU almost immediately in
> the handler, I don't see how that would possible. The other oddity in
> HSW is that we clear the NMI on entry to the handler and not at the end.
> I never gotten an good explanation as to why that was necessary. So
> maybe it is related...

Do you mean:

if (!x86_pmu.late_ack)
apic_write(APIC_LVTPC, APIC_DM_NMI);

AFAICS that means the opposite: that we clear the NMI late, i.e. shortly
before return, after we've processed the PMU.

Do the symptoms change if you remove the x86_pmu.late_ack setting line
from:

case 60: /* Haswell Client */
case 70:
case 71:
case 63:
case 69:
x86_pmu.late_ack = true;

?

Thanks,

Ingo

2013-09-10 14:34:43

by Stephane Eranian

[permalink] [raw]
Subject: Re: PEBS bug on HSW: "Unexpected number of pebs records 10" (was: Re: [GIT PULL] perf changes for v3.12)

On Tue, Sep 10, 2013 at 7:29 AM, Ingo Molnar <[email protected]> wrote:
>
> * Stephane Eranian <[email protected]> wrote:
>
>> On Tue, Sep 10, 2013 at 6:38 AM, Ingo Molnar <[email protected]> wrote:
>> >
>> > * Stephane Eranian <[email protected]> wrote:
>> >
>> >> Hi,
>> >>
>> >> Ok, so I am able to reproduce the problem using a simpler
>> >> test case with a simple multithreaded program where
>> >> #threads >> #CPUs.
>> >
>> > Does it go away if you use 'perf record --all-cpus'?
>> >
>> Haven't tried that yet.
>>
>> But I verified the DS pointers:
>> init:
>> CPU6 pebs base=ffff8808262de000 index=ffff8808262de000
>> intr=ffff8808262de0c0 max=ffff8808262defc0
>> crash:
>> CPU6 pebs base=ffff8808262de000 index=ffff8808262de9c0
>> intr=ffff8808262de0c0 max=ffff8808262defc0
>>
>> Neither the base nor the max are modified.
>> The index simply goes beyond the threshold but that's not a bug.
>> It is 12 after the threshold of 1, so total 13 is my new crash report.
>>
>> Two things to try:
>> - measure only one thread/core
>> - move the threshold a bit farther away (to get 2 or 3 entries)
>>
>> The threshold is where to generate the interrupt. It does not mean where
>> to stop PEBS recording. So it is possible that in HSW, we may get into a
>> situation where it takes time to get to the handler to stop the PMU. I
>> don't know how given we use NMI. Well, unless we were already servicing
>> an NMI at the time. But given that we stop the PMU almost immediately in
>> the handler, I don't see how that would possible. The other oddity in
>> HSW is that we clear the NMI on entry to the handler and not at the end.
>> I never gotten an good explanation as to why that was necessary. So
>> maybe it is related...
>
> Do you mean:
>
> if (!x86_pmu.late_ack)
> apic_write(APIC_LVTPC, APIC_DM_NMI);
>
> AFAICS that means the opposite: that we clear the NMI late, i.e. shortly
> before return, after we've processed the PMU.
>
Yeah, the opposity, I got confused.

Let me try reverting that.
Also curious about the influence of the LBR here.

> Do the symptoms change if you remove the x86_pmu.late_ack setting line
> from:
>
> case 60: /* Haswell Client */
> case 70:
> case 71:
> case 63:
> case 69:
> x86_pmu.late_ack = true;
>
> ?
>
> Thanks,
>
> Ingo

2013-09-10 15:29:02

by Peter Zijlstra

[permalink] [raw]
Subject: Re: PEBS bug on HSW: "Unexpected number of pebs records 10" (was: Re: [GIT PULL] perf changes for v3.12)

On Tue, Sep 10, 2013 at 07:15:19AM -0700, Stephane Eranian wrote:
> The threshold is where to generate the interrupt. It does not mean
> where to stop PEBS recording.

It does, since we don't set a reset value. So once a PEBS assist
happens, that counter stops until we reprogram it in the PMI.

> So it is possible that in HSW, we may
> get into a situation where it takes time to get to the handler to stop
> the PMU. I don't know how given we use NMI. Well, unless we were
> already servicing an NMI at the time. But given that we stop the PMU
> almost immediately in the handler, I don't see how that would
> possible. The other oddity in HSW is that we clear the NMI on entry
> to the handler and not at the end. I never gotten an good explanation
> as to why that was necessary. So maybe it is related...

Even if the PMI was delayed that counter would never do another entry.
So I really don't see how we can have more than nr_counters PEBS entries
in the buffer.

2013-09-10 16:14:19

by Stephane Eranian

[permalink] [raw]
Subject: Re: PEBS bug on HSW: "Unexpected number of pebs records 10" (was: Re: [GIT PULL] perf changes for v3.12)

On Tue, Sep 10, 2013 at 5:28 PM, Peter Zijlstra <[email protected]> wrote:
> On Tue, Sep 10, 2013 at 07:15:19AM -0700, Stephane Eranian wrote:
>> The threshold is where to generate the interrupt. It does not mean
>> where to stop PEBS recording.
>
> It does, since we don't set a reset value. So once a PEBS assist
> happens, that counter stops until we reprogram it in the PMI.
>
I agree with you.

I am curious to see what's in those entries. If they are all zeroes, then
we know only the pointer is bogus. Let me add some instrumentation
to check that.

>> So it is possible that in HSW, we may
>> get into a situation where it takes time to get to the handler to stop
>> the PMU. I don't know how given we use NMI. Well, unless we were
>> already servicing an NMI at the time. But given that we stop the PMU
>> almost immediately in the handler, I don't see how that would
>> possible. The other oddity in HSW is that we clear the NMI on entry
>> to the handler and not at the end. I never gotten an good explanation
>> as to why that was necessary. So maybe it is related...
>
> Even if the PMI was delayed that counter would never do another entry.
> So I really don't see how we can have more than nr_counters PEBS entries
> in the buffer.

I agree.

2013-09-10 17:14:54

by Ingo Molnar

[permalink] [raw]
Subject: Re: PEBS bug on HSW: "Unexpected number of pebs records 10" (was: Re: [GIT PULL] perf changes for v3.12)


* Stephane Eranian <[email protected]> wrote:

> On Tue, Sep 10, 2013 at 7:29 AM, Ingo Molnar <[email protected]> wrote:
> >
> > * Stephane Eranian <[email protected]> wrote:
> >
> >> On Tue, Sep 10, 2013 at 6:38 AM, Ingo Molnar <[email protected]> wrote:
> >> >
> >> > * Stephane Eranian <[email protected]> wrote:
> >> >
> >> >> Hi,
> >> >>
> >> >> Ok, so I am able to reproduce the problem using a simpler
> >> >> test case with a simple multithreaded program where
> >> >> #threads >> #CPUs.
> >> >
> >> > Does it go away if you use 'perf record --all-cpus'?
> >> >
> >> Haven't tried that yet.
> >>
> >> But I verified the DS pointers:
> >> init:
> >> CPU6 pebs base=ffff8808262de000 index=ffff8808262de000
> >> intr=ffff8808262de0c0 max=ffff8808262defc0
> >> crash:
> >> CPU6 pebs base=ffff8808262de000 index=ffff8808262de9c0
> >> intr=ffff8808262de0c0 max=ffff8808262defc0
> >>
> >> Neither the base nor the max are modified.
> >> The index simply goes beyond the threshold but that's not a bug.
> >> It is 12 after the threshold of 1, so total 13 is my new crash report.
> >>
> >> Two things to try:
> >> - measure only one thread/core
> >> - move the threshold a bit farther away (to get 2 or 3 entries)
> >>
> >> The threshold is where to generate the interrupt. It does not mean where
> >> to stop PEBS recording. So it is possible that in HSW, we may get into a
> >> situation where it takes time to get to the handler to stop the PMU. I
> >> don't know how given we use NMI. Well, unless we were already servicing
> >> an NMI at the time. But given that we stop the PMU almost immediately in
> >> the handler, I don't see how that would possible. The other oddity in
> >> HSW is that we clear the NMI on entry to the handler and not at the end.
> >> I never gotten an good explanation as to why that was necessary. So
> >> maybe it is related...
> >
> > Do you mean:
> >
> > if (!x86_pmu.late_ack)
> > apic_write(APIC_LVTPC, APIC_DM_NMI);
> >
> > AFAICS that means the opposite: that we clear the NMI late, i.e. shortly
> > before return, after we've processed the PMU.
> >
> Yeah, the opposity, I got confused.
>
> Let me try reverting that.
> Also curious about the influence of the LBR here.

You could exclude any LBR interaction by doing tests with "-e cycles:p".

Thanks,

Ingo

2013-09-16 11:07:10

by Stephane Eranian

[permalink] [raw]
Subject: Re: PEBS bug on HSW: "Unexpected number of pebs records 10" (was: Re: [GIT PULL] perf changes for v3.12)

Hi,

Some updates on this problem.
I have been running tests all week-end long on my HSW.
I can reproduce the problem. What I know:

- It is not linked with callchain
- The extra entries are valid
- The reset values are still zeroes
- The problem does not happen on SNB with the same test case
- The PMU state looks sane when that happens.
- The problem occurs even when restricting to one CPU/core (taskset -c 0-3)

So it seems like the threshold is ignored. But I don't understand where
there reset values are coming from. So it looks more like a bug in micro-code
where under certain circumstances multiple entries get written.

[132537.622177] Unexpected number of pebs records 5
[132537.622177] CPU0 base=ffff88023681a000 index=ffff88023681a000
intr=ffff88023681a0c0 max=ffff88023681afc0
[132537.622181] CPU0 0 p=ffff88023681a000 ovf=1 rip=c3830b
[132537.622182] CPU0 1 p=ffff88023681a0c0 ovf=1 rip=c3ab65
[132537.622183] CPU0 2 p=ffff88023681a180 ovf=1 rip=4f1b72
[132537.622184] CPU0 3 p=ffff88023681a240 ovf=1 rip=8dde78
[132537.622185] CPU0 4 p=ffff88023681a300 ovf=1 rip=4d24aa
[132537.622185] CPU0 0 reset=0
[132537.622186] CPU0 1 reset=0
[132537.622186] CPU0 2 reset=0
[132537.622187] CPU0 3 reset=0
[132537.622187] CPU0 4 reset=0
[132537.622188] CPU0 5 reset=0
[132537.622188] CPU0 6 reset=0
[132537.622189] CPU0 7 reset=0
[132537.622189]
[132537.622190] CPU#0: ctrl: 0000000000000000
[132537.622191] CPU#0: status: 0000000000000000
[132537.622191] CPU#0: overflow: 0000000000000000
[132537.622192] CPU#0: fixed: 00000000000000b0
[132537.622192] CPU#0: pebs: 0000000000000000
[132537.622193] CPU#0: active: 0000000200000001
[132537.622194] CPU#0: gen-PMC0 ctrl: 0000000010c301c2
[132537.622194] CPU#0: gen-PMC0 count: 0000fffffff214ea
[132537.622195] CPU#0: gen-PMC0 left: 00000000000deb16
[132537.622196] CPU#0: gen-PMC1 ctrl: 0000000000000000
[132537.622196] CPU#0: gen-PMC1 count: 0000000000000000
[132537.622197] CPU#0: gen-PMC1 left: 0000000000000000
[132537.622197] CPU#0: gen-PMC2 ctrl: 0000000000000000
[132537.622198] CPU#0: gen-PMC2 count: 0000000000000000
[132537.622198] CPU#0: gen-PMC2 left: 0000000000000000
[132537.622199] CPU#0: gen-PMC3 ctrl: 0000000000000000
[132537.622199] CPU#0: gen-PMC3 count: 0000000000000000
[132537.622200] CPU#0: gen-PMC3 left: 0000000000000000
[132537.622200] CPU#0: fixed-PMC0 count: 0000000000000000
[132537.622201] CPU#0: fixed-PMC1 count: 0000fff819c21c2c
[132537.622202] CPU#0: fixed-PMC2 count: 0000000000000000

Something must be happening with the interrupt or HT. I will disable
HT next and also disable the NMI watchdog.


On Tue, Sep 10, 2013 at 7:14 PM, Ingo Molnar <[email protected]> wrote:
>
> * Stephane Eranian <[email protected]> wrote:
>
>> On Tue, Sep 10, 2013 at 7:29 AM, Ingo Molnar <[email protected]> wrote:
>> >
>> > * Stephane Eranian <[email protected]> wrote:
>> >
>> >> On Tue, Sep 10, 2013 at 6:38 AM, Ingo Molnar <[email protected]> wrote:
>> >> >
>> >> > * Stephane Eranian <[email protected]> wrote:
>> >> >
>> >> >> Hi,
>> >> >>
>> >> >> Ok, so I am able to reproduce the problem using a simpler
>> >> >> test case with a simple multithreaded program where
>> >> >> #threads >> #CPUs.
>> >> >
>> >> > Does it go away if you use 'perf record --all-cpus'?
>> >> >
>> >> Haven't tried that yet.
>> >>
>> >> But I verified the DS pointers:
>> >> init:
>> >> CPU6 pebs base=ffff8808262de000 index=ffff8808262de000
>> >> intr=ffff8808262de0c0 max=ffff8808262defc0
>> >> crash:
>> >> CPU6 pebs base=ffff8808262de000 index=ffff8808262de9c0
>> >> intr=ffff8808262de0c0 max=ffff8808262defc0
>> >>
>> >> Neither the base nor the max are modified.
>> >> The index simply goes beyond the threshold but that's not a bug.
>> >> It is 12 after the threshold of 1, so total 13 is my new crash report.
>> >>
>> >> Two things to try:
>> >> - measure only one thread/core
>> >> - move the threshold a bit farther away (to get 2 or 3 entries)
>> >>
>> >> The threshold is where to generate the interrupt. It does not mean where
>> >> to stop PEBS recording. So it is possible that in HSW, we may get into a
>> >> situation where it takes time to get to the handler to stop the PMU. I
>> >> don't know how given we use NMI. Well, unless we were already servicing
>> >> an NMI at the time. But given that we stop the PMU almost immediately in
>> >> the handler, I don't see how that would possible. The other oddity in
>> >> HSW is that we clear the NMI on entry to the handler and not at the end.
>> >> I never gotten an good explanation as to why that was necessary. So
>> >> maybe it is related...
>> >
>> > Do you mean:
>> >
>> > if (!x86_pmu.late_ack)
>> > apic_write(APIC_LVTPC, APIC_DM_NMI);
>> >
>> > AFAICS that means the opposite: that we clear the NMI late, i.e. shortly
>> > before return, after we've processed the PMU.
>> >
>> Yeah, the opposity, I got confused.
>>
>> Let me try reverting that.
>> Also curious about the influence of the LBR here.
>
> You could exclude any LBR interaction by doing tests with "-e cycles:p".
>
> Thanks,
>
> Ingo

2013-09-16 15:41:51

by Ingo Molnar

[permalink] [raw]
Subject: Re: PEBS bug on HSW: "Unexpected number of pebs records 10" (was: Re: [GIT PULL] perf changes for v3.12)


* Stephane Eranian <[email protected]> wrote:

> Hi,
>
> Some updates on this problem.
> I have been running tests all week-end long on my HSW.
> I can reproduce the problem. What I know:
>
> - It is not linked with callchain
> - The extra entries are valid
> - The reset values are still zeroes
> - The problem does not happen on SNB with the same test case
> - The PMU state looks sane when that happens.
> - The problem occurs even when restricting to one CPU/core (taskset -c 0-3)
>
> So it seems like the threshold is ignored. But I don't understand where
> there reset values are coming from. So it looks more like a bug in
> micro-code where under certain circumstances multiple entries get
> written.

Either multiple entries are written, or the PMI/NMI is not asserted as it
should be?

> Something must be happening with the interrupt or HT. I will disable HT
> next and also disable the NMI watchdog.

Yes, interaction with the NMI watchdog events might also be possible.

If it's truly just the threshold that is broken occasionally in a
statistically insignificant manner then the bug is relatively benign and
we could work it around in the kernel by ignoring excess entries.

In that case we should probably not annoy users with the scary kernel
warning and instead increase a debug count somewhere so that it's still
detectable.

Thanks,

Ingo

2013-09-16 16:29:44

by Peter Zijlstra

[permalink] [raw]
Subject: Re: PEBS bug on HSW: "Unexpected number of pebs records 10" (was: Re: [GIT PULL] perf changes for v3.12)

On Mon, Sep 16, 2013 at 05:41:46PM +0200, Ingo Molnar wrote:
>
> * Stephane Eranian <[email protected]> wrote:
>
> > Hi,
> >
> > Some updates on this problem.
> > I have been running tests all week-end long on my HSW.
> > I can reproduce the problem. What I know:
> >
> > - It is not linked with callchain
> > - The extra entries are valid
> > - The reset values are still zeroes
> > - The problem does not happen on SNB with the same test case
> > - The PMU state looks sane when that happens.
> > - The problem occurs even when restricting to one CPU/core (taskset -c 0-3)
> >
> > So it seems like the threshold is ignored. But I don't understand where
> > there reset values are coming from. So it looks more like a bug in
> > micro-code where under certain circumstances multiple entries get
> > written.
>
> Either multiple entries are written, or the PMI/NMI is not asserted as it
> should be?

No, both :-)

> > Something must be happening with the interrupt or HT. I will disable HT
> > next and also disable the NMI watchdog.
>
> Yes, interaction with the NMI watchdog events might also be possible.
>
> If it's truly just the threshold that is broken occasionally in a
> statistically insignificant manner then the bug is relatively benign and
> we could work it around in the kernel by ignoring excess entries.
>
> In that case we should probably not annoy users with the scary kernel
> warning and instead increase a debug count somewhere so that it's still
> detectable.

Its not just a broken threshold. When a PEBS event happens it can re-arm
itself but only if you program a RESET value !0. We don't do that, so
each counter should only ever fire once.

We must do this because PEBS is broken on NHM+ in that the
pebs_record::status is a direct copy of the overflow status field at
time of the assist and if you use the RESET thing nothing will clear the
status bits and you cannot demux the PEBS events back to the event that
generated them.

Worse, since its the overflow that arms the assist, and the assist
happens at some undefined amount of cycles after this event it is
possible for another assist to happen first.

That is, suppose both CNT0 and CNT1 have PEBS enabled and CNT0 overflows
first it is possible to find the CNT1 entry first in the buffer with
both of them having status := 0x03.

Complete and utter trainwreck.

This is why we have a threshold of 1 and use NMI for PMI even for pure
PEBS, it minimizes the complete clusterfuck described above.

2013-09-17 07:00:54

by Ingo Molnar

[permalink] [raw]
Subject: Re: PEBS bug on HSW: "Unexpected number of pebs records 10" (was: Re: [GIT PULL] perf changes for v3.12)


* Peter Zijlstra <[email protected]> wrote:

> On Mon, Sep 16, 2013 at 05:41:46PM +0200, Ingo Molnar wrote:
> >
> > * Stephane Eranian <[email protected]> wrote:
> >
> > > Hi,
> > >
> > > Some updates on this problem.
> > > I have been running tests all week-end long on my HSW.
> > > I can reproduce the problem. What I know:
> > >
> > > - It is not linked with callchain
> > > - The extra entries are valid
> > > - The reset values are still zeroes
> > > - The problem does not happen on SNB with the same test case
> > > - The PMU state looks sane when that happens.
> > > - The problem occurs even when restricting to one CPU/core (taskset -c 0-3)
> > >
> > > So it seems like the threshold is ignored. But I don't understand where
> > > there reset values are coming from. So it looks more like a bug in
> > > micro-code where under certain circumstances multiple entries get
> > > written.
> >
> > Either multiple entries are written, or the PMI/NMI is not asserted as it
> > should be?
>
> No, both :-)
>
> > > Something must be happening with the interrupt or HT. I will disable HT
> > > next and also disable the NMI watchdog.
> >
> > Yes, interaction with the NMI watchdog events might also be possible.
> >
> > If it's truly just the threshold that is broken occasionally in a
> > statistically insignificant manner then the bug is relatively benign and
> > we could work it around in the kernel by ignoring excess entries.
> >
> > In that case we should probably not annoy users with the scary kernel
> > warning and instead increase a debug count somewhere so that it's still
> > detectable.
>
> Its not just a broken threshold. When a PEBS event happens it can re-arm
> itself but only if you program a RESET value !0. We don't do that, so
> each counter should only ever fire once.
>
> We must do this because PEBS is broken on NHM+ in that the
> pebs_record::status is a direct copy of the overflow status field at
> time of the assist and if you use the RESET thing nothing will clear the
> status bits and you cannot demux the PEBS events back to the event that
> generated them.
>
> Worse, since its the overflow that arms the assist, and the assist
> happens at some undefined amount of cycles after this event it is
> possible for another assist to happen first.
>
> That is, suppose both CNT0 and CNT1 have PEBS enabled and CNT0 overflows
> first it is possible to find the CNT1 entry first in the buffer with
> both of them having status := 0x03.
>
> Complete and utter trainwreck.
>
> This is why we have a threshold of 1 and use NMI for PMI even for pure
> PEBS, it minimizes the complete clusterfuck described above.

What I mean that as per observations the problem seems to be statistical:
it happens only once every couple of million records. So, as long as no
memory is corrupted (the PEBS records don't go outside the DS area) it
could be ignored when it happens, and still produce a valid, usable
profile.

( Btw., we might want to introduce a 'error' event passed to tools, which
event they could process in a soft, statistical manner: only warn the
user if the erroneous events go beyond 1% or 5%, etc. Kernel warnings
are really not the best fit for such purposes. )

Thanks,

Ingo

2013-09-23 15:25:23

by Stephane Eranian

[permalink] [raw]
Subject: Re: PEBS bug on HSW: "Unexpected number of pebs records 10" (was: Re: [GIT PULL] perf changes for v3.12)

On Mon, Sep 16, 2013 at 6:29 PM, Peter Zijlstra <[email protected]> wrote:
> On Mon, Sep 16, 2013 at 05:41:46PM +0200, Ingo Molnar wrote:
>>
>> * Stephane Eranian <[email protected]> wrote:
>>
>> > Hi,
>> >
>> > Some updates on this problem.
>> > I have been running tests all week-end long on my HSW.
>> > I can reproduce the problem. What I know:
>> >
>> > - It is not linked with callchain
>> > - The extra entries are valid
>> > - The reset values are still zeroes
>> > - The problem does not happen on SNB with the same test case
>> > - The PMU state looks sane when that happens.
>> > - The problem occurs even when restricting to one CPU/core (taskset -c 0-3)
>> >
>> > So it seems like the threshold is ignored. But I don't understand where
>> > there reset values are coming from. So it looks more like a bug in
>> > micro-code where under certain circumstances multiple entries get
>> > written.
>>
>> Either multiple entries are written, or the PMI/NMI is not asserted as it
>> should be?
>
> No, both :-)
>
>> > Something must be happening with the interrupt or HT. I will disable HT
>> > next and also disable the NMI watchdog.
>>
>> Yes, interaction with the NMI watchdog events might also be possible.
>>
>> If it's truly just the threshold that is broken occasionally in a
>> statistically insignificant manner then the bug is relatively benign and
>> we could work it around in the kernel by ignoring excess entries.
>>
>> In that case we should probably not annoy users with the scary kernel
>> warning and instead increase a debug count somewhere so that it's still
>> detectable.
>
> Its not just a broken threshold. When a PEBS event happens it can re-arm
> itself but only if you program a RESET value !0. We don't do that, so
> each counter should only ever fire once.
>
> We must do this because PEBS is broken on NHM+ in that the
> pebs_record::status is a direct copy of the overflow status field at
> time of the assist and if you use the RESET thing nothing will clear the
> status bits and you cannot demux the PEBS events back to the event that
> generated them.
>
Trying to understand this problem better. You are saying that in case you
are sampling multiple PEBS events there is a problem if you allow more
than one record per PEBS buffer because the overflow status is not reset
properly.

For instance, if first record is caused by counter 0, ovfl_status=0x1,
then counter
is reset. Then, if counter 1 is the cause of the next record, then
that record has the
ovfl_status=0x3 instead of ovfl_status=0x2? Is that what you are saying?

If so then yes, I agree this is a serious bug and we need to have Intel fix it.

> Worse, since its the overflow that arms the assist, and the assist
> happens at some undefined amount of cycles after this event it is
> possible for another assist to happen first.
>
> That is, suppose both CNT0 and CNT1 have PEBS enabled and CNT0 overflows
> first it is possible to find the CNT1 entry first in the buffer with
> both of them having status := 0x03.
>
> Complete and utter trainwreck.

2013-09-23 15:34:12

by Peter Zijlstra

[permalink] [raw]
Subject: Re: PEBS bug on HSW: "Unexpected number of pebs records 10" (was: Re: [GIT PULL] perf changes for v3.12)

On Mon, Sep 23, 2013 at 05:25:19PM +0200, Stephane Eranian wrote:
> > Its not just a broken threshold. When a PEBS event happens it can re-arm
> > itself but only if you program a RESET value !0. We don't do that, so
> > each counter should only ever fire once.
> >
> > We must do this because PEBS is broken on NHM+ in that the
> > pebs_record::status is a direct copy of the overflow status field at
> > time of the assist and if you use the RESET thing nothing will clear the
> > status bits and you cannot demux the PEBS events back to the event that
> > generated them.
> >
> Trying to understand this problem better. You are saying that in case you
> are sampling multiple PEBS events there is a problem if you allow more
> than one record per PEBS buffer because the overflow status is not reset
> properly.

That is what I wrote; but I'm not entire sure that's correct. I think it
will reset the overflow bits once it does an actual reset after the PEBS
assist triggers, but see below.

> For instance, if first record is caused by counter 0, ovfl_status=0x1,
> then counter
> is reset. Then, if counter 1 is the cause of the next record, then
> that record has the
> ovfl_status=0x3 instead of ovfl_status=0x2? Is that what you are saying?
>
> If so then yes, I agree this is a serious bug and we need to have Intel fix it.

But there's still the case where with 2 counters you can get:

cnt0 overflows; sets status |= 1 << 0, arms PEBS0 assist
cnt1 overflows; sets status |= 1 << 1, arms PEBS1 assist

PEBS0 ready to trigger
PEBS1 ready to trigger

Cnt1 event -> PEBS1 trigger, writes entry with status := 0x03
Cnt0 event -> PEBS0 trigger, writes entry with status := 0x03

At which point you'll have 2 events with the same status overflow bits
in 'reverse' order.

If we'd set RESET, the second entry would have status : 0x01, which
would be unambiguous again. But we'd still not know where to place the
0x03 entry.

With more PEBSn counters enabled and a threshold > 1 the chance of
having such scenarios is greatly increased.

The threshold := 1 case tries to avoid these cases by getting them out
as fast as possible and hopefully avoiding the second trigger.

2013-09-23 17:11:23

by Stephane Eranian

[permalink] [raw]
Subject: Re: PEBS bug on HSW: "Unexpected number of pebs records 10" (was: Re: [GIT PULL] perf changes for v3.12)

On Mon, Sep 23, 2013 at 5:33 PM, Peter Zijlstra <[email protected]> wrote:
> On Mon, Sep 23, 2013 at 05:25:19PM +0200, Stephane Eranian wrote:
>> > Its not just a broken threshold. When a PEBS event happens it can re-arm
>> > itself but only if you program a RESET value !0. We don't do that, so
>> > each counter should only ever fire once.
>> >
>> > We must do this because PEBS is broken on NHM+ in that the
>> > pebs_record::status is a direct copy of the overflow status field at
>> > time of the assist and if you use the RESET thing nothing will clear the
>> > status bits and you cannot demux the PEBS events back to the event that
>> > generated them.
>> >
>> Trying to understand this problem better. You are saying that in case you
>> are sampling multiple PEBS events there is a problem if you allow more
>> than one record per PEBS buffer because the overflow status is not reset
>> properly.
>
> That is what I wrote; but I'm not entire sure that's correct. I think it
> will reset the overflow bits once it does an actual reset after the PEBS
> assist triggers, but see below.
>
>> For instance, if first record is caused by counter 0, ovfl_status=0x1,
>> then counter
>> is reset. Then, if counter 1 is the cause of the next record, then
>> that record has the
>> ovfl_status=0x3 instead of ovfl_status=0x2? Is that what you are saying?
>>
>> If so then yes, I agree this is a serious bug and we need to have Intel fix it.
>
> But there's still the case where with 2 counters you can get:
>
> cnt0 overflows; sets status |= 1 << 0, arms PEBS0 assist
> cnt1 overflows; sets status |= 1 << 1, arms PEBS1 assist
>
> PEBS0 ready to trigger
> PEBS1 ready to trigger
>
> Cnt1 event -> PEBS1 trigger, writes entry with status := 0x03
> Cnt0 event -> PEBS0 trigger, writes entry with status := 0x03
>
> At which point you'll have 2 events with the same status overflow bits
> in 'reverse' order.
>
Ok so what you are saying is that the ovfl_status is not maintained private
to each counter but shared among all PEBS counters by ucode. That's
how you end up leaking between counters like that.

But the other thing I remember is that if two PEBS events overflow
at the same time, PEBS only write one record with 2 bits set in the
ovfl_status field. No point in creating two because the machine state
will be the same for both. The kernel would just need to dispatch the
same PEBS record to all the events that overflowed.

Now, your case appears like that, except this is not what happened.
So you're misled to believe both counter overflowed at the same time
when they did not in reality.

I'd like to have a test case where I could reproduce this.

> If we'd set RESET, the second entry would have status : 0x01, which
> would be unambiguous again. But we'd still not know where to place the
> 0x03 entry.
>
> With more PEBSn counters enabled and a threshold > 1 the chance of
> having such scenarios is greatly increased.
>
> The threshold := 1 case tries to avoid these cases by getting them out
> as fast as possible and hopefully avoiding the second trigger.

I understand. We need to verify that the HSW problem is not related to
have 2 events running: cycles:pp and cycles for the NMI watchdog.
What if both interrupt at the same time. Normally, I'd expect the PEBS
code to drain the buffer for the PEBS event and then the regular handler
for the NMI watchdog event. But obviously, this does not happen that
way. Will continue to investigate.

2013-09-23 17:25:29

by Peter Zijlstra

[permalink] [raw]
Subject: Re: PEBS bug on HSW: "Unexpected number of pebs records 10" (was: Re: [GIT PULL] perf changes for v3.12)

On Mon, Sep 23, 2013 at 07:11:21PM +0200, Stephane Eranian wrote:
> Ok so what you are saying is that the ovfl_status is not maintained private
> to each counter but shared among all PEBS counters by ucode. That's
> how you end up leaking between counters like that.

I only remember asking for clarification because the SDM isn't clear on
this subject; the answer was that it simply copies whatever is in
MSR_CORE_PERF_GLOBAL_STATUS.

I explained how this would be a problem and it was agreed this needed
fixing -- not sure if that ever happened.

> But the other thing I remember is that if two PEBS events overflow
> at the same time, PEBS only write one record with 2 bits set in the
> ovfl_status field. No point in creating two because the machine state
> will be the same for both. The kernel would just need to dispatch the
> same PEBS record to all the events that overflowed.

Hurm.. that makes life more interesting still. The current code only
delivers the event to the first bit set. Changing this would be simple
though.

> Now, your case appears like that, except this is not what happened.
> So you're misled to believe both counter overflowed at the same time
> when they did not in reality.
>
> I'd like to have a test case where I could reproduce this.

Agreed, I've never tried to actually reproduce this. I suppose it would
be easiest to trigger where the one event is very rare and controlled.