2020-10-12 12:57:05

by Alexey Budankov

[permalink] [raw]
Subject: [PATCH v1 00/15] Introduce threaded trace streaming for basic perf record operation


Patch set provides threaded trace streaming for base perf record
operation. Provided streaming mode (--threads) mitigates profiling
data losses and resolves scalability issues of serial and asynchronous
(--aio) trace streaming modes on multicore server systems. The patch
set is based on the prototype [1], [2] and the most closely relates
to mode 3) "mode that creates thread for every monitored memory map".

The threaded mode executes one-to-one mapping of trace streaming threads
to mapped data buffers and streaming into per-CPU trace files located
at data directory. The data buffers and threads are affined to NUMA
nodes and monitored CPUs according to system topology. --cpu option
can be used to specify exact CPUs to be monitored.

Basic analysis of data directories is provided for perf report mode.
Raw dump (-D) and aggregated reports are available for data directories,
still with no memory consumption optimizations. However data directories
collected with --compression-level option enabled can be analyzed with
little less memory because trace files are unmaped from tool process
memory after loading collected data.

Provided streaming mode is available with Zstd compression/decompression
(--compression-level) and handling of external commands (--control).
AUX area tracing, related and derived modes like --snapshot or
--aux-sample are not enabled. --switch-output, --switch-output-event,
--switch-max-files and --timestamp-filename options are not enabled.
Threaded trace streaming is not enabled for pipe mode. Asynchronous
(--aio) trace streaming and affinity (--affinity) modes are mutually
exclusive to threaded streaming mode.

See testing results and validation examples below.

[1] git clone https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git -b perf/record_threads
[2] https://lore.kernel.org/lkml/[email protected]/

---
Alexey Budankov (15):
perf session: introduce trace file path to be shown in raw trace dump
perf report: output trace file name in raw trace dump
perf data: open data directory in read access mode
perf session: move reader object definition to header file
perf session: introduce decompressor into trace reader object
perf session: load data directory into tool process memory
perf record: introduce trace file, compressor and stats in mmap object
perf record: write trace data into mmap trace files
perf record: introduce thread specific objects for trace streaming
perf record: manage thread specific data array
perf evlist: introduce evlist__ctlfd_update() to update ctl fd status
perf record: introduce thread local variable for trace streaming
perf record: stop threads in the end of trace streaming
perf record: start threads in the beginning of trace streaming
perf record: introduce --threads command line option

tools/perf/Documentation/perf-record.txt | 7 +
tools/perf/builtin-record.c | 514 +++++++++++++++++++++--
tools/perf/util/data.c | 4 +
tools/perf/util/evlist.c | 16 +
tools/perf/util/evlist.h | 1 +
tools/perf/util/mmap.c | 6 +
tools/perf/util/mmap.h | 5 +
tools/perf/util/ordered-events.h | 1 +
tools/perf/util/record.h | 1 +
tools/perf/util/session.c | 150 ++++---
tools/perf/util/session.h | 28 ++
tools/perf/util/tool.h | 3 +-
12 files changed, 635 insertions(+), 101 deletions(-)

---
Testing results:

$ perf test
1: vmlinux symtab matches kallsyms : Skip
2: Detect openat syscall event : Ok
3: Detect openat syscall event on all cpus : Ok
4: Read samples using the mmap interface : Ok
5: Test data source output : Ok
6: Parse event definition strings : Ok
7: Simple expression parser : Ok
8: PERF_RECORD_* events & perf_sample fields : Ok
9: Parse perf pmu format : Ok
10: PMU events :
10.1: PMU event table sanity : Ok
10.2: PMU event map aliases : Ok
10.3: Parsing of PMU event table metrics : Skip (some metrics failed)
10.4: Parsing of PMU event table metrics with fake PMUs : Ok
11: DSO data read : Ok
12: DSO data cache : Ok
13: DSO data reopen : FAILED!
14: Roundtrip evsel->name : Ok
15: Parse sched tracepoints fields : Ok
16: syscalls:sys_enter_openat event fields : Ok
17: Setup struct perf_event_attr : Ok
18: Match and link multiple hists : Ok
19: 'import perf' in python : FAILED!
20: Breakpoint overflow signal handler : Ok
21: Breakpoint overflow sampling : Ok
22: Breakpoint accounting : Ok
23: Watchpoint :
23.1: Read Only Watchpoint : Skip
23.2: Write Only Watchpoint : Ok
23.3: Read / Write Watchpoint : Ok
23.4: Modify Watchpoint : Ok
24: Number of exit events of a simple workload : Ok
25: Software clock events period values : Ok
26: Object code reading : Ok
27: Sample parsing : Ok
28: Use a dummy software event to keep tracking : Ok
29: Parse with no sample_id_all bit set : Ok
30: Filter hist entries : Ok
31: Lookup mmap thread : Ok
32: Share thread maps : Ok
33: Sort output of hist entries : Ok
34: Cumulate child hist entries : Ok
35: Track with sched_switch : Ok
36: Filter fds with revents mask in a fdarray : Ok
37: Add fd to a fdarray, making it autogrow : Ok
38: kmod_path__parse : Ok
39: Thread map : Ok
40: LLVM search and compile :
40.1: Basic BPF llvm compile : Skip
40.2: kbuild searching : Skip
40.3: Compile source for BPF prologue generation : Skip
40.4: Compile source for BPF relocation : Skip
41: Session topology : Ok
42: BPF filter :
42.1: Basic BPF filtering : Skip
42.2: BPF pinning : Skip
42.3: BPF prologue generation : Skip
42.4: BPF relocation checker : Skip
43: Synthesize thread map : Ok
44: Remove thread map : Ok
45: Synthesize cpu map : Ok
46: Synthesize stat config : Ok
47: Synthesize stat : Ok
48: Synthesize stat round : Ok
49: Synthesize attr update : Ok
50: Event times : Ok
51: Read backward ring buffer : Ok
52: Print cpu map : Ok
53: Merge cpu map : Ok
54: Probe SDT events : Ok
55: is_printable_array : Ok
56: Print bitmap : Ok
57: perf hooks : Ok
58: builtin clang support : Skip (not compiled in)
59: unit_number__scnprintf : Ok
60: mem2node : Ok
61: time utils : Ok
62: Test jit_write_elf : Ok
63: Test libpfm4 support : Skip (not compiled in)
64: Test api io : Ok
65: maps__merge_in : Ok
66: Demangle Java : Ok
67: Parse and process metrics : Ok
68: PE file support : Ok
69: Event expansion for cgroups : Ok
70: x86 rdpmc : Ok
71: Convert perf time to TSC : Ok
72: DWARF unwind : Ok
73: x86 instruction decoder - new instructions : Ok
74: Intel PT packet decoder : Ok
75: x86 bp modify : Ok
76: Check open filename arg using perf trace + vfs_getname : FAILED!
77: Add vfs_getname probe to get syscall args filenames : FAILED!
78: probe libc's inet_pton & backtrace it with ping : Ok
79: Use vfs_getname probe to get syscall args filenames : FAILED!
80: Zstd perf.data compression/decompression : Ok
81: Check Arm CoreSight trace data recording and synthesized samples: Skip

Validation:

$ tools/perf/perf record --threads --compression-level=3 -e cycles,instructions -- ../../bench/matrix/linux/matrix.gcc.g.O3
Addr of buf1 = 0x7f8af24ae010
Offs of buf1 = 0x7f8af24ae180
Addr of buf2 = 0x7f8af04ad010
Offs of buf2 = 0x7f8af04ad1c0
Addr of buf3 = 0x7f8aee4ac010
Offs of buf3 = 0x7f8aee4ac100
Addr of buf4 = 0x7f8aec4ab010
Offs of buf4 = 0x7f8aec4ab140
Threads #: 8 Pthreads
Matrix size: 2048
Using multiply kernel: multiply1
Execution time = 24.579 seconds
[ perf record: Woken up 972 times to write data ]
[ perf record: Captured and wrote 11.280 MB perf.data (403 samples), compressed (original 66.670 MB, ratio is 5.926) ]

$ perf report --header --stats
# ========
# captured on : Wed Oct 7 19:07:50 2020
# header version : 1
# data offset : 504
# data size : 30056
# feat offset : 30560
# hostname : nntvtune39
# os release : 5.8.8-100.fc31.x86_64
# perf version : 4.13.rc5.gdf19b1347b82
# arch : x86_64
# nrcpus online : 8
# nrcpus avail : 8
# cpudesc : Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
# cpuid : GenuineIntel,6,94,3
# total memory : 16119484 kB
# cmdline : /root/abudanko/kernel/acme/tools/perf/perf record --threads --compression-level=3 -e cycles,instructions -- ../../bench /matrix/linux/matrix.gcc.g.O3
# event : name = cycles, , id = { 54562, 54563, 54564, 54565, 54566, 54567, 54568, 54569 }, size = 120, { sample_period, sample_freq } = 4000, sample_type = IP|TID|TIME|ID|>
# event : name = instructions, , id = { 54570, 54571, 54572, 54573, 54574, 54575, 54576, 54577 }, size = 120, config = 0x1, { sample_period, sample_freq } = 4000, sample_ty>
# sibling sockets : 0-7
# sibling dies : 0-7
# sibling threads : 0,4
# sibling threads : 1,5
# sibling threads : 2,6
# sibling threads : 3,7
# CPU 0: Core ID 0, Die ID 0, Socket ID 0
# CPU 1: Core ID 1, Die ID 0, Socket ID 0
# CPU 2: Core ID 2, Die ID 0, Socket ID 0
# CPU 3: Core ID 3, Die ID 0, Socket ID 0
# CPU 4: Core ID 0, Die ID 0, Socket ID 0
# CPU 5: Core ID 1, Die ID 0, Socket ID 0
# CPU 6: Core ID 2, Die ID 0, Socket ID 0
# CPU 7: Core ID 3, Die ID 0, Socket ID 0
# node0 meminfo : total = 16119484 kB, free = 933088 kB
# node0 cpu list : 0-7
# pmu mappings: intel_pt = 8, software = 1, power = 19, uprobe = 7, uncore_imc = 11, cpu = 4, cstate_core = 17, uncore_cbox_2 = 14, breakpoint = 5, uncore_cbox_0 = 12, trac>
# CPU cache info:
# L1 Data 32K [0,4]
# L1 Instruction 32K [0,4]
# L1 Data 32K [1,5]
# L1 Instruction 32K [1,5]
# L1 Data 32K [1,5]
# L1 Instruction 32K [1,5]
# L1 Data 32K [2,6]
# L1 Instruction 32K [2,6]
# L1 Data 32K [3,7]
# L1 Instruction 32K [3,7]
# L2 Unified 256K [0,4]
# L2 Unified 256K [1,5]
# L2 Unified 256K [2,6]
# L2 Unified 256K [3,7]
# L3 Unified 8192K [0-7]
# time of first sample : 0.000000
# time of last sample : 0.000000
# sample duration : 0.000 ms
# memory nodes (nr 1, block size 0x8000000):
# 0 [16G]: 0-23,32-135
# directory data version : 1
# bpf_prog_info 835: bpf_prog_6deef7357e7b4530 addr 0xffffffffc07c5138 size 66
# bpf_prog_info 836: bpf_prog_6deef7357e7b4530 addr 0xffffffffc0804558 size 66
# bpf_prog_info 837: bpf_prog_84efc2eecc454ca6 addr 0xffffffffc080804c size 373
# bpf_prog_info 838: bpf_prog_6deef7357e7b4530 addr 0xffffffffc0817f50 size 66
# bpf_prog_info 839: bpf_prog_6deef7357e7b4530 addr 0xffffffffc082aee8 size 66
# bpf_prog_info 840: bpf_prog_5a2b06eab81b8f51 addr 0xffffffffc082c0fc size 1132
# bpf_prog_info 841: bpf_prog_8a642ed2819e9acc addr 0xffffffffc0aac380 size 62
# bpf_prog_info 842: bpf_prog_8a642ed2819e9acc addr 0xffffffffc0d4c970 size 62
# bpf_prog_info 843: bpf_prog_6deef7357e7b4530 addr 0xffffffffc0d4e8bc size 66
# bpf_prog_info 844: bpf_prog_6deef7357e7b4530 addr 0xffffffffc0d5050c size 66
# bpf_prog_info 845: bpf_prog_3ff695c0afc2579c addr 0xffffffffc0d56098 size 1552
# bpf_prog_info 846: bpf_prog_6deef7357e7b4530 addr 0xffffffffc0d58364 size 66
# bpf_prog_info 847: bpf_prog_6deef7357e7b4530 addr 0xffffffffc0d6ec00 size 66
# compressed : Zstd, level = 3, ratio = 6
# cpu pmu capabilities: branches=32, max_precise=3, pmu_name=skylake
# missing features: TRACING_DATA BUILD_ID BRANCH_STACK GROUP_DESC AUXTRACE STAT CLOCKID CLOCK_DATA
# ========
#
Aggregated stats:
TOTAL events: 1457107
MMAP events: 98
LOST events: 0
COMM events: 2
EXIT events: 6
THROTTLE events: 0
UNTHROTTLE events: 0
FORK events: 8
READ events: 0
SAMPLE events: 1456388
MMAP2 events: 7
AUX events: 0
ITRACE_START events: 0
LOST_SAMPLES events: 0
SWITCH events: 0
SWITCH_CPU_WIDE events: 0
NAMESPACES events: 0
KSYMBOL events: 13
BPF_EVENT events: 13
CGROUP events: 166
TEXT_POKE events: 0
ATTR events: 0
EVENT_TYPE events: 0
TRACING_DATA events: 0
BUILD_ID events: 0
FINISHED_ROUND events: 0
ID_INDEX events: 0
AUXTRACE_INFO events: 0
AUXTRACE events: 0
AUXTRACE_ERROR events: 0
THREAD_MAP events: 1
CPU_MAP events: 1
STAT_CONFIG events: 0
STAT events: 0
STAT_ROUND events: 0
EVENT_UPDATE events: 0
TIME_CONV events: 1
FEATURE events: 0
COMPRESSED events: 403

$ tools/perf/perf record --threads --compression-level=3 -g --call-graph dwarf,1024 -e cycles,instructions -- ../../bench/matrix/linux/matrix.gcc.g.O3
Addr of buf1 = 0x7f15e8b89010
Offs of buf1 = 0x7f15e8b89180
Addr of buf2 = 0x7f15e6b88010
Offs of buf2 = 0x7f15e6b881c0
Addr of buf3 = 0x7f15e4b87010
Offs of buf3 = 0x7f15e4b87100
Addr of buf4 = 0x7f15e2b86010
Offs of buf4 = 0x7f15e2b86140
Threads #: 8 Pthreads
Matrix size: 2048
Using multiply kernel: multiply1
Execution time = 24.281 seconds
[ perf record: Woken up 6923 times to write data ]
[ perf record: Captured and wrote 34.933 MB perf.data (10414 samples), compressed (original 1728.795 MB, ratio is 49.530) ]

$ perf report --stdio
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 708K of event 'cycles'
# Event count (approx.): 277718834163
#
# Children Self Command Shared Object Symbol
# ........ ........ ............... ........................ ........................................................
#
99.97% 0.00% matrix.gcc.g.O3 libc-2.30.so [.] __GI___clone (inlined)
|
---__GI___clone (inlined)
|
--99.97%--start_thread
|
--99.97%--ThreadFunction
multiply1
|
--0.96%--asm_sysvec_apic_timer_interrupt
|
--0.70%--subflow_syn_recv_sock
|
--0.60%--__sysvec_apic_timer_interrupt
|
--0.57%--hrtimer_interrupt

99.97% 0.00% matrix.gcc.g.O3 libpthread-2.30.so [.] start_thread
|
---start_thread
|
--99.97%--ThreadFunction
|
--99.97%--multiply1
|
--0.96%--asm_sysvec_apic_timer_interrupt
|
--0.70%--subflow_syn_recv_sock
|
--0.60%--__sysvec_apic_timer_interrupt
|
--0.57%--hrtimer_interrupt

99.97% 0.00% matrix.gcc.g.O3 matrix.gcc.g.O3 [.] ThreadFunction
|
---ThreadFunction
|
--99.97%--multiply1
|
--0.96%--asm_sysvec_apic_timer_interrupt
|
--0.70%--subflow_syn_recv_sock
|
--0.60%--__sysvec_apic_timer_interrupt
|
--0.57%--hrtimer_interrupt

99.97% 98.68% matrix.gcc.g.O3 matrix.gcc.g.O3 [.] multiply1
|
|--98.68%--__GI___clone (inlined)
| start_thread
| ThreadFunction
| multiply1
|
--1.28%--multiply1
|
--0.96%--asm_sysvec_apic_timer_interrupt
|
--0.70%--subflow_syn_recv_sock
|
--0.60%--__sysvec_apic_timer_interrupt
|
--0.57%--hrtimer_interrupt

0.97% 0.05% matrix.gcc.g.O3 [kernel.kallsyms] [k] asm_sysvec_apic_timer_interrupt
|
--0.91%--asm_sysvec_apic_timer_interrupt
|
--0.70%--subflow_syn_recv_sock


--
2.24.1


2020-10-12 12:57:22

by Alexey Budankov

[permalink] [raw]
Subject: [PATCH v1 03/15] perf data: open data directory in read access mode


Open files located at data directory in case of read access mode.

Signed-off-by: Alexey Budankov <[email protected]>
---
tools/perf/util/data.c | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/tools/perf/util/data.c b/tools/perf/util/data.c
index c47aa34fdc0a..6ad61ac6ba67 100644
--- a/tools/perf/util/data.c
+++ b/tools/perf/util/data.c
@@ -321,6 +321,10 @@ static int open_dir(struct perf_data *data)
return -1;

ret = open_file(data);
+ if (!ret && perf_data__is_dir(data)) {
+ if (perf_data__is_read(data))
+ ret = perf_data__open_dir(data);
+ }

/* Cleanup whatever we managed to create so far. */
if (ret && perf_data__is_write(data))
--
2.24.1


2020-10-12 12:57:33

by Alexey Budankov

[permalink] [raw]
Subject: [PATCH v1 05/15] perf session: introduce decompressor into trace reader object


Introduce decompressor to trace reader object so that decompression
could be executed on per trace file basis separately for every
trace file located at data directory.

Signed-off-by: Alexey Budankov <[email protected]>
---
tools/perf/util/session.c | 4 +++-
tools/perf/util/session.h | 1 +
2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index 911b2dbcd0ac..6afc670fdf0c 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -44,6 +44,8 @@ static int perf_session__process_compressed_event(struct perf_session *session,
u64 decomp_last_rem = 0;
size_t mmap_len, decomp_len = session->header.env.comp_mmap_len;
struct decomp *decomp, *decomp_last = session->decomp_last;
+ struct zstd_data *zstd_data = session->reader ?
+ &(session->reader->zstd_data) : &(session->zstd_data);

if (decomp_last) {
decomp_last_rem = decomp_last->size - decomp_last->head;
@@ -71,7 +73,7 @@ static int perf_session__process_compressed_event(struct perf_session *session,
src = (void *)event + sizeof(struct perf_record_compressed);
src_size = event->pack.header.size - sizeof(struct perf_record_compressed);

- decomp_size = zstd_decompress_stream(&(session->zstd_data), src, src_size,
+ decomp_size = zstd_decompress_stream(zstd_data, src, src_size,
&(decomp->data[decomp_last_rem]), decomp_len - decomp_last_rem);
if (!decomp_size) {
munmap(decomp, mmap_len);
diff --git a/tools/perf/util/session.h b/tools/perf/util/session.h
index abdb8518a81f..4fc9ccdf7970 100644
--- a/tools/perf/util/session.h
+++ b/tools/perf/util/session.h
@@ -42,6 +42,7 @@ struct reader {
u64 data_size;
u64 data_offset;
reader_cb_t process;
+ struct zstd_data zstd_data;
};

struct perf_session {
--
2.24.1


2020-10-12 12:58:09

by Alexey Budankov

[permalink] [raw]
Subject: [PATCH v1 10/15] perf record: manage thread specific data array


Provide allocation, initialization, finalization and releasing of
thread specific objects at thread specific data array. Allocate
thread specific object for every data buffer making one-to-one
relation between data buffer and a thread processing the buffer.
Deliver event fd related signals to thread's pollfd object.
Deliver thread control commands to ctlfd_pos fd of thread 1+.
Deliver tool external control commands via ctlfd_pos fd of thread 0.

Signed-off-by: Alexey Budankov <[email protected]>
---
tools/perf/builtin-record.c | 101 ++++++++++++++++++++++++++++++++++--
1 file changed, 98 insertions(+), 3 deletions(-)

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 8e512096a060..89cb8e913fb3 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -884,6 +884,94 @@ static int record__kcore_copy(struct machine *machine, struct perf_data *data)
return kcore_copy(from_dir, kcore_dir);
}

+static int record__alloc_thread_data(struct record *rec, struct mmap *mmaps, int nr_mmaps,
+ struct fdarray *evlist_pollfd, int ctlfd_pos)
+{
+ int i, j, k, nr_thread_data;
+ struct thread_data *thread_data;
+
+ rec->nr_thread_data = nr_thread_data = nr_mmaps;
+ rec->thread_data = thread_data = zalloc(rec->nr_thread_data * sizeof(*(rec->thread_data)));
+ if (!thread_data) {
+ pr_err("Failed to allocate thread data\n");
+ return -ENOMEM;
+ }
+
+ for (i = 0; i < nr_thread_data; i++) {
+ short revents;
+ int pos, fd;
+
+ thread_data[i].tid = -1;
+
+ if (pipe(thread_data[i].comm.msg) ||
+ pipe(thread_data[i].comm.ack)) {
+ pr_err("Failed to create thread comm pipes, errno %d\n", errno);
+ return -ENOMEM;
+ }
+
+ thread_data[i].maps = &mmaps[i];
+ thread_data[i].nr_mmaps = 1;
+
+ thread_data[i].rec = rec;
+
+ fdarray__init(&(thread_data[i].pollfd), 64);
+
+ for (j = 0; j < thread_data[i].nr_mmaps; j++) {
+ struct mmap *map = &(thread_data[i].maps[j]);
+
+ for (k = 0; k < evlist_pollfd->nr; k++) {
+ if (evlist_pollfd->priv[k].ptr != map)
+ continue;
+
+ fd = evlist_pollfd->entries[k].fd;
+ revents = evlist_pollfd->entries[k].events;
+ pos = fdarray__add(&(thread_data[i].pollfd),
+ fd, revents | POLLERR | POLLHUP,
+ fdarray_flag__default);
+ if (pos >= 0)
+ thread_data[i].pollfd.priv[pos].ptr = map;
+ else
+ return -ENOMEM;
+ }
+ }
+
+ if (i) {
+ fd = thread_data[i].comm.msg[0];
+ revents = POLLIN | POLLERR | POLLHUP;
+ } else {
+ if (ctlfd_pos == -1)
+ continue;
+ fd = evlist_pollfd->entries[ctlfd_pos].fd;
+ revents = evlist_pollfd->entries[ctlfd_pos].events;
+ }
+ thread_data[i].ctlfd_pos =
+ fdarray__add(&(thread_data[i].pollfd),
+ fd, revents, fdarray_flag__nonfilterable);
+ if (thread_data[i].ctlfd_pos < 0)
+ return -ENOMEM;
+ }
+
+ return 0;
+}
+
+static int record__free_thread_data(struct record *rec)
+{
+ int i;
+
+ if (rec->thread_data) {
+ for (i = 0; i < rec->nr_thread_data; i++) {
+ close(rec->thread_data[i].comm.msg[0]);
+ close(rec->thread_data[i].comm.msg[1]);
+ close(rec->thread_data[i].comm.ack[0]);
+ close(rec->thread_data[i].comm.ack[1]);
+ fdarray__exit(&(rec->thread_data[i].pollfd));
+ }
+ zfree(&rec->thread_data);
+ }
+
+ return 0;
+}
+
static int record__mmap_evlist(struct record *rec,
struct evlist *evlist)
{
@@ -918,6 +1006,9 @@ static int record__mmap_evlist(struct record *rec,
}
}

+ if (evlist__initialize_ctlfd(evlist, opts->ctl_fd, opts->ctl_fd_ack))
+ return -1;
+
if (record__threads_enabled(rec)) {
int i, ret, nr = evlist->core.nr_mmaps;
struct mmap *mmaps = rec->opts.overwrite ?
@@ -929,6 +1020,12 @@ static int record__mmap_evlist(struct record *rec,

for (i = 0; i < nr; i++)
mmaps[i].file = &rec->data.dir.files[i];
+
+ ret = record__alloc_thread_data(rec, mmaps, nr,
+ &evlist->core.pollfd,
+ evlist->ctl_fd.pos);
+ if (ret)
+ return ret;
}

return 0;
@@ -1910,9 +2007,6 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
perf_evlist__start_workload(rec->evlist);
}

- if (evlist__initialize_ctlfd(rec->evlist, opts->ctl_fd, opts->ctl_fd_ack))
- goto out_child;
-
if (opts->initial_delay) {
pr_info(EVLIST_DISABLED_MSG);
if (opts->initial_delay > 0) {
@@ -2063,6 +2157,7 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
record__synthesize_workload(rec, true);

out_child:
+ record__free_thread_data(rec);
evlist__finalize_ctlfd(rec->evlist);
record__mmap_read_all(rec, true);
record__aio_mmap_read_sync(rec);
--
2.24.1

2020-10-12 12:58:58

by Alexey Budankov

[permalink] [raw]
Subject: [PATCH v1 14/15] perf record: start threads in the beginning of trace streaming


Start threads in detached state because its management is possible
via messaging. Block signals prior the threads start so only main
tool thread would be notified on external async signals during data
collection. Streaming threads connect one-to-one to mapped data
buffers and write into per-CPU trace files located at data directory.
Data buffers and threads are affined to local NUMA nodes and monitored
CPUs according to system topology. --cpu option can be used to specify
exact CPUs to be monitored.

Signed-off-by: Alexey Budankov <[email protected]>
---
tools/perf/builtin-record.c | 128 +++++++++++++++++++++++++++++++++---
1 file changed, 120 insertions(+), 8 deletions(-)

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index a15642656066..1d41e996a994 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -56,6 +56,7 @@
#include <poll.h>
#include <pthread.h>
#include <unistd.h>
+#include <sys/syscall.h>
#include <sched.h>
#include <signal.h>
#ifdef HAVE_EVENTFD_SUPPORT
@@ -1377,6 +1378,62 @@ static void record__thread_munmap_filtered(struct fdarray *fda, int fd,
perf_mmap__put(map);
}

+static void *record__thread(void *arg)
+{
+ enum thread_msg msg = THREAD_MSG__READY;
+ bool terminate = false;
+ struct fdarray *pollfd;
+ int err, ctlfd_pos;
+
+ thread = arg;
+ thread->tid = syscall(SYS_gettid);
+
+ err = write(thread->comm.ack[1], &msg, sizeof(msg));
+ if (err == -1)
+ pr_err("threads: %d failed to notify on start. Error %m", thread->tid);
+
+ pollfd = &(thread->pollfd);
+ ctlfd_pos = thread->ctlfd_pos;
+
+ for (;;) {
+ unsigned long long hits = thread->samples;
+
+ if (record__mmap_read_all(thread->rec, false) < 0 || terminate)
+ break;
+
+ if (hits == thread->samples) {
+
+ err = fdarray__poll(pollfd, -1);
+ /*
+ * Propagate error, only if there's any. Ignore positive
+ * number of returned events and interrupt error.
+ */
+ if (err > 0 || (err < 0 && errno == EINTR))
+ err = 0;
+ thread->waking++;
+
+ if (fdarray__filter(pollfd, POLLERR | POLLHUP,
+ record__thread_munmap_filtered, NULL) == 0)
+ break;
+ }
+
+ if (pollfd->entries[ctlfd_pos].revents & POLLHUP) {
+ terminate = true;
+ close(thread->comm.msg[0]);
+ pollfd->entries[ctlfd_pos].fd = -1;
+ pollfd->entries[ctlfd_pos].events = 0;
+ }
+
+ pollfd->entries[ctlfd_pos].revents = 0;
+ }
+
+ err = write(thread->comm.ack[1], &msg, sizeof(msg));
+ if (err == -1)
+ pr_err("threads: %d failed to notify on termination. Error %m", thread->tid);
+
+ return NULL;
+}
+
static void record__init_features(struct record *rec)
{
struct perf_session *session = rec->session;
@@ -1823,6 +1880,58 @@ static int record__terminate_thread(struct thread_data *thread_data)
return 0;
}

+static int record__start_threads(struct record *rec)
+{
+ int i, j, ret = 0;
+ sigset_t full, mask;
+ pthread_t handle;
+ pthread_attr_t attrs;
+ int nr_thread_data = rec->nr_thread_data;
+ struct thread_data *thread_data = rec->thread_data;
+
+ if (!record__threads_enabled(rec))
+ return 0;
+
+ sigfillset(&full);
+ if (sigprocmask(SIG_SETMASK, &full, &mask)) {
+ pr_err("Failed to block signals on threads start. Error: %m\n");
+ return -1;
+ }
+
+ pthread_attr_init(&attrs);
+ pthread_attr_setdetachstate(&attrs, PTHREAD_CREATE_DETACHED);
+
+ for (i = 1; i < nr_thread_data; i++) {
+ int res = 0;
+ enum thread_msg msg = THREAD_MSG__UNSUPPORTED;
+
+ if (pthread_create(&handle, &attrs, record__thread, &thread_data[i])) {
+ for (j = 1; j < i; j++)
+ record__terminate_thread(&thread_data[i]);
+ pr_err("Failed to start threads. Error: %m\n");
+ ret = -1;
+ goto out_err;
+ }
+
+ res = read(thread_data[i].comm.ack[0], &msg, sizeof(msg));
+ if (res > 0)
+ pr_debug("threads: %d -> %s\n", rec->thread_data[i].tid,
+ thread_msg_tags[msg]);
+ }
+
+ thread = &thread_data[0];
+ thread->tid = syscall(SYS_gettid);
+ pr_debug("threads: %d -> %s\n", thread->tid, thread_msg_tags[THREAD_MSG__READY]);
+
+out_err:
+ if (sigprocmask(SIG_SETMASK, &mask, NULL)) {
+ pr_err("Failed to unblock signals on threads start. Error: %m\n");
+ ret = -1;
+ }
+
+ return ret;
+}
+
static int record__stop_threads(struct record *rec, unsigned long *waking)
{
int i, j, nr_thread_data = rec->nr_thread_data;
@@ -1965,7 +2074,7 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
err = record__kcore_copy(&session->machines.host, data);
if (err) {
pr_err("ERROR: Failed to copy kcore\n");
- goto out_child;
+ goto out_free_threads;
}
}

@@ -1976,7 +2085,7 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
bpf__strerror_apply_obj_config(err, errbuf, sizeof(errbuf));
pr_err("ERROR: Apply config to BPF failed: %s\n",
errbuf);
- goto out_child;
+ goto out_free_threads;
}

/*
@@ -1994,11 +2103,11 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
if (data->is_pipe) {
err = perf_header__write_pipe(fd);
if (err < 0)
- goto out_child;
+ goto out_free_threads;
} else {
err = perf_session__write_header(session, rec->evlist, fd, false);
if (err < 0)
- goto out_child;
+ goto out_free_threads;
}

err = -1;
@@ -2006,16 +2115,16 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
&& !perf_header__has_feat(&session->header, HEADER_BUILD_ID)) {
pr_err("Couldn't generate buildids. "
"Use --no-buildid to profile anyway.\n");
- goto out_child;
+ goto out_free_threads;
}

err = record__setup_sb_evlist(rec);
if (err)
- goto out_child;
+ goto out_free_threads;

err = record__synthesize(rec, false);
if (err < 0)
- goto out_child;
+ goto out_free_threads;

if (rec->realtime_prio) {
struct sched_param param;
@@ -2024,10 +2133,13 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
if (sched_setscheduler(0, SCHED_FIFO, &param)) {
pr_err("Could not set realtime priority.\n");
err = -1;
- goto out_child;
+ goto out_free_threads;
}
}

+ if (record__start_threads(rec))
+ goto out_free_threads;
+
/*
* When perf is starting the traced process, all the events
* (apart from group members) have enable_on_exec=1 set,
--
2.24.1

2020-10-12 12:59:32

by Alexey Budankov

[permalink] [raw]
Subject: [PATCH v1 09/15] perf record: introduce thread specific objects for trace streaming


Introduce thread local data object and its array to be used for
threaded trace streaming.

Signed-off-by: Alexey Budankov <[email protected]>
---
tools/perf/builtin-record.c | 18 ++++++++++++++++++
1 file changed, 18 insertions(+)

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index ba26d75c51d6..8e512096a060 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -85,11 +85,29 @@ struct switch_output {
int cur_file;
};

+struct thread_data {
+ pid_t tid;
+ struct {
+ int msg[2];
+ int ack[2];
+ } comm;
+ struct fdarray pollfd;
+ int ctlfd_pos;
+ struct mmap *maps;
+ int nr_mmaps;
+ struct record *rec;
+ unsigned long long samples;
+ unsigned long waking;
+ u64 bytes_written;
+};
+
struct record {
struct perf_tool tool;
struct record_opts opts;
u64 bytes_written;
struct perf_data data;
+ struct thread_data *thread_data;
+ int nr_thread_data;
struct auxtrace_record *itr;
struct evlist *evlist;
struct perf_session *session;
--
2.24.1


2020-10-12 16:08:20

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH v1 05/15] perf session: introduce decompressor into trace reader object

On Mon, Oct 12, 2020 at 11:57:00AM +0300, Alexey Budankov wrote:
> diff --git a/tools/perf/util/session.h b/tools/perf/util/session.h
> index abdb8518a81f..4fc9ccdf7970 100644
> --- a/tools/perf/util/session.h
> +++ b/tools/perf/util/session.h
> @@ -42,6 +42,7 @@ struct reader {
> u64 data_size;
> u64 data_offset;
> reader_cb_t process;
> + struct zstd_data zstd_data;

Just to make sure, there is a suitable definition of zstd_data in case
the Zstd library not available?

-Andi

2020-10-12 16:51:55

by Alexey Budankov

[permalink] [raw]
Subject: Re: [PATCH v1 05/15] perf session: introduce decompressor into trace reader object


On 12.10.2020 19:05, Andi Kleen wrote:
> On Mon, Oct 12, 2020 at 11:57:00AM +0300, Alexey Budankov wrote:
>> diff --git a/tools/perf/util/session.h b/tools/perf/util/session.h
>> index abdb8518a81f..4fc9ccdf7970 100644
>> --- a/tools/perf/util/session.h
>> +++ b/tools/perf/util/session.h
>> @@ -42,6 +42,7 @@ struct reader {
>> u64 data_size;
>> u64 data_offset;
>> reader_cb_t process;
>> + struct zstd_data zstd_data;
>
> Just to make sure, there is a suitable definition of zstd_data in case
> the Zstd library not available?

That's right.

Alexei

>
> -Andi
>

2020-10-12 16:56:04

by Alexey Budankov

[permalink] [raw]
Subject: Re: [PATCH v1 03/15] perf data: open data directory in read access mode


On 12.10.2020 19:03, Andi Kleen wrote:
> On Mon, Oct 12, 2020 at 11:55:07AM +0300, Alexey Budankov wrote:
>>
>> Open files located at data directory in case of read access mode.
>
> Need some rationale. Is this a bug fix? Did the case not matter
> before?

This is not a bug fix. The case didn't matter before.

>
>>
>> Signed-off-by: Alexey Budankov <[email protected]>
>> ---
>> tools/perf/util/data.c | 4 ++++
>> 1 file changed, 4 insertions(+)
>>
>> diff --git a/tools/perf/util/data.c b/tools/perf/util/data.c
>> index c47aa34fdc0a..6ad61ac6ba67 100644
>> --- a/tools/perf/util/data.c
>> +++ b/tools/perf/util/data.c
>> @@ -321,6 +321,10 @@ static int open_dir(struct perf_data *data)
>> return -1;
>>
>> ret = open_file(data);
>> + if (!ret && perf_data__is_dir(data)) {
>> + if (perf_data__is_read(data))
>> + ret = perf_data__open_dir(data);
>> + }
>>
>> /* Cleanup whatever we managed to create so far. */
>> if (ret && perf_data__is_write(data))
>> --
>> 2.24.1

Alexei

2020-10-12 20:21:18

by Alexey Budankov

[permalink] [raw]
Subject: [PATCH v1 02/15] perf report: output trace file name in raw trace dump


Output path of a trace file into raw dump (-D) <file_offset>@<path/file>.
Print offset of PERF_RECORD_COMPRESSED record instead of zero for
decompressed records:
[email protected] [0x30]: event: 9
or
[email protected]/data.7 [0x30]: event: 9

Signed-off-by: Alexey Budankov <[email protected]>
---
tools/perf/util/session.c | 75 +++++++++++++++++++++++----------------
tools/perf/util/tool.h | 3 +-
2 files changed, 46 insertions(+), 32 deletions(-)

diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index 4478ddae485b..6f09d506b2f6 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -36,7 +36,8 @@

#ifdef HAVE_ZSTD_SUPPORT
static int perf_session__process_compressed_event(struct perf_session *session,
- union perf_event *event, u64 file_offset)
+ union perf_event *event, u64 file_offset,
+ const char *file_path)
{
void *src;
size_t decomp_size, src_size;
@@ -58,6 +59,7 @@ static int perf_session__process_compressed_event(struct perf_session *session,
}

decomp->file_pos = file_offset;
+ decomp->file_path = file_path;
decomp->mmap_len = mmap_len;
decomp->head = 0;

@@ -98,7 +100,8 @@ static int perf_session__process_compressed_event(struct perf_session *session,
static int perf_session__deliver_event(struct perf_session *session,
union perf_event *event,
struct perf_tool *tool,
- u64 file_offset);
+ u64 file_offset,
+ const char *file_path);

static int perf_session__open(struct perf_session *session)
{
@@ -180,7 +183,8 @@ static int ordered_events__deliver_event(struct ordered_events *oe,
ordered_events);

return perf_session__deliver_event(session, event->event,
- session->tool, event->file_offset);
+ session->tool, event->file_offset,
+ event->file_path);
}

struct perf_session *perf_session__new(struct perf_data *data,
@@ -452,7 +456,8 @@ static int process_stat_round_stub(struct perf_session *perf_session __maybe_unu

static int perf_session__process_compressed_event_stub(struct perf_session *session __maybe_unused,
union perf_event *event __maybe_unused,
- u64 file_offset __maybe_unused)
+ u64 file_offset __maybe_unused,
+ const char *file_path __maybe_unused)
{
dump_printf(": unhandled!\n");
return 0;
@@ -1227,13 +1232,14 @@ static void sample_read__printf(struct perf_sample *sample, u64 read_format)
}

static void dump_event(struct evlist *evlist, union perf_event *event,
- u64 file_offset, struct perf_sample *sample)
+ u64 file_offset, struct perf_sample *sample,
+ const char *file_path)
{
if (!dump_trace)
return;

- printf("\n%#" PRIx64 " [%#x]: event: %d\n",
- file_offset, event->header.size, event->header.type);
+ printf("\n%#" PRIx64 "@%s [%#x]: event: %d\n",
+ file_offset, file_path, event->header.size, event->header.type);

trace_event(event);
if (event->header.type == PERF_RECORD_SAMPLE && evlist->trace_event_sample_raw)
@@ -1424,12 +1430,13 @@ static int machines__deliver_event(struct machines *machines,
struct evlist *evlist,
union perf_event *event,
struct perf_sample *sample,
- struct perf_tool *tool, u64 file_offset)
+ struct perf_tool *tool, u64 file_offset,
+ const char *file_path)
{
struct evsel *evsel;
struct machine *machine;

- dump_event(evlist, event, file_offset, sample);
+ dump_event(evlist, event, file_offset, sample, file_path);

evsel = perf_evlist__id2evsel(evlist, sample->id);

@@ -1506,7 +1513,8 @@ static int machines__deliver_event(struct machines *machines,
static int perf_session__deliver_event(struct perf_session *session,
union perf_event *event,
struct perf_tool *tool,
- u64 file_offset)
+ u64 file_offset,
+ const char *file_path)
{
struct perf_sample sample;
int ret;
@@ -1524,7 +1532,7 @@ static int perf_session__deliver_event(struct perf_session *session,
return 0;

ret = machines__deliver_event(&session->machines, session->evlist,
- event, &sample, tool, file_offset);
+ event, &sample, tool, file_offset, file_path);

if (dump_trace && sample.aux_sample.size)
auxtrace__dump_auxtrace_sample(session, &sample);
@@ -1534,7 +1542,8 @@ static int perf_session__deliver_event(struct perf_session *session,

static s64 perf_session__process_user_event(struct perf_session *session,
union perf_event *event,
- u64 file_offset)
+ u64 file_offset,
+ const char *file_path)
{
struct ordered_events *oe = &session->ordered_events;
struct perf_tool *tool = session->tool;
@@ -1544,7 +1553,7 @@ static s64 perf_session__process_user_event(struct perf_session *session,

if (event->header.type != PERF_RECORD_COMPRESSED ||
tool->compressed == perf_session__process_compressed_event_stub)
- dump_event(session->evlist, event, file_offset, &sample);
+ dump_event(session->evlist, event, file_offset, &sample, file_path);

/* These events are processed right away */
switch (event->header.type) {
@@ -1603,9 +1612,9 @@ static s64 perf_session__process_user_event(struct perf_session *session,
case PERF_RECORD_HEADER_FEATURE:
return tool->feature(session, event);
case PERF_RECORD_COMPRESSED:
- err = tool->compressed(session, event, file_offset);
+ err = tool->compressed(session, event, file_offset, file_path);
if (err)
- dump_event(session->evlist, event, file_offset, &sample);
+ dump_event(session->evlist, event, file_offset, &sample, file_path);
return err;
default:
return -EINVAL;
@@ -1622,9 +1631,9 @@ int perf_session__deliver_synth_event(struct perf_session *session,
events_stats__inc(&evlist->stats, event->header.type);

if (event->header.type >= PERF_RECORD_USER_TYPE_START)
- return perf_session__process_user_event(session, event, 0);
+ return perf_session__process_user_event(session, event, 0, NULL);

- return machines__deliver_event(&session->machines, evlist, event, sample, tool, 0);
+ return machines__deliver_event(&session->machines, evlist, event, sample, tool, 0, NULL);
}

static void event_swap(union perf_event *event, bool sample_id_all)
@@ -1720,7 +1729,8 @@ int perf_session__peek_events(struct perf_session *session, u64 offset,
}

static s64 perf_session__process_event(struct perf_session *session,
- union perf_event *event, u64 file_offset)
+ union perf_event *event, u64 file_offset,
+ const char *file_path)
{
struct evlist *evlist = session->evlist;
struct perf_tool *tool = session->tool;
@@ -1735,7 +1745,7 @@ static s64 perf_session__process_event(struct perf_session *session,
events_stats__inc(&evlist->stats, event->header.type);

if (event->header.type >= PERF_RECORD_USER_TYPE_START)
- return perf_session__process_user_event(session, event, file_offset);
+ return perf_session__process_user_event(session, event, file_offset, file_path);

if (tool->ordered_events) {
u64 timestamp = -1ULL;
@@ -1749,7 +1759,7 @@ static s64 perf_session__process_event(struct perf_session *session,
return ret;
}

- return perf_session__deliver_event(session, event, tool, file_offset);
+ return perf_session__deliver_event(session, event, tool, file_offset, file_path);
}

void perf_event_header__bswap(struct perf_event_header *hdr)
@@ -1987,7 +1997,8 @@ static int __perf_session__process_pipe_events(struct perf_session *session)
}
}

- if ((skip = perf_session__process_event(session, event, head)) < 0) {
+ skip = perf_session__process_event(session, event, head, "pipe");
+ if (skip < 0) {
pr_err("%#" PRIx64 " [%#x]: failed to process type: %d\n",
head, event->header.size, event->header.type);
err = -EINVAL;
@@ -2068,7 +2079,7 @@ fetch_decomp_event(u64 head, size_t mmap_size, char *buf, bool needs_swap)
static int __perf_session__process_decomp_events(struct perf_session *session)
{
s64 skip;
- u64 size, file_pos = 0;
+ u64 size;
struct decomp *decomp = session->decomp_last;

if (!decomp)
@@ -2082,9 +2093,9 @@ static int __perf_session__process_decomp_events(struct perf_session *session)
break;

size = event->header.size;
-
- if (size < sizeof(struct perf_event_header) ||
- (skip = perf_session__process_event(session, event, file_pos)) < 0) {
+ skip = perf_session__process_event(session, event, decomp->file_pos,
+ decomp->file_path);
+ if (size < sizeof(struct perf_event_header) || skip < 0) {
pr_err("%#" PRIx64 " [%#x]: failed to process type: %d\n",
decomp->file_pos + decomp->head, event->header.size, event->header.type);
return -EINVAL;
@@ -2115,7 +2126,8 @@ struct reader;

typedef s64 (*reader_cb_t)(struct perf_session *session,
union perf_event *event,
- u64 file_offset);
+ u64 file_offset,
+ const char *file_path);

struct reader {
int fd;
@@ -2198,9 +2210,9 @@ reader__process_events(struct reader *rd, struct perf_session *session,
skip = -EINVAL;

if (size < sizeof(struct perf_event_header) ||
- (skip = rd->process(session, event, file_pos)) < 0) {
- pr_err("%#" PRIx64 " [%#x]: failed to process type: %d [%s]\n",
- file_offset + head, event->header.size,
+ (skip = rd->process(session, event, file_pos, rd->path)) < 0) {
+ pr_err("%#" PRIx64 " [%s] [%#x]: failed to process type: %d [%s]\n",
+ file_offset + head, rd->path, event->header.size,
event->header.type, strerror(-skip));
err = skip;
goto out;
@@ -2230,9 +2242,10 @@ reader__process_events(struct reader *rd, struct perf_session *session,

static s64 process_simple(struct perf_session *session,
union perf_event *event,
- u64 file_offset)
+ u64 file_offset,
+ const char *file_path)
{
- return perf_session__process_event(session, event, file_offset);
+ return perf_session__process_event(session, event, file_offset, file_path);
}

static int __perf_session__process_events(struct perf_session *session)
diff --git a/tools/perf/util/tool.h b/tools/perf/util/tool.h
index bbbc0dcd461f..c966531d3eca 100644
--- a/tools/perf/util/tool.h
+++ b/tools/perf/util/tool.h
@@ -28,7 +28,8 @@ typedef int (*event_attr_op)(struct perf_tool *tool,

typedef int (*event_op2)(struct perf_session *session, union perf_event *event);
typedef s64 (*event_op3)(struct perf_session *session, union perf_event *event);
-typedef int (*event_op4)(struct perf_session *session, union perf_event *event, u64 data);
+typedef int (*event_op4)(struct perf_session *session, union perf_event *event, u64 data,
+ const char *str);

typedef int (*event_oe)(struct perf_tool *tool, union perf_event *event,
struct ordered_events *oe);
--
2.24.1

2020-10-12 20:21:27

by Alexey Budankov

[permalink] [raw]
Subject: [PATCH v1 01/15] perf session: introduce trace file path to be shown in raw trace dump


Extend reader, ordered_event and decomp objects to contain path
of a trace file being displayed.

Signed-off-by: Alexey Budankov <[email protected]>
---
tools/perf/util/ordered-events.h | 1 +
tools/perf/util/session.c | 2 ++
tools/perf/util/session.h | 1 +
3 files changed, 4 insertions(+)

diff --git a/tools/perf/util/ordered-events.h b/tools/perf/util/ordered-events.h
index 75345946c4b9..42c9764c6b5b 100644
--- a/tools/perf/util/ordered-events.h
+++ b/tools/perf/util/ordered-events.h
@@ -9,6 +9,7 @@ struct perf_sample;
struct ordered_event {
u64 timestamp;
u64 file_offset;
+ const char *file_path;
union perf_event *event;
struct list_head list;
};
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index 7a5f03764702..4478ddae485b 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -2119,6 +2119,7 @@ typedef s64 (*reader_cb_t)(struct perf_session *session,

struct reader {
int fd;
+ const char *path;
u64 data_size;
u64 data_offset;
reader_cb_t process;
@@ -2241,6 +2242,7 @@ static int __perf_session__process_events(struct perf_session *session)
.data_size = session->header.data_size,
.data_offset = session->header.data_offset,
.process = process_simple,
+ .path = session->data->file.path,
};
struct ordered_events *oe = &session->ordered_events;
struct perf_tool *tool = session->tool;
diff --git a/tools/perf/util/session.h b/tools/perf/util/session.h
index f76480166d38..378ffc3e2809 100644
--- a/tools/perf/util/session.h
+++ b/tools/perf/util/session.h
@@ -46,6 +46,7 @@ struct perf_session {
struct decomp {
struct decomp *next;
u64 file_pos;
+ const char *file_path;
size_t mmap_len;
u64 head;
size_t size;
--
2.24.1

2020-10-12 20:22:20

by Alexey Budankov

[permalink] [raw]
Subject: [PATCH v1 06/15] perf session: load data directory into tool process memory


Read trace files located at data directory into tool process memory.
Basic analysis support of data directories is provided for report
mode. Raw dump (-D) and aggregated reports are available for data
directories, still with no memory consumption optimizations. However
data directories collected with --compression-level option enabled
can be analyzed with little less memory because trace files are
unmaped from tool process memory after loading collected data.
The implementation is based on the prototype [1], [2].

[1] git clone https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git -b perf/record_threads
[2] https://lore.kernel.org/lkml/[email protected]/

Signed-off-by: Alexey Budankov <[email protected]>
---
tools/perf/util/session.c | 48 +++++++++++++++++++++++++++++++++++++++
tools/perf/util/session.h | 1 +
2 files changed, 49 insertions(+)

diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index 6afc670fdf0c..0752eec19813 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -2212,6 +2212,17 @@ reader__process_events(struct reader *rd, struct perf_session *session,
goto more;

out:
+ if (rd->unmap_file) {
+ int i;
+
+ for (i = 0; i < NUM_MMAPS; i++) {
+ if (mmaps[i]) {
+ munmap(mmaps[i], mmap_size);
+ mmaps[i] = NULL;
+ }
+ }
+ }
+
return err;
}

@@ -2231,6 +2242,7 @@ static int __perf_session__process_events(struct perf_session *session)
.data_offset = session->header.data_offset,
.process = process_simple,
.path = session->data->file.path,
+ .unmap_file = false,
};
struct ordered_events *oe = &session->ordered_events;
struct perf_tool *tool = session->tool;
@@ -2247,6 +2259,42 @@ static int __perf_session__process_events(struct perf_session *session)
err = reader__process_events(&rd, session, &prog);
if (err)
goto out_err;
+
+ if (perf_data__is_dir(session->data)) {
+ int i, nr = session->data->dir.nr;
+ struct reader file_rd[nr];
+ u64 total_size = perf_data__size(session->data);
+
+ total_size -= session->data->file.size;
+ ui_progress__init_size(&prog, total_size, "Sorting events...");
+
+ memset(&file_rd, 0, nr * sizeof(file_rd[0]));
+
+ for (i = 0; i < nr ; i++) {
+ struct perf_data_file *file;
+
+ file = &session->data->dir.files[i];
+ file_rd[i] = (struct reader) {
+ .fd = file->fd,
+ .path = file->path,
+ .data_size = file->size,
+ .data_offset = 0,
+ .process = process_simple,
+ };
+ file_rd[i].unmap_file = perf_header__has_feat(&session->header,
+ HEADER_COMPRESSED);
+ session->reader = &file_rd[i];
+
+ if (zstd_init(&(file_rd[i].zstd_data), 0))
+ goto out_err;
+ err = reader__process_events(&file_rd[i], session, &prog);
+ zstd_fini(&(file_rd[i].zstd_data));
+ session->reader = NULL;
+ if (err)
+ goto out_err;
+ }
+ }
+
/* do the final flush for ordered samples */
err = ordered_events__flush(oe, OE_FLUSH__FINAL);
if (err)
diff --git a/tools/perf/util/session.h b/tools/perf/util/session.h
index 4fc9ccdf7970..d428f3eaf7fd 100644
--- a/tools/perf/util/session.h
+++ b/tools/perf/util/session.h
@@ -43,6 +43,7 @@ struct reader {
u64 data_offset;
reader_cb_t process;
struct zstd_data zstd_data;
+ bool unmap_file;
};

struct perf_session {
--
2.24.1


2020-10-12 20:23:01

by Alexey Budankov

[permalink] [raw]
Subject: [PATCH v1 08/15] perf record: write trace data into mmap trace files


Write trace data into per mmap trace files located
at data directory. Streaming thread adjusts its affinity
according to mask of the buffer being processed.

Signed-off-by: Alexey Budankov <[email protected]>
---
tools/perf/builtin-record.c | 44 ++++++++++++++++++++++++++++++++-----
tools/perf/util/record.h | 1 +
2 files changed, 39 insertions(+), 6 deletions(-)

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 619aaee11231..ba26d75c51d6 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -120,6 +120,11 @@ static const char *affinity_tags[PERF_AFFINITY_MAX] = {
"SYS", "NODE", "CPU"
};

+static int record__threads_enabled(struct record *rec)
+{
+ return rec->opts.threads;
+}
+
static bool switch_output_signal(struct record *rec)
{
return rec->switch_output.signal &&
@@ -894,6 +899,20 @@ static int record__mmap_evlist(struct record *rec,
return -EINVAL;
}
}
+
+ if (record__threads_enabled(rec)) {
+ int i, ret, nr = evlist->core.nr_mmaps;
+ struct mmap *mmaps = rec->opts.overwrite ?
+ evlist->overwrite_mmap : evlist->mmap;
+
+ ret = perf_data__create_dir(&rec->data, evlist->core.nr_mmaps);
+ if (ret)
+ return ret;
+
+ for (i = 0; i < nr; i++)
+ mmaps[i].file = &rec->data.dir.files[i];
+ }
+
return 0;
}

@@ -1184,8 +1203,12 @@ static int record__mmap_read_evlist(struct record *rec, struct evlist *evlist,
/*
* Mark the round finished in case we wrote
* at least one event.
+ *
+ * No need for round events in directory mode,
+ * because per-cpu maps and files have data
+ * sorted by kernel.
*/
- if (bytes_written != rec->bytes_written)
+ if (!record__threads_enabled(rec) && bytes_written != rec->bytes_written)
rc = record__write(rec, NULL, &finished_round_event, sizeof(finished_round_event));

if (overwrite)
@@ -1231,7 +1254,9 @@ static void record__init_features(struct record *rec)
if (!rec->opts.use_clockid)
perf_header__clear_feat(&session->header, HEADER_CLOCK_DATA);

- perf_header__clear_feat(&session->header, HEADER_DIR_FORMAT);
+ if (!record__threads_enabled(rec))
+ perf_header__clear_feat(&session->header, HEADER_DIR_FORMAT);
+
if (!record__comp_enabled(rec))
perf_header__clear_feat(&session->header, HEADER_COMPRESSED);

@@ -1242,15 +1267,21 @@ static void
record__finish_output(struct record *rec)
{
struct perf_data *data = &rec->data;
- int fd = perf_data__fd(data);
+ int i, fd = perf_data__fd(data);

if (data->is_pipe)
return;

rec->session->header.data_size += rec->bytes_written;
data->file.size = lseek(perf_data__fd(data), 0, SEEK_CUR);
+ if (record__threads_enabled(rec)) {
+ for (i = 0; i < data->dir.nr; i++)
+ data->dir.files[i].size = lseek(data->dir.files[i].fd, 0, SEEK_CUR);
+ }

if (!rec->no_buildid) {
+ /* this will be recalculated during process_buildids() */
+ rec->samples = 0;
process_buildids(rec);

if (rec->buildid_all)
@@ -2041,8 +2072,6 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
status = err;

record__synthesize(rec, true);
- /* this will be recalculated during process_buildids() */
- rec->samples = 0;

if (!err) {
if (!rec->timestamp_filename) {
@@ -2680,9 +2709,12 @@ int cmd_record(int argc, const char **argv)

}

- if (rec->opts.kcore)
+ if (rec->opts.kcore || record__threads_enabled(rec))
rec->data.is_dir = true;

+ if (record__threads_enabled(rec))
+ rec->opts.affinity = PERF_AFFINITY_CPU;
+
if (rec->opts.comp_level != 0) {
pr_debug("Compression enabled, disabling build id collection at the end of the session.\n");
rec->no_buildid = true;
diff --git a/tools/perf/util/record.h b/tools/perf/util/record.h
index 266760ac9143..aeda3cdaa3e9 100644
--- a/tools/perf/util/record.h
+++ b/tools/perf/util/record.h
@@ -74,6 +74,7 @@ struct record_opts {
int ctl_fd;
int ctl_fd_ack;
bool ctl_fd_close;
+ bool threads;
};

extern const char * const *record_usage;
--
2.24.1

2020-10-12 20:25:05

by Alexey Budankov

[permalink] [raw]
Subject: [PATCH v1 11/15] perf evlist: introduce evlist__ctlfd_update() to update ctl fd status


Introduce evlist__ctlfd_update() to update ctl fd poll status
in evlist pollfd array using other pollfd object.

Signed-off-by: Alexey Budankov <[email protected]>
---
tools/perf/util/evlist.c | 16 ++++++++++++++++
tools/perf/util/evlist.h | 1 +
2 files changed, 17 insertions(+)

diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
index 8bdf3d2c907c..758a4896fedd 100644
--- a/tools/perf/util/evlist.c
+++ b/tools/perf/util/evlist.c
@@ -1970,6 +1970,22 @@ int evlist__ctlfd_process(struct evlist *evlist, enum evlist_ctl_cmd *cmd)
return err;
}

+int evlist__ctlfd_update(struct evlist *evlist, struct pollfd *update)
+{
+ int ctlfd_pos = evlist->ctl_fd.pos;
+ struct pollfd *entries = evlist->core.pollfd.entries;
+
+ if (!evlist__ctlfd_initialized(evlist))
+ return 0;
+
+ if (entries[ctlfd_pos].fd != update->fd ||
+ entries[ctlfd_pos].events != update->events)
+ return -1;
+
+ entries[ctlfd_pos].revents = update->revents;
+ return 0;
+}
+
struct evsel *evlist__find_evsel(struct evlist *evlist, int idx)
{
struct evsel *evsel;
diff --git a/tools/perf/util/evlist.h b/tools/perf/util/evlist.h
index e1a450322bc5..9b73d6ccf066 100644
--- a/tools/perf/util/evlist.h
+++ b/tools/perf/util/evlist.h
@@ -380,6 +380,7 @@ void evlist__close_control(int ctl_fd, int ctl_fd_ack, bool *ctl_fd_close);
int evlist__initialize_ctlfd(struct evlist *evlist, int ctl_fd, int ctl_fd_ack);
int evlist__finalize_ctlfd(struct evlist *evlist);
bool evlist__ctlfd_initialized(struct evlist *evlist);
+int evlist__ctlfd_update(struct evlist *evlist, struct pollfd *update);
int evlist__ctlfd_process(struct evlist *evlist, enum evlist_ctl_cmd *cmd);
int evlist__ctlfd_ack(struct evlist *evlist);

--
2.24.1

2020-10-12 20:25:53

by Alexey Budankov

[permalink] [raw]
Subject: [PATCH v1 13/15] perf record: stop threads in the end of trace streaming


Close write fd of comm.msg pipe to signal thread to terminate
and receive THREAD_MSG__READY confirmation on termination.
Accumulate thread stats into global stats to be correctly
calculated and displayed in perf tool output.

Signed-off-by: Alexey Budankov <[email protected]>
---
tools/perf/builtin-record.c | 64 ++++++++++++++++++++++++++++++++++---
1 file changed, 60 insertions(+), 4 deletions(-)

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 3b7e9026f25b..a15642656066 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -85,6 +85,16 @@ struct switch_output {
int cur_file;
};

+enum thread_msg {
+ THREAD_MSG__UNSUPPORTED = 0,
+ THREAD_MSG__READY,
+ THREAD_MSG__MAX,
+};
+
+static const char *thread_msg_tags[THREAD_MSG__MAX] = {
+ "UNSUPPORTED", "READY"
+};
+
struct thread_data {
pid_t tid;
struct {
@@ -1796,6 +1806,50 @@ static void hit_auxtrace_snapshot_trigger(struct record *rec)
}
}

+static int record__terminate_thread(struct thread_data *thread_data)
+{
+ int res;
+ enum thread_msg ack = THREAD_MSG__UNSUPPORTED;
+ pid_t tid = thread_data->tid;
+
+ close(thread_data->comm.msg[1]);
+ res = read(thread_data->comm.ack[0], &ack, sizeof(ack));
+ if (res != -1)
+ pr_debug("threads: %d -> %s\n", tid, thread_msg_tags[ack]);
+ else
+ pr_err("threads: failed to recv msg=%s from %d\n",
+ thread_msg_tags[ack], tid);
+
+ return 0;
+}
+
+static int record__stop_threads(struct record *rec, unsigned long *waking)
+{
+ int i, j, nr_thread_data = rec->nr_thread_data;
+ struct thread_data *thread_data = rec->thread_data;
+
+ if (!record__threads_enabled(rec))
+ return 0;
+
+ for (i = 1; i < nr_thread_data; i++)
+ record__terminate_thread(&thread_data[i]);
+
+ for (i = 0; i < nr_thread_data; i++) {
+ pr_debug("threads: %d : samples %lld, wakes %ld\n",
+ thread_data[i].tid, thread_data[i].samples,
+ thread_data[i].waking);
+
+ rec->samples += thread_data[i].samples;
+ *waking += thread_data[i].waking;
+ for (j = 0; j < thread_data[i].nr_mmaps; j++) {
+ rec->session->bytes_transferred += thread_data[i].maps[j].bytes_transferred;
+ rec->session->bytes_compressed += thread_data[i].maps[j].bytes_compressed;
+ }
+ }
+
+ return 0;
+}
+
static int __cmd_record(struct record *rec, int argc, const char **argv)
{
int err;
@@ -1903,7 +1957,7 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)

if (record__open(rec) != 0) {
err = -1;
- goto out_child;
+ goto out_free_threads;
}
session->header.env.comp_mmap_len = session->evlist->core.mmap_len;

@@ -2203,18 +2257,20 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
goto out_child;
}

- if (!quiet)
- fprintf(stderr, "[ perf record: Woken up %ld times to write data ]\n", waking);
-
if (target__none(&rec->opts.target))
record__synthesize_workload(rec, true);

out_child:
+ record__stop_threads(rec, &waking);
+out_free_threads:
record__free_thread_data(rec);
evlist__finalize_ctlfd(rec->evlist);
record__mmap_read_all(rec, true);
record__aio_mmap_read_sync(rec);

+ if (!quiet)
+ fprintf(stderr, "[ perf record: Woken up %ld times to write data ]\n", waking);
+
if (rec->session->bytes_transferred && rec->session->bytes_compressed) {
ratio = (float)rec->session->bytes_transferred/(float)rec->session->bytes_compressed;
session->header.env.comp_ratio = ratio + 0.5;
--
2.24.1

2020-10-13 07:10:16

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH v1 02/15] perf report: output trace file name in raw trace dump

On Mon, Oct 12, 2020 at 11:54:24AM +0300, Alexey Budankov wrote:
> @@ -180,7 +183,8 @@ static int ordered_events__deliver_event(struct ordered_events *oe,
> ordered_events);
>
> return perf_session__deliver_event(session, event->event,
> - session->tool, event->file_offset);
> + session->tool, event->file_offset,
> + event->file_path);

Wouldn't it be better to pass "event" around now, which would contain at least
four of the arguments.

These functions are getting entirely too many arguments.

-Andi

2020-10-13 07:15:21

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH v1 03/15] perf data: open data directory in read access mode

On Mon, Oct 12, 2020 at 11:55:07AM +0300, Alexey Budankov wrote:
>
> Open files located at data directory in case of read access mode.

Need some rationale. Is this a bug fix? Did the case not matter
before?

>
> Signed-off-by: Alexey Budankov <[email protected]>
> ---
> tools/perf/util/data.c | 4 ++++
> 1 file changed, 4 insertions(+)
>
> diff --git a/tools/perf/util/data.c b/tools/perf/util/data.c
> index c47aa34fdc0a..6ad61ac6ba67 100644
> --- a/tools/perf/util/data.c
> +++ b/tools/perf/util/data.c
> @@ -321,6 +321,10 @@ static int open_dir(struct perf_data *data)
> return -1;
>
> ret = open_file(data);
> + if (!ret && perf_data__is_dir(data)) {
> + if (perf_data__is_read(data))
> + ret = perf_data__open_dir(data);
> + }
>
> /* Cleanup whatever we managed to create so far. */
> if (ret && perf_data__is_write(data))
> --
> 2.24.1
>
>

2020-10-13 07:17:22

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH v1 06/15] perf session: load data directory into tool process memory

On Mon, Oct 12, 2020 at 11:58:58AM +0300, Alexey Budankov wrote:
>
> Read trace files located at data directory into tool process memory.
> Basic analysis support of data directories is provided for report
> mode. Raw dump (-D) and aggregated reports are available for data
> directories, still with no memory consumption optimizations. However
> data directories collected with --compression-level option enabled
> can be analyzed with little less memory because trace files are
> unmaped from tool process memory after loading collected data.
> The implementation is based on the prototype [1], [2].

Should credit the author(s) of the prototypes.
>
> [1] git clone https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git -b perf/record_threads
> [2] https://lore.kernel.org/lkml/[email protected]/
>
> Signed-off-by: Alexey Budankov <[email protected]>
> ---
> tools/perf/util/session.c | 48 +++++++++++++++++++++++++++++++++++++++
> tools/perf/util/session.h | 1 +
> 2 files changed, 49 insertions(+)
>
> diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
> index 6afc670fdf0c..0752eec19813 100644
> --- a/tools/perf/util/session.c
> +++ b/tools/perf/util/session.c
> @@ -2212,6 +2212,17 @@ reader__process_events(struct reader *rd, struct perf_session *session,
> goto more;
>
> out:
> + if (rd->unmap_file) {
> + int i;
> +
> + for (i = 0; i < NUM_MMAPS; i++) {
> + if (mmaps[i]) {
> + munmap(mmaps[i], mmap_size);
> + mmaps[i] = NULL;

Okay so where is the mmap? Would make more sense to put that
into the same patch as who adds the mmap. Or is the mmap
code already in the perf source? In that case it should
probably be some common helper with the existing users.

2020-10-13 19:57:26

by Jiri Olsa

[permalink] [raw]
Subject: Re: [PATCH v1 02/15] perf report: output trace file name in raw trace dump

On Mon, Oct 12, 2020 at 11:54:24AM +0300, Alexey Budankov wrote:
>
> Output path of a trace file into raw dump (-D) <file_offset>@<path/file>.
> Print offset of PERF_RECORD_COMPRESSED record instead of zero for
> decompressed records:
> [email protected] [0x30]: event: 9
> or
> [email protected]/data.7 [0x30]: event: 9
>
> Signed-off-by: Alexey Budankov <[email protected]>

hi,
I'm getting:

CC builtin-inject.o
builtin-inject.c: In function ‘cmd_inject’:
builtin-inject.c:850:18: error: initialization of ‘int (*)(struct perf_session *, union perf_event *, u64, const char *)’ {aka ‘int (*)(struct perf_session *, union perf_event *, long unsigned int, const char *)’} from incompatible pointer type ‘int (*)(struct perf_session *, union perf_event *, u64)’ {aka ‘int (*)(struct perf_session *, union perf_event *, long unsigned int)’} [-Werror=incompatible-pointer-types]
850 | .compressed = perf_event__repipe_op4_synth,
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
builtin-inject.c:850:18: note: (near initialization for ‘inject.tool.compressed’)

it's probably recent build id changes

jirka


> ---
> tools/perf/util/session.c | 75 +++++++++++++++++++++++----------------
> tools/perf/util/tool.h | 3 +-
> 2 files changed, 46 insertions(+), 32 deletions(-)
>
> diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
> index 4478ddae485b..6f09d506b2f6 100644
> --- a/tools/perf/util/session.c
> +++ b/tools/perf/util/session.c
> @@ -36,7 +36,8 @@
>
> #ifdef HAVE_ZSTD_SUPPORT
> static int perf_session__process_compressed_event(struct perf_session *session,
> - union perf_event *event, u64 file_offset)
> + union perf_event *event, u64 file_offset,
> + const char *file_path)
> {
> void *src;
> size_t decomp_size, src_size;
> @@ -58,6 +59,7 @@ static int perf_session__process_compressed_event(struct perf_session *session,
> }
>
> decomp->file_pos = file_offset;
> + decomp->file_path = file_path;
> decomp->mmap_len = mmap_len;
> decomp->head = 0;
>
> @@ -98,7 +100,8 @@ static int perf_session__process_compressed_event(struct perf_session *session,
> static int perf_session__deliver_event(struct perf_session *session,
> union perf_event *event,
> struct perf_tool *tool,
> - u64 file_offset);
> + u64 file_offset,
> + const char *file_path);
>
> static int perf_session__open(struct perf_session *session)
> {
> @@ -180,7 +183,8 @@ static int ordered_events__deliver_event(struct ordered_events *oe,
> ordered_events);
>
> return perf_session__deliver_event(session, event->event,
> - session->tool, event->file_offset);
> + session->tool, event->file_offset,
> + event->file_path);
> }
>
> struct perf_session *perf_session__new(struct perf_data *data,
> @@ -452,7 +456,8 @@ static int process_stat_round_stub(struct perf_session *perf_session __maybe_unu
>
> static int perf_session__process_compressed_event_stub(struct perf_session *session __maybe_unused,
> union perf_event *event __maybe_unused,
> - u64 file_offset __maybe_unused)
> + u64 file_offset __maybe_unused,
> + const char *file_path __maybe_unused)
> {
> dump_printf(": unhandled!\n");
> return 0;
> @@ -1227,13 +1232,14 @@ static void sample_read__printf(struct perf_sample *sample, u64 read_format)
> }
>
> static void dump_event(struct evlist *evlist, union perf_event *event,
> - u64 file_offset, struct perf_sample *sample)
> + u64 file_offset, struct perf_sample *sample,
> + const char *file_path)
> {
> if (!dump_trace)
> return;
>
> - printf("\n%#" PRIx64 " [%#x]: event: %d\n",
> - file_offset, event->header.size, event->header.type);
> + printf("\n%#" PRIx64 "@%s [%#x]: event: %d\n",
> + file_offset, file_path, event->header.size, event->header.type);
>
> trace_event(event);
> if (event->header.type == PERF_RECORD_SAMPLE && evlist->trace_event_sample_raw)
> @@ -1424,12 +1430,13 @@ static int machines__deliver_event(struct machines *machines,
> struct evlist *evlist,
> union perf_event *event,
> struct perf_sample *sample,
> - struct perf_tool *tool, u64 file_offset)
> + struct perf_tool *tool, u64 file_offset,
> + const char *file_path)
> {
> struct evsel *evsel;
> struct machine *machine;
>
> - dump_event(evlist, event, file_offset, sample);
> + dump_event(evlist, event, file_offset, sample, file_path);
>
> evsel = perf_evlist__id2evsel(evlist, sample->id);
>
> @@ -1506,7 +1513,8 @@ static int machines__deliver_event(struct machines *machines,
> static int perf_session__deliver_event(struct perf_session *session,
> union perf_event *event,
> struct perf_tool *tool,
> - u64 file_offset)
> + u64 file_offset,
> + const char *file_path)
> {
> struct perf_sample sample;
> int ret;
> @@ -1524,7 +1532,7 @@ static int perf_session__deliver_event(struct perf_session *session,
> return 0;
>
> ret = machines__deliver_event(&session->machines, session->evlist,
> - event, &sample, tool, file_offset);
> + event, &sample, tool, file_offset, file_path);
>
> if (dump_trace && sample.aux_sample.size)
> auxtrace__dump_auxtrace_sample(session, &sample);
> @@ -1534,7 +1542,8 @@ static int perf_session__deliver_event(struct perf_session *session,
>
> static s64 perf_session__process_user_event(struct perf_session *session,
> union perf_event *event,
> - u64 file_offset)
> + u64 file_offset,
> + const char *file_path)
> {
> struct ordered_events *oe = &session->ordered_events;
> struct perf_tool *tool = session->tool;
> @@ -1544,7 +1553,7 @@ static s64 perf_session__process_user_event(struct perf_session *session,
>
> if (event->header.type != PERF_RECORD_COMPRESSED ||
> tool->compressed == perf_session__process_compressed_event_stub)
> - dump_event(session->evlist, event, file_offset, &sample);
> + dump_event(session->evlist, event, file_offset, &sample, file_path);
>
> /* These events are processed right away */
> switch (event->header.type) {
> @@ -1603,9 +1612,9 @@ static s64 perf_session__process_user_event(struct perf_session *session,
> case PERF_RECORD_HEADER_FEATURE:
> return tool->feature(session, event);
> case PERF_RECORD_COMPRESSED:
> - err = tool->compressed(session, event, file_offset);
> + err = tool->compressed(session, event, file_offset, file_path);
> if (err)
> - dump_event(session->evlist, event, file_offset, &sample);
> + dump_event(session->evlist, event, file_offset, &sample, file_path);
> return err;
> default:
> return -EINVAL;
> @@ -1622,9 +1631,9 @@ int perf_session__deliver_synth_event(struct perf_session *session,
> events_stats__inc(&evlist->stats, event->header.type);
>
> if (event->header.type >= PERF_RECORD_USER_TYPE_START)
> - return perf_session__process_user_event(session, event, 0);
> + return perf_session__process_user_event(session, event, 0, NULL);
>
> - return machines__deliver_event(&session->machines, evlist, event, sample, tool, 0);
> + return machines__deliver_event(&session->machines, evlist, event, sample, tool, 0, NULL);
> }
>
> static void event_swap(union perf_event *event, bool sample_id_all)
> @@ -1720,7 +1729,8 @@ int perf_session__peek_events(struct perf_session *session, u64 offset,
> }
>
> static s64 perf_session__process_event(struct perf_session *session,
> - union perf_event *event, u64 file_offset)
> + union perf_event *event, u64 file_offset,
> + const char *file_path)
> {
> struct evlist *evlist = session->evlist;
> struct perf_tool *tool = session->tool;
> @@ -1735,7 +1745,7 @@ static s64 perf_session__process_event(struct perf_session *session,
> events_stats__inc(&evlist->stats, event->header.type);
>
> if (event->header.type >= PERF_RECORD_USER_TYPE_START)
> - return perf_session__process_user_event(session, event, file_offset);
> + return perf_session__process_user_event(session, event, file_offset, file_path);
>
> if (tool->ordered_events) {
> u64 timestamp = -1ULL;
> @@ -1749,7 +1759,7 @@ static s64 perf_session__process_event(struct perf_session *session,
> return ret;
> }
>
> - return perf_session__deliver_event(session, event, tool, file_offset);
> + return perf_session__deliver_event(session, event, tool, file_offset, file_path);
> }
>
> void perf_event_header__bswap(struct perf_event_header *hdr)
> @@ -1987,7 +1997,8 @@ static int __perf_session__process_pipe_events(struct perf_session *session)
> }
> }
>
> - if ((skip = perf_session__process_event(session, event, head)) < 0) {
> + skip = perf_session__process_event(session, event, head, "pipe");
> + if (skip < 0) {
> pr_err("%#" PRIx64 " [%#x]: failed to process type: %d\n",
> head, event->header.size, event->header.type);
> err = -EINVAL;
> @@ -2068,7 +2079,7 @@ fetch_decomp_event(u64 head, size_t mmap_size, char *buf, bool needs_swap)
> static int __perf_session__process_decomp_events(struct perf_session *session)
> {
> s64 skip;
> - u64 size, file_pos = 0;
> + u64 size;
> struct decomp *decomp = session->decomp_last;
>
> if (!decomp)
> @@ -2082,9 +2093,9 @@ static int __perf_session__process_decomp_events(struct perf_session *session)
> break;
>
> size = event->header.size;
> -
> - if (size < sizeof(struct perf_event_header) ||
> - (skip = perf_session__process_event(session, event, file_pos)) < 0) {
> + skip = perf_session__process_event(session, event, decomp->file_pos,
> + decomp->file_path);
> + if (size < sizeof(struct perf_event_header) || skip < 0) {
> pr_err("%#" PRIx64 " [%#x]: failed to process type: %d\n",
> decomp->file_pos + decomp->head, event->header.size, event->header.type);
> return -EINVAL;
> @@ -2115,7 +2126,8 @@ struct reader;
>
> typedef s64 (*reader_cb_t)(struct perf_session *session,
> union perf_event *event,
> - u64 file_offset);
> + u64 file_offset,
> + const char *file_path);
>
> struct reader {
> int fd;
> @@ -2198,9 +2210,9 @@ reader__process_events(struct reader *rd, struct perf_session *session,
> skip = -EINVAL;
>
> if (size < sizeof(struct perf_event_header) ||
> - (skip = rd->process(session, event, file_pos)) < 0) {
> - pr_err("%#" PRIx64 " [%#x]: failed to process type: %d [%s]\n",
> - file_offset + head, event->header.size,
> + (skip = rd->process(session, event, file_pos, rd->path)) < 0) {
> + pr_err("%#" PRIx64 " [%s] [%#x]: failed to process type: %d [%s]\n",
> + file_offset + head, rd->path, event->header.size,
> event->header.type, strerror(-skip));
> err = skip;
> goto out;
> @@ -2230,9 +2242,10 @@ reader__process_events(struct reader *rd, struct perf_session *session,
>
> static s64 process_simple(struct perf_session *session,
> union perf_event *event,
> - u64 file_offset)
> + u64 file_offset,
> + const char *file_path)
> {
> - return perf_session__process_event(session, event, file_offset);
> + return perf_session__process_event(session, event, file_offset, file_path);
> }
>
> static int __perf_session__process_events(struct perf_session *session)
> diff --git a/tools/perf/util/tool.h b/tools/perf/util/tool.h
> index bbbc0dcd461f..c966531d3eca 100644
> --- a/tools/perf/util/tool.h
> +++ b/tools/perf/util/tool.h
> @@ -28,7 +28,8 @@ typedef int (*event_attr_op)(struct perf_tool *tool,
>
> typedef int (*event_op2)(struct perf_session *session, union perf_event *event);
> typedef s64 (*event_op3)(struct perf_session *session, union perf_event *event);
> -typedef int (*event_op4)(struct perf_session *session, union perf_event *event, u64 data);
> +typedef int (*event_op4)(struct perf_session *session, union perf_event *event, u64 data,
> + const char *str);
>
> typedef int (*event_oe)(struct perf_tool *tool, union perf_event *event,
> struct ordered_events *oe);
> --
> 2.24.1
>

2020-10-14 02:11:49

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: [PATCH v1 03/15] perf data: open data directory in read access mode

Em Mon, Oct 12, 2020 at 07:52:33PM +0300, Alexey Budankov escreveu:
>
> On 12.10.2020 19:03, Andi Kleen wrote:
> > On Mon, Oct 12, 2020 at 11:55:07AM +0300, Alexey Budankov wrote:
> >>
> >> Open files located at data directory in case of read access mode.
> >
> > Need some rationale. Is this a bug fix? Did the case not matter
> > before?
>
> This is not a bug fix. The case didn't matter before.

In such cases can you please write a paragraph explaining why it didn't
matter before and now it is required?

- Arnaldo

>
> >>
> >> Signed-off-by: Alexey Budankov <[email protected]>
> >> ---
> >> tools/perf/util/data.c | 4 ++++
> >> 1 file changed, 4 insertions(+)
> >>
> >> diff --git a/tools/perf/util/data.c b/tools/perf/util/data.c
> >> index c47aa34fdc0a..6ad61ac6ba67 100644
> >> --- a/tools/perf/util/data.c
> >> +++ b/tools/perf/util/data.c
> >> @@ -321,6 +321,10 @@ static int open_dir(struct perf_data *data)
> >> return -1;
> >>
> >> ret = open_file(data);
> >> + if (!ret && perf_data__is_dir(data)) {
> >> + if (perf_data__is_read(data))
> >> + ret = perf_data__open_dir(data);
> >> + }
> >>
> >> /* Cleanup whatever we managed to create so far. */
> >> if (ret && perf_data__is_write(data))
> >> --
> >> 2.24.1
>
> Alexei
>

--

- Arnaldo

2020-10-14 02:11:50

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: [PATCH v1 00/15] Introduce threaded trace streaming for basic perf record operation

Em Mon, Oct 12, 2020 at 11:50:29AM +0300, Alexey Budankov escreveu:
>
> Patch set provides threaded trace streaming for base perf record
> operation. Provided streaming mode (--threads) mitigates profiling
> data losses and resolves scalability issues of serial and asynchronous
> (--aio) trace streaming modes on multicore server systems. The patch
> set is based on the prototype [1], [2] and the most closely relates
> to mode 3) "mode that creates thread for every monitored memory map".
>
> The threaded mode executes one-to-one mapping of trace streaming threads
> to mapped data buffers and streaming into per-CPU trace files located
> at data directory. The data buffers and threads are affined to NUMA
> nodes and monitored CPUs according to system topology. --cpu option
> can be used to specify exact CPUs to be monitored.
>
> Basic analysis of data directories is provided for perf report mode.
> Raw dump (-D) and aggregated reports are available for data directories,
> still with no memory consumption optimizations. However data directories
> collected with --compression-level option enabled can be analyzed with
> little less memory because trace files are unmaped from tool process
> memory after loading collected data.
>
> Provided streaming mode is available with Zstd compression/decompression
> (--compression-level) and handling of external commands (--control).
> AUX area tracing, related and derived modes like --snapshot or
> --aux-sample are not enabled. --switch-output, --switch-output-event,
> --switch-max-files and --timestamp-filename options are not enabled.

Would be interesting to spell out what are the difficulties to have
those options working with this threaded mode, as I expect that once
this is all reviewed and tested we should switch to it by default, no?

- Arnaldo

> Threaded trace streaming is not enabled for pipe mode. Asynchronous
> (--aio) trace streaming and affinity (--affinity) modes are mutually
> exclusive to threaded streaming mode.
>
> See testing results and validation examples below.
>
> [1] git clone https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git -b perf/record_threads
> [2] https://lore.kernel.org/lkml/[email protected]/
>
> ---
> Alexey Budankov (15):
> perf session: introduce trace file path to be shown in raw trace dump
> perf report: output trace file name in raw trace dump
> perf data: open data directory in read access mode
> perf session: move reader object definition to header file
> perf session: introduce decompressor into trace reader object
> perf session: load data directory into tool process memory
> perf record: introduce trace file, compressor and stats in mmap object
> perf record: write trace data into mmap trace files
> perf record: introduce thread specific objects for trace streaming
> perf record: manage thread specific data array
> perf evlist: introduce evlist__ctlfd_update() to update ctl fd status
> perf record: introduce thread local variable for trace streaming
> perf record: stop threads in the end of trace streaming
> perf record: start threads in the beginning of trace streaming
> perf record: introduce --threads command line option
>
> tools/perf/Documentation/perf-record.txt | 7 +
> tools/perf/builtin-record.c | 514 +++++++++++++++++++++--
> tools/perf/util/data.c | 4 +
> tools/perf/util/evlist.c | 16 +
> tools/perf/util/evlist.h | 1 +
> tools/perf/util/mmap.c | 6 +
> tools/perf/util/mmap.h | 5 +
> tools/perf/util/ordered-events.h | 1 +
> tools/perf/util/record.h | 1 +
> tools/perf/util/session.c | 150 ++++---
> tools/perf/util/session.h | 28 ++
> tools/perf/util/tool.h | 3 +-
> 12 files changed, 635 insertions(+), 101 deletions(-)
>
> ---
> Testing results:
>
> $ perf test
> 1: vmlinux symtab matches kallsyms : Skip
> 2: Detect openat syscall event : Ok
> 3: Detect openat syscall event on all cpus : Ok
> 4: Read samples using the mmap interface : Ok
> 5: Test data source output : Ok
> 6: Parse event definition strings : Ok
> 7: Simple expression parser : Ok
> 8: PERF_RECORD_* events & perf_sample fields : Ok
> 9: Parse perf pmu format : Ok
> 10: PMU events :
> 10.1: PMU event table sanity : Ok
> 10.2: PMU event map aliases : Ok
> 10.3: Parsing of PMU event table metrics : Skip (some metrics failed)
> 10.4: Parsing of PMU event table metrics with fake PMUs : Ok
> 11: DSO data read : Ok
> 12: DSO data cache : Ok
> 13: DSO data reopen : FAILED!
> 14: Roundtrip evsel->name : Ok
> 15: Parse sched tracepoints fields : Ok
> 16: syscalls:sys_enter_openat event fields : Ok
> 17: Setup struct perf_event_attr : Ok
> 18: Match and link multiple hists : Ok
> 19: 'import perf' in python : FAILED!
> 20: Breakpoint overflow signal handler : Ok
> 21: Breakpoint overflow sampling : Ok
> 22: Breakpoint accounting : Ok
> 23: Watchpoint :
> 23.1: Read Only Watchpoint : Skip
> 23.2: Write Only Watchpoint : Ok
> 23.3: Read / Write Watchpoint : Ok
> 23.4: Modify Watchpoint : Ok
> 24: Number of exit events of a simple workload : Ok
> 25: Software clock events period values : Ok
> 26: Object code reading : Ok
> 27: Sample parsing : Ok
> 28: Use a dummy software event to keep tracking : Ok
> 29: Parse with no sample_id_all bit set : Ok
> 30: Filter hist entries : Ok
> 31: Lookup mmap thread : Ok
> 32: Share thread maps : Ok
> 33: Sort output of hist entries : Ok
> 34: Cumulate child hist entries : Ok
> 35: Track with sched_switch : Ok
> 36: Filter fds with revents mask in a fdarray : Ok
> 37: Add fd to a fdarray, making it autogrow : Ok
> 38: kmod_path__parse : Ok
> 39: Thread map : Ok
> 40: LLVM search and compile :
> 40.1: Basic BPF llvm compile : Skip
> 40.2: kbuild searching : Skip
> 40.3: Compile source for BPF prologue generation : Skip
> 40.4: Compile source for BPF relocation : Skip
> 41: Session topology : Ok
> 42: BPF filter :
> 42.1: Basic BPF filtering : Skip
> 42.2: BPF pinning : Skip
> 42.3: BPF prologue generation : Skip
> 42.4: BPF relocation checker : Skip
> 43: Synthesize thread map : Ok
> 44: Remove thread map : Ok
> 45: Synthesize cpu map : Ok
> 46: Synthesize stat config : Ok
> 47: Synthesize stat : Ok
> 48: Synthesize stat round : Ok
> 49: Synthesize attr update : Ok
> 50: Event times : Ok
> 51: Read backward ring buffer : Ok
> 52: Print cpu map : Ok
> 53: Merge cpu map : Ok
> 54: Probe SDT events : Ok
> 55: is_printable_array : Ok
> 56: Print bitmap : Ok
> 57: perf hooks : Ok
> 58: builtin clang support : Skip (not compiled in)
> 59: unit_number__scnprintf : Ok
> 60: mem2node : Ok
> 61: time utils : Ok
> 62: Test jit_write_elf : Ok
> 63: Test libpfm4 support : Skip (not compiled in)
> 64: Test api io : Ok
> 65: maps__merge_in : Ok
> 66: Demangle Java : Ok
> 67: Parse and process metrics : Ok
> 68: PE file support : Ok
> 69: Event expansion for cgroups : Ok
> 70: x86 rdpmc : Ok
> 71: Convert perf time to TSC : Ok
> 72: DWARF unwind : Ok
> 73: x86 instruction decoder - new instructions : Ok
> 74: Intel PT packet decoder : Ok
> 75: x86 bp modify : Ok
> 76: Check open filename arg using perf trace + vfs_getname : FAILED!
> 77: Add vfs_getname probe to get syscall args filenames : FAILED!
> 78: probe libc's inet_pton & backtrace it with ping : Ok
> 79: Use vfs_getname probe to get syscall args filenames : FAILED!
> 80: Zstd perf.data compression/decompression : Ok
> 81: Check Arm CoreSight trace data recording and synthesized samples: Skip
>
> Validation:
>
> $ tools/perf/perf record --threads --compression-level=3 -e cycles,instructions -- ../../bench/matrix/linux/matrix.gcc.g.O3
> Addr of buf1 = 0x7f8af24ae010
> Offs of buf1 = 0x7f8af24ae180
> Addr of buf2 = 0x7f8af04ad010
> Offs of buf2 = 0x7f8af04ad1c0
> Addr of buf3 = 0x7f8aee4ac010
> Offs of buf3 = 0x7f8aee4ac100
> Addr of buf4 = 0x7f8aec4ab010
> Offs of buf4 = 0x7f8aec4ab140
> Threads #: 8 Pthreads
> Matrix size: 2048
> Using multiply kernel: multiply1
> Execution time = 24.579 seconds
> [ perf record: Woken up 972 times to write data ]
> [ perf record: Captured and wrote 11.280 MB perf.data (403 samples), compressed (original 66.670 MB, ratio is 5.926) ]
>
> $ perf report --header --stats
> # ========
> # captured on : Wed Oct 7 19:07:50 2020
> # header version : 1
> # data offset : 504
> # data size : 30056
> # feat offset : 30560
> # hostname : nntvtune39
> # os release : 5.8.8-100.fc31.x86_64
> # perf version : 4.13.rc5.gdf19b1347b82
> # arch : x86_64
> # nrcpus online : 8
> # nrcpus avail : 8
> # cpudesc : Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
> # cpuid : GenuineIntel,6,94,3
> # total memory : 16119484 kB
> # cmdline : /root/abudanko/kernel/acme/tools/perf/perf record --threads --compression-level=3 -e cycles,instructions -- ../../bench /matrix/linux/matrix.gcc.g.O3
> # event : name = cycles, , id = { 54562, 54563, 54564, 54565, 54566, 54567, 54568, 54569 }, size = 120, { sample_period, sample_freq } = 4000, sample_type = IP|TID|TIME|ID|>
> # event : name = instructions, , id = { 54570, 54571, 54572, 54573, 54574, 54575, 54576, 54577 }, size = 120, config = 0x1, { sample_period, sample_freq } = 4000, sample_ty>
> # sibling sockets : 0-7
> # sibling dies : 0-7
> # sibling threads : 0,4
> # sibling threads : 1,5
> # sibling threads : 2,6
> # sibling threads : 3,7
> # CPU 0: Core ID 0, Die ID 0, Socket ID 0
> # CPU 1: Core ID 1, Die ID 0, Socket ID 0
> # CPU 2: Core ID 2, Die ID 0, Socket ID 0
> # CPU 3: Core ID 3, Die ID 0, Socket ID 0
> # CPU 4: Core ID 0, Die ID 0, Socket ID 0
> # CPU 5: Core ID 1, Die ID 0, Socket ID 0
> # CPU 6: Core ID 2, Die ID 0, Socket ID 0
> # CPU 7: Core ID 3, Die ID 0, Socket ID 0
> # node0 meminfo : total = 16119484 kB, free = 933088 kB
> # node0 cpu list : 0-7
> # pmu mappings: intel_pt = 8, software = 1, power = 19, uprobe = 7, uncore_imc = 11, cpu = 4, cstate_core = 17, uncore_cbox_2 = 14, breakpoint = 5, uncore_cbox_0 = 12, trac>
> # CPU cache info:
> # L1 Data 32K [0,4]
> # L1 Instruction 32K [0,4]
> # L1 Data 32K [1,5]
> # L1 Instruction 32K [1,5]
> # L1 Data 32K [1,5]
> # L1 Instruction 32K [1,5]
> # L1 Data 32K [2,6]
> # L1 Instruction 32K [2,6]
> # L1 Data 32K [3,7]
> # L1 Instruction 32K [3,7]
> # L2 Unified 256K [0,4]
> # L2 Unified 256K [1,5]
> # L2 Unified 256K [2,6]
> # L2 Unified 256K [3,7]
> # L3 Unified 8192K [0-7]
> # time of first sample : 0.000000
> # time of last sample : 0.000000
> # sample duration : 0.000 ms
> # memory nodes (nr 1, block size 0x8000000):
> # 0 [16G]: 0-23,32-135
> # directory data version : 1
> # bpf_prog_info 835: bpf_prog_6deef7357e7b4530 addr 0xffffffffc07c5138 size 66
> # bpf_prog_info 836: bpf_prog_6deef7357e7b4530 addr 0xffffffffc0804558 size 66
> # bpf_prog_info 837: bpf_prog_84efc2eecc454ca6 addr 0xffffffffc080804c size 373
> # bpf_prog_info 838: bpf_prog_6deef7357e7b4530 addr 0xffffffffc0817f50 size 66
> # bpf_prog_info 839: bpf_prog_6deef7357e7b4530 addr 0xffffffffc082aee8 size 66
> # bpf_prog_info 840: bpf_prog_5a2b06eab81b8f51 addr 0xffffffffc082c0fc size 1132
> # bpf_prog_info 841: bpf_prog_8a642ed2819e9acc addr 0xffffffffc0aac380 size 62
> # bpf_prog_info 842: bpf_prog_8a642ed2819e9acc addr 0xffffffffc0d4c970 size 62
> # bpf_prog_info 843: bpf_prog_6deef7357e7b4530 addr 0xffffffffc0d4e8bc size 66
> # bpf_prog_info 844: bpf_prog_6deef7357e7b4530 addr 0xffffffffc0d5050c size 66
> # bpf_prog_info 845: bpf_prog_3ff695c0afc2579c addr 0xffffffffc0d56098 size 1552
> # bpf_prog_info 846: bpf_prog_6deef7357e7b4530 addr 0xffffffffc0d58364 size 66
> # bpf_prog_info 847: bpf_prog_6deef7357e7b4530 addr 0xffffffffc0d6ec00 size 66
> # compressed : Zstd, level = 3, ratio = 6
> # cpu pmu capabilities: branches=32, max_precise=3, pmu_name=skylake
> # missing features: TRACING_DATA BUILD_ID BRANCH_STACK GROUP_DESC AUXTRACE STAT CLOCKID CLOCK_DATA
> # ========
> #
> Aggregated stats:
> TOTAL events: 1457107
> MMAP events: 98
> LOST events: 0
> COMM events: 2
> EXIT events: 6
> THROTTLE events: 0
> UNTHROTTLE events: 0
> FORK events: 8
> READ events: 0
> SAMPLE events: 1456388
> MMAP2 events: 7
> AUX events: 0
> ITRACE_START events: 0
> LOST_SAMPLES events: 0
> SWITCH events: 0
> SWITCH_CPU_WIDE events: 0
> NAMESPACES events: 0
> KSYMBOL events: 13
> BPF_EVENT events: 13
> CGROUP events: 166
> TEXT_POKE events: 0
> ATTR events: 0
> EVENT_TYPE events: 0
> TRACING_DATA events: 0
> BUILD_ID events: 0
> FINISHED_ROUND events: 0
> ID_INDEX events: 0
> AUXTRACE_INFO events: 0
> AUXTRACE events: 0
> AUXTRACE_ERROR events: 0
> THREAD_MAP events: 1
> CPU_MAP events: 1
> STAT_CONFIG events: 0
> STAT events: 0
> STAT_ROUND events: 0
> EVENT_UPDATE events: 0
> TIME_CONV events: 1
> FEATURE events: 0
> COMPRESSED events: 403
>
> $ tools/perf/perf record --threads --compression-level=3 -g --call-graph dwarf,1024 -e cycles,instructions -- ../../bench/matrix/linux/matrix.gcc.g.O3
> Addr of buf1 = 0x7f15e8b89010
> Offs of buf1 = 0x7f15e8b89180
> Addr of buf2 = 0x7f15e6b88010
> Offs of buf2 = 0x7f15e6b881c0
> Addr of buf3 = 0x7f15e4b87010
> Offs of buf3 = 0x7f15e4b87100
> Addr of buf4 = 0x7f15e2b86010
> Offs of buf4 = 0x7f15e2b86140
> Threads #: 8 Pthreads
> Matrix size: 2048
> Using multiply kernel: multiply1
> Execution time = 24.281 seconds
> [ perf record: Woken up 6923 times to write data ]
> [ perf record: Captured and wrote 34.933 MB perf.data (10414 samples), compressed (original 1728.795 MB, ratio is 49.530) ]
>
> $ perf report --stdio
> # To display the perf.data header info, please use --header/--header-only options.
> #
> #
> # Total Lost Samples: 0
> #
> # Samples: 708K of event 'cycles'
> # Event count (approx.): 277718834163
> #
> # Children Self Command Shared Object Symbol
> # ........ ........ ............... ........................ ........................................................
> #
> 99.97% 0.00% matrix.gcc.g.O3 libc-2.30.so [.] __GI___clone (inlined)
> |
> ---__GI___clone (inlined)
> |
> --99.97%--start_thread
> |
> --99.97%--ThreadFunction
> multiply1
> |
> --0.96%--asm_sysvec_apic_timer_interrupt
> |
> --0.70%--subflow_syn_recv_sock
> |
> --0.60%--__sysvec_apic_timer_interrupt
> |
> --0.57%--hrtimer_interrupt
>
> 99.97% 0.00% matrix.gcc.g.O3 libpthread-2.30.so [.] start_thread
> |
> ---start_thread
> |
> --99.97%--ThreadFunction
> |
> --99.97%--multiply1
> |
> --0.96%--asm_sysvec_apic_timer_interrupt
> |
> --0.70%--subflow_syn_recv_sock
> |
> --0.60%--__sysvec_apic_timer_interrupt
> |
> --0.57%--hrtimer_interrupt
>
> 99.97% 0.00% matrix.gcc.g.O3 matrix.gcc.g.O3 [.] ThreadFunction
> |
> ---ThreadFunction
> |
> --99.97%--multiply1
> |
> --0.96%--asm_sysvec_apic_timer_interrupt
> |
> --0.70%--subflow_syn_recv_sock
> |
> --0.60%--__sysvec_apic_timer_interrupt
> |
> --0.57%--hrtimer_interrupt
>
> 99.97% 98.68% matrix.gcc.g.O3 matrix.gcc.g.O3 [.] multiply1
> |
> |--98.68%--__GI___clone (inlined)
> | start_thread
> | ThreadFunction
> | multiply1
> |
> --1.28%--multiply1
> |
> --0.96%--asm_sysvec_apic_timer_interrupt
> |
> --0.70%--subflow_syn_recv_sock
> |
> --0.60%--__sysvec_apic_timer_interrupt
> |
> --0.57%--hrtimer_interrupt
>
> 0.97% 0.05% matrix.gcc.g.O3 [kernel.kallsyms] [k] asm_sysvec_apic_timer_interrupt
> |
> --0.91%--asm_sysvec_apic_timer_interrupt
> |
> --0.70%--subflow_syn_recv_sock
>
>
> --
> 2.24.1
>

--

- Arnaldo

2020-10-14 10:43:27

by Alexey Budankov

[permalink] [raw]
Subject: Re: [PATCH v1 02/15] perf report: output trace file name in raw trace dump

Hi,

On 13.10.2020 22:54, Jiri Olsa wrote:
> On Mon, Oct 12, 2020 at 11:54:24AM +0300, Alexey Budankov wrote:
>>
>> Output path of a trace file into raw dump (-D) <file_offset>@<path/file>.
>> Print offset of PERF_RECORD_COMPRESSED record instead of zero for
>> decompressed records:
>> [email protected] [0x30]: event: 9
>> or
>> [email protected]/data.7 [0x30]: event: 9
>>
>> Signed-off-by: Alexey Budankov <[email protected]>
>
> hi,
> I'm getting:
>
> CC builtin-inject.o
> builtin-inject.c: In function ‘cmd_inject’:
> builtin-inject.c:850:18: error: initialization of ‘int (*)(struct perf_session *, union perf_event *, u64, const char *)’ {aka ‘int (*)(struct perf_session *, union perf_event *, long unsigned int, const char *)’} from incompatible pointer type ‘int (*)(struct perf_session *, union perf_event *, u64)’ {aka ‘int (*)(struct perf_session *, union perf_event *, long unsigned int)’} [-Werror=incompatible-pointer-types]
> 850 | .compressed = perf_event__repipe_op4_synth,
> | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
> builtin-inject.c:850:18: note: (near initialization for ‘inject.tool.compressed’)
>
> it's probably recent build id changes

Looks like that's it. Fix is in v2 and follows:

diff --git a/tools/perf/builtin-inject.c b/tools/perf/builtin-inject.c
index f3f965157d69..35c005b8da7f 100644
--- a/tools/perf/builtin-inject.c
+++ b/tools/perf/builtin-inject.c
@@ -106,7 +106,8 @@ static int perf_event__repipe_op2_synth(struct perf_session *session,

static int perf_event__repipe_op4_synth(struct perf_session *session,
union perf_event *event,
- u64 data __maybe_unused)
+ u64 data __maybe_unused,
+ const char *str __maybe_unused)
{
return perf_event__repipe_synth(session->tool, event);
}

Thanks!
Alexei

2020-10-14 15:53:43

by Alexey Budankov

[permalink] [raw]
Subject: Re: [PATCH v1 03/15] perf data: open data directory in read access mode


On 13.10.2020 19:22, Arnaldo Carvalho de Melo wrote:
> Em Mon, Oct 12, 2020 at 07:52:33PM +0300, Alexey Budankov escreveu:
>>
>> On 12.10.2020 19:03, Andi Kleen wrote:
>>> On Mon, Oct 12, 2020 at 11:55:07AM +0300, Alexey Budankov wrote:
>>>>
>>>> Open files located at data directory in case of read access mode.
>>>
>>> Need some rationale. Is this a bug fix? Did the case not matter
>>> before?
>>
>> This is not a bug fix. The case didn't matter before.
>
> In such cases can you please write a paragraph explaining why it didn't
> matter before and now it is required?

Sure. Will do. Already in v2.

Alexei

2020-10-14 16:00:12

by Namhyung Kim

[permalink] [raw]
Subject: Re: [PATCH v1 08/15] perf record: write trace data into mmap trace files

Hi,

On Mon, Oct 12, 2020 at 6:01 PM Alexey Budankov
<[email protected]> wrote:
>
>
> Write trace data into per mmap trace files located
> at data directory. Streaming thread adjusts its affinity
> according to mask of the buffer being processed.
>
> Signed-off-by: Alexey Budankov <[email protected]>
> ---
[SNIP]
> @@ -1184,8 +1203,12 @@ static int record__mmap_read_evlist(struct record *rec, struct evlist *evlist,
> /*
> * Mark the round finished in case we wrote
> * at least one event.
> + *
> + * No need for round events in directory mode,
> + * because per-cpu maps and files have data
> + * sorted by kernel.
> */
> - if (bytes_written != rec->bytes_written)
> + if (!record__threads_enabled(rec) && bytes_written != rec->bytes_written)
> rc = record__write(rec, NULL, &finished_round_event, sizeof(finished_round_event));

This means it needs to keep all events in the ordered events queue
when perf report processes the data, right?

Thanks
Namhyung

2020-10-14 16:13:09

by Alexey Budankov

[permalink] [raw]
Subject: Re: [PATCH v1 08/15] perf record: write trace data into mmap trace files

Hi,

On 14.10.2020 13:52, Namhyung Kim wrote:
> Hi,
>
> On Mon, Oct 12, 2020 at 6:01 PM Alexey Budankov
> <[email protected]> wrote:
>>
>>
>> Write trace data into per mmap trace files located
>> at data directory. Streaming thread adjusts its affinity
>> according to mask of the buffer being processed.
>>
>> Signed-off-by: Alexey Budankov <[email protected]>
>> ---
> [SNIP]
>> @@ -1184,8 +1203,12 @@ static int record__mmap_read_evlist(struct record *rec, struct evlist *evlist,
>> /*
>> * Mark the round finished in case we wrote
>> * at least one event.
>> + *
>> + * No need for round events in directory mode,
>> + * because per-cpu maps and files have data
>> + * sorted by kernel.
>> */
>> - if (bytes_written != rec->bytes_written)
>> + if (!record__threads_enabled(rec) && bytes_written != rec->bytes_written)
>> rc = record__write(rec, NULL, &finished_round_event, sizeof(finished_round_event));
>
> This means it needs to keep all events in the ordered events queue
> when perf report processes the data, right?

Looks so.

Alexei

2020-10-14 16:16:18

by Alexey Budankov

[permalink] [raw]
Subject: Re: [PATCH v1 00/15] Introduce threaded trace streaming for basic perf record operation


On 13.10.2020 19:20, Arnaldo Carvalho de Melo wrote:
> Em Mon, Oct 12, 2020 at 11:50:29AM +0300, Alexey Budankov escreveu:
>>
>> Patch set provides threaded trace streaming for base perf record
>> operation. Provided streaming mode (--threads) mitigates profiling
>> data losses and resolves scalability issues of serial and asynchronous
>> (--aio) trace streaming modes on multicore server systems. The patch
>> set is based on the prototype [1], [2] and the most closely relates
>> to mode 3) "mode that creates thread for every monitored memory map".
>>
>> The threaded mode executes one-to-one mapping of trace streaming threads
>> to mapped data buffers and streaming into per-CPU trace files located
>> at data directory. The data buffers and threads are affined to NUMA
>> nodes and monitored CPUs according to system topology. --cpu option
>> can be used to specify exact CPUs to be monitored.
>>
>> Basic analysis of data directories is provided for perf report mode.
>> Raw dump (-D) and aggregated reports are available for data directories,
>> still with no memory consumption optimizations. However data directories
>> collected with --compression-level option enabled can be analyzed with
>> little less memory because trace files are unmaped from tool process
>> memory after loading collected data.
>>
>> Provided streaming mode is available with Zstd compression/decompression
>> (--compression-level) and handling of external commands (--control).
>> AUX area tracing, related and derived modes like --snapshot or
>> --aux-sample are not enabled. --switch-output, --switch-output-event,
>> --switch-max-files and --timestamp-filename options are not enabled.
>
> Would be interesting to spell out what are the difficulties to have
> those options working with this threaded mode, as I expect that once
> this is all reviewed and tested we should switch to it by default, no?

At the moment I am not sure about this as the default mode. It all depends
on specifics of HW configuration and workload to be monitored and analyzed.
On middle and small sized systems --aio could still fit better from HW/OS
resource consumption perspective.

Initial intent to enable AUX area tracing faced the need to define some
(optimal?) way to store index data at data directory, thus left aside of
this first step to bring threaded trace streaming into Perf tool.
--switch-output-* and --timestamp-filename use cases are not yet clear
for data directories and thus look like features of the second order.

Addressing all that issues in a single patch set looks too much. Proper
way to have it all in is step-by-step. Also I should say that it is aside
of the scope of current Intel VTune specific needs.

Alexei

2020-10-14 17:46:29

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v1 00/15] Introduce threaded trace streaming for basic perf record operation


* Alexey Budankov <[email protected]> wrote:

>
> Patch set provides threaded trace streaming for base perf record
> operation. Provided streaming mode (--threads) mitigates profiling
> data losses and resolves scalability issues of serial and asynchronous
> (--aio) trace streaming modes on multicore server systems. The patch
> set is based on the prototype [1], [2] and the most closely relates
> to mode 3) "mode that creates thread for every monitored memory map".
>
> The threaded mode executes one-to-one mapping of trace streaming threads
> to mapped data buffers and streaming into per-CPU trace files located
> at data directory. The data buffers and threads are affined to NUMA
> nodes and monitored CPUs according to system topology. --cpu option
> can be used to specify exact CPUs to be monitored.

Yay! This should really be the default trace capture model everywhere
possible.

Can we do this for perf top too? It's really struggling with lots of cores.

If on a 64-core system I run just a moderately higher frequency 'perf top'
of 1 kHz:

perf top -e cycles -F 1000

perf stays stuck forever in 'Collecting samples...', and I also get a lot
of:

[548112.871089] Uhhuh. NMI received for unknown reason 31 on CPU 25.
[548112.871089] Do you have a strange power saving mode enabled?

Thanks,

Ingo

2020-10-15 12:04:23

by Alexey Budankov

[permalink] [raw]
Subject: Re: [PATCH v1 00/15] Introduce threaded trace streaming for basic perf record operation


On 14.10.2020 20:27, Ingo Molnar wrote:
>
> * Alexey Budankov <[email protected]> wrote:
>
>>
>> Patch set provides threaded trace streaming for base perf record
>> operation. Provided streaming mode (--threads) mitigates profiling
>> data losses and resolves scalability issues of serial and asynchronous
>> (--aio) trace streaming modes on multicore server systems. The patch
>> set is based on the prototype [1], [2] and the most closely relates
>> to mode 3) "mode that creates thread for every monitored memory map".
>>
>> The threaded mode executes one-to-one mapping of trace streaming threads
>> to mapped data buffers and streaming into per-CPU trace files located
>> at data directory. The data buffers and threads are affined to NUMA
>> nodes and monitored CPUs according to system topology. --cpu option
>> can be used to specify exact CPUs to be monitored.
>
> Yay! This should really be the default trace capture model everywhere
> possible.
>
> Can we do this for perf top too? It's really struggling with lots of cores.
>
> If on a 64-core system I run just a moderately higher frequency 'perf top'
> of 1 kHz:
>
> perf top -e cycles -F 1000
>
> perf stays stuck forever in 'Collecting samples...', and I also get a lot
> of:
>
> [548112.871089] Uhhuh. NMI received for unknown reason 31 on CPU 25.
> [548112.871089] Do you have a strange power saving mode enabled?

Yes, we can. I would only prefer to do it in a separate patch set since
for me this patch set is already complex enough as a single change.
Is it ok?

I would also appreciate if you could clarify, advise or guide on the impact
of this perf top advancement or may be even provide some feedback on this feature adoption to help better justify the effort for my management.

Gratefully,
Alexei

>
> Thanks,
>
> Ingo
>

2020-10-21 12:46:42

by Namhyung Kim

[permalink] [raw]
Subject: Re: [PATCH v1 08/15] perf record: write trace data into mmap trace files

On Wed, Oct 14, 2020 at 9:09 PM Alexey Budankov
<[email protected]> wrote:
>
> Hi,
>
> On 14.10.2020 13:52, Namhyung Kim wrote:
> > Hi,
> >
> > On Mon, Oct 12, 2020 at 6:01 PM Alexey Budankov
> > <[email protected]> wrote:
> >>
> >>
> >> Write trace data into per mmap trace files located
> >> at data directory. Streaming thread adjusts its affinity
> >> according to mask of the buffer being processed.
> >>
> >> Signed-off-by: Alexey Budankov <[email protected]>
> >> ---
> > [SNIP]
> >> @@ -1184,8 +1203,12 @@ static int record__mmap_read_evlist(struct record *rec, struct evlist *evlist,
> >> /*
> >> * Mark the round finished in case we wrote
> >> * at least one event.
> >> + *
> >> + * No need for round events in directory mode,
> >> + * because per-cpu maps and files have data
> >> + * sorted by kernel.
> >> */
> >> - if (bytes_written != rec->bytes_written)
> >> + if (!record__threads_enabled(rec) && bytes_written != rec->bytes_written)
> >> rc = record__write(rec, NULL, &finished_round_event, sizeof(finished_round_event));
> >
> > This means it needs to keep all events in the ordered events queue
> > when perf report processes the data, right?
>
> Looks so.

Maybe it's not related to this directly. But we need to think about
how to make perf report faster and more efficient as well.

In my previous attempt, I separated samples from other events
to be in different mmaps so they were saved to different files
(or in a separate part of the data file).

And perf report processes the meta events (FORK/MMAP/...)
first to construct the system image and then processes samples
with multi-threads.

Once it has the image, it could bypass the ordered events queue
entirely.

Thanks

Namhyung

2020-10-21 13:38:34

by Namhyung Kim

[permalink] [raw]
Subject: Re: [PATCH v1 08/15] perf record: write trace data into mmap trace files

On Wed, Oct 21, 2020 at 7:25 PM Alexey Budankov
<[email protected]> wrote:
>
>
> On 21.10.2020 10:34, Namhyung Kim wrote:
> > On Wed, Oct 14, 2020 at 9:09 PM Alexey Budankov
> > <[email protected]> wrote:
> >>
> >> Hi,
> >>
> >> On 14.10.2020 13:52, Namhyung Kim wrote:
> >>> Hi,
> >>>
> >>> On Mon, Oct 12, 2020 at 6:01 PM Alexey Budankov
> >>> <[email protected]> wrote:
> >>>>
> >>>>
> >>>> Write trace data into per mmap trace files located
> >>>> at data directory. Streaming thread adjusts its affinity
> >>>> according to mask of the buffer being processed.
> >>>>
> >>>> Signed-off-by: Alexey Budankov <[email protected]>
> >>>> ---
> >>> [SNIP]
> >>>> @@ -1184,8 +1203,12 @@ static int record__mmap_read_evlist(struct record *rec, struct evlist *evlist,
> >>>> /*
> >>>> * Mark the round finished in case we wrote
> >>>> * at least one event.
> >>>> + *
> >>>> + * No need for round events in directory mode,
> >>>> + * because per-cpu maps and files have data
> >>>> + * sorted by kernel.
> >>>> */
> >>>> - if (bytes_written != rec->bytes_written)
> >>>> + if (!record__threads_enabled(rec) && bytes_written != rec->bytes_written)
> >>>> rc = record__write(rec, NULL, &finished_round_event, sizeof(finished_round_event));
> >>>
> >>> This means it needs to keep all events in the ordered events queue
> >>> when perf report processes the data, right?
> >>
> >> Looks so.
> >
> > Maybe it's not related to this directly. But we need to think about
> > how to make perf report faster and more efficient as well.
>
> Makes sense. Agreed.
>
> >
> > In my previous attempt, I separated samples from other events
> > to be in different mmaps so they were saved to different files
> > (or in a separate part of the data file).
> >
> > And perf report processes the meta events (FORK/MMAP/...)
> > first to construct the system image and then processes samples
> > with multi-threads.
>
> Looks like separation to global, per-process events and per-thread
> ones. Alternative algorithm could possibly be multi-passing of trace
> data. First pass is to capture global events and build process state
> overtime progress picture. Second pass is to capture and map per-thread
> samples and/or other events into process state according to samples
> and events time.

Yep, it seems basically the same approach. But it'd be better if
we could do it in a single pass (with some modification in the data
collection).

Thanks
Namhyung

2020-10-21 21:54:18

by Alexey Budankov

[permalink] [raw]
Subject: Re: [PATCH v1 08/15] perf record: write trace data into mmap trace files


On 21.10.2020 10:34, Namhyung Kim wrote:
> On Wed, Oct 14, 2020 at 9:09 PM Alexey Budankov
> <[email protected]> wrote:
>>
>> Hi,
>>
>> On 14.10.2020 13:52, Namhyung Kim wrote:
>>> Hi,
>>>
>>> On Mon, Oct 12, 2020 at 6:01 PM Alexey Budankov
>>> <[email protected]> wrote:
>>>>
>>>>
>>>> Write trace data into per mmap trace files located
>>>> at data directory. Streaming thread adjusts its affinity
>>>> according to mask of the buffer being processed.
>>>>
>>>> Signed-off-by: Alexey Budankov <[email protected]>
>>>> ---
>>> [SNIP]
>>>> @@ -1184,8 +1203,12 @@ static int record__mmap_read_evlist(struct record *rec, struct evlist *evlist,
>>>> /*
>>>> * Mark the round finished in case we wrote
>>>> * at least one event.
>>>> + *
>>>> + * No need for round events in directory mode,
>>>> + * because per-cpu maps and files have data
>>>> + * sorted by kernel.
>>>> */
>>>> - if (bytes_written != rec->bytes_written)
>>>> + if (!record__threads_enabled(rec) && bytes_written != rec->bytes_written)
>>>> rc = record__write(rec, NULL, &finished_round_event, sizeof(finished_round_event));
>>>
>>> This means it needs to keep all events in the ordered events queue
>>> when perf report processes the data, right?
>>
>> Looks so.
>
> Maybe it's not related to this directly. But we need to think about
> how to make perf report faster and more efficient as well.

Makes sense. Agreed.

>
> In my previous attempt, I separated samples from other events
> to be in different mmaps so they were saved to different files
> (or in a separate part of the data file).
>
> And perf report processes the meta events (FORK/MMAP/...)
> first to construct the system image and then processes samples
> with multi-threads.

Looks like separation to global, per-process events and per-thread
ones. Alternative algorithm could possibly be multi-passing of trace
data. First pass is to capture global events and build process state
overtime progress picture. Second pass is to capture and map per-thread
samples and/or other events into process state according to samples
and events time.

>
> Once it has the image, it could bypass the ordered events queue
> entirely.
>
> Thanks
>
> Namhyung
>

Thanks,
Alexei