2020-10-22 10:53:55

by Alexey Budankov

[permalink] [raw]
Subject: [PATCH v2 00/15] Introduce threaded trace streaming for basic perf record operation


Changes in v2:
- explicitly added credit tags to patches 6/15 and 15/15,
additionally to cites [1], [2]
- updated description of 3/15 to explicitly mention the reason
to open data directories in read access mode (e.g. for perf report)
- implemented fix for compilation error of 2/15
- explicitly elaborated on found issues to be resolved for
threaded AUX trace capture

v1: https://lore.kernel.org/lkml/[email protected]/

Patch set provides threaded trace streaming for base perf record
operation. Provided streaming mode (--threads) mitigates profiling
data losses and resolves scalability issues of serial and asynchronous
(--aio) trace streaming modes on multicore server systems. The patch
set is based on the prototype [1], [2] and the most closely relates
to mode 3) "mode that creates thread for every monitored memory map".

The threaded mode executes one-to-one mapping of trace streaming threads
to mapped data buffers and streaming into per-CPU trace files located
at data directory. The data buffers and threads are affined to NUMA
nodes and monitored CPUs according to system topology. --cpu option
can be used to specify exact CPUs to be monitored.

Basic analysis of data directories is provided for perf report mode.
Raw dump (-D) and aggregated reports are available for data directories,
still with no memory consumption optimizations. However data directories
collected with --compression-level option enabled can be analyzed with
little less memory because trace files are unmaped from tool process
memory after loading collected data.

Provided streaming mode is available with Zstd compression/decompression
(--compression-level) and handling of external commands (--control).
AUX area tracing, related and derived modes like --snapshot or
--aux-sample are not enabled. --switch-output, --switch-output-event,
--switch-max-files and --timestamp-filename options are not enabled.
Threaded trace streaming is not enabled for pipe mode. Initial intent
to enable AUX area tracing faced the need to define some optimal way
to store index data in data directory, thus left aside. --switch-output-*
and --timestamp-filename use cases are not clear for data directories.

Asynchronous(--aio) trace streaming and affinity (--affinity) modes
are mutually exclusive to threaded streaming mode.

See testing results and validation examples below.

[1] git clone https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git -b perf/record_threads
[2] https://lore.kernel.org/lkml/[email protected]/

---
Alexey Budankov (15):
perf session: introduce trace file path to be shown in raw trace dump
perf report: output trace file name in raw trace dump
perf data: open data directory in read access mode
perf session: move reader object definition to header file
perf session: introduce decompressor into trace reader object
perf session: load data directory into tool process memory
perf record: introduce trace file, compressor and stats in mmap object
perf record: write trace data into mmap trace files
perf record: introduce thread specific objects for trace streaming
perf record: manage thread specific data array
perf evlist: introduce evlist__ctlfd_update() to update ctl fd status
perf record: introduce thread local variable for trace streaming
perf record: stop threads in the end of trace streaming
perf record: start threads in the beginning of trace streaming
perf record: introduce --threads command line option

tools/perf/Documentation/perf-record.txt | 7 +
tools/perf/builtin-inject.c | 3 +-
tools/perf/builtin-record.c | 514 +++++++++++++++++++++--
tools/perf/util/data.c | 4 +
tools/perf/util/evlist.c | 16 +
tools/perf/util/evlist.h | 1 +
tools/perf/util/mmap.c | 6 +
tools/perf/util/mmap.h | 5 +
tools/perf/util/ordered-events.h | 1 +
tools/perf/util/record.h | 1 +
tools/perf/util/session.c | 150 ++++---
tools/perf/util/session.h | 28 ++
tools/perf/util/tool.h | 3 +-
13 files changed, 637 insertions(+), 102 deletions(-)

---
Testing results:

$ perf test
1: vmlinux symtab matches kallsyms : Skip
2: Detect openat syscall event : Ok
3: Detect openat syscall event on all cpus : Ok
4: Read samples using the mmap interface : Ok
5: Test data source output : Ok
6: Parse event definition strings : Ok
7: Simple expression parser : Ok
8: PERF_RECORD_* events & perf_sample fields : Ok
9: Parse perf pmu format : Ok
10: PMU events :
10.1: PMU event table sanity : Ok
10.2: PMU event map aliases : Ok
10.3: Parsing of PMU event table metrics : Skip (some metrics failed)
10.4: Parsing of PMU event table metrics with fake PMUs : Ok
11: DSO data read : Ok
12: DSO data cache : Ok
13: DSO data reopen : FAILED!
14: Roundtrip evsel->name : Ok
15: Parse sched tracepoints fields : Ok
16: syscalls:sys_enter_openat event fields : Ok
17: Setup struct perf_event_attr : Ok
18: Match and link multiple hists : Ok
19: 'import perf' in python : FAILED!
20: Breakpoint overflow signal handler : Ok
21: Breakpoint overflow sampling : Ok
22: Breakpoint accounting : Ok
23: Watchpoint :
23.1: Read Only Watchpoint : Skip
23.2: Write Only Watchpoint : Ok
23.3: Read / Write Watchpoint : Ok
23.4: Modify Watchpoint : Ok
24: Number of exit events of a simple workload : Ok
25: Software clock events period values : Ok
26: Object code reading : Ok
27: Sample parsing : Ok
28: Use a dummy software event to keep tracking : Ok
29: Parse with no sample_id_all bit set : Ok
30: Filter hist entries : Ok
31: Lookup mmap thread : Ok
32: Share thread maps : Ok
33: Sort output of hist entries : Ok
34: Cumulate child hist entries : Ok
35: Track with sched_switch : Ok
36: Filter fds with revents mask in a fdarray : Ok
37: Add fd to a fdarray, making it autogrow : Ok
38: kmod_path__parse : Ok
39: Thread map : Ok
40: LLVM search and compile :
40.1: Basic BPF llvm compile : Skip
40.2: kbuild searching : Skip
40.3: Compile source for BPF prologue generation : Skip
40.4: Compile source for BPF relocation : Skip
41: Session topology : Ok
42: BPF filter :
42.1: Basic BPF filtering : Skip
42.2: BPF pinning : Skip
42.3: BPF prologue generation : Skip
42.4: BPF relocation checker : Skip
43: Synthesize thread map : Ok
44: Remove thread map : Ok
45: Synthesize cpu map : Ok
46: Synthesize stat config : Ok
47: Synthesize stat : Ok
48: Synthesize stat round : Ok
49: Synthesize attr update : Ok
50: Event times : Ok
51: Read backward ring buffer : Ok
52: Print cpu map : Ok
53: Merge cpu map : Ok
54: Probe SDT events : Ok
55: is_printable_array : Ok
56: Print bitmap : Ok
57: perf hooks : Ok
58: builtin clang support : Skip (not compiled in)
59: unit_number__scnprintf : Ok
60: mem2node : Ok
61: time utils : Ok
62: Test jit_write_elf : Ok
63: Test libpfm4 support : Skip (not compiled in)
64: Test api io : Ok
65: maps__merge_in : Ok
66: Demangle Java : Ok
67: Parse and process metrics : Ok
68: PE file support : Ok
69: Event expansion for cgroups : Ok
70: x86 rdpmc : Ok
71: Convert perf time to TSC : Ok
72: DWARF unwind : Ok
73: x86 instruction decoder - new instructions : Ok
74: Intel PT packet decoder : Ok
75: x86 bp modify : Ok
76: Check open filename arg using perf trace + vfs_getname : FAILED!
77: Add vfs_getname probe to get syscall args filenames : FAILED!
78: probe libc's inet_pton & backtrace it with ping : Ok
79: Use vfs_getname probe to get syscall args filenames : FAILED!
80: Zstd perf.data compression/decompression : Ok
81: Check Arm CoreSight trace data recording and synthesized samples: Skip

Validation:

$ tools/perf/perf record --threads --compression-level=3 -e cycles,instructions -- ../../bench/matrix/linux/matrix.gcc.g.O3
Addr of buf1 = 0x7f8af24ae010
Offs of buf1 = 0x7f8af24ae180
Addr of buf2 = 0x7f8af04ad010
Offs of buf2 = 0x7f8af04ad1c0
Addr of buf3 = 0x7f8aee4ac010
Offs of buf3 = 0x7f8aee4ac100
Addr of buf4 = 0x7f8aec4ab010
Offs of buf4 = 0x7f8aec4ab140
Threads #: 8 Pthreads
Matrix size: 2048
Using multiply kernel: multiply1
Execution time = 24.579 seconds
[ perf record: Woken up 972 times to write data ]
[ perf record: Captured and wrote 11.280 MB perf.data (403 samples), compressed (original 66.670 MB, ratio is 5.926) ]

$ perf report --header --stats
# ========
# captured on : Wed Oct 7 19:07:50 2020
# header version : 1
# data offset : 504
# data size : 30056
# feat offset : 30560
# hostname : nntvtune39
# os release : 5.8.8-100.fc31.x86_64
# perf version : 4.13.rc5.gdf19b1347b82
# arch : x86_64
# nrcpus online : 8
# nrcpus avail : 8
# cpudesc : Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
# cpuid : GenuineIntel,6,94,3
# total memory : 16119484 kB
# cmdline : /root/abudanko/kernel/acme/tools/perf/perf record --threads --compression-level=3 -e cycles,instructions -- ../../bench /matrix/linux/matrix.gcc.g.O3
# event : name = cycles, , id = { 54562, 54563, 54564, 54565, 54566, 54567, 54568, 54569 }, size = 120, { sample_period, sample_freq } = 4000, sample_type = IP|TID|TIME|ID|>
# event : name = instructions, , id = { 54570, 54571, 54572, 54573, 54574, 54575, 54576, 54577 }, size = 120, config = 0x1, { sample_period, sample_freq } = 4000, sample_ty>
# sibling sockets : 0-7
# sibling dies : 0-7
# sibling threads : 0,4
# sibling threads : 1,5
# sibling threads : 2,6
# sibling threads : 3,7
# CPU 0: Core ID 0, Die ID 0, Socket ID 0
# CPU 1: Core ID 1, Die ID 0, Socket ID 0
# CPU 2: Core ID 2, Die ID 0, Socket ID 0
# CPU 3: Core ID 3, Die ID 0, Socket ID 0
# CPU 4: Core ID 0, Die ID 0, Socket ID 0
# CPU 5: Core ID 1, Die ID 0, Socket ID 0
# CPU 6: Core ID 2, Die ID 0, Socket ID 0
# CPU 7: Core ID 3, Die ID 0, Socket ID 0
# node0 meminfo : total = 16119484 kB, free = 933088 kB
# node0 cpu list : 0-7
# pmu mappings: intel_pt = 8, software = 1, power = 19, uprobe = 7, uncore_imc = 11, cpu = 4, cstate_core = 17, uncore_cbox_2 = 14, breakpoint = 5, uncore_cbox_0 = 12, trac>
# CPU cache info:
# L1 Data 32K [0,4]
# L1 Instruction 32K [0,4]
# L1 Data 32K [1,5]
# L1 Instruction 32K [1,5]
# L1 Data 32K [1,5]
# L1 Instruction 32K [1,5]
# L1 Data 32K [2,6]
# L1 Instruction 32K [2,6]
# L1 Data 32K [3,7]
# L1 Instruction 32K [3,7]
# L2 Unified 256K [0,4]
# L2 Unified 256K [1,5]
# L2 Unified 256K [2,6]
# L2 Unified 256K [3,7]
# L3 Unified 8192K [0-7]
# time of first sample : 0.000000
# time of last sample : 0.000000
# sample duration : 0.000 ms
# memory nodes (nr 1, block size 0x8000000):
# 0 [16G]: 0-23,32-135
# directory data version : 1
# bpf_prog_info 835: bpf_prog_6deef7357e7b4530 addr 0xffffffffc07c5138 size 66
# bpf_prog_info 836: bpf_prog_6deef7357e7b4530 addr 0xffffffffc0804558 size 66
# bpf_prog_info 837: bpf_prog_84efc2eecc454ca6 addr 0xffffffffc080804c size 373
# bpf_prog_info 838: bpf_prog_6deef7357e7b4530 addr 0xffffffffc0817f50 size 66
# bpf_prog_info 839: bpf_prog_6deef7357e7b4530 addr 0xffffffffc082aee8 size 66
# bpf_prog_info 840: bpf_prog_5a2b06eab81b8f51 addr 0xffffffffc082c0fc size 1132
# bpf_prog_info 841: bpf_prog_8a642ed2819e9acc addr 0xffffffffc0aac380 size 62
# bpf_prog_info 842: bpf_prog_8a642ed2819e9acc addr 0xffffffffc0d4c970 size 62
# bpf_prog_info 843: bpf_prog_6deef7357e7b4530 addr 0xffffffffc0d4e8bc size 66
# bpf_prog_info 844: bpf_prog_6deef7357e7b4530 addr 0xffffffffc0d5050c size 66
# bpf_prog_info 845: bpf_prog_3ff695c0afc2579c addr 0xffffffffc0d56098 size 1552
# bpf_prog_info 846: bpf_prog_6deef7357e7b4530 addr 0xffffffffc0d58364 size 66
# bpf_prog_info 847: bpf_prog_6deef7357e7b4530 addr 0xffffffffc0d6ec00 size 66
# compressed : Zstd, level = 3, ratio = 6
# cpu pmu capabilities: branches=32, max_precise=3, pmu_name=skylake
# missing features: TRACING_DATA BUILD_ID BRANCH_STACK GROUP_DESC AUXTRACE STAT CLOCKID CLOCK_DATA
# ========
#
Aggregated stats:
TOTAL events: 1457107
MMAP events: 98
LOST events: 0
COMM events: 2
EXIT events: 6
THROTTLE events: 0
UNTHROTTLE events: 0
FORK events: 8
READ events: 0
SAMPLE events: 1456388
MMAP2 events: 7
AUX events: 0
ITRACE_START events: 0
LOST_SAMPLES events: 0
SWITCH events: 0
SWITCH_CPU_WIDE events: 0
NAMESPACES events: 0
KSYMBOL events: 13
BPF_EVENT events: 13
CGROUP events: 166
TEXT_POKE events: 0
ATTR events: 0
EVENT_TYPE events: 0
TRACING_DATA events: 0
BUILD_ID events: 0
FINISHED_ROUND events: 0
ID_INDEX events: 0
AUXTRACE_INFO events: 0
AUXTRACE events: 0
AUXTRACE_ERROR events: 0
THREAD_MAP events: 1
CPU_MAP events: 1
STAT_CONFIG events: 0
STAT events: 0
STAT_ROUND events: 0
EVENT_UPDATE events: 0
TIME_CONV events: 1
FEATURE events: 0
COMPRESSED events: 403

$ tools/perf/perf record --threads --compression-level=3 -g --call-graph dwarf,1024 -e cycles,instructions -- ../../bench/matrix/linux/matrix.gcc.g.O3
Addr of buf1 = 0x7f15e8b89010
Offs of buf1 = 0x7f15e8b89180
Addr of buf2 = 0x7f15e6b88010
Offs of buf2 = 0x7f15e6b881c0
Addr of buf3 = 0x7f15e4b87010
Offs of buf3 = 0x7f15e4b87100
Addr of buf4 = 0x7f15e2b86010
Offs of buf4 = 0x7f15e2b86140
Threads #: 8 Pthreads
Matrix size: 2048
Using multiply kernel: multiply1
Execution time = 24.281 seconds
[ perf record: Woken up 6923 times to write data ]
[ perf record: Captured and wrote 34.933 MB perf.data (10414 samples), compressed (original 1728.795 MB, ratio is 49.530) ]

$ perf report --stdio
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 708K of event 'cycles'
# Event count (approx.): 277718834163
#
# Children Self Command Shared Object Symbol
# ........ ........ ............... ........................ ........................................................
#
99.97% 0.00% matrix.gcc.g.O3 libc-2.30.so [.] __GI___clone (inlined)
|
---__GI___clone (inlined)
|
--99.97%--start_thread
|
--99.97%--ThreadFunction
multiply1
|
--0.96%--asm_sysvec_apic_timer_interrupt
|
--0.70%--subflow_syn_recv_sock
|
--0.60%--__sysvec_apic_timer_interrupt
|
--0.57%--hrtimer_interrupt

99.97% 0.00% matrix.gcc.g.O3 libpthread-2.30.so [.] start_thread
|
---start_thread
|
--99.97%--ThreadFunction
|
--99.97%--multiply1
|
--0.96%--asm_sysvec_apic_timer_interrupt
|
--0.70%--subflow_syn_recv_sock
|
--0.60%--__sysvec_apic_timer_interrupt
|
--0.57%--hrtimer_interrupt

99.97% 0.00% matrix.gcc.g.O3 matrix.gcc.g.O3 [.] ThreadFunction
|
---ThreadFunction
|
--99.97%--multiply1
|
--0.96%--asm_sysvec_apic_timer_interrupt
|
--0.70%--subflow_syn_recv_sock
|
--0.60%--__sysvec_apic_timer_interrupt
|
--0.57%--hrtimer_interrupt

99.97% 98.68% matrix.gcc.g.O3 matrix.gcc.g.O3 [.] multiply1
|
|--98.68%--__GI___clone (inlined)
| start_thread
| ThreadFunction
| multiply1
|
--1.28%--multiply1
|
--0.96%--asm_sysvec_apic_timer_interrupt
|
--0.70%--subflow_syn_recv_sock
|
--0.60%--__sysvec_apic_timer_interrupt
|
--0.57%--hrtimer_interrupt

0.97% 0.05% matrix.gcc.g.O3 [kernel.kallsyms] [k] asm_sysvec_apic_timer_interrupt
|
--0.91%--asm_sysvec_apic_timer_interrupt
|
--0.70%--subflow_syn_recv_sock


--
2.24.1


2020-10-22 10:54:34

by Alexey Budankov

[permalink] [raw]
Subject: [PATCH v2 02/15] perf report: output trace file name in raw trace dump


Print path and name of a trace file into raw dump (-D)
<file_offset>@<path/file>. Print offset of PERF_RECORD_COMPRESSED
record instead of zero for decompressed records:
[email protected] [0x30]: event: 9
or
[email protected]/data.7 [0x30]: event: 9

Signed-off-by: Alexey Budankov <[email protected]>
---
tools/perf/builtin-inject.c | 3 +-
tools/perf/util/session.c | 75 ++++++++++++++++++++++---------------
tools/perf/util/tool.h | 3 +-
3 files changed, 48 insertions(+), 33 deletions(-)

diff --git a/tools/perf/builtin-inject.c b/tools/perf/builtin-inject.c
index 452a75fe68e5..037f8a98220c 100644
--- a/tools/perf/builtin-inject.c
+++ b/tools/perf/builtin-inject.c
@@ -106,7 +106,8 @@ static int perf_event__repipe_op2_synth(struct perf_session *session,

static int perf_event__repipe_op4_synth(struct perf_session *session,
union perf_event *event,
- u64 data __maybe_unused)
+ u64 data __maybe_unused,
+ const char *str __maybe_unused)
{
return perf_event__repipe_synth(session->tool, event);
}
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index 4478ddae485b..6f09d506b2f6 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -36,7 +36,8 @@

#ifdef HAVE_ZSTD_SUPPORT
static int perf_session__process_compressed_event(struct perf_session *session,
- union perf_event *event, u64 file_offset)
+ union perf_event *event, u64 file_offset,
+ const char *file_path)
{
void *src;
size_t decomp_size, src_size;
@@ -58,6 +59,7 @@ static int perf_session__process_compressed_event(struct perf_session *session,
}

decomp->file_pos = file_offset;
+ decomp->file_path = file_path;
decomp->mmap_len = mmap_len;
decomp->head = 0;

@@ -98,7 +100,8 @@ static int perf_session__process_compressed_event(struct perf_session *session,
static int perf_session__deliver_event(struct perf_session *session,
union perf_event *event,
struct perf_tool *tool,
- u64 file_offset);
+ u64 file_offset,
+ const char *file_path);

static int perf_session__open(struct perf_session *session)
{
@@ -180,7 +183,8 @@ static int ordered_events__deliver_event(struct ordered_events *oe,
ordered_events);

return perf_session__deliver_event(session, event->event,
- session->tool, event->file_offset);
+ session->tool, event->file_offset,
+ event->file_path);
}

struct perf_session *perf_session__new(struct perf_data *data,
@@ -452,7 +456,8 @@ static int process_stat_round_stub(struct perf_session *perf_session __maybe_unu

static int perf_session__process_compressed_event_stub(struct perf_session *session __maybe_unused,
union perf_event *event __maybe_unused,
- u64 file_offset __maybe_unused)
+ u64 file_offset __maybe_unused,
+ const char *file_path __maybe_unused)
{
dump_printf(": unhandled!\n");
return 0;
@@ -1227,13 +1232,14 @@ static void sample_read__printf(struct perf_sample *sample, u64 read_format)
}

static void dump_event(struct evlist *evlist, union perf_event *event,
- u64 file_offset, struct perf_sample *sample)
+ u64 file_offset, struct perf_sample *sample,
+ const char *file_path)
{
if (!dump_trace)
return;

- printf("\n%#" PRIx64 " [%#x]: event: %d\n",
- file_offset, event->header.size, event->header.type);
+ printf("\n%#" PRIx64 "@%s [%#x]: event: %d\n",
+ file_offset, file_path, event->header.size, event->header.type);

trace_event(event);
if (event->header.type == PERF_RECORD_SAMPLE && evlist->trace_event_sample_raw)
@@ -1424,12 +1430,13 @@ static int machines__deliver_event(struct machines *machines,
struct evlist *evlist,
union perf_event *event,
struct perf_sample *sample,
- struct perf_tool *tool, u64 file_offset)
+ struct perf_tool *tool, u64 file_offset,
+ const char *file_path)
{
struct evsel *evsel;
struct machine *machine;

- dump_event(evlist, event, file_offset, sample);
+ dump_event(evlist, event, file_offset, sample, file_path);

evsel = perf_evlist__id2evsel(evlist, sample->id);

@@ -1506,7 +1513,8 @@ static int machines__deliver_event(struct machines *machines,
static int perf_session__deliver_event(struct perf_session *session,
union perf_event *event,
struct perf_tool *tool,
- u64 file_offset)
+ u64 file_offset,
+ const char *file_path)
{
struct perf_sample sample;
int ret;
@@ -1524,7 +1532,7 @@ static int perf_session__deliver_event(struct perf_session *session,
return 0;

ret = machines__deliver_event(&session->machines, session->evlist,
- event, &sample, tool, file_offset);
+ event, &sample, tool, file_offset, file_path);

if (dump_trace && sample.aux_sample.size)
auxtrace__dump_auxtrace_sample(session, &sample);
@@ -1534,7 +1542,8 @@ static int perf_session__deliver_event(struct perf_session *session,

static s64 perf_session__process_user_event(struct perf_session *session,
union perf_event *event,
- u64 file_offset)
+ u64 file_offset,
+ const char *file_path)
{
struct ordered_events *oe = &session->ordered_events;
struct perf_tool *tool = session->tool;
@@ -1544,7 +1553,7 @@ static s64 perf_session__process_user_event(struct perf_session *session,

if (event->header.type != PERF_RECORD_COMPRESSED ||
tool->compressed == perf_session__process_compressed_event_stub)
- dump_event(session->evlist, event, file_offset, &sample);
+ dump_event(session->evlist, event, file_offset, &sample, file_path);

/* These events are processed right away */
switch (event->header.type) {
@@ -1603,9 +1612,9 @@ static s64 perf_session__process_user_event(struct perf_session *session,
case PERF_RECORD_HEADER_FEATURE:
return tool->feature(session, event);
case PERF_RECORD_COMPRESSED:
- err = tool->compressed(session, event, file_offset);
+ err = tool->compressed(session, event, file_offset, file_path);
if (err)
- dump_event(session->evlist, event, file_offset, &sample);
+ dump_event(session->evlist, event, file_offset, &sample, file_path);
return err;
default:
return -EINVAL;
@@ -1622,9 +1631,9 @@ int perf_session__deliver_synth_event(struct perf_session *session,
events_stats__inc(&evlist->stats, event->header.type);

if (event->header.type >= PERF_RECORD_USER_TYPE_START)
- return perf_session__process_user_event(session, event, 0);
+ return perf_session__process_user_event(session, event, 0, NULL);

- return machines__deliver_event(&session->machines, evlist, event, sample, tool, 0);
+ return machines__deliver_event(&session->machines, evlist, event, sample, tool, 0, NULL);
}

static void event_swap(union perf_event *event, bool sample_id_all)
@@ -1720,7 +1729,8 @@ int perf_session__peek_events(struct perf_session *session, u64 offset,
}

static s64 perf_session__process_event(struct perf_session *session,
- union perf_event *event, u64 file_offset)
+ union perf_event *event, u64 file_offset,
+ const char *file_path)
{
struct evlist *evlist = session->evlist;
struct perf_tool *tool = session->tool;
@@ -1735,7 +1745,7 @@ static s64 perf_session__process_event(struct perf_session *session,
events_stats__inc(&evlist->stats, event->header.type);

if (event->header.type >= PERF_RECORD_USER_TYPE_START)
- return perf_session__process_user_event(session, event, file_offset);
+ return perf_session__process_user_event(session, event, file_offset, file_path);

if (tool->ordered_events) {
u64 timestamp = -1ULL;
@@ -1749,7 +1759,7 @@ static s64 perf_session__process_event(struct perf_session *session,
return ret;
}

- return perf_session__deliver_event(session, event, tool, file_offset);
+ return perf_session__deliver_event(session, event, tool, file_offset, file_path);
}

void perf_event_header__bswap(struct perf_event_header *hdr)
@@ -1987,7 +1997,8 @@ static int __perf_session__process_pipe_events(struct perf_session *session)
}
}

- if ((skip = perf_session__process_event(session, event, head)) < 0) {
+ skip = perf_session__process_event(session, event, head, "pipe");
+ if (skip < 0) {
pr_err("%#" PRIx64 " [%#x]: failed to process type: %d\n",
head, event->header.size, event->header.type);
err = -EINVAL;
@@ -2068,7 +2079,7 @@ fetch_decomp_event(u64 head, size_t mmap_size, char *buf, bool needs_swap)
static int __perf_session__process_decomp_events(struct perf_session *session)
{
s64 skip;
- u64 size, file_pos = 0;
+ u64 size;
struct decomp *decomp = session->decomp_last;

if (!decomp)
@@ -2082,9 +2093,9 @@ static int __perf_session__process_decomp_events(struct perf_session *session)
break;

size = event->header.size;
-
- if (size < sizeof(struct perf_event_header) ||
- (skip = perf_session__process_event(session, event, file_pos)) < 0) {
+ skip = perf_session__process_event(session, event, decomp->file_pos,
+ decomp->file_path);
+ if (size < sizeof(struct perf_event_header) || skip < 0) {
pr_err("%#" PRIx64 " [%#x]: failed to process type: %d\n",
decomp->file_pos + decomp->head, event->header.size, event->header.type);
return -EINVAL;
@@ -2115,7 +2126,8 @@ struct reader;

typedef s64 (*reader_cb_t)(struct perf_session *session,
union perf_event *event,
- u64 file_offset);
+ u64 file_offset,
+ const char *file_path);

struct reader {
int fd;
@@ -2198,9 +2210,9 @@ reader__process_events(struct reader *rd, struct perf_session *session,
skip = -EINVAL;

if (size < sizeof(struct perf_event_header) ||
- (skip = rd->process(session, event, file_pos)) < 0) {
- pr_err("%#" PRIx64 " [%#x]: failed to process type: %d [%s]\n",
- file_offset + head, event->header.size,
+ (skip = rd->process(session, event, file_pos, rd->path)) < 0) {
+ pr_err("%#" PRIx64 " [%s] [%#x]: failed to process type: %d [%s]\n",
+ file_offset + head, rd->path, event->header.size,
event->header.type, strerror(-skip));
err = skip;
goto out;
@@ -2230,9 +2242,10 @@ reader__process_events(struct reader *rd, struct perf_session *session,

static s64 process_simple(struct perf_session *session,
union perf_event *event,
- u64 file_offset)
+ u64 file_offset,
+ const char *file_path)
{
- return perf_session__process_event(session, event, file_offset);
+ return perf_session__process_event(session, event, file_offset, file_path);
}

static int __perf_session__process_events(struct perf_session *session)
diff --git a/tools/perf/util/tool.h b/tools/perf/util/tool.h
index bbbc0dcd461f..c966531d3eca 100644
--- a/tools/perf/util/tool.h
+++ b/tools/perf/util/tool.h
@@ -28,7 +28,8 @@ typedef int (*event_attr_op)(struct perf_tool *tool,

typedef int (*event_op2)(struct perf_session *session, union perf_event *event);
typedef s64 (*event_op3)(struct perf_session *session, union perf_event *event);
-typedef int (*event_op4)(struct perf_session *session, union perf_event *event, u64 data);
+typedef int (*event_op4)(struct perf_session *session, union perf_event *event, u64 data,
+ const char *str);

typedef int (*event_oe)(struct perf_tool *tool, union perf_event *event,
struct ordered_events *oe);
--
2.24.1

2020-10-22 10:55:29

by Alexey Budankov

[permalink] [raw]
Subject: [PATCH v2 10/15] perf record: manage thread specific data array


Provide allocation, initialization, finalization and releasing of
thread specific objects at thread specific data array. Allocate
thread specific object for every data buffer making one-to-one
relation between data buffer and a thread processing the buffer.
Deliver event fd related signals to thread's pollfd object.
Deliver thread control commands to ctlfd_pos fd of thread 1+.
Deliver tool external control commands via ctlfd_pos fd of thread 0.

Signed-off-by: Alexey Budankov <[email protected]>
---
tools/perf/builtin-record.c | 101 ++++++++++++++++++++++++++++++++++--
1 file changed, 98 insertions(+), 3 deletions(-)

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 8e512096a060..89cb8e913fb3 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -884,6 +884,94 @@ static int record__kcore_copy(struct machine *machine, struct perf_data *data)
return kcore_copy(from_dir, kcore_dir);
}

+static int record__alloc_thread_data(struct record *rec, struct mmap *mmaps, int nr_mmaps,
+ struct fdarray *evlist_pollfd, int ctlfd_pos)
+{
+ int i, j, k, nr_thread_data;
+ struct thread_data *thread_data;
+
+ rec->nr_thread_data = nr_thread_data = nr_mmaps;
+ rec->thread_data = thread_data = zalloc(rec->nr_thread_data * sizeof(*(rec->thread_data)));
+ if (!thread_data) {
+ pr_err("Failed to allocate thread data\n");
+ return -ENOMEM;
+ }
+
+ for (i = 0; i < nr_thread_data; i++) {
+ short revents;
+ int pos, fd;
+
+ thread_data[i].tid = -1;
+
+ if (pipe(thread_data[i].comm.msg) ||
+ pipe(thread_data[i].comm.ack)) {
+ pr_err("Failed to create thread comm pipes, errno %d\n", errno);
+ return -ENOMEM;
+ }
+
+ thread_data[i].maps = &mmaps[i];
+ thread_data[i].nr_mmaps = 1;
+
+ thread_data[i].rec = rec;
+
+ fdarray__init(&(thread_data[i].pollfd), 64);
+
+ for (j = 0; j < thread_data[i].nr_mmaps; j++) {
+ struct mmap *map = &(thread_data[i].maps[j]);
+
+ for (k = 0; k < evlist_pollfd->nr; k++) {
+ if (evlist_pollfd->priv[k].ptr != map)
+ continue;
+
+ fd = evlist_pollfd->entries[k].fd;
+ revents = evlist_pollfd->entries[k].events;
+ pos = fdarray__add(&(thread_data[i].pollfd),
+ fd, revents | POLLERR | POLLHUP,
+ fdarray_flag__default);
+ if (pos >= 0)
+ thread_data[i].pollfd.priv[pos].ptr = map;
+ else
+ return -ENOMEM;
+ }
+ }
+
+ if (i) {
+ fd = thread_data[i].comm.msg[0];
+ revents = POLLIN | POLLERR | POLLHUP;
+ } else {
+ if (ctlfd_pos == -1)
+ continue;
+ fd = evlist_pollfd->entries[ctlfd_pos].fd;
+ revents = evlist_pollfd->entries[ctlfd_pos].events;
+ }
+ thread_data[i].ctlfd_pos =
+ fdarray__add(&(thread_data[i].pollfd),
+ fd, revents, fdarray_flag__nonfilterable);
+ if (thread_data[i].ctlfd_pos < 0)
+ return -ENOMEM;
+ }
+
+ return 0;
+}
+
+static int record__free_thread_data(struct record *rec)
+{
+ int i;
+
+ if (rec->thread_data) {
+ for (i = 0; i < rec->nr_thread_data; i++) {
+ close(rec->thread_data[i].comm.msg[0]);
+ close(rec->thread_data[i].comm.msg[1]);
+ close(rec->thread_data[i].comm.ack[0]);
+ close(rec->thread_data[i].comm.ack[1]);
+ fdarray__exit(&(rec->thread_data[i].pollfd));
+ }
+ zfree(&rec->thread_data);
+ }
+
+ return 0;
+}
+
static int record__mmap_evlist(struct record *rec,
struct evlist *evlist)
{
@@ -918,6 +1006,9 @@ static int record__mmap_evlist(struct record *rec,
}
}

+ if (evlist__initialize_ctlfd(evlist, opts->ctl_fd, opts->ctl_fd_ack))
+ return -1;
+
if (record__threads_enabled(rec)) {
int i, ret, nr = evlist->core.nr_mmaps;
struct mmap *mmaps = rec->opts.overwrite ?
@@ -929,6 +1020,12 @@ static int record__mmap_evlist(struct record *rec,

for (i = 0; i < nr; i++)
mmaps[i].file = &rec->data.dir.files[i];
+
+ ret = record__alloc_thread_data(rec, mmaps, nr,
+ &evlist->core.pollfd,
+ evlist->ctl_fd.pos);
+ if (ret)
+ return ret;
}

return 0;
@@ -1910,9 +2007,6 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
perf_evlist__start_workload(rec->evlist);
}

- if (evlist__initialize_ctlfd(rec->evlist, opts->ctl_fd, opts->ctl_fd_ack))
- goto out_child;
-
if (opts->initial_delay) {
pr_info(EVLIST_DISABLED_MSG);
if (opts->initial_delay > 0) {
@@ -2063,6 +2157,7 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
record__synthesize_workload(rec, true);

out_child:
+ record__free_thread_data(rec);
evlist__finalize_ctlfd(rec->evlist);
record__mmap_read_all(rec, true);
record__aio_mmap_read_sync(rec);
--
2.24.1

2020-10-22 10:55:49

by Alexey Budankov

[permalink] [raw]
Subject: [PATCH v2 09/15] perf record: introduce thread specific objects for trace streaming


Introduce thread local data object and its array to be used for
threaded trace streaming.

Signed-off-by: Alexey Budankov <[email protected]>
---
tools/perf/builtin-record.c | 18 ++++++++++++++++++
1 file changed, 18 insertions(+)

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index ba26d75c51d6..8e512096a060 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -85,11 +85,29 @@ struct switch_output {
int cur_file;
};

+struct thread_data {
+ pid_t tid;
+ struct {
+ int msg[2];
+ int ack[2];
+ } comm;
+ struct fdarray pollfd;
+ int ctlfd_pos;
+ struct mmap *maps;
+ int nr_mmaps;
+ struct record *rec;
+ unsigned long long samples;
+ unsigned long waking;
+ u64 bytes_written;
+};
+
struct record {
struct perf_tool tool;
struct record_opts opts;
u64 bytes_written;
struct perf_data data;
+ struct thread_data *thread_data;
+ int nr_thread_data;
struct auxtrace_record *itr;
struct evlist *evlist;
struct perf_session *session;
--
2.24.1

2020-10-22 10:55:49

by Alexey Budankov

[permalink] [raw]
Subject: [PATCH v2 11/15] perf evlist: introduce evlist__ctlfd_update() to update ctl fd status


Introduce evlist__ctlfd_update() to update ctl fd poll status
in evlist pollfd array using other pollfd object.

Signed-off-by: Alexey Budankov <[email protected]>
---
tools/perf/util/evlist.c | 16 ++++++++++++++++
tools/perf/util/evlist.h | 1 +
2 files changed, 17 insertions(+)

diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
index 8bdf3d2c907c..758a4896fedd 100644
--- a/tools/perf/util/evlist.c
+++ b/tools/perf/util/evlist.c
@@ -1970,6 +1970,22 @@ int evlist__ctlfd_process(struct evlist *evlist, enum evlist_ctl_cmd *cmd)
return err;
}

+int evlist__ctlfd_update(struct evlist *evlist, struct pollfd *update)
+{
+ int ctlfd_pos = evlist->ctl_fd.pos;
+ struct pollfd *entries = evlist->core.pollfd.entries;
+
+ if (!evlist__ctlfd_initialized(evlist))
+ return 0;
+
+ if (entries[ctlfd_pos].fd != update->fd ||
+ entries[ctlfd_pos].events != update->events)
+ return -1;
+
+ entries[ctlfd_pos].revents = update->revents;
+ return 0;
+}
+
struct evsel *evlist__find_evsel(struct evlist *evlist, int idx)
{
struct evsel *evsel;
diff --git a/tools/perf/util/evlist.h b/tools/perf/util/evlist.h
index e1a450322bc5..9b73d6ccf066 100644
--- a/tools/perf/util/evlist.h
+++ b/tools/perf/util/evlist.h
@@ -380,6 +380,7 @@ void evlist__close_control(int ctl_fd, int ctl_fd_ack, bool *ctl_fd_close);
int evlist__initialize_ctlfd(struct evlist *evlist, int ctl_fd, int ctl_fd_ack);
int evlist__finalize_ctlfd(struct evlist *evlist);
bool evlist__ctlfd_initialized(struct evlist *evlist);
+int evlist__ctlfd_update(struct evlist *evlist, struct pollfd *update);
int evlist__ctlfd_process(struct evlist *evlist, enum evlist_ctl_cmd *cmd);
int evlist__ctlfd_ack(struct evlist *evlist);

--
2.24.1


2020-10-22 10:56:09

by Alexey Budankov

[permalink] [raw]
Subject: [PATCH v2 14/15] perf record: start threads in the beginning of trace streaming


Start threads in detached state because its management is possible
via messaging. Block signals prior the threads start so only main
tool thread would be notified on external async signals during data
collection. Streaming threads connect one-to-one to mapped data
buffers and write into per-CPU trace files located at data directory.
Data buffers and threads are affined to local NUMA nodes and monitored
CPUs according to system topology. --cpu option can be used to specify
CPUs to be monitored.

Signed-off-by: Alexey Budankov <[email protected]>
---
tools/perf/builtin-record.c | 128 +++++++++++++++++++++++++++++++++---
1 file changed, 120 insertions(+), 8 deletions(-)

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index a15642656066..1d41e996a994 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -56,6 +56,7 @@
#include <poll.h>
#include <pthread.h>
#include <unistd.h>
+#include <sys/syscall.h>
#include <sched.h>
#include <signal.h>
#ifdef HAVE_EVENTFD_SUPPORT
@@ -1377,6 +1378,62 @@ static void record__thread_munmap_filtered(struct fdarray *fda, int fd,
perf_mmap__put(map);
}

+static void *record__thread(void *arg)
+{
+ enum thread_msg msg = THREAD_MSG__READY;
+ bool terminate = false;
+ struct fdarray *pollfd;
+ int err, ctlfd_pos;
+
+ thread = arg;
+ thread->tid = syscall(SYS_gettid);
+
+ err = write(thread->comm.ack[1], &msg, sizeof(msg));
+ if (err == -1)
+ pr_err("threads: %d failed to notify on start. Error %m", thread->tid);
+
+ pollfd = &(thread->pollfd);
+ ctlfd_pos = thread->ctlfd_pos;
+
+ for (;;) {
+ unsigned long long hits = thread->samples;
+
+ if (record__mmap_read_all(thread->rec, false) < 0 || terminate)
+ break;
+
+ if (hits == thread->samples) {
+
+ err = fdarray__poll(pollfd, -1);
+ /*
+ * Propagate error, only if there's any. Ignore positive
+ * number of returned events and interrupt error.
+ */
+ if (err > 0 || (err < 0 && errno == EINTR))
+ err = 0;
+ thread->waking++;
+
+ if (fdarray__filter(pollfd, POLLERR | POLLHUP,
+ record__thread_munmap_filtered, NULL) == 0)
+ break;
+ }
+
+ if (pollfd->entries[ctlfd_pos].revents & POLLHUP) {
+ terminate = true;
+ close(thread->comm.msg[0]);
+ pollfd->entries[ctlfd_pos].fd = -1;
+ pollfd->entries[ctlfd_pos].events = 0;
+ }
+
+ pollfd->entries[ctlfd_pos].revents = 0;
+ }
+
+ err = write(thread->comm.ack[1], &msg, sizeof(msg));
+ if (err == -1)
+ pr_err("threads: %d failed to notify on termination. Error %m", thread->tid);
+
+ return NULL;
+}
+
static void record__init_features(struct record *rec)
{
struct perf_session *session = rec->session;
@@ -1823,6 +1880,58 @@ static int record__terminate_thread(struct thread_data *thread_data)
return 0;
}

+static int record__start_threads(struct record *rec)
+{
+ int i, j, ret = 0;
+ sigset_t full, mask;
+ pthread_t handle;
+ pthread_attr_t attrs;
+ int nr_thread_data = rec->nr_thread_data;
+ struct thread_data *thread_data = rec->thread_data;
+
+ if (!record__threads_enabled(rec))
+ return 0;
+
+ sigfillset(&full);
+ if (sigprocmask(SIG_SETMASK, &full, &mask)) {
+ pr_err("Failed to block signals on threads start. Error: %m\n");
+ return -1;
+ }
+
+ pthread_attr_init(&attrs);
+ pthread_attr_setdetachstate(&attrs, PTHREAD_CREATE_DETACHED);
+
+ for (i = 1; i < nr_thread_data; i++) {
+ int res = 0;
+ enum thread_msg msg = THREAD_MSG__UNSUPPORTED;
+
+ if (pthread_create(&handle, &attrs, record__thread, &thread_data[i])) {
+ for (j = 1; j < i; j++)
+ record__terminate_thread(&thread_data[i]);
+ pr_err("Failed to start threads. Error: %m\n");
+ ret = -1;
+ goto out_err;
+ }
+
+ res = read(thread_data[i].comm.ack[0], &msg, sizeof(msg));
+ if (res > 0)
+ pr_debug("threads: %d -> %s\n", rec->thread_data[i].tid,
+ thread_msg_tags[msg]);
+ }
+
+ thread = &thread_data[0];
+ thread->tid = syscall(SYS_gettid);
+ pr_debug("threads: %d -> %s\n", thread->tid, thread_msg_tags[THREAD_MSG__READY]);
+
+out_err:
+ if (sigprocmask(SIG_SETMASK, &mask, NULL)) {
+ pr_err("Failed to unblock signals on threads start. Error: %m\n");
+ ret = -1;
+ }
+
+ return ret;
+}
+
static int record__stop_threads(struct record *rec, unsigned long *waking)
{
int i, j, nr_thread_data = rec->nr_thread_data;
@@ -1965,7 +2074,7 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
err = record__kcore_copy(&session->machines.host, data);
if (err) {
pr_err("ERROR: Failed to copy kcore\n");
- goto out_child;
+ goto out_free_threads;
}
}

@@ -1976,7 +2085,7 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
bpf__strerror_apply_obj_config(err, errbuf, sizeof(errbuf));
pr_err("ERROR: Apply config to BPF failed: %s\n",
errbuf);
- goto out_child;
+ goto out_free_threads;
}

/*
@@ -1994,11 +2103,11 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
if (data->is_pipe) {
err = perf_header__write_pipe(fd);
if (err < 0)
- goto out_child;
+ goto out_free_threads;
} else {
err = perf_session__write_header(session, rec->evlist, fd, false);
if (err < 0)
- goto out_child;
+ goto out_free_threads;
}

err = -1;
@@ -2006,16 +2115,16 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
&& !perf_header__has_feat(&session->header, HEADER_BUILD_ID)) {
pr_err("Couldn't generate buildids. "
"Use --no-buildid to profile anyway.\n");
- goto out_child;
+ goto out_free_threads;
}

err = record__setup_sb_evlist(rec);
if (err)
- goto out_child;
+ goto out_free_threads;

err = record__synthesize(rec, false);
if (err < 0)
- goto out_child;
+ goto out_free_threads;

if (rec->realtime_prio) {
struct sched_param param;
@@ -2024,10 +2133,13 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
if (sched_setscheduler(0, SCHED_FIFO, &param)) {
pr_err("Could not set realtime priority.\n");
err = -1;
- goto out_child;
+ goto out_free_threads;
}
}

+ if (record__start_threads(rec))
+ goto out_free_threads;
+
/*
* When perf is starting the traced process, all the events
* (apart from group members) have enable_on_exec=1 set,
--
2.24.1


2020-10-22 10:56:15

by Alexey Budankov

[permalink] [raw]
Subject: [PATCH v2 15/15] perf record: introduce --threads command line option


Provide --threads option in perf record command line interface.
Threaded streaming mode mitigates profiling data losses and
resolves scalability issues of serial and asynchronous (--aio) trace
streaming modes on multicore server systems. The implementation is
based on the prototype [1], [2] and the most closely relates to mode 3)
"mode that creates thread for every monitored memory map".

Threaded streaming mode is available with Zstd compression/decompression
(--compression-level) and handling of external commands (--control).
AUX area tracing, related and derived modes like --snapshot or
--aux-sample are not enabled. --switch-output, --switch-output-event,
--switch-max-files and --timestamp-filename options are not enabled.
Threaded trace streaming is not enabled for pipe mode. Initial intent
to enable AUX area tracing faced the need to define some optimal way
to store index data at data directory thus left aside. --switch-output-*
and --timestamp-filename use cases are not yet clear for data directories.

Asynchronous (--aio) trace streaming and affinity (--affinity) modes
are mutually exclusive to the exposed threaded streaming mode.

[1] git clone https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git -b perf/record_threads
[2] https://lore.kernel.org/lkml/[email protected]/

Suggested-by: Jiri Olsa <[email protected]>
Suggested-by: Namhyung Kim <[email protected]>
Signed-off-by: Alexey Budankov <[email protected]>
---
tools/perf/Documentation/perf-record.txt | 7 ++++
tools/perf/builtin-record.c | 45 +++++++++++++++++++++---
2 files changed, 48 insertions(+), 4 deletions(-)

diff --git a/tools/perf/Documentation/perf-record.txt b/tools/perf/Documentation/perf-record.txt
index 768888b9326a..d8fa387da973 100644
--- a/tools/perf/Documentation/perf-record.txt
+++ b/tools/perf/Documentation/perf-record.txt
@@ -671,6 +671,13 @@ Example of bash shell script to enable and disable events during measurements:
wait -n ${perf_pid}
exit $?

+--threads::
+Write collected trace data into per-CPU trace files using parallel threads.
+List of monitored CPUs can be configured by a mask provided via --cpu option.
+Trace writing threads correspond one-to-one to mapped data buffers. Threads
+and buffers are affined to monitored CPUs and NUMA nodes according to system
+topology. Threaded trace streaming mode is mutually exclusive to asynchronous
+trace streaming (--aio) and affinity (--affinity) modes.

SEE ALSO
--------
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 1d41e996a994..575b0b595081 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -780,6 +780,12 @@ static int record__auxtrace_init(struct record *rec)
{
int err;

+ if ((rec->opts.auxtrace_snapshot_opts || rec->opts.auxtrace_sample_opts)
+ && record__threads_enabled(rec)) {
+ pr_err("AUX area tracing options are not available in threaded streaming mode.\n");
+ return -EINVAL;
+ }
+
if (!rec->itr) {
rec->itr = auxtrace_record__init(rec->evlist, &err);
if (err)
@@ -2008,6 +2014,11 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
return PTR_ERR(session);
}

+ if (record__threads_enabled(rec) && perf_data__is_pipe(&rec->data)) {
+ pr_err("Threaded trace streaming is not available in pipe mode.\n");
+ return -1;
+ }
+
fd = perf_data__fd(data);
rec->session = session;

@@ -2680,12 +2691,22 @@ static int switch_output_setup(struct record *rec)
* --switch-output=signal, as we'll send a SIGUSR2 from the side band
* thread to its parent.
*/
- if (rec->switch_output_event_set)
+ if (rec->switch_output_event_set) {
+ if (record__threads_enabled(rec)) {
+ pr_warning("WARNING: --switch-output-event option is not available in threaded streaming mode.\n");
+ return 0;
+ }
goto do_signal;
+ }

if (!s->set)
return 0;

+ if (record__threads_enabled(rec)) {
+ pr_warning("WARNING: --switch-output option is not available in threaded streaming mode.\n");
+ return 0;
+ }
+
if (!strcmp(s->str, "signal")) {
do_signal:
s->signal = true;
@@ -2964,8 +2985,8 @@ static struct option __record_options[] = {
"Set affinity mask of trace reading thread to NUMA node cpu mask or cpu of processed mmap buffer",
record__parse_affinity),
#ifdef HAVE_ZSTD_SUPPORT
- OPT_CALLBACK_OPTARG('z', "compression-level", &record.opts, &comp_level_default,
- "n", "Compressed records using specified level (default: 1 - fastest compression, 22 - greatest compression)",
+ OPT_CALLBACK_OPTARG('z', "compression-level", &record.opts, &comp_level_default, "n",
+ "Compress records using specified level (default: 1 - fastest compression, 22 - greatest compression)",
record__parse_comp_level),
#endif
OPT_CALLBACK(0, "max-size", &record.output_max_size,
@@ -2984,6 +3005,8 @@ static struct option __record_options[] = {
"\t\t\t Optionally send control command completion ('ack\\n') to ack-fd descriptor.\n"
"\t\t\t Alternatively, ctl-fifo / ack-fifo will be opened and used as ctl-fd / ack-fd.",
parse_control_option),
+ OPT_BOOLEAN(0, "threads", &record.opts.threads,
+ "write collected trace data into per-CPU trace files using parallel threads"),
OPT_END()
};

@@ -3046,8 +3069,17 @@ int cmd_record(int argc, const char **argv)
if (rec->opts.kcore || record__threads_enabled(rec))
rec->data.is_dir = true;

- if (record__threads_enabled(rec))
+ if (record__threads_enabled(rec)) {
+ if (rec->opts.affinity != PERF_AFFINITY_SYS) {
+ pr_err("--affinity option is mutually exclusive to threaded streaming mode.\n");
+ goto out_opts;
+ }
rec->opts.affinity = PERF_AFFINITY_CPU;
+ if (record__aio_enabled(rec)) {
+ pr_err("Asynchronous streaming mode (--aio) is mutually exclusive to threaded streaming mode.\n");
+ goto out_opts;
+ }
+ }

if (rec->opts.comp_level != 0) {
pr_debug("Compression enabled, disabling build id collection at the end of the session.\n");
@@ -3082,6 +3114,11 @@ int cmd_record(int argc, const char **argv)
}
}

+ if (rec->timestamp_filename && record__threads_enabled(rec)) {
+ rec->timestamp_filename = false;
+ pr_warning("WARNING: --timestamp-filename option is not available in threaded streaming mode.\n");
+ }
+
/*
* Allow aliases to facilitate the lookup of symbols for address
* filters. Refer to auxtrace_parse_filters().
--
2.24.1


2020-10-22 11:17:02

by Alexey Budankov

[permalink] [raw]
Subject: [PATCH v2 04/15] perf session: move reader object definition to header file


Move definition of reader to session header file to be shared
among different source files. Introduce reference to active
reader object from session object.

Signed-off-by: Alexey Budankov <[email protected]>
---
tools/perf/util/session.c | 27 ---------------------------
tools/perf/util/session.h | 25 +++++++++++++++++++++++++
2 files changed, 25 insertions(+), 27 deletions(-)

diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index 6f09d506b2f6..911b2dbcd0ac 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -2110,33 +2110,6 @@ static int __perf_session__process_decomp_events(struct perf_session *session)
return 0;
}

-/*
- * On 64bit we can mmap the data file in one go. No need for tiny mmap
- * slices. On 32bit we use 32MB.
- */
-#if BITS_PER_LONG == 64
-#define MMAP_SIZE ULLONG_MAX
-#define NUM_MMAPS 1
-#else
-#define MMAP_SIZE (32 * 1024 * 1024ULL)
-#define NUM_MMAPS 128
-#endif
-
-struct reader;
-
-typedef s64 (*reader_cb_t)(struct perf_session *session,
- union perf_event *event,
- u64 file_offset,
- const char *file_path);
-
-struct reader {
- int fd;
- const char *path;
- u64 data_size;
- u64 data_offset;
- reader_cb_t process;
-};
-
static int
reader__process_events(struct reader *rd, struct perf_session *session,
struct ui_progress *prog)
diff --git a/tools/perf/util/session.h b/tools/perf/util/session.h
index 378ffc3e2809..abdb8518a81f 100644
--- a/tools/perf/util/session.h
+++ b/tools/perf/util/session.h
@@ -20,6 +20,30 @@ struct thread;
struct auxtrace;
struct itrace_synth_opts;

+/*
+ * On 64bit we can mmap the data file in one go. No need for tiny mmap
+ * slices. On 32bit we use 32MB.
+ */
+#if BITS_PER_LONG == 64
+#define MMAP_SIZE ULLONG_MAX
+#define NUM_MMAPS 1
+#else
+#define MMAP_SIZE (32 * 1024 * 1024ULL)
+#define NUM_MMAPS 128
+#endif
+
+typedef s64 (*reader_cb_t)(struct perf_session *session,
+ union perf_event *event,
+ u64 file_offset, const char *file_path);
+
+struct reader {
+ int fd;
+ const char *path;
+ u64 data_size;
+ u64 data_offset;
+ reader_cb_t process;
+};
+
struct perf_session {
struct perf_header header;
struct machines machines;
@@ -41,6 +65,7 @@ struct perf_session {
struct zstd_data zstd_data;
struct decomp *decomp;
struct decomp *decomp_last;
+ struct reader *reader;
};

struct decomp {
--
2.24.1

2020-10-22 11:17:28

by Alexey Budankov

[permalink] [raw]
Subject: [PATCH v2 06/15] perf session: load data directory into tool process memory


Read trace files located in data directory into tool process memory.
Basic analysis support of data directories is provided for report
mode. Raw dump (-D) and aggregated reports are available for data
directories, still with no memory consumption optimizations. However
data directories collected with --compression-level option enabled
can be analyzed with little less memory because trace files are
unmaped from tool process memory after loading collected data.
The implementation is based on the prototype [1], [2].

[1] git clone https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git -b perf/record_threads
[2] https://lore.kernel.org/lkml/[email protected]/

Suggested-by: Jiri Olsa <[email protected]>
Suggested-by: Namhyung Kim <[email protected]>
Signed-off-by: Alexey Budankov <[email protected]>
---
tools/perf/util/session.c | 48 +++++++++++++++++++++++++++++++++++++++
tools/perf/util/session.h | 1 +
2 files changed, 49 insertions(+)

diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index 6afc670fdf0c..0752eec19813 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -2212,6 +2212,17 @@ reader__process_events(struct reader *rd, struct perf_session *session,
goto more;

out:
+ if (rd->unmap_file) {
+ int i;
+
+ for (i = 0; i < NUM_MMAPS; i++) {
+ if (mmaps[i]) {
+ munmap(mmaps[i], mmap_size);
+ mmaps[i] = NULL;
+ }
+ }
+ }
+
return err;
}

@@ -2231,6 +2242,7 @@ static int __perf_session__process_events(struct perf_session *session)
.data_offset = session->header.data_offset,
.process = process_simple,
.path = session->data->file.path,
+ .unmap_file = false,
};
struct ordered_events *oe = &session->ordered_events;
struct perf_tool *tool = session->tool;
@@ -2247,6 +2259,42 @@ static int __perf_session__process_events(struct perf_session *session)
err = reader__process_events(&rd, session, &prog);
if (err)
goto out_err;
+
+ if (perf_data__is_dir(session->data)) {
+ int i, nr = session->data->dir.nr;
+ struct reader file_rd[nr];
+ u64 total_size = perf_data__size(session->data);
+
+ total_size -= session->data->file.size;
+ ui_progress__init_size(&prog, total_size, "Sorting events...");
+
+ memset(&file_rd, 0, nr * sizeof(file_rd[0]));
+
+ for (i = 0; i < nr ; i++) {
+ struct perf_data_file *file;
+
+ file = &session->data->dir.files[i];
+ file_rd[i] = (struct reader) {
+ .fd = file->fd,
+ .path = file->path,
+ .data_size = file->size,
+ .data_offset = 0,
+ .process = process_simple,
+ };
+ file_rd[i].unmap_file = perf_header__has_feat(&session->header,
+ HEADER_COMPRESSED);
+ session->reader = &file_rd[i];
+
+ if (zstd_init(&(file_rd[i].zstd_data), 0))
+ goto out_err;
+ err = reader__process_events(&file_rd[i], session, &prog);
+ zstd_fini(&(file_rd[i].zstd_data));
+ session->reader = NULL;
+ if (err)
+ goto out_err;
+ }
+ }
+
/* do the final flush for ordered samples */
err = ordered_events__flush(oe, OE_FLUSH__FINAL);
if (err)
diff --git a/tools/perf/util/session.h b/tools/perf/util/session.h
index 4fc9ccdf7970..d428f3eaf7fd 100644
--- a/tools/perf/util/session.h
+++ b/tools/perf/util/session.h
@@ -43,6 +43,7 @@ struct reader {
u64 data_offset;
reader_cb_t process;
struct zstd_data zstd_data;
+ bool unmap_file;
};

struct perf_session {
--
2.24.1

2020-10-22 11:17:29

by Alexey Budankov

[permalink] [raw]
Subject: [PATCH v2 07/15] perf record: introduce trace file, compressor and stats in mmap object


Introduce trace file and compressor objects into mmap object so
they could be used to process and store data stream from the
corresponding kernel data buffer. Introduce bytes_transferred
and bytes_compressed stats so they would capture statistics for
the related data buffer transfers. Make use of the introduced
per mmap file, compressor and stats when they are initialized
and available.

Signed-off-by: Alexey Budankov <[email protected]>
---
tools/perf/builtin-record.c | 47 +++++++++++++++++++++++++++----------
tools/perf/util/mmap.c | 6 +++++
tools/perf/util/mmap.h | 5 ++++
3 files changed, 46 insertions(+), 12 deletions(-)

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index adf311d15d3d..619aaee11231 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -150,11 +150,17 @@ static int record__write(struct record *rec, struct mmap *map __maybe_unused,
{
struct perf_data_file *file = &rec->session->data->file;

+ if (map && map->file)
+ file = map->file;
+
if (perf_data_file__write(file, bf, size) < 0) {
pr_err("failed to write perf data, error: %m\n");
return -1;
}

+ if (file != &rec->session->data->file)
+ return 0;
+
rec->bytes_written += size;

if (record__output_max_size_exceeded(rec) && !done) {
@@ -172,8 +178,8 @@ static int record__write(struct record *rec, struct mmap *map __maybe_unused,

static int record__aio_enabled(struct record *rec);
static int record__comp_enabled(struct record *rec);
-static size_t zstd_compress(struct perf_session *session, void *dst, size_t dst_size,
- void *src, size_t src_size);
+static size_t zstd_compress(struct zstd_data *data,
+ void *dst, size_t dst_size, void *src, size_t src_size);

#ifdef HAVE_AIO_SUPPORT
static int record__aio_write(struct aiocb *cblock, int trace_fd,
@@ -291,6 +297,7 @@ struct record_aio {
static int record__aio_pushfn(struct mmap *map, void *to, void *buf, size_t size)
{
struct record_aio *aio = to;
+ size_t compressed = size;

/*
* map->core.base data pointed by buf is copied into free map->aio.data[] buffer
@@ -307,9 +314,14 @@ static int record__aio_pushfn(struct mmap *map, void *to, void *buf, size_t size
*/

if (record__comp_enabled(aio->rec)) {
- size = zstd_compress(aio->rec->session, aio->data + aio->size,
- mmap__mmap_len(map) - aio->size,
+ struct zstd_data *zstd_data = &(aio->rec->session->zstd_data);
+
+ compressed = zstd_compress(zstd_data,
+ aio->data + aio->size, mmap__mmap_len(map) - aio->size,
buf, size);
+
+ aio->rec->session->bytes_transferred += size;
+ aio->rec->session->bytes_compressed += compressed;
} else {
memcpy(aio->data + aio->size, buf, size);
}
@@ -328,7 +340,7 @@ static int record__aio_pushfn(struct mmap *map, void *to, void *buf, size_t size
perf_mmap__get(&map->core);
}

- aio->size += size;
+ aio->size += compressed;

return size;
}
@@ -532,14 +544,28 @@ static int process_locked_synthesized_event(struct perf_tool *tool,
static int record__pushfn(struct mmap *map, void *to, void *bf, size_t size)
{
struct record *rec = to;
+ size_t compressed = size;

if (record__comp_enabled(rec)) {
- size = zstd_compress(rec->session, map->data, mmap__mmap_len(map), bf, size);
+ struct zstd_data *zstd_data = &rec->session->zstd_data;
+
+ if (map->file)
+ zstd_data = &map->zstd_data;
+
+ compressed = zstd_compress(zstd_data, map->data, mmap__mmap_len(map), bf, size);
bf = map->data;
+
+ if (map->file) {
+ map->bytes_transferred += size;
+ map->bytes_compressed += compressed;
+ } else {
+ rec->session->bytes_transferred += size;
+ rec->session->bytes_compressed += compressed;
+ }
}

rec->samples++;
- return record__write(rec, map, bf, size);
+ return record__write(rec, map, bf, compressed);
}

static volatile int signr = -1;
@@ -1079,18 +1105,15 @@ static size_t process_comp_header(void *record, size_t increment)
return size;
}

-static size_t zstd_compress(struct perf_session *session, void *dst, size_t dst_size,
+static size_t zstd_compress(struct zstd_data *zstd_data, void *dst, size_t dst_size,
void *src, size_t src_size)
{
size_t compressed;
size_t max_record_size = PERF_SAMPLE_MAX_SIZE - sizeof(struct perf_record_compressed) - 1;

- compressed = zstd_compress_stream_to_records(&session->zstd_data, dst, dst_size, src, src_size,
+ compressed = zstd_compress_stream_to_records(zstd_data, dst, dst_size, src, src_size,
max_record_size, process_comp_header);

- session->bytes_transferred += src_size;
- session->bytes_compressed += compressed;
-
return compressed;
}

diff --git a/tools/perf/util/mmap.c b/tools/perf/util/mmap.c
index ab7108d22428..a2c5e4237592 100644
--- a/tools/perf/util/mmap.c
+++ b/tools/perf/util/mmap.c
@@ -230,6 +230,8 @@ void mmap__munmap(struct mmap *map)
{
bitmap_free(map->affinity_mask.bits);

+ zstd_fini(&map->zstd_data);
+
perf_mmap__aio_munmap(map);
if (map->data != NULL) {
munmap(map->data, mmap__mmap_len(map));
@@ -291,6 +293,10 @@ int mmap__mmap(struct mmap *map, struct mmap_params *mp, int fd, int cpu)
map->core.flush = mp->flush;

map->comp_level = mp->comp_level;
+ if (zstd_init(&map->zstd_data, map->comp_level)) {
+ pr_debug2("failed to init mmap commpressor, error %d\n", errno);
+ return -1;
+ }

if (map->comp_level && !perf_mmap__aio_enabled(map)) {
map->data = mmap(NULL, mmap__mmap_len(map), PROT_READ|PROT_WRITE,
diff --git a/tools/perf/util/mmap.h b/tools/perf/util/mmap.h
index 9d5f589f02ae..d98b0bdedeac 100644
--- a/tools/perf/util/mmap.h
+++ b/tools/perf/util/mmap.h
@@ -13,6 +13,7 @@
#endif
#include "auxtrace.h"
#include "event.h"
+#include "util/compress.h"

struct aiocb;

@@ -43,6 +44,10 @@ struct mmap {
struct mmap_cpu_mask affinity_mask;
void *data;
int comp_level;
+ struct perf_data_file *file;
+ struct zstd_data zstd_data;
+ u64 bytes_transferred;
+ u64 bytes_compressed;
};

struct mmap_params {
--
2.24.1


2020-10-24 15:48:28

by Jiri Olsa

[permalink] [raw]
Subject: Re: [PATCH v2 00/15] Introduce threaded trace streaming for basic perf record operation

On Wed, Oct 21, 2020 at 06:52:43PM +0300, Alexey Budankov wrote:
>
> Changes in v2:
> - explicitly added credit tags to patches 6/15 and 15/15,
> additionally to cites [1], [2]
> - updated description of 3/15 to explicitly mention the reason
> to open data directories in read access mode (e.g. for perf report)
> - implemented fix for compilation error of 2/15
> - explicitly elaborated on found issues to be resolved for
> threaded AUX trace capture
>
> v1: https://lore.kernel.org/lkml/[email protected]/
>
> Patch set provides threaded trace streaming for base perf record
> operation. Provided streaming mode (--threads) mitigates profiling
> data losses and resolves scalability issues of serial and asynchronous
> (--aio) trace streaming modes on multicore server systems. The patch
> set is based on the prototype [1], [2] and the most closely relates
> to mode 3) "mode that creates thread for every monitored memory map".

so what I liked about the previous code was that you could
configure how the threads would be created

default --threads options created thread for each cpu like
in your change:

$ perf record -v --threads ...
...
thread 0 monitor: 0 allowed: 0
thread 1 monitor: 1 allowed: 1
thread 2 monitor: 2 allowed: 2
thread 3 monitor: 3 allowed: 3
thread 4 monitor: 4 allowed: 4
thread 5 monitor: 5 allowed: 5
thread 6 monitor: 6 allowed: 6
thread 7 monitor: 7 allowed: 7


then numa based:

$ perf record -v --threads=numa ...
...
thread 0 monitor: 0-5,12-17 allowed: 0-5,12-17
thread 1 monitor: 6-11,18-23 allowed: 6-11,18-23


socket based:

$ perf record -v --threads=socket ...
...
thread 0 monitor: 0-7 allowed: 0-7


core based:

$ perf record -v --threads=core ...
...
thread 0 monitor: 0,4 allowed: 0,4
thread 1 monitor: 1,5 allowed: 1,5
thread 2 monitor: 2,6 allowed: 2,6
thread 3 monitor: 3,7 allowed: 3,7


and user configurable:

$ perf record -v --threads=0-3/0:4-7/4 ...
...
threads: 0. monitor 0-3, allowed 0
threads: 1. monitor 4-7, allowed 4


so this way you could easily pin threads to cpu/core/socket/numa,
or to some other cpu of your choice, because this will be always
game of try and check where I'm not getting LOST events and not
creating 1000 threads

perf record: Add support for threads numa option value
perf record: Add support for threads socket option value
perf record: Add support for threads core option value
perf record: Add support for threads user option value

jirka

2020-10-24 15:49:16

by Jiri Olsa

[permalink] [raw]
Subject: Re: [PATCH v2 10/15] perf record: manage thread specific data array

On Wed, Oct 21, 2020 at 07:04:26PM +0300, Alexey Budankov wrote:
>
> Provide allocation, initialization, finalization and releasing of
> thread specific objects at thread specific data array. Allocate
> thread specific object for every data buffer making one-to-one
> relation between data buffer and a thread processing the buffer.
> Deliver event fd related signals to thread's pollfd object.
> Deliver thread control commands to ctlfd_pos fd of thread 1+.
> Deliver tool external control commands via ctlfd_pos fd of thread 0.
>
> Signed-off-by: Alexey Budankov <[email protected]>
> ---
> tools/perf/builtin-record.c | 101 ++++++++++++++++++++++++++++++++++--
> 1 file changed, 98 insertions(+), 3 deletions(-)
>
> diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
> index 8e512096a060..89cb8e913fb3 100644
> --- a/tools/perf/builtin-record.c
> +++ b/tools/perf/builtin-record.c
> @@ -884,6 +884,94 @@ static int record__kcore_copy(struct machine *machine, struct perf_data *data)
> return kcore_copy(from_dir, kcore_dir);
> }
>
> +static int record__alloc_thread_data(struct record *rec, struct mmap *mmaps, int nr_mmaps,
> + struct fdarray *evlist_pollfd, int ctlfd_pos)
> +{
> + int i, j, k, nr_thread_data;
> + struct thread_data *thread_data;
> +
> + rec->nr_thread_data = nr_thread_data = nr_mmaps;
> + rec->thread_data = thread_data = zalloc(rec->nr_thread_data * sizeof(*(rec->thread_data)));
> + if (!thread_data) {
> + pr_err("Failed to allocate thread data\n");
> + return -ENOMEM;
> + }
> +
> + for (i = 0; i < nr_thread_data; i++) {
> + short revents;
> + int pos, fd;
> +
> + thread_data[i].tid = -1;
> +
> + if (pipe(thread_data[i].comm.msg) ||
> + pipe(thread_data[i].comm.ack)) {
> + pr_err("Failed to create thread comm pipes, errno %d\n", errno);
> + return -ENOMEM;
> + }

the original code was using state flag and pthread_cond,
which I think is more readable

https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git/commit/?h=perf/record_threads&id=a7da527ff8be69572c6d17525c03c6fe394503c8
https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git/commit/?h=perf/record_threads&id=eb85ce4da64a885fdb6c77cfc5bd71312fe02e2a

> +
> + thread_data[i].maps = &mmaps[i];
> + thread_data[i].nr_mmaps = 1;
> +
> + thread_data[i].rec = rec;
> +
> + fdarray__init(&(thread_data[i].pollfd), 64);
> +
> + for (j = 0; j < thread_data[i].nr_mmaps; j++) {
> + struct mmap *map = &(thread_data[i].maps[j]);
> +
> + for (k = 0; k < evlist_pollfd->nr; k++) {
> + if (evlist_pollfd->priv[k].ptr != map)
> + continue;
> +
> + fd = evlist_pollfd->entries[k].fd;
> + revents = evlist_pollfd->entries[k].events;
> + pos = fdarray__add(&(thread_data[i].pollfd),
> + fd, revents | POLLERR | POLLHUP,
> + fdarray_flag__default);
> + if (pos >= 0)
> + thread_data[i].pollfd.priv[pos].ptr = map;
> + else
> + return -ENOMEM;

I added function for that:
https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git/commit/?h=perf/record_threads&id=8aa6e68a7471b9d25af1a9eebfa9321433366a17

jirka

> + }
> + }
> +
> + if (i) {
> + fd = thread_data[i].comm.msg[0];
> + revents = POLLIN | POLLERR | POLLHUP;
> + } else {
> + if (ctlfd_pos == -1)
> + continue;
> + fd = evlist_pollfd->entries[ctlfd_pos].fd;
> + revents = evlist_pollfd->entries[ctlfd_pos].events;
> + }
> + thread_data[i].ctlfd_pos =
> + fdarray__add(&(thread_data[i].pollfd),
> + fd, revents, fdarray_flag__nonfilterable);
> + if (thread_data[i].ctlfd_pos < 0)
> + return -ENOMEM;
> + }
> +
> + return 0;
> +}
> +
> +static int record__free_thread_data(struct record *rec)
> +{
> + int i;
> +
> + if (rec->thread_data) {
> + for (i = 0; i < rec->nr_thread_data; i++) {
> + close(rec->thread_data[i].comm.msg[0]);
> + close(rec->thread_data[i].comm.msg[1]);
> + close(rec->thread_data[i].comm.ack[0]);
> + close(rec->thread_data[i].comm.ack[1]);
> + fdarray__exit(&(rec->thread_data[i].pollfd));
> + }
> + zfree(&rec->thread_data);
> + }
> +
> + return 0;
> +}
> +
> static int record__mmap_evlist(struct record *rec,
> struct evlist *evlist)
> {
> @@ -918,6 +1006,9 @@ static int record__mmap_evlist(struct record *rec,
> }
> }
>
> + if (evlist__initialize_ctlfd(evlist, opts->ctl_fd, opts->ctl_fd_ack))
> + return -1;
> +
> if (record__threads_enabled(rec)) {
> int i, ret, nr = evlist->core.nr_mmaps;
> struct mmap *mmaps = rec->opts.overwrite ?
> @@ -929,6 +1020,12 @@ static int record__mmap_evlist(struct record *rec,
>
> for (i = 0; i < nr; i++)
> mmaps[i].file = &rec->data.dir.files[i];
> +
> + ret = record__alloc_thread_data(rec, mmaps, nr,
> + &evlist->core.pollfd,
> + evlist->ctl_fd.pos);
> + if (ret)
> + return ret;
> }
>
> return 0;
> @@ -1910,9 +2007,6 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
> perf_evlist__start_workload(rec->evlist);
> }
>
> - if (evlist__initialize_ctlfd(rec->evlist, opts->ctl_fd, opts->ctl_fd_ack))
> - goto out_child;
> -
> if (opts->initial_delay) {
> pr_info(EVLIST_DISABLED_MSG);
> if (opts->initial_delay > 0) {
> @@ -2063,6 +2157,7 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
> record__synthesize_workload(rec, true);
>
> out_child:
> + record__free_thread_data(rec);
> evlist__finalize_ctlfd(rec->evlist);
> record__mmap_read_all(rec, true);
> record__aio_mmap_read_sync(rec);
> --
> 2.24.1
>

2020-10-24 15:49:40

by Jiri Olsa

[permalink] [raw]
Subject: Re: [PATCH v2 14/15] perf record: start threads in the beginning of trace streaming

On Wed, Oct 21, 2020 at 07:10:09PM +0300, Alexey Budankov wrote:
>
> Start threads in detached state because its management is possible
> via messaging. Block signals prior the threads start so only main
> tool thread would be notified on external async signals during data
> collection. Streaming threads connect one-to-one to mapped data
> buffers and write into per-CPU trace files located at data directory.
> Data buffers and threads are affined to local NUMA nodes and monitored
> CPUs according to system topology. --cpu option can be used to specify
> CPUs to be monitored.
>
> Signed-off-by: Alexey Budankov <[email protected]>
> ---
> tools/perf/builtin-record.c | 128 +++++++++++++++++++++++++++++++++---
> 1 file changed, 120 insertions(+), 8 deletions(-)
>
> diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
> index a15642656066..1d41e996a994 100644
> --- a/tools/perf/builtin-record.c
> +++ b/tools/perf/builtin-record.c
> @@ -56,6 +56,7 @@
> #include <poll.h>
> #include <pthread.h>
> #include <unistd.h>
> +#include <sys/syscall.h>
> #include <sched.h>
> #include <signal.h>
> #ifdef HAVE_EVENTFD_SUPPORT
> @@ -1377,6 +1378,62 @@ static void record__thread_munmap_filtered(struct fdarray *fda, int fd,
> perf_mmap__put(map);
> }
>
> +static void *record__thread(void *arg)
> +{
> + enum thread_msg msg = THREAD_MSG__READY;
> + bool terminate = false;
> + struct fdarray *pollfd;
> + int err, ctlfd_pos;
> +
> + thread = arg;
> + thread->tid = syscall(SYS_gettid);
> +
> + err = write(thread->comm.ack[1], &msg, sizeof(msg));
> + if (err == -1)
> + pr_err("threads: %d failed to notify on start. Error %m", thread->tid);
> +
> + pollfd = &(thread->pollfd);

I don't think braces are necessary in here

jirka

2020-10-24 15:50:31

by Jiri Olsa

[permalink] [raw]
Subject: Re: [PATCH v2 09/15] perf record: introduce thread specific objects for trace streaming

On Wed, Oct 21, 2020 at 07:03:48PM +0300, Alexey Budankov wrote:
>
> Introduce thread local data object and its array to be used for
> threaded trace streaming.
>
> Signed-off-by: Alexey Budankov <[email protected]>
> ---
> tools/perf/builtin-record.c | 18 ++++++++++++++++++
> 1 file changed, 18 insertions(+)
>
> diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
> index ba26d75c51d6..8e512096a060 100644
> --- a/tools/perf/builtin-record.c
> +++ b/tools/perf/builtin-record.c
> @@ -85,11 +85,29 @@ struct switch_output {
> int cur_file;
> };
>
> +struct thread_data {
> + pid_t tid;
> + struct {
> + int msg[2];
> + int ack[2];
> + } comm;
> + struct fdarray pollfd;
> + int ctlfd_pos;
> + struct mmap *maps;
> + int nr_mmaps;
> + struct record *rec;
> + unsigned long long samples;
> + unsigned long waking;
> + u64 bytes_written;
> +};

please merge the struct with the code that's using it

jirka

> +
> struct record {
> struct perf_tool tool;
> struct record_opts opts;
> u64 bytes_written;
> struct perf_data data;
> + struct thread_data *thread_data;
> + int nr_thread_data;
> struct auxtrace_record *itr;
> struct evlist *evlist;
> struct perf_session *session;
> --
> 2.24.1
>

2020-10-24 15:50:51

by Jiri Olsa

[permalink] [raw]
Subject: Re: [PATCH v2 04/15] perf session: move reader object definition to header file

On Wed, Oct 21, 2020 at 06:59:48PM +0300, Alexey Budankov wrote:
>
> Move definition of reader to session header file to be shared
> among different source files. Introduce reference to active
> reader object from session object.
>
> Signed-off-by: Alexey Budankov <[email protected]>
> ---
> tools/perf/util/session.c | 27 ---------------------------
> tools/perf/util/session.h | 25 +++++++++++++++++++++++++
> 2 files changed, 25 insertions(+), 27 deletions(-)
>
> diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
> index 6f09d506b2f6..911b2dbcd0ac 100644
> --- a/tools/perf/util/session.c
> +++ b/tools/perf/util/session.c
> @@ -2110,33 +2110,6 @@ static int __perf_session__process_decomp_events(struct perf_session *session)
> return 0;
> }
>
> -/*
> - * On 64bit we can mmap the data file in one go. No need for tiny mmap
> - * slices. On 32bit we use 32MB.
> - */
> -#if BITS_PER_LONG == 64
> -#define MMAP_SIZE ULLONG_MAX
> -#define NUM_MMAPS 1
> -#else
> -#define MMAP_SIZE (32 * 1024 * 1024ULL)
> -#define NUM_MMAPS 128
> -#endif
> -
> -struct reader;
> -
> -typedef s64 (*reader_cb_t)(struct perf_session *session,
> - union perf_event *event,
> - u64 file_offset,
> - const char *file_path);
> -
> -struct reader {
> - int fd;
> - const char *path;
> - u64 data_size;
> - u64 data_offset;
> - reader_cb_t process;
> -};
> -
> static int
> reader__process_events(struct reader *rd, struct perf_session *session,
> struct ui_progress *prog)
> diff --git a/tools/perf/util/session.h b/tools/perf/util/session.h
> index 378ffc3e2809..abdb8518a81f 100644
> --- a/tools/perf/util/session.h
> +++ b/tools/perf/util/session.h
> @@ -20,6 +20,30 @@ struct thread;
> struct auxtrace;
> struct itrace_synth_opts;
>
> +/*
> + * On 64bit we can mmap the data file in one go. No need for tiny mmap
> + * slices. On 32bit we use 32MB.
> + */
> +#if BITS_PER_LONG == 64
> +#define MMAP_SIZE ULLONG_MAX
> +#define NUM_MMAPS 1
> +#else
> +#define MMAP_SIZE (32 * 1024 * 1024ULL)
> +#define NUM_MMAPS 128
> +#endif
> +
> +typedef s64 (*reader_cb_t)(struct perf_session *session,
> + union perf_event *event,
> + u64 file_offset, const char *file_path);
> +
> +struct reader {
> + int fd;
> + const char *path;
> + u64 data_size;
> + u64 data_offset;
> + reader_cb_t process;
> +};

I wasn't able to find where is this used ouf of session.c ?


> +
> struct perf_session {
> struct perf_header header;
> struct machines machines;
> @@ -41,6 +65,7 @@ struct perf_session {
> struct zstd_data zstd_data;
> struct decomp *decomp;
> struct decomp *decomp_last;
> + struct reader *reader;

please define it in the patch where it's actualy used

thanks,
jirka

2020-10-24 17:57:04

by Jiri Olsa

[permalink] [raw]
Subject: Re: [PATCH v2 06/15] perf session: load data directory into tool process memory

On Wed, Oct 21, 2020 at 07:01:19PM +0300, Alexey Budankov wrote:
>
> Read trace files located in data directory into tool process memory.
> Basic analysis support of data directories is provided for report
> mode. Raw dump (-D) and aggregated reports are available for data
> directories, still with no memory consumption optimizations. However
> data directories collected with --compression-level option enabled
> can be analyzed with little less memory because trace files are
> unmaped from tool process memory after loading collected data.
> The implementation is based on the prototype [1], [2].
>
> [1] git clone https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git -b perf/record_threads
> [2] https://lore.kernel.org/lkml/[email protected]/
>
> Suggested-by: Jiri Olsa <[email protected]>

very loosely ;-) so there was a reason for all that reader refactoring,
so we could have __perf_session__process_dir_events function:

https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git/commit/?h=perf/record_threads&id=308aa7cff1fed335401cfc02c7bac1a4644af68e

when reporting the threaded record data on really big servers,
you will run out of memory, so you need to read and flush all
the files together by smaller pieces

IMO we need to have this change before we allow threaded record

jirka


> Suggested-by: Namhyung Kim <[email protected]>
> Signed-off-by: Alexey Budankov <[email protected]>
> ---
> tools/perf/util/session.c | 48 +++++++++++++++++++++++++++++++++++++++
> tools/perf/util/session.h | 1 +
> 2 files changed, 49 insertions(+)
>
> diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
> index 6afc670fdf0c..0752eec19813 100644
> --- a/tools/perf/util/session.c
> +++ b/tools/perf/util/session.c
> @@ -2212,6 +2212,17 @@ reader__process_events(struct reader *rd, struct perf_session *session,
> goto more;
>
> out:
> + if (rd->unmap_file) {
> + int i;
> +
> + for (i = 0; i < NUM_MMAPS; i++) {
> + if (mmaps[i]) {
> + munmap(mmaps[i], mmap_size);
> + mmaps[i] = NULL;
> + }
> + }
> + }
> +
> return err;
> }
>
> @@ -2231,6 +2242,7 @@ static int __perf_session__process_events(struct perf_session *session)
> .data_offset = session->header.data_offset,
> .process = process_simple,
> .path = session->data->file.path,
> + .unmap_file = false,
> };
> struct ordered_events *oe = &session->ordered_events;
> struct perf_tool *tool = session->tool;
> @@ -2247,6 +2259,42 @@ static int __perf_session__process_events(struct perf_session *session)
> err = reader__process_events(&rd, session, &prog);
> if (err)
> goto out_err;
> +
> + if (perf_data__is_dir(session->data)) {
> + int i, nr = session->data->dir.nr;
> + struct reader file_rd[nr];
> + u64 total_size = perf_data__size(session->data);
> +
> + total_size -= session->data->file.size;
> + ui_progress__init_size(&prog, total_size, "Sorting events...");
> +
> + memset(&file_rd, 0, nr * sizeof(file_rd[0]));
> +
> + for (i = 0; i < nr ; i++) {
> + struct perf_data_file *file;
> +
> + file = &session->data->dir.files[i];
> + file_rd[i] = (struct reader) {
> + .fd = file->fd,
> + .path = file->path,
> + .data_size = file->size,
> + .data_offset = 0,
> + .process = process_simple,
> + };
> + file_rd[i].unmap_file = perf_header__has_feat(&session->header,
> + HEADER_COMPRESSED);
> + session->reader = &file_rd[i];
> +
> + if (zstd_init(&(file_rd[i].zstd_data), 0))
> + goto out_err;
> + err = reader__process_events(&file_rd[i], session, &prog);
> + zstd_fini(&(file_rd[i].zstd_data));
> + session->reader = NULL;
> + if (err)
> + goto out_err;
> + }
> + }
> +
> /* do the final flush for ordered samples */
> err = ordered_events__flush(oe, OE_FLUSH__FINAL);
> if (err)
> diff --git a/tools/perf/util/session.h b/tools/perf/util/session.h
> index 4fc9ccdf7970..d428f3eaf7fd 100644
> --- a/tools/perf/util/session.h
> +++ b/tools/perf/util/session.h
> @@ -43,6 +43,7 @@ struct reader {
> u64 data_offset;
> reader_cb_t process;
> struct zstd_data zstd_data;
> + bool unmap_file;
> };
>
> struct perf_session {
> --
> 2.24.1
>

2020-10-26 09:50:53

by Alexei Budankov

[permalink] [raw]
Subject: Re: [PATCH v2 10/15] perf record: manage thread specific data array


On 24.10.2020 18:44, Jiri Olsa wrote:
> On Wed, Oct 21, 2020 at 07:04:26PM +0300, Alexey Budankov wrote:
>>
>> Provide allocation, initialization, finalization and releasing of
>> thread specific objects at thread specific data array. Allocate
>> thread specific object for every data buffer making one-to-one
>> relation between data buffer and a thread processing the buffer.
>> Deliver event fd related signals to thread's pollfd object.
>> Deliver thread control commands to ctlfd_pos fd of thread 1+.
>> Deliver tool external control commands via ctlfd_pos fd of thread 0.
>>
>> Signed-off-by: Alexey Budankov <[email protected]>
>> ---
>> tools/perf/builtin-record.c | 101 ++++++++++++++++++++++++++++++++++--
>> 1 file changed, 98 insertions(+), 3 deletions(-)
>>
>> diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
>> index 8e512096a060..89cb8e913fb3 100644
>> --- a/tools/perf/builtin-record.c
>> +++ b/tools/perf/builtin-record.c
>> @@ -884,6 +884,94 @@ static int record__kcore_copy(struct machine *machine, struct perf_data *data)
>> return kcore_copy(from_dir, kcore_dir);
>> }
>>
>> +static int record__alloc_thread_data(struct record *rec, struct mmap *mmaps, int nr_mmaps,
>> + struct fdarray *evlist_pollfd, int ctlfd_pos)
>> +{
>> + int i, j, k, nr_thread_data;
>> + struct thread_data *thread_data;
>> +
>> + rec->nr_thread_data = nr_thread_data = nr_mmaps;
>> + rec->thread_data = thread_data = zalloc(rec->nr_thread_data * sizeof(*(rec->thread_data)));
>> + if (!thread_data) {
>> + pr_err("Failed to allocate thread data\n");
>> + return -ENOMEM;
>> + }
>> +
>> + for (i = 0; i < nr_thread_data; i++) {
>> + short revents;
>> + int pos, fd;
>> +
>> + thread_data[i].tid = -1;
>> +
>> + if (pipe(thread_data[i].comm.msg) ||
>> + pipe(thread_data[i].comm.ack)) {
>> + pr_err("Failed to create thread comm pipes, errno %d\n", errno);
>> + return -ENOMEM;
>> + }
>
> the original code was using state flag and pthread_cond,
> which I think is more readable
>
> https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git/commit/?h=perf/record_threads&id=a7da527ff8be69572c6d17525c03c6fe394503c8
> https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git/commit/?h=perf/record_threads&id=eb85ce4da64a885fdb6c77cfc5bd71312fe02e2a

Yes, right, but messaging scales better and that critical section
to just increment global counter in memory looked as over complication.

>
>> +
>> + thread_data[i].maps = &mmaps[i];
>> + thread_data[i].nr_mmaps = 1;
>> +
>> + thread_data[i].rec = rec;
>> +
>> + fdarray__init(&(thread_data[i].pollfd), 64);
>> +
>> + for (j = 0; j < thread_data[i].nr_mmaps; j++) {
>> + struct mmap *map = &(thread_data[i].maps[j]);
>> +
>> + for (k = 0; k < evlist_pollfd->nr; k++) {
>> + if (evlist_pollfd->priv[k].ptr != map)
>> + continue;
>> +
>> + fd = evlist_pollfd->entries[k].fd;
>> + revents = evlist_pollfd->entries[k].events;
>> + pos = fdarray__add(&(thread_data[i].pollfd),
>> + fd, revents | POLLERR | POLLHUP,
>> + fdarray_flag__default);
>> + if (pos >= 0)
>> + thread_data[i].pollfd.priv[pos].ptr = map;
>> + else
>> + return -ENOMEM;
>
> I added function for that:
> https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git/commit/?h=perf/record_threads&id=8aa6e68a7471b9d25af1a9eebfa9321433366a17

Ok, but it is not clone operation it is more like ordinary adding of
fd into another pollfd object. It could be wrapped into a function
e.g. fdarray__dup(pollfd_dst, fd_src, fd_revents)

Alexei

2020-10-26 09:52:40

by Alexei Budankov

[permalink] [raw]
Subject: Re: [PATCH v2 14/15] perf record: start threads in the beginning of trace streaming


On 24.10.2020 18:44, Jiri Olsa wrote:
> On Wed, Oct 21, 2020 at 07:10:09PM +0300, Alexey Budankov wrote:
>>
>> Start threads in detached state because its management is possible
>> via messaging. Block signals prior the threads start so only main
>> tool thread would be notified on external async signals during data
>> collection. Streaming threads connect one-to-one to mapped data
>> buffers and write into per-CPU trace files located at data directory.
>> Data buffers and threads are affined to local NUMA nodes and monitored
>> CPUs according to system topology. --cpu option can be used to specify
>> CPUs to be monitored.
>>
>> Signed-off-by: Alexey Budankov <[email protected]>
>> ---
>> tools/perf/builtin-record.c | 128 +++++++++++++++++++++++++++++++++---
>> 1 file changed, 120 insertions(+), 8 deletions(-)
>>
>> diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
>> index a15642656066..1d41e996a994 100644
>> --- a/tools/perf/builtin-record.c
>> +++ b/tools/perf/builtin-record.c
>> @@ -56,6 +56,7 @@
>> #include <poll.h>
>> #include <pthread.h>
>> #include <unistd.h>
>> +#include <sys/syscall.h>
>> #include <sched.h>
>> #include <signal.h>
>> #ifdef HAVE_EVENTFD_SUPPORT
>> @@ -1377,6 +1378,62 @@ static void record__thread_munmap_filtered(struct fdarray *fda, int fd,
>> perf_mmap__put(map);
>> }
>>
>> +static void *record__thread(void *arg)
>> +{
>> + enum thread_msg msg = THREAD_MSG__READY;
>> + bool terminate = false;
>> + struct fdarray *pollfd;
>> + int err, ctlfd_pos;
>> +
>> + thread = arg;
>> + thread->tid = syscall(SYS_gettid);
>> +
>> + err = write(thread->comm.ack[1], &msg, sizeof(msg));
>> + if (err == -1)
>> + pr_err("threads: %d failed to notify on start. Error %m", thread->tid);
>> +
>> + pollfd = &(thread->pollfd);
>
> I don't think braces are necessary in here

Corrected in v3.

Alexei

2020-10-26 11:15:14

by Alexei Budankov

[permalink] [raw]
Subject: Re: [PATCH v2 09/15] perf record: introduce thread specific objects for trace streaming



On 24.10.2020 18:44, Jiri Olsa wrote:
> On Wed, Oct 21, 2020 at 07:03:48PM +0300, Alexey Budankov wrote:
>>
>> Introduce thread local data object and its array to be used for
>> threaded trace streaming.
>>
>> Signed-off-by: Alexey Budankov <[email protected]>
>> ---
>> tools/perf/builtin-record.c | 18 ++++++++++++++++++
>> 1 file changed, 18 insertions(+)
>>
>> diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
>> index ba26d75c51d6..8e512096a060 100644
>> --- a/tools/perf/builtin-record.c
>> +++ b/tools/perf/builtin-record.c
>> @@ -85,11 +85,29 @@ struct switch_output {
>> int cur_file;
>> };
>>
>> +struct thread_data {
>> + pid_t tid;
>> + struct {
>> + int msg[2];
>> + int ack[2];
>> + } comm;
>> + struct fdarray pollfd;
>> + int ctlfd_pos;
>> + struct mmap *maps;
>> + int nr_mmaps;
>> + struct record *rec;
>> + unsigned long long samples;
>> + unsigned long waking;
>> + u64 bytes_written;
>> +};
>
> please merge the struct with the code that's using it

Corrected in v3.

Thanks,
Alexei

>
> jirka
>
>> +
>> struct record {
>> struct perf_tool tool;
>> struct record_opts opts;
>> u64 bytes_written;
>> struct perf_data data;
>> + struct thread_data *thread_data;
>> + int nr_thread_data;
>> struct auxtrace_record *itr;
>> struct evlist *evlist;
>> struct perf_session *session;
>> --
>> 2.24.1
>>
>

2020-10-26 23:35:30

by Alexey Budankov

[permalink] [raw]
Subject: Re: [PATCH v2 04/15] perf session: move reader object definition to header file


On 24.10.2020 18:44, Jiri Olsa wrote:
> On Wed, Oct 21, 2020 at 06:59:48PM +0300, Alexey Budankov wrote:
>>
>> Move definition of reader to session header file to be shared
>> among different source files. Introduce reference to active
>> reader object from session object.
>>
>> Signed-off-by: Alexey Budankov <[email protected]>
>> ---
>> tools/perf/util/session.c | 27 ---------------------------
>> tools/perf/util/session.h | 25 +++++++++++++++++++++++++
>> 2 files changed, 25 insertions(+), 27 deletions(-)
>>
>> diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
>> index 6f09d506b2f6..911b2dbcd0ac 100644
>> --- a/tools/perf/util/session.c
>> +++ b/tools/perf/util/session.c
>> @@ -2110,33 +2110,6 @@ static int __perf_session__process_decomp_events(struct perf_session *session)
>> return 0;
>> }
>>
>> -/*
>> - * On 64bit we can mmap the data file in one go. No need for tiny mmap
>> - * slices. On 32bit we use 32MB.
>> - */
>> -#if BITS_PER_LONG == 64
>> -#define MMAP_SIZE ULLONG_MAX
>> -#define NUM_MMAPS 1
>> -#else
>> -#define MMAP_SIZE (32 * 1024 * 1024ULL)
>> -#define NUM_MMAPS 128
>> -#endif
>> -
>> -struct reader;
>> -
>> -typedef s64 (*reader_cb_t)(struct perf_session *session,
>> - union perf_event *event,
>> - u64 file_offset,
>> - const char *file_path);
>> -
>> -struct reader {
>> - int fd;
>> - const char *path;
>> - u64 data_size;
>> - u64 data_offset;
>> - reader_cb_t process;
>> -};
>> -
>> static int
>> reader__process_events(struct reader *rd, struct perf_session *session,
>> struct ui_progress *prog)
>> diff --git a/tools/perf/util/session.h b/tools/perf/util/session.h
>> index 378ffc3e2809..abdb8518a81f 100644
>> --- a/tools/perf/util/session.h
>> +++ b/tools/perf/util/session.h
>> @@ -20,6 +20,30 @@ struct thread;
>> struct auxtrace;
>> struct itrace_synth_opts;
>>
>> +/*
>> + * On 64bit we can mmap the data file in one go. No need for tiny mmap
>> + * slices. On 32bit we use 32MB.
>> + */
>> +#if BITS_PER_LONG == 64
>> +#define MMAP_SIZE ULLONG_MAX
>> +#define NUM_MMAPS 1
>> +#else
>> +#define MMAP_SIZE (32 * 1024 * 1024ULL)
>> +#define NUM_MMAPS 128
>> +#endif
>> +
>> +typedef s64 (*reader_cb_t)(struct perf_session *session,
>> + union perf_event *event,
>> + u64 file_offset, const char *file_path);
>> +
>> +struct reader {
>> + int fd;
>> + const char *path;
>> + u64 data_size;
>> + u64 data_offset;
>> + reader_cb_t process;
>> +};
>
> I wasn't able to find where is this used ouf of session.c ?

I will double check and adjust appropriately.

>
>
>> +
>> struct perf_session {
>> struct perf_header header;
>> struct machines machines;
>> @@ -41,6 +65,7 @@ struct perf_session {
>> struct zstd_data zstd_data;
>> struct decomp *decomp;
>> struct decomp *decomp_last;
>> + struct reader *reader;
>
> please define it in the patch where it's actualy used

Corrected in v3.

Alexei

>
> thanks,
> jirka
>

2020-10-26 23:46:57

by Alexey Budankov

[permalink] [raw]
Subject: Re: [PATCH v2 00/15] Introduce threaded trace streaming for basic perf record operation


On 24.10.2020 18:43, Jiri Olsa wrote:
> On Wed, Oct 21, 2020 at 06:52:43PM +0300, Alexey Budankov wrote:
>>
>> Changes in v2:
>> - explicitly added credit tags to patches 6/15 and 15/15,
>> additionally to cites [1], [2]
>> - updated description of 3/15 to explicitly mention the reason
>> to open data directories in read access mode (e.g. for perf report)
>> - implemented fix for compilation error of 2/15
>> - explicitly elaborated on found issues to be resolved for
>> threaded AUX trace capture
>>
>> v1: https://lore.kernel.org/lkml/[email protected]/
>>
>> Patch set provides threaded trace streaming for base perf record
>> operation. Provided streaming mode (--threads) mitigates profiling
>> data losses and resolves scalability issues of serial and asynchronous
>> (--aio) trace streaming modes on multicore server systems. The patch
>> set is based on the prototype [1], [2] and the most closely relates
>> to mode 3) "mode that creates thread for every monitored memory map".
>
> so what I liked about the previous code was that you could
> configure how the threads would be created
>
> default --threads options created thread for each cpu like
> in your change:
>
> $ perf record -v --threads ...
> ...
> thread 0 monitor: 0 allowed: 0
> thread 1 monitor: 1 allowed: 1
> thread 2 monitor: 2 allowed: 2
> thread 3 monitor: 3 allowed: 3
> thread 4 monitor: 4 allowed: 4
> thread 5 monitor: 5 allowed: 5
> thread 6 monitor: 6 allowed: 6
> thread 7 monitor: 7 allowed: 7

Yes, it is configurable in the prototype. Even though this patch set
doesn't implement that parameters for --thread option, just because
VTune doesn't have use cases for that yet, it has still been designed
and implemented with that possible extension in mind so it could then
be easily added on top of it.

>
>
> then numa based:
>
> $ perf record -v --threads=numa ...
> ...
> thread 0 monitor: 0-5,12-17 allowed: 0-5,12-17
> thread 1 monitor: 6-11,18-23 allowed: 6-11,18-23
>
>
> socket based:
>
> $ perf record -v --threads=socket ...
> ...
> thread 0 monitor: 0-7 allowed: 0-7
>
>
> core based:
>
> $ perf record -v --threads=core ...
> ...
> thread 0 monitor: 0,4 allowed: 0,4
> thread 1 monitor: 1,5 allowed: 1,5
> thread 2 monitor: 2,6 allowed: 2,6
> thread 3 monitor: 3,7 allowed: 3,7
>
>
> and user configurable:
>
> $ perf record -v --threads=0-3/0:4-7/4 ...
> ...
> threads: 0. monitor 0-3, allowed 0
> threads: 1. monitor 4-7, allowed 4
>
>
> so this way you could easily pin threads to cpu/core/socket/numa,
> or to some other cpu of your choice, because this will be always
> game of try and check where I'm not getting LOST events and not
> creating 1000 threads
>
> perf record: Add support for threads numa option value
> perf record: Add support for threads socket option value
> perf record: Add support for threads core option value
> perf record: Add support for threads user option value

Makes sense.

Alexei

2020-10-28 01:41:04

by Alexey Budankov

[permalink] [raw]
Subject: Re: [PATCH v2 06/15] perf session: load data directory into tool process memory


On 24.10.2020 18:43, Jiri Olsa wrote:
> On Wed, Oct 21, 2020 at 07:01:19PM +0300, Alexey Budankov wrote:
>>
>> Read trace files located in data directory into tool process memory.
>> Basic analysis support of data directories is provided for report
>> mode. Raw dump (-D) and aggregated reports are available for data
>> directories, still with no memory consumption optimizations. However
>> data directories collected with --compression-level option enabled
>> can be analyzed with little less memory because trace files are
>> unmaped from tool process memory after loading collected data.
>> The implementation is based on the prototype [1], [2].
>>
>> [1] git clone https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git -b perf/record_threads
>> [2] https://lore.kernel.org/lkml/[email protected]/
>>
>> Suggested-by: Jiri Olsa <[email protected]>
>
> very loosely ;-) so there was a reason for all that reader refactoring,
> so we could have __perf_session__process_dir_events function:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git/commit/?h=perf/record_threads&id=308aa7cff1fed335401cfc02c7bac1a4644af68e

Nonetheless. All that are necessary parts to make threaded data streaming
and analysis eventually merged into the mainline as joint Perf developers
community effort.

>
> when reporting the threaded record data on really big servers,
> you will run out of memory, so you need to read and flush all
> the files together by smaller pieces

Yes, handling all that _big_ data after collection to make it
helpful for analysis of performance issues is the other part
of this story so that possible OOM should be somehow avoided.

>
> IMO we need to have this change before we allow threaded record

There are use cases of perf tool as a data provider, btw VTune is not
the only one of them, and for those use cases threaded trace streaming
lets its users get to their data that the users just were loosing before.
This is huge difference and whole new level of support for such users.
Post-process scripting around perf (e.g. Python based) will benefit
from threaded trace streaming. Pipe mode can be extended to stream into
open and passed fds using threads (e.g. perf record -o -fd:13,14,15,16).
VTune-like tools can get performance data, load it into a (relational)
DB files and provide analysis. And all that uses perf tool at its core.

I agree perf report OOM issue can exist on really-big servers but data
directories support for report mode for not-so-big servers and desktops
is already enabled with this smaller change. Also really-big-servers
come with really-big amount of memory and collection could possibly be
limited to only interesting phases of execution so the issue could likely
be avoided. At the same time threaded trace streaming could clarify on
real use cases that are blocked by perf report OOM issue and that would
clarify on exact required solution. So perf report OOM issue shouldn't
be the showstopper for upstream of threaded trace streaming.

Alexei

2020-10-28 07:21:18

by Jiri Olsa

[permalink] [raw]
Subject: Re: [PATCH v2 00/15] Introduce threaded trace streaming for basic perf record operation

On Mon, Oct 26, 2020 at 08:59:01PM +0300, Alexey Budankov wrote:
>
> On 24.10.2020 18:43, Jiri Olsa wrote:
> > On Wed, Oct 21, 2020 at 06:52:43PM +0300, Alexey Budankov wrote:
> >>
> >> Changes in v2:
> >> - explicitly added credit tags to patches 6/15 and 15/15,
> >> additionally to cites [1], [2]
> >> - updated description of 3/15 to explicitly mention the reason
> >> to open data directories in read access mode (e.g. for perf report)
> >> - implemented fix for compilation error of 2/15
> >> - explicitly elaborated on found issues to be resolved for
> >> threaded AUX trace capture
> >>
> >> v1: https://lore.kernel.org/lkml/[email protected]/
> >>
> >> Patch set provides threaded trace streaming for base perf record
> >> operation. Provided streaming mode (--threads) mitigates profiling
> >> data losses and resolves scalability issues of serial and asynchronous
> >> (--aio) trace streaming modes on multicore server systems. The patch
> >> set is based on the prototype [1], [2] and the most closely relates
> >> to mode 3) "mode that creates thread for every monitored memory map".
> >
> > so what I liked about the previous code was that you could
> > configure how the threads would be created
> >
> > default --threads options created thread for each cpu like
> > in your change:
> >
> > $ perf record -v --threads ...
> > ...
> > thread 0 monitor: 0 allowed: 0
> > thread 1 monitor: 1 allowed: 1
> > thread 2 monitor: 2 allowed: 2
> > thread 3 monitor: 3 allowed: 3
> > thread 4 monitor: 4 allowed: 4
> > thread 5 monitor: 5 allowed: 5
> > thread 6 monitor: 6 allowed: 6
> > thread 7 monitor: 7 allowed: 7
>
> Yes, it is configurable in the prototype. Even though this patch set
> doesn't implement that parameters for --thread option, just because
> VTune doesn't have use cases for that yet, it has still been designed
> and implemented with that possible extension in mind so it could then
> be easily added on top of it.

I'm not sure about vtune extensions, but if we are going to
have --threads option I believe we should make it configurable
at least to the extend descibed below

jirka

>
> >
> >
> > then numa based:
> >
> > $ perf record -v --threads=numa ...
> > ...
> > thread 0 monitor: 0-5,12-17 allowed: 0-5,12-17
> > thread 1 monitor: 6-11,18-23 allowed: 6-11,18-23
> >
> >
> > socket based:
> >
> > $ perf record -v --threads=socket ...
> > ...
> > thread 0 monitor: 0-7 allowed: 0-7
> >
> >
> > core based:
> >
> > $ perf record -v --threads=core ...
> > ...
> > thread 0 monitor: 0,4 allowed: 0,4
> > thread 1 monitor: 1,5 allowed: 1,5
> > thread 2 monitor: 2,6 allowed: 2,6
> > thread 3 monitor: 3,7 allowed: 3,7
> >
> >
> > and user configurable:
> >
> > $ perf record -v --threads=0-3/0:4-7/4 ...
> > ...
> > threads: 0. monitor 0-3, allowed 0
> > threads: 1. monitor 4-7, allowed 4
> >
> >
> > so this way you could easily pin threads to cpu/core/socket/numa,
> > or to some other cpu of your choice, because this will be always
> > game of try and check where I'm not getting LOST events and not
> > creating 1000 threads
> >
> > perf record: Add support for threads numa option value
> > perf record: Add support for threads socket option value
> > perf record: Add support for threads core option value
> > perf record: Add support for threads user option value
>
> Makes sense.
>
> Alexei
>

2020-10-28 07:31:22

by Jiri Olsa

[permalink] [raw]
Subject: Re: [PATCH v2 06/15] perf session: load data directory into tool process memory

On Tue, Oct 27, 2020 at 10:37:58AM +0300, Alexey Budankov wrote:
>
> On 24.10.2020 18:43, Jiri Olsa wrote:
> > On Wed, Oct 21, 2020 at 07:01:19PM +0300, Alexey Budankov wrote:
> >>
> >> Read trace files located in data directory into tool process memory.
> >> Basic analysis support of data directories is provided for report
> >> mode. Raw dump (-D) and aggregated reports are available for data
> >> directories, still with no memory consumption optimizations. However
> >> data directories collected with --compression-level option enabled
> >> can be analyzed with little less memory because trace files are
> >> unmaped from tool process memory after loading collected data.
> >> The implementation is based on the prototype [1], [2].
> >>
> >> [1] git clone https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git -b perf/record_threads
> >> [2] https://lore.kernel.org/lkml/[email protected]/
> >>
> >> Suggested-by: Jiri Olsa <[email protected]>
> >
> > very loosely ;-) so there was a reason for all that reader refactoring,
> > so we could have __perf_session__process_dir_events function:
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git/commit/?h=perf/record_threads&id=308aa7cff1fed335401cfc02c7bac1a4644af68e
>
> Nonetheless. All that are necessary parts to make threaded data streaming
> and analysis eventually merged into the mainline as joint Perf developers
> community effort.
>
> >
> > when reporting the threaded record data on really big servers,
> > you will run out of memory, so you need to read and flush all
> > the files together by smaller pieces
>
> Yes, handling all that _big_ data after collection to make it
> helpful for analysis of performance issues is the other part
> of this story so that possible OOM should be somehow avoided.
>
> >
> > IMO we need to have this change before we allow threaded record
>
> There are use cases of perf tool as a data provider, btw VTune is not
> the only one of them, and for those use cases threaded trace streaming
> lets its users get to their data that the users just were loosing before.
> This is huge difference and whole new level of support for such users.
> Post-process scripting around perf (e.g. Python based) will benefit
> from threaded trace streaming. Pipe mode can be extended to stream into
> open and passed fds using threads (e.g. perf record -o -fd:13,14,15,16).
> VTune-like tools can get performance data, load it into a (relational)
> DB files and provide analysis. And all that uses perf tool at its core.
>
> I agree perf report OOM issue can exist on really-big servers but data
> directories support for report mode for not-so-big servers and desktops
> is already enabled with this smaller change. Also really-big-servers
> come with really-big amount of memory and collection could possibly be
> limited to only interesting phases of execution so the issue could likely
> be avoided. At the same time threaded trace streaming could clarify on
> real use cases that are blocked by perf report OOM issue and that would
> clarify on exact required solution. So perf report OOM issue shouldn't
> be the showstopper for upstream of threaded trace streaming.

so the short answer is no, right? ;-)

I understand all the excuses, but from my point of view we are
adding another pain point (and there's already few ;-) ) that
will make perf (even more) not user friendly

if we allow really friendly way to create huge data, we should
do our best to be able to process it as best as we can

jirka

2020-10-28 11:09:06

by Alexey Budankov

[permalink] [raw]
Subject: Re: [PATCH v2 06/15] perf session: load data directory into tool process memory


On 27.10.2020 15:21, Jiri Olsa wrote:
> On Tue, Oct 27, 2020 at 10:37:58AM +0300, Alexey Budankov wrote:
>>
>> On 24.10.2020 18:43, Jiri Olsa wrote:
>>> On Wed, Oct 21, 2020 at 07:01:19PM +0300, Alexey Budankov wrote:
>>>>
>>>> Read trace files located in data directory into tool process memory.
>>>> Basic analysis support of data directories is provided for report
>>>> mode. Raw dump (-D) and aggregated reports are available for data
>>>> directories, still with no memory consumption optimizations. However
>>>> data directories collected with --compression-level option enabled
>>>> can be analyzed with little less memory because trace files are
>>>> unmaped from tool process memory after loading collected data.
>>>> The implementation is based on the prototype [1], [2].
>>>>
>>>> [1] git clone https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git -b perf/record_threads
>>>> [2] https://lore.kernel.org/lkml/[email protected]/
>>>>
>>>> Suggested-by: Jiri Olsa <[email protected]>
>>>
>>> very loosely ;-) so there was a reason for all that reader refactoring,
>>> so we could have __perf_session__process_dir_events function:
>>>
>>> https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git/commit/?h=perf/record_threads&id=308aa7cff1fed335401cfc02c7bac1a4644af68e
>>
>> Nonetheless. All that are necessary parts to make threaded data streaming
>> and analysis eventually merged into the mainline as joint Perf developers
>> community effort.
>>
>>>
>>> when reporting the threaded record data on really big servers,
>>> you will run out of memory, so you need to read and flush all
>>> the files together by smaller pieces
>>
>> Yes, handling all that _big_ data after collection to make it
>> helpful for analysis of performance issues is the other part
>> of this story so that possible OOM should be somehow avoided.
>>
>>>
>>> IMO we need to have this change before we allow threaded record
>>
>> There are use cases of perf tool as a data provider, btw VTune is not
>> the only one of them, and for those use cases threaded trace streaming
>> lets its users get to their data that the users just were loosing before.
>> This is huge difference and whole new level of support for such users.
>> Post-process scripting around perf (e.g. Python based) will benefit
>> from threaded trace streaming. Pipe mode can be extended to stream into
>> open and passed fds using threads (e.g. perf record -o -fd:13,14,15,16).
>> VTune-like tools can get performance data, load it into a (relational)
>> DB files and provide analysis. And all that uses perf tool at its core.
>>
>> I agree perf report OOM issue can exist on really-big servers but data
>> directories support for report mode for not-so-big servers and desktops
>> is already enabled with this smaller change. Also really-big-servers
>> come with really-big amount of memory and collection could possibly be
>> limited to only interesting phases of execution so the issue could likely
>> be avoided. At the same time threaded trace streaming could clarify on
>> real use cases that are blocked by perf report OOM issue and that would
>> clarify on exact required solution. So perf report OOM issue shouldn't
>> be the showstopper for upstream of threaded trace streaming.
>
> so the short answer is no, right? ;-)

Answer to what question? Resolve OOM in perf report for data directories?
I don't see a simple solution for that. The next issue after OOM is resolved
is a very long processing of data directories. And again there is no simple
solution for that as well. But it still need progress in order to be resolved
eventually.

>
> I understand all the excuses, but from my point of view we are
> adding another pain point (and there's already few ;-) ) that
> will make perf (even more) not user friendly

I would not name it a paint point but instead a growth opportunity.
Now --threads can't be and is not enabled by default. When a user
asks --threads the tool can print warning in advance about lots of
data and possible perf report OOM limitation so confusion and
disappointment for users of perf report can be avoided in advance.

>
> if we allow really friendly way to create huge data, we should
> do our best to be able to process it as best as we can

It is just little to no more friendly as it is already now.
Everyone can grab patches apply and get threaded streaming.
Inclusion into mainline will standardize solution to build
and evolve upon and this is necessary step towards complete
support of data directories in perf tool suite.

Alexei

>
> jirka
>

2020-10-28 11:34:48

by Alexey Budankov

[permalink] [raw]
Subject: Re: [PATCH v2 06/15] perf session: load data directory into tool process memory


On 27.10.2020 15:21, Jiri Olsa wrote:
> On Tue, Oct 27, 2020 at 10:37:58AM +0300, Alexey Budankov wrote:
>>
>> On 24.10.2020 18:43, Jiri Olsa wrote:
>>> On Wed, Oct 21, 2020 at 07:01:19PM +0300, Alexey Budankov wrote:
>>>>
>>>> Read trace files located in data directory into tool process memory.
>>>> Basic analysis support of data directories is provided for report
>>>> mode. Raw dump (-D) and aggregated reports are available for data
>>>> directories, still with no memory consumption optimizations. However
>>>> data directories collected with --compression-level option enabled
>>>> can be analyzed with little less memory because trace files are
>>>> unmaped from tool process memory after loading collected data.
>>>> The implementation is based on the prototype [1], [2].
>>>>
>>>> [1] git clone https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git -b perf/record_threads
>>>> [2] https://lore.kernel.org/lkml/[email protected]/
>>>>
>>>> Suggested-by: Jiri Olsa <[email protected]>
>>>
>>> very loosely ;-) so there was a reason for all that reader refactoring,
>>> so we could have __perf_session__process_dir_events function:
>>>
>>> https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git/commit/?h=perf/record_threads&id=308aa7cff1fed335401cfc02c7bac1a4644af68e
>>
>> Nonetheless. All that are necessary parts to make threaded data streaming
>> and analysis eventually merged into the mainline as joint Perf developers
>> community effort.
>>
>>>
>>> when reporting the threaded record data on really big servers,
>>> you will run out of memory, so you need to read and flush all
>>> the files together by smaller pieces
>>
>> Yes, handling all that _big_ data after collection to make it
>> helpful for analysis of performance issues is the other part
>> of this story so that possible OOM should be somehow avoided.
>>
>>>
>>> IMO we need to have this change before we allow threaded record
>>
>> There are use cases of perf tool as a data provider, btw VTune is not
>> the only one of them, and for those use cases threaded trace streaming
>> lets its users get to their data that the users just were loosing before.
>> This is huge difference and whole new level of support for such users.
>> Post-process scripting around perf (e.g. Python based) will benefit
>> from threaded trace streaming. Pipe mode can be extended to stream into
>> open and passed fds using threads (e.g. perf record -o -fd:13,14,15,16).
>> VTune-like tools can get performance data, load it into a (relational)
>> DB files and provide analysis. And all that uses perf tool at its core.
>>
>> I agree perf report OOM issue can exist on really-big servers but data
>> directories support for report mode for not-so-big servers and desktops
>> is already enabled with this smaller change. Also really-big-servers
>> come with really-big amount of memory and collection could possibly be
>> limited to only interesting phases of execution so the issue could likely
>> be avoided. At the same time threaded trace streaming could clarify on
>> real use cases that are blocked by perf report OOM issue and that would
>> clarify on exact required solution. So perf report OOM issue shouldn't
>> be the showstopper for upstream of threaded trace streaming.
>
> so the short answer is no, right? ;-)
>
> I understand all the excuses, but from my point of view we are
> adding another pain point (and there's already few ;-) ) that

BTW what are those a few pain points that are 'unfriendly' in perf?
Possibly users could be warned about the points in advance to avoid
confusion and disappointment by the fact.

Alexei

> will make perf (even more) not user friendly
>
> if we allow really friendly way to create huge data, we should
> do our best to be able to process it as best as we can
>
> jirka
>

2020-10-28 14:59:15

by Alexey Budankov

[permalink] [raw]
Subject: Re: [PATCH v2 00/15] Introduce threaded trace streaming for basic perf record operation


On 27.10.2020 15:10, Jiri Olsa wrote:
> On Mon, Oct 26, 2020 at 08:59:01PM +0300, Alexey Budankov wrote:
>>
>> On 24.10.2020 18:43, Jiri Olsa wrote:
>>> On Wed, Oct 21, 2020 at 06:52:43PM +0300, Alexey Budankov wrote:
>>>>
>>>> Changes in v2:
>>>> - explicitly added credit tags to patches 6/15 and 15/15,
>>>> additionally to cites [1], [2]
>>>> - updated description of 3/15 to explicitly mention the reason
>>>> to open data directories in read access mode (e.g. for perf report)
>>>> - implemented fix for compilation error of 2/15
>>>> - explicitly elaborated on found issues to be resolved for
>>>> threaded AUX trace capture
>>>>
>>>> v1: https://lore.kernel.org/lkml/[email protected]/
>>>>
>>>> Patch set provides threaded trace streaming for base perf record
>>>> operation. Provided streaming mode (--threads) mitigates profiling
>>>> data losses and resolves scalability issues of serial and asynchronous
>>>> (--aio) trace streaming modes on multicore server systems. The patch
>>>> set is based on the prototype [1], [2] and the most closely relates
>>>> to mode 3) "mode that creates thread for every monitored memory map".
>>>
>>> so what I liked about the previous code was that you could
>>> configure how the threads would be created
>>>
>>> default --threads options created thread for each cpu like
>>> in your change:
>>>
>>> $ perf record -v --threads ...
>>> ...
>>> thread 0 monitor: 0 allowed: 0
>>> thread 1 monitor: 1 allowed: 1
>>> thread 2 monitor: 2 allowed: 2
>>> thread 3 monitor: 3 allowed: 3
>>> thread 4 monitor: 4 allowed: 4
>>> thread 5 monitor: 5 allowed: 5
>>> thread 6 monitor: 6 allowed: 6
>>> thread 7 monitor: 7 allowed: 7
>>
>> Yes, it is configurable in the prototype. Even though this patch set
>> doesn't implement that parameters for --thread option, just because
>> VTune doesn't have use cases for that yet, it has still been designed
>> and implemented with that possible extension in mind so it could then
>> be easily added on top of it.
>
> I'm not sure about vtune extensions, but if we are going to
> have --threads option I believe we should make it configurable
> at least to the extend descibed below

It employs --threads mode only and there are no use cases
observed so far beyond this mode. Do you have or see such
use cases?

Alexei

>
> jirka
>
>>
>>>
>>>
>>> then numa based:
>>>
>>> $ perf record -v --threads=numa ...
>>> ...
>>> thread 0 monitor: 0-5,12-17 allowed: 0-5,12-17
>>> thread 1 monitor: 6-11,18-23 allowed: 6-11,18-23
>>>
>>>
>>> socket based:
>>>
>>> $ perf record -v --threads=socket ...
>>> ...
>>> thread 0 monitor: 0-7 allowed: 0-7
>>>
>>>
>>> core based:
>>>
>>> $ perf record -v --threads=core ...
>>> ...
>>> thread 0 monitor: 0,4 allowed: 0,4
>>> thread 1 monitor: 1,5 allowed: 1,5
>>> thread 2 monitor: 2,6 allowed: 2,6
>>> thread 3 monitor: 3,7 allowed: 3,7
>>>
>>>
>>> and user configurable:
>>>
>>> $ perf record -v --threads=0-3/0:4-7/4 ...
>>> ...
>>> threads: 0. monitor 0-3, allowed 0
>>> threads: 1. monitor 4-7, allowed 4
>>>
>>>
>>> so this way you could easily pin threads to cpu/core/socket/numa,
>>> or to some other cpu of your choice, because this will be always
>>> game of try and check where I'm not getting LOST events and not
>>> creating 1000 threads
>>>
>>> perf record: Add support for threads numa option value
>>> perf record: Add support for threads socket option value
>>> perf record: Add support for threads core option value
>>> perf record: Add support for threads user option value
>>
>> Makes sense.
>>
>> Alexei
>>
>

2020-10-28 17:38:15

by Alexey Budankov

[permalink] [raw]
Subject: Re: [PATCH v2 00/15] Introduce threaded trace streaming for basic perf record operation


On 27.10.2020 15:10, Jiri Olsa wrote:
> On Mon, Oct 26, 2020 at 08:59:01PM +0300, Alexey Budankov wrote:
>>
>> On 24.10.2020 18:43, Jiri Olsa wrote:
>>> On Wed, Oct 21, 2020 at 06:52:43PM +0300, Alexey Budankov wrote:
>>>>
>>>> Changes in v2:
>>>> - explicitly added credit tags to patches 6/15 and 15/15,
>>>> additionally to cites [1], [2]
>>>> - updated description of 3/15 to explicitly mention the reason
>>>> to open data directories in read access mode (e.g. for perf report)
>>>> - implemented fix for compilation error of 2/15
>>>> - explicitly elaborated on found issues to be resolved for
>>>> threaded AUX trace capture
>>>>
>>>> v1: https://lore.kernel.org/lkml/[email protected]/
>>>>
>>>> Patch set provides threaded trace streaming for base perf record
>>>> operation. Provided streaming mode (--threads) mitigates profiling
>>>> data losses and resolves scalability issues of serial and asynchronous
>>>> (--aio) trace streaming modes on multicore server systems. The patch
>>>> set is based on the prototype [1], [2] and the most closely relates
>>>> to mode 3) "mode that creates thread for every monitored memory map".
>>>
>>> so what I liked about the previous code was that you could
>>> configure how the threads would be created
>>>
>>> default --threads options created thread for each cpu like
>>> in your change:
>>>
>>> $ perf record -v --threads ...
>>> ...
>>> thread 0 monitor: 0 allowed: 0
>>> thread 1 monitor: 1 allowed: 1
>>> thread 2 monitor: 2 allowed: 2
>>> thread 3 monitor: 3 allowed: 3
>>> thread 4 monitor: 4 allowed: 4
>>> thread 5 monitor: 5 allowed: 5
>>> thread 6 monitor: 6 allowed: 6
>>> thread 7 monitor: 7 allowed: 7
>>
>> Yes, it is configurable in the prototype. Even though this patch set
>> doesn't implement that parameters for --thread option, just because
>> VTune doesn't have use cases for that yet, it has still been designed
>> and implemented with that possible extension in mind so it could then
>> be easily added on top of it.
>
> I'm not sure about vtune extensions, but if we are going to
> have --threads option I believe we should make it configurable
> at least to the extend descibed below

vtune employs --threads mode only and there are no use cases
observed so far beyond this mode. Do you have such use cases?

Alexei

>
> jirka
>
>>
>>>
>>>
>>> then numa based:
>>>
>>> $ perf record -v --threads=numa ...
>>> ...
>>> thread 0 monitor: 0-5,12-17 allowed: 0-5,12-17
>>> thread 1 monitor: 6-11,18-23 allowed: 6-11,18-23
>>>
>>>
>>> socket based:
>>>
>>> $ perf record -v --threads=socket ...
>>> ...
>>> thread 0 monitor: 0-7 allowed: 0-7
>>>
>>>
>>> core based:
>>>
>>> $ perf record -v --threads=core ...
>>> ...
>>> thread 0 monitor: 0,4 allowed: 0,4
>>> thread 1 monitor: 1,5 allowed: 1,5
>>> thread 2 monitor: 2,6 allowed: 2,6
>>> thread 3 monitor: 3,7 allowed: 3,7
>>>
>>>
>>> and user configurable:
>>>
>>> $ perf record -v --threads=0-3/0:4-7/4 ...
>>> ...
>>> threads: 0. monitor 0-3, allowed 0
>>> threads: 1. monitor 4-7, allowed 4
>>>
>>>
>>> so this way you could easily pin threads to cpu/core/socket/numa,
>>> or to some other cpu of your choice, because this will be always
>>> game of try and check where I'm not getting LOST events and not
>>> creating 1000 threads
>>>
>>> perf record: Add support for threads numa option value
>>> perf record: Add support for threads socket option value
>>> perf record: Add support for threads core option value
>>> perf record: Add support for threads user option value
>>
>> Makes sense.
>>
>> Alexei
>>
>

2020-10-28 23:28:49

by Namhyung Kim

[permalink] [raw]
Subject: Re: [PATCH v2 06/15] perf session: load data directory into tool process memory

On Tue, Oct 27, 2020 at 11:43 PM Alexey Budankov
<[email protected]> wrote:
>
>
> On 27.10.2020 15:21, Jiri Olsa wrote:
> > On Tue, Oct 27, 2020 at 10:37:58AM +0300, Alexey Budankov wrote:
> >> I agree perf report OOM issue can exist on really-big servers but data
> >> directories support for report mode for not-so-big servers and desktops
> >> is already enabled with this smaller change. Also really-big-servers
> >> come with really-big amount of memory and collection could possibly be
> >> limited to only interesting phases of execution so the issue could likely
> >> be avoided. At the same time threaded trace streaming could clarify on
> >> real use cases that are blocked by perf report OOM issue and that would
> >> clarify on exact required solution. So perf report OOM issue shouldn't
> >> be the showstopper for upstream of threaded trace streaming.
> >
> > so the short answer is no, right? ;-)
>
> Answer to what question? Resolve OOM in perf report for data directories?
> I don't see a simple solution for that. The next issue after OOM is resolved
> is a very long processing of data directories. And again there is no simple
> solution for that as well. But it still need progress in order to be resolved
> eventually.

I think we should find a better way than just adding all events to the
ordered events queue in memory then processing them one by one.

Separating tracking events (FORK/MMAP/...) might be the first step.

Thanks
Namhyung

2020-10-29 00:53:53

by Jiri Olsa

[permalink] [raw]
Subject: Re: [PATCH v2 00/15] Introduce threaded trace streaming for basic perf record operation

On Wed, Oct 28, 2020 at 10:35:08AM +0300, Alexey Budankov wrote:
>
> On 28.10.2020 10:08, Namhyung Kim wrote:
> > Hi,
> >
> > On Wed, Oct 28, 2020 at 1:02 AM Alexey Budankov
> > <[email protected]> wrote:
> >>
> >>
> >> On 27.10.2020 15:10, Jiri Olsa wrote:
> >>> On Mon, Oct 26, 2020 at 08:59:01PM +0300, Alexey Budankov wrote:
> >>>>
> >>>> On 24.10.2020 18:43, Jiri Olsa wrote:
> >>>>> On Wed, Oct 21, 2020 at 06:52:43PM +0300, Alexey Budankov wrote:
> >>>>>>
> >>>>>> Changes in v2:
> >>>>>> - explicitly added credit tags to patches 6/15 and 15/15,
> >>>>>> additionally to cites [1], [2]
> >>>>>> - updated description of 3/15 to explicitly mention the reason
> >>>>>> to open data directories in read access mode (e.g. for perf report)
> >>>>>> - implemented fix for compilation error of 2/15
> >>>>>> - explicitly elaborated on found issues to be resolved for
> >>>>>> threaded AUX trace capture
> >>>>>>
> >>>>>> v1: https://lore.kernel.org/lkml/[email protected]/
> >>>>>>
> >>>>>> Patch set provides threaded trace streaming for base perf record
> >>>>>> operation. Provided streaming mode (--threads) mitigates profiling
> >>>>>> data losses and resolves scalability issues of serial and asynchronous
> >>>>>> (--aio) trace streaming modes on multicore server systems. The patch
> >>>>>> set is based on the prototype [1], [2] and the most closely relates
> >>>>>> to mode 3) "mode that creates thread for every monitored memory map".
> >>>>>
> >>>>> so what I liked about the previous code was that you could
> >>>>> configure how the threads would be created
> >>>>>
> >>>>> default --threads options created thread for each cpu like
> >>>>> in your change:
> >>>>>
> >>>>> $ perf record -v --threads ...
> >>>>> ...
> >>>>> thread 0 monitor: 0 allowed: 0
> >>>>> thread 1 monitor: 1 allowed: 1
> >>>>> thread 2 monitor: 2 allowed: 2
> >>>>> thread 3 monitor: 3 allowed: 3
> >>>>> thread 4 monitor: 4 allowed: 4
> >>>>> thread 5 monitor: 5 allowed: 5
> >>>>> thread 6 monitor: 6 allowed: 6
> >>>>> thread 7 monitor: 7 allowed: 7
> >>>>
> >>>> Yes, it is configurable in the prototype. Even though this patch set
> >>>> doesn't implement that parameters for --thread option, just because
> >>>> VTune doesn't have use cases for that yet, it has still been designed
> >>>> and implemented with that possible extension in mind so it could then
> >>>> be easily added on top of it.
> >>>
> >>> I'm not sure about vtune extensions, but if we are going to
> >>> have --threads option I believe we should make it configurable
> >>> at least to the extend descibed below
> >>
> >> It employs --threads mode only and there are no use cases
> >> observed so far beyond this mode. Do you have or see such
> >> use cases?
> >
> > I don't know about vtune and other users, but it's an important
> > feature for better performance so I agree with Jiri's opinion to
> > make it flexible for the system requirement.
>
> For sure, vtune is not the only one for this threaded streaming
> and it should be well suited for perf tool use cases equally.
> And for perf it would be beneficial to document some examples in
> perf-record.txt as a part of this configuration implementation.
> I am not just aware of such examples and that is why I am asking
> you guys.

I saw no LOST events on big servers for some tests with --threads=numa
option, so there was no reason to spawn 200+ threads

jirka

2020-10-29 00:54:30

by Jiri Olsa

[permalink] [raw]
Subject: Re: [PATCH v2 06/15] perf session: load data directory into tool process memory

On Wed, Oct 28, 2020 at 04:22:49PM +0900, Namhyung Kim wrote:
> On Tue, Oct 27, 2020 at 11:43 PM Alexey Budankov
> <[email protected]> wrote:
> >
> >
> > On 27.10.2020 15:21, Jiri Olsa wrote:
> > > On Tue, Oct 27, 2020 at 10:37:58AM +0300, Alexey Budankov wrote:
> > >> I agree perf report OOM issue can exist on really-big servers but data
> > >> directories support for report mode for not-so-big servers and desktops
> > >> is already enabled with this smaller change. Also really-big-servers
> > >> come with really-big amount of memory and collection could possibly be
> > >> limited to only interesting phases of execution so the issue could likely
> > >> be avoided. At the same time threaded trace streaming could clarify on
> > >> real use cases that are blocked by perf report OOM issue and that would
> > >> clarify on exact required solution. So perf report OOM issue shouldn't
> > >> be the showstopper for upstream of threaded trace streaming.
> > >
> > > so the short answer is no, right? ;-)
> >
> > Answer to what question? Resolve OOM in perf report for data directories?
> > I don't see a simple solution for that. The next issue after OOM is resolved
> > is a very long processing of data directories. And again there is no simple
> > solution for that as well. But it still need progress in order to be resolved
> > eventually.
>
> I think we should find a better way than just adding all events to the
> ordered events queue in memory then processing them one by one.
>
> Separating tracking events (FORK/MMAP/...) might be the first step.

I recall seeing this change before for threaded perf report,
maybe even from you, right? ;-)

jirka

>
> Thanks
> Namhyung
>

2020-10-29 07:46:54

by Namhyung Kim

[permalink] [raw]
Subject: Re: [PATCH v2 00/15] Introduce threaded trace streaming for basic perf record operation

Hi,

On Wed, Oct 28, 2020 at 1:02 AM Alexey Budankov
<[email protected]> wrote:
>
>
> On 27.10.2020 15:10, Jiri Olsa wrote:
> > On Mon, Oct 26, 2020 at 08:59:01PM +0300, Alexey Budankov wrote:
> >>
> >> On 24.10.2020 18:43, Jiri Olsa wrote:
> >>> On Wed, Oct 21, 2020 at 06:52:43PM +0300, Alexey Budankov wrote:
> >>>>
> >>>> Changes in v2:
> >>>> - explicitly added credit tags to patches 6/15 and 15/15,
> >>>> additionally to cites [1], [2]
> >>>> - updated description of 3/15 to explicitly mention the reason
> >>>> to open data directories in read access mode (e.g. for perf report)
> >>>> - implemented fix for compilation error of 2/15
> >>>> - explicitly elaborated on found issues to be resolved for
> >>>> threaded AUX trace capture
> >>>>
> >>>> v1: https://lore.kernel.org/lkml/[email protected]/
> >>>>
> >>>> Patch set provides threaded trace streaming for base perf record
> >>>> operation. Provided streaming mode (--threads) mitigates profiling
> >>>> data losses and resolves scalability issues of serial and asynchronous
> >>>> (--aio) trace streaming modes on multicore server systems. The patch
> >>>> set is based on the prototype [1], [2] and the most closely relates
> >>>> to mode 3) "mode that creates thread for every monitored memory map".
> >>>
> >>> so what I liked about the previous code was that you could
> >>> configure how the threads would be created
> >>>
> >>> default --threads options created thread for each cpu like
> >>> in your change:
> >>>
> >>> $ perf record -v --threads ...
> >>> ...
> >>> thread 0 monitor: 0 allowed: 0
> >>> thread 1 monitor: 1 allowed: 1
> >>> thread 2 monitor: 2 allowed: 2
> >>> thread 3 monitor: 3 allowed: 3
> >>> thread 4 monitor: 4 allowed: 4
> >>> thread 5 monitor: 5 allowed: 5
> >>> thread 6 monitor: 6 allowed: 6
> >>> thread 7 monitor: 7 allowed: 7
> >>
> >> Yes, it is configurable in the prototype. Even though this patch set
> >> doesn't implement that parameters for --thread option, just because
> >> VTune doesn't have use cases for that yet, it has still been designed
> >> and implemented with that possible extension in mind so it could then
> >> be easily added on top of it.
> >
> > I'm not sure about vtune extensions, but if we are going to
> > have --threads option I believe we should make it configurable
> > at least to the extend descibed below
>
> It employs --threads mode only and there are no use cases
> observed so far beyond this mode. Do you have or see such
> use cases?

I don't know about vtune and other users, but it's an important
feature for better performance so I agree with Jiri's opinion to
make it flexible for the system requirement.

Thanks
Namhyung

2020-10-29 08:54:09

by Jiri Olsa

[permalink] [raw]
Subject: Re: [PATCH v2 06/15] perf session: load data directory into tool process memory

On Tue, Oct 27, 2020 at 05:43:20PM +0300, Alexey Budankov wrote:
>
> On 27.10.2020 15:21, Jiri Olsa wrote:
> > On Tue, Oct 27, 2020 at 10:37:58AM +0300, Alexey Budankov wrote:
> >>
> >> On 24.10.2020 18:43, Jiri Olsa wrote:
> >>> On Wed, Oct 21, 2020 at 07:01:19PM +0300, Alexey Budankov wrote:
> >>>>
> >>>> Read trace files located in data directory into tool process memory.
> >>>> Basic analysis support of data directories is provided for report
> >>>> mode. Raw dump (-D) and aggregated reports are available for data
> >>>> directories, still with no memory consumption optimizations. However
> >>>> data directories collected with --compression-level option enabled
> >>>> can be analyzed with little less memory because trace files are
> >>>> unmaped from tool process memory after loading collected data.
> >>>> The implementation is based on the prototype [1], [2].
> >>>>
> >>>> [1] git clone https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git -b perf/record_threads
> >>>> [2] https://lore.kernel.org/lkml/[email protected]/
> >>>>
> >>>> Suggested-by: Jiri Olsa <[email protected]>
> >>>
> >>> very loosely ;-) so there was a reason for all that reader refactoring,
> >>> so we could have __perf_session__process_dir_events function:
> >>>
> >>> https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git/commit/?h=perf/record_threads&id=308aa7cff1fed335401cfc02c7bac1a4644af68e
> >>
> >> Nonetheless. All that are necessary parts to make threaded data streaming
> >> and analysis eventually merged into the mainline as joint Perf developers
> >> community effort.
> >>
> >>>
> >>> when reporting the threaded record data on really big servers,
> >>> you will run out of memory, so you need to read and flush all
> >>> the files together by smaller pieces
> >>
> >> Yes, handling all that _big_ data after collection to make it
> >> helpful for analysis of performance issues is the other part
> >> of this story so that possible OOM should be somehow avoided.
> >>
> >>>
> >>> IMO we need to have this change before we allow threaded record
> >>
> >> There are use cases of perf tool as a data provider, btw VTune is not
> >> the only one of them, and for those use cases threaded trace streaming
> >> lets its users get to their data that the users just were loosing before.
> >> This is huge difference and whole new level of support for such users.
> >> Post-process scripting around perf (e.g. Python based) will benefit
> >> from threaded trace streaming. Pipe mode can be extended to stream into
> >> open and passed fds using threads (e.g. perf record -o -fd:13,14,15,16).
> >> VTune-like tools can get performance data, load it into a (relational)
> >> DB files and provide analysis. And all that uses perf tool at its core.
> >>
> >> I agree perf report OOM issue can exist on really-big servers but data
> >> directories support for report mode for not-so-big servers and desktops
> >> is already enabled with this smaller change. Also really-big-servers
> >> come with really-big amount of memory and collection could possibly be
> >> limited to only interesting phases of execution so the issue could likely
> >> be avoided. At the same time threaded trace streaming could clarify on
> >> real use cases that are blocked by perf report OOM issue and that would
> >> clarify on exact required solution. So perf report OOM issue shouldn't
> >> be the showstopper for upstream of threaded trace streaming.
> >
> > so the short answer is no, right? ;-)
>
> Answer to what question? Resolve OOM in perf report for data directories?
> I don't see a simple solution for that. The next issue after OOM is resolved
> is a very long processing of data directories. And again there is no simple
> solution for that as well. But it still need progress in order to be resolved
> eventually.

it's right here:
https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git/commit/?h=perf/record_threads&id=308aa7cff1fed335401cfc02c7bac1a4644af68e

jirka

>
> >
> > I understand all the excuses, but from my point of view we are
> > adding another pain point (and there's already few ;-) ) that
> > will make perf (even more) not user friendly
>
> I would not name it a paint point but instead a growth opportunity.
> Now --threads can't be and is not enabled by default. When a user
> asks --threads the tool can print warning in advance about lots of
> data and possible perf report OOM limitation so confusion and
> disappointment for users of perf report can be avoided in advance.
>
> >
> > if we allow really friendly way to create huge data, we should
> > do our best to be able to process it as best as we can
>
> It is just little to no more friendly as it is already now.
> Everyone can grab patches apply and get threaded streaming.
> Inclusion into mainline will standardize solution to build
> and evolve upon and this is necessary step towards complete
> support of data directories in perf tool suite.
>
> Alexei
>
> >
> > jirka
> >
>

2020-10-29 11:15:01

by Namhyung Kim

[permalink] [raw]
Subject: Re: [PATCH v2 06/15] perf session: load data directory into tool process memory

Hi Jiri,

On Thu, Oct 29, 2020 at 12:40 AM Jiri Olsa <[email protected]> wrote:
>
> On Wed, Oct 28, 2020 at 04:22:49PM +0900, Namhyung Kim wrote:
> > On Tue, Oct 27, 2020 at 11:43 PM Alexey Budankov
> > <[email protected]> wrote:
> > >
> > >
> > > On 27.10.2020 15:21, Jiri Olsa wrote:
> > > > On Tue, Oct 27, 2020 at 10:37:58AM +0300, Alexey Budankov wrote:
> > > >> I agree perf report OOM issue can exist on really-big servers but data
> > > >> directories support for report mode for not-so-big servers and desktops
> > > >> is already enabled with this smaller change. Also really-big-servers
> > > >> come with really-big amount of memory and collection could possibly be
> > > >> limited to only interesting phases of execution so the issue could likely
> > > >> be avoided. At the same time threaded trace streaming could clarify on
> > > >> real use cases that are blocked by perf report OOM issue and that would
> > > >> clarify on exact required solution. So perf report OOM issue shouldn't
> > > >> be the showstopper for upstream of threaded trace streaming.
> > > >
> > > > so the short answer is no, right? ;-)
> > >
> > > Answer to what question? Resolve OOM in perf report for data directories?
> > > I don't see a simple solution for that. The next issue after OOM is resolved
> > > is a very long processing of data directories. And again there is no simple
> > > solution for that as well. But it still need progress in order to be resolved
> > > eventually.
> >
> > I think we should find a better way than just adding all events to the
> > ordered events queue in memory then processing them one by one.
> >
> > Separating tracking events (FORK/MMAP/...) might be the first step.
>
> I recall seeing this change before for threaded perf report,
> maybe even from you, right? ;-)

Yes, I did it a couple of years ago. The last version is here:

https://git.kernel.org/pub/scm/linux/kernel/git/namhyung/linux-perf.git/log/?h=perf/threaded-v5

Thanks
Namhyung