The patch set implements runtime trace compression (-z option) in
record mode and trace auto decompression in report and inject modes.
Streaming Zstd API [1] is used for compression and decompression of
data that come from kernel mmaped data buffers.
Usage of implemented -z,--compression_level=n option provides ~3-5x
avg. trace file size reduction on variety of tested workloads what
saves storage space on larger server systems where trace file size
can easily reach several tens or even hundreds of GiBs, especially
when profiling with dwarf-based stacks and tracing of context switches.
Default option value is 1 (fastest compression).
Implemented --mmap-flush option can be used to specify minimal size
of data chunk that is extracted from mmaped kernel buffer to store
into a trace. The option is independent from -z setting and doesn't
vary with compression level. The default option value is 1 byte what
means every time trace writing thread finds some new data in the
mmaped buffer the data is extracted, possibly compressed and written
to a trace. The option serves two purposes the first one is to increase
the compression ratio of trace data and the second one is to avoid
live-lock self tool process monitoring in system wide (-a) profiling
mode. Profiling in system wide mode with compression (-a -z) can
additionally induce data into the kernel buffers along with the data
from monitored processes. If performance data rate and volume from
the monitored processes is high then trace streaming and compression
activity in the tool is also high. It can lead to subtle live-lock
effect of endless activity when compression of single new byte from
some of mmaped kernel buffer induces the next single byte at some
mmaped buffer. So perf tool thread never stops on polling event file
descriptors. Varying data chunk size to be extracted from mmap buffers
allows avoiding live-locking self monitoring in system wide mode and
makes mmap buffers polling loop manageable. Possible usage examples:
$ tools/perf/perf record -z -e cycles -- matrix.gcc
$ tools/perf/perf record --aio -z -e cycles -- matrix.gcc
$ tools/perf/perf record -z --mmap-flush 1024 -e cycles -- matrix.gcc
$ tools/perf/perf record --aio -z --mmap-flush 1K -e cycles -- matrix.gcc
Runtime compression overhead has been measured for serial and AIO
trace writing modes when profiling matrix multiplication workload:
-------------------------------------------------------------
| SERIAL | AIO-1 |
----|-----------------------------|-----------------------------|
|-z | OVH(x) | ratio(x) size(MiB) | OVH(x) | ratio(x) size(MiB) |
|---|--------|--------------------|--------|--------------------|
| 0 | 1,00 | 1,000 179,424 | 1,00 | 1,000 187,527 |
| 1 | 1,04 | 8,427 181,148 | 1,01 | 8,474 188,562 |
| 2 | 1,07 | 8,055 186,953 | 1,03 | 7,912 191,773 |
| 3 | 1,04 | 8,283 181,908 | 1,03 | 8,220 191,078 |
| 5 | 1,09 | 8,101 187,705 | 1,05 | 7,780 190,065 |
| 8 | 1,05 | 9,217 179,191 | 1,12 | 6,111 193,024 |
-----------------------------------------------------------------
OVH = (Execution time with -z N) / (Execution time with -z 0)
ratio - compression ratio
size - number of bytes that was compressed
size ~= trace file x ratio
See complete description of measurement conditions with details below.
Introduced compression functionality can be disabled or configured from
the command line using NO_LIBZSTD and LIBZSTD_DIR defines:
$ make -C tools/perf NO_LIBZSTD=1 clean all
$ make -C tools/perf LIBZSTD_DIR=/path/to/zstd/sources/ clean all
If your system has some version of the zstd package preinstalled then
the build system finds and uses it during the build. Auto detection
feature status is reported just before compilation starts, as usual.
If you still prefer to compile with some other version of zstd you have
capability to refer the compilation to that version using LIBZSTD_DIR
define.
See 'perf test' results below for enabled and disabled (NO_LIBZSTD=1)
feature configurations.
---
Alexey Budankov (12):
feature: implement libzstd check, LIBZSTD_DIR and NO_LIBZSTD defines
perf record: implement --mmap-flush=<number> option
perf session: define bytes_transferred and bytes_compressed metrics
perf record: implement COMPRESSED event record and its attributes
perf mmap: implement dedicated memory buffer for data compression
perf util: introduce Zstd streaming based compression API
perf record: implement compression for serial trace streaming
perf record: implement compression for AIO trace streaming
perf record: implement -z,--compression_level[=<n>] option
perf report: implement record trace decompression
perf inject: enable COMPRESSED records decompression
perf tests: implement Zstd comp/decomp integration test
tools/build/Makefile.feature | 6 +-
tools/build/feature/Makefile | 6 +-
tools/build/feature/test-all.c | 5 +
tools/build/feature/test-libzstd.c | 12 +
tools/perf/Documentation/perf-record.txt | 17 ++
.../Documentation/perf.data-file-format.txt | 24 ++
tools/perf/Makefile.config | 20 ++
tools/perf/Makefile.perf | 3 +
tools/perf/builtin-inject.c | 4 +
tools/perf/builtin-record.c | 285 +++++++++++++++---
tools/perf/builtin-report.c | 5 +-
tools/perf/builtin-version.c | 2 +
tools/perf/perf.h | 2 +
.../tests/shell/record+zstd_comp_decomp.sh | 35 +++
tools/perf/util/Build | 2 +
tools/perf/util/compress.h | 54 ++++
tools/perf/util/env.h | 11 +
tools/perf/util/event.c | 1 +
tools/perf/util/event.h | 7 +
tools/perf/util/evlist.c | 8 +-
tools/perf/util/evlist.h | 3 +-
tools/perf/util/header.c | 55 +++-
tools/perf/util/header.h | 1 +
tools/perf/util/mmap.c | 106 ++-----
tools/perf/util/mmap.h | 17 +-
tools/perf/util/session.c | 129 +++++++-
tools/perf/util/session.h | 14 +
tools/perf/util/tool.h | 2 +
tools/perf/util/zstd.c | 111 +++++++
29 files changed, 813 insertions(+), 134 deletions(-)
create mode 100644 tools/build/feature/test-libzstd.c
create mode 100755 tools/perf/tests/shell/record+zstd_comp_decomp.sh
create mode 100644 tools/perf/util/zstd.c
---
Changes in v10:
- separated decomp list deallocation into perf_session__release_decomp_events
- extended the test with suggested decompression validation
Changes in v9:
- fixed issue with improper max COMPRESSED record size calculation
- moved up calculation of ratio variable in 03/12
- made minor corrections in changelogs
- corrected several checkpatch.pl warnings and errors
Changes in v8:
- avoid using -f for --mmap-flush option
- move stubs to compress.h and avoid unconditional compiling of zstd.c
- fixed silent interruption for perf record collection
- implemented -z 1 as default
Changes in v7:
- rebased to Arnaldo's perf/core tip
- implemented B/K/M/G suffixes for -f option
- reworked record__mmap_read_evlist() to replace perf_mmap__aio_push()
by perf_mmap__push() in AIO case
- extended "[ perf record: Captured ... ]" message with compression statistics
- extended changelog for v5 06/10
- used PERF_SAMPLE_MAX_SIZE for compressed record size calculations
- renamed record__zstd_compress to zstd_compress and
record__process_comp_header to process_comp_header
- separated nr_cblocks_max applying
Changes in v6:
- extended docs with description of PERF_RECORD_COMPRESSED record and
HEADER_COMPRESSED feature layouts
Changes in v5:
- implemented perf version --build-options extension for aio and zstd - see TESTING below
- adjusted commit message and perf-record.txt content for -f option
- fixed build errors in case of NO_AIO=1 and NO_LIBZSTD=1
Changes in v4:
- implemented integration tests
- adjusted zstd_ stub functions
- rebased on tip of Arnaldo's perf/core
Changes in v3:
- moved -f,--mmap-flush option implementation into a separate patch
- moved definition and printing of bytes_transferred and bytes_compressed into a separate patch
- moved COMPRESSED feature into a separate patch
- added versioning and stored COMPRESSED feature attributes as u32
- implemented dedicated memory buffer for compression in case of serial streaming
- moved low level Zstd based compression functions into util/{compress.h,zstd.c}
- made compress function to be a param of __push(), __aio_push() functions
- enabled perf inject to decompress COMPRESSED records
- measured compression overhead for serial and AIO streaming using
basic matrix multiplication workload on 8 core skylake
Changes in v2:
- moved compression/decompression code to session layer
- enabled allocation aio data buffers for compression
- enabled trace compression for serial trace streaming
---
[1] https://github.com/facebook/zstd
---
OVERHEAD MEASUREMENTS:
uname -a
Linux localhost 4.20.7-200.fc29.x86_64 #1 SMP Wed Feb 6 19:16:42 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
cat /proc/cpuinfo
processor : 7
vendor_id : GenuineIntel
cpu family : 6
model : 94
model name : Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
stepping : 3
microcode : 0xc6
cpu MHz : 4021.884
cache size : 8192 KB
physical id : 0
siblings : 8
core id : 3
cpu cores : 4
apicid : 7
initial apicid : 7
fpu : yes
fpu_exception : yes
cpuid level : 22
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp flush_l1d
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf
bogomips : 8016.00
clflush size : 64
cache_alignment : 64
address sizes : 39 bits physical, 48 bits virtual
power management:
-----------------------------------------------------------------
#!/bin/bash -xv
echo 0 > /proc/sys/kernel/perf_event_paranoid
+ echo 0
cat /proc/sys/kernel/perf_event_paranoid
+ cat /proc/sys/kernel/perf_event_paranoid
0
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
+ echo performance
+ tee /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor /sys/devices/system/cpu/cpu1/cpufreq/scaling_governor /sys/devices/system/cpu/cpu2/cpufreq/scaling_governor /sys/devices/system/cpu/cpu3/cpufreq/scaling_governor /sys/devices/system/cpu/cpu4/cpufreq/scaling_governor /sys/devices/system/cpu/cpu5/cpufreq/scaling_governor /sys/devices/system/cpu/cpu6/cpufreq/scaling_governor /sys/devices/system/cpu/cpu7/cpufreq/scaling_governor
performance
for i in 0 1 2 3 5 8
do
/usr/bin/time tools/perf/perf record -z $i -F 25000 -N -B -T -R -e cycles -- ../../matrix/linux/matrix.gcc
done
+ for i in 0 1 2 3 5 8
+ /usr/bin/time tools/perf/perf record -z 0 -F 25000 -N -B -T -R -e cycles -- ../../matrix/linux/matrix.gcc
Addr of buf1 = 0x7fe36de5c010
Offs of buf1 = 0x7fe36de5c180
Addr of buf2 = 0x7fe36be5b010
Offs of buf2 = 0x7fe36be5b1c0
Addr of buf3 = 0x7fe369e5a010
Offs of buf3 = 0x7fe369e5a100
Addr of buf4 = 0x7fe367e59010
Offs of buf4 = 0x7fe367e59140
Threads #: 8 Pthreads
Matrix size: 2048
Using multiply kernel: multiply1
Execution time = 16.949 seconds
[ perf record: Woken up 309 times to write data ]
[ perf record: Captured and wrote 179.424 MB perf.data ]
133.67user 0.35system 0:17.08elapsed 784%CPU (0avgtext+0avgdata 100580maxresident)k
0inputs+367480outputs (0major+34737minor)pagefaults 0swaps
+ for i in 0 1 2 3 5 8
+ /usr/bin/time tools/perf/perf record -z 1 -F 25000 -N -B -T -R -e cycles -- ../../matrix/linux/matrix.gcc
Addr of buf1 = 0x7fcaec334010
Offs of buf1 = 0x7fcaec334180
Addr of buf2 = 0x7fcaea333010
Offs of buf2 = 0x7fcaea3331c0
Addr of buf3 = 0x7fcae8332010
Offs of buf3 = 0x7fcae8332100
Addr of buf4 = 0x7fcae6331010
Offs of buf4 = 0x7fcae6331140
Threads #: 8 Pthreads
Matrix size: 2048
Using multiply kernel: multiply1
Execution time = 17.608 seconds
[ perf record: Woken up 595 times to write data ]
[ perf record: Compressed 181.148 MB to 21.497 MB, ratio is 8.427 ]
[ perf record: Captured and wrote 21.527 MB perf.data ]
135.69user 0.24system 0:17.73elapsed 766%CPU (0avgtext+0avgdata 100500maxresident)k
0inputs+44112outputs (0major+35033minor)pagefaults 0swaps
+ for i in 0 1 2 3 5 8
+ /usr/bin/time tools/perf/perf record -z 2 -F 25000 -N -B -T -R -e cycles -- ../../matrix/linux/matrix.gcc
Addr of buf1 = 0x7f1336f8d010
Offs of buf1 = 0x7f1336f8d180
Addr of buf2 = 0x7f1334f8c010
Offs of buf2 = 0x7f1334f8c1c0
Addr of buf3 = 0x7f1332f8b010
Offs of buf3 = 0x7f1332f8b100
Addr of buf4 = 0x7f1330f8a010
Offs of buf4 = 0x7f1330f8a140
Threads #: 8 Pthreads
Matrix size: 2048
Using multiply kernel: multiply1
Execution time = 18.175 seconds
[ perf record: Woken up 521 times to write data ]
[ perf record: Compressed 186.953 MB to 23.210 MB, ratio is 8.055 ]
[ perf record: Captured and wrote 23.239 MB perf.data ]
140.21user 0.25system 0:18.32elapsed 766%CPU (0avgtext+0avgdata 100560maxresident)k
0inputs+47608outputs (0major+35263minor)pagefaults 0swaps
+ for i in 0 1 2 3 5 8
+ /usr/bin/time tools/perf/perf record -z 3 -F 25000 -N -B -T -R -e cycles -- ../../matrix/linux/matrix.gcc
Addr of buf1 = 0x7f97060e3010
Offs of buf1 = 0x7f97060e3180
Addr of buf2 = 0x7f97040e2010
Offs of buf2 = 0x7f97040e21c0
Addr of buf3 = 0x7f97020e1010
Offs of buf3 = 0x7f97020e1100
Addr of buf4 = 0x7f97000e0010
Offs of buf4 = 0x7f97000e0140
Threads #: 8 Pthreads
Matrix size: 2048
Using multiply kernel: multiply1
Execution time = 17.688 seconds
[ perf record: Woken up 485 times to write data ]
[ perf record: Compressed 181.908 MB to 21.962 MB, ratio is 8.283 ]
[ perf record: Captured and wrote 21.991 MB perf.data ]
136.87user 0.23system 0:17.81elapsed 769%CPU (0avgtext+0avgdata 100616maxresident)k
0inputs+45064outputs (0major+35773minor)pagefaults 0swaps
+ for i in 0 1 2 3 5 8
+ /usr/bin/time tools/perf/perf record -z 5 -F 25000 -N -B -T -R -e cycles -- ../../matrix/linux/matrix.gcc
Addr of buf1 = 0x7f477b444010
Offs of buf1 = 0x7f477b444180
Addr of buf2 = 0x7f4779443010
Offs of buf2 = 0x7f47794431c0
Addr of buf3 = 0x7f4777442010
Offs of buf3 = 0x7f4777442100
Addr of buf4 = 0x7f4775441010
Offs of buf4 = 0x7f4775441140
Threads #: 8 Pthreads
Matrix size: 2048
Using multiply kernel: multiply1
Execution time = 18.406 seconds
[ perf record: Woken up 416 times to write data ]
[ perf record: Compressed 187.705 MB to 23.170 MB, ratio is 8.101 ]
[ perf record: Captured and wrote 23.200 MB perf.data ]
142.72user 0.26system 0:18.53elapsed 771%CPU (0avgtext+0avgdata 100520maxresident)k
0inputs+47528outputs (0major+36928minor)pagefaults 0swaps
+ for i in 0 1 2 3 5 8
+ /usr/bin/time tools/perf/perf record -z 8 -F 25000 -N -B -T -R -e cycles -- ../../matrix/linux/matrix.gcc
Addr of buf1 = 0x7fb5bf032010
Offs of buf1 = 0x7fb5bf032180
Addr of buf2 = 0x7fb5bd031010
Offs of buf2 = 0x7fb5bd0311c0
Addr of buf3 = 0x7fb5bb030010
Offs of buf3 = 0x7fb5bb030100
Addr of buf4 = 0x7fb5b902f010
Offs of buf4 = 0x7fb5b902f140
Threads #: 8 Pthreads
Matrix size: 2048
Using multiply kernel: multiply1
Execution time = 17.751 seconds
[ perf record: Woken up 391 times to write data ]
[ perf record: Compressed 179.191 MB to 19.441 MB, ratio is 9.217 ]
[ perf record: Captured and wrote 19.502 MB perf.data ]
138.90user 0.29system 0:17.88elapsed 778%CPU (0avgtext+0avgdata 100612maxresident)k
0inputs+39968outputs (0major+37436minor)pagefaults 0swaps
for i in 0 1 2 3 5 8
do
/usr/bin/time tools/perf/perf record --aio=1 -z $i -F 25000 -N -B -T -R -e cycles -- ../../matrix/linux/matrix.gcc
done
+ for i in 0 1 2 3 5 8
+ /usr/bin/time tools/perf/perf record --aio=1 -z 0 -F 25000 -N -B -T -R -e cycles -- ../../matrix/linux/matrix.gcc
Addr of buf1 = 0x7feee4519010
Offs of buf1 = 0x7feee4519180
Addr of buf2 = 0x7feee2518010
Offs of buf2 = 0x7feee25181c0
Addr of buf3 = 0x7feee0517010
Offs of buf3 = 0x7feee0517100
Addr of buf4 = 0x7feede516010
Offs of buf4 = 0x7feede516140
Threads #: 8 Pthreads
Matrix size: 2048
Using multiply kernel: multiply1
Execution time = 17.912 seconds
[ perf record: Woken up 390 times to write data ]
[ perf record: Captured and wrote 187.527 MB perf.data ]
139.70user 0.39system 0:18.04elapsed 776%CPU (0avgtext+0avgdata 100624maxresident)k
0inputs+384072outputs (0major+35257minor)pagefaults 0swaps
+ for i in 0 1 2 3 5 8
+ /usr/bin/time tools/perf/perf record --aio=1 -z 1 -F 25000 -N -B -T -R -e cycles -- ../../matrix/linux/matrix.gcc
Addr of buf1 = 0x7f72b93ac010
Offs of buf1 = 0x7f72b93ac180
Addr of buf2 = 0x7f72b73ab010
Offs of buf2 = 0x7f72b73ab1c0
Addr of buf3 = 0x7f72b53aa010
Offs of buf3 = 0x7f72b53aa100
Addr of buf4 = 0x7f72b33a9010
Offs of buf4 = 0x7f72b33a9140
Threads #: 8 Pthreads
Matrix size: 2048
Using multiply kernel: multiply1
Execution time = 18.198 seconds
[ perf record: Woken up 416 times to write data ]
[ perf record: Compressed 188.562 MB to 22.252 MB, ratio is 8.474 ]
[ perf record: Captured and wrote 22.284 MB perf.data ]
141.12user 0.32system 0:18.32elapsed 771%CPU (0avgtext+0avgdata 100576maxresident)k
0inputs+45664outputs (0major+35040minor)pagefaults 0swaps
+ for i in 0 1 2 3 5 8
+ /usr/bin/time tools/perf/perf record --aio=1 -z 2 -F 25000 -N -B -T -R -e cycles -- ../../matrix/linux/matrix.gcc
Addr of buf1 = 0x7ffb9caf3010
Offs of buf1 = 0x7ffb9caf3180
Addr of buf2 = 0x7ffb9aaf2010
Offs of buf2 = 0x7ffb9aaf21c0
Addr of buf3 = 0x7ffb98af1010
Offs of buf3 = 0x7ffb98af1100
Addr of buf4 = 0x7ffb96af0010
Offs of buf4 = 0x7ffb96af0140
Threads #: 8 Pthreads
Matrix size: 2048
Using multiply kernel: multiply1
Execution time = 18.360 seconds
[ perf record: Woken up 442 times to write data ]
[ perf record: Compressed 191.773 MB to 24.238 MB, ratio is 7.912 ]
[ perf record: Captured and wrote 24.290 MB perf.data ]
143.76user 0.49system 0:18.50elapsed 779%CPU (0avgtext+0avgdata 100596maxresident)k
0inputs+49760outputs (0major+35276minor)pagefaults 0swaps
+ for i in 0 1 2 3 5 8
+ /usr/bin/time tools/perf/perf record --aio=1 -z 3 -F 25000 -N -B -T -R -e cycles -- ../../matrix/linux/matrix.gcc
Addr of buf1 = 0x7f13f2df2010
Offs of buf1 = 0x7f13f2df2180
Addr of buf2 = 0x7f13f0df1010
Offs of buf2 = 0x7f13f0df11c0
Addr of buf3 = 0x7f13eedf0010
Offs of buf3 = 0x7f13eedf0100
Addr of buf4 = 0x7f13ecdef010
Offs of buf4 = 0x7f13ecdef140
Threads #: 8 Pthreads
Matrix size: 2048
Using multiply kernel: multiply1
Execution time = 18.383 seconds
[ perf record: Woken up 499 times to write data ]
[ perf record: Compressed 191.078 MB to 23.246 MB, ratio is 8.220 ]
[ perf record: Captured and wrote 23.282 MB perf.data ]
143.72user 0.34system 0:18.51elapsed 778%CPU (0avgtext+0avgdata 100616maxresident)k
0inputs+47704outputs (0major+35783minor)pagefaults 0swaps
+ for i in 0 1 2 3 5 8
+ /usr/bin/time tools/perf/perf record --aio=1 -z 5 -F 25000 -N -B -T -R -e cycles -- ../../matrix/linux/matrix.gcc
Addr of buf1 = 0x7fca0d091010
Offs of buf1 = 0x7fca0d091180
Addr of buf2 = 0x7fca0b090010
Offs of buf2 = 0x7fca0b0901c0
Addr of buf3 = 0x7fca0908f010
Offs of buf3 = 0x7fca0908f100
Addr of buf4 = 0x7fca0708e010
Offs of buf4 = 0x7fca0708e140
Threads #: 8 Pthreads
Matrix size: 2048
Using multiply kernel: multiply1
Execution time = 18.758 seconds
[ perf record: Woken up 535 times to write data ]
[ perf record: Compressed 190.065 MB to 24.430 MB, ratio is 7.780 ]
[ perf record: Captured and wrote 24.519 MB perf.data ]
144.62user 0.66system 0:18.88elapsed 769%CPU (0avgtext+0avgdata 100528maxresident)k
0inputs+50232outputs (0major+36942minor)pagefaults 0swaps
+ for i in 0 1 2 3 5 8
+ /usr/bin/time tools/perf/perf record --aio=1 -z 8 -F 25000 -N -B -T -R -e cycles -- ../../matrix/linux/matrix.gcc
Addr of buf1 = 0x7f7e1f449010
Offs of buf1 = 0x7f7e1f449180
Addr of buf2 = 0x7f7e1d448010
Offs of buf2 = 0x7f7e1d4481c0
Addr of buf3 = 0x7f7e1b447010
Offs of buf3 = 0x7f7e1b447100
Addr of buf4 = 0x7f7e19446010
Offs of buf4 = 0x7f7e19446140
Threads #: 8 Pthreads
Matrix size: 2048
Using multiply kernel: multiply1
Execution time = 20.103 seconds
[ perf record: Woken up 260 times to write data ]
[ perf record: Compressed 193.024 MB to 31.588 MB, ratio is 6.111 ]
[ perf record: Captured and wrote 32.139 MB perf.data ]
151.73user 4.21system 0:20.23elapsed 770%CPU (0avgtext+0avgdata 100616maxresident)k
0inputs+65848outputs (0major+37431minor)pagefaults 0swaps
---
TESTING:
$ tools/perf/perf version --build-options
perf version 4.13.rc5.gd8d056b
dwarf: [ on ] # HAVE_DWARF_SUPPORT
dwarf_getlocations: [ on ] # HAVE_DWARF_GETLOCATIONS_SUPPORT
glibc: [ on ] # HAVE_GLIBC_SUPPORT
gtk2: [ on ] # HAVE_GTK2_SUPPORT
syscall_table: [ on ] # HAVE_SYSCALL_TABLE_SUPPORT
libbfd: [ on ] # HAVE_LIBBFD_SUPPORT
libelf: [ on ] # HAVE_LIBELF_SUPPORT
libnuma: [ on ] # HAVE_LIBNUMA_SUPPORT
numa_num_possible_cpus: [ on ] # HAVE_LIBNUMA_SUPPORT
libperl: [ on ] # HAVE_LIBPERL_SUPPORT
libpython: [ on ] # HAVE_LIBPYTHON_SUPPORT
libslang: [ on ] # HAVE_SLANG_SUPPORT
libcrypto: [ on ] # HAVE_LIBCRYPTO_SUPPORT
libunwind: [ on ] # HAVE_LIBUNWIND_SUPPORT
libdw-dwarf-unwind: [ on ] # HAVE_DWARF_SUPPORT
zlib: [ on ] # HAVE_ZLIB_SUPPORT
lzma: [ on ] # HAVE_LZMA_SUPPORT
get_cpuid: [ on ] # HAVE_AUXTRACE_SUPPORT
bpf: [ on ] # HAVE_LIBBPF_SUPPORT
aio: [ OFF ] # HAVE_AIO_SUPPORT
zstd: [ OFF ] # HAVE_ZSTD_SUPPORT
$ tools/perf/perf version --build-options
perf version 4.13.rc5.gd8d056b
dwarf: [ on ] # HAVE_DWARF_SUPPORT
dwarf_getlocations: [ on ] # HAVE_DWARF_GETLOCATIONS_SUPPORT
glibc: [ on ] # HAVE_GLIBC_SUPPORT
gtk2: [ on ] # HAVE_GTK2_SUPPORT
syscall_table: [ on ] # HAVE_SYSCALL_TABLE_SUPPORT
libbfd: [ on ] # HAVE_LIBBFD_SUPPORT
libelf: [ on ] # HAVE_LIBELF_SUPPORT
libnuma: [ on ] # HAVE_LIBNUMA_SUPPORT
numa_num_possible_cpus: [ on ] # HAVE_LIBNUMA_SUPPORT
libperl: [ on ] # HAVE_LIBPERL_SUPPORT
libpython: [ on ] # HAVE_LIBPYTHON_SUPPORT
libslang: [ on ] # HAVE_SLANG_SUPPORT
libcrypto: [ on ] # HAVE_LIBCRYPTO_SUPPORT
libunwind: [ on ] # HAVE_LIBUNWIND_SUPPORT
libdw-dwarf-unwind: [ on ] # HAVE_DWARF_SUPPORT
zlib: [ on ] # HAVE_ZLIB_SUPPORT
lzma: [ on ] # HAVE_LZMA_SUPPORT
get_cpuid: [ on ] # HAVE_AUXTRACE_SUPPORT
bpf: [ on ] # HAVE_LIBBPF_SUPPORT
aio: [ on ] # HAVE_AIO_SUPPORT
zstd: [ on ] # HAVE_ZSTD_SUPPORT
$ make -C tools/perf clean all
...
$ pushd tools/perf/ && ./perf test && popd
~/abudanko/kernel/acme/tools/perf ~/abudanko/kernel/acme
1: vmlinux symtab matches kallsyms : Skip
2: Detect openat syscall event : Ok
3: Detect openat syscall event on all cpus : Ok
4: Read samples using the mmap interface : Ok
5: Test data source output : Ok
6: Parse event definition strings : Ok
7: Simple expression parser : Ok
8: PERF_RECORD_* events & perf_sample fields : Ok
9: Parse perf pmu format : Ok
10: DSO data read : Ok
11: DSO data cache : Ok
12: DSO data reopen : Ok
13: Roundtrip evsel->name : Ok
14: Parse sched tracepoints fields : Ok
15: syscalls:sys_enter_openat event fields : Ok
16: Setup struct perf_event_attr : Ok
17: Match and link multiple hists : Ok
18: 'import perf' in python : Ok
19: Breakpoint overflow signal handler : Ok
20: Breakpoint overflow sampling : Ok
21: Breakpoint accounting : Ok
22: Watchpoint :
22.1: Read Only Watchpoint : Skip
22.2: Write Only Watchpoint : Ok
22.3: Read / Write Watchpoint : Ok
22.4: Modify Watchpoint : Ok
23: Number of exit events of a simple workload : Ok
24: Software clock events period values : Ok
25: Object code reading : Ok
26: Sample parsing : Ok
27: Use a dummy software event to keep tracking : Ok
28: Parse with no sample_id_all bit set : Ok
29: Filter hist entries : Ok
30: Lookup mmap thread : Ok
31: Share thread mg : Ok
32: Sort output of hist entries : Ok
33: Cumulate child hist entries : Ok
34: Track with sched_switch : Ok
35: Filter fds with revents mask in a fdarray : Ok
36: Add fd to a fdarray, making it autogrow : Ok
37: kmod_path__parse : Ok
38: Thread map : Ok
39: LLVM search and compile :
39.1: Basic BPF llvm compile : Skip
39.2: kbuild searching : Skip
39.3: Compile source for BPF prologue generation : Skip
39.4: Compile source for BPF relocation : Skip
40: Session topology : Ok
41: BPF filter :
41.1: Basic BPF filtering : Skip
41.2: BPF pinning : Skip
41.3: BPF prologue generation : Skip
41.4: BPF relocation checker : Skip
42: Synthesize thread map : Ok
43: Remove thread map : Ok
44: Synthesize cpu map : Ok
45: Synthesize stat config : Ok
46: Synthesize stat : Ok
47: Synthesize stat round : Ok
48: Synthesize attr update : Ok
49: Event times : Ok
50: Read backward ring buffer : Ok
51: Print cpu map : Ok
52: Probe SDT events : Ok
53: is_printable_array : Ok
54: Print bitmap : Ok
55: perf hooks : Ok
56: builtin clang support : Skip (not compiled in)
57: unit_number__scnprintf : Ok
58: mem2node : Ok
59: x86 rdpmc : Ok
60: Convert perf time to TSC : Ok
61: DWARF unwind : Ok
62: x86 instruction decoder - new instructions : Ok
63: x86 bp modify : Ok
64: Check open filename arg using perf trace + vfs_getname: Skip
65: Add vfs_getname probe to get syscall args filenames : Skip
66: probe libc's inet_pton & backtrace it with ping : Ok
67: Use vfs_getname probe to get syscall args filenames : Skip
68: record trace Zstd compression/decompression : Ok
~/abudanko/kernel/acme
$ make -C tools/perf NO_LIBZSTD=1 clean all
...
$ pushd tools/perf/ && ./perf test && popd
~/abudanko/kernel/acme/tools/perf ~/abudanko/kernel/acme
...
68: record trace Zstd compression/decompression : Skip
~/abudanko/kernel/acme
Implement libzstd feature check, NO_LIBZSTD and LIBZSTD_DIR defines
to override Zstd library sources or disable the feature from the
command line:
$ make -C tools/perf LIBZSTD_DIR=/path/to/zstd/sources/ clean all
$ make -C tools/perf NO_LIBZSTD=1 clean all
Auto detection feature status is reported just before compilation starts.
If your system has some version of the zstd library preinstalled then
the build system finds and uses it during the build.
If you still prefer to compile with some other version of zstd library
you have capability to refer the compilation to that version using
LIBZSTD_DIR define.
Signed-off-by: Alexey Budankov <[email protected]>
---
tools/build/Makefile.feature | 6 ++++--
tools/build/feature/Makefile | 6 +++++-
tools/build/feature/test-all.c | 5 +++++
tools/build/feature/test-libzstd.c | 12 ++++++++++++
tools/perf/Makefile.config | 20 ++++++++++++++++++++
tools/perf/Makefile.perf | 3 +++
tools/perf/builtin-version.c | 2 ++
7 files changed, 51 insertions(+), 3 deletions(-)
create mode 100644 tools/build/feature/test-libzstd.c
diff --git a/tools/build/Makefile.feature b/tools/build/Makefile.feature
index 61e46d54a67c..adf791cbd726 100644
--- a/tools/build/Makefile.feature
+++ b/tools/build/Makefile.feature
@@ -66,7 +66,8 @@ FEATURE_TESTS_BASIC := \
sched_getcpu \
sdt \
setns \
- libaio
+ libaio \
+ libzstd
# FEATURE_TESTS_BASIC + FEATURE_TESTS_EXTRA is the complete list
# of all feature tests
@@ -118,7 +119,8 @@ FEATURE_DISPLAY ?= \
lzma \
get_cpuid \
bpf \
- libaio
+ libaio \
+ libzstd
# Set FEATURE_CHECK_(C|LD)FLAGS-all for all FEATURE_TESTS features.
# If in the future we need per-feature checks/flags for features not
diff --git a/tools/build/feature/Makefile b/tools/build/feature/Makefile
index 7ceb4441b627..4b8244ee65ce 100644
--- a/tools/build/feature/Makefile
+++ b/tools/build/feature/Makefile
@@ -62,7 +62,8 @@ FILES= \
test-clang.bin \
test-llvm.bin \
test-llvm-version.bin \
- test-libaio.bin
+ test-libaio.bin \
+ test-libzstd.bin
FILES := $(addprefix $(OUTPUT),$(FILES))
@@ -301,6 +302,9 @@ $(OUTPUT)test-clang.bin:
$(OUTPUT)test-libaio.bin:
$(BUILD) -lrt
+$(OUTPUT)test-libzstd.bin:
+ $(BUILD) -lzstd
+
###############################
clean:
diff --git a/tools/build/feature/test-all.c b/tools/build/feature/test-all.c
index e903b86b742f..b0dda7db2a17 100644
--- a/tools/build/feature/test-all.c
+++ b/tools/build/feature/test-all.c
@@ -178,6 +178,10 @@
# include "test-reallocarray.c"
#undef main
+#define main main_test_zstd
+# include "test-libzstd.c"
+#undef main
+
int main(int argc, char *argv[])
{
main_test_libpython();
@@ -219,6 +223,7 @@ int main(int argc, char *argv[])
main_test_setns();
main_test_libaio();
main_test_reallocarray();
+ main_test_libzstd();
return 0;
}
diff --git a/tools/build/feature/test-libzstd.c b/tools/build/feature/test-libzstd.c
new file mode 100644
index 000000000000..55268c01b84d
--- /dev/null
+++ b/tools/build/feature/test-libzstd.c
@@ -0,0 +1,12 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <zstd.h>
+
+int main(void)
+{
+ ZSTD_CStream *cstream;
+
+ cstream = ZSTD_createCStream();
+ ZSTD_freeCStream(cstream);
+
+ return 0;
+}
diff --git a/tools/perf/Makefile.config b/tools/perf/Makefile.config
index 0f11d5891301..4949bdb16a66 100644
--- a/tools/perf/Makefile.config
+++ b/tools/perf/Makefile.config
@@ -152,6 +152,13 @@ endif
FEATURE_CHECK_CFLAGS-libbabeltrace := $(LIBBABELTRACE_CFLAGS)
FEATURE_CHECK_LDFLAGS-libbabeltrace := $(LIBBABELTRACE_LDFLAGS) -lbabeltrace-ctf
+ifdef LIBZSTD_DIR
+ LIBZSTD_CFLAGS := -I$(LIBZSTD_DIR)/lib
+ LIBZSTD_LDFLAGS := -L$(LIBZSTD_DIR)/lib
+endif
+FEATURE_CHECK_CFLAGS-libzstd := $(LIBZSTD_CFLAGS)
+FEATURE_CHECK_LDFLAGS-libzstd := $(LIBZSTD_LDFLAGS)
+
FEATURE_CHECK_CFLAGS-bpf = -I. -I$(srctree)/tools/include -I$(srctree)/tools/arch/$(SRCARCH)/include/uapi -I$(srctree)/tools/include/uapi
# include ARCH specific config
-include $(src-perf)/arch/$(SRCARCH)/Makefile
@@ -782,6 +789,19 @@ ifndef NO_LZMA
endif
endif
+ifndef NO_LIBZSTD
+ ifeq ($(feature-libzstd), 1)
+ CFLAGS += -DHAVE_ZSTD_SUPPORT
+ CFLAGS += $(LIBZSTD_CFLAGS)
+ LDFLAGS += $(LIBZSTD_LDFLAGS)
+ EXTLIBS += -lzstd
+ $(call detected,CONFIG_ZSTD)
+ else
+ msg := $(warning No libzstd found, disables trace compression, please install libzstd-dev[el] and/or set LIBZSTD_DIR);
+ NO_LIBZSTD := 1
+ endif
+endif
+
ifndef NO_BACKTRACE
ifeq ($(feature-backtrace), 1)
CFLAGS += -DHAVE_BACKTRACE_SUPPORT
diff --git a/tools/perf/Makefile.perf b/tools/perf/Makefile.perf
index 01f7555fd933..06b927ee6ee3 100644
--- a/tools/perf/Makefile.perf
+++ b/tools/perf/Makefile.perf
@@ -108,6 +108,9 @@ include ../scripts/utilities.mak
# streaming for record mode. Currently Posix AIO trace streaming is
# supported only when linking with glibc.
#
+# Define NO_LIBZSTD if you do not want support of Zstandard based runtime
+# trace compression in record mode.
+#
# As per kernel Makefile, avoid funny character set dependencies
unexport LC_ALL
diff --git a/tools/perf/builtin-version.c b/tools/perf/builtin-version.c
index 50df168be326..f470144d1a70 100644
--- a/tools/perf/builtin-version.c
+++ b/tools/perf/builtin-version.c
@@ -78,6 +78,8 @@ static void library_status(void)
STATUS(HAVE_LZMA_SUPPORT, lzma);
STATUS(HAVE_AUXTRACE_SUPPORT, get_cpuid);
STATUS(HAVE_LIBBPF_SUPPORT, bpf);
+ STATUS(HAVE_AIO_SUPPORT, aio);
+ STATUS(HAVE_ZSTD_SUPPORT, zstd);
}
int cmd_version(int argc, const char **argv)
--
2.20.1
Implemented --mmap-flush option that specifies minimal number of bytes
that is extracted from mmaped kernel buffer to store into a trace. The
default option value is 1 byte what means every time trace writing
thread finds some new data in the mmaped buffer the data is extracted,
possibly compressed and written to a trace.
$ tools/perf/perf record --mmap-flush 1024 -e cycles -- matrix.gcc
$ tools/perf/perf record --aio --mmap-flush 1K -e cycles -- matrix.gcc
The option is independent from -z setting, doesn't vary with compression
level and can serve two purposes.
The first purpose is to increase the compression ratio of a trace data.
Larger data chunks are compressed more effectively so the implemented
option allows specifying data chunk size to compress. Also at some cases
executing more write syscalls with smaller data size can take longer
than executing less write syscalls with bigger data size due to syscall
overhead so extracting bigger data chunks specified by the option value
could additionally decrease runtime overhead.
The second purpose is to avoid self monitoring live-lock issue in system
wide (-a) profiling mode. Profiling in system wide mode with compression
(-a -z) can additionally induce data into the kernel buffers along with
the data from monitored processes. If performance data rate and volume
from the monitored processes is high then trace streaming and compression
activity in the tool is also high. High tool process activity can lead
to subtle live-lock effect when compression of single new byte from some
of mmaped kernel buffer leads to generation of the next single byte at
some mmaped buffer. So perf tool process ends up in endless self
monitoring.
Implemented sync parameter is the mean to force data move independently
from the specified flush threshold value. Despite the provided flush
value the tool needs capability to unconditionally drain memory buffers,
at least in the end of the collection.
Signed-off-by: Alexey Budankov <[email protected]>
---
tools/perf/Documentation/perf-record.txt | 12 +++++
tools/perf/builtin-record.c | 65 +++++++++++++++++++++---
tools/perf/perf.h | 1 +
tools/perf/util/evlist.c | 6 +--
tools/perf/util/evlist.h | 3 +-
tools/perf/util/mmap.c | 4 +-
tools/perf/util/mmap.h | 3 +-
7 files changed, 82 insertions(+), 12 deletions(-)
diff --git a/tools/perf/Documentation/perf-record.txt b/tools/perf/Documentation/perf-record.txt
index 8f0c2be34848..18fceb49434e 100644
--- a/tools/perf/Documentation/perf-record.txt
+++ b/tools/perf/Documentation/perf-record.txt
@@ -459,6 +459,18 @@ Set affinity mask of trace reading thread according to the policy defined by 'mo
node - thread affinity mask is set to NUMA node cpu mask of the processed mmap buffer
cpu - thread affinity mask is set to cpu of the processed mmap buffer
+--mmap-flush=number::
+Specify minimal number of bytes that is extracted from mmap data pages and stored
+into a trace. The number specification is possible using B/K/M/G suffixes. Maximal allowed
+value is a quarter of the size of mmaped data pages. The default option value is 1 byte
+what means that every time trace writing thread finds some new data in the mmaped buffer
+the data is extracted, possibly compressed (-z) and written to a trace. Larger data chunks
+are compressed more effectively in comparison to smaller chunks so extraction of larger
+chunks from the mmap data pages is preferable from perspective of trace size reduction.
+Also at some cases executing less trace write syscalls with bigger data size can take
+shorter than executing more trace write syscalls with smaller data size thus lowering
+runtime profiling overhead.
+
--all-kernel::
Configure all used events to run in kernel space.
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index a468d882e74f..f55302dec440 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -334,6 +334,41 @@ static int record__aio_enabled(struct record *rec)
return rec->opts.nr_cblocks > 0;
}
+#define MMAP_FLUSH_DEFAULT 1
+static int record__mmap_flush_parse(const struct option *opt,
+ const char *str,
+ int unset)
+{
+ int flush_max;
+ struct record_opts *opts = (struct record_opts *)opt->value;
+ static struct parse_tag tags[] = {
+ { .tag = 'B', .mult = 1 },
+ { .tag = 'K', .mult = 1 << 10 },
+ { .tag = 'M', .mult = 1 << 20 },
+ { .tag = 'G', .mult = 1 << 30 },
+ { .tag = 0 },
+ };
+
+ if (unset)
+ return 0;
+
+ if (str) {
+ opts->mmap_flush = parse_tag_value(str, tags);
+ if (opts->mmap_flush == (int)-1)
+ opts->mmap_flush = strtol(str, NULL, 0);
+ }
+
+ if (!opts->mmap_flush)
+ opts->mmap_flush = MMAP_FLUSH_DEFAULT;
+
+ flush_max = perf_evlist__mmap_size(opts->mmap_pages);
+ flush_max /= 4;
+ if (opts->mmap_flush > flush_max)
+ opts->mmap_flush = flush_max;
+
+ return 0;
+}
+
static int process_synthesized_event(struct perf_tool *tool,
union perf_event *event,
struct perf_sample *sample __maybe_unused,
@@ -543,7 +578,8 @@ static int record__mmap_evlist(struct record *rec,
if (perf_evlist__mmap_ex(evlist, opts->mmap_pages,
opts->auxtrace_mmap_pages,
opts->auxtrace_snapshot_mode,
- opts->nr_cblocks, opts->affinity) < 0) {
+ opts->nr_cblocks, opts->affinity,
+ opts->mmap_flush) < 0) {
if (errno == EPERM) {
pr_err("Permission error mapping pages.\n"
"Consider increasing "
@@ -733,7 +769,7 @@ static void record__adjust_affinity(struct record *rec, struct perf_mmap *map)
}
static int record__mmap_read_evlist(struct record *rec, struct perf_evlist *evlist,
- bool overwrite)
+ bool overwrite, bool sync)
{
u64 bytes_written = rec->bytes_written;
int i;
@@ -756,12 +792,19 @@ static int record__mmap_read_evlist(struct record *rec, struct perf_evlist *evli
off = record__aio_get_pos(trace_fd);
for (i = 0; i < evlist->nr_mmaps; i++) {
+ u64 flush = 0;
struct perf_mmap *map = &maps[i];
if (map->base) {
record__adjust_affinity(rec, map);
+ if (sync) {
+ flush = map->flush;
+ map->flush = 1;
+ }
if (!record__aio_enabled(rec)) {
if (perf_mmap__push(map, rec, record__pushfn) != 0) {
+ if (sync)
+ map->flush = flush;
rc = -1;
goto out;
}
@@ -774,10 +817,14 @@ static int record__mmap_read_evlist(struct record *rec, struct perf_evlist *evli
idx = record__aio_sync(map, false);
if (perf_mmap__aio_push(map, rec, idx, record__aio_pushfn, &off) != 0) {
record__aio_set_pos(trace_fd, off);
+ if (sync)
+ map->flush = flush;
rc = -1;
goto out;
}
}
+ if (sync)
+ map->flush = flush;
}
if (map->auxtrace_mmap.base && !rec->opts.auxtrace_snapshot_mode &&
@@ -803,15 +850,15 @@ static int record__mmap_read_evlist(struct record *rec, struct perf_evlist *evli
return rc;
}
-static int record__mmap_read_all(struct record *rec)
+static int record__mmap_read_all(struct record *rec, bool sync)
{
int err;
- err = record__mmap_read_evlist(rec, rec->evlist, false);
+ err = record__mmap_read_evlist(rec, rec->evlist, false, sync);
if (err)
return err;
- return record__mmap_read_evlist(rec, rec->evlist, true);
+ return record__mmap_read_evlist(rec, rec->evlist, true, sync);
}
static void record__init_features(struct record *rec)
@@ -1312,7 +1359,7 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
if (trigger_is_hit(&switch_output_trigger) || done || draining)
perf_evlist__toggle_bkw_mmap(rec->evlist, BKW_MMAP_DATA_PENDING);
- if (record__mmap_read_all(rec) < 0) {
+ if (record__mmap_read_all(rec, false) < 0) {
trigger_error(&auxtrace_snapshot_trigger);
trigger_error(&switch_output_trigger);
err = -1;
@@ -1413,6 +1460,7 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
record__synthesize_workload(rec, true);
out_child:
+ record__mmap_read_all(rec, true);
record__aio_mmap_read_sync(rec);
if (forks) {
@@ -1815,6 +1863,7 @@ static struct record record = {
.uses_mmap = true,
.default_per_cpu = true,
},
+ .mmap_flush = MMAP_FLUSH_DEFAULT,
},
.tool = {
.sample = process_sample_event,
@@ -1881,6 +1930,9 @@ static struct option __record_options[] = {
OPT_CALLBACK('m', "mmap-pages", &record.opts, "pages[,pages]",
"number of mmap data pages and AUX area tracing mmap pages",
record__parse_mmap_pages),
+ OPT_CALLBACK(0, "mmap-flush", &record.opts, "number",
+ "Minimal number of bytes that is extracted from mmap data pages (default: 1)",
+ record__mmap_flush_parse),
OPT_BOOLEAN(0, "group", &record.opts.group,
"put the counters into a counter group"),
OPT_CALLBACK_NOOPT('g', NULL, &callchain_param,
@@ -2184,6 +2236,7 @@ int cmd_record(int argc, const char **argv)
pr_info("nr_cblocks: %d\n", rec->opts.nr_cblocks);
pr_debug("affinity: %s\n", affinity_tags[rec->opts.affinity]);
+ pr_debug("mmap flush: %d\n", rec->opts.mmap_flush);
err = __cmd_record(&record, argc, argv);
out:
diff --git a/tools/perf/perf.h b/tools/perf/perf.h
index b120e547ddc7..7886cc9771cf 100644
--- a/tools/perf/perf.h
+++ b/tools/perf/perf.h
@@ -85,6 +85,7 @@ struct record_opts {
u64 clockid_res_ns;
int nr_cblocks;
int affinity;
+ int mmap_flush;
};
enum perf_affinity {
diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
index ed20f4379956..8858d829983b 100644
--- a/tools/perf/util/evlist.c
+++ b/tools/perf/util/evlist.c
@@ -1037,7 +1037,7 @@ int perf_evlist__parse_mmap_pages(const struct option *opt, const char *str,
*/
int perf_evlist__mmap_ex(struct perf_evlist *evlist, unsigned int pages,
unsigned int auxtrace_pages,
- bool auxtrace_overwrite, int nr_cblocks, int affinity)
+ bool auxtrace_overwrite, int nr_cblocks, int affinity, int flush)
{
struct perf_evsel *evsel;
const struct cpu_map *cpus = evlist->cpus;
@@ -1047,7 +1047,7 @@ int perf_evlist__mmap_ex(struct perf_evlist *evlist, unsigned int pages,
* Its value is decided by evsel's write_backward.
* So &mp should not be passed through const pointer.
*/
- struct mmap_params mp = { .nr_cblocks = nr_cblocks, .affinity = affinity };
+ struct mmap_params mp = { .nr_cblocks = nr_cblocks, .affinity = affinity, .flush = flush };
if (!evlist->mmap)
evlist->mmap = perf_evlist__alloc_mmap(evlist, false);
@@ -1079,7 +1079,7 @@ int perf_evlist__mmap_ex(struct perf_evlist *evlist, unsigned int pages,
int perf_evlist__mmap(struct perf_evlist *evlist, unsigned int pages)
{
- return perf_evlist__mmap_ex(evlist, pages, 0, false, 0, PERF_AFFINITY_SYS);
+ return perf_evlist__mmap_ex(evlist, pages, 0, false, 0, PERF_AFFINITY_SYS, 1);
}
int perf_evlist__create_maps(struct perf_evlist *evlist, struct target *target)
diff --git a/tools/perf/util/evlist.h b/tools/perf/util/evlist.h
index 744906dd4887..edf18811e39f 100644
--- a/tools/perf/util/evlist.h
+++ b/tools/perf/util/evlist.h
@@ -165,7 +165,8 @@ unsigned long perf_event_mlock_kb_in_pages(void);
int perf_evlist__mmap_ex(struct perf_evlist *evlist, unsigned int pages,
unsigned int auxtrace_pages,
- bool auxtrace_overwrite, int nr_cblocks, int affinity);
+ bool auxtrace_overwrite, int nr_cblocks,
+ int affinity, int flush);
int perf_evlist__mmap(struct perf_evlist *evlist, unsigned int pages);
void perf_evlist__munmap(struct perf_evlist *evlist);
diff --git a/tools/perf/util/mmap.c b/tools/perf/util/mmap.c
index cdc7740fc181..ef3d79b2c90b 100644
--- a/tools/perf/util/mmap.c
+++ b/tools/perf/util/mmap.c
@@ -440,6 +440,8 @@ int perf_mmap__mmap(struct perf_mmap *map, struct mmap_params *mp, int fd, int c
perf_mmap__setup_affinity_mask(map, mp);
+ map->flush = mp->flush;
+
if (auxtrace_mmap__mmap(&map->auxtrace_mmap,
&mp->auxtrace_mp, map->base, fd))
return -1;
@@ -492,7 +494,7 @@ static int __perf_mmap__read_init(struct perf_mmap *md)
md->start = md->overwrite ? head : old;
md->end = md->overwrite ? old : head;
- if (md->start == md->end)
+ if ((md->end - md->start) < md->flush)
return -EAGAIN;
size = md->end - md->start;
diff --git a/tools/perf/util/mmap.h b/tools/perf/util/mmap.h
index e566c19b242b..b82f8c2d55c4 100644
--- a/tools/perf/util/mmap.h
+++ b/tools/perf/util/mmap.h
@@ -39,6 +39,7 @@ struct perf_mmap {
} aio;
#endif
cpu_set_t affinity_mask;
+ u64 flush;
};
/*
@@ -70,7 +71,7 @@ enum bkw_mmap_state {
};
struct mmap_params {
- int prot, mask, nr_cblocks, affinity;
+ int prot, mask, nr_cblocks, affinity, flush;
struct auxtrace_mmap_params auxtrace_mp;
};
--
2.20.1
Define bytes_transferred and bytes_compressed metrics to calculate ratio
in the end of the data collection:
compression ratio = bytes_transferred / bytes_compressed
bytes_transferred accumulates the amount of bytes that was extracted from
the mmaped kernel buffers for compression. bytes_compressed accumulates
the amount of bytes that was received after applying compression.
Signed-off-by: Alexey Budankov <[email protected]>
---
tools/perf/builtin-record.c | 14 +++++++++++++-
tools/perf/util/env.h | 1 +
tools/perf/util/session.h | 2 ++
3 files changed, 16 insertions(+), 1 deletion(-)
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index f55302dec440..51b7f23a0c7a 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -1166,6 +1166,7 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
struct perf_session *session;
bool disabled = false, draining = false;
int fd;
+ float ratio = 0;
atexit(record__sig_exit);
signal(SIGCHLD, sig_handler);
@@ -1463,6 +1464,11 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
record__mmap_read_all(rec, true);
record__aio_mmap_read_sync(rec);
+ if (rec->session->bytes_transferred && rec->session->bytes_compressed) {
+ ratio = (float)rec->session->bytes_transferred/(float)rec->session->bytes_compressed;
+ session->header.env.comp_ratio = ratio + 0.5;
+ }
+
if (forks) {
int exit_status;
@@ -1509,9 +1515,15 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
else
samples[0] = '\0';
- fprintf(stderr, "[ perf record: Captured and wrote %.3f MB %s%s%s ]\n",
+ fprintf(stderr, "[ perf record: Captured and wrote %.3f MB %s%s%s",
perf_data__size(data) / 1024.0 / 1024.0,
data->path, postfix, samples);
+ if (ratio) {
+ fprintf(stderr, ", compressed (original %.3f MB, ratio is %.3f)",
+ rec->session->bytes_transferred / 1024.0 / 1024.0,
+ ratio);
+ }
+ fprintf(stderr, " ]\n");
}
out_delete_session:
diff --git a/tools/perf/util/env.h b/tools/perf/util/env.h
index d01b8355f4ca..fb39e9af128f 100644
--- a/tools/perf/util/env.h
+++ b/tools/perf/util/env.h
@@ -64,6 +64,7 @@ struct perf_env {
struct memory_node *memory_nodes;
unsigned long long memory_bsize;
u64 clockid_res_ns;
+ u32 comp_ratio;
};
extern struct perf_env perf_env;
diff --git a/tools/perf/util/session.h b/tools/perf/util/session.h
index d96eccd7d27f..0e14884f28b2 100644
--- a/tools/perf/util/session.h
+++ b/tools/perf/util/session.h
@@ -35,6 +35,8 @@ struct perf_session {
struct ordered_events ordered_events;
struct perf_data *data;
struct perf_tool *tool;
+ u64 bytes_transferred;
+ u64 bytes_compressed;
};
struct perf_tool;
--
2.20.1
Implemented PERF_RECORD_COMPRESSED event, related data types, header
feature and functions to write, read and print feature attributes
from the trace header section.
comp_mmap_len preserves the size of mmaped kernel buffer that was used
during collection. comp_mmap_len size is used on loading stage as the
size of decomp buffer for decompression of COMPRESSED events content.
Signed-off-by: Alexey Budankov <[email protected]>
---
.../Documentation/perf.data-file-format.txt | 24 ++++++++
tools/perf/builtin-record.c | 8 +++
tools/perf/perf.h | 1 +
tools/perf/util/env.h | 10 ++++
tools/perf/util/event.c | 1 +
tools/perf/util/event.h | 7 +++
tools/perf/util/header.c | 55 ++++++++++++++++++-
tools/perf/util/header.h | 1 +
8 files changed, 106 insertions(+), 1 deletion(-)
diff --git a/tools/perf/Documentation/perf.data-file-format.txt b/tools/perf/Documentation/perf.data-file-format.txt
index 593ef49b273c..418fa0bce52e 100644
--- a/tools/perf/Documentation/perf.data-file-format.txt
+++ b/tools/perf/Documentation/perf.data-file-format.txt
@@ -272,6 +272,19 @@ struct {
Two uint64_t for the time of first sample and the time of last sample.
+ HEADER_COMPRESSED = 24,
+
+struct {
+ u32 version;
+ u32 type;
+ u32 level;
+ u32 ratio;
+ u32 mmap_len;
+};
+
+Indicates that trace contains records of PERF_RECORD_COMPRESSED type
+that have perf_events records in compressed form.
+
other bits are reserved and should ignored for now
HEADER_FEAT_BITS = 256,
@@ -437,6 +450,17 @@ struct auxtrace_error_event {
Describes a header feature. These are records used in pipe-mode that
contain information that otherwise would be in perf.data file's header.
+ PERF_RECORD_COMPRESSED = 81,
+
+struct compressed_event {
+ struct perf_event_header header;
+ char data[];
+};
+
+The header is followed by compressed data frame that can be decompressed
+into array of perf trace records. The size of the entire compressed event
+record including the header is limited by the max value of header.size.
+
Event types
Define the event attributes with their IDs.
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 51b7f23a0c7a..7125b780c4f4 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -369,6 +369,11 @@ static int record__mmap_flush_parse(const struct option *opt,
return 0;
}
+static int record__comp_enabled(struct record *rec)
+{
+ return rec->opts.comp_level > 0;
+}
+
static int process_synthesized_event(struct perf_tool *tool,
union perf_event *event,
struct perf_sample *sample __maybe_unused,
@@ -885,6 +890,8 @@ static void record__init_features(struct record *rec)
perf_header__clear_feat(&session->header, HEADER_CLOCKID);
perf_header__clear_feat(&session->header, HEADER_DIR_FORMAT);
+ if (!record__comp_enabled(rec))
+ perf_header__clear_feat(&session->header, HEADER_COMPRESSED);
perf_header__clear_feat(&session->header, HEADER_STAT);
}
@@ -1225,6 +1232,7 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
err = -1;
goto out_child;
}
+ session->header.env.comp_mmap_len = session->evlist->mmap_len;
err = bpf__apply_obj_config();
if (err) {
diff --git a/tools/perf/perf.h b/tools/perf/perf.h
index 7886cc9771cf..2c6caad45b10 100644
--- a/tools/perf/perf.h
+++ b/tools/perf/perf.h
@@ -86,6 +86,7 @@ struct record_opts {
int nr_cblocks;
int affinity;
int mmap_flush;
+ unsigned int comp_level;
};
enum perf_affinity {
diff --git a/tools/perf/util/env.h b/tools/perf/util/env.h
index fb39e9af128f..7990d63ab764 100644
--- a/tools/perf/util/env.h
+++ b/tools/perf/util/env.h
@@ -65,6 +65,16 @@ struct perf_env {
unsigned long long memory_bsize;
u64 clockid_res_ns;
u32 comp_ratio;
+ u32 comp_ver;
+ u32 comp_type;
+ u32 comp_level;
+ u32 comp_mmap_len;
+};
+
+enum perf_compress_type {
+ PERF_COMP_NONE = 0,
+ PERF_COMP_ZSTD,
+ PERF_COMP_MAX
};
extern struct perf_env perf_env;
diff --git a/tools/perf/util/event.c b/tools/perf/util/event.c
index ba7be74fad6e..d1ad6c419724 100644
--- a/tools/perf/util/event.c
+++ b/tools/perf/util/event.c
@@ -68,6 +68,7 @@ static const char *perf_event__names[] = {
[PERF_RECORD_EVENT_UPDATE] = "EVENT_UPDATE",
[PERF_RECORD_TIME_CONV] = "TIME_CONV",
[PERF_RECORD_HEADER_FEATURE] = "FEATURE",
+ [PERF_RECORD_COMPRESSED] = "COMPRESSED",
};
static const char *perf_ns__names[] = {
diff --git a/tools/perf/util/event.h b/tools/perf/util/event.h
index 36ae7e92dab1..8a13aefe734e 100644
--- a/tools/perf/util/event.h
+++ b/tools/perf/util/event.h
@@ -254,6 +254,7 @@ enum perf_user_event_type { /* above any possible kernel type */
PERF_RECORD_EVENT_UPDATE = 78,
PERF_RECORD_TIME_CONV = 79,
PERF_RECORD_HEADER_FEATURE = 80,
+ PERF_RECORD_COMPRESSED = 81,
PERF_RECORD_HEADER_MAX
};
@@ -626,6 +627,11 @@ struct feature_event {
char data[];
};
+struct compressed_event {
+ struct perf_event_header header;
+ char data[];
+};
+
union perf_event {
struct perf_event_header header;
struct mmap_event mmap;
@@ -659,6 +665,7 @@ union perf_event {
struct feature_event feat;
struct ksymbol_event ksymbol_event;
struct bpf_event bpf_event;
+ struct compressed_event pack;
};
void perf_event__print_totals(void);
diff --git a/tools/perf/util/header.c b/tools/perf/util/header.c
index b0683bf4d9f3..ee5dd3befa4b 100644
--- a/tools/perf/util/header.c
+++ b/tools/perf/util/header.c
@@ -1259,6 +1259,30 @@ static int write_mem_topology(struct feat_fd *ff __maybe_unused,
return ret;
}
+static int write_compressed(struct feat_fd *ff __maybe_unused,
+ struct perf_evlist *evlist __maybe_unused)
+{
+ int ret;
+
+ ret = do_write(ff, &(ff->ph->env.comp_ver), sizeof(ff->ph->env.comp_ver));
+ if (ret)
+ return ret;
+
+ ret = do_write(ff, &(ff->ph->env.comp_type), sizeof(ff->ph->env.comp_type));
+ if (ret)
+ return ret;
+
+ ret = do_write(ff, &(ff->ph->env.comp_level), sizeof(ff->ph->env.comp_level));
+ if (ret)
+ return ret;
+
+ ret = do_write(ff, &(ff->ph->env.comp_ratio), sizeof(ff->ph->env.comp_ratio));
+ if (ret)
+ return ret;
+
+ return do_write(ff, &(ff->ph->env.comp_mmap_len), sizeof(ff->ph->env.comp_mmap_len));
+}
+
static void print_hostname(struct feat_fd *ff, FILE *fp)
{
fprintf(fp, "# hostname : %s\n", ff->ph->env.hostname);
@@ -1557,6 +1581,13 @@ static void print_cache(struct feat_fd *ff, FILE *fp __maybe_unused)
}
}
+static void print_compressed(struct feat_fd *ff, FILE *fp)
+{
+ fprintf(fp, "# compressed : %s, level = %d, ratio = %d\n",
+ ff->ph->env.comp_type == PERF_COMP_ZSTD ? "Zstd" : "Unknown",
+ ff->ph->env.comp_level, ff->ph->env.comp_ratio);
+}
+
static void print_pmu_mappings(struct feat_fd *ff, FILE *fp)
{
const char *delimiter = "# pmu mappings: ";
@@ -2414,6 +2445,27 @@ static int process_dir_format(struct feat_fd *ff,
return do_read_u64(ff, &data->dir.version);
}
+static int process_compressed(struct feat_fd *ff,
+ void *data __maybe_unused)
+{
+ if (do_read_u32(ff, &(ff->ph->env.comp_ver)))
+ return -1;
+
+ if (do_read_u32(ff, &(ff->ph->env.comp_type)))
+ return -1;
+
+ if (do_read_u32(ff, &(ff->ph->env.comp_level)))
+ return -1;
+
+ if (do_read_u32(ff, &(ff->ph->env.comp_ratio)))
+ return -1;
+
+ if (do_read_u32(ff, &(ff->ph->env.comp_mmap_len)))
+ return -1;
+
+ return 0;
+}
+
struct feature_ops {
int (*write)(struct feat_fd *ff, struct perf_evlist *evlist);
void (*print)(struct feat_fd *ff, FILE *fp);
@@ -2474,7 +2526,8 @@ static const struct feature_ops feat_ops[HEADER_LAST_FEATURE] = {
FEAT_OPR(SAMPLE_TIME, sample_time, false),
FEAT_OPR(MEM_TOPOLOGY, mem_topology, true),
FEAT_OPR(CLOCKID, clockid, false),
- FEAT_OPN(DIR_FORMAT, dir_format, false)
+ FEAT_OPN(DIR_FORMAT, dir_format, false),
+ FEAT_OPR(COMPRESSED, compressed, false)
};
struct header_print_data {
diff --git a/tools/perf/util/header.h b/tools/perf/util/header.h
index 6a231340238d..9ccfb204bd2c 100644
--- a/tools/perf/util/header.h
+++ b/tools/perf/util/header.h
@@ -40,6 +40,7 @@ enum {
HEADER_MEM_TOPOLOGY,
HEADER_CLOCKID,
HEADER_DIR_FORMAT,
+ HEADER_COMPRESSED,
HEADER_LAST_FEATURE,
HEADER_FEAT_BITS = 256,
};
--
2.20.1
Implemented mmap data buffer that is used as the memory to operate
on when compressing data in case of serial trace streaming.
Signed-off-by: Alexey Budankov <[email protected]>
---
tools/perf/builtin-record.c | 8 +++++++-
tools/perf/util/evlist.c | 8 +++++---
tools/perf/util/evlist.h | 2 +-
tools/perf/util/mmap.c | 30 ++++++++++++++++++++++++++++--
tools/perf/util/mmap.h | 4 +++-
5 files changed, 44 insertions(+), 8 deletions(-)
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 7125b780c4f4..948489cb6ff0 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -369,6 +369,8 @@ static int record__mmap_flush_parse(const struct option *opt,
return 0;
}
+static unsigned int comp_level_max = 22;
+
static int record__comp_enabled(struct record *rec)
{
return rec->opts.comp_level > 0;
@@ -584,7 +586,7 @@ static int record__mmap_evlist(struct record *rec,
opts->auxtrace_mmap_pages,
opts->auxtrace_snapshot_mode,
opts->nr_cblocks, opts->affinity,
- opts->mmap_flush) < 0) {
+ opts->mmap_flush, opts->comp_level) < 0) {
if (errno == EPERM) {
pr_err("Permission error mapping pages.\n"
"Consider increasing "
@@ -2258,6 +2260,10 @@ int cmd_record(int argc, const char **argv)
pr_debug("affinity: %s\n", affinity_tags[rec->opts.affinity]);
pr_debug("mmap flush: %d\n", rec->opts.mmap_flush);
+ if (rec->opts.comp_level > comp_level_max)
+ rec->opts.comp_level = comp_level_max;
+ pr_debug("comp level: %d\n", rec->opts.comp_level);
+
err = __cmd_record(&record, argc, argv);
out:
perf_evlist__delete(rec->evlist);
diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
index 8858d829983b..4d8a25a12430 100644
--- a/tools/perf/util/evlist.c
+++ b/tools/perf/util/evlist.c
@@ -1037,7 +1037,8 @@ int perf_evlist__parse_mmap_pages(const struct option *opt, const char *str,
*/
int perf_evlist__mmap_ex(struct perf_evlist *evlist, unsigned int pages,
unsigned int auxtrace_pages,
- bool auxtrace_overwrite, int nr_cblocks, int affinity, int flush)
+ bool auxtrace_overwrite, int nr_cblocks, int affinity, int flush,
+ int comp_level)
{
struct perf_evsel *evsel;
const struct cpu_map *cpus = evlist->cpus;
@@ -1047,7 +1048,8 @@ int perf_evlist__mmap_ex(struct perf_evlist *evlist, unsigned int pages,
* Its value is decided by evsel's write_backward.
* So &mp should not be passed through const pointer.
*/
- struct mmap_params mp = { .nr_cblocks = nr_cblocks, .affinity = affinity, .flush = flush };
+ struct mmap_params mp = { .nr_cblocks = nr_cblocks, .affinity = affinity, .flush = flush,
+ .comp_level = comp_level };
if (!evlist->mmap)
evlist->mmap = perf_evlist__alloc_mmap(evlist, false);
@@ -1079,7 +1081,7 @@ int perf_evlist__mmap_ex(struct perf_evlist *evlist, unsigned int pages,
int perf_evlist__mmap(struct perf_evlist *evlist, unsigned int pages)
{
- return perf_evlist__mmap_ex(evlist, pages, 0, false, 0, PERF_AFFINITY_SYS, 1);
+ return perf_evlist__mmap_ex(evlist, pages, 0, false, 0, PERF_AFFINITY_SYS, 1, 0);
}
int perf_evlist__create_maps(struct perf_evlist *evlist, struct target *target)
diff --git a/tools/perf/util/evlist.h b/tools/perf/util/evlist.h
index edf18811e39f..77c11dac4a63 100644
--- a/tools/perf/util/evlist.h
+++ b/tools/perf/util/evlist.h
@@ -166,7 +166,7 @@ unsigned long perf_event_mlock_kb_in_pages(void);
int perf_evlist__mmap_ex(struct perf_evlist *evlist, unsigned int pages,
unsigned int auxtrace_pages,
bool auxtrace_overwrite, int nr_cblocks,
- int affinity, int flush);
+ int affinity, int flush, int comp_level);
int perf_evlist__mmap(struct perf_evlist *evlist, unsigned int pages);
void perf_evlist__munmap(struct perf_evlist *evlist);
diff --git a/tools/perf/util/mmap.c b/tools/perf/util/mmap.c
index ef3d79b2c90b..d85e73fc82e2 100644
--- a/tools/perf/util/mmap.c
+++ b/tools/perf/util/mmap.c
@@ -157,6 +157,10 @@ void __weak auxtrace_mmap_params__set_idx(struct auxtrace_mmap_params *mp __mayb
}
#ifdef HAVE_AIO_SUPPORT
+static int perf_mmap__aio_enabled(struct perf_mmap *map)
+{
+ return map->aio.nr_cblocks > 0;
+}
#ifdef HAVE_LIBNUMA_SUPPORT
static int perf_mmap__aio_alloc(struct perf_mmap *map, int idx)
@@ -198,7 +202,7 @@ static int perf_mmap__aio_bind(struct perf_mmap *map, int idx, int cpu, int affi
return 0;
}
-#else
+#else /* !HAVE_LIBNUMA_SUPPORT */
static int perf_mmap__aio_alloc(struct perf_mmap *map, int idx)
{
map->aio.data[idx] = malloc(perf_mmap__mmap_len(map));
@@ -359,7 +363,12 @@ int perf_mmap__aio_push(struct perf_mmap *md, void *to, int idx,
return rc;
}
-#else
+#else /* !HAVE_AIO_SUPPORT */
+static int perf_mmap__aio_enabled(struct perf_mmap *map __maybe_unused)
+{
+ return 0;
+}
+
static int perf_mmap__aio_mmap(struct perf_mmap *map __maybe_unused,
struct mmap_params *mp __maybe_unused)
{
@@ -374,6 +383,10 @@ static void perf_mmap__aio_munmap(struct perf_mmap *map __maybe_unused)
void perf_mmap__munmap(struct perf_mmap *map)
{
perf_mmap__aio_munmap(map);
+ if (map->data != NULL) {
+ munmap(map->data, perf_mmap__mmap_len(map));
+ map->data = NULL;
+ }
if (map->base != NULL) {
munmap(map->base, perf_mmap__mmap_len(map));
map->base = NULL;
@@ -442,6 +455,19 @@ int perf_mmap__mmap(struct perf_mmap *map, struct mmap_params *mp, int fd, int c
map->flush = mp->flush;
+ map->comp_level = mp->comp_level;
+
+ if (map->comp_level && !perf_mmap__aio_enabled(map)) {
+ map->data = mmap(NULL, perf_mmap__mmap_len(map), PROT_READ|PROT_WRITE,
+ MAP_PRIVATE|MAP_ANONYMOUS, 0, 0);
+ if (map->data == MAP_FAILED) {
+ pr_debug2("failed to mmap data buffer, error %d\n",
+ errno);
+ map->data = NULL;
+ return -1;
+ }
+ }
+
if (auxtrace_mmap__mmap(&map->auxtrace_mmap,
&mp->auxtrace_mp, map->base, fd))
return -1;
diff --git a/tools/perf/util/mmap.h b/tools/perf/util/mmap.h
index b82f8c2d55c4..4e2f58d95c1f 100644
--- a/tools/perf/util/mmap.h
+++ b/tools/perf/util/mmap.h
@@ -40,6 +40,8 @@ struct perf_mmap {
#endif
cpu_set_t affinity_mask;
u64 flush;
+ void *data;
+ int comp_level;
};
/*
@@ -71,7 +73,7 @@ enum bkw_mmap_state {
};
struct mmap_params {
- int prot, mask, nr_cblocks, affinity, flush;
+ int prot, mask, nr_cblocks, affinity, flush, comp_level;
struct auxtrace_mmap_params auxtrace_mp;
};
--
2.20.1
Implemented functions are based on Zstd streaming compression API.
The functions are used in runtime to compress data that come from
mmaped kernel buffer. zstd_init(), zstd_fini() are used for
initialization and finalization to allocate and deallocate internal
zstd objects. zstd_compress_stream_to_records() is used to convert
parts of mmaped kernel buffer into an array of PERF_RECORD_COMPRESSED
records.
Signed-off-by: Alexey Budankov <[email protected]>
---
tools/perf/util/Build | 2 ++
tools/perf/util/compress.h | 43 +++++++++++++++++++++++
tools/perf/util/zstd.c | 71 ++++++++++++++++++++++++++++++++++++++
3 files changed, 116 insertions(+)
create mode 100644 tools/perf/util/zstd.c
diff --git a/tools/perf/util/Build b/tools/perf/util/Build
index 8dd3102301ea..6d5bbc8b589b 100644
--- a/tools/perf/util/Build
+++ b/tools/perf/util/Build
@@ -145,6 +145,8 @@ perf-y += scripting-engines/
perf-$(CONFIG_ZLIB) += zlib.o
perf-$(CONFIG_LZMA) += lzma.o
+perf-$(CONFIG_ZSTD) += zstd.o
+
perf-y += demangle-java.o
perf-y += demangle-rust.o
diff --git a/tools/perf/util/compress.h b/tools/perf/util/compress.h
index 892e92e7e7fc..d00d7cb095aa 100644
--- a/tools/perf/util/compress.h
+++ b/tools/perf/util/compress.h
@@ -2,6 +2,11 @@
#ifndef PERF_COMPRESS_H
#define PERF_COMPRESS_H
+#include <stdbool.h>
+#ifdef HAVE_ZSTD_SUPPORT
+#include <zstd.h>
+#endif
+
#ifdef HAVE_ZLIB_SUPPORT
int gzip_decompress_to_file(const char *input, int output_fd);
bool gzip_is_compressed(const char *input);
@@ -12,4 +17,42 @@ int lzma_decompress_to_file(const char *input, int output_fd);
bool lzma_is_compressed(const char *input);
#endif
+struct zstd_data {
+#ifdef HAVE_ZSTD_SUPPORT
+ ZSTD_CStream *cstream;
+#endif
+};
+
+#ifdef HAVE_ZSTD_SUPPORT
+
+int zstd_init(struct zstd_data *data, int level);
+int zstd_fini(struct zstd_data *data);
+
+size_t zstd_compress_stream_to_records(struct zstd_data *data,
+ void *dst, size_t dst_size, void *src, size_t src_size, size_t max_record_size,
+ size_t process_header(void *record, size_t increment));
+
+#else /* !HAVE_ZSTD_SUPPORT */
+
+static inline int zstd_init(struct zstd_data *data __maybe_unused, int level __maybe_unused)
+{
+ return 0;
+}
+
+static inline int zstd_fini(struct zstd_data *data __maybe_unused)
+{
+ return 0;
+}
+
+static inline size_t zstd_compress_stream_to_records(struct zstd_data *data __maybe_unused,
+ void *dst __maybe_unused, size_t dst_size __maybe_unused,
+ void *src __maybe_unused, size_t src_size __maybe_unused,
+ size_t max_record_size __maybe_unused,
+ size_t process_header(void *record, size_t increment) __maybe_unused)
+{
+ return 0;
+}
+
+#endif
+
#endif /* PERF_COMPRESS_H */
diff --git a/tools/perf/util/zstd.c b/tools/perf/util/zstd.c
new file mode 100644
index 000000000000..6d4f69d57567
--- /dev/null
+++ b/tools/perf/util/zstd.c
@@ -0,0 +1,71 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <string.h>
+
+#include "util/compress.h"
+#include "util/debug.h"
+
+int zstd_init(struct zstd_data *data, int level)
+{
+ size_t ret;
+
+ data->cstream = ZSTD_createCStream();
+ if (data->cstream == NULL) {
+ pr_err("Couldn't create compression stream.\n");
+ return -1;
+ }
+
+ ret = ZSTD_initCStream(data->cstream, level);
+ if (ZSTD_isError(ret)) {
+ pr_err("Failed to initialize compression stream: %s\n", ZSTD_getErrorName(ret));
+ return -1;
+ }
+
+ return 0;
+}
+
+int zstd_fini(struct zstd_data *data)
+{
+ if (data->cstream) {
+ ZSTD_freeCStream(data->cstream);
+ data->cstream = NULL;
+ }
+
+ return 0;
+}
+
+size_t zstd_compress_stream_to_records(struct zstd_data *data,
+ void *dst, size_t dst_size, void *src, size_t src_size, size_t max_record_size,
+ size_t process_header(void *record, size_t increment))
+{
+ size_t ret, size, compressed = 0;
+ ZSTD_inBuffer input = { src, src_size, 0 };
+ ZSTD_outBuffer output;
+ void *record;
+
+ while (input.pos < input.size) {
+ record = dst;
+ size = process_header(record, 0);
+ compressed += size;
+ dst += size;
+ dst_size -= size;
+ output = (ZSTD_outBuffer){ dst, (dst_size > max_record_size) ?
+ max_record_size : dst_size, 0 };
+ ret = ZSTD_compressStream(data->cstream, &output, &input);
+ ZSTD_flushStream(data->cstream, &output);
+ if (ZSTD_isError(ret)) {
+ pr_err("failed to compress %ld bytes: %s\n",
+ (long)src_size, ZSTD_getErrorName(ret));
+ memcpy(dst, src, src_size);
+ return src_size;
+ }
+ size = output.pos;
+ size = process_header(record, size);
+ compressed += size;
+ dst += size;
+ dst_size -= size;
+ }
+
+ return compressed;
+}
+
--
2.20.1
Compression is implemented using the functions from zstd.c. As the memory
to operate on the compression uses mmap->data buffer. If Zstd streaming
compression API fails for some reason the data to be compressed are just
copied into the memory buffers using plain memcpy().
Compressed trace frame consists of an array of PERF_RECORD_COMPRESSED
records. Each element of the array is not longer that PERF_SAMPLE_MAX_SIZE
and consists of perf_event_header followed by the compressed chunk
that is decompressed on the loading stage.
Signed-off-by: Alexey Budankov <[email protected]>
---
tools/perf/builtin-record.c | 53 +++++++++++++++++++++++++++++++++++--
tools/perf/util/session.h | 2 ++
2 files changed, 53 insertions(+), 2 deletions(-)
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 948489cb6ff0..c22e65f6b8e6 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -130,6 +130,9 @@ static int record__write(struct record *rec, struct perf_mmap *map __maybe_unuse
return 0;
}
+static size_t zstd_compress(struct perf_session *session, void *dst, size_t dst_size,
+ void *src, size_t src_size);
+
#ifdef HAVE_AIO_SUPPORT
static int record__aio_write(struct aiocb *cblock, int trace_fd,
void *buf, size_t size, off_t off)
@@ -389,6 +392,12 @@ static int record__pushfn(struct perf_mmap *map, void *to, void *bf, size_t size
{
struct record *rec = to;
+ if (record__comp_enabled(rec)) {
+ size = zstd_compress(rec->session, map->data,
+ perf_mmap__mmap_len(map), bf, size);
+ bf = map->data;
+ }
+
rec->samples++;
return record__write(rec, map, bf, size);
}
@@ -775,6 +784,38 @@ static void record__adjust_affinity(struct record *rec, struct perf_mmap *map)
}
}
+static size_t process_comp_header(void *record, size_t increment)
+{
+ struct compressed_event *event = record;
+ size_t size = sizeof(struct compressed_event);
+
+ if (increment) {
+ event->header.size += increment;
+ return increment;
+ }
+
+ event->header.type = PERF_RECORD_COMPRESSED;
+ event->header.size = size;
+
+ return size;
+}
+
+static size_t zstd_compress(struct perf_session *session, void *dst, size_t dst_size,
+ void *src, size_t src_size)
+{
+ size_t compressed;
+ size_t max_record_size = PERF_SAMPLE_MAX_SIZE - sizeof(struct compressed_event) - 1;
+
+ compressed = zstd_compress_stream_to_records(&(session->zstd_data),
+ dst, dst_size, src, src_size, max_record_size,
+ process_comp_header);
+
+ session->bytes_transferred += src_size;
+ session->bytes_compressed += compressed;
+
+ return compressed;
+}
+
static int record__mmap_read_evlist(struct record *rec, struct perf_evlist *evlist,
bool overwrite, bool sync)
{
@@ -1205,6 +1246,14 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
fd = perf_data__fd(data);
rec->session = session;
+ if (zstd_init(&(session->zstd_data), rec->opts.comp_level) < 0) {
+ pr_err("Compression initialization failed.\n");
+ return -1;
+ }
+
+ session->header.env.comp_type = PERF_COMP_ZSTD;
+ session->header.env.comp_level = rec->opts.comp_level;
+
record__init_features(rec);
if (rec->opts.use_clockid && rec->opts.clockid_res_ns)
@@ -1537,6 +1586,7 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
}
out_delete_session:
+ zstd_fini(&(session->zstd_data));
perf_session__delete(session);
return status;
}
@@ -2254,8 +2304,7 @@ int cmd_record(int argc, const char **argv)
if (rec->opts.nr_cblocks > nr_cblocks_max)
rec->opts.nr_cblocks = nr_cblocks_max;
- if (verbose > 0)
- pr_info("nr_cblocks: %d\n", rec->opts.nr_cblocks);
+ pr_debug("nr_cblocks: %d\n", rec->opts.nr_cblocks);
pr_debug("affinity: %s\n", affinity_tags[rec->opts.affinity]);
pr_debug("mmap flush: %d\n", rec->opts.mmap_flush);
diff --git a/tools/perf/util/session.h b/tools/perf/util/session.h
index 0e14884f28b2..6c984c895924 100644
--- a/tools/perf/util/session.h
+++ b/tools/perf/util/session.h
@@ -8,6 +8,7 @@
#include "machine.h"
#include "data.h"
#include "ordered-events.h"
+#include "util/compress.h"
#include <linux/kernel.h>
#include <linux/rbtree.h>
#include <linux/perf_event.h>
@@ -37,6 +38,7 @@ struct perf_session {
struct perf_tool *tool;
u64 bytes_transferred;
u64 bytes_compressed;
+ struct zstd_data zstd_data;
};
struct perf_tool;
--
2.20.1
Compression is implemented using the functions from zstd.c. As the memory
to operate on the compression uses mmap->aio.data[] buffers. If Zstd
streaming compression API fails for some reason the data to be compressed
are just copied into the memory buffers using plain memcpy().
Compressed trace frame consists of an array of PERF_RECORD_COMPRESSED
records. Each element of the array is not longer that PERF_SAMPLE_MAX_SIZE
and consists of perf_event_header followed by the compressed chunk
that is decompressed on the loading stage.
perf_mmap__aio_push() is replaced by perf_mmap__push() which is now used
in the both serial and AIO streaming cases. perf_mmap__push() is extended
with positive return values to signify absence of data ready for
processing.
Signed-off-by: Alexey Budankov <[email protected]>
---
tools/perf/builtin-record.c | 114 ++++++++++++++++++++++++++++--------
tools/perf/util/mmap.c | 76 +-----------------------
tools/perf/util/mmap.h | 12 ----
3 files changed, 89 insertions(+), 113 deletions(-)
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index c22e65f6b8e6..2e083891affa 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -130,6 +130,8 @@ static int record__write(struct record *rec, struct perf_mmap *map __maybe_unuse
return 0;
}
+static int record__aio_enabled(struct record *rec);
+static int record__comp_enabled(struct record *rec);
static size_t zstd_compress(struct perf_session *session, void *dst, size_t dst_size,
void *src, size_t src_size);
@@ -183,9 +185,9 @@ static int record__aio_complete(struct perf_mmap *md, struct aiocb *cblock)
if (rem_size == 0) {
cblock->aio_fildes = -1;
/*
- * md->refcount is incremented in perf_mmap__push() for
- * every enqueued aio write request so decrement it because
- * the request is now complete.
+ * md->refcount is incremented in record__aio_pushfn() for
+ * every aio write request started in record__aio_push() so
+ * decrement it because the request is now complete.
*/
perf_mmap__put(md);
rc = 1;
@@ -240,18 +242,89 @@ static int record__aio_sync(struct perf_mmap *md, bool sync_all)
} while (1);
}
-static int record__aio_pushfn(void *to, struct aiocb *cblock, void *bf, size_t size, off_t off)
+struct record_aio {
+ struct record *rec;
+ void *data;
+ size_t size;
+};
+
+static int record__aio_pushfn(struct perf_mmap *map, void *to, void *buf, size_t size)
{
- struct record *rec = to;
- int ret, trace_fd = rec->session->data->file.fd;
+ struct record_aio *aio = to;
- rec->samples++;
+ /*
+ * map->base data pointed by buf is copied into free map->aio.data[] buffer
+ * to release space in the kernel buffer as fast as possible, calling
+ * perf_mmap__consume() from perf_mmap__push() function.
+ *
+ * That lets the kernel to proceed with storing more profiling data into
+ * the kernel buffer earlier than other per-cpu kernel buffers are handled.
+ *
+ * Coping can be done in two steps in case the chunk of profiling data
+ * crosses the upper bound of the kernel buffer. In this case we first move
+ * part of data from map->start till the upper bound and then the reminder
+ * from the beginning of the kernel buffer till the end of the data chunk.
+ */
- ret = record__aio_write(cblock, trace_fd, bf, size, off);
+ if (record__comp_enabled(aio->rec)) {
+ size = zstd_compress(aio->rec->session, aio->data + aio->size,
+ perf_mmap__mmap_len(map) - aio->size,
+ buf, size);
+ } else {
+ memcpy(aio->data + aio->size, buf, size);
+ }
+
+ if (!aio->size) {
+ /*
+ * Increment map->refcount to guard map->aio.data[] buffer
+ * from premature deallocation because map object can be
+ * released earlier than aio write request started on
+ * map->aio.data[] buffer is complete.
+ *
+ * perf_mmap__put() is done at record__aio_complete()
+ * after started aio request completion or at record__aio_push()
+ * if the request failed to start.
+ */
+ perf_mmap__get(map);
+ }
+
+ aio->size += size;
+
+ return size;
+}
+
+static int record__aio_push(struct record *rec, struct perf_mmap *map, off_t *off)
+{
+ int ret, idx;
+ int trace_fd = rec->session->data->file.fd;
+ struct record_aio aio = { .rec = rec, .size = 0 };
+
+ /*
+ * Call record__aio_sync() to wait till map->aio.data[] buffer
+ * becomes available after previous aio write operation.
+ */
+
+ idx = record__aio_sync(map, false);
+ aio.data = map->aio.data[idx];
+ ret = perf_mmap__push(map, &aio, record__aio_pushfn);
+ if (ret != 0) /* ret > 0 - no data, ret < 0 - error */
+ return ret;
+
+ rec->samples++;
+ ret = record__aio_write(&(map->aio.cblocks[idx]), trace_fd, aio.data, aio.size, *off);
if (!ret) {
- rec->bytes_written += size;
+ *off += aio.size;
+ rec->bytes_written += aio.size;
if (switch_output_size(rec))
trigger_hit(&switch_output_trigger);
+ } else {
+ /*
+ * Decrement map->refcount incremented in record__aio_pushfn()
+ * back if record__aio_write() operation failed to start, otherwise
+ * map->refcount is decremented in record__aio_complete() after
+ * aio write operation finishes successfully.
+ */
+ perf_mmap__put(map);
}
return ret;
@@ -273,7 +346,7 @@ static void record__aio_mmap_read_sync(struct record *rec)
struct perf_evlist *evlist = rec->evlist;
struct perf_mmap *maps = evlist->mmap;
- if (!rec->opts.nr_cblocks)
+ if (!record__aio_enabled(rec))
return;
for (i = 0; i < evlist->nr_mmaps; i++) {
@@ -307,13 +380,8 @@ static int record__aio_parse(const struct option *opt,
#else /* HAVE_AIO_SUPPORT */
static int nr_cblocks_max = 0;
-static int record__aio_sync(struct perf_mmap *md __maybe_unused, bool sync_all __maybe_unused)
-{
- return -1;
-}
-
-static int record__aio_pushfn(void *to __maybe_unused, struct aiocb *cblock __maybe_unused,
- void *bf __maybe_unused, size_t size __maybe_unused, off_t off __maybe_unused)
+static int record__aio_push(struct record *rec __maybe_unused, struct perf_mmap *map __maybe_unused,
+ off_t *off __maybe_unused)
{
return -1;
}
@@ -824,7 +892,7 @@ static int record__mmap_read_evlist(struct record *rec, struct perf_evlist *evli
int rc = 0;
struct perf_mmap *maps;
int trace_fd = rec->data.file.fd;
- off_t off;
+ off_t off = 0;
if (!evlist)
return 0;
@@ -850,20 +918,14 @@ static int record__mmap_read_evlist(struct record *rec, struct perf_evlist *evli
map->flush = 1;
}
if (!record__aio_enabled(rec)) {
- if (perf_mmap__push(map, rec, record__pushfn) != 0) {
+ if (perf_mmap__push(map, rec, record__pushfn) < 0) {
if (sync)
map->flush = flush;
rc = -1;
goto out;
}
} else {
- int idx;
- /*
- * Call record__aio_sync() to wait till map->data buffer
- * becomes available after previous aio write request.
- */
- idx = record__aio_sync(map, false);
- if (perf_mmap__aio_push(map, rec, idx, record__aio_pushfn, &off) != 0) {
+ if (record__aio_push(rec, map, &off) < 0) {
record__aio_set_pos(trace_fd, off);
if (sync)
map->flush = flush;
diff --git a/tools/perf/util/mmap.c b/tools/perf/util/mmap.c
index d85e73fc82e2..868c0b0e909c 100644
--- a/tools/perf/util/mmap.c
+++ b/tools/perf/util/mmap.c
@@ -289,80 +289,6 @@ static void perf_mmap__aio_munmap(struct perf_mmap *map)
zfree(&map->aio.cblocks);
zfree(&map->aio.aiocb);
}
-
-int perf_mmap__aio_push(struct perf_mmap *md, void *to, int idx,
- int push(void *to, struct aiocb *cblock, void *buf, size_t size, off_t off),
- off_t *off)
-{
- u64 head = perf_mmap__read_head(md);
- unsigned char *data = md->base + page_size;
- unsigned long size, size0 = 0;
- void *buf;
- int rc = 0;
-
- rc = perf_mmap__read_init(md);
- if (rc < 0)
- return (rc == -EAGAIN) ? 0 : -1;
-
- /*
- * md->base data is copied into md->data[idx] buffer to
- * release space in the kernel buffer as fast as possible,
- * thru perf_mmap__consume() below.
- *
- * That lets the kernel to proceed with storing more
- * profiling data into the kernel buffer earlier than other
- * per-cpu kernel buffers are handled.
- *
- * Coping can be done in two steps in case the chunk of
- * profiling data crosses the upper bound of the kernel buffer.
- * In this case we first move part of data from md->start
- * till the upper bound and then the reminder from the
- * beginning of the kernel buffer till the end of
- * the data chunk.
- */
-
- size = md->end - md->start;
-
- if ((md->start & md->mask) + size != (md->end & md->mask)) {
- buf = &data[md->start & md->mask];
- size = md->mask + 1 - (md->start & md->mask);
- md->start += size;
- memcpy(md->aio.data[idx], buf, size);
- size0 = size;
- }
-
- buf = &data[md->start & md->mask];
- size = md->end - md->start;
- md->start += size;
- memcpy(md->aio.data[idx] + size0, buf, size);
-
- /*
- * Increment md->refcount to guard md->data[idx] buffer
- * from premature deallocation because md object can be
- * released earlier than aio write request started
- * on mmap->data[idx] is complete.
- *
- * perf_mmap__put() is done at record__aio_complete()
- * after started request completion.
- */
- perf_mmap__get(md);
-
- md->prev = head;
- perf_mmap__consume(md);
-
- rc = push(to, &md->aio.cblocks[idx], md->aio.data[idx], size0 + size, *off);
- if (!rc) {
- *off += size0 + size;
- } else {
- /*
- * Decrement md->refcount back if aio write
- * operation failed to start.
- */
- perf_mmap__put(md);
- }
-
- return rc;
-}
#else /* !HAVE_AIO_SUPPORT */
static int perf_mmap__aio_enabled(struct perf_mmap *map __maybe_unused)
{
@@ -566,7 +492,7 @@ int perf_mmap__push(struct perf_mmap *md, void *to,
rc = perf_mmap__read_init(md);
if (rc < 0)
- return (rc == -EAGAIN) ? 0 : -1;
+ return (rc == -EAGAIN) ? 1 : -1;
size = md->end - md->start;
diff --git a/tools/perf/util/mmap.h b/tools/perf/util/mmap.h
index 4e2f58d95c1f..274ce389cd84 100644
--- a/tools/perf/util/mmap.h
+++ b/tools/perf/util/mmap.h
@@ -101,18 +101,6 @@ union perf_event *perf_mmap__read_event(struct perf_mmap *map);
int perf_mmap__push(struct perf_mmap *md, void *to,
int push(struct perf_mmap *map, void *to, void *buf, size_t size));
-#ifdef HAVE_AIO_SUPPORT
-int perf_mmap__aio_push(struct perf_mmap *md, void *to, int idx,
- int push(void *to, struct aiocb *cblock, void *buf, size_t size, off_t off),
- off_t *off);
-#else
-static inline int perf_mmap__aio_push(struct perf_mmap *md __maybe_unused, void *to __maybe_unused, int idx __maybe_unused,
- int push(void *to, struct aiocb *cblock, void *buf, size_t size, off_t off) __maybe_unused,
- off_t *off __maybe_unused)
-{
- return 0;
-}
-#endif
size_t perf_mmap__mmap_len(struct perf_mmap *map);
--
2.20.1
Initialized decompression part of Zstd based API so COMPRESSED records
would be decompressed into the resulting output data file.
Signed-off-by: Alexey Budankov <[email protected]>
---
tools/perf/builtin-inject.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/tools/perf/builtin-inject.c b/tools/perf/builtin-inject.c
index 24086b7f1b14..8e0e06d3edfc 100644
--- a/tools/perf/builtin-inject.c
+++ b/tools/perf/builtin-inject.c
@@ -837,6 +837,9 @@ int cmd_inject(int argc, const char **argv)
if (inject.session == NULL)
return -1;
+ if (zstd_init(&(inject.session->zstd_data), 0) < 0)
+ pr_warning("Decompression initialization failed.\n");
+
if (inject.build_ids) {
/*
* to make sure the mmap records are ordered correctly
@@ -867,6 +870,7 @@ int cmd_inject(int argc, const char **argv)
ret = __cmd_inject(&inject);
out_delete:
+ zstd_fini(&(inject.session->zstd_data));
perf_session__delete(inject.session);
return ret;
}
--
2.20.1
Implemented basic integration test for Zstd based trace
compression/decompression in record and report modes.
Signed-off-by: Alexey Budankov <[email protected]>
---
.../tests/shell/record+zstd_comp_decomp.sh | 35 +++++++++++++++++++
1 file changed, 35 insertions(+)
create mode 100755 tools/perf/tests/shell/record+zstd_comp_decomp.sh
diff --git a/tools/perf/tests/shell/record+zstd_comp_decomp.sh b/tools/perf/tests/shell/record+zstd_comp_decomp.sh
new file mode 100755
index 000000000000..c0ff142bf66a
--- /dev/null
+++ b/tools/perf/tests/shell/record+zstd_comp_decomp.sh
@@ -0,0 +1,35 @@
+#!/bin/sh
+# record trace Zstd compression/decompression
+
+trace_file=$(mktemp /tmp/perf.data.XXX)
+perf_tool=perf
+output=/dev/null
+
+skip_if_no_z_record() {
+ $perf_tool record -h 2>&1 | grep '\-z, \-\-compression\-level'
+}
+
+collect_z_record() {
+ echo "Collecting compressed record trace file:"
+ $perf_tool record -o $trace_file -g -z -F 25000 -- \
+ dd count=1000 if=/dev/random of=/dev/null > $output 2>&1
+}
+
+check_compressed_stats() {
+ echo "Checking compressed events stats:"
+ $perf_tool report -i $trace_file --header --stats | \
+ grep -E "(# compressed : Zstd,)|(COMPRESSED events:)" > $output 2>&1
+}
+
+check_compressed_output() {
+ $perf_tool inject -i $trace_file -o $trace_file.decomp &&
+ $perf_tool report -i $trace_file --stdio | head -n -3 > $trace_file.comp.output &&
+ $perf_tool report -i $trace_file.decomp --stdio | head -n -3 > $trace_file.decomp.output &&
+ diff $trace_file.comp.output $trace_file.decomp.output > $output 2>&1
+}
+
+skip_if_no_z_record || exit 2
+collect_z_record && check_compressed_stats && check_compressed_output
+err=$?
+rm -f $trace_file*
+exit $err
--
2.20.1
Implemented -z,--compression_level[=<n>] option that enables compression
of mmaped kernel data buffers content in runtime during perf record
mode collection. Default option value is 1 (fastest compression).
Compression overhead has been measured for serial and AIO streaming
when profiling matrix multiplication workload:
-------------------------------------------------------------
| SERIAL | AIO-1 |
----------------------------------------------------------------|
|-z | OVH(x) | ratio(x) size(MiB) | OVH(x) | ratio(x) size(MiB) |
|---------------------------------------------------------------|
| 0 | 1,00 | 1,000 179,424 | 1,00 | 1,000 187,527 |
| 1 | 1,04 | 8,427 181,148 | 1,01 | 8,474 188,562 |
| 2 | 1,07 | 8,055 186,953 | 1,03 | 7,912 191,773 |
| 3 | 1,04 | 8,283 181,908 | 1,03 | 8,220 191,078 |
| 5 | 1,09 | 8,101 187,705 | 1,05 | 7,780 190,065 |
| 8 | 1,05 | 9,217 179,191 | 1,12 | 6,111 193,024 |
-----------------------------------------------------------------
OVH = (Execution time with -z N) / (Execution time with -z 0)
ratio - compression ratio
size - number of bytes that was compressed
size ~= trace size x ratio
Signed-off-by: Alexey Budankov <[email protected]>
---
tools/perf/Documentation/perf-record.txt | 5 +++++
tools/perf/builtin-record.c | 25 ++++++++++++++++++++++++
2 files changed, 30 insertions(+)
diff --git a/tools/perf/Documentation/perf-record.txt b/tools/perf/Documentation/perf-record.txt
index 18fceb49434e..0567bacc2ae6 100644
--- a/tools/perf/Documentation/perf-record.txt
+++ b/tools/perf/Documentation/perf-record.txt
@@ -471,6 +471,11 @@ Also at some cases executing less trace write syscalls with bigger data size can
shorter than executing more trace write syscalls with smaller data size thus lowering
runtime profiling overhead.
+-z::
+--compression-level[=n]::
+Produce compressed trace using specified level n (default: 1 - fastest compression,
+22 - smallest trace)
+
--all-kernel::
Configure all used events to run in kernel space.
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 2e083891affa..7258f2964a3b 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -440,6 +440,26 @@ static int record__mmap_flush_parse(const struct option *opt,
return 0;
}
+#ifdef HAVE_ZSTD_SUPPORT
+static unsigned int comp_level_default = 1;
+static int record__parse_comp_level(const struct option *opt,
+ const char *str,
+ int unset)
+{
+ struct record_opts *opts = (struct record_opts *)opt->value;
+
+ if (unset) {
+ opts->comp_level = 0;
+ } else {
+ if (str)
+ opts->comp_level = strtol(str, NULL, 0);
+ if (!opts->comp_level)
+ opts->comp_level = comp_level_default;
+ }
+
+ return 0;
+}
+#endif
static unsigned int comp_level_max = 22;
static int record__comp_enabled(struct record *rec)
@@ -2169,6 +2189,11 @@ static struct option __record_options[] = {
OPT_CALLBACK(0, "affinity", &record.opts, "node|cpu",
"Set affinity mask of trace reading thread to NUMA node cpu mask or cpu of processed mmap buffer",
record__parse_affinity),
+#ifdef HAVE_ZSTD_SUPPORT
+ OPT_CALLBACK_OPTARG('z', "compression-level", &record.opts, &comp_level_default,
+ "n", "Produce compressed trace using specified level (default: 1 - fastest compression, 22 - smallest trace)",
+ record__parse_comp_level),
+#endif
OPT_END()
};
--
2.20.1
zstd_init(, comp_level = 0) initializes decompression part of API only
that now consists of zstd_decompress_stream() function.
Trace frames containing PERF_RECORD_COMPRESSED records are decompressed
using zstd_decompress_stream() function into a linked list of mmaped
memory regions of mmap_comp_len size (struct decomp).
After decompression of one COMPRESSED record its content is iterated and
fetched for usual processing. The mmaped memory regions with decompressed
events are kept in the linked list till the tool process termination.
When dumping raw trace (e.g., perf report -D --header) file offsets of
events from compressed records are printed as zero.
Signed-off-by: Alexey Budankov <[email protected]>
---
tools/perf/builtin-report.c | 5 +-
tools/perf/util/compress.h | 11 +++
tools/perf/util/session.c | 129 +++++++++++++++++++++++++++++++++++-
tools/perf/util/session.h | 10 +++
tools/perf/util/tool.h | 2 +
tools/perf/util/zstd.c | 40 +++++++++++
6 files changed, 195 insertions(+), 2 deletions(-)
diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c
index 1921aaa9cece..f8f899245289 100644
--- a/tools/perf/builtin-report.c
+++ b/tools/perf/builtin-report.c
@@ -1260,6 +1260,9 @@ int cmd_report(int argc, const char **argv)
if (session == NULL)
return -1;
+ if (zstd_init(&(session->zstd_data), 0) < 0)
+ pr_warning("Decompression initialization failed. Reported data may be incomplete.\n");
+
if (report.queue_size) {
ordered_events__set_alloc_size(&session->ordered_events,
report.queue_size);
@@ -1450,7 +1453,7 @@ int cmd_report(int argc, const char **argv)
error:
if (report.ptime_range)
zfree(&report.ptime_range);
-
+ zstd_fini(&(session->zstd_data));
perf_session__delete(session);
return ret;
}
diff --git a/tools/perf/util/compress.h b/tools/perf/util/compress.h
index d00d7cb095aa..46127f7e4563 100644
--- a/tools/perf/util/compress.h
+++ b/tools/perf/util/compress.h
@@ -20,6 +20,7 @@ bool lzma_is_compressed(const char *input);
struct zstd_data {
#ifdef HAVE_ZSTD_SUPPORT
ZSTD_CStream *cstream;
+ ZSTD_DStream *dstream;
#endif
};
@@ -32,6 +33,9 @@ size_t zstd_compress_stream_to_records(struct zstd_data *data,
void *dst, size_t dst_size, void *src, size_t src_size, size_t max_record_size,
size_t process_header(void *record, size_t increment));
+size_t zstd_decompress_stream(struct zstd_data *data,
+ void *src, size_t src_size, void *dst, size_t dst_size);
+
#else /* !HAVE_ZSTD_SUPPORT */
static inline int zstd_init(struct zstd_data *data __maybe_unused, int level __maybe_unused)
@@ -53,6 +57,13 @@ static inline size_t zstd_compress_stream_to_records(struct zstd_data *data __ma
return 0;
}
+static inline size_t zstd_decompress_stream(struct zstd_data *data __maybe_unused, void *src __maybe_unused,
+ size_t src_size __maybe_unused, void *dst __maybe_unused,
+ size_t dst_size __maybe_unused)
+{
+ return 0;
+}
+
#endif
#endif /* PERF_COMPRESS_H */
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index 0ec34227bd60..81b7c09a97d1 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -29,6 +29,67 @@
#include "stat.h"
#include "arch/common.h"
+#ifdef HAVE_ZSTD_SUPPORT
+static int perf_session__process_compressed_event(struct perf_session *session,
+ union perf_event *event, u64 file_offset)
+{
+ void *src;
+ size_t decomp_size, src_size;
+ u64 decomp_last_rem = 0;
+ size_t decomp_len = session->header.env.comp_mmap_len;
+ struct decomp *decomp, *decomp_last = session->decomp_last;
+
+ decomp = mmap(NULL, sizeof(struct decomp) + decomp_len, PROT_READ|PROT_WRITE,
+ MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
+ if (decomp == MAP_FAILED) {
+ pr_err("Couldn't allocate memory for decompression\n");
+ return -1;
+ }
+
+ decomp->file_pos = file_offset;
+ decomp->head = 0;
+
+ if (decomp_last) {
+ decomp_last_rem = decomp_last->size - decomp_last->head;
+ memcpy(decomp->data, &(decomp_last->data[decomp_last->head]), decomp_last_rem);
+ decomp->size = decomp_last_rem;
+ }
+
+ src = (void *)event + sizeof(struct compressed_event);
+ src_size = event->pack.header.size - sizeof(struct compressed_event);
+
+ decomp_size = zstd_decompress_stream(&(session->zstd_data), src, src_size,
+ &(decomp->data[decomp_last_rem]), decomp_len - decomp_last_rem);
+ if (!decomp_size) {
+ munmap(decomp, sizeof(struct decomp) + decomp_len);
+ pr_err("Couldn't decompress data\n");
+ return -1;
+ }
+
+ decomp->size += decomp_size;
+
+ if (session->decomp == NULL) {
+ session->decomp = decomp;
+ session->decomp_last = decomp;
+ } else {
+ session->decomp_last->next = decomp;
+ session->decomp_last = decomp;
+ }
+
+ pr_debug("decomp (B): %ld to %ld\n", src_size, decomp_size);
+
+ return 0;
+}
+#else /* !HAVE_ZSTD_SUPPORT */
+static int perf_session__process_compressed_event(struct perf_session *session __maybe_unused,
+ union perf_event *event __maybe_unused,
+ u64 file_offset __maybe_unused)
+{
+ dump_printf(": unhandled!\n");
+ return 0;
+}
+#endif
+
static int perf_session__deliver_event(struct perf_session *session,
union perf_event *event,
struct perf_tool *tool,
@@ -196,6 +257,21 @@ static void perf_session__delete_threads(struct perf_session *session)
machine__delete_threads(&session->machines.host);
}
+static void perf_session__release_decomp_events(struct perf_session *session)
+{
+ struct decomp *next, *decomp;
+ size_t decomp_len;
+ next = session->decomp;
+ decomp_len = session->header.env.comp_mmap_len;
+ do {
+ decomp = next;
+ if (decomp == NULL)
+ break;
+ next = decomp->next;
+ munmap(decomp, decomp_len + sizeof(struct decomp));
+ } while (1);
+}
+
void perf_session__delete(struct perf_session *session)
{
if (session == NULL)
@@ -204,6 +280,7 @@ void perf_session__delete(struct perf_session *session)
auxtrace_index__free(&session->auxtrace_index);
perf_session__destroy_kernel_maps(session);
perf_session__delete_threads(session);
+ perf_session__release_decomp_events(session);
perf_env__exit(&session->header.env);
machines__exit(&session->machines);
if (session->data)
@@ -429,6 +506,8 @@ void perf_tool__fill_defaults(struct perf_tool *tool)
tool->time_conv = process_event_op2_stub;
if (tool->feature == NULL)
tool->feature = process_event_op2_stub;
+ if (tool->compressed == NULL)
+ tool->compressed = perf_session__process_compressed_event;
}
static void swap_sample_id_all(union perf_event *event, void *data)
@@ -1372,7 +1451,8 @@ static s64 perf_session__process_user_event(struct perf_session *session,
int fd = perf_data__fd(session->data);
int err;
- dump_event(session->evlist, event, file_offset, &sample);
+ if (event->header.type != PERF_RECORD_COMPRESSED)
+ dump_event(session->evlist, event, file_offset, &sample);
/* These events are processed right away */
switch (event->header.type) {
@@ -1425,6 +1505,11 @@ static s64 perf_session__process_user_event(struct perf_session *session,
return tool->time_conv(session, event);
case PERF_RECORD_HEADER_FEATURE:
return tool->feature(session, event);
+ case PERF_RECORD_COMPRESSED:
+ err = tool->compressed(session, event, file_offset);
+ if (err)
+ dump_event(session->evlist, event, file_offset, &sample);
+ return 0;
default:
return -EINVAL;
}
@@ -1707,6 +1792,8 @@ static int perf_session__flush_thread_stacks(struct perf_session *session)
volatile int session_done;
+static int __perf_session__process_decomp_events(struct perf_session *session);
+
static int __perf_session__process_pipe_events(struct perf_session *session)
{
struct ordered_events *oe = &session->ordered_events;
@@ -1787,6 +1874,10 @@ static int __perf_session__process_pipe_events(struct perf_session *session)
if (skip > 0)
head += skip;
+ err = __perf_session__process_decomp_events(session);
+ if (err)
+ goto out_err;
+
if (!session_done())
goto more;
done:
@@ -1835,6 +1926,38 @@ fetch_mmaped_event(struct perf_session *session,
return event;
}
+static int __perf_session__process_decomp_events(struct perf_session *session)
+{
+ s64 skip;
+ u64 size, file_pos = 0;
+ union perf_event *event;
+ struct decomp *decomp = session->decomp_last;
+
+ if (!decomp)
+ return 0;
+
+ while (decomp->head < decomp->size && !session_done()) {
+ event = fetch_mmaped_event(session, decomp->head, decomp->size, decomp->data);
+ if (!event)
+ break;
+
+ size = event->header.size;
+ if (size < sizeof(struct perf_event_header) ||
+ (skip = perf_session__process_event(session, event, file_pos)) < 0) {
+ pr_err("%#" PRIx64 " [%#x]: failed to process type: %d\n",
+ decomp->file_pos + decomp->head, event->header.size, event->header.type);
+ return -EINVAL;
+ }
+
+ if (skip)
+ size += skip;
+
+ decomp->head += size;
+ }
+
+ return 0;
+}
+
/*
* On 64bit we can mmap the data file in one go. No need for tiny mmap
* slices. On 32bit we use 32MB.
@@ -1942,6 +2065,10 @@ reader__process_events(struct reader *rd, struct perf_session *session,
head += size;
file_pos += size;
+ err = __perf_session__process_decomp_events(session);
+ if (err)
+ goto out;
+
ui_progress__update(prog, size);
if (session_done())
diff --git a/tools/perf/util/session.h b/tools/perf/util/session.h
index 6c984c895924..dd8920b745bc 100644
--- a/tools/perf/util/session.h
+++ b/tools/perf/util/session.h
@@ -39,6 +39,16 @@ struct perf_session {
u64 bytes_transferred;
u64 bytes_compressed;
struct zstd_data zstd_data;
+ struct decomp *decomp;
+ struct decomp *decomp_last;
+};
+
+struct decomp {
+ struct decomp *next;
+ u64 file_pos;
+ u64 head;
+ size_t size;
+ char data[];
};
struct perf_tool;
diff --git a/tools/perf/util/tool.h b/tools/perf/util/tool.h
index 250391672f9f..9096a6e3de59 100644
--- a/tools/perf/util/tool.h
+++ b/tools/perf/util/tool.h
@@ -28,6 +28,7 @@ typedef int (*event_attr_op)(struct perf_tool *tool,
typedef int (*event_op2)(struct perf_session *session, union perf_event *event);
typedef s64 (*event_op3)(struct perf_session *session, union perf_event *event);
+typedef int (*event_op4)(struct perf_session *session, union perf_event *event, u64 data);
typedef int (*event_oe)(struct perf_tool *tool, union perf_event *event,
struct ordered_events *oe);
@@ -72,6 +73,7 @@ struct perf_tool {
stat,
stat_round,
feature;
+ event_op4 compressed;
event_op3 auxtrace;
bool ordered_events;
bool ordering_requires_timestamps;
diff --git a/tools/perf/util/zstd.c b/tools/perf/util/zstd.c
index 6d4f69d57567..15aa02c933ef 100644
--- a/tools/perf/util/zstd.c
+++ b/tools/perf/util/zstd.c
@@ -9,6 +9,21 @@ int zstd_init(struct zstd_data *data, int level)
{
size_t ret;
+ data->dstream = ZSTD_createDStream();
+ if (data->dstream == NULL) {
+ pr_err("Couldn't create decompression stream.\n");
+ return -1;
+ }
+
+ ret = ZSTD_initDStream(data->dstream);
+ if (ZSTD_isError(ret)) {
+ pr_err("Failed to initialize decompression stream: %s\n", ZSTD_getErrorName(ret));
+ return -1;
+ }
+
+ if (!level)
+ return 0;
+
data->cstream = ZSTD_createCStream();
if (data->cstream == NULL) {
pr_err("Couldn't create compression stream.\n");
@@ -26,6 +41,11 @@ int zstd_init(struct zstd_data *data, int level)
int zstd_fini(struct zstd_data *data)
{
+ if (data->dstream) {
+ ZSTD_freeDStream(data->dstream);
+ data->dstream = NULL;
+ }
+
if (data->cstream) {
ZSTD_freeCStream(data->cstream);
data->cstream = NULL;
@@ -69,3 +89,23 @@ size_t zstd_compress_stream_to_records(struct zstd_data *data,
return compressed;
}
+size_t zstd_decompress_stream(struct zstd_data *data,
+ void *src, size_t src_size, void *dst, size_t dst_size)
+{
+ size_t ret;
+ ZSTD_inBuffer input = { src, src_size, 0 };
+ ZSTD_outBuffer output = { dst, dst_size, 0 };
+
+ while (input.pos < input.size) {
+ ret = ZSTD_decompressStream(data->dstream, &output, &input);
+ if (ZSTD_isError(ret)) {
+ pr_err("failed to decompress (B): %ld -> %ld : %s\n",
+ src_size, output.size, ZSTD_getErrorName(ret));
+ break;
+ }
+ output.dst = dst + output.pos;
+ output.size = dst_size - output.pos;
+ }
+
+ return output.pos;
+}
--
2.20.1
Hi,
This is a gentle reminder regarding the patch set below.
Thanks,
Alexey
On 18.03.2019 20:36, Alexey Budankov wrote:
>
> The patch set implements runtime trace compression (-z option) in
> record mode and trace auto decompression in report and inject modes.
> Streaming Zstd API [1] is used for compression and decompression of
> data that come from kernel mmaped data buffers.
>
> Usage of implemented -z,--compression_level=n option provides ~3-5x
> avg. trace file size reduction on variety of tested workloads what
> saves storage space on larger server systems where trace file size
> can easily reach several tens or even hundreds of GiBs, especially
> when profiling with dwarf-based stacks and tracing of context switches.
> Default option value is 1 (fastest compression).
>
> Implemented --mmap-flush option can be used to specify minimal size
> of data chunk that is extracted from mmaped kernel buffer to store
> into a trace. The option is independent from -z setting and doesn't
> vary with compression level. The default option value is 1 byte what
> means every time trace writing thread finds some new data in the
> mmaped buffer the data is extracted, possibly compressed and written
> to a trace. The option serves two purposes the first one is to increase
> the compression ratio of trace data and the second one is to avoid
> live-lock self tool process monitoring in system wide (-a) profiling
> mode. Profiling in system wide mode with compression (-a -z) can
> additionally induce data into the kernel buffers along with the data
> from monitored processes. If performance data rate and volume from
> the monitored processes is high then trace streaming and compression
> activity in the tool is also high. It can lead to subtle live-lock
> effect of endless activity when compression of single new byte from
> some of mmaped kernel buffer induces the next single byte at some
> mmaped buffer. So perf tool thread never stops on polling event file
> descriptors. Varying data chunk size to be extracted from mmap buffers
> allows avoiding live-locking self monitoring in system wide mode and
> makes mmap buffers polling loop manageable. Possible usage examples:
>
> $ tools/perf/perf record -z -e cycles -- matrix.gcc
> $ tools/perf/perf record --aio -z -e cycles -- matrix.gcc
> $ tools/perf/perf record -z --mmap-flush 1024 -e cycles -- matrix.gcc
> $ tools/perf/perf record --aio -z --mmap-flush 1K -e cycles -- matrix.gcc
>
> Runtime compression overhead has been measured for serial and AIO
> trace writing modes when profiling matrix multiplication workload:
>
> -------------------------------------------------------------
> | SERIAL | AIO-1 |
> ----|-----------------------------|-----------------------------|
> |-z | OVH(x) | ratio(x) size(MiB) | OVH(x) | ratio(x) size(MiB) |
> |---|--------|--------------------|--------|--------------------|
> | 0 | 1,00 | 1,000 179,424 | 1,00 | 1,000 187,527 |
> | 1 | 1,04 | 8,427 181,148 | 1,01 | 8,474 188,562 |
> | 2 | 1,07 | 8,055 186,953 | 1,03 | 7,912 191,773 |
> | 3 | 1,04 | 8,283 181,908 | 1,03 | 8,220 191,078 |
> | 5 | 1,09 | 8,101 187,705 | 1,05 | 7,780 190,065 |
> | 8 | 1,05 | 9,217 179,191 | 1,12 | 6,111 193,024 |
> -----------------------------------------------------------------
>
> OVH = (Execution time with -z N) / (Execution time with -z 0)
>
> ratio - compression ratio
> size - number of bytes that was compressed
>
> size ~= trace file x ratio
>
> See complete description of measurement conditions with details below.
>
> Introduced compression functionality can be disabled or configured from
> the command line using NO_LIBZSTD and LIBZSTD_DIR defines:
>
> $ make -C tools/perf NO_LIBZSTD=1 clean all
> $ make -C tools/perf LIBZSTD_DIR=/path/to/zstd/sources/ clean all
>
> If your system has some version of the zstd package preinstalled then
> the build system finds and uses it during the build. Auto detection
> feature status is reported just before compilation starts, as usual.
> If you still prefer to compile with some other version of zstd you have
> capability to refer the compilation to that version using LIBZSTD_DIR
> define.
>
> See 'perf test' results below for enabled and disabled (NO_LIBZSTD=1)
> feature configurations.
>
> ---
> Alexey Budankov (12):
> feature: implement libzstd check, LIBZSTD_DIR and NO_LIBZSTD defines
> perf record: implement --mmap-flush=<number> option
> perf session: define bytes_transferred and bytes_compressed metrics
> perf record: implement COMPRESSED event record and its attributes
> perf mmap: implement dedicated memory buffer for data compression
> perf util: introduce Zstd streaming based compression API
> perf record: implement compression for serial trace streaming
> perf record: implement compression for AIO trace streaming
> perf record: implement -z,--compression_level[=<n>] option
> perf report: implement record trace decompression
> perf inject: enable COMPRESSED records decompression
> perf tests: implement Zstd comp/decomp integration test
>
> tools/build/Makefile.feature | 6 +-
> tools/build/feature/Makefile | 6 +-
> tools/build/feature/test-all.c | 5 +
> tools/build/feature/test-libzstd.c | 12 +
> tools/perf/Documentation/perf-record.txt | 17 ++
> .../Documentation/perf.data-file-format.txt | 24 ++
> tools/perf/Makefile.config | 20 ++
> tools/perf/Makefile.perf | 3 +
> tools/perf/builtin-inject.c | 4 +
> tools/perf/builtin-record.c | 285 +++++++++++++++---
> tools/perf/builtin-report.c | 5 +-
> tools/perf/builtin-version.c | 2 +
> tools/perf/perf.h | 2 +
> .../tests/shell/record+zstd_comp_decomp.sh | 35 +++
> tools/perf/util/Build | 2 +
> tools/perf/util/compress.h | 54 ++++
> tools/perf/util/env.h | 11 +
> tools/perf/util/event.c | 1 +
> tools/perf/util/event.h | 7 +
> tools/perf/util/evlist.c | 8 +-
> tools/perf/util/evlist.h | 3 +-
> tools/perf/util/header.c | 55 +++-
> tools/perf/util/header.h | 1 +
> tools/perf/util/mmap.c | 106 ++-----
> tools/perf/util/mmap.h | 17 +-
> tools/perf/util/session.c | 129 +++++++-
> tools/perf/util/session.h | 14 +
> tools/perf/util/tool.h | 2 +
> tools/perf/util/zstd.c | 111 +++++++
> 29 files changed, 813 insertions(+), 134 deletions(-)
> create mode 100644 tools/build/feature/test-libzstd.c
> create mode 100755 tools/perf/tests/shell/record+zstd_comp_decomp.sh
> create mode 100644 tools/perf/util/zstd.c
>
> ---
> Changes in v10:
> - separated decomp list deallocation into perf_session__release_decomp_events
> - extended the test with suggested decompression validation
>
> Changes in v9:
> - fixed issue with improper max COMPRESSED record size calculation
> - moved up calculation of ratio variable in 03/12
> - made minor corrections in changelogs
> - corrected several checkpatch.pl warnings and errors
>
> Changes in v8:
> - avoid using -f for --mmap-flush option
> - move stubs to compress.h and avoid unconditional compiling of zstd.c
> - fixed silent interruption for perf record collection
> - implemented -z 1 as default
>
> Changes in v7:
> - rebased to Arnaldo's perf/core tip
> - implemented B/K/M/G suffixes for -f option
> - reworked record__mmap_read_evlist() to replace perf_mmap__aio_push()
> by perf_mmap__push() in AIO case
> - extended "[ perf record: Captured ... ]" message with compression statistics
> - extended changelog for v5 06/10
> - used PERF_SAMPLE_MAX_SIZE for compressed record size calculations
> - renamed record__zstd_compress to zstd_compress and
> record__process_comp_header to process_comp_header
> - separated nr_cblocks_max applying
>
> Changes in v6:
> - extended docs with description of PERF_RECORD_COMPRESSED record and
> HEADER_COMPRESSED feature layouts
>
> Changes in v5:
> - implemented perf version --build-options extension for aio and zstd - see TESTING below
> - adjusted commit message and perf-record.txt content for -f option
> - fixed build errors in case of NO_AIO=1 and NO_LIBZSTD=1
>
> Changes in v4:
> - implemented integration tests
> - adjusted zstd_ stub functions
> - rebased on tip of Arnaldo's perf/core
>
> Changes in v3:
> - moved -f,--mmap-flush option implementation into a separate patch
> - moved definition and printing of bytes_transferred and bytes_compressed into a separate patch
> - moved COMPRESSED feature into a separate patch
> - added versioning and stored COMPRESSED feature attributes as u32
> - implemented dedicated memory buffer for compression in case of serial streaming
> - moved low level Zstd based compression functions into util/{compress.h,zstd.c}
> - made compress function to be a param of __push(), __aio_push() functions
> - enabled perf inject to decompress COMPRESSED records
> - measured compression overhead for serial and AIO streaming using
> basic matrix multiplication workload on 8 core skylake
>
> Changes in v2:
> - moved compression/decompression code to session layer
> - enabled allocation aio data buffers for compression
> - enabled trace compression for serial trace streaming
>
> ---
> [1] https://github.com/facebook/zstd
>
> ---
> OVERHEAD MEASUREMENTS:
>
> uname -a
> Linux localhost 4.20.7-200.fc29.x86_64 #1 SMP Wed Feb 6 19:16:42 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
>
> cat /proc/cpuinfo
> processor : 7
> vendor_id : GenuineIntel
> cpu family : 6
> model : 94
> model name : Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
> stepping : 3
> microcode : 0xc6
> cpu MHz : 4021.884
> cache size : 8192 KB
> physical id : 0
> siblings : 8
> core id : 3
> cpu cores : 4
> apicid : 7
> initial apicid : 7
> fpu : yes
> fpu_exception : yes
> cpuid level : 22
> wp : yes
> flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp flush_l1d
> bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf
> bogomips : 8016.00
> clflush size : 64
> cache_alignment : 64
> address sizes : 39 bits physical, 48 bits virtual
> power management:
>
> -----------------------------------------------------------------
> #!/bin/bash -xv
>
> echo 0 > /proc/sys/kernel/perf_event_paranoid
> + echo 0
> cat /proc/sys/kernel/perf_event_paranoid
> + cat /proc/sys/kernel/perf_event_paranoid
> 0
>
> echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
> + echo performance
> + tee /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor /sys/devices/system/cpu/cpu1/cpufreq/scaling_governor /sys/devices/system/cpu/cpu2/cpufreq/scaling_governor /sys/devices/system/cpu/cpu3/cpufreq/scaling_governor /sys/devices/system/cpu/cpu4/cpufreq/scaling_governor /sys/devices/system/cpu/cpu5/cpufreq/scaling_governor /sys/devices/system/cpu/cpu6/cpufreq/scaling_governor /sys/devices/system/cpu/cpu7/cpufreq/scaling_governor
> performance
>
> for i in 0 1 2 3 5 8
> do
> /usr/bin/time tools/perf/perf record -z $i -F 25000 -N -B -T -R -e cycles -- ../../matrix/linux/matrix.gcc
> done
> + for i in 0 1 2 3 5 8
> + /usr/bin/time tools/perf/perf record -z 0 -F 25000 -N -B -T -R -e cycles -- ../../matrix/linux/matrix.gcc
> Addr of buf1 = 0x7fe36de5c010
> Offs of buf1 = 0x7fe36de5c180
> Addr of buf2 = 0x7fe36be5b010
> Offs of buf2 = 0x7fe36be5b1c0
> Addr of buf3 = 0x7fe369e5a010
> Offs of buf3 = 0x7fe369e5a100
> Addr of buf4 = 0x7fe367e59010
> Offs of buf4 = 0x7fe367e59140
> Threads #: 8 Pthreads
> Matrix size: 2048
> Using multiply kernel: multiply1
> Execution time = 16.949 seconds
> [ perf record: Woken up 309 times to write data ]
> [ perf record: Captured and wrote 179.424 MB perf.data ]
> 133.67user 0.35system 0:17.08elapsed 784%CPU (0avgtext+0avgdata 100580maxresident)k
> 0inputs+367480outputs (0major+34737minor)pagefaults 0swaps
> + for i in 0 1 2 3 5 8
> + /usr/bin/time tools/perf/perf record -z 1 -F 25000 -N -B -T -R -e cycles -- ../../matrix/linux/matrix.gcc
> Addr of buf1 = 0x7fcaec334010
> Offs of buf1 = 0x7fcaec334180
> Addr of buf2 = 0x7fcaea333010
> Offs of buf2 = 0x7fcaea3331c0
> Addr of buf3 = 0x7fcae8332010
> Offs of buf3 = 0x7fcae8332100
> Addr of buf4 = 0x7fcae6331010
> Offs of buf4 = 0x7fcae6331140
> Threads #: 8 Pthreads
> Matrix size: 2048
> Using multiply kernel: multiply1
> Execution time = 17.608 seconds
> [ perf record: Woken up 595 times to write data ]
> [ perf record: Compressed 181.148 MB to 21.497 MB, ratio is 8.427 ]
> [ perf record: Captured and wrote 21.527 MB perf.data ]
> 135.69user 0.24system 0:17.73elapsed 766%CPU (0avgtext+0avgdata 100500maxresident)k
> 0inputs+44112outputs (0major+35033minor)pagefaults 0swaps
> + for i in 0 1 2 3 5 8
> + /usr/bin/time tools/perf/perf record -z 2 -F 25000 -N -B -T -R -e cycles -- ../../matrix/linux/matrix.gcc
> Addr of buf1 = 0x7f1336f8d010
> Offs of buf1 = 0x7f1336f8d180
> Addr of buf2 = 0x7f1334f8c010
> Offs of buf2 = 0x7f1334f8c1c0
> Addr of buf3 = 0x7f1332f8b010
> Offs of buf3 = 0x7f1332f8b100
> Addr of buf4 = 0x7f1330f8a010
> Offs of buf4 = 0x7f1330f8a140
> Threads #: 8 Pthreads
> Matrix size: 2048
> Using multiply kernel: multiply1
> Execution time = 18.175 seconds
> [ perf record: Woken up 521 times to write data ]
> [ perf record: Compressed 186.953 MB to 23.210 MB, ratio is 8.055 ]
> [ perf record: Captured and wrote 23.239 MB perf.data ]
> 140.21user 0.25system 0:18.32elapsed 766%CPU (0avgtext+0avgdata 100560maxresident)k
> 0inputs+47608outputs (0major+35263minor)pagefaults 0swaps
> + for i in 0 1 2 3 5 8
> + /usr/bin/time tools/perf/perf record -z 3 -F 25000 -N -B -T -R -e cycles -- ../../matrix/linux/matrix.gcc
> Addr of buf1 = 0x7f97060e3010
> Offs of buf1 = 0x7f97060e3180
> Addr of buf2 = 0x7f97040e2010
> Offs of buf2 = 0x7f97040e21c0
> Addr of buf3 = 0x7f97020e1010
> Offs of buf3 = 0x7f97020e1100
> Addr of buf4 = 0x7f97000e0010
> Offs of buf4 = 0x7f97000e0140
> Threads #: 8 Pthreads
> Matrix size: 2048
> Using multiply kernel: multiply1
> Execution time = 17.688 seconds
> [ perf record: Woken up 485 times to write data ]
> [ perf record: Compressed 181.908 MB to 21.962 MB, ratio is 8.283 ]
> [ perf record: Captured and wrote 21.991 MB perf.data ]
> 136.87user 0.23system 0:17.81elapsed 769%CPU (0avgtext+0avgdata 100616maxresident)k
> 0inputs+45064outputs (0major+35773minor)pagefaults 0swaps
> + for i in 0 1 2 3 5 8
> + /usr/bin/time tools/perf/perf record -z 5 -F 25000 -N -B -T -R -e cycles -- ../../matrix/linux/matrix.gcc
> Addr of buf1 = 0x7f477b444010
> Offs of buf1 = 0x7f477b444180
> Addr of buf2 = 0x7f4779443010
> Offs of buf2 = 0x7f47794431c0
> Addr of buf3 = 0x7f4777442010
> Offs of buf3 = 0x7f4777442100
> Addr of buf4 = 0x7f4775441010
> Offs of buf4 = 0x7f4775441140
> Threads #: 8 Pthreads
> Matrix size: 2048
> Using multiply kernel: multiply1
> Execution time = 18.406 seconds
> [ perf record: Woken up 416 times to write data ]
> [ perf record: Compressed 187.705 MB to 23.170 MB, ratio is 8.101 ]
> [ perf record: Captured and wrote 23.200 MB perf.data ]
> 142.72user 0.26system 0:18.53elapsed 771%CPU (0avgtext+0avgdata 100520maxresident)k
> 0inputs+47528outputs (0major+36928minor)pagefaults 0swaps
> + for i in 0 1 2 3 5 8
> + /usr/bin/time tools/perf/perf record -z 8 -F 25000 -N -B -T -R -e cycles -- ../../matrix/linux/matrix.gcc
> Addr of buf1 = 0x7fb5bf032010
> Offs of buf1 = 0x7fb5bf032180
> Addr of buf2 = 0x7fb5bd031010
> Offs of buf2 = 0x7fb5bd0311c0
> Addr of buf3 = 0x7fb5bb030010
> Offs of buf3 = 0x7fb5bb030100
> Addr of buf4 = 0x7fb5b902f010
> Offs of buf4 = 0x7fb5b902f140
> Threads #: 8 Pthreads
> Matrix size: 2048
> Using multiply kernel: multiply1
> Execution time = 17.751 seconds
> [ perf record: Woken up 391 times to write data ]
> [ perf record: Compressed 179.191 MB to 19.441 MB, ratio is 9.217 ]
> [ perf record: Captured and wrote 19.502 MB perf.data ]
> 138.90user 0.29system 0:17.88elapsed 778%CPU (0avgtext+0avgdata 100612maxresident)k
> 0inputs+39968outputs (0major+37436minor)pagefaults 0swaps
>
> for i in 0 1 2 3 5 8
> do
> /usr/bin/time tools/perf/perf record --aio=1 -z $i -F 25000 -N -B -T -R -e cycles -- ../../matrix/linux/matrix.gcc
> done
> + for i in 0 1 2 3 5 8
> + /usr/bin/time tools/perf/perf record --aio=1 -z 0 -F 25000 -N -B -T -R -e cycles -- ../../matrix/linux/matrix.gcc
> Addr of buf1 = 0x7feee4519010
> Offs of buf1 = 0x7feee4519180
> Addr of buf2 = 0x7feee2518010
> Offs of buf2 = 0x7feee25181c0
> Addr of buf3 = 0x7feee0517010
> Offs of buf3 = 0x7feee0517100
> Addr of buf4 = 0x7feede516010
> Offs of buf4 = 0x7feede516140
> Threads #: 8 Pthreads
> Matrix size: 2048
> Using multiply kernel: multiply1
> Execution time = 17.912 seconds
> [ perf record: Woken up 390 times to write data ]
> [ perf record: Captured and wrote 187.527 MB perf.data ]
> 139.70user 0.39system 0:18.04elapsed 776%CPU (0avgtext+0avgdata 100624maxresident)k
> 0inputs+384072outputs (0major+35257minor)pagefaults 0swaps
> + for i in 0 1 2 3 5 8
> + /usr/bin/time tools/perf/perf record --aio=1 -z 1 -F 25000 -N -B -T -R -e cycles -- ../../matrix/linux/matrix.gcc
> Addr of buf1 = 0x7f72b93ac010
> Offs of buf1 = 0x7f72b93ac180
> Addr of buf2 = 0x7f72b73ab010
> Offs of buf2 = 0x7f72b73ab1c0
> Addr of buf3 = 0x7f72b53aa010
> Offs of buf3 = 0x7f72b53aa100
> Addr of buf4 = 0x7f72b33a9010
> Offs of buf4 = 0x7f72b33a9140
> Threads #: 8 Pthreads
> Matrix size: 2048
> Using multiply kernel: multiply1
> Execution time = 18.198 seconds
> [ perf record: Woken up 416 times to write data ]
> [ perf record: Compressed 188.562 MB to 22.252 MB, ratio is 8.474 ]
> [ perf record: Captured and wrote 22.284 MB perf.data ]
> 141.12user 0.32system 0:18.32elapsed 771%CPU (0avgtext+0avgdata 100576maxresident)k
> 0inputs+45664outputs (0major+35040minor)pagefaults 0swaps
> + for i in 0 1 2 3 5 8
> + /usr/bin/time tools/perf/perf record --aio=1 -z 2 -F 25000 -N -B -T -R -e cycles -- ../../matrix/linux/matrix.gcc
> Addr of buf1 = 0x7ffb9caf3010
> Offs of buf1 = 0x7ffb9caf3180
> Addr of buf2 = 0x7ffb9aaf2010
> Offs of buf2 = 0x7ffb9aaf21c0
> Addr of buf3 = 0x7ffb98af1010
> Offs of buf3 = 0x7ffb98af1100
> Addr of buf4 = 0x7ffb96af0010
> Offs of buf4 = 0x7ffb96af0140
> Threads #: 8 Pthreads
> Matrix size: 2048
> Using multiply kernel: multiply1
> Execution time = 18.360 seconds
> [ perf record: Woken up 442 times to write data ]
> [ perf record: Compressed 191.773 MB to 24.238 MB, ratio is 7.912 ]
> [ perf record: Captured and wrote 24.290 MB perf.data ]
> 143.76user 0.49system 0:18.50elapsed 779%CPU (0avgtext+0avgdata 100596maxresident)k
> 0inputs+49760outputs (0major+35276minor)pagefaults 0swaps
> + for i in 0 1 2 3 5 8
> + /usr/bin/time tools/perf/perf record --aio=1 -z 3 -F 25000 -N -B -T -R -e cycles -- ../../matrix/linux/matrix.gcc
> Addr of buf1 = 0x7f13f2df2010
> Offs of buf1 = 0x7f13f2df2180
> Addr of buf2 = 0x7f13f0df1010
> Offs of buf2 = 0x7f13f0df11c0
> Addr of buf3 = 0x7f13eedf0010
> Offs of buf3 = 0x7f13eedf0100
> Addr of buf4 = 0x7f13ecdef010
> Offs of buf4 = 0x7f13ecdef140
> Threads #: 8 Pthreads
> Matrix size: 2048
> Using multiply kernel: multiply1
> Execution time = 18.383 seconds
> [ perf record: Woken up 499 times to write data ]
> [ perf record: Compressed 191.078 MB to 23.246 MB, ratio is 8.220 ]
> [ perf record: Captured and wrote 23.282 MB perf.data ]
> 143.72user 0.34system 0:18.51elapsed 778%CPU (0avgtext+0avgdata 100616maxresident)k
> 0inputs+47704outputs (0major+35783minor)pagefaults 0swaps
> + for i in 0 1 2 3 5 8
> + /usr/bin/time tools/perf/perf record --aio=1 -z 5 -F 25000 -N -B -T -R -e cycles -- ../../matrix/linux/matrix.gcc
> Addr of buf1 = 0x7fca0d091010
> Offs of buf1 = 0x7fca0d091180
> Addr of buf2 = 0x7fca0b090010
> Offs of buf2 = 0x7fca0b0901c0
> Addr of buf3 = 0x7fca0908f010
> Offs of buf3 = 0x7fca0908f100
> Addr of buf4 = 0x7fca0708e010
> Offs of buf4 = 0x7fca0708e140
> Threads #: 8 Pthreads
> Matrix size: 2048
> Using multiply kernel: multiply1
> Execution time = 18.758 seconds
> [ perf record: Woken up 535 times to write data ]
> [ perf record: Compressed 190.065 MB to 24.430 MB, ratio is 7.780 ]
> [ perf record: Captured and wrote 24.519 MB perf.data ]
> 144.62user 0.66system 0:18.88elapsed 769%CPU (0avgtext+0avgdata 100528maxresident)k
> 0inputs+50232outputs (0major+36942minor)pagefaults 0swaps
> + for i in 0 1 2 3 5 8
> + /usr/bin/time tools/perf/perf record --aio=1 -z 8 -F 25000 -N -B -T -R -e cycles -- ../../matrix/linux/matrix.gcc
> Addr of buf1 = 0x7f7e1f449010
> Offs of buf1 = 0x7f7e1f449180
> Addr of buf2 = 0x7f7e1d448010
> Offs of buf2 = 0x7f7e1d4481c0
> Addr of buf3 = 0x7f7e1b447010
> Offs of buf3 = 0x7f7e1b447100
> Addr of buf4 = 0x7f7e19446010
> Offs of buf4 = 0x7f7e19446140
> Threads #: 8 Pthreads
> Matrix size: 2048
> Using multiply kernel: multiply1
> Execution time = 20.103 seconds
> [ perf record: Woken up 260 times to write data ]
> [ perf record: Compressed 193.024 MB to 31.588 MB, ratio is 6.111 ]
> [ perf record: Captured and wrote 32.139 MB perf.data ]
> 151.73user 4.21system 0:20.23elapsed 770%CPU (0avgtext+0avgdata 100616maxresident)k
> 0inputs+65848outputs (0major+37431minor)pagefaults 0swaps
>
> ---
> TESTING:
>
> $ tools/perf/perf version --build-options
> perf version 4.13.rc5.gd8d056b
> dwarf: [ on ] # HAVE_DWARF_SUPPORT
> dwarf_getlocations: [ on ] # HAVE_DWARF_GETLOCATIONS_SUPPORT
> glibc: [ on ] # HAVE_GLIBC_SUPPORT
> gtk2: [ on ] # HAVE_GTK2_SUPPORT
> syscall_table: [ on ] # HAVE_SYSCALL_TABLE_SUPPORT
> libbfd: [ on ] # HAVE_LIBBFD_SUPPORT
> libelf: [ on ] # HAVE_LIBELF_SUPPORT
> libnuma: [ on ] # HAVE_LIBNUMA_SUPPORT
> numa_num_possible_cpus: [ on ] # HAVE_LIBNUMA_SUPPORT
> libperl: [ on ] # HAVE_LIBPERL_SUPPORT
> libpython: [ on ] # HAVE_LIBPYTHON_SUPPORT
> libslang: [ on ] # HAVE_SLANG_SUPPORT
> libcrypto: [ on ] # HAVE_LIBCRYPTO_SUPPORT
> libunwind: [ on ] # HAVE_LIBUNWIND_SUPPORT
> libdw-dwarf-unwind: [ on ] # HAVE_DWARF_SUPPORT
> zlib: [ on ] # HAVE_ZLIB_SUPPORT
> lzma: [ on ] # HAVE_LZMA_SUPPORT
> get_cpuid: [ on ] # HAVE_AUXTRACE_SUPPORT
> bpf: [ on ] # HAVE_LIBBPF_SUPPORT
> aio: [ OFF ] # HAVE_AIO_SUPPORT
> zstd: [ OFF ] # HAVE_ZSTD_SUPPORT
>
> $ tools/perf/perf version --build-options
> perf version 4.13.rc5.gd8d056b
> dwarf: [ on ] # HAVE_DWARF_SUPPORT
> dwarf_getlocations: [ on ] # HAVE_DWARF_GETLOCATIONS_SUPPORT
> glibc: [ on ] # HAVE_GLIBC_SUPPORT
> gtk2: [ on ] # HAVE_GTK2_SUPPORT
> syscall_table: [ on ] # HAVE_SYSCALL_TABLE_SUPPORT
> libbfd: [ on ] # HAVE_LIBBFD_SUPPORT
> libelf: [ on ] # HAVE_LIBELF_SUPPORT
> libnuma: [ on ] # HAVE_LIBNUMA_SUPPORT
> numa_num_possible_cpus: [ on ] # HAVE_LIBNUMA_SUPPORT
> libperl: [ on ] # HAVE_LIBPERL_SUPPORT
> libpython: [ on ] # HAVE_LIBPYTHON_SUPPORT
> libslang: [ on ] # HAVE_SLANG_SUPPORT
> libcrypto: [ on ] # HAVE_LIBCRYPTO_SUPPORT
> libunwind: [ on ] # HAVE_LIBUNWIND_SUPPORT
> libdw-dwarf-unwind: [ on ] # HAVE_DWARF_SUPPORT
> zlib: [ on ] # HAVE_ZLIB_SUPPORT
> lzma: [ on ] # HAVE_LZMA_SUPPORT
> get_cpuid: [ on ] # HAVE_AUXTRACE_SUPPORT
> bpf: [ on ] # HAVE_LIBBPF_SUPPORT
> aio: [ on ] # HAVE_AIO_SUPPORT
> zstd: [ on ] # HAVE_ZSTD_SUPPORT
>
> $ make -C tools/perf clean all
> ...
> $ pushd tools/perf/ && ./perf test && popd
> ~/abudanko/kernel/acme/tools/perf ~/abudanko/kernel/acme
> 1: vmlinux symtab matches kallsyms : Skip
> 2: Detect openat syscall event : Ok
> 3: Detect openat syscall event on all cpus : Ok
> 4: Read samples using the mmap interface : Ok
> 5: Test data source output : Ok
> 6: Parse event definition strings : Ok
> 7: Simple expression parser : Ok
> 8: PERF_RECORD_* events & perf_sample fields : Ok
> 9: Parse perf pmu format : Ok
> 10: DSO data read : Ok
> 11: DSO data cache : Ok
> 12: DSO data reopen : Ok
> 13: Roundtrip evsel->name : Ok
> 14: Parse sched tracepoints fields : Ok
> 15: syscalls:sys_enter_openat event fields : Ok
> 16: Setup struct perf_event_attr : Ok
> 17: Match and link multiple hists : Ok
> 18: 'import perf' in python : Ok
> 19: Breakpoint overflow signal handler : Ok
> 20: Breakpoint overflow sampling : Ok
> 21: Breakpoint accounting : Ok
> 22: Watchpoint :
> 22.1: Read Only Watchpoint : Skip
> 22.2: Write Only Watchpoint : Ok
> 22.3: Read / Write Watchpoint : Ok
> 22.4: Modify Watchpoint : Ok
> 23: Number of exit events of a simple workload : Ok
> 24: Software clock events period values : Ok
> 25: Object code reading : Ok
> 26: Sample parsing : Ok
> 27: Use a dummy software event to keep tracking : Ok
> 28: Parse with no sample_id_all bit set : Ok
> 29: Filter hist entries : Ok
> 30: Lookup mmap thread : Ok
> 31: Share thread mg : Ok
> 32: Sort output of hist entries : Ok
> 33: Cumulate child hist entries : Ok
> 34: Track with sched_switch : Ok
> 35: Filter fds with revents mask in a fdarray : Ok
> 36: Add fd to a fdarray, making it autogrow : Ok
> 37: kmod_path__parse : Ok
> 38: Thread map : Ok
> 39: LLVM search and compile :
> 39.1: Basic BPF llvm compile : Skip
> 39.2: kbuild searching : Skip
> 39.3: Compile source for BPF prologue generation : Skip
> 39.4: Compile source for BPF relocation : Skip
> 40: Session topology : Ok
> 41: BPF filter :
> 41.1: Basic BPF filtering : Skip
> 41.2: BPF pinning : Skip
> 41.3: BPF prologue generation : Skip
> 41.4: BPF relocation checker : Skip
> 42: Synthesize thread map : Ok
> 43: Remove thread map : Ok
> 44: Synthesize cpu map : Ok
> 45: Synthesize stat config : Ok
> 46: Synthesize stat : Ok
> 47: Synthesize stat round : Ok
> 48: Synthesize attr update : Ok
> 49: Event times : Ok
> 50: Read backward ring buffer : Ok
> 51: Print cpu map : Ok
> 52: Probe SDT events : Ok
> 53: is_printable_array : Ok
> 54: Print bitmap : Ok
> 55: perf hooks : Ok
> 56: builtin clang support : Skip (not compiled in)
> 57: unit_number__scnprintf : Ok
> 58: mem2node : Ok
> 59: x86 rdpmc : Ok
> 60: Convert perf time to TSC : Ok
> 61: DWARF unwind : Ok
> 62: x86 instruction decoder - new instructions : Ok
> 63: x86 bp modify : Ok
> 64: Check open filename arg using perf trace + vfs_getname: Skip
> 65: Add vfs_getname probe to get syscall args filenames : Skip
> 66: probe libc's inet_pton & backtrace it with ping : Ok
> 67: Use vfs_getname probe to get syscall args filenames : Skip
> 68: record trace Zstd compression/decompression : Ok
> ~/abudanko/kernel/acme
>
> $ make -C tools/perf NO_LIBZSTD=1 clean all
> ...
> $ pushd tools/perf/ && ./perf test && popd
> ~/abudanko/kernel/acme/tools/perf ~/abudanko/kernel/acme
> ...
> 68: record trace Zstd compression/decompression : Skip
> ~/abudanko/kernel/acme
>
Em Mon, Mar 18, 2019 at 08:40:26PM +0300, Alexey Budankov escreveu:
>
> Implemented --mmap-flush option that specifies minimal number of bytes
> that is extracted from mmaped kernel buffer to store into a trace. The
> default option value is 1 byte what means every time trace writing
> thread finds some new data in the mmaped buffer the data is extracted,
> possibly compressed and written to a trace.
>
> $ tools/perf/perf record --mmap-flush 1024 -e cycles -- matrix.gcc
> $ tools/perf/perf record --aio --mmap-flush 1K -e cycles -- matrix.gcc
>
> The option is independent from -z setting, doesn't vary with compression
> level and can serve two purposes.
>
> The first purpose is to increase the compression ratio of a trace data.
> Larger data chunks are compressed more effectively so the implemented
> option allows specifying data chunk size to compress. Also at some cases
> executing more write syscalls with smaller data size can take longer
> than executing less write syscalls with bigger data size due to syscall
> overhead so extracting bigger data chunks specified by the option value
> could additionally decrease runtime overhead.
>
> The second purpose is to avoid self monitoring live-lock issue in system
> wide (-a) profiling mode. Profiling in system wide mode with compression
> (-a -z) can additionally induce data into the kernel buffers along with
> the data from monitored processes. If performance data rate and volume
> from the monitored processes is high then trace streaming and compression
> activity in the tool is also high. High tool process activity can lead
> to subtle live-lock effect when compression of single new byte from some
> of mmaped kernel buffer leads to generation of the next single byte at
> some mmaped buffer. So perf tool process ends up in endless self
> monitoring.
>
> Implemented sync parameter is the mean to force data move independently
> from the specified flush threshold value. Despite the provided flush
> value the tool needs capability to unconditionally drain memory buffers,
> at least in the end of the collection.
>
> Signed-off-by: Alexey Budankov <[email protected]>
> ---
> tools/perf/Documentation/perf-record.txt | 12 +++++
> tools/perf/builtin-record.c | 65 +++++++++++++++++++++---
> tools/perf/perf.h | 1 +
> tools/perf/util/evlist.c | 6 +--
> tools/perf/util/evlist.h | 3 +-
> tools/perf/util/mmap.c | 4 +-
> tools/perf/util/mmap.h | 3 +-
> 7 files changed, 82 insertions(+), 12 deletions(-)
>
> diff --git a/tools/perf/Documentation/perf-record.txt b/tools/perf/Documentation/perf-record.txt
> index 8f0c2be34848..18fceb49434e 100644
> --- a/tools/perf/Documentation/perf-record.txt
> +++ b/tools/perf/Documentation/perf-record.txt
> @@ -459,6 +459,18 @@ Set affinity mask of trace reading thread according to the policy defined by 'mo
> node - thread affinity mask is set to NUMA node cpu mask of the processed mmap buffer
> cpu - thread affinity mask is set to cpu of the processed mmap buffer
>
> +--mmap-flush=number::
> +Specify minimal number of bytes that is extracted from mmap data pages and stored
> +into a trace. The number specification is possible using B/K/M/G suffixes. Maximal allowed
> +value is a quarter of the size of mmaped data pages. The default option value is 1 byte
I found this annoying, I tried first with the default value:
perf trace -m 2048 --call-graph dwarf -e write -- perf record --mmap-flush
<SNIP> the first writes for the synthesized data:
107.561 ( 0.005 ms): perf/25821 write(fd: 3</root/perf.data>, buf: 0x7f1befc02000, count: 336) = 336
__libc_write (/usr/lib64/libpthread-2.28.so)
ion (/home/acme/bin/perf)
record__write (inlined)
record__pushfn (/home/acme/bin/perf)
perf_mmap__push (/home/acme/bin/perf)
record__mmap_read_evlist (inlined)
record__mmap_read_all (inlined)
__cmd_record (inlined)
cmd_record (/home/acme/bin/perf)
12919.953 ( 0.136 ms): perf/25821 write(fd: 3</root/perf.data>, buf: 0x7f1befc83150, count: 184984) = 184984
<SNIP same backtrace as in the 107.561 timestamp>
12920.094 ( 0.155 ms): perf/25821 write(fd: 3</root/perf.data>, buf: 0x7f1befc02150, count: 261816) = 261816
<SNIP same backtrace as in the 107.561 timestamp>
12920.253 ( 0.093 ms): perf/25821 write(fd: 3</root/perf.data>, buf: 0x7f1befb81120, count: 170832) = 170832
<SNIP same backtrace as in the 107.561 timestamp>
Then with --mmap-flush 16M, and then the writes to perf.data were always
more than 132096, which is the limit that it silently set, I think we
should warn this record__mmap_flush_parse, something like:
"max flush is a quarter of the mmap size, if wanting to bump the mmap
flush further, bump the mmap size as well using -m/--mmap-pages"
Found this using -v, which shows the mmap size twice, one line after the
next one:
mmap flush: 132096
mmap size 528384B
mmap size 528384B
I reflowed a bit the man page and added committer notes testing it, end
result is at the bottom of this message, I also had to rename 'sync' to
'synch' to get it to build with other glibcs:
CC /tmp/build/perf/builtin-kmem.o
cc1: warnings being treated as errors
builtin-record.c: In function 'record__mmap_read_evlist':
builtin-record.c:775: warning: declaration of 'sync' shadows a global declaration
/usr/include/unistd.h:933: warning: shadowed declaration is here
builtin-record.c: In function 'record__mmap_read_all':
builtin-record.c:856: warning: declaration of 'sync' shadows a global declaration
/usr/include/unistd.h:933: warning: shadowed declaration is here
mv: cannot stat `/tmp/build/perf/.builtin-record.o.tmp': No such file or directory
commit 221771de64b6bd0422f451e2c808d75eb3721814
Author: Alexey Budankov <[email protected]>
Date: Mon Mar 18 20:40:26 2019 +0300
perf record: Implement --mmap-flush=<number> option
Implement a --mmap-flush option that specifies minimal number of bytes
that is extracted from mmaped kernel buffer to store into a trace. The
default option value is 1 byte what means every time trace writing
thread finds some new data in the mmaped buffer the data is extracted,
possibly compressed and written to a trace.
$ tools/perf/perf record --mmap-flush 1024 -e cycles -- matrix.gcc
$ tools/perf/perf record --aio --mmap-flush 1K -e cycles -- matrix.gcc
The option is independent from -z setting, doesn't vary with compression
level and can serve two purposes.
The first purpose is to increase the compression ratio of a trace data.
Larger data chunks are compressed more effectively so the implemented
option allows specifying data chunk size to compress. Also at some cases
executing more write syscalls with smaller data size can take longer
than executing less write syscalls with bigger data size due to syscall
overhead so extracting bigger data chunks specified by the option value
could additionally decrease runtime overhead.
The second purpose is to avoid self monitoring live-lock issue in system
wide (-a) profiling mode. Profiling in system wide mode with compression
(-a -z) can additionally induce data into the kernel buffers along with
the data from monitored processes. If performance data rate and volume
from the monitored processes is high then trace streaming and
compression activity in the tool is also high. High tool process
activity can lead to subtle live-lock effect when compression of single
new byte from some of mmaped kernel buffer leads to generation of the
next single byte at some mmaped buffer. So perf tool process ends up in
endless self monitoring.
Implemented synch parameter is the mean to force data move independently
from the specified flush threshold value. Despite the provided flush
value the tool needs capability to unconditionally drain memory buffers,
at least in the end of the collection.
Committer testing:
Running with the default value, i.e. as soon as there is something to
read go on consuming, we first write the synthesized events, small
chunks of about 128 bytes:
# perf trace -m 2048 --call-graph dwarf -e write -- perf record
<SNIP>
101.142 ( 0.004 ms): perf/25821 write(fd: 3</root/perf.data>, buf: 0x210db60, count: 120) = 120
__libc_write (/usr/lib64/libpthread-2.28.so)
ion (/home/acme/bin/perf)
record__write (inlined)
process_synthesized_event (/home/acme/bin/perf)
perf_tool__process_synth_event (inlined)
perf_event__synthesize_mmap_events (/home/acme/bin/perf)
Then we move to reading the mmap buffers consuming the events put there
by the kernel perf infrastructure:
107.561 ( 0.005 ms): perf/25821 write(fd: 3</root/perf.data>, buf: 0x7f1befc02000, count: 336) = 336
__libc_write (/usr/lib64/libpthread-2.28.so)
ion (/home/acme/bin/perf)
record__write (inlined)
record__pushfn (/home/acme/bin/perf)
perf_mmap__push (/home/acme/bin/perf)
record__mmap_read_evlist (inlined)
record__mmap_read_all (inlined)
__cmd_record (inlined)
cmd_record (/home/acme/bin/perf)
12919.953 ( 0.136 ms): perf/25821 write(fd: 3</root/perf.data>, buf: 0x7f1befc83150, count: 184984) = 184984
<SNIP same backtrace as in the 107.561 timestamp>
12920.094 ( 0.155 ms): perf/25821 write(fd: 3</root/perf.data>, buf: 0x7f1befc02150, count: 261816) = 261816
<SNIP same backtrace as in the 107.561 timestamp>
12920.253 ( 0.093 ms): perf/25821 write(fd: 3</root/perf.data>, buf: 0x7f1befb81120, count: 170832) = 170832
<SNIP same backtrace as in the 107.561 timestamp>
If we limit it to write only when more than 16MB are available for
reading, it throttles that to a quarter of the --mmap-pages set for
'perf record', which by default get to 528384 bytes, found out using
'record -v':
mmap flush: 132096
mmap size 528384B
With that in place all the writes coming from
record__mmap_read_evlist(), i.e. from the mmap buffers setup by the
kernel perf infrastructure were at least 132096 bytes long.
Trying with a bigger mmap size:
perf trace -e write perf record -v -m 2048 --mmap-flush 16M
74982.928 ( 2.471 ms): perf/26500 write(fd: 3</root/perf.data>, buf: 0x7ff94a6cc000, count: 3580888) = 3580888
74985.406 ( 2.353 ms): perf/26500 write(fd: 3</root/perf.data>, buf: 0x7ff949ecb000, count: 3453256) = 3453256
74987.764 ( 2.629 ms): perf/26500 write(fd: 3</root/perf.data>, buf: 0x7ff9496ca000, count: 3859232) = 3859232
74990.399 ( 2.341 ms): perf/26500 write(fd: 3</root/perf.data>, buf: 0x7ff948ec9000, count: 3769032) = 3769032
74992.744 ( 2.064 ms): perf/26500 write(fd: 3</root/perf.data>, buf: 0x7ff9486c8000, count: 3310520) = 3310520
74994.814 ( 2.619 ms): perf/26500 write(fd: 3</root/perf.data>, buf: 0x7ff947ec7000, count: 4194688) = 4194688
74997.439 ( 2.787 ms): perf/26500 write(fd: 3</root/perf.data>, buf: 0x7ff9476c6000, count: 4029760) = 4029760
Was again limited to a quarter of the mmap size:
mmap flush: 2098176
mmap size 8392704B
A warning about that would be good to have but can be added later,
something like:
"max flush is a quarter of the mmap size, if wanting to bump the mmap
flush further, bump the mmap size as well using -m/--mmap-pages"
Signed-off-by: Alexey Budankov <[email protected]>
Reviewed-by: Jiri Olsa <[email protected]>
Tested-by: Arnaldo Carvalho de Melo <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
diff --git a/tools/perf/Documentation/perf-record.txt b/tools/perf/Documentation/perf-record.txt
index 8fe4dffcadd0..58986f4cc190 100644
--- a/tools/perf/Documentation/perf-record.txt
+++ b/tools/perf/Documentation/perf-record.txt
@@ -459,6 +459,25 @@ Set affinity mask of trace reading thread according to the policy defined by 'mo
node - thread affinity mask is set to NUMA node cpu mask of the processed mmap buffer
cpu - thread affinity mask is set to cpu of the processed mmap buffer
+--mmap-flush=number::
+
+Specify minimal number of bytes that is extracted from mmap data pages and
+processed for output. One can specify the number using B/K/M/G suffixes.
+
+The maximal allowed value is a quarter of the size of mmaped data pages.
+
+The default option value is 1 byte which means that every time that the output
+writing thread finds some new data in the mmaped buffer the data is extracted,
+possibly compressed (-z) and written to the output, perf.data or pipe.
+
+Larger data chunks are compressed more effectively in comparison to smaller
+chunks so extraction of larger chunks from the mmap data pages is preferable
+from the perspective of output size reduction.
+
+Also at some cases executing less output write syscalls with bigger data size
+can take less time than executing more output write syscalls with smaller data
+size thus lowering runtime profiling overhead.
+
--all-kernel::
Configure all used events to run in kernel space.
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 4e2d953d4bc5..e344232c2ac6 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -337,6 +337,41 @@ static int record__aio_enabled(struct record *rec)
return rec->opts.nr_cblocks > 0;
}
+#define MMAP_FLUSH_DEFAULT 1
+static int record__mmap_flush_parse(const struct option *opt,
+ const char *str,
+ int unset)
+{
+ int flush_max;
+ struct record_opts *opts = (struct record_opts *)opt->value;
+ static struct parse_tag tags[] = {
+ { .tag = 'B', .mult = 1 },
+ { .tag = 'K', .mult = 1 << 10 },
+ { .tag = 'M', .mult = 1 << 20 },
+ { .tag = 'G', .mult = 1 << 30 },
+ { .tag = 0 },
+ };
+
+ if (unset)
+ return 0;
+
+ if (str) {
+ opts->mmap_flush = parse_tag_value(str, tags);
+ if (opts->mmap_flush == (int)-1)
+ opts->mmap_flush = strtol(str, NULL, 0);
+ }
+
+ if (!opts->mmap_flush)
+ opts->mmap_flush = MMAP_FLUSH_DEFAULT;
+
+ flush_max = perf_evlist__mmap_size(opts->mmap_pages);
+ flush_max /= 4;
+ if (opts->mmap_flush > flush_max)
+ opts->mmap_flush = flush_max;
+
+ return 0;
+}
+
static int process_synthesized_event(struct perf_tool *tool,
union perf_event *event,
struct perf_sample *sample __maybe_unused,
@@ -546,7 +581,8 @@ static int record__mmap_evlist(struct record *rec,
if (perf_evlist__mmap_ex(evlist, opts->mmap_pages,
opts->auxtrace_mmap_pages,
opts->auxtrace_snapshot_mode,
- opts->nr_cblocks, opts->affinity) < 0) {
+ opts->nr_cblocks, opts->affinity,
+ opts->mmap_flush) < 0) {
if (errno == EPERM) {
pr_err("Permission error mapping pages.\n"
"Consider increasing "
@@ -736,7 +772,7 @@ static void record__adjust_affinity(struct record *rec, struct perf_mmap *map)
}
static int record__mmap_read_evlist(struct record *rec, struct perf_evlist *evlist,
- bool overwrite)
+ bool overwrite, bool synch)
{
u64 bytes_written = rec->bytes_written;
int i;
@@ -759,12 +795,19 @@ static int record__mmap_read_evlist(struct record *rec, struct perf_evlist *evli
off = record__aio_get_pos(trace_fd);
for (i = 0; i < evlist->nr_mmaps; i++) {
+ u64 flush = 0;
struct perf_mmap *map = &maps[i];
if (map->base) {
record__adjust_affinity(rec, map);
+ if (synch) {
+ flush = map->flush;
+ map->flush = 1;
+ }
if (!record__aio_enabled(rec)) {
if (perf_mmap__push(map, rec, record__pushfn) != 0) {
+ if (synch)
+ map->flush = flush;
rc = -1;
goto out;
}
@@ -777,10 +820,14 @@ static int record__mmap_read_evlist(struct record *rec, struct perf_evlist *evli
idx = record__aio_sync(map, false);
if (perf_mmap__aio_push(map, rec, idx, record__aio_pushfn, &off) != 0) {
record__aio_set_pos(trace_fd, off);
+ if (synch)
+ map->flush = flush;
rc = -1;
goto out;
}
}
+ if (synch)
+ map->flush = flush;
}
if (map->auxtrace_mmap.base && !rec->opts.auxtrace_snapshot_mode &&
@@ -806,15 +853,15 @@ static int record__mmap_read_evlist(struct record *rec, struct perf_evlist *evli
return rc;
}
-static int record__mmap_read_all(struct record *rec)
+static int record__mmap_read_all(struct record *rec, bool synch)
{
int err;
- err = record__mmap_read_evlist(rec, rec->evlist, false);
+ err = record__mmap_read_evlist(rec, rec->evlist, false, synch);
if (err)
return err;
- return record__mmap_read_evlist(rec, rec->evlist, true);
+ return record__mmap_read_evlist(rec, rec->evlist, true, synch);
}
static void record__init_features(struct record *rec)
@@ -1340,7 +1387,7 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
if (trigger_is_hit(&switch_output_trigger) || done || draining)
perf_evlist__toggle_bkw_mmap(rec->evlist, BKW_MMAP_DATA_PENDING);
- if (record__mmap_read_all(rec) < 0) {
+ if (record__mmap_read_all(rec, false) < 0) {
trigger_error(&auxtrace_snapshot_trigger);
trigger_error(&switch_output_trigger);
err = -1;
@@ -1441,6 +1488,7 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
record__synthesize_workload(rec, true);
out_child:
+ record__mmap_read_all(rec, true);
record__aio_mmap_read_sync(rec);
if (forks) {
@@ -1846,6 +1894,7 @@ static struct record record = {
.uses_mmap = true,
.default_per_cpu = true,
},
+ .mmap_flush = MMAP_FLUSH_DEFAULT,
},
.tool = {
.sample = process_sample_event,
@@ -1912,6 +1961,9 @@ static struct option __record_options[] = {
OPT_CALLBACK('m', "mmap-pages", &record.opts, "pages[,pages]",
"number of mmap data pages and AUX area tracing mmap pages",
record__parse_mmap_pages),
+ OPT_CALLBACK(0, "mmap-flush", &record.opts, "number",
+ "Minimal number of bytes that is extracted from mmap data pages (default: 1)",
+ record__mmap_flush_parse),
OPT_BOOLEAN(0, "group", &record.opts.group,
"put the counters into a counter group"),
OPT_CALLBACK_NOOPT('g', NULL, &callchain_param,
@@ -2224,6 +2276,7 @@ int cmd_record(int argc, const char **argv)
pr_info("nr_cblocks: %d\n", rec->opts.nr_cblocks);
pr_debug("affinity: %s\n", affinity_tags[rec->opts.affinity]);
+ pr_debug("mmap flush: %d\n", rec->opts.mmap_flush);
err = __cmd_record(&record, argc, argv);
out:
diff --git a/tools/perf/perf.h b/tools/perf/perf.h
index c59743def8d3..369eae61068d 100644
--- a/tools/perf/perf.h
+++ b/tools/perf/perf.h
@@ -85,6 +85,7 @@ struct record_opts {
u64 clockid_res_ns;
int nr_cblocks;
int affinity;
+ int mmap_flush;
};
enum perf_affinity {
diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
index ec78e93085de..54ef0b596134 100644
--- a/tools/perf/util/evlist.c
+++ b/tools/perf/util/evlist.c
@@ -1038,7 +1038,7 @@ int perf_evlist__parse_mmap_pages(const struct option *opt, const char *str,
*/
int perf_evlist__mmap_ex(struct perf_evlist *evlist, unsigned int pages,
unsigned int auxtrace_pages,
- bool auxtrace_overwrite, int nr_cblocks, int affinity)
+ bool auxtrace_overwrite, int nr_cblocks, int affinity, int flush)
{
struct perf_evsel *evsel;
const struct cpu_map *cpus = evlist->cpus;
@@ -1048,7 +1048,7 @@ int perf_evlist__mmap_ex(struct perf_evlist *evlist, unsigned int pages,
* Its value is decided by evsel's write_backward.
* So &mp should not be passed through const pointer.
*/
- struct mmap_params mp = { .nr_cblocks = nr_cblocks, .affinity = affinity };
+ struct mmap_params mp = { .nr_cblocks = nr_cblocks, .affinity = affinity, .flush = flush };
if (!evlist->mmap)
evlist->mmap = perf_evlist__alloc_mmap(evlist, false);
@@ -1080,7 +1080,7 @@ int perf_evlist__mmap_ex(struct perf_evlist *evlist, unsigned int pages,
int perf_evlist__mmap(struct perf_evlist *evlist, unsigned int pages)
{
- return perf_evlist__mmap_ex(evlist, pages, 0, false, 0, PERF_AFFINITY_SYS);
+ return perf_evlist__mmap_ex(evlist, pages, 0, false, 0, PERF_AFFINITY_SYS, 1);
}
int perf_evlist__create_maps(struct perf_evlist *evlist, struct target *target)
diff --git a/tools/perf/util/evlist.h b/tools/perf/util/evlist.h
index dcb68f34d2cd..ad705bb1d3d1 100644
--- a/tools/perf/util/evlist.h
+++ b/tools/perf/util/evlist.h
@@ -177,7 +177,8 @@ unsigned long perf_event_mlock_kb_in_pages(void);
int perf_evlist__mmap_ex(struct perf_evlist *evlist, unsigned int pages,
unsigned int auxtrace_pages,
- bool auxtrace_overwrite, int nr_cblocks, int affinity);
+ bool auxtrace_overwrite, int nr_cblocks,
+ int affinity, int flush);
int perf_evlist__mmap(struct perf_evlist *evlist, unsigned int pages);
void perf_evlist__munmap(struct perf_evlist *evlist);
diff --git a/tools/perf/util/mmap.c b/tools/perf/util/mmap.c
index cdc7740fc181..ef3d79b2c90b 100644
--- a/tools/perf/util/mmap.c
+++ b/tools/perf/util/mmap.c
@@ -440,6 +440,8 @@ int perf_mmap__mmap(struct perf_mmap *map, struct mmap_params *mp, int fd, int c
perf_mmap__setup_affinity_mask(map, mp);
+ map->flush = mp->flush;
+
if (auxtrace_mmap__mmap(&map->auxtrace_mmap,
&mp->auxtrace_mp, map->base, fd))
return -1;
@@ -492,7 +494,7 @@ static int __perf_mmap__read_init(struct perf_mmap *md)
md->start = md->overwrite ? head : old;
md->end = md->overwrite ? old : head;
- if (md->start == md->end)
+ if ((md->end - md->start) < md->flush)
return -EAGAIN;
size = md->end - md->start;
diff --git a/tools/perf/util/mmap.h b/tools/perf/util/mmap.h
index e566c19b242b..b82f8c2d55c4 100644
--- a/tools/perf/util/mmap.h
+++ b/tools/perf/util/mmap.h
@@ -39,6 +39,7 @@ struct perf_mmap {
} aio;
#endif
cpu_set_t affinity_mask;
+ u64 flush;
};
/*
@@ -70,7 +71,7 @@ enum bkw_mmap_state {
};
struct mmap_params {
- int prot, mask, nr_cblocks, affinity;
+ int prot, mask, nr_cblocks, affinity, flush;
struct auxtrace_mmap_params auxtrace_mp;
};
On 29.03.2019 22:02, Arnaldo Carvalho de Melo wrote:
> Em Mon, Mar 18, 2019 at 08:40:26PM +0300, Alexey Budankov escreveu:
>>
>> Implemented --mmap-flush option that specifies minimal number of bytes
>> that is extracted from mmaped kernel buffer to store into a trace. The
>> default option value is 1 byte what means every time trace writing
>> thread finds some new data in the mmaped buffer the data is extracted,
>> possibly compressed and written to a trace.
>>
>> $ tools/perf/perf record --mmap-flush 1024 -e cycles -- matrix.gcc
>> $ tools/perf/perf record --aio --mmap-flush 1K -e cycles -- matrix.gcc
>>
>> The option is independent from -z setting, doesn't vary with compression
>> level and can serve two purposes.
>>
>> The first purpose is to increase the compression ratio of a trace data.
>> Larger data chunks are compressed more effectively so the implemented
>> option allows specifying data chunk size to compress. Also at some cases
>> executing more write syscalls with smaller data size can take longer
>> than executing less write syscalls with bigger data size due to syscall
>> overhead so extracting bigger data chunks specified by the option value
>> could additionally decrease runtime overhead.
>>
>> The second purpose is to avoid self monitoring live-lock issue in system
>> wide (-a) profiling mode. Profiling in system wide mode with compression
>> (-a -z) can additionally induce data into the kernel buffers along with
>> the data from monitored processes. If performance data rate and volume
>> from the monitored processes is high then trace streaming and compression
>> activity in the tool is also high. High tool process activity can lead
>> to subtle live-lock effect when compression of single new byte from some
>> of mmaped kernel buffer leads to generation of the next single byte at
>> some mmaped buffer. So perf tool process ends up in endless self
>> monitoring.
>>
>> Implemented sync parameter is the mean to force data move independently
>> from the specified flush threshold value. Despite the provided flush
>> value the tool needs capability to unconditionally drain memory buffers,
>> at least in the end of the collection.
>>
>> Signed-off-by: Alexey Budankov <[email protected]>
>> ---
>> tools/perf/Documentation/perf-record.txt | 12 +++++
>> tools/perf/builtin-record.c | 65 +++++++++++++++++++++---
>> tools/perf/perf.h | 1 +
>> tools/perf/util/evlist.c | 6 +--
>> tools/perf/util/evlist.h | 3 +-
>> tools/perf/util/mmap.c | 4 +-
>> tools/perf/util/mmap.h | 3 +-
>> 7 files changed, 82 insertions(+), 12 deletions(-)
>>
>> diff --git a/tools/perf/Documentation/perf-record.txt b/tools/perf/Documentation/perf-record.txt
>> index 8f0c2be34848..18fceb49434e 100644
>> --- a/tools/perf/Documentation/perf-record.txt
>> +++ b/tools/perf/Documentation/perf-record.txt
>> @@ -459,6 +459,18 @@ Set affinity mask of trace reading thread according to the policy defined by 'mo
>> node - thread affinity mask is set to NUMA node cpu mask of the processed mmap buffer
>> cpu - thread affinity mask is set to cpu of the processed mmap buffer
>>
>> +--mmap-flush=number::
>> +Specify minimal number of bytes that is extracted from mmap data pages and stored
>> +into a trace. The number specification is possible using B/K/M/G suffixes. Maximal allowed
>> +value is a quarter of the size of mmaped data pages. The default option value is 1 byte
>
> I found this annoying, I tried first with the default value:
>
> perf trace -m 2048 --call-graph dwarf -e write -- perf record --mmap-flush
> <SNIP> the first writes for the synthesized data:
> 107.561 ( 0.005 ms): perf/25821 write(fd: 3</root/perf.data>, buf: 0x7f1befc02000, count: 336) = 336
> __libc_write (/usr/lib64/libpthread-2.28.so)
> ion (/home/acme/bin/perf)
> record__write (inlined)
> record__pushfn (/home/acme/bin/perf)
> perf_mmap__push (/home/acme/bin/perf)
> record__mmap_read_evlist (inlined)
> record__mmap_read_all (inlined)
> __cmd_record (inlined)
> cmd_record (/home/acme/bin/perf)
> 12919.953 ( 0.136 ms): perf/25821 write(fd: 3</root/perf.data>, buf: 0x7f1befc83150, count: 184984) = 184984
> <SNIP same backtrace as in the 107.561 timestamp>
> 12920.094 ( 0.155 ms): perf/25821 write(fd: 3</root/perf.data>, buf: 0x7f1befc02150, count: 261816) = 261816
> <SNIP same backtrace as in the 107.561 timestamp>
> 12920.253 ( 0.093 ms): perf/25821 write(fd: 3</root/perf.data>, buf: 0x7f1befb81120, count: 170832) = 170832
> <SNIP same backtrace as in the 107.561 timestamp>
>
>
> Then with --mmap-flush 16M, and then the writes to perf.data were always
> more than 132096, which is the limit that it silently set, I think we
> should warn this record__mmap_flush_parse, something like:
>
> "max flush is a quarter of the mmap size, if wanting to bump the mmap
> flush further, bump the mmap size as well using -m/--mmap-pages"
Makes sense.
>
> Found this using -v, which shows the mmap size twice, one line after the
> next one:
>
> mmap flush: 132096
> mmap size 528384B
> mmap size 528384B
>
> I reflowed a bit the man page and added committer notes testing it, end
> result is at the bottom of this message, I also had to rename 'sync' to
> 'synch' to get it to build with other glibcs:
>
> CC /tmp/build/perf/builtin-kmem.o
> cc1: warnings being treated as errors
> builtin-record.c: In function 'record__mmap_read_evlist':
> builtin-record.c:775: warning: declaration of 'sync' shadows a global declaration
> /usr/include/unistd.h:933: warning: shadowed declaration is here
> builtin-record.c: In function 'record__mmap_read_all':
> builtin-record.c:856: warning: declaration of 'sync' shadows a global declaration
> /usr/include/unistd.h:933: warning: shadowed declaration is here
> mv: cannot stat `/tmp/build/perf/.builtin-record.o.tmp': No such file or directory
Thanks for applied corrections.
~Alexey
>
>
> commit 221771de64b6bd0422f451e2c808d75eb3721814
> Author: Alexey Budankov <[email protected]>
> Date: Mon Mar 18 20:40:26 2019 +0300
>
> perf record: Implement --mmap-flush=<number> option
>
> Implement a --mmap-flush option that specifies minimal number of bytes
> that is extracted from mmaped kernel buffer to store into a trace. The
> default option value is 1 byte what means every time trace writing
> thread finds some new data in the mmaped buffer the data is extracted,
> possibly compressed and written to a trace.
>
> $ tools/perf/perf record --mmap-flush 1024 -e cycles -- matrix.gcc
> $ tools/perf/perf record --aio --mmap-flush 1K -e cycles -- matrix.gcc
>
> The option is independent from -z setting, doesn't vary with compression
> level and can serve two purposes.
>
> The first purpose is to increase the compression ratio of a trace data.
> Larger data chunks are compressed more effectively so the implemented
> option allows specifying data chunk size to compress. Also at some cases
> executing more write syscalls with smaller data size can take longer
> than executing less write syscalls with bigger data size due to syscall
> overhead so extracting bigger data chunks specified by the option value
> could additionally decrease runtime overhead.
>
> The second purpose is to avoid self monitoring live-lock issue in system
> wide (-a) profiling mode. Profiling in system wide mode with compression
> (-a -z) can additionally induce data into the kernel buffers along with
> the data from monitored processes. If performance data rate and volume
> from the monitored processes is high then trace streaming and
> compression activity in the tool is also high. High tool process
> activity can lead to subtle live-lock effect when compression of single
> new byte from some of mmaped kernel buffer leads to generation of the
> next single byte at some mmaped buffer. So perf tool process ends up in
> endless self monitoring.
>
> Implemented synch parameter is the mean to force data move independently
> from the specified flush threshold value. Despite the provided flush
> value the tool needs capability to unconditionally drain memory buffers,
> at least in the end of the collection.
>
> Committer testing:
>
> Running with the default value, i.e. as soon as there is something to
> read go on consuming, we first write the synthesized events, small
> chunks of about 128 bytes:
>
> # perf trace -m 2048 --call-graph dwarf -e write -- perf record
> <SNIP>
> 101.142 ( 0.004 ms): perf/25821 write(fd: 3</root/perf.data>, buf: 0x210db60, count: 120) = 120
> __libc_write (/usr/lib64/libpthread-2.28.so)
> ion (/home/acme/bin/perf)
> record__write (inlined)
> process_synthesized_event (/home/acme/bin/perf)
> perf_tool__process_synth_event (inlined)
> perf_event__synthesize_mmap_events (/home/acme/bin/perf)
>
> Then we move to reading the mmap buffers consuming the events put there
> by the kernel perf infrastructure:
>
> 107.561 ( 0.005 ms): perf/25821 write(fd: 3</root/perf.data>, buf: 0x7f1befc02000, count: 336) = 336
> __libc_write (/usr/lib64/libpthread-2.28.so)
> ion (/home/acme/bin/perf)
> record__write (inlined)
> record__pushfn (/home/acme/bin/perf)
> perf_mmap__push (/home/acme/bin/perf)
> record__mmap_read_evlist (inlined)
> record__mmap_read_all (inlined)
> __cmd_record (inlined)
> cmd_record (/home/acme/bin/perf)
> 12919.953 ( 0.136 ms): perf/25821 write(fd: 3</root/perf.data>, buf: 0x7f1befc83150, count: 184984) = 184984
> <SNIP same backtrace as in the 107.561 timestamp>
> 12920.094 ( 0.155 ms): perf/25821 write(fd: 3</root/perf.data>, buf: 0x7f1befc02150, count: 261816) = 261816
> <SNIP same backtrace as in the 107.561 timestamp>
> 12920.253 ( 0.093 ms): perf/25821 write(fd: 3</root/perf.data>, buf: 0x7f1befb81120, count: 170832) = 170832
> <SNIP same backtrace as in the 107.561 timestamp>
>
> If we limit it to write only when more than 16MB are available for
> reading, it throttles that to a quarter of the --mmap-pages set for
> 'perf record', which by default get to 528384 bytes, found out using
> 'record -v':
>
> mmap flush: 132096
> mmap size 528384B
>
> With that in place all the writes coming from
> record__mmap_read_evlist(), i.e. from the mmap buffers setup by the
> kernel perf infrastructure were at least 132096 bytes long.
>
> Trying with a bigger mmap size:
>
> perf trace -e write perf record -v -m 2048 --mmap-flush 16M
> 74982.928 ( 2.471 ms): perf/26500 write(fd: 3</root/perf.data>, buf: 0x7ff94a6cc000, count: 3580888) = 3580888
> 74985.406 ( 2.353 ms): perf/26500 write(fd: 3</root/perf.data>, buf: 0x7ff949ecb000, count: 3453256) = 3453256
> 74987.764 ( 2.629 ms): perf/26500 write(fd: 3</root/perf.data>, buf: 0x7ff9496ca000, count: 3859232) = 3859232
> 74990.399 ( 2.341 ms): perf/26500 write(fd: 3</root/perf.data>, buf: 0x7ff948ec9000, count: 3769032) = 3769032
> 74992.744 ( 2.064 ms): perf/26500 write(fd: 3</root/perf.data>, buf: 0x7ff9486c8000, count: 3310520) = 3310520
> 74994.814 ( 2.619 ms): perf/26500 write(fd: 3</root/perf.data>, buf: 0x7ff947ec7000, count: 4194688) = 4194688
> 74997.439 ( 2.787 ms): perf/26500 write(fd: 3</root/perf.data>, buf: 0x7ff9476c6000, count: 4029760) = 4029760
>
> Was again limited to a quarter of the mmap size:
>
> mmap flush: 2098176
> mmap size 8392704B
>
> A warning about that would be good to have but can be added later,
> something like:
>
> "max flush is a quarter of the mmap size, if wanting to bump the mmap
> flush further, bump the mmap size as well using -m/--mmap-pages"
>
> Signed-off-by: Alexey Budankov <[email protected]>
> Reviewed-by: Jiri Olsa <[email protected]>
> Tested-by: Arnaldo Carvalho de Melo <[email protected]>
> Cc: Alexander Shishkin <[email protected]>
> Cc: Andi Kleen <[email protected]>
> Cc: Namhyung Kim <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Link: http://lkml.kernel.org/r/[email protected]
> Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
>
> diff --git a/tools/perf/Documentation/perf-record.txt b/tools/perf/Documentation/perf-record.txt
> index 8fe4dffcadd0..58986f4cc190 100644
> --- a/tools/perf/Documentation/perf-record.txt
> +++ b/tools/perf/Documentation/perf-record.txt
> @@ -459,6 +459,25 @@ Set affinity mask of trace reading thread according to the policy defined by 'mo
> node - thread affinity mask is set to NUMA node cpu mask of the processed mmap buffer
> cpu - thread affinity mask is set to cpu of the processed mmap buffer
>
> +--mmap-flush=number::
> +
> +Specify minimal number of bytes that is extracted from mmap data pages and
> +processed for output. One can specify the number using B/K/M/G suffixes.
> +
> +The maximal allowed value is a quarter of the size of mmaped data pages.
> +
> +The default option value is 1 byte which means that every time that the output
> +writing thread finds some new data in the mmaped buffer the data is extracted,
> +possibly compressed (-z) and written to the output, perf.data or pipe.
> +
> +Larger data chunks are compressed more effectively in comparison to smaller
> +chunks so extraction of larger chunks from the mmap data pages is preferable
> +from the perspective of output size reduction.
> +
> +Also at some cases executing less output write syscalls with bigger data size
> +can take less time than executing more output write syscalls with smaller data
> +size thus lowering runtime profiling overhead.
> +
> --all-kernel::
> Configure all used events to run in kernel space.
>
> diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
> index 4e2d953d4bc5..e344232c2ac6 100644
> --- a/tools/perf/builtin-record.c
> +++ b/tools/perf/builtin-record.c
> @@ -337,6 +337,41 @@ static int record__aio_enabled(struct record *rec)
> return rec->opts.nr_cblocks > 0;
> }
>
> +#define MMAP_FLUSH_DEFAULT 1
> +static int record__mmap_flush_parse(const struct option *opt,
> + const char *str,
> + int unset)
> +{
> + int flush_max;
> + struct record_opts *opts = (struct record_opts *)opt->value;
> + static struct parse_tag tags[] = {
> + { .tag = 'B', .mult = 1 },
> + { .tag = 'K', .mult = 1 << 10 },
> + { .tag = 'M', .mult = 1 << 20 },
> + { .tag = 'G', .mult = 1 << 30 },
> + { .tag = 0 },
> + };
> +
> + if (unset)
> + return 0;
> +
> + if (str) {
> + opts->mmap_flush = parse_tag_value(str, tags);
> + if (opts->mmap_flush == (int)-1)
> + opts->mmap_flush = strtol(str, NULL, 0);
> + }
> +
> + if (!opts->mmap_flush)
> + opts->mmap_flush = MMAP_FLUSH_DEFAULT;
> +
> + flush_max = perf_evlist__mmap_size(opts->mmap_pages);
> + flush_max /= 4;
> + if (opts->mmap_flush > flush_max)
> + opts->mmap_flush = flush_max;
> +
> + return 0;
> +}
> +
> static int process_synthesized_event(struct perf_tool *tool,
> union perf_event *event,
> struct perf_sample *sample __maybe_unused,
> @@ -546,7 +581,8 @@ static int record__mmap_evlist(struct record *rec,
> if (perf_evlist__mmap_ex(evlist, opts->mmap_pages,
> opts->auxtrace_mmap_pages,
> opts->auxtrace_snapshot_mode,
> - opts->nr_cblocks, opts->affinity) < 0) {
> + opts->nr_cblocks, opts->affinity,
> + opts->mmap_flush) < 0) {
> if (errno == EPERM) {
> pr_err("Permission error mapping pages.\n"
> "Consider increasing "
> @@ -736,7 +772,7 @@ static void record__adjust_affinity(struct record *rec, struct perf_mmap *map)
> }
>
> static int record__mmap_read_evlist(struct record *rec, struct perf_evlist *evlist,
> - bool overwrite)
> + bool overwrite, bool synch)
> {
> u64 bytes_written = rec->bytes_written;
> int i;
> @@ -759,12 +795,19 @@ static int record__mmap_read_evlist(struct record *rec, struct perf_evlist *evli
> off = record__aio_get_pos(trace_fd);
>
> for (i = 0; i < evlist->nr_mmaps; i++) {
> + u64 flush = 0;
> struct perf_mmap *map = &maps[i];
>
> if (map->base) {
> record__adjust_affinity(rec, map);
> + if (synch) {
> + flush = map->flush;
> + map->flush = 1;
> + }
> if (!record__aio_enabled(rec)) {
> if (perf_mmap__push(map, rec, record__pushfn) != 0) {
> + if (synch)
> + map->flush = flush;
> rc = -1;
> goto out;
> }
> @@ -777,10 +820,14 @@ static int record__mmap_read_evlist(struct record *rec, struct perf_evlist *evli
> idx = record__aio_sync(map, false);
> if (perf_mmap__aio_push(map, rec, idx, record__aio_pushfn, &off) != 0) {
> record__aio_set_pos(trace_fd, off);
> + if (synch)
> + map->flush = flush;
> rc = -1;
> goto out;
> }
> }
> + if (synch)
> + map->flush = flush;
> }
>
> if (map->auxtrace_mmap.base && !rec->opts.auxtrace_snapshot_mode &&
> @@ -806,15 +853,15 @@ static int record__mmap_read_evlist(struct record *rec, struct perf_evlist *evli
> return rc;
> }
>
> -static int record__mmap_read_all(struct record *rec)
> +static int record__mmap_read_all(struct record *rec, bool synch)
> {
> int err;
>
> - err = record__mmap_read_evlist(rec, rec->evlist, false);
> + err = record__mmap_read_evlist(rec, rec->evlist, false, synch);
> if (err)
> return err;
>
> - return record__mmap_read_evlist(rec, rec->evlist, true);
> + return record__mmap_read_evlist(rec, rec->evlist, true, synch);
> }
>
> static void record__init_features(struct record *rec)
> @@ -1340,7 +1387,7 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
> if (trigger_is_hit(&switch_output_trigger) || done || draining)
> perf_evlist__toggle_bkw_mmap(rec->evlist, BKW_MMAP_DATA_PENDING);
>
> - if (record__mmap_read_all(rec) < 0) {
> + if (record__mmap_read_all(rec, false) < 0) {
> trigger_error(&auxtrace_snapshot_trigger);
> trigger_error(&switch_output_trigger);
> err = -1;
> @@ -1441,6 +1488,7 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
> record__synthesize_workload(rec, true);
>
> out_child:
> + record__mmap_read_all(rec, true);
> record__aio_mmap_read_sync(rec);
>
> if (forks) {
> @@ -1846,6 +1894,7 @@ static struct record record = {
> .uses_mmap = true,
> .default_per_cpu = true,
> },
> + .mmap_flush = MMAP_FLUSH_DEFAULT,
> },
> .tool = {
> .sample = process_sample_event,
> @@ -1912,6 +1961,9 @@ static struct option __record_options[] = {
> OPT_CALLBACK('m', "mmap-pages", &record.opts, "pages[,pages]",
> "number of mmap data pages and AUX area tracing mmap pages",
> record__parse_mmap_pages),
> + OPT_CALLBACK(0, "mmap-flush", &record.opts, "number",
> + "Minimal number of bytes that is extracted from mmap data pages (default: 1)",
> + record__mmap_flush_parse),
> OPT_BOOLEAN(0, "group", &record.opts.group,
> "put the counters into a counter group"),
> OPT_CALLBACK_NOOPT('g', NULL, &callchain_param,
> @@ -2224,6 +2276,7 @@ int cmd_record(int argc, const char **argv)
> pr_info("nr_cblocks: %d\n", rec->opts.nr_cblocks);
>
> pr_debug("affinity: %s\n", affinity_tags[rec->opts.affinity]);
> + pr_debug("mmap flush: %d\n", rec->opts.mmap_flush);
>
> err = __cmd_record(&record, argc, argv);
> out:
> diff --git a/tools/perf/perf.h b/tools/perf/perf.h
> index c59743def8d3..369eae61068d 100644
> --- a/tools/perf/perf.h
> +++ b/tools/perf/perf.h
> @@ -85,6 +85,7 @@ struct record_opts {
> u64 clockid_res_ns;
> int nr_cblocks;
> int affinity;
> + int mmap_flush;
> };
>
> enum perf_affinity {
> diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
> index ec78e93085de..54ef0b596134 100644
> --- a/tools/perf/util/evlist.c
> +++ b/tools/perf/util/evlist.c
> @@ -1038,7 +1038,7 @@ int perf_evlist__parse_mmap_pages(const struct option *opt, const char *str,
> */
> int perf_evlist__mmap_ex(struct perf_evlist *evlist, unsigned int pages,
> unsigned int auxtrace_pages,
> - bool auxtrace_overwrite, int nr_cblocks, int affinity)
> + bool auxtrace_overwrite, int nr_cblocks, int affinity, int flush)
> {
> struct perf_evsel *evsel;
> const struct cpu_map *cpus = evlist->cpus;
> @@ -1048,7 +1048,7 @@ int perf_evlist__mmap_ex(struct perf_evlist *evlist, unsigned int pages,
> * Its value is decided by evsel's write_backward.
> * So &mp should not be passed through const pointer.
> */
> - struct mmap_params mp = { .nr_cblocks = nr_cblocks, .affinity = affinity };
> + struct mmap_params mp = { .nr_cblocks = nr_cblocks, .affinity = affinity, .flush = flush };
>
> if (!evlist->mmap)
> evlist->mmap = perf_evlist__alloc_mmap(evlist, false);
> @@ -1080,7 +1080,7 @@ int perf_evlist__mmap_ex(struct perf_evlist *evlist, unsigned int pages,
>
> int perf_evlist__mmap(struct perf_evlist *evlist, unsigned int pages)
> {
> - return perf_evlist__mmap_ex(evlist, pages, 0, false, 0, PERF_AFFINITY_SYS);
> + return perf_evlist__mmap_ex(evlist, pages, 0, false, 0, PERF_AFFINITY_SYS, 1);
> }
>
> int perf_evlist__create_maps(struct perf_evlist *evlist, struct target *target)
> diff --git a/tools/perf/util/evlist.h b/tools/perf/util/evlist.h
> index dcb68f34d2cd..ad705bb1d3d1 100644
> --- a/tools/perf/util/evlist.h
> +++ b/tools/perf/util/evlist.h
> @@ -177,7 +177,8 @@ unsigned long perf_event_mlock_kb_in_pages(void);
>
> int perf_evlist__mmap_ex(struct perf_evlist *evlist, unsigned int pages,
> unsigned int auxtrace_pages,
> - bool auxtrace_overwrite, int nr_cblocks, int affinity);
> + bool auxtrace_overwrite, int nr_cblocks,
> + int affinity, int flush);
> int perf_evlist__mmap(struct perf_evlist *evlist, unsigned int pages);
> void perf_evlist__munmap(struct perf_evlist *evlist);
>
> diff --git a/tools/perf/util/mmap.c b/tools/perf/util/mmap.c
> index cdc7740fc181..ef3d79b2c90b 100644
> --- a/tools/perf/util/mmap.c
> +++ b/tools/perf/util/mmap.c
> @@ -440,6 +440,8 @@ int perf_mmap__mmap(struct perf_mmap *map, struct mmap_params *mp, int fd, int c
>
> perf_mmap__setup_affinity_mask(map, mp);
>
> + map->flush = mp->flush;
> +
> if (auxtrace_mmap__mmap(&map->auxtrace_mmap,
> &mp->auxtrace_mp, map->base, fd))
> return -1;
> @@ -492,7 +494,7 @@ static int __perf_mmap__read_init(struct perf_mmap *md)
> md->start = md->overwrite ? head : old;
> md->end = md->overwrite ? old : head;
>
> - if (md->start == md->end)
> + if ((md->end - md->start) < md->flush)
> return -EAGAIN;
>
> size = md->end - md->start;
> diff --git a/tools/perf/util/mmap.h b/tools/perf/util/mmap.h
> index e566c19b242b..b82f8c2d55c4 100644
> --- a/tools/perf/util/mmap.h
> +++ b/tools/perf/util/mmap.h
> @@ -39,6 +39,7 @@ struct perf_mmap {
> } aio;
> #endif
> cpu_set_t affinity_mask;
> + u64 flush;
> };
>
> /*
> @@ -70,7 +71,7 @@ enum bkw_mmap_state {
> };
>
> struct mmap_params {
> - int prot, mask, nr_cblocks, affinity;
> + int prot, mask, nr_cblocks, affinity, flush;
> struct auxtrace_mmap_params auxtrace_mp;
> };
>
>
Commit-ID: 3b1c5d9659718263c7f9c93af82f98221c58f171
Gitweb: https://git.kernel.org/tip/3b1c5d9659718263c7f9c93af82f98221c58f171
Author: Alexey Budankov <[email protected]>
AuthorDate: Mon, 18 Mar 2019 20:39:49 +0300
Committer: Arnaldo Carvalho de Melo <[email protected]>
CommitDate: Mon, 1 Apr 2019 15:18:10 -0300
tools build: Implement libzstd feature check, LIBZSTD_DIR and NO_LIBZSTD defines
Implement libzstd feature check, NO_LIBZSTD and LIBZSTD_DIR defines to
override Zstd library sources or disable the feature from the command
line:
$ make -C tools/perf LIBZSTD_DIR=/path/to/zstd/sources/ clean all
$ make -C tools/perf NO_LIBZSTD=1 clean all
Auto detection feature status is reported just before compilation
starts. If your system has some version of the zstd library
preinstalled then the build system finds and uses it during the build.
If you still prefer to compile with some other version of zstd library
you have capability to refer the compilation to that version using
LIBZSTD_DIR define.
Signed-off-by: Alexey Budankov <[email protected]>
Reviewed-by: Jiri Olsa <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
---
tools/build/Makefile.feature | 2 ++
tools/build/feature/Makefile | 6 +++++-
tools/build/feature/test-all.c | 5 +++++
tools/build/feature/test-libzstd.c | 12 ++++++++++++
tools/perf/Makefile.config | 20 ++++++++++++++++++++
tools/perf/Makefile.perf | 3 +++
tools/perf/builtin-version.c | 2 ++
7 files changed, 49 insertions(+), 1 deletion(-)
diff --git a/tools/build/Makefile.feature b/tools/build/Makefile.feature
index 8d3864b061f3..361207387b1b 100644
--- a/tools/build/Makefile.feature
+++ b/tools/build/Makefile.feature
@@ -67,6 +67,7 @@ FEATURE_TESTS_BASIC := \
sdt \
setns \
libaio \
+ libzstd \
disassembler-four-args
# FEATURE_TESTS_BASIC + FEATURE_TESTS_EXTRA is the complete list
@@ -120,6 +121,7 @@ FEATURE_DISPLAY ?= \
get_cpuid \
bpf \
libaio \
+ libzstd \
disassembler-four-args
# Set FEATURE_CHECK_(C|LD)FLAGS-all for all FEATURE_TESTS features.
diff --git a/tools/build/feature/Makefile b/tools/build/feature/Makefile
index 7ceb4441b627..4b8244ee65ce 100644
--- a/tools/build/feature/Makefile
+++ b/tools/build/feature/Makefile
@@ -62,7 +62,8 @@ FILES= \
test-clang.bin \
test-llvm.bin \
test-llvm-version.bin \
- test-libaio.bin
+ test-libaio.bin \
+ test-libzstd.bin
FILES := $(addprefix $(OUTPUT),$(FILES))
@@ -301,6 +302,9 @@ $(OUTPUT)test-clang.bin:
$(OUTPUT)test-libaio.bin:
$(BUILD) -lrt
+$(OUTPUT)test-libzstd.bin:
+ $(BUILD) -lzstd
+
###############################
clean:
diff --git a/tools/build/feature/test-all.c b/tools/build/feature/test-all.c
index 7853e6d91090..a59c53705093 100644
--- a/tools/build/feature/test-all.c
+++ b/tools/build/feature/test-all.c
@@ -182,6 +182,10 @@
# include "test-disassembler-four-args.c"
#undef main
+#define main main_test_zstd
+# include "test-libzstd.c"
+#undef main
+
int main(int argc, char *argv[])
{
main_test_libpython();
@@ -224,6 +228,7 @@ int main(int argc, char *argv[])
main_test_libaio();
main_test_reallocarray();
main_test_disassembler_four_args();
+ main_test_libzstd();
return 0;
}
diff --git a/tools/build/feature/test-libzstd.c b/tools/build/feature/test-libzstd.c
new file mode 100644
index 000000000000..55268c01b84d
--- /dev/null
+++ b/tools/build/feature/test-libzstd.c
@@ -0,0 +1,12 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <zstd.h>
+
+int main(void)
+{
+ ZSTD_CStream *cstream;
+
+ cstream = ZSTD_createCStream();
+ ZSTD_freeCStream(cstream);
+
+ return 0;
+}
diff --git a/tools/perf/Makefile.config b/tools/perf/Makefile.config
index fe3f97e342fa..beb8b48b44e6 100644
--- a/tools/perf/Makefile.config
+++ b/tools/perf/Makefile.config
@@ -152,6 +152,13 @@ endif
FEATURE_CHECK_CFLAGS-libbabeltrace := $(LIBBABELTRACE_CFLAGS)
FEATURE_CHECK_LDFLAGS-libbabeltrace := $(LIBBABELTRACE_LDFLAGS) -lbabeltrace-ctf
+ifdef LIBZSTD_DIR
+ LIBZSTD_CFLAGS := -I$(LIBZSTD_DIR)/lib
+ LIBZSTD_LDFLAGS := -L$(LIBZSTD_DIR)/lib
+endif
+FEATURE_CHECK_CFLAGS-libzstd := $(LIBZSTD_CFLAGS)
+FEATURE_CHECK_LDFLAGS-libzstd := $(LIBZSTD_LDFLAGS)
+
FEATURE_CHECK_CFLAGS-bpf = -I. -I$(srctree)/tools/include -I$(srctree)/tools/arch/$(SRCARCH)/include/uapi -I$(srctree)/tools/include/uapi
# include ARCH specific config
-include $(src-perf)/arch/$(SRCARCH)/Makefile
@@ -787,6 +794,19 @@ ifndef NO_LZMA
endif
endif
+ifndef NO_LIBZSTD
+ ifeq ($(feature-libzstd), 1)
+ CFLAGS += -DHAVE_ZSTD_SUPPORT
+ CFLAGS += $(LIBZSTD_CFLAGS)
+ LDFLAGS += $(LIBZSTD_LDFLAGS)
+ EXTLIBS += -lzstd
+ $(call detected,CONFIG_ZSTD)
+ else
+ msg := $(warning No libzstd found, disables trace compression, please install libzstd-dev[el] and/or set LIBZSTD_DIR);
+ NO_LIBZSTD := 1
+ endif
+endif
+
ifndef NO_BACKTRACE
ifeq ($(feature-backtrace), 1)
CFLAGS += -DHAVE_BACKTRACE_SUPPORT
diff --git a/tools/perf/Makefile.perf b/tools/perf/Makefile.perf
index e8c9f77e9010..c706548d5b10 100644
--- a/tools/perf/Makefile.perf
+++ b/tools/perf/Makefile.perf
@@ -108,6 +108,9 @@ include ../scripts/utilities.mak
# streaming for record mode. Currently Posix AIO trace streaming is
# supported only when linking with glibc.
#
+# Define NO_LIBZSTD if you do not want support of Zstandard based runtime
+# trace compression in record mode.
+#
# As per kernel Makefile, avoid funny character set dependencies
unexport LC_ALL
diff --git a/tools/perf/builtin-version.c b/tools/perf/builtin-version.c
index 50df168be326..f470144d1a70 100644
--- a/tools/perf/builtin-version.c
+++ b/tools/perf/builtin-version.c
@@ -78,6 +78,8 @@ static void library_status(void)
STATUS(HAVE_LZMA_SUPPORT, lzma);
STATUS(HAVE_AUXTRACE_SUPPORT, get_cpuid);
STATUS(HAVE_LIBBPF_SUPPORT, bpf);
+ STATUS(HAVE_AIO_SUPPORT, aio);
+ STATUS(HAVE_ZSTD_SUPPORT, zstd);
}
int cmd_version(int argc, const char **argv)
Commit-ID: 470530bbb8fbbf2a09bd1d7150bb92501c5c54e6
Gitweb: https://git.kernel.org/tip/470530bbb8fbbf2a09bd1d7150bb92501c5c54e6
Author: Alexey Budankov <[email protected]>
AuthorDate: Mon, 18 Mar 2019 20:40:26 +0300
Committer: Arnaldo Carvalho de Melo <[email protected]>
CommitDate: Mon, 1 Apr 2019 15:18:10 -0300
perf record: Implement --mmap-flush=<number> option
Implement a --mmap-flush option that specifies minimal number of bytes
that is extracted from mmaped kernel buffer to store into a trace. The
default option value is 1 byte what means every time trace writing
thread finds some new data in the mmaped buffer the data is extracted,
possibly compressed and written to a trace.
$ tools/perf/perf record --mmap-flush 1024 -e cycles -- matrix.gcc
$ tools/perf/perf record --aio --mmap-flush 1K -e cycles -- matrix.gcc
The option is independent from -z setting, doesn't vary with compression
level and can serve two purposes.
The first purpose is to increase the compression ratio of a trace data.
Larger data chunks are compressed more effectively so the implemented
option allows specifying data chunk size to compress. Also at some cases
executing more write syscalls with smaller data size can take longer
than executing less write syscalls with bigger data size due to syscall
overhead so extracting bigger data chunks specified by the option value
could additionally decrease runtime overhead.
The second purpose is to avoid self monitoring live-lock issue in system
wide (-a) profiling mode. Profiling in system wide mode with compression
(-a -z) can additionally induce data into the kernel buffers along with
the data from monitored processes. If performance data rate and volume
from the monitored processes is high then trace streaming and
compression activity in the tool is also high. High tool process
activity can lead to subtle live-lock effect when compression of single
new byte from some of mmaped kernel buffer leads to generation of the
next single byte at some mmaped buffer. So perf tool process ends up in
endless self monitoring.
Implemented synch parameter is the mean to force data move independently
from the specified flush threshold value. Despite the provided flush
value the tool needs capability to unconditionally drain memory buffers,
at least in the end of the collection.
Committer testing:
Running with the default value, i.e. as soon as there is something to
read go on consuming, we first write the synthesized events, small
chunks of about 128 bytes:
# perf trace -m 2048 --call-graph dwarf -e write -- perf record
<SNIP>
101.142 ( 0.004 ms): perf/25821 write(fd: 3</root/perf.data>, buf: 0x210db60, count: 120) = 120
__libc_write (/usr/lib64/libpthread-2.28.so)
ion (/home/acme/bin/perf)
record__write (inlined)
process_synthesized_event (/home/acme/bin/perf)
perf_tool__process_synth_event (inlined)
perf_event__synthesize_mmap_events (/home/acme/bin/perf)
Then we move to reading the mmap buffers consuming the events put there
by the kernel perf infrastructure:
107.561 ( 0.005 ms): perf/25821 write(fd: 3</root/perf.data>, buf: 0x7f1befc02000, count: 336) = 336
__libc_write (/usr/lib64/libpthread-2.28.so)
ion (/home/acme/bin/perf)
record__write (inlined)
record__pushfn (/home/acme/bin/perf)
perf_mmap__push (/home/acme/bin/perf)
record__mmap_read_evlist (inlined)
record__mmap_read_all (inlined)
__cmd_record (inlined)
cmd_record (/home/acme/bin/perf)
12919.953 ( 0.136 ms): perf/25821 write(fd: 3</root/perf.data>, buf: 0x7f1befc83150, count: 184984) = 184984
<SNIP same backtrace as in the 107.561 timestamp>
12920.094 ( 0.155 ms): perf/25821 write(fd: 3</root/perf.data>, buf: 0x7f1befc02150, count: 261816) = 261816
<SNIP same backtrace as in the 107.561 timestamp>
12920.253 ( 0.093 ms): perf/25821 write(fd: 3</root/perf.data>, buf: 0x7f1befb81120, count: 170832) = 170832
<SNIP same backtrace as in the 107.561 timestamp>
If we limit it to write only when more than 16MB are available for
reading, it throttles that to a quarter of the --mmap-pages set for
'perf record', which by default get to 528384 bytes, found out using
'record -v':
mmap flush: 132096
mmap size 528384B
With that in place all the writes coming from
record__mmap_read_evlist(), i.e. from the mmap buffers setup by the
kernel perf infrastructure were at least 132096 bytes long.
Trying with a bigger mmap size:
perf trace -e write perf record -v -m 2048 --mmap-flush 16M
74982.928 ( 2.471 ms): perf/26500 write(fd: 3</root/perf.data>, buf: 0x7ff94a6cc000, count: 3580888) = 3580888
74985.406 ( 2.353 ms): perf/26500 write(fd: 3</root/perf.data>, buf: 0x7ff949ecb000, count: 3453256) = 3453256
74987.764 ( 2.629 ms): perf/26500 write(fd: 3</root/perf.data>, buf: 0x7ff9496ca000, count: 3859232) = 3859232
74990.399 ( 2.341 ms): perf/26500 write(fd: 3</root/perf.data>, buf: 0x7ff948ec9000, count: 3769032) = 3769032
74992.744 ( 2.064 ms): perf/26500 write(fd: 3</root/perf.data>, buf: 0x7ff9486c8000, count: 3310520) = 3310520
74994.814 ( 2.619 ms): perf/26500 write(fd: 3</root/perf.data>, buf: 0x7ff947ec7000, count: 4194688) = 4194688
74997.439 ( 2.787 ms): perf/26500 write(fd: 3</root/perf.data>, buf: 0x7ff9476c6000, count: 4029760) = 4029760
Was again limited to a quarter of the mmap size:
mmap flush: 2098176
mmap size 8392704B
A warning about that would be good to have but can be added later,
something like:
"max flush is a quarter of the mmap size, if wanting to bump the mmap
flush further, bump the mmap size as well using -m/--mmap-pages"
Also rename the 'sync' parameters to 'synch' to keep tools/perf building
with older glibcs:
cc1: warnings being treated as errors
builtin-record.c: In function 'record__mmap_read_evlist':
builtin-record.c:775: warning: declaration of 'sync' shadows a global declaration
/usr/include/unistd.h:933: warning: shadowed declaration is here
builtin-record.c: In function 'record__mmap_read_all':
builtin-record.c:856: warning: declaration of 'sync' shadows a global declaration
/usr/include/unistd.h:933: warning: shadowed declaration is here
Signed-off-by: Alexey Budankov <[email protected]>
Reviewed-by: Jiri Olsa <[email protected]>
Tested-by: Arnaldo Carvalho de Melo <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
---
tools/perf/Documentation/perf-record.txt | 19 ++++++++++
tools/perf/builtin-record.c | 65 +++++++++++++++++++++++++++++---
tools/perf/perf.h | 1 +
tools/perf/util/evlist.c | 6 +--
tools/perf/util/evlist.h | 3 +-
tools/perf/util/mmap.c | 4 +-
tools/perf/util/mmap.h | 3 +-
7 files changed, 89 insertions(+), 12 deletions(-)
diff --git a/tools/perf/Documentation/perf-record.txt b/tools/perf/Documentation/perf-record.txt
index 8fe4dffcadd0..58986f4cc190 100644
--- a/tools/perf/Documentation/perf-record.txt
+++ b/tools/perf/Documentation/perf-record.txt
@@ -459,6 +459,25 @@ Set affinity mask of trace reading thread according to the policy defined by 'mo
node - thread affinity mask is set to NUMA node cpu mask of the processed mmap buffer
cpu - thread affinity mask is set to cpu of the processed mmap buffer
+--mmap-flush=number::
+
+Specify minimal number of bytes that is extracted from mmap data pages and
+processed for output. One can specify the number using B/K/M/G suffixes.
+
+The maximal allowed value is a quarter of the size of mmaped data pages.
+
+The default option value is 1 byte which means that every time that the output
+writing thread finds some new data in the mmaped buffer the data is extracted,
+possibly compressed (-z) and written to the output, perf.data or pipe.
+
+Larger data chunks are compressed more effectively in comparison to smaller
+chunks so extraction of larger chunks from the mmap data pages is preferable
+from the perspective of output size reduction.
+
+Also at some cases executing less output write syscalls with bigger data size
+can take less time than executing more output write syscalls with smaller data
+size thus lowering runtime profiling overhead.
+
--all-kernel::
Configure all used events to run in kernel space.
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 4e2d953d4bc5..c5e10552776a 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -337,6 +337,41 @@ static int record__aio_enabled(struct record *rec)
return rec->opts.nr_cblocks > 0;
}
+#define MMAP_FLUSH_DEFAULT 1
+static int record__mmap_flush_parse(const struct option *opt,
+ const char *str,
+ int unset)
+{
+ int flush_max;
+ struct record_opts *opts = (struct record_opts *)opt->value;
+ static struct parse_tag tags[] = {
+ { .tag = 'B', .mult = 1 },
+ { .tag = 'K', .mult = 1 << 10 },
+ { .tag = 'M', .mult = 1 << 20 },
+ { .tag = 'G', .mult = 1 << 30 },
+ { .tag = 0 },
+ };
+
+ if (unset)
+ return 0;
+
+ if (str) {
+ opts->mmap_flush = parse_tag_value(str, tags);
+ if (opts->mmap_flush == (int)-1)
+ opts->mmap_flush = strtol(str, NULL, 0);
+ }
+
+ if (!opts->mmap_flush)
+ opts->mmap_flush = MMAP_FLUSH_DEFAULT;
+
+ flush_max = perf_evlist__mmap_size(opts->mmap_pages);
+ flush_max /= 4;
+ if (opts->mmap_flush > flush_max)
+ opts->mmap_flush = flush_max;
+
+ return 0;
+}
+
static int process_synthesized_event(struct perf_tool *tool,
union perf_event *event,
struct perf_sample *sample __maybe_unused,
@@ -546,7 +581,8 @@ static int record__mmap_evlist(struct record *rec,
if (perf_evlist__mmap_ex(evlist, opts->mmap_pages,
opts->auxtrace_mmap_pages,
opts->auxtrace_snapshot_mode,
- opts->nr_cblocks, opts->affinity) < 0) {
+ opts->nr_cblocks, opts->affinity,
+ opts->mmap_flush) < 0) {
if (errno == EPERM) {
pr_err("Permission error mapping pages.\n"
"Consider increasing "
@@ -736,7 +772,7 @@ static void record__adjust_affinity(struct record *rec, struct perf_mmap *map)
}
static int record__mmap_read_evlist(struct record *rec, struct perf_evlist *evlist,
- bool overwrite)
+ bool overwrite, bool synch)
{
u64 bytes_written = rec->bytes_written;
int i;
@@ -759,12 +795,19 @@ static int record__mmap_read_evlist(struct record *rec, struct perf_evlist *evli
off = record__aio_get_pos(trace_fd);
for (i = 0; i < evlist->nr_mmaps; i++) {
+ u64 flush = 0;
struct perf_mmap *map = &maps[i];
if (map->base) {
record__adjust_affinity(rec, map);
+ if (synch) {
+ flush = map->flush;
+ map->flush = 1;
+ }
if (!record__aio_enabled(rec)) {
if (perf_mmap__push(map, rec, record__pushfn) != 0) {
+ if (synch)
+ map->flush = flush;
rc = -1;
goto out;
}
@@ -777,10 +820,14 @@ static int record__mmap_read_evlist(struct record *rec, struct perf_evlist *evli
idx = record__aio_sync(map, false);
if (perf_mmap__aio_push(map, rec, idx, record__aio_pushfn, &off) != 0) {
record__aio_set_pos(trace_fd, off);
+ if (synch)
+ map->flush = flush;
rc = -1;
goto out;
}
}
+ if (synch)
+ map->flush = flush;
}
if (map->auxtrace_mmap.base && !rec->opts.auxtrace_snapshot_mode &&
@@ -806,15 +853,15 @@ out:
return rc;
}
-static int record__mmap_read_all(struct record *rec)
+static int record__mmap_read_all(struct record *rec, bool synch)
{
int err;
- err = record__mmap_read_evlist(rec, rec->evlist, false);
+ err = record__mmap_read_evlist(rec, rec->evlist, false, synch);
if (err)
return err;
- return record__mmap_read_evlist(rec, rec->evlist, true);
+ return record__mmap_read_evlist(rec, rec->evlist, true, synch);
}
static void record__init_features(struct record *rec)
@@ -1340,7 +1387,7 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
if (trigger_is_hit(&switch_output_trigger) || done || draining)
perf_evlist__toggle_bkw_mmap(rec->evlist, BKW_MMAP_DATA_PENDING);
- if (record__mmap_read_all(rec) < 0) {
+ if (record__mmap_read_all(rec, false) < 0) {
trigger_error(&auxtrace_snapshot_trigger);
trigger_error(&switch_output_trigger);
err = -1;
@@ -1441,6 +1488,7 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
record__synthesize_workload(rec, true);
out_child:
+ record__mmap_read_all(rec, true);
record__aio_mmap_read_sync(rec);
if (forks) {
@@ -1846,6 +1894,7 @@ static struct record record = {
.uses_mmap = true,
.default_per_cpu = true,
},
+ .mmap_flush = MMAP_FLUSH_DEFAULT,
},
.tool = {
.sample = process_sample_event,
@@ -1912,6 +1961,9 @@ static struct option __record_options[] = {
OPT_CALLBACK('m', "mmap-pages", &record.opts, "pages[,pages]",
"number of mmap data pages and AUX area tracing mmap pages",
record__parse_mmap_pages),
+ OPT_CALLBACK(0, "mmap-flush", &record.opts, "number",
+ "Minimal number of bytes that is extracted from mmap data pages (default: 1)",
+ record__mmap_flush_parse),
OPT_BOOLEAN(0, "group", &record.opts.group,
"put the counters into a counter group"),
OPT_CALLBACK_NOOPT('g', NULL, &callchain_param,
@@ -2224,6 +2276,7 @@ int cmd_record(int argc, const char **argv)
pr_info("nr_cblocks: %d\n", rec->opts.nr_cblocks);
pr_debug("affinity: %s\n", affinity_tags[rec->opts.affinity]);
+ pr_debug("mmap flush: %d\n", rec->opts.mmap_flush);
err = __cmd_record(&record, argc, argv);
out:
diff --git a/tools/perf/perf.h b/tools/perf/perf.h
index c59743def8d3..369eae61068d 100644
--- a/tools/perf/perf.h
+++ b/tools/perf/perf.h
@@ -85,6 +85,7 @@ struct record_opts {
u64 clockid_res_ns;
int nr_cblocks;
int affinity;
+ int mmap_flush;
};
enum perf_affinity {
diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
index 6689378ee577..f2bbae38278d 100644
--- a/tools/perf/util/evlist.c
+++ b/tools/perf/util/evlist.c
@@ -1009,7 +1009,7 @@ int perf_evlist__parse_mmap_pages(const struct option *opt, const char *str,
*/
int perf_evlist__mmap_ex(struct perf_evlist *evlist, unsigned int pages,
unsigned int auxtrace_pages,
- bool auxtrace_overwrite, int nr_cblocks, int affinity)
+ bool auxtrace_overwrite, int nr_cblocks, int affinity, int flush)
{
struct perf_evsel *evsel;
const struct cpu_map *cpus = evlist->cpus;
@@ -1019,7 +1019,7 @@ int perf_evlist__mmap_ex(struct perf_evlist *evlist, unsigned int pages,
* Its value is decided by evsel's write_backward.
* So &mp should not be passed through const pointer.
*/
- struct mmap_params mp = { .nr_cblocks = nr_cblocks, .affinity = affinity };
+ struct mmap_params mp = { .nr_cblocks = nr_cblocks, .affinity = affinity, .flush = flush };
if (!evlist->mmap)
evlist->mmap = perf_evlist__alloc_mmap(evlist, false);
@@ -1051,7 +1051,7 @@ int perf_evlist__mmap_ex(struct perf_evlist *evlist, unsigned int pages,
int perf_evlist__mmap(struct perf_evlist *evlist, unsigned int pages)
{
- return perf_evlist__mmap_ex(evlist, pages, 0, false, 0, PERF_AFFINITY_SYS);
+ return perf_evlist__mmap_ex(evlist, pages, 0, false, 0, PERF_AFFINITY_SYS, 1);
}
int perf_evlist__create_maps(struct perf_evlist *evlist, struct target *target)
diff --git a/tools/perf/util/evlist.h b/tools/perf/util/evlist.h
index 6a94785b9100..c9a0f72677fd 100644
--- a/tools/perf/util/evlist.h
+++ b/tools/perf/util/evlist.h
@@ -177,7 +177,8 @@ unsigned long perf_event_mlock_kb_in_pages(void);
int perf_evlist__mmap_ex(struct perf_evlist *evlist, unsigned int pages,
unsigned int auxtrace_pages,
- bool auxtrace_overwrite, int nr_cblocks, int affinity);
+ bool auxtrace_overwrite, int nr_cblocks,
+ int affinity, int flush);
int perf_evlist__mmap(struct perf_evlist *evlist, unsigned int pages);
void perf_evlist__munmap(struct perf_evlist *evlist);
diff --git a/tools/perf/util/mmap.c b/tools/perf/util/mmap.c
index cdc7740fc181..ef3d79b2c90b 100644
--- a/tools/perf/util/mmap.c
+++ b/tools/perf/util/mmap.c
@@ -440,6 +440,8 @@ int perf_mmap__mmap(struct perf_mmap *map, struct mmap_params *mp, int fd, int c
perf_mmap__setup_affinity_mask(map, mp);
+ map->flush = mp->flush;
+
if (auxtrace_mmap__mmap(&map->auxtrace_mmap,
&mp->auxtrace_mp, map->base, fd))
return -1;
@@ -492,7 +494,7 @@ static int __perf_mmap__read_init(struct perf_mmap *md)
md->start = md->overwrite ? head : old;
md->end = md->overwrite ? old : head;
- if (md->start == md->end)
+ if ((md->end - md->start) < md->flush)
return -EAGAIN;
size = md->end - md->start;
diff --git a/tools/perf/util/mmap.h b/tools/perf/util/mmap.h
index e566c19b242b..b82f8c2d55c4 100644
--- a/tools/perf/util/mmap.h
+++ b/tools/perf/util/mmap.h
@@ -39,6 +39,7 @@ struct perf_mmap {
} aio;
#endif
cpu_set_t affinity_mask;
+ u64 flush;
};
/*
@@ -70,7 +71,7 @@ enum bkw_mmap_state {
};
struct mmap_params {
- int prot, mask, nr_cblocks, affinity;
+ int prot, mask, nr_cblocks, affinity, flush;
struct auxtrace_mmap_params auxtrace_mp;
};
Em Mon, Mar 18, 2019 at 08:44:42PM +0300, Alexey Budankov escreveu:
>
> Implemented -z,--compression_level[=<n>] option that enables compression
> of mmaped kernel data buffers content in runtime during perf record
> mode collection. Default option value is 1 (fastest compression).
>
> Compression overhead has been measured for serial and AIO streaming
> when profiling matrix multiplication workload:
>
> -------------------------------------------------------------
> | SERIAL | AIO-1 |
> ----------------------------------------------------------------|
Please don't have lines starting with --- in the cset comment log
message, breaks scripts, fixing it up now.
- Arnaldo
> |-z | OVH(x) | ratio(x) size(MiB) | OVH(x) | ratio(x) size(MiB) |
> |---------------------------------------------------------------|
> | 0 | 1,00 | 1,000 179,424 | 1,00 | 1,000 187,527 |
> | 1 | 1,04 | 8,427 181,148 | 1,01 | 8,474 188,562 |
> | 2 | 1,07 | 8,055 186,953 | 1,03 | 7,912 191,773 |
> | 3 | 1,04 | 8,283 181,908 | 1,03 | 8,220 191,078 |
> | 5 | 1,09 | 8,101 187,705 | 1,05 | 7,780 190,065 |
> | 8 | 1,05 | 9,217 179,191 | 1,12 | 6,111 193,024 |
> -----------------------------------------------------------------
>
> OVH = (Execution time with -z N) / (Execution time with -z 0)
>
> ratio - compression ratio
> size - number of bytes that was compressed
>
> size ~= trace size x ratio
>
> Signed-off-by: Alexey Budankov <[email protected]>
> ---
> tools/perf/Documentation/perf-record.txt | 5 +++++
> tools/perf/builtin-record.c | 25 ++++++++++++++++++++++++
> 2 files changed, 30 insertions(+)
>
> diff --git a/tools/perf/Documentation/perf-record.txt b/tools/perf/Documentation/perf-record.txt
> index 18fceb49434e..0567bacc2ae6 100644
> --- a/tools/perf/Documentation/perf-record.txt
> +++ b/tools/perf/Documentation/perf-record.txt
> @@ -471,6 +471,11 @@ Also at some cases executing less trace write syscalls with bigger data size can
> shorter than executing more trace write syscalls with smaller data size thus lowering
> runtime profiling overhead.
>
> +-z::
> +--compression-level[=n]::
> +Produce compressed trace using specified level n (default: 1 - fastest compression,
> +22 - smallest trace)
> +
> --all-kernel::
> Configure all used events to run in kernel space.
>
> diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
> index 2e083891affa..7258f2964a3b 100644
> --- a/tools/perf/builtin-record.c
> +++ b/tools/perf/builtin-record.c
> @@ -440,6 +440,26 @@ static int record__mmap_flush_parse(const struct option *opt,
> return 0;
> }
>
> +#ifdef HAVE_ZSTD_SUPPORT
> +static unsigned int comp_level_default = 1;
> +static int record__parse_comp_level(const struct option *opt,
> + const char *str,
> + int unset)
> +{
> + struct record_opts *opts = (struct record_opts *)opt->value;
> +
> + if (unset) {
> + opts->comp_level = 0;
> + } else {
> + if (str)
> + opts->comp_level = strtol(str, NULL, 0);
> + if (!opts->comp_level)
> + opts->comp_level = comp_level_default;
> + }
> +
> + return 0;
> +}
> +#endif
> static unsigned int comp_level_max = 22;
>
> static int record__comp_enabled(struct record *rec)
> @@ -2169,6 +2189,11 @@ static struct option __record_options[] = {
> OPT_CALLBACK(0, "affinity", &record.opts, "node|cpu",
> "Set affinity mask of trace reading thread to NUMA node cpu mask or cpu of processed mmap buffer",
> record__parse_affinity),
> +#ifdef HAVE_ZSTD_SUPPORT
> + OPT_CALLBACK_OPTARG('z', "compression-level", &record.opts, &comp_level_default,
> + "n", "Produce compressed trace using specified level (default: 1 - fastest compression, 22 - smallest trace)",
> + record__parse_comp_level),
> +#endif
> OPT_END()
> };
>
> --
> 2.20.1
--
- Arnaldo
Em Mon, Mar 18, 2019 at 08:44:42PM +0300, Alexey Budankov escreveu:
>
> Implemented -z,--compression_level[=<n>] option that enables compression
> of mmaped kernel data buffers content in runtime during perf record
> mode collection. Default option value is 1 (fastest compression).
>
> Compression overhead has been measured for serial and AIO streaming
> when profiling matrix multiplication workload:
>
> -------------------------------------------------------------
> | SERIAL | AIO-1 |
> ----------------------------------------------------------------|
> |-z | OVH(x) | ratio(x) size(MiB) | OVH(x) | ratio(x) size(MiB) |
> |---------------------------------------------------------------|
> | 0 | 1,00 | 1,000 179,424 | 1,00 | 1,000 187,527 |
> | 1 | 1,04 | 8,427 181,148 | 1,01 | 8,474 188,562 |
> | 2 | 1,07 | 8,055 186,953 | 1,03 | 7,912 191,773 |
> | 3 | 1,04 | 8,283 181,908 | 1,03 | 8,220 191,078 |
> | 5 | 1,09 | 8,101 187,705 | 1,05 | 7,780 190,065 |
> | 8 | 1,05 | 9,217 179,191 | 1,12 | 6,111 193,024 |
> -----------------------------------------------------------------
>
> OVH = (Execution time with -z N) / (Execution time with -z 0)
>
> ratio - compression ratio
> size - number of bytes that was compressed
>
> size ~= trace size x ratio
[root@quaco ~]# perf record -z2
^C[ perf record: Woken up 1 times to write data ]
0x1746e0 [0x76]: failed to process type: 81 [Invalid argument]
[ perf record: Captured and wrote 1.568 MB perf.data, compressed (original 0.452 MB, ratio is 3.995) ]
[root@quaco ~]#
I've pushed what I have to the tmp.perf/core branch, please try to see
if I made any mistake in fixing up conflicts with BPF_PROG_INFO and
BPF_BTF header features. I'll continue tomorrow with 10-12/12.
- Arnaldo
> Signed-off-by: Alexey Budankov <[email protected]>
> ---
> tools/perf/Documentation/perf-record.txt | 5 +++++
> tools/perf/builtin-record.c | 25 ++++++++++++++++++++++++
> 2 files changed, 30 insertions(+)
>
> diff --git a/tools/perf/Documentation/perf-record.txt b/tools/perf/Documentation/perf-record.txt
> index 18fceb49434e..0567bacc2ae6 100644
> --- a/tools/perf/Documentation/perf-record.txt
> +++ b/tools/perf/Documentation/perf-record.txt
> @@ -471,6 +471,11 @@ Also at some cases executing less trace write syscalls with bigger data size can
> shorter than executing more trace write syscalls with smaller data size thus lowering
> runtime profiling overhead.
>
> +-z::
> +--compression-level[=n]::
> +Produce compressed trace using specified level n (default: 1 - fastest compression,
> +22 - smallest trace)
> +
> --all-kernel::
> Configure all used events to run in kernel space.
>
> diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
> index 2e083891affa..7258f2964a3b 100644
> --- a/tools/perf/builtin-record.c
> +++ b/tools/perf/builtin-record.c
> @@ -440,6 +440,26 @@ static int record__mmap_flush_parse(const struct option *opt,
> return 0;
> }
>
> +#ifdef HAVE_ZSTD_SUPPORT
> +static unsigned int comp_level_default = 1;
> +static int record__parse_comp_level(const struct option *opt,
> + const char *str,
> + int unset)
> +{
> + struct record_opts *opts = (struct record_opts *)opt->value;
> +
> + if (unset) {
> + opts->comp_level = 0;
> + } else {
> + if (str)
> + opts->comp_level = strtol(str, NULL, 0);
> + if (!opts->comp_level)
> + opts->comp_level = comp_level_default;
> + }
> +
> + return 0;
> +}
> +#endif
> static unsigned int comp_level_max = 22;
>
> static int record__comp_enabled(struct record *rec)
> @@ -2169,6 +2189,11 @@ static struct option __record_options[] = {
> OPT_CALLBACK(0, "affinity", &record.opts, "node|cpu",
> "Set affinity mask of trace reading thread to NUMA node cpu mask or cpu of processed mmap buffer",
> record__parse_affinity),
> +#ifdef HAVE_ZSTD_SUPPORT
> + OPT_CALLBACK_OPTARG('z', "compression-level", &record.opts, &comp_level_default,
> + "n", "Produce compressed trace using specified level (default: 1 - fastest compression, 22 - smallest trace)",
> + record__parse_comp_level),
> +#endif
> OPT_END()
> };
>
> --
> 2.20.1
--
- Arnaldo
Em Tue, May 14, 2019 at 05:20:41PM -0300, Arnaldo Carvalho de Melo escreveu:
> Em Mon, Mar 18, 2019 at 08:44:42PM +0300, Alexey Budankov escreveu:
> >
> > Implemented -z,--compression_level[=<n>] option that enables compression
> > of mmaped kernel data buffers content in runtime during perf record
> > mode collection. Default option value is 1 (fastest compression).
<SNIP>
> [root@quaco ~]# perf record -z2
> ^C[ perf record: Woken up 1 times to write data ]
> 0x1746e0 [0x76]: failed to process type: 81 [Invalid argument]
> [ perf record: Captured and wrote 1.568 MB perf.data, compressed (original 0.452 MB, ratio is 3.995) ]
>
> [root@quaco ~]#
So, its the buildid processing at the end, so we can't do build-id
processing when using PERF_RECORD_COMPRESSED, otherwise we'd have to
uncompress at the end to find the PERF_RECORD_FORK/PERF_RECORD_MMAP,
etc.
[root@quaco ~]# perf record -z2 --no-buildid sleep 1
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.020 MB perf.data, compressed (original 0.001 MB, ratio is 2.153) ]
[root@quaco ~]# perf report -D | grep PERF_RECORD_COMP
0x4f40 [0x195]: failed to process type: 81 [Invalid argument]
Error:
failed to process sample
0 0x4f40 [0x195]: PERF_RECORD_COMPRESSED
[root@quaco ~]#
I'll play with it tomorrow.
- Arnaldo
> I've pushed what I have to the tmp.perf/core branch, please try to see
> if I made any mistake in fixing up conflicts with BPF_PROG_INFO and
> BPF_BTF header features. I'll continue tomorrow with 10-12/12.
On 14.05.2019 23:04, Arnaldo Carvalho de Melo wrote:
> Em Mon, Mar 18, 2019 at 08:44:42PM +0300, Alexey Budankov escreveu:
>>
>> Implemented -z,--compression_level[=<n>] option that enables compression
>> of mmaped kernel data buffers content in runtime during perf record
>> mode collection. Default option value is 1 (fastest compression).
>>
>> Compression overhead has been measured for serial and AIO streaming
>> when profiling matrix multiplication workload:
>>
>> -------------------------------------------------------------
>> | SERIAL | AIO-1 |
>> ----------------------------------------------------------------|
>
> Please don't have lines starting with --- in the cset comment log
> message, breaks scripts, fixing it up now.
Oops, will do my best about that. Thanks.
~Alexey
>
> - Arnaldo
>
>> |-z | OVH(x) | ratio(x) size(MiB) | OVH(x) | ratio(x) size(MiB) |
>> |---------------------------------------------------------------|
>> | 0 | 1,00 | 1,000 179,424 | 1,00 | 1,000 187,527 |
>> | 1 | 1,04 | 8,427 181,148 | 1,01 | 8,474 188,562 |
>> | 2 | 1,07 | 8,055 186,953 | 1,03 | 7,912 191,773 |
>> | 3 | 1,04 | 8,283 181,908 | 1,03 | 8,220 191,078 |
>> | 5 | 1,09 | 8,101 187,705 | 1,05 | 7,780 190,065 |
>> | 8 | 1,05 | 9,217 179,191 | 1,12 | 6,111 193,024 |
>> -----------------------------------------------------------------
>>
>> OVH = (Execution time with -z N) / (Execution time with -z 0)
>>
>> ratio - compression ratio
>> size - number of bytes that was compressed
>>
>> size ~= trace size x ratio
>>
>> Signed-off-by: Alexey Budankov <[email protected]>
>> ---
>> tools/perf/Documentation/perf-record.txt | 5 +++++
>> tools/perf/builtin-record.c | 25 ++++++++++++++++++++++++
>> 2 files changed, 30 insertions(+)
>>
>> diff --git a/tools/perf/Documentation/perf-record.txt b/tools/perf/Documentation/perf-record.txt
>> index 18fceb49434e..0567bacc2ae6 100644
>> --- a/tools/perf/Documentation/perf-record.txt
>> +++ b/tools/perf/Documentation/perf-record.txt
>> @@ -471,6 +471,11 @@ Also at some cases executing less trace write syscalls with bigger data size can
>> shorter than executing more trace write syscalls with smaller data size thus lowering
>> runtime profiling overhead.
>>
>> +-z::
>> +--compression-level[=n]::
>> +Produce compressed trace using specified level n (default: 1 - fastest compression,
>> +22 - smallest trace)
>> +
>> --all-kernel::
>> Configure all used events to run in kernel space.
>>
>> diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
>> index 2e083891affa..7258f2964a3b 100644
>> --- a/tools/perf/builtin-record.c
>> +++ b/tools/perf/builtin-record.c
>> @@ -440,6 +440,26 @@ static int record__mmap_flush_parse(const struct option *opt,
>> return 0;
>> }
>>
>> +#ifdef HAVE_ZSTD_SUPPORT
>> +static unsigned int comp_level_default = 1;
>> +static int record__parse_comp_level(const struct option *opt,
>> + const char *str,
>> + int unset)
>> +{
>> + struct record_opts *opts = (struct record_opts *)opt->value;
>> +
>> + if (unset) {
>> + opts->comp_level = 0;
>> + } else {
>> + if (str)
>> + opts->comp_level = strtol(str, NULL, 0);
>> + if (!opts->comp_level)
>> + opts->comp_level = comp_level_default;
>> + }
>> +
>> + return 0;
>> +}
>> +#endif
>> static unsigned int comp_level_max = 22;
>>
>> static int record__comp_enabled(struct record *rec)
>> @@ -2169,6 +2189,11 @@ static struct option __record_options[] = {
>> OPT_CALLBACK(0, "affinity", &record.opts, "node|cpu",
>> "Set affinity mask of trace reading thread to NUMA node cpu mask or cpu of processed mmap buffer",
>> record__parse_affinity),
>> +#ifdef HAVE_ZSTD_SUPPORT
>> + OPT_CALLBACK_OPTARG('z', "compression-level", &record.opts, &comp_level_default,
>> + "n", "Produce compressed trace using specified level (default: 1 - fastest compression, 22 - smallest trace)",
>> + record__parse_comp_level),
>> +#endif
>> OPT_END()
>> };
>>
>> --
>> 2.20.1
>
On 15.05.2019 0:46, Arnaldo Carvalho de Melo wrote:
> Em Tue, May 14, 2019 at 05:20:41PM -0300, Arnaldo Carvalho de Melo escreveu:
>> Em Mon, Mar 18, 2019 at 08:44:42PM +0300, Alexey Budankov escreveu:
>>>
>>> Implemented -z,--compression_level[=<n>] option that enables compression
>>> of mmaped kernel data buffers content in runtime during perf record
>>> mode collection. Default option value is 1 (fastest compression).
>
> <SNIP>
>
>> [root@quaco ~]# perf record -z2
>> ^C[ perf record: Woken up 1 times to write data ]
>> 0x1746e0 [0x76]: failed to process type: 81 [Invalid argument]
>> [ perf record: Captured and wrote 1.568 MB perf.data, compressed (original 0.452 MB, ratio is 3.995) ]
>>
>> [root@quaco ~]#
>
> So, its the buildid processing at the end, so we can't do build-id
> processing when using PERF_RECORD_COMPRESSED, otherwise we'd have to
> uncompress at the end to find the PERF_RECORD_FORK/PERF_RECORD_MMAP,
> etc.
>
> [root@quaco ~]# perf record -z2 --no-buildid sleep 1
> [ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 0.020 MB perf.data, compressed (original 0.001 MB, ratio is 2.153) ]
> [root@quaco ~]# perf report -D | grep PERF_RECORD_COMP
> 0x4f40 [0x195]: failed to process type: 81 [Invalid argument]
> Error:
> failed to process sample
> 0 0x4f40 [0x195]: PERF_RECORD_COMPRESSED
> [root@quaco ~]#
>
> I'll play with it tomorrow.
Applied the whole patch set on top of the current perf/core
and the whole thing functions as expected.
~Alexey
>
> - Arnaldo
>
>> I've pushed what I have to the tmp.perf/core branch, please try to see
>> if I made any mistake in fixing up conflicts with BPF_PROG_INFO and
>> BPF_BTF header features. I'll continue tomorrow with 10-12/12.
>
Em Wed, May 15, 2019 at 11:43:30AM +0300, Alexey Budankov escreveu:
> On 15.05.2019 0:46, Arnaldo Carvalho de Melo wrote:
> > Em Tue, May 14, 2019 at 05:20:41PM -0300, Arnaldo Carvalho de Melo escreveu:
> >> Em Mon, Mar 18, 2019 at 08:44:42PM +0300, Alexey Budankov escreveu:
> >>> Implemented -z,--compression_level[=<n>] option that enables compression
> >>> of mmaped kernel data buffers content in runtime during perf record
> >>> mode collection. Default option value is 1 (fastest compression).
> > <SNIP>
> >> [root@quaco ~]# perf record -z2
> >> ^C[ perf record: Woken up 1 times to write data ]
> >> 0x1746e0 [0x76]: failed to process type: 81 [Invalid argument]
> >> [ perf record: Captured and wrote 1.568 MB perf.data, compressed (original 0.452 MB, ratio is 3.995) ]
> >> [root@quaco ~]#
> > So, its the buildid processing at the end, so we can't do build-id
> > processing when using PERF_RECORD_COMPRESSED, otherwise we'd have to
> > uncompress at the end to find the PERF_RECORD_FORK/PERF_RECORD_MMAP,
> > etc.
> > [root@quaco ~]# perf record -z2 --no-buildid sleep 1
> > [ perf record: Woken up 1 times to write data ]
> > [ perf record: Captured and wrote 0.020 MB perf.data, compressed (original 0.001 MB, ratio is 2.153) ]
> > [root@quaco ~]# perf report -D | grep PERF_RECORD_COMP
> > 0x4f40 [0x195]: failed to process type: 81 [Invalid argument]
> > Error:
> > failed to process sample
> > 0 0x4f40 [0x195]: PERF_RECORD_COMPRESSED
> > [root@quaco ~]#
> > I'll play with it tomorrow.
> Applied the whole patch set on top of the current perf/core
> and the whole thing functions as expected.
It doesn't, see the reported error above, these three lines, that
shouldn't be there:
0x4f40 [0x195]: failed to process type: 81 [Invalid argument]
Error:
failed to process sample
That is because at this point in the patch series a record was
introduced that is not being handled by the build id processing done, by
default, at the end of the 'perf record' session, and, as explained
above, needs fixing so that when we do 'git bisect' looking for a non
expected "failed to process type: 81" kind of error, this doesn't
appear.
I added the changes below to this cset and will continue from there:
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index d84a4885e341..f8d21991f94c 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -2284,6 +2284,12 @@ int cmd_record(int argc, const char **argv)
"cgroup monitoring only available in system-wide mode");
}
+
+ if (rec->opts.comp_level != 0) {
+ pr_debug("Compression enabled, disabling build id collection at the end of the session\n");
+ rec->no_buildid = true;
+ }
+
if (rec->opts.record_switch_events &&
!perf_can_record_switch_events()) {
ui__error("kernel does not support recording context switch events\n");
---------------------------------------------------------------------------
[acme@quaco perf]$ perf record -z2 sleep 1
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.001 MB perf.data, compressed (original 0.001 MB, ratio is 2.292) ]
[acme@quaco perf]$ perf record -v -z2 sleep 1
Compression enabled, disabling build id collection at the end of the session
Using CPUID GenuineIntel-6-8E-A
nr_cblocks: 0
affinity: SYS
mmap flush: 1
comp level: 2
mmap size 528384B
Couldn't start the BPF side band thread:
BPF programs starting from now on won't be annotatable
perf_event__synthesize_bpf_events: can't get next program: Operation not permitted
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.001 MB perf.data, compressed (original 0.001 MB, ratio is 2.305) ]
[acme@quaco perf]$
Will check if its possible to get rid of the following in this patch, to
keep bisection working for this case as well:
[acme@quaco perf]$ perf report -D | grep COMPRESS
0x1b8 [0x169]: failed to process type: 81 [Invalid argument]
Error:
failed to process sample
0 0x1b8 [0x169]: PERF_RECORD_COMPRESSED
[acme@quaco perf]$
- Arnaldo
On 15.05.2019 15:59, Arnaldo Carvalho de Melo wrote:
> Em Wed, May 15, 2019 at 11:43:30AM +0300, Alexey Budankov escreveu:
>> On 15.05.2019 0:46, Arnaldo Carvalho de Melo wrote:
>>> Em Tue, May 14, 2019 at 05:20:41PM -0300, Arnaldo Carvalho de Melo escreveu:
>>>> Em Mon, Mar 18, 2019 at 08:44:42PM +0300, Alexey Budankov escreveu:
>
>>>>> Implemented -z,--compression_level[=<n>] option that enables compression
>>>>> of mmaped kernel data buffers content in runtime during perf record
>>>>> mode collection. Default option value is 1 (fastest compression).
>
>>> <SNIP>
>
>>>> [root@quaco ~]# perf record -z2
>>>> ^C[ perf record: Woken up 1 times to write data ]
>>>> 0x1746e0 [0x76]: failed to process type: 81 [Invalid argument]
>>>> [ perf record: Captured and wrote 1.568 MB perf.data, compressed (original 0.452 MB, ratio is 3.995) ]
>
>>>> [root@quaco ~]#
>
>>> So, its the buildid processing at the end, so we can't do build-id
>>> processing when using PERF_RECORD_COMPRESSED, otherwise we'd have to
>>> uncompress at the end to find the PERF_RECORD_FORK/PERF_RECORD_MMAP,
>>> etc.
>
>>> [root@quaco ~]# perf record -z2 --no-buildid sleep 1
>>> [ perf record: Woken up 1 times to write data ]
>>> [ perf record: Captured and wrote 0.020 MB perf.data, compressed (original 0.001 MB, ratio is 2.153) ]
>>> [root@quaco ~]# perf report -D | grep PERF_RECORD_COMP
>>> 0x4f40 [0x195]: failed to process type: 81 [Invalid argument]
>>> Error:
>>> failed to process sample
>>> 0 0x4f40 [0x195]: PERF_RECORD_COMPRESSED
>>> [root@quaco ~]#
>
>>> I'll play with it tomorrow.
>
>> Applied the whole patch set on top of the current perf/core
>> and the whole thing functions as expected.
>
> It doesn't, see the reported error above, these three lines, that
> shouldn't be there:
>
> 0x4f40 [0x195]: failed to process type: 81 [Invalid argument]
> Error:
> failed to process sample
>
> That is because at this point in the patch series a record was
> introduced that is not being handled by the build id processing done, by
> default, at the end of the 'perf record' session, and, as explained
> above, needs fixing so that when we do 'git bisect' looking for a non
> expected "failed to process type: 81" kind of error, this doesn't
> appear.
>
> I added the changes below to this cset and will continue from there:
>
> diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
> index d84a4885e341..f8d21991f94c 100644
> --- a/tools/perf/builtin-record.c
> +++ b/tools/perf/builtin-record.c
> @@ -2284,6 +2284,12 @@ int cmd_record(int argc, const char **argv)
> "cgroup monitoring only available in system-wide mode");
>
> }
> +
> + if (rec->opts.comp_level != 0) {
> + pr_debug("Compression enabled, disabling build id collection at the end of the session\n");
> + rec->no_buildid = true;
> + }
> +
> if (rec->opts.record_switch_events &&
> !perf_can_record_switch_events()) {
> ui__error("kernel does not support recording context switch events\n");
>
> ---------------------------------------------------------------------------
>
> [acme@quaco perf]$ perf record -z2 sleep 1
> [ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 0.001 MB perf.data, compressed (original 0.001 MB, ratio is 2.292) ]
> [acme@quaco perf]$ perf record -v -z2 sleep 1
> Compression enabled, disabling build id collection at the end of the session
> Using CPUID GenuineIntel-6-8E-A
> nr_cblocks: 0
> affinity: SYS
> mmap flush: 1
> comp level: 2
> mmap size 528384B
> Couldn't start the BPF side band thread:
> BPF programs starting from now on won't be annotatable
> perf_event__synthesize_bpf_events: can't get next program: Operation not permitted
> [ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 0.001 MB perf.data, compressed (original 0.001 MB, ratio is 2.305) ]
> [acme@quaco perf]$
>
> Will check if its possible to get rid of the following in this patch, to
> keep bisection working for this case as well:
>
> [acme@quaco perf]$ perf report -D | grep COMPRESS
> 0x1b8 [0x169]: failed to process type: 81 [Invalid argument]
> Error:
> failed to process sample
> 0 0x1b8 [0x169]: PERF_RECORD_COMPRESSED
> [acme@quaco perf]$
Makes sense. Thanks.
~Alexey
>
> - Arnaldo
>
Em Wed, May 15, 2019 at 06:44:29PM +0300, Alexey Budankov escreveu:
> On 15.05.2019 15:59, Arnaldo Carvalho de Melo wrote:
> > Em Wed, May 15, 2019 at 11:43:30AM +0300, Alexey Budankov escreveu:
> >> On 15.05.2019 0:46, Arnaldo Carvalho de Melo wrote:
> >>> Em Tue, May 14, 2019 at 05:20:41PM -0300, Arnaldo Carvalho de Melo escreveu:
> >>>> Em Mon, Mar 18, 2019 at 08:44:42PM +0300, Alexey Budankov escreveu:
> >
> >>>>> Implemented -z,--compression_level[=<n>] option that enables compression
> >>>>> of mmaped kernel data buffers content in runtime during perf record
> >>>>> mode collection. Default option value is 1 (fastest compression).
> >
> >>> <SNIP>
> >
> >>>> [root@quaco ~]# perf record -z2
> >>>> ^C[ perf record: Woken up 1 times to write data ]
> >>>> 0x1746e0 [0x76]: failed to process type: 81 [Invalid argument]
> >>>> [ perf record: Captured and wrote 1.568 MB perf.data, compressed (original 0.452 MB, ratio is 3.995) ]
> >
> >>>> [root@quaco ~]#
> >
> >>> So, its the buildid processing at the end, so we can't do build-id
> >>> processing when using PERF_RECORD_COMPRESSED, otherwise we'd have to
> >>> uncompress at the end to find the PERF_RECORD_FORK/PERF_RECORD_MMAP,
> >>> etc.
> >
> >>> [root@quaco ~]# perf record -z2 --no-buildid sleep 1
> >>> [ perf record: Woken up 1 times to write data ]
> >>> [ perf record: Captured and wrote 0.020 MB perf.data, compressed (original 0.001 MB, ratio is 2.153) ]
> >>> [root@quaco ~]# perf report -D | grep PERF_RECORD_COMP
> >>> 0x4f40 [0x195]: failed to process type: 81 [Invalid argument]
> >>> Error:
> >>> failed to process sample
> >>> 0 0x4f40 [0x195]: PERF_RECORD_COMPRESSED
> >>> [root@quaco ~]#
> >
> >>> I'll play with it tomorrow.
> >
> >> Applied the whole patch set on top of the current perf/core
> >> and the whole thing functions as expected.
> >
> > It doesn't, see the reported error above, these three lines, that
> > shouldn't be there:
> >
> > 0x4f40 [0x195]: failed to process type: 81 [Invalid argument]
> > Error:
> > failed to process sample
> >
> > That is because at this point in the patch series a record was
> > introduced that is not being handled by the build id processing done, by
> > default, at the end of the 'perf record' session, and, as explained
> > above, needs fixing so that when we do 'git bisect' looking for a non
> > expected "failed to process type: 81" kind of error, this doesn't
> > appear.
> >
> > I added the changes below to this cset and will continue from there:
> >
> > diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
> > index d84a4885e341..f8d21991f94c 100644
> > --- a/tools/perf/builtin-record.c
> > +++ b/tools/perf/builtin-record.c
> > @@ -2284,6 +2284,12 @@ int cmd_record(int argc, const char **argv)
> > "cgroup monitoring only available in system-wide mode");
> >
> > }
> > +
> > + if (rec->opts.comp_level != 0) {
> > + pr_debug("Compression enabled, disabling build id collection at the end of the session\n");
> > + rec->no_buildid = true;
> > + }
> > +
> > if (rec->opts.record_switch_events &&
> > !perf_can_record_switch_events()) {
> > ui__error("kernel does not support recording context switch events\n");
> >
> > ---------------------------------------------------------------------------
> >
> > [acme@quaco perf]$ perf record -z2 sleep 1
> > [ perf record: Woken up 1 times to write data ]
> > [ perf record: Captured and wrote 0.001 MB perf.data, compressed (original 0.001 MB, ratio is 2.292) ]
> > [acme@quaco perf]$ perf record -v -z2 sleep 1
> > Compression enabled, disabling build id collection at the end of the session
> > Using CPUID GenuineIntel-6-8E-A
> > nr_cblocks: 0
> > affinity: SYS
> > mmap flush: 1
> > comp level: 2
> > mmap size 528384B
> > Couldn't start the BPF side band thread:
> > BPF programs starting from now on won't be annotatable
> > perf_event__synthesize_bpf_events: can't get next program: Operation not permitted
> > [ perf record: Woken up 1 times to write data ]
> > [ perf record: Captured and wrote 0.001 MB perf.data, compressed (original 0.001 MB, ratio is 2.305) ]
> > [acme@quaco perf]$
> >
> > Will check if its possible to get rid of the following in this patch, to
> > keep bisection working for this case as well:
> >
> > [acme@quaco perf]$ perf report -D | grep COMPRESS
> > 0x1b8 [0x169]: failed to process type: 81 [Invalid argument]
> > Error:
> > failed to process sample
> > 0 0x1b8 [0x169]: PERF_RECORD_COMPRESSED
> > [acme@quaco perf]$
>
> Makes sense. Thanks.
I did it yesterday, all is in my acme/perf/core branch, now testing it
together with the large pile of patches there accumulated while I was in
LSF/MM + vacations :-)
All have already passed through most of my test build containers, with
most of the distros that have libzstd being updated to include it, and
the make_minimal test build target was updated to build explicitely
disabling zstd, i.e. with NO_LIBZSTD=1, so that we test with/without it
in systems where it is installed and also in systems where zstd is not
even available.
- Arnaldo
On 17.05.2019 18:01, Arnaldo Carvalho de Melo wrote:
> Em Wed, May 15, 2019 at 06:44:29PM +0300, Alexey Budankov escreveu:
>> On 15.05.2019 15:59, Arnaldo Carvalho de Melo wrote:
<SNIP>
>>> Em Wed, May 15, 2019 at 11:43:30AM +0300, Alexey Budankov escreveu:
>>>> On 15.05.2019 0:46, Arnaldo Carvalho de Melo wrote:
>>>>> Em Tue, May 14, 2019 at 05:20:41PM -0300, Arnaldo Carvalho de Melo escreveu:
>>>>>> Em Mon, Mar 18, 2019 at 08:44:42PM +0300, Alexey Budankov escreveu:
>>>
>>>>>>> Implemented -z,--compression_level[=<n>] option that enables compression
>>>>>>> of mmaped kernel data buffers content in runtime during perf record
>>>>>>> mode collection. Default option value is 1 (fastest compression).
>>>
>>>>> <SNIP>
>>>
>>>>>> [root@quaco ~]# perf record -z2
>>>>>> ^C[ perf record: Woken up 1 times to write data ]
>>>>>> 0x1746e0 [0x76]: failed to process type: 81 [Invalid argument]
>>>>>> [ perf record: Captured and wrote 1.568 MB perf.data, compressed (original 0.452 MB, ratio is 3.995) ]
>>>
>>>>>> [root@quaco ~]#
>>>
>>>>> So, its the buildid processing at the end, so we can't do build-id
>>>>> processing when using PERF_RECORD_COMPRESSED, otherwise we'd have to
>>>>> uncompress at the end to find the PERF_RECORD_FORK/PERF_RECORD_MMAP,
>>>>> etc.
>>>
>>>>> [root@quaco ~]# perf record -z2 --no-buildid sleep 1
>>>>> [ perf record: Woken up 1 times to write data ]
>>>>> [ perf record: Captured and wrote 0.020 MB perf.data, compressed (original 0.001 MB, ratio is 2.153) ]
>>>>> [root@quaco ~]# perf report -D | grep PERF_RECORD_COMP
>>>>> 0x4f40 [0x195]: failed to process type: 81 [Invalid argument]
>>>>> Error:
>>>>> failed to process sample
>>>>> 0 0x4f40 [0x195]: PERF_RECORD_COMPRESSED
>>>>> [root@quaco ~]#
>>>
>>>>> I'll play with it tomorrow.
>>>
>>>> Applied the whole patch set on top of the current perf/core
>>>> and the whole thing functions as expected.
>>>
>>> It doesn't, see the reported error above, these three lines, that
>>> shouldn't be there:
>>>
>>> 0x4f40 [0x195]: failed to process type: 81 [Invalid argument]
>>> Error:
>>> failed to process sample
>>>
>>> That is because at this point in the patch series a record was
>>> introduced that is not being handled by the build id processing done, by
>>> default, at the end of the 'perf record' session, and, as explained
>>> above, needs fixing so that when we do 'git bisect' looking for a non
>>> expected "failed to process type: 81" kind of error, this doesn't
>>> appear.
>>>
>>> I added the changes below to this cset and will continue from there:
>>>
>>> diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
>>> index d84a4885e341..f8d21991f94c 100644
>>> --- a/tools/perf/builtin-record.c
>>> +++ b/tools/perf/builtin-record.c
>>> @@ -2284,6 +2284,12 @@ int cmd_record(int argc, const char **argv)
>>> "cgroup monitoring only available in system-wide mode");
>>>
>>> }
>>> +
>>> + if (rec->opts.comp_level != 0) {
>>> + pr_debug("Compression enabled, disabling build id collection at the end of the session\n");
>>> + rec->no_buildid = true;
>>> + }
>>> +
>>> if (rec->opts.record_switch_events &&
>>> !perf_can_record_switch_events()) {
>>> ui__error("kernel does not support recording context switch events\n");
>>>
>>> ---------------------------------------------------------------------------
>>>
>>> [acme@quaco perf]$ perf record -z2 sleep 1
>>> [ perf record: Woken up 1 times to write data ]
>>> [ perf record: Captured and wrote 0.001 MB perf.data, compressed (original 0.001 MB, ratio is 2.292) ]
>>> [acme@quaco perf]$ perf record -v -z2 sleep 1
>>> Compression enabled, disabling build id collection at the end of the session
>>> Using CPUID GenuineIntel-6-8E-A
>>> nr_cblocks: 0
>>> affinity: SYS
>>> mmap flush: 1
>>> comp level: 2
>>> mmap size 528384B
>>> Couldn't start the BPF side band thread:
>>> BPF programs starting from now on won't be annotatable
>>> perf_event__synthesize_bpf_events: can't get next program: Operation not permitted
>>> [ perf record: Woken up 1 times to write data ]
>>> [ perf record: Captured and wrote 0.001 MB perf.data, compressed (original 0.001 MB, ratio is 2.305) ]
>>> [acme@quaco perf]$
>>>
>>> Will check if its possible to get rid of the following in this patch, to
>>> keep bisection working for this case as well:
>>>
>>> [acme@quaco perf]$ perf report -D | grep COMPRESS
>>> 0x1b8 [0x169]: failed to process type: 81 [Invalid argument]
>>> Error:
>>> failed to process sample
>>> 0 0x1b8 [0x169]: PERF_RECORD_COMPRESSED
>>> [acme@quaco perf]$
>>
>> Makes sense. Thanks.
>
> I did it yesterday, all is in my acme/perf/core branch, now testing it
> together with the large pile of patches there accumulated while I was in
> LSF/MM + vacations :-)
>
> All have already passed through most of my test build containers, with
> most of the distros that have libzstd being updated to include it, and
> the make_minimal test build target was updated to build explicitely
> disabling zstd, i.e. with NO_LIBZSTD=1, so that we test with/without it
> in systems where it is installed and also in systems where zstd is not
> even available.
Good news. Thanks!
~Alexey
>
> - Arnaldo
>
Commit-ID: 61a7773ca88f32ef7e185fdf9fc0d44e8ec18a66
Gitweb: https://git.kernel.org/tip/61a7773ca88f32ef7e185fdf9fc0d44e8ec18a66
Author: Alexey Budankov <[email protected]>
AuthorDate: Mon, 18 Mar 2019 20:45:11 +0300
Committer: Arnaldo Carvalho de Melo <[email protected]>
CommitDate: Wed, 15 May 2019 16:36:49 -0300
perf report: Add stub processing of compressed events for -D
Committer note:
Split from a larger patch, this only dumps PERF_RECORD_COMPRESSED as
unhandled, so that when we introduce the record part in the next patch,
we don't see unhandled events when using 'perf record -D'.
Changed it so that we dump the event if the handler is just a stub, i.e.
for the case where we don't have ZSTD linked but we're processing a
perf.data file generated by a tool with that linked.
Also when failing to decompress we can't just dump the uncompressed
event and return 0, we have to propagate the error.
Signed-off-by: Alexey Budankov <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
---
tools/perf/util/session.c | 19 ++++++++++++++++++-
tools/perf/util/tool.h | 2 ++
2 files changed, 20 insertions(+), 1 deletion(-)
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index bad5f87ae001..ec1dec86d0e1 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -358,6 +358,14 @@ static int process_stat_round_stub(struct perf_session *perf_session __maybe_unu
return 0;
}
+static int perf_session__process_compressed_event_stub(struct perf_session *session __maybe_unused,
+ union perf_event *event __maybe_unused,
+ u64 file_offset __maybe_unused)
+{
+ dump_printf(": unhandled!\n");
+ return 0;
+}
+
void perf_tool__fill_defaults(struct perf_tool *tool)
{
if (tool->sample == NULL)
@@ -430,6 +438,8 @@ void perf_tool__fill_defaults(struct perf_tool *tool)
tool->time_conv = process_event_op2_stub;
if (tool->feature == NULL)
tool->feature = process_event_op2_stub;
+ if (tool->compressed == NULL)
+ tool->compressed = perf_session__process_compressed_event_stub;
}
static void swap_sample_id_all(union perf_event *event, void *data)
@@ -1373,7 +1383,9 @@ static s64 perf_session__process_user_event(struct perf_session *session,
int fd = perf_data__fd(session->data);
int err;
- dump_event(session->evlist, event, file_offset, &sample);
+ if (event->header.type != PERF_RECORD_COMPRESSED ||
+ tool->compressed == perf_session__process_compressed_event_stub)
+ dump_event(session->evlist, event, file_offset, &sample);
/* These events are processed right away */
switch (event->header.type) {
@@ -1426,6 +1438,11 @@ static s64 perf_session__process_user_event(struct perf_session *session,
return tool->time_conv(session, event);
case PERF_RECORD_HEADER_FEATURE:
return tool->feature(session, event);
+ case PERF_RECORD_COMPRESSED:
+ err = tool->compressed(session, event, file_offset);
+ if (err)
+ dump_event(session->evlist, event, file_offset, &sample);
+ return err;
default:
return -EINVAL;
}
diff --git a/tools/perf/util/tool.h b/tools/perf/util/tool.h
index 250391672f9f..9096a6e3de59 100644
--- a/tools/perf/util/tool.h
+++ b/tools/perf/util/tool.h
@@ -28,6 +28,7 @@ typedef int (*event_attr_op)(struct perf_tool *tool,
typedef int (*event_op2)(struct perf_session *session, union perf_event *event);
typedef s64 (*event_op3)(struct perf_session *session, union perf_event *event);
+typedef int (*event_op4)(struct perf_session *session, union perf_event *event, u64 data);
typedef int (*event_oe)(struct perf_tool *tool, union perf_event *event,
struct ordered_events *oe);
@@ -72,6 +73,7 @@ struct perf_tool {
stat,
stat_round,
feature;
+ event_op4 compressed;
event_op3 auxtrace;
bool ordered_events;
bool ordering_requires_timestamps;
Commit-ID: cb62c6f1f59232457414ecbbf2337a1cb67b4ce2
Gitweb: https://git.kernel.org/tip/cb62c6f1f59232457414ecbbf2337a1cb67b4ce2
Author: Alexey Budankov <[email protected]>
AuthorDate: Mon, 18 Mar 2019 20:45:11 +0300
Committer: Arnaldo Carvalho de Melo <[email protected]>
CommitDate: Wed, 15 May 2019 16:36:49 -0300
perf report: Implement perf.data record decompression
zstd_init(, comp_level = 0) initializes decompression part of API only
hat now consists of zstd_decompress_stream() function.
The perf.data PERF_RECORD_COMPRESSED records are decompressed using
zstd_decompress_stream() function into a linked list of mmaped memory
regions of mmap_comp_len size (struct decomp).
After decompression of one COMPRESSED record its content is iterated and
fetched for usual processing. The mmaped memory regions with
decompressed events are kept in the linked list till the tool process
termination.
When dumping raw records (e.g., perf report -D --header) file offsets of
events from compressed records are printed as zero.
Committer notes:
Since now we have support for processing PERF_RECORD_COMPRESSED, we see
none, in raw form, like we saw in the previous patch commiter notes,
they were decompressed into the usual PERF_RECORD_{FORK,MMAP,COMM,etc}
records, we only see the stats for those PERF_RECORD_COMPRESSED events,
and since I used the file generated in the commiter notes for the
previous patch, there they are, 2 compressed records:
$ perf report --header-only | grep cmdline
# cmdline : /home/acme/bin/perf record -z2 sleep 1
$ perf report -D | grep COMPRESS
COMPRESSED events: 2
COMPRESSED events: 0
$ perf report --stdio
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 15 of event 'cycles:u'
# Event count (approx.): 962227
#
# Overhead Command Shared Object Symbol
# ........ ....... ................ ...........................
#
46.99% sleep libc-2.28.so [.] _dl_addr
29.24% sleep [unknown] [k] 0xffffffffaea00a67
16.45% sleep libc-2.28.so [.] __GI__IO_un_link.part.1
5.92% sleep ld-2.28.so [.] _dl_setup_hash
1.40% sleep libc-2.28.so [.] __nanosleep
0.00% sleep [unknown] [k] 0xffffffffaea00163
#
# (Tip: To see callchains in a more compact form: perf report -g folded)
#
$
Signed-off-by: Alexey Budankov <[email protected]>
Reviewed-by: Jiri Olsa <[email protected]>
Tested-by: Arnaldo Carvalho de Melo <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
---
tools/perf/builtin-report.c | 5 +-
tools/perf/util/compress.h | 11 +++++
tools/perf/util/session.c | 116 +++++++++++++++++++++++++++++++++++++++++++-
tools/perf/util/session.h | 10 ++++
tools/perf/util/zstd.c | 41 ++++++++++++++++
5 files changed, 181 insertions(+), 2 deletions(-)
diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c
index 91e27ac297c2..1ca533f06a4c 100644
--- a/tools/perf/builtin-report.c
+++ b/tools/perf/builtin-report.c
@@ -1258,6 +1258,9 @@ repeat:
if (session == NULL)
return -1;
+ if (zstd_init(&(session->zstd_data), 0) < 0)
+ pr_warning("Decompression initialization failed. Reported data may be incomplete.\n");
+
if (report.queue_size) {
ordered_events__set_alloc_size(&session->ordered_events,
report.queue_size);
@@ -1448,7 +1451,7 @@ repeat:
error:
if (report.ptime_range)
zfree(&report.ptime_range);
-
+ zstd_fini(&(session->zstd_data));
perf_session__delete(session);
return ret;
}
diff --git a/tools/perf/util/compress.h b/tools/perf/util/compress.h
index 1041a4fd81e2..0cd3369af2a4 100644
--- a/tools/perf/util/compress.h
+++ b/tools/perf/util/compress.h
@@ -20,6 +20,7 @@ bool lzma_is_compressed(const char *input);
struct zstd_data {
#ifdef HAVE_ZSTD_SUPPORT
ZSTD_CStream *cstream;
+ ZSTD_DStream *dstream;
#endif
};
@@ -31,6 +32,9 @@ int zstd_fini(struct zstd_data *data);
size_t zstd_compress_stream_to_records(struct zstd_data *data, void *dst, size_t dst_size,
void *src, size_t src_size, size_t max_record_size,
size_t process_header(void *record, size_t increment));
+
+size_t zstd_decompress_stream(struct zstd_data *data, void *src, size_t src_size,
+ void *dst, size_t dst_size);
#else /* !HAVE_ZSTD_SUPPORT */
static inline int zstd_init(struct zstd_data *data __maybe_unused, int level __maybe_unused)
@@ -52,6 +56,13 @@ size_t zstd_compress_stream_to_records(struct zstd_data *data __maybe_unused,
{
return 0;
}
+
+static inline size_t zstd_decompress_stream(struct zstd_data *data __maybe_unused, void *src __maybe_unused,
+ size_t src_size __maybe_unused, void *dst __maybe_unused,
+ size_t dst_size __maybe_unused)
+{
+ return 0;
+}
#endif
#endif /* PERF_COMPRESS_H */
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index ec1dec86d0e1..2310a1752983 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -29,6 +29,61 @@
#include "stat.h"
#include "arch/common.h"
+#ifdef HAVE_ZSTD_SUPPORT
+static int perf_session__process_compressed_event(struct perf_session *session,
+ union perf_event *event, u64 file_offset)
+{
+ void *src;
+ size_t decomp_size, src_size;
+ u64 decomp_last_rem = 0;
+ size_t decomp_len = session->header.env.comp_mmap_len;
+ struct decomp *decomp, *decomp_last = session->decomp_last;
+
+ decomp = mmap(NULL, sizeof(struct decomp) + decomp_len, PROT_READ|PROT_WRITE,
+ MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
+ if (decomp == MAP_FAILED) {
+ pr_err("Couldn't allocate memory for decompression\n");
+ return -1;
+ }
+
+ decomp->file_pos = file_offset;
+ decomp->head = 0;
+
+ if (decomp_last) {
+ decomp_last_rem = decomp_last->size - decomp_last->head;
+ memcpy(decomp->data, &(decomp_last->data[decomp_last->head]), decomp_last_rem);
+ decomp->size = decomp_last_rem;
+ }
+
+ src = (void *)event + sizeof(struct compressed_event);
+ src_size = event->pack.header.size - sizeof(struct compressed_event);
+
+ decomp_size = zstd_decompress_stream(&(session->zstd_data), src, src_size,
+ &(decomp->data[decomp_last_rem]), decomp_len - decomp_last_rem);
+ if (!decomp_size) {
+ munmap(decomp, sizeof(struct decomp) + decomp_len);
+ pr_err("Couldn't decompress data\n");
+ return -1;
+ }
+
+ decomp->size += decomp_size;
+
+ if (session->decomp == NULL) {
+ session->decomp = decomp;
+ session->decomp_last = decomp;
+ } else {
+ session->decomp_last->next = decomp;
+ session->decomp_last = decomp;
+ }
+
+ pr_debug("decomp (B): %ld to %ld\n", src_size, decomp_size);
+
+ return 0;
+}
+#else /* !HAVE_ZSTD_SUPPORT */
+#define perf_session__process_compressed_event perf_session__process_compressed_event_stub
+#endif
+
static int perf_session__deliver_event(struct perf_session *session,
union perf_event *event,
struct perf_tool *tool,
@@ -197,6 +252,21 @@ static void perf_session__delete_threads(struct perf_session *session)
machine__delete_threads(&session->machines.host);
}
+static void perf_session__release_decomp_events(struct perf_session *session)
+{
+ struct decomp *next, *decomp;
+ size_t decomp_len;
+ next = session->decomp;
+ decomp_len = session->header.env.comp_mmap_len;
+ do {
+ decomp = next;
+ if (decomp == NULL)
+ break;
+ next = decomp->next;
+ munmap(decomp, decomp_len + sizeof(struct decomp));
+ } while (1);
+}
+
void perf_session__delete(struct perf_session *session)
{
if (session == NULL)
@@ -205,6 +275,7 @@ void perf_session__delete(struct perf_session *session)
auxtrace_index__free(&session->auxtrace_index);
perf_session__destroy_kernel_maps(session);
perf_session__delete_threads(session);
+ perf_session__release_decomp_events(session);
perf_env__exit(&session->header.env);
machines__exit(&session->machines);
if (session->data)
@@ -439,7 +510,7 @@ void perf_tool__fill_defaults(struct perf_tool *tool)
if (tool->feature == NULL)
tool->feature = process_event_op2_stub;
if (tool->compressed == NULL)
- tool->compressed = perf_session__process_compressed_event_stub;
+ tool->compressed = perf_session__process_compressed_event;
}
static void swap_sample_id_all(union perf_event *event, void *data)
@@ -1725,6 +1796,8 @@ static int perf_session__flush_thread_stacks(struct perf_session *session)
volatile int session_done;
+static int __perf_session__process_decomp_events(struct perf_session *session);
+
static int __perf_session__process_pipe_events(struct perf_session *session)
{
struct ordered_events *oe = &session->ordered_events;
@@ -1805,6 +1878,10 @@ more:
if (skip > 0)
head += skip;
+ err = __perf_session__process_decomp_events(session);
+ if (err)
+ goto out_err;
+
if (!session_done())
goto more;
done:
@@ -1853,6 +1930,39 @@ fetch_mmaped_event(struct perf_session *session,
return event;
}
+static int __perf_session__process_decomp_events(struct perf_session *session)
+{
+ s64 skip;
+ u64 size, file_pos = 0;
+ struct decomp *decomp = session->decomp_last;
+
+ if (!decomp)
+ return 0;
+
+ while (decomp->head < decomp->size && !session_done()) {
+ union perf_event *event = fetch_mmaped_event(session, decomp->head, decomp->size, decomp->data);
+
+ if (!event)
+ break;
+
+ size = event->header.size;
+
+ if (size < sizeof(struct perf_event_header) ||
+ (skip = perf_session__process_event(session, event, file_pos)) < 0) {
+ pr_err("%#" PRIx64 " [%#x]: failed to process type: %d\n",
+ decomp->file_pos + decomp->head, event->header.size, event->header.type);
+ return -EINVAL;
+ }
+
+ if (skip)
+ size += skip;
+
+ decomp->head += size;
+ }
+
+ return 0;
+}
+
/*
* On 64bit we can mmap the data file in one go. No need for tiny mmap
* slices. On 32bit we use 32MB.
@@ -1962,6 +2072,10 @@ more:
head += size;
file_pos += size;
+ err = __perf_session__process_decomp_events(session);
+ if (err)
+ goto out;
+
ui_progress__update(prog, size);
if (session_done())
diff --git a/tools/perf/util/session.h b/tools/perf/util/session.h
index 6c984c895924..dd8920b745bc 100644
--- a/tools/perf/util/session.h
+++ b/tools/perf/util/session.h
@@ -39,6 +39,16 @@ struct perf_session {
u64 bytes_transferred;
u64 bytes_compressed;
struct zstd_data zstd_data;
+ struct decomp *decomp;
+ struct decomp *decomp_last;
+};
+
+struct decomp {
+ struct decomp *next;
+ u64 file_pos;
+ u64 head;
+ size_t size;
+ char data[];
};
struct perf_tool;
diff --git a/tools/perf/util/zstd.c b/tools/perf/util/zstd.c
index 359ec9a9d306..23bdb9884576 100644
--- a/tools/perf/util/zstd.c
+++ b/tools/perf/util/zstd.c
@@ -9,6 +9,21 @@ int zstd_init(struct zstd_data *data, int level)
{
size_t ret;
+ data->dstream = ZSTD_createDStream();
+ if (data->dstream == NULL) {
+ pr_err("Couldn't create decompression stream.\n");
+ return -1;
+ }
+
+ ret = ZSTD_initDStream(data->dstream);
+ if (ZSTD_isError(ret)) {
+ pr_err("Failed to initialize decompression stream: %s\n", ZSTD_getErrorName(ret));
+ return -1;
+ }
+
+ if (!level)
+ return 0;
+
data->cstream = ZSTD_createCStream();
if (data->cstream == NULL) {
pr_err("Couldn't create compression stream.\n");
@@ -26,6 +41,11 @@ int zstd_init(struct zstd_data *data, int level)
int zstd_fini(struct zstd_data *data)
{
+ if (data->dstream) {
+ ZSTD_freeDStream(data->dstream);
+ data->dstream = NULL;
+ }
+
if (data->cstream) {
ZSTD_freeCStream(data->cstream);
data->cstream = NULL;
@@ -68,3 +88,24 @@ size_t zstd_compress_stream_to_records(struct zstd_data *data, void *dst, size_t
return compressed;
}
+
+size_t zstd_decompress_stream(struct zstd_data *data, void *src, size_t src_size,
+ void *dst, size_t dst_size)
+{
+ size_t ret;
+ ZSTD_inBuffer input = { src, src_size, 0 };
+ ZSTD_outBuffer output = { dst, dst_size, 0 };
+
+ while (input.pos < input.size) {
+ ret = ZSTD_decompressStream(data->dstream, &output, &input);
+ if (ZSTD_isError(ret)) {
+ pr_err("failed to decompress (B): %ld -> %ld : %s\n",
+ src_size, output.size, ZSTD_getErrorName(ret));
+ break;
+ }
+ output.dst = dst + output.pos;
+ output.size = dst_size - output.pos;
+ }
+
+ return output.pos;
+}
Commit-ID: 371a3378d83a755add84b2dca730a3a641002f3a
Gitweb: https://git.kernel.org/tip/371a3378d83a755add84b2dca730a3a641002f3a
Author: Alexey Budankov <[email protected]>
AuthorDate: Mon, 18 Mar 2019 20:45:44 +0300
Committer: Arnaldo Carvalho de Melo <[email protected]>
CommitDate: Wed, 15 May 2019 16:36:49 -0300
perf inject: Enable COMPRESSED record decompression
Initialized decompression part of Zstd based API so COMPRESSED records
would be decompressed into the resulting output data file.
Signed-off-by: Alexey Budankov <[email protected]>
Reviewed-by: Jiri Olsa <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
---
tools/perf/builtin-inject.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/tools/perf/builtin-inject.c b/tools/perf/builtin-inject.c
index 24086b7f1b14..8e0e06d3edfc 100644
--- a/tools/perf/builtin-inject.c
+++ b/tools/perf/builtin-inject.c
@@ -837,6 +837,9 @@ int cmd_inject(int argc, const char **argv)
if (inject.session == NULL)
return -1;
+ if (zstd_init(&(inject.session->zstd_data), 0) < 0)
+ pr_warning("Decompression initialization failed.\n");
+
if (inject.build_ids) {
/*
* to make sure the mmap records are ordered correctly
@@ -867,6 +870,7 @@ int cmd_inject(int argc, const char **argv)
ret = __cmd_inject(&inject);
out_delete:
+ zstd_fini(&(inject.session->zstd_data));
perf_session__delete(inject.session);
return ret;
}
Commit-ID: bdc35cbc35c0b33428922503c7c85259510911a6
Gitweb: https://git.kernel.org/tip/bdc35cbc35c0b33428922503c7c85259510911a6
Author: Alexey Budankov <[email protected]>
AuthorDate: Mon, 18 Mar 2019 20:46:17 +0300
Committer: Arnaldo Carvalho de Melo <[email protected]>
CommitDate: Wed, 15 May 2019 16:36:49 -0300
perf tests: Implement Zstd comp/decomp integration test
Introduce a basic integration test for Zstd based record
compression/decompression using 'perf record' and 'perf report'.
Committer notes:
Reduce a bit the freq (from 25 kHz to 5 kHz) and the number of /dev/null
records read (from 1000 to 500), reducing the time it takes to something
more in line with the time existing 'perf test' entries take to run.
With that in place:
$ time perf test zstd
68: Zstd perf.data compression/decompression : Ok
real 0m10.376s
user 0m0.105s
sys 0m0.440s
$ grep "model name" /proc/cpuinfo | head -1
model name : Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz
$
Signed-off-by: Alexey Budankov <[email protected]>
Reviewed-by: Jiri Olsa <[email protected]>
Tested-by: Arnaldo Carvalho de Melo <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
---
tools/perf/tests/shell/record+zstd_comp_decomp.sh | 35 +++++++++++++++++++++++
1 file changed, 35 insertions(+)
diff --git a/tools/perf/tests/shell/record+zstd_comp_decomp.sh b/tools/perf/tests/shell/record+zstd_comp_decomp.sh
new file mode 100755
index 000000000000..93a26a87b1f2
--- /dev/null
+++ b/tools/perf/tests/shell/record+zstd_comp_decomp.sh
@@ -0,0 +1,35 @@
+#!/bin/sh
+# Zstd perf.data compression/decompression
+
+trace_file=$(mktemp /tmp/perf.data.XXX)
+perf_tool=perf
+output=/dev/null
+
+skip_if_no_z_record() {
+ $perf_tool record -h 2>&1 | grep '\-z, \-\-compression\-level'
+}
+
+collect_z_record() {
+ echo "Collecting compressed record file:"
+ $perf_tool record -o $trace_file -g -z -F 5000 -- \
+ dd count=500 if=/dev/random of=/dev/null > $output 2>&1
+}
+
+check_compressed_stats() {
+ echo "Checking compressed events stats:"
+ $perf_tool report -i $trace_file --header --stats | \
+ grep -E "(# compressed : Zstd,)|(COMPRESSED events:)" > $output 2>&1
+}
+
+check_compressed_output() {
+ $perf_tool inject -i $trace_file -o $trace_file.decomp &&
+ $perf_tool report -i $trace_file --stdio | head -n -3 > $trace_file.comp.output &&
+ $perf_tool report -i $trace_file.decomp --stdio | head -n -3 > $trace_file.decomp.output &&
+ diff $trace_file.comp.output $trace_file.decomp.output > $output 2>&1
+}
+
+skip_if_no_z_record || exit 2
+collect_z_record && check_compressed_stats && check_compressed_output
+err=$?
+rm -f $trace_file*
+exit $err
Commit-ID: 42e1fd80a5b8bf9188ddb502b788433ece189aae
Gitweb: https://git.kernel.org/tip/42e1fd80a5b8bf9188ddb502b788433ece189aae
Author: Alexey Budankov <[email protected]>
AuthorDate: Mon, 18 Mar 2019 20:41:33 +0300
Committer: Arnaldo Carvalho de Melo <[email protected]>
CommitDate: Wed, 15 May 2019 16:36:49 -0300
perf record: Implement COMPRESSED event record and its attributes
Implemented PERF_RECORD_COMPRESSED event, related data types, header
feature and functions to write, read and print feature attributes from
the trace header section.
comp_mmap_len preserves the size of mmaped kernel buffer that was used
during collection. comp_mmap_len size is used on loading stage as the
size of decomp buffer for decompression of COMPRESSED events content.
Committer notes:
Fixed up conflict with BPF_PROG_INFO and BTF_BTF header features.
Signed-off-by: Alexey Budankov <[email protected]>
Reviewed-by: Jiri Olsa <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
---
tools/perf/Documentation/perf.data-file-format.txt | 24 ++++++++++
tools/perf/builtin-record.c | 8 ++++
tools/perf/perf.h | 1 +
tools/perf/util/env.h | 10 ++++
tools/perf/util/event.c | 1 +
tools/perf/util/event.h | 7 +++
tools/perf/util/header.c | 53 ++++++++++++++++++++++
tools/perf/util/header.h | 1 +
8 files changed, 105 insertions(+)
diff --git a/tools/perf/Documentation/perf.data-file-format.txt b/tools/perf/Documentation/perf.data-file-format.txt
index 593ef49b273c..6967e9b02be5 100644
--- a/tools/perf/Documentation/perf.data-file-format.txt
+++ b/tools/perf/Documentation/perf.data-file-format.txt
@@ -272,6 +272,19 @@ struct {
Two uint64_t for the time of first sample and the time of last sample.
+ HEADER_COMPRESSED = 27,
+
+struct {
+ u32 version;
+ u32 type;
+ u32 level;
+ u32 ratio;
+ u32 mmap_len;
+};
+
+Indicates that trace contains records of PERF_RECORD_COMPRESSED type
+that have perf_events records in compressed form.
+
other bits are reserved and should ignored for now
HEADER_FEAT_BITS = 256,
@@ -437,6 +450,17 @@ struct auxtrace_error_event {
Describes a header feature. These are records used in pipe-mode that
contain information that otherwise would be in perf.data file's header.
+ PERF_RECORD_COMPRESSED = 81,
+
+struct compressed_event {
+ struct perf_event_header header;
+ char data[];
+};
+
+The header is followed by compressed data frame that can be decompressed
+into array of perf trace records. The size of the entire compressed event
+record including the header is limited by the max value of header.size.
+
Event types
Define the event attributes with their IDs.
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 386e665a166f..45a80b3584ad 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -372,6 +372,11 @@ static int record__mmap_flush_parse(const struct option *opt,
return 0;
}
+static int record__comp_enabled(struct record *rec)
+{
+ return rec->opts.comp_level > 0;
+}
+
static int process_synthesized_event(struct perf_tool *tool,
union perf_event *event,
struct perf_sample *sample __maybe_unused,
@@ -888,6 +893,8 @@ static void record__init_features(struct record *rec)
perf_header__clear_feat(&session->header, HEADER_CLOCKID);
perf_header__clear_feat(&session->header, HEADER_DIR_FORMAT);
+ if (!record__comp_enabled(rec))
+ perf_header__clear_feat(&session->header, HEADER_COMPRESSED);
perf_header__clear_feat(&session->header, HEADER_STAT);
}
@@ -1245,6 +1252,7 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
err = -1;
goto out_child;
}
+ session->header.env.comp_mmap_len = session->evlist->mmap_len;
err = bpf__apply_obj_config();
if (err) {
diff --git a/tools/perf/perf.h b/tools/perf/perf.h
index 369eae61068d..d59dee61b64d 100644
--- a/tools/perf/perf.h
+++ b/tools/perf/perf.h
@@ -86,6 +86,7 @@ struct record_opts {
int nr_cblocks;
int affinity;
int mmap_flush;
+ unsigned int comp_level;
};
enum perf_affinity {
diff --git a/tools/perf/util/env.h b/tools/perf/util/env.h
index 34868ca7efd1..271a90b326c4 100644
--- a/tools/perf/util/env.h
+++ b/tools/perf/util/env.h
@@ -63,6 +63,10 @@ struct perf_env {
struct cpu_cache_level *caches;
int caches_cnt;
u32 comp_ratio;
+ u32 comp_ver;
+ u32 comp_type;
+ u32 comp_level;
+ u32 comp_mmap_len;
struct numa_node *numa_nodes;
struct memory_node *memory_nodes;
unsigned long long memory_bsize;
@@ -81,6 +85,12 @@ struct perf_env {
} bpf_progs;
};
+enum perf_compress_type {
+ PERF_COMP_NONE = 0,
+ PERF_COMP_ZSTD,
+ PERF_COMP_MAX
+};
+
struct bpf_prog_info_node;
struct btf_node;
diff --git a/tools/perf/util/event.c b/tools/perf/util/event.c
index ba7be74fad6e..d1ad6c419724 100644
--- a/tools/perf/util/event.c
+++ b/tools/perf/util/event.c
@@ -68,6 +68,7 @@ static const char *perf_event__names[] = {
[PERF_RECORD_EVENT_UPDATE] = "EVENT_UPDATE",
[PERF_RECORD_TIME_CONV] = "TIME_CONV",
[PERF_RECORD_HEADER_FEATURE] = "FEATURE",
+ [PERF_RECORD_COMPRESSED] = "COMPRESSED",
};
static const char *perf_ns__names[] = {
diff --git a/tools/perf/util/event.h b/tools/perf/util/event.h
index 4e908ec1ef64..9e999550f247 100644
--- a/tools/perf/util/event.h
+++ b/tools/perf/util/event.h
@@ -255,6 +255,7 @@ enum perf_user_event_type { /* above any possible kernel type */
PERF_RECORD_EVENT_UPDATE = 78,
PERF_RECORD_TIME_CONV = 79,
PERF_RECORD_HEADER_FEATURE = 80,
+ PERF_RECORD_COMPRESSED = 81,
PERF_RECORD_HEADER_MAX
};
@@ -627,6 +628,11 @@ struct feature_event {
char data[];
};
+struct compressed_event {
+ struct perf_event_header header;
+ char data[];
+};
+
union perf_event {
struct perf_event_header header;
struct mmap_event mmap;
@@ -660,6 +666,7 @@ union perf_event {
struct feature_event feat;
struct ksymbol_event ksymbol_event;
struct bpf_event bpf_event;
+ struct compressed_event pack;
};
void perf_event__print_totals(void);
diff --git a/tools/perf/util/header.c b/tools/perf/util/header.c
index 2d2af2ac2b1e..847ae51a524b 100644
--- a/tools/perf/util/header.c
+++ b/tools/perf/util/header.c
@@ -1344,6 +1344,30 @@ out:
return ret;
}
+static int write_compressed(struct feat_fd *ff __maybe_unused,
+ struct perf_evlist *evlist __maybe_unused)
+{
+ int ret;
+
+ ret = do_write(ff, &(ff->ph->env.comp_ver), sizeof(ff->ph->env.comp_ver));
+ if (ret)
+ return ret;
+
+ ret = do_write(ff, &(ff->ph->env.comp_type), sizeof(ff->ph->env.comp_type));
+ if (ret)
+ return ret;
+
+ ret = do_write(ff, &(ff->ph->env.comp_level), sizeof(ff->ph->env.comp_level));
+ if (ret)
+ return ret;
+
+ ret = do_write(ff, &(ff->ph->env.comp_ratio), sizeof(ff->ph->env.comp_ratio));
+ if (ret)
+ return ret;
+
+ return do_write(ff, &(ff->ph->env.comp_mmap_len), sizeof(ff->ph->env.comp_mmap_len));
+}
+
static void print_hostname(struct feat_fd *ff, FILE *fp)
{
fprintf(fp, "# hostname : %s\n", ff->ph->env.hostname);
@@ -1688,6 +1712,13 @@ static void print_cache(struct feat_fd *ff, FILE *fp __maybe_unused)
}
}
+static void print_compressed(struct feat_fd *ff, FILE *fp)
+{
+ fprintf(fp, "# compressed : %s, level = %d, ratio = %d\n",
+ ff->ph->env.comp_type == PERF_COMP_ZSTD ? "Zstd" : "Unknown",
+ ff->ph->env.comp_level, ff->ph->env.comp_ratio);
+}
+
static void print_pmu_mappings(struct feat_fd *ff, FILE *fp)
{
const char *delimiter = "# pmu mappings: ";
@@ -2667,6 +2698,27 @@ out:
return err;
}
+static int process_compressed(struct feat_fd *ff,
+ void *data __maybe_unused)
+{
+ if (do_read_u32(ff, &(ff->ph->env.comp_ver)))
+ return -1;
+
+ if (do_read_u32(ff, &(ff->ph->env.comp_type)))
+ return -1;
+
+ if (do_read_u32(ff, &(ff->ph->env.comp_level)))
+ return -1;
+
+ if (do_read_u32(ff, &(ff->ph->env.comp_ratio)))
+ return -1;
+
+ if (do_read_u32(ff, &(ff->ph->env.comp_mmap_len)))
+ return -1;
+
+ return 0;
+}
+
struct feature_ops {
int (*write)(struct feat_fd *ff, struct perf_evlist *evlist);
void (*print)(struct feat_fd *ff, FILE *fp);
@@ -2730,6 +2782,7 @@ static const struct feature_ops feat_ops[HEADER_LAST_FEATURE] = {
FEAT_OPN(DIR_FORMAT, dir_format, false),
FEAT_OPR(BPF_PROG_INFO, bpf_prog_info, false),
FEAT_OPR(BPF_BTF, bpf_btf, false),
+ FEAT_OPR(COMPRESSED, compressed, false),
};
struct header_print_data {
diff --git a/tools/perf/util/header.h b/tools/perf/util/header.h
index 386da49e1bfa..5b3abe4172e2 100644
--- a/tools/perf/util/header.h
+++ b/tools/perf/util/header.h
@@ -42,6 +42,7 @@ enum {
HEADER_DIR_FORMAT,
HEADER_BPF_PROG_INFO,
HEADER_BPF_BTF,
+ HEADER_COMPRESSED,
HEADER_LAST_FEATURE,
HEADER_FEAT_BITS = 256,
};
Commit-ID: 51255a8af7c41c876c2d715a35ab03c13302a607
Gitweb: https://git.kernel.org/tip/51255a8af7c41c876c2d715a35ab03c13302a607
Author: Alexey Budankov <[email protected]>
AuthorDate: Mon, 18 Mar 2019 20:42:19 +0300
Committer: Arnaldo Carvalho de Melo <[email protected]>
CommitDate: Wed, 15 May 2019 16:36:49 -0300
perf mmap: Implement dedicated memory buffer for data compression
Implemented mmap data buffer that is used as the memory to operate
on when compressing data in case of serial trace streaming.
Signed-off-by: Alexey Budankov <[email protected]>
Reviewed-by: Jiri Olsa <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
---
tools/perf/builtin-record.c | 8 +++++++-
tools/perf/util/evlist.c | 8 +++++---
tools/perf/util/evlist.h | 2 +-
tools/perf/util/mmap.c | 30 ++++++++++++++++++++++++++++--
tools/perf/util/mmap.h | 4 +++-
5 files changed, 44 insertions(+), 8 deletions(-)
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 45a80b3584ad..ca6d7488e34b 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -372,6 +372,8 @@ static int record__mmap_flush_parse(const struct option *opt,
return 0;
}
+static unsigned int comp_level_max = 22;
+
static int record__comp_enabled(struct record *rec)
{
return rec->opts.comp_level > 0;
@@ -587,7 +589,7 @@ static int record__mmap_evlist(struct record *rec,
opts->auxtrace_mmap_pages,
opts->auxtrace_snapshot_mode,
opts->nr_cblocks, opts->affinity,
- opts->mmap_flush) < 0) {
+ opts->mmap_flush, opts->comp_level) < 0) {
if (errno == EPERM) {
pr_err("Permission error mapping pages.\n"
"Consider increasing "
@@ -2298,6 +2300,10 @@ int cmd_record(int argc, const char **argv)
pr_debug("affinity: %s\n", affinity_tags[rec->opts.affinity]);
pr_debug("mmap flush: %d\n", rec->opts.mmap_flush);
+ if (rec->opts.comp_level > comp_level_max)
+ rec->opts.comp_level = comp_level_max;
+ pr_debug("comp level: %d\n", rec->opts.comp_level);
+
err = __cmd_record(&record, argc, argv);
out:
perf_evlist__delete(rec->evlist);
diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
index 4b6783ff5813..69d0fa8ab16f 100644
--- a/tools/perf/util/evlist.c
+++ b/tools/perf/util/evlist.c
@@ -1009,7 +1009,8 @@ int perf_evlist__parse_mmap_pages(const struct option *opt, const char *str,
*/
int perf_evlist__mmap_ex(struct perf_evlist *evlist, unsigned int pages,
unsigned int auxtrace_pages,
- bool auxtrace_overwrite, int nr_cblocks, int affinity, int flush)
+ bool auxtrace_overwrite, int nr_cblocks, int affinity, int flush,
+ int comp_level)
{
struct perf_evsel *evsel;
const struct cpu_map *cpus = evlist->cpus;
@@ -1019,7 +1020,8 @@ int perf_evlist__mmap_ex(struct perf_evlist *evlist, unsigned int pages,
* Its value is decided by evsel's write_backward.
* So &mp should not be passed through const pointer.
*/
- struct mmap_params mp = { .nr_cblocks = nr_cblocks, .affinity = affinity, .flush = flush };
+ struct mmap_params mp = { .nr_cblocks = nr_cblocks, .affinity = affinity, .flush = flush,
+ .comp_level = comp_level };
if (!evlist->mmap)
evlist->mmap = perf_evlist__alloc_mmap(evlist, false);
@@ -1051,7 +1053,7 @@ int perf_evlist__mmap_ex(struct perf_evlist *evlist, unsigned int pages,
int perf_evlist__mmap(struct perf_evlist *evlist, unsigned int pages)
{
- return perf_evlist__mmap_ex(evlist, pages, 0, false, 0, PERF_AFFINITY_SYS, 1);
+ return perf_evlist__mmap_ex(evlist, pages, 0, false, 0, PERF_AFFINITY_SYS, 1, 0);
}
int perf_evlist__create_maps(struct perf_evlist *evlist, struct target *target)
diff --git a/tools/perf/util/evlist.h b/tools/perf/util/evlist.h
index c9a0f72677fd..49354fe24d5f 100644
--- a/tools/perf/util/evlist.h
+++ b/tools/perf/util/evlist.h
@@ -178,7 +178,7 @@ unsigned long perf_event_mlock_kb_in_pages(void);
int perf_evlist__mmap_ex(struct perf_evlist *evlist, unsigned int pages,
unsigned int auxtrace_pages,
bool auxtrace_overwrite, int nr_cblocks,
- int affinity, int flush);
+ int affinity, int flush, int comp_level);
int perf_evlist__mmap(struct perf_evlist *evlist, unsigned int pages);
void perf_evlist__munmap(struct perf_evlist *evlist);
diff --git a/tools/perf/util/mmap.c b/tools/perf/util/mmap.c
index ef3d79b2c90b..d85e73fc82e2 100644
--- a/tools/perf/util/mmap.c
+++ b/tools/perf/util/mmap.c
@@ -157,6 +157,10 @@ void __weak auxtrace_mmap_params__set_idx(struct auxtrace_mmap_params *mp __mayb
}
#ifdef HAVE_AIO_SUPPORT
+static int perf_mmap__aio_enabled(struct perf_mmap *map)
+{
+ return map->aio.nr_cblocks > 0;
+}
#ifdef HAVE_LIBNUMA_SUPPORT
static int perf_mmap__aio_alloc(struct perf_mmap *map, int idx)
@@ -198,7 +202,7 @@ static int perf_mmap__aio_bind(struct perf_mmap *map, int idx, int cpu, int affi
return 0;
}
-#else
+#else /* !HAVE_LIBNUMA_SUPPORT */
static int perf_mmap__aio_alloc(struct perf_mmap *map, int idx)
{
map->aio.data[idx] = malloc(perf_mmap__mmap_len(map));
@@ -359,7 +363,12 @@ int perf_mmap__aio_push(struct perf_mmap *md, void *to, int idx,
return rc;
}
-#else
+#else /* !HAVE_AIO_SUPPORT */
+static int perf_mmap__aio_enabled(struct perf_mmap *map __maybe_unused)
+{
+ return 0;
+}
+
static int perf_mmap__aio_mmap(struct perf_mmap *map __maybe_unused,
struct mmap_params *mp __maybe_unused)
{
@@ -374,6 +383,10 @@ static void perf_mmap__aio_munmap(struct perf_mmap *map __maybe_unused)
void perf_mmap__munmap(struct perf_mmap *map)
{
perf_mmap__aio_munmap(map);
+ if (map->data != NULL) {
+ munmap(map->data, perf_mmap__mmap_len(map));
+ map->data = NULL;
+ }
if (map->base != NULL) {
munmap(map->base, perf_mmap__mmap_len(map));
map->base = NULL;
@@ -442,6 +455,19 @@ int perf_mmap__mmap(struct perf_mmap *map, struct mmap_params *mp, int fd, int c
map->flush = mp->flush;
+ map->comp_level = mp->comp_level;
+
+ if (map->comp_level && !perf_mmap__aio_enabled(map)) {
+ map->data = mmap(NULL, perf_mmap__mmap_len(map), PROT_READ|PROT_WRITE,
+ MAP_PRIVATE|MAP_ANONYMOUS, 0, 0);
+ if (map->data == MAP_FAILED) {
+ pr_debug2("failed to mmap data buffer, error %d\n",
+ errno);
+ map->data = NULL;
+ return -1;
+ }
+ }
+
if (auxtrace_mmap__mmap(&map->auxtrace_mmap,
&mp->auxtrace_mp, map->base, fd))
return -1;
diff --git a/tools/perf/util/mmap.h b/tools/perf/util/mmap.h
index b82f8c2d55c4..4e2f58d95c1f 100644
--- a/tools/perf/util/mmap.h
+++ b/tools/perf/util/mmap.h
@@ -40,6 +40,8 @@ struct perf_mmap {
#endif
cpu_set_t affinity_mask;
u64 flush;
+ void *data;
+ int comp_level;
};
/*
@@ -71,7 +73,7 @@ enum bkw_mmap_state {
};
struct mmap_params {
- int prot, mask, nr_cblocks, affinity, flush;
+ int prot, mask, nr_cblocks, affinity, flush, comp_level;
struct auxtrace_mmap_params auxtrace_mp;
};
Commit-ID: f24c1d7523e6db26ec2115a308750c875927741b
Gitweb: https://git.kernel.org/tip/f24c1d7523e6db26ec2115a308750c875927741b
Author: Alexey Budankov <[email protected]>
AuthorDate: Mon, 18 Mar 2019 20:42:55 +0300
Committer: Arnaldo Carvalho de Melo <[email protected]>
CommitDate: Wed, 15 May 2019 16:36:49 -0300
perf tools: Introduce Zstd streaming based compression API
Implemented functions are based on Zstd streaming compression API.
The functions are used in runtime to compress data that come from mmaped
kernel buffer. zstd_init(), zstd_fini() are used for initialization and
finalization to allocate and deallocate internal zstd objects.
zstd_compress_stream_to_records() is used to convert parts of mmaped
kernel buffer into an array of PERF_RECORD_COMPRESSED records.
Signed-off-by: Alexey Budankov <[email protected]>
Reviewed-by: Jiri Olsa <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
---
tools/perf/util/Build | 2 ++
tools/perf/util/compress.h | 42 ++++++++++++++++++++++++++++
tools/perf/util/zstd.c | 70 ++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 114 insertions(+)
diff --git a/tools/perf/util/Build b/tools/perf/util/Build
index 8dd3102301ea..6d5bbc8b589b 100644
--- a/tools/perf/util/Build
+++ b/tools/perf/util/Build
@@ -145,6 +145,8 @@ perf-y += scripting-engines/
perf-$(CONFIG_ZLIB) += zlib.o
perf-$(CONFIG_LZMA) += lzma.o
+perf-$(CONFIG_ZSTD) += zstd.o
+
perf-y += demangle-java.o
perf-y += demangle-rust.o
diff --git a/tools/perf/util/compress.h b/tools/perf/util/compress.h
index 892e92e7e7fc..1041a4fd81e2 100644
--- a/tools/perf/util/compress.h
+++ b/tools/perf/util/compress.h
@@ -2,6 +2,11 @@
#ifndef PERF_COMPRESS_H
#define PERF_COMPRESS_H
+#include <stdbool.h>
+#ifdef HAVE_ZSTD_SUPPORT
+#include <zstd.h>
+#endif
+
#ifdef HAVE_ZLIB_SUPPORT
int gzip_decompress_to_file(const char *input, int output_fd);
bool gzip_is_compressed(const char *input);
@@ -12,4 +17,41 @@ int lzma_decompress_to_file(const char *input, int output_fd);
bool lzma_is_compressed(const char *input);
#endif
+struct zstd_data {
+#ifdef HAVE_ZSTD_SUPPORT
+ ZSTD_CStream *cstream;
+#endif
+};
+
+#ifdef HAVE_ZSTD_SUPPORT
+
+int zstd_init(struct zstd_data *data, int level);
+int zstd_fini(struct zstd_data *data);
+
+size_t zstd_compress_stream_to_records(struct zstd_data *data, void *dst, size_t dst_size,
+ void *src, size_t src_size, size_t max_record_size,
+ size_t process_header(void *record, size_t increment));
+#else /* !HAVE_ZSTD_SUPPORT */
+
+static inline int zstd_init(struct zstd_data *data __maybe_unused, int level __maybe_unused)
+{
+ return 0;
+}
+
+static inline int zstd_fini(struct zstd_data *data __maybe_unused)
+{
+ return 0;
+}
+
+static inline
+size_t zstd_compress_stream_to_records(struct zstd_data *data __maybe_unused,
+ void *dst __maybe_unused, size_t dst_size __maybe_unused,
+ void *src __maybe_unused, size_t src_size __maybe_unused,
+ size_t max_record_size __maybe_unused,
+ size_t process_header(void *record, size_t increment) __maybe_unused)
+{
+ return 0;
+}
+#endif
+
#endif /* PERF_COMPRESS_H */
diff --git a/tools/perf/util/zstd.c b/tools/perf/util/zstd.c
new file mode 100644
index 000000000000..359ec9a9d306
--- /dev/null
+++ b/tools/perf/util/zstd.c
@@ -0,0 +1,70 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <string.h>
+
+#include "util/compress.h"
+#include "util/debug.h"
+
+int zstd_init(struct zstd_data *data, int level)
+{
+ size_t ret;
+
+ data->cstream = ZSTD_createCStream();
+ if (data->cstream == NULL) {
+ pr_err("Couldn't create compression stream.\n");
+ return -1;
+ }
+
+ ret = ZSTD_initCStream(data->cstream, level);
+ if (ZSTD_isError(ret)) {
+ pr_err("Failed to initialize compression stream: %s\n", ZSTD_getErrorName(ret));
+ return -1;
+ }
+
+ return 0;
+}
+
+int zstd_fini(struct zstd_data *data)
+{
+ if (data->cstream) {
+ ZSTD_freeCStream(data->cstream);
+ data->cstream = NULL;
+ }
+
+ return 0;
+}
+
+size_t zstd_compress_stream_to_records(struct zstd_data *data, void *dst, size_t dst_size,
+ void *src, size_t src_size, size_t max_record_size,
+ size_t process_header(void *record, size_t increment))
+{
+ size_t ret, size, compressed = 0;
+ ZSTD_inBuffer input = { src, src_size, 0 };
+ ZSTD_outBuffer output;
+ void *record;
+
+ while (input.pos < input.size) {
+ record = dst;
+ size = process_header(record, 0);
+ compressed += size;
+ dst += size;
+ dst_size -= size;
+ output = (ZSTD_outBuffer){ dst, (dst_size > max_record_size) ?
+ max_record_size : dst_size, 0 };
+ ret = ZSTD_compressStream(data->cstream, &output, &input);
+ ZSTD_flushStream(data->cstream, &output);
+ if (ZSTD_isError(ret)) {
+ pr_err("failed to compress %ld bytes: %s\n",
+ (long)src_size, ZSTD_getErrorName(ret));
+ memcpy(dst, src, src_size);
+ return src_size;
+ }
+ size = output.pos;
+ size = process_header(record, size);
+ compressed += size;
+ dst += size;
+ dst_size -= size;
+ }
+
+ return compressed;
+}
Commit-ID: 5d7f41164930ecc1797702b7f9728ac702609ef3
Gitweb: https://git.kernel.org/tip/5d7f41164930ecc1797702b7f9728ac702609ef3
Author: Alexey Budankov <[email protected]>
AuthorDate: Mon, 18 Mar 2019 20:43:35 +0300
Committer: Arnaldo Carvalho de Melo <[email protected]>
CommitDate: Wed, 15 May 2019 16:36:49 -0300
perf record: Implement compression for serial trace streaming
Compression is implemented using the functions from zstd.c. As the
memory to operate on the compression uses mmap->data buffer.
If Zstd streaming compression API fails for some reason the data to be
compressed are just copied into the memory buffers using plain memcpy().
Compressed trace frame consists of an array of PERF_RECORD_COMPRESSED
records. Each element of the array is not longer that
PERF_SAMPLE_MAX_SIZE and consists of perf_event_header followed by the
compressed chunk that is decompressed on the loading stage.
Comitter notes:
Undo some unnecessary line breaks, remove some unnecessary () around
zstd_data to then just get its address, and fix conflicts with
BPF_PROG_INFO/BPF_BTF patchkits.
Signed-off-by: Alexey Budankov <[email protected]>
Reviewed-by: Jiri Olsa <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
---
tools/perf/builtin-record.c | 51 +++++++++++++++++++++++++++++++++++++++++++--
tools/perf/util/session.h | 2 ++
2 files changed, 51 insertions(+), 2 deletions(-)
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index ca6d7488e34b..de9632c69852 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -133,6 +133,9 @@ static int record__write(struct record *rec, struct perf_mmap *map __maybe_unuse
return 0;
}
+static size_t zstd_compress(struct perf_session *session, void *dst, size_t dst_size,
+ void *src, size_t src_size);
+
#ifdef HAVE_AIO_SUPPORT
static int record__aio_write(struct aiocb *cblock, int trace_fd,
void *buf, size_t size, off_t off)
@@ -392,6 +395,11 @@ static int record__pushfn(struct perf_mmap *map, void *to, void *bf, size_t size
{
struct record *rec = to;
+ if (record__comp_enabled(rec)) {
+ size = zstd_compress(rec->session, map->data, perf_mmap__mmap_len(map), bf, size);
+ bf = map->data;
+ }
+
rec->samples++;
return record__write(rec, map, bf, size);
}
@@ -778,6 +786,37 @@ static void record__adjust_affinity(struct record *rec, struct perf_mmap *map)
}
}
+static size_t process_comp_header(void *record, size_t increment)
+{
+ struct compressed_event *event = record;
+ size_t size = sizeof(*event);
+
+ if (increment) {
+ event->header.size += increment;
+ return increment;
+ }
+
+ event->header.type = PERF_RECORD_COMPRESSED;
+ event->header.size = size;
+
+ return size;
+}
+
+static size_t zstd_compress(struct perf_session *session, void *dst, size_t dst_size,
+ void *src, size_t src_size)
+{
+ size_t compressed;
+ size_t max_record_size = PERF_SAMPLE_MAX_SIZE - sizeof(struct compressed_event) - 1;
+
+ compressed = zstd_compress_stream_to_records(&session->zstd_data, dst, dst_size, src, src_size,
+ max_record_size, process_comp_header);
+
+ session->bytes_transferred += src_size;
+ session->bytes_compressed += compressed;
+
+ return compressed;
+}
+
static int record__mmap_read_evlist(struct record *rec, struct perf_evlist *evlist,
bool overwrite, bool synch)
{
@@ -1225,6 +1264,14 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
fd = perf_data__fd(data);
rec->session = session;
+ if (zstd_init(&session->zstd_data, rec->opts.comp_level) < 0) {
+ pr_err("Compression initialization failed.\n");
+ return -1;
+ }
+
+ session->header.env.comp_type = PERF_COMP_ZSTD;
+ session->header.env.comp_level = rec->opts.comp_level;
+
record__init_features(rec);
if (rec->opts.use_clockid && rec->opts.clockid_res_ns)
@@ -1565,6 +1612,7 @@ out_child:
}
out_delete_session:
+ zstd_fini(&session->zstd_data);
perf_session__delete(session);
if (!opts->no_bpf_event)
@@ -2294,8 +2342,7 @@ int cmd_record(int argc, const char **argv)
if (rec->opts.nr_cblocks > nr_cblocks_max)
rec->opts.nr_cblocks = nr_cblocks_max;
- if (verbose > 0)
- pr_info("nr_cblocks: %d\n", rec->opts.nr_cblocks);
+ pr_debug("nr_cblocks: %d\n", rec->opts.nr_cblocks);
pr_debug("affinity: %s\n", affinity_tags[rec->opts.affinity]);
pr_debug("mmap flush: %d\n", rec->opts.mmap_flush);
diff --git a/tools/perf/util/session.h b/tools/perf/util/session.h
index 0e14884f28b2..6c984c895924 100644
--- a/tools/perf/util/session.h
+++ b/tools/perf/util/session.h
@@ -8,6 +8,7 @@
#include "machine.h"
#include "data.h"
#include "ordered-events.h"
+#include "util/compress.h"
#include <linux/kernel.h>
#include <linux/rbtree.h>
#include <linux/perf_event.h>
@@ -37,6 +38,7 @@ struct perf_session {
struct perf_tool *tool;
u64 bytes_transferred;
u64 bytes_compressed;
+ struct zstd_data zstd_data;
};
struct perf_tool;
Commit-ID: ef781128e47e73f0e5b2ad385cfa685a0719456a
Gitweb: https://git.kernel.org/tip/ef781128e47e73f0e5b2ad385cfa685a0719456a
Author: Alexey Budankov <[email protected]>
AuthorDate: Mon, 18 Mar 2019 20:44:12 +0300
Committer: Arnaldo Carvalho de Melo <[email protected]>
CommitDate: Wed, 15 May 2019 16:36:49 -0300
perf record: Implement compression for AIO trace streaming
Compression is implemented using the functions from zstd.c. As the memory
to operate on the compression uses mmap->aio.data[] buffers. If Zstd
streaming compression API fails for some reason the data to be compressed
are just copied into the memory buffers using plain memcpy().
Compressed trace frame consists of an array of PERF_RECORD_COMPRESSED
records. Each element of the array is not longer that PERF_SAMPLE_MAX_SIZE
and consists of perf_event_header followed by the compressed chunk
that is decompressed on the loading stage.
perf_mmap__aio_push() is replaced by perf_mmap__push() which is now used
in the both serial and AIO streaming cases. perf_mmap__push() is extended
with positive return values to signify absence of data ready for
processing.
Signed-off-by: Alexey Budankov <[email protected]>
Reviewed-by: Jiri Olsa <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
---
tools/perf/builtin-record.c | 114 ++++++++++++++++++++++++++++++++++----------
tools/perf/util/mmap.c | 76 +----------------------------
tools/perf/util/mmap.h | 12 -----
3 files changed, 89 insertions(+), 113 deletions(-)
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index de9632c69852..a0bd9104fae6 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -133,6 +133,8 @@ static int record__write(struct record *rec, struct perf_mmap *map __maybe_unuse
return 0;
}
+static int record__aio_enabled(struct record *rec);
+static int record__comp_enabled(struct record *rec);
static size_t zstd_compress(struct perf_session *session, void *dst, size_t dst_size,
void *src, size_t src_size);
@@ -186,9 +188,9 @@ static int record__aio_complete(struct perf_mmap *md, struct aiocb *cblock)
if (rem_size == 0) {
cblock->aio_fildes = -1;
/*
- * md->refcount is incremented in perf_mmap__push() for
- * every enqueued aio write request so decrement it because
- * the request is now complete.
+ * md->refcount is incremented in record__aio_pushfn() for
+ * every aio write request started in record__aio_push() so
+ * decrement it because the request is now complete.
*/
perf_mmap__put(md);
rc = 1;
@@ -243,18 +245,89 @@ static int record__aio_sync(struct perf_mmap *md, bool sync_all)
} while (1);
}
-static int record__aio_pushfn(void *to, struct aiocb *cblock, void *bf, size_t size, off_t off)
+struct record_aio {
+ struct record *rec;
+ void *data;
+ size_t size;
+};
+
+static int record__aio_pushfn(struct perf_mmap *map, void *to, void *buf, size_t size)
{
- struct record *rec = to;
- int ret, trace_fd = rec->session->data->file.fd;
+ struct record_aio *aio = to;
- rec->samples++;
+ /*
+ * map->base data pointed by buf is copied into free map->aio.data[] buffer
+ * to release space in the kernel buffer as fast as possible, calling
+ * perf_mmap__consume() from perf_mmap__push() function.
+ *
+ * That lets the kernel to proceed with storing more profiling data into
+ * the kernel buffer earlier than other per-cpu kernel buffers are handled.
+ *
+ * Coping can be done in two steps in case the chunk of profiling data
+ * crosses the upper bound of the kernel buffer. In this case we first move
+ * part of data from map->start till the upper bound and then the reminder
+ * from the beginning of the kernel buffer till the end of the data chunk.
+ */
- ret = record__aio_write(cblock, trace_fd, bf, size, off);
+ if (record__comp_enabled(aio->rec)) {
+ size = zstd_compress(aio->rec->session, aio->data + aio->size,
+ perf_mmap__mmap_len(map) - aio->size,
+ buf, size);
+ } else {
+ memcpy(aio->data + aio->size, buf, size);
+ }
+
+ if (!aio->size) {
+ /*
+ * Increment map->refcount to guard map->aio.data[] buffer
+ * from premature deallocation because map object can be
+ * released earlier than aio write request started on
+ * map->aio.data[] buffer is complete.
+ *
+ * perf_mmap__put() is done at record__aio_complete()
+ * after started aio request completion or at record__aio_push()
+ * if the request failed to start.
+ */
+ perf_mmap__get(map);
+ }
+
+ aio->size += size;
+
+ return size;
+}
+
+static int record__aio_push(struct record *rec, struct perf_mmap *map, off_t *off)
+{
+ int ret, idx;
+ int trace_fd = rec->session->data->file.fd;
+ struct record_aio aio = { .rec = rec, .size = 0 };
+
+ /*
+ * Call record__aio_sync() to wait till map->aio.data[] buffer
+ * becomes available after previous aio write operation.
+ */
+
+ idx = record__aio_sync(map, false);
+ aio.data = map->aio.data[idx];
+ ret = perf_mmap__push(map, &aio, record__aio_pushfn);
+ if (ret != 0) /* ret > 0 - no data, ret < 0 - error */
+ return ret;
+
+ rec->samples++;
+ ret = record__aio_write(&(map->aio.cblocks[idx]), trace_fd, aio.data, aio.size, *off);
if (!ret) {
- rec->bytes_written += size;
+ *off += aio.size;
+ rec->bytes_written += aio.size;
if (switch_output_size(rec))
trigger_hit(&switch_output_trigger);
+ } else {
+ /*
+ * Decrement map->refcount incremented in record__aio_pushfn()
+ * back if record__aio_write() operation failed to start, otherwise
+ * map->refcount is decremented in record__aio_complete() after
+ * aio write operation finishes successfully.
+ */
+ perf_mmap__put(map);
}
return ret;
@@ -276,7 +349,7 @@ static void record__aio_mmap_read_sync(struct record *rec)
struct perf_evlist *evlist = rec->evlist;
struct perf_mmap *maps = evlist->mmap;
- if (!rec->opts.nr_cblocks)
+ if (!record__aio_enabled(rec))
return;
for (i = 0; i < evlist->nr_mmaps; i++) {
@@ -310,13 +383,8 @@ static int record__aio_parse(const struct option *opt,
#else /* HAVE_AIO_SUPPORT */
static int nr_cblocks_max = 0;
-static int record__aio_sync(struct perf_mmap *md __maybe_unused, bool sync_all __maybe_unused)
-{
- return -1;
-}
-
-static int record__aio_pushfn(void *to __maybe_unused, struct aiocb *cblock __maybe_unused,
- void *bf __maybe_unused, size_t size __maybe_unused, off_t off __maybe_unused)
+static int record__aio_push(struct record *rec __maybe_unused, struct perf_mmap *map __maybe_unused,
+ off_t *off __maybe_unused)
{
return -1;
}
@@ -825,7 +893,7 @@ static int record__mmap_read_evlist(struct record *rec, struct perf_evlist *evli
int rc = 0;
struct perf_mmap *maps;
int trace_fd = rec->data.file.fd;
- off_t off;
+ off_t off = 0;
if (!evlist)
return 0;
@@ -851,20 +919,14 @@ static int record__mmap_read_evlist(struct record *rec, struct perf_evlist *evli
map->flush = 1;
}
if (!record__aio_enabled(rec)) {
- if (perf_mmap__push(map, rec, record__pushfn) != 0) {
+ if (perf_mmap__push(map, rec, record__pushfn) < 0) {
if (synch)
map->flush = flush;
rc = -1;
goto out;
}
} else {
- int idx;
- /*
- * Call record__aio_sync() to wait till map->data buffer
- * becomes available after previous aio write request.
- */
- idx = record__aio_sync(map, false);
- if (perf_mmap__aio_push(map, rec, idx, record__aio_pushfn, &off) != 0) {
+ if (record__aio_push(rec, map, &off) < 0) {
record__aio_set_pos(trace_fd, off);
if (synch)
map->flush = flush;
diff --git a/tools/perf/util/mmap.c b/tools/perf/util/mmap.c
index d85e73fc82e2..868c0b0e909c 100644
--- a/tools/perf/util/mmap.c
+++ b/tools/perf/util/mmap.c
@@ -289,80 +289,6 @@ static void perf_mmap__aio_munmap(struct perf_mmap *map)
zfree(&map->aio.cblocks);
zfree(&map->aio.aiocb);
}
-
-int perf_mmap__aio_push(struct perf_mmap *md, void *to, int idx,
- int push(void *to, struct aiocb *cblock, void *buf, size_t size, off_t off),
- off_t *off)
-{
- u64 head = perf_mmap__read_head(md);
- unsigned char *data = md->base + page_size;
- unsigned long size, size0 = 0;
- void *buf;
- int rc = 0;
-
- rc = perf_mmap__read_init(md);
- if (rc < 0)
- return (rc == -EAGAIN) ? 0 : -1;
-
- /*
- * md->base data is copied into md->data[idx] buffer to
- * release space in the kernel buffer as fast as possible,
- * thru perf_mmap__consume() below.
- *
- * That lets the kernel to proceed with storing more
- * profiling data into the kernel buffer earlier than other
- * per-cpu kernel buffers are handled.
- *
- * Coping can be done in two steps in case the chunk of
- * profiling data crosses the upper bound of the kernel buffer.
- * In this case we first move part of data from md->start
- * till the upper bound and then the reminder from the
- * beginning of the kernel buffer till the end of
- * the data chunk.
- */
-
- size = md->end - md->start;
-
- if ((md->start & md->mask) + size != (md->end & md->mask)) {
- buf = &data[md->start & md->mask];
- size = md->mask + 1 - (md->start & md->mask);
- md->start += size;
- memcpy(md->aio.data[idx], buf, size);
- size0 = size;
- }
-
- buf = &data[md->start & md->mask];
- size = md->end - md->start;
- md->start += size;
- memcpy(md->aio.data[idx] + size0, buf, size);
-
- /*
- * Increment md->refcount to guard md->data[idx] buffer
- * from premature deallocation because md object can be
- * released earlier than aio write request started
- * on mmap->data[idx] is complete.
- *
- * perf_mmap__put() is done at record__aio_complete()
- * after started request completion.
- */
- perf_mmap__get(md);
-
- md->prev = head;
- perf_mmap__consume(md);
-
- rc = push(to, &md->aio.cblocks[idx], md->aio.data[idx], size0 + size, *off);
- if (!rc) {
- *off += size0 + size;
- } else {
- /*
- * Decrement md->refcount back if aio write
- * operation failed to start.
- */
- perf_mmap__put(md);
- }
-
- return rc;
-}
#else /* !HAVE_AIO_SUPPORT */
static int perf_mmap__aio_enabled(struct perf_mmap *map __maybe_unused)
{
@@ -566,7 +492,7 @@ int perf_mmap__push(struct perf_mmap *md, void *to,
rc = perf_mmap__read_init(md);
if (rc < 0)
- return (rc == -EAGAIN) ? 0 : -1;
+ return (rc == -EAGAIN) ? 1 : -1;
size = md->end - md->start;
diff --git a/tools/perf/util/mmap.h b/tools/perf/util/mmap.h
index 4e2f58d95c1f..274ce389cd84 100644
--- a/tools/perf/util/mmap.h
+++ b/tools/perf/util/mmap.h
@@ -101,18 +101,6 @@ union perf_event *perf_mmap__read_event(struct perf_mmap *map);
int perf_mmap__push(struct perf_mmap *md, void *to,
int push(struct perf_mmap *map, void *to, void *buf, size_t size));
-#ifdef HAVE_AIO_SUPPORT
-int perf_mmap__aio_push(struct perf_mmap *md, void *to, int idx,
- int push(void *to, struct aiocb *cblock, void *buf, size_t size, off_t off),
- off_t *off);
-#else
-static inline int perf_mmap__aio_push(struct perf_mmap *md __maybe_unused, void *to __maybe_unused, int idx __maybe_unused,
- int push(void *to, struct aiocb *cblock, void *buf, size_t size, off_t off) __maybe_unused,
- off_t *off __maybe_unused)
-{
- return 0;
-}
-#endif
size_t perf_mmap__mmap_len(struct perf_mmap *map);
Commit-ID: d3c8c08e75c4cbb6a940323092b40fcfd1de5380
Gitweb: https://git.kernel.org/tip/d3c8c08e75c4cbb6a940323092b40fcfd1de5380
Author: Alexey Budankov <[email protected]>
AuthorDate: Mon, 18 Mar 2019 20:41:02 +0300
Committer: Arnaldo Carvalho de Melo <[email protected]>
CommitDate: Wed, 15 May 2019 16:36:49 -0300
perf session: Define 'bytes_transferred' and 'bytes_compressed' metrics
Define 'bytes_transferred' and 'bytes_compressed' metrics to calculate
ratio in the end of the data collection:
compression ratio = bytes_transferred / bytes_compressed
The 'bytes_transferred' metric accumulates the amount of bytes that was
extracted from the mmaped kernel buffers for compression, while
'bytes_compressed' accumulates the amount of bytes that was received
after applying compression.
Signed-off-by: Alexey Budankov <[email protected]>
Reviewed-by: Jiri Olsa <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
---
tools/perf/builtin-record.c | 14 +++++++++++++-
tools/perf/util/env.h | 1 +
tools/perf/util/session.h | 2 ++
3 files changed, 16 insertions(+), 1 deletion(-)
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index d2b5a22b7249..386e665a166f 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -1186,6 +1186,7 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
bool disabled = false, draining = false;
struct perf_evlist *sb_evlist = NULL;
int fd;
+ float ratio = 0;
atexit(record__sig_exit);
signal(SIGCHLD, sig_handler);
@@ -1491,6 +1492,11 @@ out_child:
record__mmap_read_all(rec, true);
record__aio_mmap_read_sync(rec);
+ if (rec->session->bytes_transferred && rec->session->bytes_compressed) {
+ ratio = (float)rec->session->bytes_transferred/(float)rec->session->bytes_compressed;
+ session->header.env.comp_ratio = ratio + 0.5;
+ }
+
if (forks) {
int exit_status;
@@ -1537,9 +1543,15 @@ out_child:
else
samples[0] = '\0';
- fprintf(stderr, "[ perf record: Captured and wrote %.3f MB %s%s%s ]\n",
+ fprintf(stderr, "[ perf record: Captured and wrote %.3f MB %s%s%s",
perf_data__size(data) / 1024.0 / 1024.0,
data->path, postfix, samples);
+ if (ratio) {
+ fprintf(stderr, ", compressed (original %.3f MB, ratio is %.3f)",
+ rec->session->bytes_transferred / 1024.0 / 1024.0,
+ ratio);
+ }
+ fprintf(stderr, " ]\n");
}
out_delete_session:
diff --git a/tools/perf/util/env.h b/tools/perf/util/env.h
index 4f8e2b485c01..34868ca7efd1 100644
--- a/tools/perf/util/env.h
+++ b/tools/perf/util/env.h
@@ -62,6 +62,7 @@ struct perf_env {
struct cpu_topology_map *cpu;
struct cpu_cache_level *caches;
int caches_cnt;
+ u32 comp_ratio;
struct numa_node *numa_nodes;
struct memory_node *memory_nodes;
unsigned long long memory_bsize;
diff --git a/tools/perf/util/session.h b/tools/perf/util/session.h
index d96eccd7d27f..0e14884f28b2 100644
--- a/tools/perf/util/session.h
+++ b/tools/perf/util/session.h
@@ -35,6 +35,8 @@ struct perf_session {
struct ordered_events ordered_events;
struct perf_data *data;
struct perf_tool *tool;
+ u64 bytes_transferred;
+ u64 bytes_compressed;
};
struct perf_tool;
Commit-ID: 504c1ad11691d1a16e92285bb961728a80c06014
Gitweb: https://git.kernel.org/tip/504c1ad11691d1a16e92285bb961728a80c06014
Author: Alexey Budankov <[email protected]>
AuthorDate: Mon, 18 Mar 2019 20:44:42 +0300
Committer: Arnaldo Carvalho de Melo <[email protected]>
CommitDate: Wed, 15 May 2019 16:36:49 -0300
perf record: Implement -z,--compression_level[=<n>] option
Implemented -z,--compression_level[=<n>] option that enables compression
of mmaped kernel data buffers content in runtime during perf record mode
collection. Default option value is 1 (fastest compression).
Compression overhead has been measured for serial and AIO streaming when
profiling matrix multiplication workload:
-------------------------------------------------------------
| SERIAL | AIO-1 |
----------------------------------------------------------------|
|-z | OVH(x) | ratio(x) size(MiB) | OVH(x) | ratio(x) size(MiB) |
|---------------------------------------------------------------|
| 0 | 1,00 | 1,000 179,424 | 1,00 | 1,000 187,527 |
| 1 | 1,04 | 8,427 181,148 | 1,01 | 8,474 188,562 |
| 2 | 1,07 | 8,055 186,953 | 1,03 | 7,912 191,773 |
| 3 | 1,04 | 8,283 181,908 | 1,03 | 8,220 191,078 |
| 5 | 1,09 | 8,101 187,705 | 1,05 | 7,780 190,065 |
| 8 | 1,05 | 9,217 179,191 | 1,12 | 6,111 193,024 |
-----------------------------------------------------------------
OVH = (Execution time with -z N) / (Execution time with -z 0)
ratio - compression ratio
size - number of bytes that was compressed
size ~= trace size x ratio
Committer notes:
Testing it I noticed that it failed to disable build id processing when
compression is enabled, and as we'd have to uncompress everything to
look for the PERF_RECORD_{MMAP,SAMPLE,etc} to figure out which build ids
to read from DSOs, we better disable build id processing when
compression is enabled, logging with pr_debug() when doing so:
Original patch:
# perf record -z2
^C[ perf record: Woken up 1 times to write data ]
0x1746e0 [0x76]: failed to process type: 81 [Invalid argument]
[ perf record: Captured and wrote 1.568 MB perf.data, compressed (original 0.452 MB, ratio is 3.995) ]
#
After auto-disabling build id processing when compression is enabled:
$ perf record -z2 sleep 1
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.001 MB perf.data, compressed (original 0.001 MB, ratio is 2.292) ]
$ perf record -v -z2 sleep 1
Compression enabled, disabling build id collection at the end of the session.
<SNIP extra -v pr_debug() messages>
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.001 MB perf.data, compressed (original 0.001 MB, ratio is 2.305) ]
$
Also, with parts of the patch originally after this one moved to just
before this one we get:
$ perf record -z2 sleep 1
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.001 MB perf.data, compressed (original 0.001 MB, ratio is 2.371) ]
$ perf report -D | grep COMPRESS
0 0x1b8 [0x155]: PERF_RECORD_COMPRESSED: unhandled!
0 0x30d [0x80]: PERF_RECORD_COMPRESSED: unhandled!
COMPRESSED events: 2
COMPRESSED events: 0
$
I.e. when faced with PERF_RECORD_COMPRESSED that we still have no code
to process, we just show it as not being handled, skip them and
continue, while before we had:
$ perf report -D | grep COMPRESS
0x1b8 [0x169]: failed to process type: 81 [Invalid argument]
Error:
failed to process sample
0 0x1b8 [0x169]: PERF_RECORD_COMPRESSED
$
Signed-off-by: Alexey Budankov <[email protected]>
Reviewed-by: Jiri Olsa <[email protected]>
Tested-by: Arnaldo Carvalho de Melo <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
---
tools/perf/Documentation/perf-record.txt | 5 +++++
tools/perf/builtin-record.c | 30 ++++++++++++++++++++++++++++++
2 files changed, 35 insertions(+)
diff --git a/tools/perf/Documentation/perf-record.txt b/tools/perf/Documentation/perf-record.txt
index 58986f4cc190..27b37624c376 100644
--- a/tools/perf/Documentation/perf-record.txt
+++ b/tools/perf/Documentation/perf-record.txt
@@ -478,6 +478,11 @@ Also at some cases executing less output write syscalls with bigger data size
can take less time than executing more output write syscalls with smaller data
size thus lowering runtime profiling overhead.
+-z::
+--compression-level[=n]::
+Produce compressed trace using specified level n (default: 1 - fastest compression,
+22 - smallest trace)
+
--all-kernel::
Configure all used events to run in kernel space.
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index a0bd9104fae6..861395753c25 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -443,6 +443,25 @@ static int record__mmap_flush_parse(const struct option *opt,
return 0;
}
+#ifdef HAVE_ZSTD_SUPPORT
+static unsigned int comp_level_default = 1;
+
+static int record__parse_comp_level(const struct option *opt, const char *str, int unset)
+{
+ struct record_opts *opts = opt->value;
+
+ if (unset) {
+ opts->comp_level = 0;
+ } else {
+ if (str)
+ opts->comp_level = strtol(str, NULL, 0);
+ if (!opts->comp_level)
+ opts->comp_level = comp_level_default;
+ }
+
+ return 0;
+}
+#endif
static unsigned int comp_level_max = 22;
static int record__comp_enabled(struct record *rec)
@@ -2200,6 +2219,11 @@ static struct option __record_options[] = {
OPT_CALLBACK(0, "affinity", &record.opts, "node|cpu",
"Set affinity mask of trace reading thread to NUMA node cpu mask or cpu of processed mmap buffer",
record__parse_affinity),
+#ifdef HAVE_ZSTD_SUPPORT
+ OPT_CALLBACK_OPTARG('z', "compression-level", &record.opts, &comp_level_default,
+ "n", "Compressed records using specified level (default: 1 - fastest compression, 22 - greatest compression)",
+ record__parse_comp_level),
+#endif
OPT_END()
};
@@ -2259,6 +2283,12 @@ int cmd_record(int argc, const char **argv)
"cgroup monitoring only available in system-wide mode");
}
+
+ if (rec->opts.comp_level != 0) {
+ pr_debug("Compression enabled, disabling build id collection at the end of the session.\n");
+ rec->no_buildid = true;
+ }
+
if (rec->opts.record_switch_events &&
!perf_can_record_switch_events()) {
ui__error("kernel does not support recording context switch events\n");