2021-05-26 12:47:23

by Bayduraev, Alexey V

[permalink] [raw]
Subject: [PATCH v6 00/20] Introduce threaded trace streaming for basic perf record operation

Changes in v6:
- fixed leaks and possible double free in record__thread_mask_alloc()
- fixed leaks in record__init_thread_user_masks()
- fixed final mmaps flushing for threads id > 0
- merged with origin/perf/core

v5: https://lore.kernel.org/lkml/[email protected]/

Changes in v5:
- fixed leaks in record__init_thread_masks_spec()
- fixed leaks after failed realloc
- replaced "%m" to strerror()
- added masks examples to the documentation
- captured Acked-by: tags by Andi Kleen
- do not allow --thread option for full_auxtrace mode
- split patch 06/12 to 06/20 and 07/20
- split patch 08/12 to 09/20 and 10/20
- split patches 11/12 and 11/12 to 13/20-20/20

v4: https://lore.kernel.org/lkml/[email protected]/

Changes in v4:
- renamed 'comm' structure to 'pipes'
- moved thread fd/maps messages to verbose=2
- fixed leaks during allocation of thread_data structures
- fixed leaks during allocation of thread masks
- fixed possible fails when releasing thread masks

v3: https://lore.kernel.org/lkml/[email protected]/

Changes in v3:
- avoided skipped redundant patch 3/15
- applied "data file" and "data directory" terms allover the patch set
- captured Acked-by: tags by Namhyung Kim
- avoided braces where don't needed
- employed thread local variable for serial trace streaming
- added specs for --thread option - core, socket, numa and user defined
- added parallel loading of data directory files similar to the prototype [1]

v2: https://lore.kernel.org/lkml/[email protected]/

Changes in v2:
- explicitly added credit tags to patches 6/15 and 15/15,
additionally to cites [1], [2]
- updated description of 3/15 to explicitly mention the reason
to open data directories in read access mode (e.g. for perf report)
- implemented fix for compilation error of 2/15
- explicitly elaborated on found issues to be resolved for
threaded AUX trace capture

v1: https://lore.kernel.org/lkml/[email protected]/

Patch set provides parallel threaded trace streaming mode for basic
perf record operation. Provided mode mitigates profiling data losses
and resolves scalability issues of serial and asynchronous (--aio)
trace streaming modes on multicore server systems. The design and
implementation are based on the prototype [1], [2].

Parallel threaded mode executes trace streaming threads that read kernel
data buffers and write captured data into several data files located at
data directory. Layout of trace streaming threads and their mapping to data
buffers to read can be configured using a value of --thread command line
option. Specification value provides masks separated by colon so the masks
define cpus to be monitored by one thread and thread affinity mask is
separated by slash. <cpus mask 1>/<affinity mask 1>:<cpu mask 2>/<affinity mask 2>
specifies parallel threads layout that consists of two threads with
corresponding assigned cpus to be monitored. Specification value can be
a string e.g. "cpu", "core" or "socket" meaning creation of data streaming
thread for monitoring every cpu, whole core or socket. The option provided
with no or empty value defaults to "cpu" layout creating data streaming
thread for every cpu being monitored. Specification masks are filtered
by the mask provided via -C option.

Parallel streaming mode is compatible with Zstd compression/decompression
(--compression-level) and external control commands (--control). The mode
is not enabled for pipe mode. The mode is not enabled for AUX area tracing,
related and derived modes like --snapshot or --aux-sample. --switch-output-*
and --timestamp-filename options are not enabled for parallel streaming.
Initial intent to enable AUX area tracing faced the need to define some
optimal way to store index data in data directory. --switch-output-* and
--timestamp-filename use cases are not clear for data directories.
Asynchronous(--aio) trace streaming and affinity (--affinity) modes are
mutually exclusive to parallel streaming mode.

Basic analysis of data directories is provided in perf report mode.
Raw dump and aggregated reports are available for data directories,
still with no memory consumption optimizations.

Tested:

tools/perf/perf record -o prof.data --threads -- matrix.gcc.g.O3
tools/perf/perf record -o prof.data --threads= -- matrix.gcc.g.O3
tools/perf/perf record -o prof.data --threads=cpu -- matrix.gcc.g.O3
tools/perf/perf record -o prof.data --threads=core -- matrix.gcc.g.O3
tools/perf/perf record -o prof.data --threads=socket -- matrix.gcc.g.O3
tools/perf/perf record -o prof.data --threads=numa -- matrix.gcc.g.O3
tools/perf/perf record -o prof.data --threads=0-3/3:4-7/4 -- matrix.gcc.g.O3
tools/perf/perf record -o prof.data -C 2,5 --threads=0-3/3:4-7/4 -- matrix.gcc.g.O3
tools/perf/perf record -o prof.data -C 3,4 --threads=0-3/3:4-7/4 -- matrix.gcc.g.O3
tools/perf/perf record -o prof.data -C 0,4,2,6 --threads=core -- matrix.gcc.g.O3
tools/perf/perf record -o prof.data -C 0,4,2,6 --threads=numa -- matrix.gcc.g.O3
tools/perf/perf record -o prof.data --threads -g --call-graph dwarf,4096 -- matrix.gcc.g.O3
tools/perf/perf record -o prof.data --threads -g --call-graph dwarf,4096 --compression-level=3 -- matrix.gcc.g.O3
tools/perf/perf record -o prof.data --threads -a
tools/perf/perf record -D -1 -e cpu-cycles -a --control fd:10,11 -- sleep 30
tools/perf/perf record --threads -D -1 -e cpu-cycles -a --control fd:10,11 -- sleep 30

tools/perf/perf report -i prof.data
tools/perf/perf report -i prof.data --call-graph=callee
tools/perf/perf report -i prof.data --stdio --header
tools/perf/perf report -i prof.data -D --header

[1] git clone https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git -b perf/record_threads
[2] https://lore.kernel.org/lkml/[email protected]/

Alexey Bayduraev (20):
perf record: Introduce thread affinity and mmap masks
perf record: Introduce thread specific data array
perf record: Introduce thread local variable
perf record: Stop threads in the end of trace streaming
perf record: Start threads in the beginning of trace streaming
perf record: Introduce data file at mmap buffer object
perf record: Introduce data transferred and compressed stats
perf record: Init data file at mmap buffer object
tools lib: Introduce bitmap_intersects() operation
perf record: Introduce --threads=<spec> command line option
perf record: Document parallel data streaming mode
perf report: Output data file name in raw trace dump
perf session: Move reader structure to the top
perf session: Introduce reader_state in reader object
perf session: Introduce reader objects in session object
perf session: Introduce decompressor into trace reader object
perf session: Move init into reader__init function
perf session: Move map/unmap into reader__mmap function
perf session: Load single file for analysis
perf session: Load data directory files for analysis

tools/include/linux/bitmap.h | 11 +
tools/lib/api/fd/array.c | 17 +
tools/lib/api/fd/array.h | 1 +
tools/lib/bitmap.c | 14 +
tools/perf/Documentation/perf-record.txt | 30 +
tools/perf/builtin-inject.c | 3 +-
tools/perf/builtin-record.c | 1082 ++++++++++++++++++++--
tools/perf/util/evlist.c | 16 +
tools/perf/util/evlist.h | 1 +
tools/perf/util/mmap.c | 6 +
tools/perf/util/mmap.h | 6 +
tools/perf/util/ordered-events.h | 1 +
tools/perf/util/record.h | 2 +
tools/perf/util/session.c | 497 +++++++---
tools/perf/util/session.h | 5 +
tools/perf/util/tool.h | 3 +-
16 files changed, 1494 insertions(+), 201 deletions(-)

--
2.19.0


2021-05-26 12:48:40

by Bayduraev, Alexey V

[permalink] [raw]
Subject: [PATCH v6 08/20] perf record: Init data file at mmap buffer object

Initialize data files located at mmap buffer objects so trace data
can be written into several data file located at data directory.

Acked-by: Andi Kleen <[email protected]>
Signed-off-by: Alexey Bayduraev <[email protected]>
---
tools/perf/builtin-record.c | 41 ++++++++++++++++++++++++++++++-------
tools/perf/util/record.h | 1 +
2 files changed, 35 insertions(+), 7 deletions(-)

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 19e4f067b20a..e118efe21ba7 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -160,6 +160,11 @@ static const char *affinity_tags[PERF_AFFINITY_MAX] = {
"SYS", "NODE", "CPU"
};

+static int record__threads_enabled(struct record *rec)
+{
+ return rec->opts.threads_spec;
+}
+
static bool switch_output_signal(struct record *rec)
{
return rec->switch_output.signal &&
@@ -1070,7 +1075,7 @@ static void record__free_thread_data(struct record *rec)
static int record__mmap_evlist(struct record *rec,
struct evlist *evlist)
{
- int ret;
+ int i, ret;
struct record_opts *opts = &rec->opts;
bool auxtrace_overwrite = opts->auxtrace_snapshot_mode ||
opts->auxtrace_sample_mode;
@@ -1109,6 +1114,18 @@ static int record__mmap_evlist(struct record *rec,
if (ret)
return ret;

+ if (record__threads_enabled(rec)) {
+ ret = perf_data__create_dir(&rec->data, evlist->core.nr_mmaps);
+ if (ret)
+ return ret;
+ for (i = 0; i < evlist->core.nr_mmaps; i++) {
+ if (evlist->mmap)
+ evlist->mmap[i].file = &rec->data.dir.files[i];
+ if (evlist->overwrite_mmap)
+ evlist->overwrite_mmap[i].file = &rec->data.dir.files[i];
+ }
+ }
+
return 0;
}

@@ -1416,8 +1433,12 @@ static int record__mmap_read_evlist(struct record *rec, struct evlist *evlist,
/*
* Mark the round finished in case we wrote
* at least one event.
+ *
+ * No need for round events in directory mode,
+ * because per-cpu maps and files have data
+ * sorted by kernel.
*/
- if (bytes_written != rec->bytes_written)
+ if (!record__threads_enabled(rec) && bytes_written != rec->bytes_written)
rc = record__write(rec, NULL, &finished_round_event, sizeof(finished_round_event));

if (overwrite)
@@ -1532,7 +1553,9 @@ static void record__init_features(struct record *rec)
if (!rec->opts.use_clockid)
perf_header__clear_feat(&session->header, HEADER_CLOCK_DATA);

- perf_header__clear_feat(&session->header, HEADER_DIR_FORMAT);
+ if (!record__threads_enabled(rec))
+ perf_header__clear_feat(&session->header, HEADER_DIR_FORMAT);
+
if (!record__comp_enabled(rec))
perf_header__clear_feat(&session->header, HEADER_COMPRESSED);

@@ -1543,15 +1566,21 @@ static void
record__finish_output(struct record *rec)
{
struct perf_data *data = &rec->data;
- int fd = perf_data__fd(data);
+ int i, fd = perf_data__fd(data);

if (data->is_pipe)
return;

rec->session->header.data_size += rec->bytes_written;
data->file.size = lseek(perf_data__fd(data), 0, SEEK_CUR);
+ if (record__threads_enabled(rec)) {
+ for (i = 0; i < data->dir.nr; i++)
+ data->dir.files[i].size = lseek(data->dir.files[i].fd, 0, SEEK_CUR);
+ }

if (!rec->no_buildid) {
+ /* this will be recalculated during process_buildids() */
+ rec->samples = 0;
process_buildids(rec);

if (rec->buildid_all)
@@ -2487,8 +2516,6 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
status = err;

record__synthesize(rec, true);
- /* this will be recalculated during process_buildids() */
- rec->samples = 0;

if (!err) {
if (!rec->timestamp_filename) {
@@ -3275,7 +3302,7 @@ int cmd_record(int argc, const char **argv)
rec->no_buildid = true;
}

- if (rec->opts.kcore)
+ if (rec->opts.kcore || record__threads_enabled(rec))
rec->data.is_dir = true;

if (rec->opts.comp_level != 0) {
diff --git a/tools/perf/util/record.h b/tools/perf/util/record.h
index 68f471d9a88b..4d68b7e27272 100644
--- a/tools/perf/util/record.h
+++ b/tools/perf/util/record.h
@@ -77,6 +77,7 @@ struct record_opts {
int ctl_fd;
int ctl_fd_ack;
bool ctl_fd_close;
+ int threads_spec;
};

extern const char * const *record_usage;
--
2.19.0

2021-05-26 16:55:36

by Bayduraev, Alexey V

[permalink] [raw]
Subject: [PATCH v6 04/20] perf record: Stop threads in the end of trace streaming

Signal thread to terminate by closing write fd of msg pipe.
Receive THREAD_MSG__READY message as the confirmation of the
thread's termination. Stop threads created for parallel trace
streaming prior their stats processing.

Acked-by: Andi Kleen <[email protected]>
Signed-off-by: Alexey Bayduraev <[email protected]>
---
tools/perf/builtin-record.c | 30 ++++++++++++++++++++++++++++++
1 file changed, 30 insertions(+)

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 2027334d70bc..838c1f779849 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -112,6 +112,16 @@ struct thread_data {

static __thread struct thread_data *thread;

+enum thread_msg {
+ THREAD_MSG__UNDEFINED = 0,
+ THREAD_MSG__READY,
+ THREAD_MSG__MAX,
+};
+
+static const char *thread_msg_tags[THREAD_MSG__MAX] = {
+ "UNDEFINED", "READY"
+};
+
struct record {
struct perf_tool tool;
struct record_opts opts;
@@ -1857,6 +1867,23 @@ static void record__uniquify_name(struct record *rec)
}
}

+static int record__terminate_thread(struct thread_data *thread_data)
+{
+ int res;
+ enum thread_msg ack = THREAD_MSG__UNDEFINED;
+ pid_t tid = thread_data->tid;
+
+ close(thread_data->pipes.msg[1]);
+ res = read(thread_data->pipes.ack[0], &ack, sizeof(ack));
+ if (res != -1)
+ pr_debug2("threads[%d]: sent %s\n", tid, thread_msg_tags[ack]);
+ else
+ pr_err("threads[%d]: failed to recv msg=%s from tid=%d\n",
+ thread->tid, thread_msg_tags[ack], tid);
+
+ return 0;
+}
+
static int record__start_threads(struct record *rec)
{
struct thread_data *thread_data = rec->thread_data;
@@ -1873,6 +1900,9 @@ static int record__stop_threads(struct record *rec, unsigned long *waking)
int t;
struct thread_data *thread_data = rec->thread_data;

+ for (t = 1; t < rec->nr_threads; t++)
+ record__terminate_thread(&thread_data[t]);
+
for (t = 0; t < rec->nr_threads; t++) {
rec->samples += thread_data[t].samples;
*waking += thread_data[t].waking;
--
2.19.0

2021-05-26 16:56:23

by Bayduraev, Alexey V

[permalink] [raw]
Subject: [PATCH v6 10/20] perf record: Introduce --threads=<spec> command line option

Provide --threads option in perf record command line interface.
The option can have a value in the form of masks that specify
cpus to be monitored with data streaming threads and its layout
in system topology. The masks can be filtered using cpu mask
provided via -C option.

The specification value can be user defined list of masks. Masks
separated by colon define cpus to be monitored by one thread and
affinity mask of that thread is separated by slash. For example:
<cpus mask 1>/<affinity mask 1>:<cpu mask 2>/<affinity mask 2>
specifies parallel threads layout that consists of two threads
with corresponding assigned cpus to be monitored.

The specification value can be a string e.g. "cpu", "core" or
"socket" meaning creation of data streaming thread for every
cpu or core or socket to monitor distinct cpus or cpus grouped
by core or socket.

The option provided with no or empty value defaults to per-cpu
parallel threads layout creating data streaming thread for every
cpu being monitored.

Feature design and implementation are based on prototypes [1], [2].

[1] git clone https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git -b perf/record_threads
[2] https://lore.kernel.org/lkml/[email protected]/

Suggested-by: Jiri Olsa <[email protected]>
Suggested-by: Namhyung Kim <[email protected]>
Acked-by: Andi Kleen <[email protected]>
Signed-off-by: Alexey Bayduraev <[email protected]>
---
tools/perf/builtin-record.c | 347 +++++++++++++++++++++++++++++++++++-
tools/perf/util/record.h | 1 +
2 files changed, 346 insertions(+), 2 deletions(-)

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index e118efe21ba7..a54d72475629 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -51,6 +51,7 @@
#include "util/evlist-hybrid.h"
#include "asm/bug.h"
#include "perf.h"
+#include "cputopo.h"

#include <errno.h>
#include <inttypes.h>
@@ -122,6 +123,20 @@ static const char *thread_msg_tags[THREAD_MSG__MAX] = {
"UNDEFINED", "READY"
};

+enum thread_spec {
+ THREAD_SPEC__UNDEFINED = 0,
+ THREAD_SPEC__CPU,
+ THREAD_SPEC__CORE,
+ THREAD_SPEC__SOCKET,
+ THREAD_SPEC__NUMA,
+ THREAD_SPEC__USER,
+ THREAD_SPEC__MAX,
+};
+
+static const char *thread_spec_tags[THREAD_SPEC__MAX] = {
+ "undefined", "cpu", "core", "socket", "numa", "user"
+};
+
struct record {
struct perf_tool tool;
struct record_opts opts;
@@ -2721,6 +2736,70 @@ static void record__thread_mask_free(struct thread_mask *mask)
record__mmap_cpu_mask_free(&mask->affinity);
}

+static int record__thread_mask_or(struct thread_mask *dest, struct thread_mask *src1,
+ struct thread_mask *src2)
+{
+ if (src1->maps.nbits != src2->maps.nbits ||
+ dest->maps.nbits != src1->maps.nbits ||
+ src1->affinity.nbits != src2->affinity.nbits ||
+ dest->affinity.nbits != src1->affinity.nbits)
+ return -EINVAL;
+
+ bitmap_or(dest->maps.bits, src1->maps.bits,
+ src2->maps.bits, src1->maps.nbits);
+ bitmap_or(dest->affinity.bits, src1->affinity.bits,
+ src2->affinity.bits, src1->affinity.nbits);
+
+ return 0;
+}
+
+static int record__thread_mask_intersects(struct thread_mask *mask_1, struct thread_mask *mask_2)
+{
+ int res1, res2;
+
+ if (mask_1->maps.nbits != mask_2->maps.nbits ||
+ mask_1->affinity.nbits != mask_2->affinity.nbits)
+ return -EINVAL;
+
+ res1 = bitmap_intersects(mask_1->maps.bits, mask_2->maps.bits,
+ mask_1->maps.nbits);
+ res2 = bitmap_intersects(mask_1->affinity.bits, mask_2->affinity.bits,
+ mask_1->affinity.nbits);
+ if (res1 || res2)
+ return 1;
+
+ return 0;
+}
+
+static int record__parse_threads(const struct option *opt, const char *str, int unset)
+{
+ int s;
+ struct record_opts *opts = opt->value;
+
+ if (unset || !str || !strlen(str)) {
+ opts->threads_spec = THREAD_SPEC__CPU;
+ } else {
+ for (s = 1; s < THREAD_SPEC__MAX; s++) {
+ if (s == THREAD_SPEC__USER) {
+ opts->threads_user_spec = strdup(str);
+ opts->threads_spec = THREAD_SPEC__USER;
+ break;
+ }
+ if (!strncasecmp(str, thread_spec_tags[s], strlen(thread_spec_tags[s]))) {
+ opts->threads_spec = s;
+ break;
+ }
+ }
+ }
+
+ pr_debug("threads_spec: %s", thread_spec_tags[opts->threads_spec]);
+ if (opts->threads_spec == THREAD_SPEC__USER)
+ pr_debug("=[%s]", opts->threads_user_spec);
+ pr_debug("\n");
+
+ return 0;
+}
+
static int parse_output_max_size(const struct option *opt,
const char *str, int unset)
{
@@ -3164,6 +3243,9 @@ static struct option __record_options[] = {
"\t\t\t Optionally send control command completion ('ack\\n') to ack-fd descriptor.\n"
"\t\t\t Alternatively, ctl-fifo / ack-fifo will be opened and used as ctl-fd / ack-fd.",
parse_control_option),
+ OPT_CALLBACK_OPTARG(0, "threads", &record.opts, NULL, "spec",
+ "write collected trace data into several data files using parallel threads",
+ record__parse_threads),
OPT_END()
};

@@ -3177,6 +3259,17 @@ static void record__mmap_cpu_mask_init(struct mmap_cpu_mask *mask, struct perf_c
set_bit(cpus->map[c], mask->bits);
}

+static void record__mmap_cpu_mask_init_spec(struct mmap_cpu_mask *mask, char *mask_spec)
+{
+ struct perf_cpu_map *cpus;
+
+ cpus = perf_cpu_map__new(mask_spec);
+ if (cpus) {
+ record__mmap_cpu_mask_init(mask, cpus);
+ perf_cpu_map__put(cpus);
+ }
+}
+
static int record__alloc_thread_masks(struct record *rec, int nr_threads, int nr_bits)
{
int t, ret;
@@ -3196,6 +3289,229 @@ static int record__alloc_thread_masks(struct record *rec, int nr_threads, int nr

return 0;
}
+
+static int record__init_thread_cpu_masks(struct record *rec, struct perf_cpu_map *cpus)
+{
+ int t, ret, nr_cpus = perf_cpu_map__nr(cpus);
+
+ ret = record__alloc_thread_masks(rec, nr_cpus, cpu__max_cpu());
+ if (ret)
+ return ret;
+
+ rec->nr_threads = nr_cpus;
+ pr_debug("threads: nr_threads=%d\n", rec->nr_threads);
+
+ for (t = 0; t < rec->nr_threads; t++) {
+ set_bit(cpus->map[t], rec->thread_masks[t].maps.bits);
+ pr_debug("thread_masks[%d]: maps mask [%d]\n", t, cpus->map[t]);
+ set_bit(cpus->map[t], rec->thread_masks[t].affinity.bits);
+ pr_debug("thread_masks[%d]: affinity mask [%d]\n", t, cpus->map[t]);
+ }
+
+ return 0;
+}
+
+static int record__init_thread_masks_spec(struct record *rec, struct perf_cpu_map *cpus,
+ char **maps_spec, char **affinity_spec, u32 nr_spec)
+{
+ u32 s;
+ int ret = 0, nr_threads = 0;
+ struct mmap_cpu_mask cpus_mask;
+ struct thread_mask thread_mask, full_mask, *prev_masks;
+
+ ret = record__mmap_cpu_mask_alloc(&cpus_mask, cpu__max_cpu());
+ if (ret)
+ goto out;
+ record__mmap_cpu_mask_init(&cpus_mask, cpus);
+ ret = record__thread_mask_alloc(&thread_mask, cpu__max_cpu());
+ if (ret)
+ goto out_free_cpu_mask;
+ ret = record__thread_mask_alloc(&full_mask, cpu__max_cpu());
+ if (ret)
+ goto out_free_thread_mask;
+ record__thread_mask_clear(&full_mask);
+
+ for (s = 0; s < nr_spec; s++) {
+ record__thread_mask_clear(&thread_mask);
+
+ record__mmap_cpu_mask_init_spec(&thread_mask.maps, maps_spec[s]);
+ record__mmap_cpu_mask_init_spec(&thread_mask.affinity, affinity_spec[s]);
+
+ if (!bitmap_and(thread_mask.maps.bits, thread_mask.maps.bits,
+ cpus_mask.bits, thread_mask.maps.nbits) ||
+ !bitmap_and(thread_mask.affinity.bits, thread_mask.affinity.bits,
+ cpus_mask.bits, thread_mask.affinity.nbits))
+ continue;
+
+ ret = record__thread_mask_intersects(&thread_mask, &full_mask);
+ if (ret)
+ goto out_free_full_mask;
+ record__thread_mask_or(&full_mask, &full_mask, &thread_mask);
+
+ prev_masks = rec->thread_masks;
+ rec->thread_masks = realloc(rec->thread_masks,
+ (nr_threads + 1) * sizeof(struct thread_mask));
+ if (!rec->thread_masks) {
+ pr_err("Failed to allocate thread masks\n");
+ rec->thread_masks = prev_masks;
+ ret = -ENOMEM;
+ goto out_free_full_mask;
+ }
+ rec->thread_masks[nr_threads] = thread_mask;
+ if (verbose) {
+ pr_debug("thread_masks[%d]: addr=", nr_threads);
+ mmap_cpu_mask__scnprintf(&rec->thread_masks[nr_threads].maps, "maps");
+ pr_debug("thread_masks[%d]: addr=", nr_threads);
+ mmap_cpu_mask__scnprintf(&rec->thread_masks[nr_threads].affinity,
+ "affinity");
+ }
+ nr_threads++;
+ ret = record__thread_mask_alloc(&thread_mask, cpu__max_cpu());
+ if (ret)
+ goto out_free_full_mask;
+ }
+
+ rec->nr_threads = nr_threads;
+ pr_debug("threads: nr_threads=%d\n", rec->nr_threads);
+
+out_free_full_mask:
+ record__thread_mask_free(&full_mask);
+out_free_thread_mask:
+ record__thread_mask_free(&thread_mask);
+out_free_cpu_mask:
+ record__mmap_cpu_mask_free(&cpus_mask);
+out:
+ return ret;
+}
+
+static int record__init_thread_core_masks(struct record *rec, struct perf_cpu_map *cpus)
+{
+ int ret;
+ struct cpu_topology *topo;
+
+ topo = cpu_topology__new();
+ if (!topo)
+ return -EINVAL;
+
+ ret = record__init_thread_masks_spec(rec, cpus, topo->thread_siblings,
+ topo->thread_siblings, topo->thread_sib);
+ cpu_topology__delete(topo);
+
+ return ret;
+}
+
+static int record__init_thread_socket_masks(struct record *rec, struct perf_cpu_map *cpus)
+{
+ int ret;
+ struct cpu_topology *topo;
+
+ topo = cpu_topology__new();
+ if (!topo)
+ return -EINVAL;
+
+ ret = record__init_thread_masks_spec(rec, cpus, topo->core_siblings,
+ topo->core_siblings, topo->core_sib);
+ cpu_topology__delete(topo);
+
+ return ret;
+}
+
+static int record__init_thread_numa_masks(struct record *rec, struct perf_cpu_map *cpus)
+{
+ u32 s;
+ int ret;
+ char **spec;
+ struct numa_topology *topo;
+
+ topo = numa_topology__new();
+ if (!topo)
+ return -EINVAL;
+ spec = zalloc(topo->nr * sizeof(char *));
+ if (!spec) {
+ ret = -ENOMEM;
+ goto out_delete_topo;
+ }
+ for (s = 0; s < topo->nr; s++)
+ spec[s] = topo->nodes[s].cpus;
+
+ ret = record__init_thread_masks_spec(rec, cpus, spec, spec, topo->nr);
+
+ zfree(&spec);
+
+out_delete_topo:
+ numa_topology__delete(topo);
+
+ return ret;
+}
+
+static int record__init_thread_user_masks(struct record *rec, struct perf_cpu_map *cpus)
+{
+ int t, ret;
+ u32 s, nr_spec = 0;
+ char **maps_spec = NULL, **affinity_spec = NULL, **prev_spec;
+ char *spec, *spec_ptr, *user_spec, *mask, *mask_ptr;
+
+ for (t = 0, user_spec = (char *)rec->opts.threads_user_spec; ; t++, user_spec = NULL) {
+ spec = strtok_r(user_spec, ":", &spec_ptr);
+ if (spec == NULL)
+ break;
+ pr_debug(" spec[%d]: %s\n", t, spec);
+ mask = strtok_r(spec, "/", &mask_ptr);
+ if (mask == NULL)
+ break;
+ pr_debug(" maps mask: %s\n", mask);
+ prev_spec = maps_spec;
+ maps_spec = realloc(maps_spec, (nr_spec + 1) * sizeof(char *));
+ if (!maps_spec) {
+ pr_err("Failed to realloc maps_spec\n");
+ maps_spec = prev_spec;
+ ret = -ENOMEM;
+ goto out_free_all_specs;
+ }
+ maps_spec[nr_spec] = strdup(mask);
+ if (!maps_spec[nr_spec]) {
+ pr_err("Failed to alloc maps_spec[%d]\n", nr_spec);
+ ret = -ENOMEM;
+ goto out_free_all_specs;
+ }
+ mask = strtok_r(NULL, "/", &mask_ptr);
+ if (mask == NULL)
+ break;
+ pr_debug(" affinity mask: %s\n", mask);
+ prev_spec = affinity_spec;
+ affinity_spec = realloc(affinity_spec, (nr_spec + 1) * sizeof(char *));
+ if (!affinity_spec) {
+ pr_err("Failed to realloc affinity_spec\n");
+ affinity_spec = prev_spec;
+ free(maps_spec[nr_spec]);
+ ret = -ENOMEM;
+ goto out_free_all_specs;
+ }
+ affinity_spec[nr_spec] = strdup(mask);
+ if (!affinity_spec[nr_spec]) {
+ pr_err("Failed to alloc affinity_spec[%d]\n", nr_spec);
+ free(maps_spec[nr_spec]);
+ ret = -ENOMEM;
+ goto out_free_all_specs;
+ }
+ nr_spec++;
+ }
+
+ ret = record__init_thread_masks_spec(rec, cpus, maps_spec, affinity_spec, nr_spec);
+
+out_free_all_specs:
+ for (s = 0; s < nr_spec; s++) {
+ if (maps_spec)
+ free(maps_spec[s]);
+ if (affinity_spec)
+ free(affinity_spec[s]);
+ }
+ free(affinity_spec);
+ free(maps_spec);
+
+ return ret;
+}
+
static int record__init_thread_default_masks(struct record *rec, struct perf_cpu_map *cpus)
{
int ret;
@@ -3213,9 +3529,33 @@ static int record__init_thread_default_masks(struct record *rec, struct perf_cpu

static int record__init_thread_masks(struct record *rec)
{
+ int ret = 0;
struct perf_cpu_map *cpus = rec->evlist->core.cpus;

- return record__init_thread_default_masks(rec, cpus);
+ if (!record__threads_enabled(rec))
+ return record__init_thread_default_masks(rec, cpus);
+
+ switch (rec->opts.threads_spec) {
+ case THREAD_SPEC__CPU:
+ ret = record__init_thread_cpu_masks(rec, cpus);
+ break;
+ case THREAD_SPEC__CORE:
+ ret = record__init_thread_core_masks(rec, cpus);
+ break;
+ case THREAD_SPEC__SOCKET:
+ ret = record__init_thread_socket_masks(rec, cpus);
+ break;
+ case THREAD_SPEC__NUMA:
+ ret = record__init_thread_numa_masks(rec, cpus);
+ break;
+ case THREAD_SPEC__USER:
+ ret = record__init_thread_user_masks(rec, cpus);
+ break;
+ default:
+ break;
+ }
+
+ return ret;
}

static int record__fini_thread_masks(struct record *rec)
@@ -3466,7 +3806,10 @@ int cmd_record(int argc, const char **argv)

err = record__init_thread_masks(rec);
if (err) {
- pr_err("record__init_thread_masks failed, error %d\n", err);
+ if (err > 0)
+ pr_err("ERROR: parallel data streaming masks (--threads) intersect.\n");
+ else
+ pr_err("record__init_thread_masks failed, error %d\n", err);
goto out;
}

diff --git a/tools/perf/util/record.h b/tools/perf/util/record.h
index 4d68b7e27272..3da156498f47 100644
--- a/tools/perf/util/record.h
+++ b/tools/perf/util/record.h
@@ -78,6 +78,7 @@ struct record_opts {
int ctl_fd_ack;
bool ctl_fd_close;
int threads_spec;
+ const char *threads_user_spec;
};

extern const char * const *record_usage;
--
2.19.0

2021-05-26 16:57:44

by Bayduraev, Alexey V

[permalink] [raw]
Subject: [PATCH v6 12/20] perf report: Output data file name in raw trace dump

Print path and name of a data file into raw dump (-D)
<file_offset>@<path/file>. Print offset of PERF_RECORD_COMPRESSED
record instead of zero for decompressed records:
[email protected] [0x30]: event: 9
or
[email protected]/data.7 [0x30]: event: 9

Acked-by: Namhyung Kim <[email protected]>
Acked-by: Andi Kleen <[email protected]>
Signed-off-by: Alexey Bayduraev <[email protected]>
---
tools/perf/builtin-inject.c | 3 +-
tools/perf/util/ordered-events.h | 1 +
tools/perf/util/session.c | 77 +++++++++++++++++++-------------
tools/perf/util/session.h | 1 +
tools/perf/util/tool.h | 3 +-
5 files changed, 52 insertions(+), 33 deletions(-)

diff --git a/tools/perf/builtin-inject.c b/tools/perf/builtin-inject.c
index 102cafb0c0b3..404eb9b88f6e 100644
--- a/tools/perf/builtin-inject.c
+++ b/tools/perf/builtin-inject.c
@@ -109,7 +109,8 @@ static int perf_event__repipe_op2_synth(struct perf_session *session,

static int perf_event__repipe_op4_synth(struct perf_session *session,
union perf_event *event,
- u64 data __maybe_unused)
+ u64 data __maybe_unused,
+ const char *str __maybe_unused)
{
return perf_event__repipe_synth(session->tool, event);
}
diff --git a/tools/perf/util/ordered-events.h b/tools/perf/util/ordered-events.h
index 75345946c4b9..42c9764c6b5b 100644
--- a/tools/perf/util/ordered-events.h
+++ b/tools/perf/util/ordered-events.h
@@ -9,6 +9,7 @@ struct perf_sample;
struct ordered_event {
u64 timestamp;
u64 file_offset;
+ const char *file_path;
union perf_event *event;
struct list_head list;
};
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index 0dbb4f2628f3..09efd7c02400 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -38,7 +38,8 @@

#ifdef HAVE_ZSTD_SUPPORT
static int perf_session__process_compressed_event(struct perf_session *session,
- union perf_event *event, u64 file_offset)
+ union perf_event *event, u64 file_offset,
+ const char *file_path)
{
void *src;
size_t decomp_size, src_size;
@@ -60,6 +61,7 @@ static int perf_session__process_compressed_event(struct perf_session *session,
}

decomp->file_pos = file_offset;
+ decomp->file_path = file_path;
decomp->mmap_len = mmap_len;
decomp->head = 0;

@@ -100,7 +102,8 @@ static int perf_session__process_compressed_event(struct perf_session *session,
static int perf_session__deliver_event(struct perf_session *session,
union perf_event *event,
struct perf_tool *tool,
- u64 file_offset);
+ u64 file_offset,
+ const char *file_path);

static int perf_session__open(struct perf_session *session)
{
@@ -182,7 +185,8 @@ static int ordered_events__deliver_event(struct ordered_events *oe,
ordered_events);

return perf_session__deliver_event(session, event->event,
- session->tool, event->file_offset);
+ session->tool, event->file_offset,
+ event->file_path);
}

struct perf_session *perf_session__new(struct perf_data *data,
@@ -464,7 +468,8 @@ static int process_event_time_conv_stub(struct perf_session *perf_session __mayb

static int perf_session__process_compressed_event_stub(struct perf_session *session __maybe_unused,
union perf_event *event __maybe_unused,
- u64 file_offset __maybe_unused)
+ u64 file_offset __maybe_unused,
+ const char *file_path __maybe_unused)
{
dump_printf(": unhandled!\n");
return 0;
@@ -1267,13 +1272,14 @@ static void sample_read__printf(struct perf_sample *sample, u64 read_format)
}

static void dump_event(struct evlist *evlist, union perf_event *event,
- u64 file_offset, struct perf_sample *sample)
+ u64 file_offset, struct perf_sample *sample,
+ const char *file_path)
{
if (!dump_trace)
return;

- printf("\n%#" PRIx64 " [%#x]: event: %d\n",
- file_offset, event->header.size, event->header.type);
+ printf("\n%#" PRIx64 "@%s [%#x]: event: %d\n",
+ file_offset, file_path, event->header.size, event->header.type);

trace_event(event);
if (event->header.type == PERF_RECORD_SAMPLE && evlist->trace_event_sample_raw)
@@ -1476,12 +1482,13 @@ static int machines__deliver_event(struct machines *machines,
struct evlist *evlist,
union perf_event *event,
struct perf_sample *sample,
- struct perf_tool *tool, u64 file_offset)
+ struct perf_tool *tool, u64 file_offset,
+ const char *file_path)
{
struct evsel *evsel;
struct machine *machine;

- dump_event(evlist, event, file_offset, sample);
+ dump_event(evlist, event, file_offset, sample, file_path);

evsel = evlist__id2evsel(evlist, sample->id);

@@ -1558,7 +1565,8 @@ static int machines__deliver_event(struct machines *machines,
static int perf_session__deliver_event(struct perf_session *session,
union perf_event *event,
struct perf_tool *tool,
- u64 file_offset)
+ u64 file_offset,
+ const char *file_path)
{
struct perf_sample sample;
int ret = evlist__parse_sample(session->evlist, event, &sample);
@@ -1575,7 +1583,7 @@ static int perf_session__deliver_event(struct perf_session *session,
return 0;

ret = machines__deliver_event(&session->machines, session->evlist,
- event, &sample, tool, file_offset);
+ event, &sample, tool, file_offset, file_path);

if (dump_trace && sample.aux_sample.size)
auxtrace__dump_auxtrace_sample(session, &sample);
@@ -1585,7 +1593,8 @@ static int perf_session__deliver_event(struct perf_session *session,

static s64 perf_session__process_user_event(struct perf_session *session,
union perf_event *event,
- u64 file_offset)
+ u64 file_offset,
+ const char *file_path)
{
struct ordered_events *oe = &session->ordered_events;
struct perf_tool *tool = session->tool;
@@ -1595,7 +1604,7 @@ static s64 perf_session__process_user_event(struct perf_session *session,

if (event->header.type != PERF_RECORD_COMPRESSED ||
tool->compressed == perf_session__process_compressed_event_stub)
- dump_event(session->evlist, event, file_offset, &sample);
+ dump_event(session->evlist, event, file_offset, &sample, file_path);

/* These events are processed right away */
switch (event->header.type) {
@@ -1654,9 +1663,9 @@ static s64 perf_session__process_user_event(struct perf_session *session,
case PERF_RECORD_HEADER_FEATURE:
return tool->feature(session, event);
case PERF_RECORD_COMPRESSED:
- err = tool->compressed(session, event, file_offset);
+ err = tool->compressed(session, event, file_offset, file_path);
if (err)
- dump_event(session->evlist, event, file_offset, &sample);
+ dump_event(session->evlist, event, file_offset, &sample, file_path);
return err;
default:
return -EINVAL;
@@ -1673,9 +1682,9 @@ int perf_session__deliver_synth_event(struct perf_session *session,
events_stats__inc(&evlist->stats, event->header.type);

if (event->header.type >= PERF_RECORD_USER_TYPE_START)
- return perf_session__process_user_event(session, event, 0);
+ return perf_session__process_user_event(session, event, 0, NULL);

- return machines__deliver_event(&session->machines, evlist, event, sample, tool, 0);
+ return machines__deliver_event(&session->machines, evlist, event, sample, tool, 0, NULL);
}

static void event_swap(union perf_event *event, bool sample_id_all)
@@ -1771,7 +1780,8 @@ int perf_session__peek_events(struct perf_session *session, u64 offset,
}

static s64 perf_session__process_event(struct perf_session *session,
- union perf_event *event, u64 file_offset)
+ union perf_event *event, u64 file_offset,
+ const char *file_path)
{
struct evlist *evlist = session->evlist;
struct perf_tool *tool = session->tool;
@@ -1786,7 +1796,7 @@ static s64 perf_session__process_event(struct perf_session *session,
events_stats__inc(&evlist->stats, event->header.type);

if (event->header.type >= PERF_RECORD_USER_TYPE_START)
- return perf_session__process_user_event(session, event, file_offset);
+ return perf_session__process_user_event(session, event, file_offset, file_path);

if (tool->ordered_events) {
u64 timestamp = -1ULL;
@@ -1800,7 +1810,7 @@ static s64 perf_session__process_event(struct perf_session *session,
return ret;
}

- return perf_session__deliver_event(session, event, tool, file_offset);
+ return perf_session__deliver_event(session, event, tool, file_offset, file_path);
}

void perf_event_header__bswap(struct perf_event_header *hdr)
@@ -2020,7 +2030,8 @@ static int __perf_session__process_pipe_events(struct perf_session *session)
}
}

- if ((skip = perf_session__process_event(session, event, head)) < 0) {
+ skip = perf_session__process_event(session, event, head, "pipe");
+ if (skip < 0) {
pr_err("%#" PRIx64 " [%#x]: failed to process type: %d\n",
head, event->header.size, event->header.type);
err = -EINVAL;
@@ -2101,7 +2112,7 @@ fetch_decomp_event(u64 head, size_t mmap_size, char *buf, bool needs_swap)
static int __perf_session__process_decomp_events(struct perf_session *session)
{
s64 skip;
- u64 size, file_pos = 0;
+ u64 size;
struct decomp *decomp = session->decomp_last;

if (!decomp)
@@ -2115,9 +2126,9 @@ static int __perf_session__process_decomp_events(struct perf_session *session)
break;

size = event->header.size;
-
- if (size < sizeof(struct perf_event_header) ||
- (skip = perf_session__process_event(session, event, file_pos)) < 0) {
+ skip = perf_session__process_event(session, event, decomp->file_pos,
+ decomp->file_path);
+ if (size < sizeof(struct perf_event_header) || skip < 0) {
pr_err("%#" PRIx64 " [%#x]: failed to process type: %d\n",
decomp->file_pos + decomp->head, event->header.size, event->header.type);
return -EINVAL;
@@ -2148,10 +2159,12 @@ struct reader;

typedef s64 (*reader_cb_t)(struct perf_session *session,
union perf_event *event,
- u64 file_offset);
+ u64 file_offset,
+ const char *file_path);

struct reader {
int fd;
+ const char *path;
u64 data_size;
u64 data_offset;
reader_cb_t process;
@@ -2233,9 +2246,9 @@ reader__process_events(struct reader *rd, struct perf_session *session,
skip = -EINVAL;

if (size < sizeof(struct perf_event_header) ||
- (skip = rd->process(session, event, file_pos)) < 0) {
- pr_err("%#" PRIx64 " [%#x]: failed to process type: %d [%s]\n",
- file_offset + head, event->header.size,
+ (skip = rd->process(session, event, file_pos, rd->path)) < 0) {
+ pr_err("%#" PRIx64 " [%s] [%#x]: failed to process type: %d [%s]\n",
+ file_offset + head, rd->path, event->header.size,
event->header.type, strerror(-skip));
err = skip;
goto out;
@@ -2265,9 +2278,10 @@ reader__process_events(struct reader *rd, struct perf_session *session,

static s64 process_simple(struct perf_session *session,
union perf_event *event,
- u64 file_offset)
+ u64 file_offset,
+ const char *file_path)
{
- return perf_session__process_event(session, event, file_offset);
+ return perf_session__process_event(session, event, file_offset, file_path);
}

static int __perf_session__process_events(struct perf_session *session)
@@ -2277,6 +2291,7 @@ static int __perf_session__process_events(struct perf_session *session)
.data_size = session->header.data_size,
.data_offset = session->header.data_offset,
.process = process_simple,
+ .path = session->data->file.path,
.in_place_update = session->data->in_place_update,
};
struct ordered_events *oe = &session->ordered_events;
diff --git a/tools/perf/util/session.h b/tools/perf/util/session.h
index e31ba4c92a6c..6895a22ab5b7 100644
--- a/tools/perf/util/session.h
+++ b/tools/perf/util/session.h
@@ -46,6 +46,7 @@ struct perf_session {
struct decomp {
struct decomp *next;
u64 file_pos;
+ const char *file_path;
size_t mmap_len;
u64 head;
size_t size;
diff --git a/tools/perf/util/tool.h b/tools/perf/util/tool.h
index bbbc0dcd461f..c966531d3eca 100644
--- a/tools/perf/util/tool.h
+++ b/tools/perf/util/tool.h
@@ -28,7 +28,8 @@ typedef int (*event_attr_op)(struct perf_tool *tool,

typedef int (*event_op2)(struct perf_session *session, union perf_event *event);
typedef s64 (*event_op3)(struct perf_session *session, union perf_event *event);
-typedef int (*event_op4)(struct perf_session *session, union perf_event *event, u64 data);
+typedef int (*event_op4)(struct perf_session *session, union perf_event *event, u64 data,
+ const char *str);

typedef int (*event_oe)(struct perf_tool *tool, union perf_event *event,
struct ordered_events *oe);
--
2.19.0

2021-06-03 23:25:22

by Riccardo Mancini

[permalink] [raw]
Subject: Re: [PATCH v6 10/20] perf record: Introduce --threads=<spec> command line option

Hi,

the parameter provided to --thread is not checked. If an invalid value is
inserted, it will crash.

On Wed, 2021-05-26 at 13:53 +0300, Alexey Bayduraev wrote:
> Provide --threads option in perf record command line interface.
> The option can have a value in the form of masks that specify
> cpus to be monitored with data streaming threads and its layout
> in system topology. The masks can be filtered using cpu mask
> provided via -C option.
>
> The specification value can be user defined list of masks. Masks
> separated by colon define cpus to be monitored by one thread and
> affinity mask of that thread is separated by slash. For example:
> <cpus mask 1>/<affinity mask 1>:<cpu mask 2>/<affinity mask 2>
> specifies parallel threads layout that consists of two threads
> with corresponding assigned cpus to be monitored.
>
> The specification value can be a string e.g. "cpu", "core" or
> "socket" meaning creation of data streaming thread for every
> cpu or core or socket to monitor distinct cpus or cpus grouped
> by core or socket.
>
> The option provided with no or empty value defaults to per-cpu
> parallel threads layout creating data streaming thread for every
> cpu being monitored.
>
> Feature design and implementation are based on prototypes [1], [2].
>
> [1] git clone https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git -
> b perf/record_threads
> [2] https://lore.kernel.org/lkml/[email protected]/
>
> Suggested-by: Jiri Olsa <[email protected]>
> Suggested-by: Namhyung Kim <[email protected]>
> Acked-by: Andi Kleen <[email protected]>
> Signed-off-by: Alexey Bayduraev <[email protected]>
> ---
>  tools/perf/builtin-record.c | 347 +++++++++++++++++++++++++++++++++++-
>  tools/perf/util/record.h    |   1 +
>  2 files changed, 346 insertions(+), 2 deletions(-)
>
> diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
> index e118efe21ba7..a54d72475629 100644
> --- a/tools/perf/builtin-record.c
> +++ b/tools/perf/builtin-record.c
> @@ -51,6 +51,7 @@
>  #include "util/evlist-hybrid.h"
>  #include "asm/bug.h"
>  #include "perf.h"
> +#include "cputopo.h"
>  
>  #include <errno.h>
>  #include <inttypes.h>
> @@ -122,6 +123,20 @@ static const char *thread_msg_tags[THREAD_MSG__MAX] = {
>         "UNDEFINED", "READY"
>  };
>  
> +enum thread_spec {
> +       THREAD_SPEC__UNDEFINED = 0,
> +       THREAD_SPEC__CPU,
> +       THREAD_SPEC__CORE,
> +       THREAD_SPEC__SOCKET,
> +       THREAD_SPEC__NUMA,
> +       THREAD_SPEC__USER,
> +       THREAD_SPEC__MAX,
> +};
> +
> +static const char *thread_spec_tags[THREAD_SPEC__MAX] = {
> +       "undefined", "cpu", "core", "socket", "numa", "user"
> +};
> +
>  struct record {
>         struct perf_tool        tool;
>         struct record_opts      opts;
> @@ -2721,6 +2736,70 @@ static void record__thread_mask_free(struct thread_mask
> *mask)
>         record__mmap_cpu_mask_free(&mask->affinity);
>  }
>  
> +static int record__thread_mask_or(struct thread_mask *dest, struct
> thread_mask *src1,
> +                                  struct thread_mask *src2)
> +{
> +       if (src1->maps.nbits != src2->maps.nbits ||
> +           dest->maps.nbits != src1->maps.nbits ||
> +           src1->affinity.nbits != src2->affinity.nbits ||
> +           dest->affinity.nbits != src1->affinity.nbits)
> +               return -EINVAL;
> +
> +       bitmap_or(dest->maps.bits, src1->maps.bits,
> +                 src2->maps.bits, src1->maps.nbits);
> +       bitmap_or(dest->affinity.bits, src1->affinity.bits,
> +                 src2->affinity.bits, src1->affinity.nbits);
> +
> +       return 0;
> +}
> +
> +static int record__thread_mask_intersects(struct thread_mask *mask_1, struct
> thread_mask *mask_2)
> +{
> +       int res1, res2;
> +
> +       if (mask_1->maps.nbits != mask_2->maps.nbits ||
> +           mask_1->affinity.nbits != mask_2->affinity.nbits)
> +               return -EINVAL;
> +
> +       res1 = bitmap_intersects(mask_1->maps.bits, mask_2->maps.bits,
> +                                mask_1->maps.nbits);
> +       res2 = bitmap_intersects(mask_1->affinity.bits, mask_2->affinity.bits,
> +                                mask_1->affinity.nbits);
> +       if (res1 || res2)
> +               return 1;
> +
> +       return 0;
> +}
> +
> +static int record__parse_threads(const struct option *opt, const char *str,
> int unset)
> +{
> +       int s;
> +       struct record_opts *opts = opt->value;
> +
> +       if (unset || !str || !strlen(str)) {
> +               opts->threads_spec = THREAD_SPEC__CPU;
> +       } else {
> +               for (s = 1; s < THREAD_SPEC__MAX; s++) {
> +                       if (s == THREAD_SPEC__USER) {
> +                               opts->threads_user_spec = strdup(str);
> +                               opts->threads_spec = THREAD_SPEC__USER;
> +                               break;
> +                       }
> +                       if (!strncasecmp(str, thread_spec_tags[s],
> strlen(thread_spec_tags[s]))) {
> +                               opts->threads_spec = s;
> +                               break;
> +                       }
> +               }
> +       }
> +
> +       pr_debug("threads_spec: %s", thread_spec_tags[opts->threads_spec]);
> +       if (opts->threads_spec == THREAD_SPEC__USER)
> +               pr_debug("=[%s]", opts->threads_user_spec);
> +       pr_debug("\n");
> +
> +       return 0;
> +}
> +
>  static int parse_output_max_size(const struct option *opt,
>                                  const char *str, int unset)
>  {
> @@ -3164,6 +3243,9 @@ static struct option __record_options[] = {
>                      "\t\t\t  Optionally send control command completion
> ('ack\\n') to ack-fd descriptor.\n"
>                      "\t\t\t  Alternatively, ctl-fifo / ack-fifo will be
> opened and used as ctl-fd / ack-fd.",
>                       parse_control_option),
> +       OPT_CALLBACK_OPTARG(0, "threads", &record.opts, NULL, "spec",
> +                           "write collected trace data into several data
> files using parallel threads",
> +                           record__parse_threads),
>         OPT_END()
>  };
>  
> @@ -3177,6 +3259,17 @@ static void record__mmap_cpu_mask_init(struct
> mmap_cpu_mask *mask, struct perf_c
>                 set_bit(cpus->map[c], mask->bits);
>  }
>  
> +static void record__mmap_cpu_mask_init_spec(struct mmap_cpu_mask *mask, char
> *mask_spec)
> +{
> +       struct perf_cpu_map *cpus;
> +
> +       cpus = perf_cpu_map__new(mask_spec);
> +       if (cpus) {
> +               record__mmap_cpu_mask_init(mask, cpus);
> +               perf_cpu_map__put(cpus);
> +       }
> +}
> +
>  static int record__alloc_thread_masks(struct record *rec, int nr_threads, int
> nr_bits)
>  {
>         int t, ret;
> @@ -3196,6 +3289,229 @@ static int record__alloc_thread_masks(struct record
> *rec, int nr_threads, int nr
>  
>         return 0;
>  }
> +
> +static int record__init_thread_cpu_masks(struct record *rec, struct
> perf_cpu_map *cpus)
> +{
> +       int t, ret, nr_cpus = perf_cpu_map__nr(cpus);
> +
> +       ret = record__alloc_thread_masks(rec, nr_cpus, cpu__max_cpu());
> +       if (ret)
> +               return ret;
> +
> +       rec->nr_threads = nr_cpus;
> +       pr_debug("threads: nr_threads=%d\n", rec->nr_threads);
> +
> +       for (t = 0; t < rec->nr_threads; t++) {
> +               set_bit(cpus->map[t], rec->thread_masks[t].maps.bits);
> +               pr_debug("thread_masks[%d]: maps mask [%d]\n", t, cpus-
> >map[t]);
> +               set_bit(cpus->map[t], rec->thread_masks[t].affinity.bits);
> +               pr_debug("thread_masks[%d]: affinity mask [%d]\n", t, cpus-
> >map[t]);
> +       }
> +
> +       return 0;
> +}
> +
> +static int record__init_thread_masks_spec(struct record *rec, struct
> perf_cpu_map *cpus,
> +                                         char **maps_spec, char
> **affinity_spec, u32 nr_spec)
> +{
> +       u32 s;
> +       int ret = 0, nr_threads = 0;
> +       struct mmap_cpu_mask cpus_mask;
> +       struct thread_mask thread_mask, full_mask, *prev_masks;
> +
> +       ret = record__mmap_cpu_mask_alloc(&cpus_mask, cpu__max_cpu());
> +       if (ret)
> +               goto out;
> +       record__mmap_cpu_mask_init(&cpus_mask, cpus);
> +       ret = record__thread_mask_alloc(&thread_mask, cpu__max_cpu());
> +       if (ret)
> +               goto out_free_cpu_mask;
> +       ret = record__thread_mask_alloc(&full_mask, cpu__max_cpu());
> +       if (ret)
> +               goto out_free_thread_mask;
> +       record__thread_mask_clear(&full_mask);
> +
> +       for (s = 0; s < nr_spec; s++) {
> +               record__thread_mask_clear(&thread_mask);
> +
> +               record__mmap_cpu_mask_init_spec(&thread_mask.maps,
> maps_spec[s]);
> +               record__mmap_cpu_mask_init_spec(&thread_mask.affinity,
> affinity_spec[s]);
> +
> +               if (!bitmap_and(thread_mask.maps.bits, thread_mask.maps.bits,
> +                               cpus_mask.bits, thread_mask.maps.nbits) ||
> +                   !bitmap_and(thread_mask.affinity.bits,
> thread_mask.affinity.bits,
> +                               cpus_mask.bits, thread_mask.affinity.nbits))
> +                       continue;
> +
> +               ret = record__thread_mask_intersects(&thread_mask,
> &full_mask);
> +               if (ret)
> +                       goto out_free_full_mask;
> +               record__thread_mask_or(&full_mask, &full_mask, &thread_mask);
> +
> +               prev_masks = rec->thread_masks;
> +               rec->thread_masks = realloc(rec->thread_masks,
> +                                           (nr_threads + 1) * sizeof(struct
> thread_mask));
> +               if (!rec->thread_masks) {
> +                       pr_err("Failed to allocate thread masks\n");
> +                       rec->thread_masks = prev_masks;
> +                       ret = -ENOMEM;
> +                       goto out_free_full_mask;
> +               }
> +               rec->thread_masks[nr_threads] = thread_mask;
> +               if (verbose) {
> +                       pr_debug("thread_masks[%d]: addr=", nr_threads);
> +                       mmap_cpu_mask__scnprintf(&rec-
> >thread_masks[nr_threads].maps, "maps");
> +                       pr_debug("thread_masks[%d]: addr=", nr_threads);
> +                       mmap_cpu_mask__scnprintf(&rec-
> >thread_masks[nr_threads].affinity,
> +                                                "affinity");
> +               }
> +               nr_threads++;
> +               ret = record__thread_mask_alloc(&thread_mask, cpu__max_cpu());
> +               if (ret)
> +                       goto out_free_full_mask;
> +       }
> +

Here it can be checked if nr_threads is > 0. Otherwise, print an error message
epxlaining the reason and return an error.
nr_threads can be 0 if the user inserts out-of-range CPUs in the spec, e.g. 
"--threads=15/17" on a 4 CPU machine.

> +       rec->nr_threads = nr_threads;
> +       pr_debug("threads: nr_threads=%d\n", rec->nr_threads);
> +
> +out_free_full_mask:
> +       record__thread_mask_free(&full_mask);
> +out_free_thread_mask:
> +       record__thread_mask_free(&thread_mask);
> +out_free_cpu_mask:
> +       record__mmap_cpu_mask_free(&cpus_mask);
> +out:
> +       return ret;
> +}
> +
> +static int record__init_thread_core_masks(struct record *rec, struct
> perf_cpu_map *cpus)
> +{
> +       int ret;
> +       struct cpu_topology *topo;
> +
> +       topo = cpu_topology__new();
> +       if (!topo)
> +               return -EINVAL;
> +
> +       ret = record__init_thread_masks_spec(rec, cpus, topo->thread_siblings,
> +                                            topo->thread_siblings, topo-
> >thread_sib);
> +       cpu_topology__delete(topo);
> +
> +       return ret;
> +}
> +
> +static int record__init_thread_socket_masks(struct record *rec, struct
> perf_cpu_map *cpus)
> +{
> +       int ret;
> +       struct cpu_topology *topo;
> +
> +       topo = cpu_topology__new();
> +       if (!topo)
> +               return -EINVAL;
> +
> +       ret = record__init_thread_masks_spec(rec, cpus, topo->core_siblings,
> +                                            topo->core_siblings, topo-
> >core_sib);
> +       cpu_topology__delete(topo);
> +
> +       return ret;
> +}
> +
> +static int record__init_thread_numa_masks(struct record *rec, struct
> perf_cpu_map *cpus)
> +{
> +       u32 s;
> +       int ret;
> +       char **spec;
> +       struct numa_topology *topo;
> +
> +       topo = numa_topology__new();
> +       if (!topo)
> +               return -EINVAL;
> +       spec = zalloc(topo->nr * sizeof(char *));
> +       if (!spec) {
> +               ret = -ENOMEM;
> +               goto out_delete_topo;
> +       }
> +       for (s = 0; s < topo->nr; s++)
> +               spec[s] = topo->nodes[s].cpus;
> +
> +       ret = record__init_thread_masks_spec(rec, cpus, spec, spec, topo->nr);
> +
> +       zfree(&spec);
> +
> +out_delete_topo:
> +       numa_topology__delete(topo);
> +
> +       return ret;
> +}
> +
> +static int record__init_thread_user_masks(struct record *rec, struct
> perf_cpu_map *cpus)
> +{
> +       int t, ret;
> +       u32 s, nr_spec = 0;
> +       char **maps_spec = NULL, **affinity_spec = NULL, **prev_spec;
> +       char *spec, *spec_ptr, *user_spec, *mask, *mask_ptr;
> +
> +       for (t = 0, user_spec = (char *)rec->opts.threads_user_spec; ; t++,
> user_spec = NULL) {
> +               spec = strtok_r(user_spec, ":", &spec_ptr);
> +               if (spec == NULL)
> +                       break;
> +               pr_debug(" spec[%d]: %s\n", t, spec);
> +               mask = strtok_r(spec, "/", &mask_ptr);
> +               if (mask == NULL)
> +                       break;
> +               pr_debug("  maps mask: %s\n", mask);
> +               prev_spec = maps_spec;
> +               maps_spec = realloc(maps_spec, (nr_spec + 1) * sizeof(char
> *));
> +               if (!maps_spec) {
> +                       pr_err("Failed to realloc maps_spec\n");
> +                       maps_spec = prev_spec;
> +                       ret = -ENOMEM;
> +                       goto out_free_all_specs;
> +               }
> +               maps_spec[nr_spec] = strdup(mask);
> +               if (!maps_spec[nr_spec]) {
> +                       pr_err("Failed to alloc maps_spec[%d]\n", nr_spec);
> +                       ret = -ENOMEM;
> +                       goto out_free_all_specs;
> +               }
> +               mask = strtok_r(NULL, "/", &mask_ptr);
> +               if (mask == NULL)

I think this should be a parse failure and return an error, not just skip to the
next one. 
Furthermore, maps_spec[nr_spec] should be freed before breaking/exiting.

Thanks,
Riccardo

> +                       break;
> +               pr_debug("  affinity mask: %s\n", mask);
> +               prev_spec = affinity_spec;
> +               affinity_spec = realloc(affinity_spec, (nr_spec + 1) *
> sizeof(char *));
> +               if (!affinity_spec) {
> +                       pr_err("Failed to realloc affinity_spec\n");
> +                       affinity_spec = prev_spec;
> +                       free(maps_spec[nr_spec]);
> +                       ret = -ENOMEM;
> +                       goto out_free_all_specs;
> +               }
> +               affinity_spec[nr_spec] = strdup(mask);
> +               if (!affinity_spec[nr_spec]) {
> +                       pr_err("Failed to alloc affinity_spec[%d]\n",
> nr_spec);
> +                       free(maps_spec[nr_spec]);
> +                       ret = -ENOMEM;
> +                       goto out_free_all_specs;
> +               }
> +               nr_spec++;
> +       }
> +
> +       ret = record__init_thread_masks_spec(rec, cpus, maps_spec,
> affinity_spec, nr_spec);
> +
> +out_free_all_specs:
> +       for (s = 0; s < nr_spec; s++) {
> +               if (maps_spec)
> +                       free(maps_spec[s]);
> +               if (affinity_spec)
> +                       free(affinity_spec[s]);
> +       }
> +       free(affinity_spec);
> +       free(maps_spec);
> +
> +       return ret;
> +}
> +
>  static int record__init_thread_default_masks(struct record *rec, struct
> perf_cpu_map *cpus)
>  {
>         int ret;
> @@ -3213,9 +3529,33 @@ static int record__init_thread_default_masks(struct
> record *rec, struct perf_cpu
>  
>  static int record__init_thread_masks(struct record *rec)
>  {
> +       int ret = 0;
>         struct perf_cpu_map *cpus = rec->evlist->core.cpus;
>  
> -       return record__init_thread_default_masks(rec, cpus);
> +       if (!record__threads_enabled(rec))
> +               return record__init_thread_default_masks(rec, cpus);
> +
> +       switch (rec->opts.threads_spec) {
> +       case THREAD_SPEC__CPU:
> +               ret = record__init_thread_cpu_masks(rec, cpus);
> +               break;
> +       case THREAD_SPEC__CORE:
> +               ret = record__init_thread_core_masks(rec, cpus);
> +               break;
> +       case THREAD_SPEC__SOCKET:
> +               ret = record__init_thread_socket_masks(rec, cpus);
> +               break;
> +       case THREAD_SPEC__NUMA:
> +               ret = record__init_thread_numa_masks(rec, cpus);
> +               break;
> +       case THREAD_SPEC__USER:
> +               ret = record__init_thread_user_masks(rec, cpus);
> +               break;
> +       default:
> +               break;
> +       }
> +
> +       return ret;
>  }
>  
>  static int record__fini_thread_masks(struct record *rec)
> @@ -3466,7 +3806,10 @@ int cmd_record(int argc, const char **argv)
>  
>         err = record__init_thread_masks(rec);
>         if (err) {
> -               pr_err("record__init_thread_masks failed, error %d\n", err);
> +               if (err > 0)
> +                       pr_err("ERROR: parallel data streaming masks (--
> threads) intersect.\n");
> +               else
> +                       pr_err("record__init_thread_masks failed, error %d\n",
> err);
>                 goto out;
>         }
>  
> diff --git a/tools/perf/util/record.h b/tools/perf/util/record.h
> index 4d68b7e27272..3da156498f47 100644
> --- a/tools/perf/util/record.h
> +++ b/tools/perf/util/record.h
> @@ -78,6 +78,7 @@ struct record_opts {
>         int           ctl_fd_ack;
>         bool          ctl_fd_close;
>         int           threads_spec;
> +       const char    *threads_user_spec;
>  };
>  
>  extern const char * const *record_usage;