2021-06-30 15:56:55

by Bayduraev, Alexey V

[permalink] [raw]
Subject: [PATCH v8 00/22] Introduce threaded trace streaming for basic perf record operation

Changes in v8:
- captured Acked-by: tags by Namhyung Kim
- merged with origin/perf/core
- added patch 21/22 introducing READER_NODATA state
- added patch 22/22 fixing --max-size option

v7: https://lore.kernel.org/lkml/[email protected]/

Changes in v7:
- fixed possible crash after out_free_threads label
- added missing pthread_attr_destroy() call
- added check of correctness of user masks
- fixed zsts_data finalization

v6: https://lore.kernel.org/lkml/[email protected]/

Changes in v6:
- fixed leaks and possible double free in record__thread_mask_alloc()
- fixed leaks in record__init_thread_user_masks()
- fixed final mmaps flushing for threads id > 0
- merged with origin/perf/core

v5: https://lore.kernel.org/lkml/[email protected]/

Changes in v5:
- fixed leaks in record__init_thread_masks_spec()
- fixed leaks after failed realloc
- replaced "%m" to strerror()
- added masks examples to the documentation
- captured Acked-by: tags by Andi Kleen
- do not allow --thread option for full_auxtrace mode
- split patch 06/12 to 06/20 and 07/20
- split patch 08/12 to 09/20 and 10/20
- split patches 11/12 and 11/12 to 13/20-20/20

v4: https://lore.kernel.org/lkml/[email protected]/

Changes in v4:
- renamed 'comm' structure to 'pipes'
- moved thread fd/maps messages to verbose=2
- fixed leaks during allocation of thread_data structures
- fixed leaks during allocation of thread masks
- fixed possible fails when releasing thread masks

v3: https://lore.kernel.org/lkml/[email protected]/

Changes in v3:
- avoided skipped redundant patch 3/15
- applied "data file" and "data directory" terms allover the patch set
- captured Acked-by: tags by Namhyung Kim
- avoided braces where don't needed
- employed thread local variable for serial trace streaming
- added specs for --thread option - core, socket, numa and user defined
- added parallel loading of data directory files similar to the prototype [1]

v2: https://lore.kernel.org/lkml/[email protected]/

Changes in v2:
- explicitly added credit tags to patches 6/15 and 15/15,
additionally to cites [1], [2]
- updated description of 3/15 to explicitly mention the reason
to open data directories in read access mode (e.g. for perf report)
- implemented fix for compilation error of 2/15
- explicitly elaborated on found issues to be resolved for
threaded AUX trace capture

v1: https://lore.kernel.org/lkml/[email protected]/

Patch set provides parallel threaded trace streaming mode for basic
perf record operation. Provided mode mitigates profiling data losses
and resolves scalability issues of serial and asynchronous (--aio)
trace streaming modes on multicore server systems. The design and
implementation are based on the prototype [1], [2].

Parallel threaded mode executes trace streaming threads that read kernel
data buffers and write captured data into several data files located at
data directory. Layout of trace streaming threads and their mapping to data
buffers to read can be configured using a value of --thread command line
option. Specification value provides masks separated by colon so the masks
define cpus to be monitored by one thread and thread affinity mask is
separated by slash. <cpus mask 1>/<affinity mask 1>:<cpu mask 2>/<affinity mask 2>
specifies parallel threads layout that consists of two threads with
corresponding assigned cpus to be monitored. Specification value can be
a string e.g. "cpu", "core" or "socket" meaning creation of data streaming
thread for monitoring every cpu, whole core or socket. The option provided
with no or empty value defaults to "cpu" layout creating data streaming
thread for every cpu being monitored. Specification masks are filtered
by the mask provided via -C option.

Parallel streaming mode is compatible with Zstd compression/decompression
(--compression-level) and external control commands (--control). The mode
is not enabled for pipe mode. The mode is not enabled for AUX area tracing,
related and derived modes like --snapshot or --aux-sample. --switch-output-*
and --timestamp-filename options are not enabled for parallel streaming.
Initial intent to enable AUX area tracing faced the need to define some
optimal way to store index data in data directory. --switch-output-* and
--timestamp-filename use cases are not clear for data directories.
Asynchronous(--aio) trace streaming and affinity (--affinity) modes are
mutually exclusive to parallel streaming mode.

Basic analysis of data directories is provided in perf report mode.
Raw dump and aggregated reports are available for data directories,
still with no memory consumption optimizations.

Tested:

tools/perf/perf record -o prof.data --threads -- matrix.gcc.g.O3
tools/perf/perf record -o prof.data --threads= -- matrix.gcc.g.O3
tools/perf/perf record -o prof.data --threads=cpu -- matrix.gcc.g.O3
tools/perf/perf record -o prof.data --threads=core -- matrix.gcc.g.O3
tools/perf/perf record -o prof.data --threads=socket -- matrix.gcc.g.O3
tools/perf/perf record -o prof.data --threads=numa -- matrix.gcc.g.O3
tools/perf/perf record -o prof.data --threads=0-3/3:4-7/4 -- matrix.gcc.g.O3
tools/perf/perf record -o prof.data -C 2,5 --threads=0-3/3:4-7/4 -- matrix.gcc.g.O3
tools/perf/perf record -o prof.data -C 3,4 --threads=0-3/3:4-7/4 -- matrix.gcc.g.O3
tools/perf/perf record -o prof.data -C 0,4,2,6 --threads=core -- matrix.gcc.g.O3
tools/perf/perf record -o prof.data -C 0,4,2,6 --threads=numa -- matrix.gcc.g.O3
tools/perf/perf record -o prof.data --threads -g --call-graph dwarf,4096 -- matrix.gcc.g.O3
tools/perf/perf record -o prof.data --threads -g --call-graph dwarf,4096 --compression-level=3 -- matrix.gcc.g.O3
tools/perf/perf record -o prof.data --threads -a
tools/perf/perf record -D -1 -e cpu-cycles -a --control fd:10,11 -- sleep 30
tools/perf/perf record --threads -D -1 -e cpu-cycles -a --control fd:10,11 -- sleep 30

tools/perf/perf report -i prof.data
tools/perf/perf report -i prof.data --call-graph=callee
tools/perf/perf report -i prof.data --stdio --header
tools/perf/perf report -i prof.data -D --header

[1] git clone https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git -b perf/record_threads
[2] https://lore.kernel.org/lkml/[email protected]/

Alexey Bayduraev (22):
perf record: Introduce thread affinity and mmap masks
perf record: Introduce thread specific data array
perf record: Introduce thread local variable
perf record: Stop threads in the end of trace streaming
perf record: Start threads in the beginning of trace streaming
perf record: Introduce data file at mmap buffer object
perf record: Introduce data transferred and compressed stats
perf record: Init data file at mmap buffer object
tools lib: Introduce bitmap_intersects() operation
perf record: Introduce --threads=<spec> command line option
perf record: Document parallel data streaming mode
perf report: Output data file name in raw trace dump
perf session: Move reader structure to the top
perf session: Introduce reader_state in reader object
perf session: Introduce reader objects in session object
perf session: Introduce decompressor into trace reader object
perf session: Move init into reader__init function
perf session: Move map/unmap into reader__mmap function
perf session: Load single file for analysis
perf session: Load data directory files for analysis
perf session: Introduce READER_NODATA state
perf record: Introduce record__bytes_written and fix --max-size option

tools/include/linux/bitmap.h | 11 +
tools/lib/api/fd/array.c | 17 +
tools/lib/api/fd/array.h | 1 +
tools/lib/bitmap.c | 14 +
tools/perf/Documentation/perf-record.txt | 30 +
tools/perf/builtin-inject.c | 3 +-
tools/perf/builtin-record.c | 1119 ++++++++++++++++++++--
tools/perf/util/evlist.c | 16 +
tools/perf/util/evlist.h | 1 +
tools/perf/util/mmap.c | 6 +
tools/perf/util/mmap.h | 6 +
tools/perf/util/ordered-events.h | 1 +
tools/perf/util/record.h | 2 +
tools/perf/util/session.c | 501 +++++++---
tools/perf/util/session.h | 5 +
tools/perf/util/tool.h | 3 +-
16 files changed, 1531 insertions(+), 205 deletions(-)

--
2.19.0


2021-06-30 15:57:03

by Bayduraev, Alexey V

[permalink] [raw]
Subject: [PATCH v8 02/22] perf record: Introduce thread specific data array

Introduce thread specific data object and array of such objects
to store and manage thread local data. Implement functions to
allocate, initialize, finalize and release thread specific data.

Thread local maps and overwrite_maps arrays keep pointers to
mmap buffer objects to serve according to maps thread mask.
Thread local pollfd array keeps event fds connected to mmaps
buffers according to maps thread mask.

Thread control commands are delivered via thread local comm pipes
and ctlfd_pos fd. External control commands (--control option)
are delivered via evlist ctlfd_pos fd and handled by the main
tool thread.

Acked-by: Namhyung Kim <[email protected]>
Signed-off-by: Alexey Bayduraev <[email protected]>
---
tools/lib/api/fd/array.c | 17 ++++
tools/lib/api/fd/array.h | 1 +
tools/perf/builtin-record.c | 196 +++++++++++++++++++++++++++++++++++-
3 files changed, 211 insertions(+), 3 deletions(-)

diff --git a/tools/lib/api/fd/array.c b/tools/lib/api/fd/array.c
index 5e6cb9debe37..de8bcbaea3f1 100644
--- a/tools/lib/api/fd/array.c
+++ b/tools/lib/api/fd/array.c
@@ -88,6 +88,23 @@ int fdarray__add(struct fdarray *fda, int fd, short revents, enum fdarray_flags
return pos;
}

+int fdarray__clone(struct fdarray *fda, int pos, struct fdarray *base)
+{
+ struct pollfd *entry;
+ int npos;
+
+ if (pos >= base->nr)
+ return -EINVAL;
+
+ entry = &base->entries[pos];
+
+ npos = fdarray__add(fda, entry->fd, entry->events, base->priv[pos].flags);
+ if (npos >= 0)
+ fda->priv[npos] = base->priv[pos];
+
+ return npos;
+}
+
int fdarray__filter(struct fdarray *fda, short revents,
void (*entry_destructor)(struct fdarray *fda, int fd, void *arg),
void *arg)
diff --git a/tools/lib/api/fd/array.h b/tools/lib/api/fd/array.h
index 7fcf21a33c0c..4a03da7f1fc1 100644
--- a/tools/lib/api/fd/array.h
+++ b/tools/lib/api/fd/array.h
@@ -42,6 +42,7 @@ struct fdarray *fdarray__new(int nr_alloc, int nr_autogrow);
void fdarray__delete(struct fdarray *fda);

int fdarray__add(struct fdarray *fda, int fd, short revents, enum fdarray_flags flags);
+int fdarray__clone(struct fdarray *fda, int pos, struct fdarray *base);
int fdarray__poll(struct fdarray *fda, int timeout);
int fdarray__filter(struct fdarray *fda, short revents,
void (*entry_destructor)(struct fdarray *fda, int fd, void *arg),
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 31b3a515abc1..11ce64b23db4 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -58,6 +58,7 @@
#include <poll.h>
#include <pthread.h>
#include <unistd.h>
+#include <sys/syscall.h>
#include <sched.h>
#include <signal.h>
#ifdef HAVE_EVENTFD_SUPPORT
@@ -92,6 +93,23 @@ struct thread_mask {
struct mmap_cpu_mask affinity;
};

+struct thread_data {
+ pid_t tid;
+ struct thread_mask *mask;
+ struct {
+ int msg[2];
+ int ack[2];
+ } pipes;
+ struct fdarray pollfd;
+ int ctlfd_pos;
+ struct mmap **maps;
+ struct mmap **overwrite_maps;
+ int nr_mmaps;
+ struct record *rec;
+ unsigned long long samples;
+ unsigned long waking;
+};
+
struct record {
struct perf_tool tool;
struct record_opts opts;
@@ -117,6 +135,7 @@ struct record {
struct mmap_cpu_mask affinity_mask;
unsigned long output_max_size; /* = 0: unlimited */
struct thread_mask *thread_masks;
+ struct thread_data *thread_data;
int nr_threads;
};

@@ -847,9 +866,174 @@ static int record__kcore_copy(struct machine *machine, struct perf_data *data)
return kcore_copy(from_dir, kcore_dir);
}

+static int record__thread_data_init_pipes(struct thread_data *thread_data)
+{
+ if (pipe(thread_data->pipes.msg) || pipe(thread_data->pipes.ack)) {
+ pr_err("Failed to create thread communication pipes: %s\n", strerror(errno));
+ return -ENOMEM;
+ }
+
+ pr_debug2("thread_data[%p]: msg=[%d,%d], ack=[%d,%d]\n", thread_data,
+ thread_data->pipes.msg[0], thread_data->pipes.msg[1],
+ thread_data->pipes.ack[0], thread_data->pipes.ack[1]);
+
+ return 0;
+}
+
+static int record__thread_data_init_maps(struct thread_data *thread_data, struct evlist *evlist)
+{
+ int m, tm, nr_mmaps = evlist->core.nr_mmaps;
+ struct mmap *mmap = evlist->mmap;
+ struct mmap *overwrite_mmap = evlist->overwrite_mmap;
+ struct perf_cpu_map *cpus = evlist->core.cpus;
+
+ thread_data->nr_mmaps = bitmap_weight(thread_data->mask->maps.bits,
+ thread_data->mask->maps.nbits);
+ if (mmap) {
+ thread_data->maps = zalloc(thread_data->nr_mmaps * sizeof(struct mmap *));
+ if (!thread_data->maps) {
+ pr_err("Failed to allocate maps thread data\n");
+ return -ENOMEM;
+ }
+ }
+ if (overwrite_mmap) {
+ thread_data->overwrite_maps = zalloc(thread_data->nr_mmaps * sizeof(struct mmap *));
+ if (!thread_data->overwrite_maps) {
+ pr_err("Failed to allocate overwrite maps thread data\n");
+ return -ENOMEM;
+ }
+ }
+ pr_debug2("thread_data[%p]: nr_mmaps=%d, maps=%p, ow_maps=%p\n", thread_data,
+ thread_data->nr_mmaps, thread_data->maps, thread_data->overwrite_maps);
+
+ for (m = 0, tm = 0; m < nr_mmaps && tm < thread_data->nr_mmaps; m++) {
+ if (test_bit(cpus->map[m], thread_data->mask->maps.bits)) {
+ if (thread_data->maps) {
+ thread_data->maps[tm] = &mmap[m];
+ pr_debug2("thread_data[%p]: maps[%d] -> mmap[%d], cpus[%d]\n",
+ thread_data, tm, m, cpus->map[m]);
+ }
+ if (thread_data->overwrite_maps) {
+ thread_data->overwrite_maps[tm] = &overwrite_mmap[m];
+ pr_debug2("thread_data[%p]: ow_maps[%d] -> ow_mmap[%d], cpus[%d]\n",
+ thread_data, tm, m, cpus->map[m]);
+ }
+ tm++;
+ }
+ }
+
+ return 0;
+}
+
+static int record__thread_data_init_pollfd(struct thread_data *thread_data, struct evlist *evlist)
+{
+ int f, tm, pos;
+ struct mmap *map, *overwrite_map;
+
+ fdarray__init(&thread_data->pollfd, 64);
+
+ for (tm = 0; tm < thread_data->nr_mmaps; tm++) {
+ map = thread_data->maps ? thread_data->maps[tm] : NULL;
+ overwrite_map = thread_data->overwrite_maps ?
+ thread_data->overwrite_maps[tm] : NULL;
+
+ for (f = 0; f < evlist->core.pollfd.nr; f++) {
+ void *ptr = evlist->core.pollfd.priv[f].ptr;
+
+ if ((map && ptr == map) || (overwrite_map && ptr == overwrite_map)) {
+ pos = fdarray__clone(&thread_data->pollfd, f, &evlist->core.pollfd);
+ if (pos < 0)
+ return pos;
+ pr_debug2("thread_data[%p]: pollfd[%d] <- event_fd=%d\n",
+ thread_data, pos, evlist->core.pollfd.entries[f].fd);
+ }
+ }
+ }
+
+ return 0;
+}
+
+static int record__alloc_thread_data(struct record *rec, struct evlist *evlist)
+{
+ int t, ret;
+ struct thread_data *thread_data;
+
+ rec->thread_data = zalloc(rec->nr_threads * sizeof(*(rec->thread_data)));
+ if (!rec->thread_data) {
+ pr_err("Failed to allocate thread data\n");
+ return -ENOMEM;
+ }
+ thread_data = rec->thread_data;
+
+ for (t = 0; t < rec->nr_threads; t++) {
+ thread_data[t].rec = rec;
+ thread_data[t].mask = &rec->thread_masks[t];
+ ret = record__thread_data_init_maps(&thread_data[t], evlist);
+ if (ret)
+ return ret;
+ ret = record__thread_data_init_pollfd(&thread_data[t], evlist);
+ if (ret)
+ return ret;
+ if (t) {
+ thread_data[t].tid = -1;
+ ret = record__thread_data_init_pipes(&thread_data[t]);
+ if (ret)
+ return ret;
+ thread_data[t].ctlfd_pos = fdarray__add(&thread_data[t].pollfd,
+ thread_data[t].pipes.msg[0],
+ POLLIN | POLLERR | POLLHUP,
+ fdarray_flag__nonfilterable);
+ if (thread_data[t].ctlfd_pos < 0)
+ return -ENOMEM;
+ pr_debug2("thread_data[%p]: pollfd[%d] <- ctl_fd=%d\n",
+ thread_data, thread_data[t].ctlfd_pos,
+ thread_data[t].pipes.msg[0]);
+ } else {
+ thread_data[t].tid = syscall(SYS_gettid);
+ if (evlist->ctl_fd.pos == -1)
+ continue;
+ thread_data[t].ctlfd_pos = fdarray__clone(&thread_data[t].pollfd,
+ evlist->ctl_fd.pos,
+ &evlist->core.pollfd);
+ if (thread_data[t].ctlfd_pos < 0)
+ return -ENOMEM;
+ pr_debug2("thread_data[%p]: pollfd[%d] <- ctl_fd=%d\n",
+ thread_data, thread_data[t].ctlfd_pos,
+ evlist->core.pollfd.entries[evlist->ctl_fd.pos].fd);
+ }
+ }
+
+ return 0;
+}
+
+static void record__free_thread_data(struct record *rec)
+{
+ int t;
+
+ if (rec->thread_data == NULL)
+ return;
+
+ for (t = 0; t < rec->nr_threads; t++) {
+ if (rec->thread_data[t].pipes.msg[0])
+ close(rec->thread_data[t].pipes.msg[0]);
+ if (rec->thread_data[t].pipes.msg[1])
+ close(rec->thread_data[t].pipes.msg[1]);
+ if (rec->thread_data[t].pipes.ack[0])
+ close(rec->thread_data[t].pipes.ack[0]);
+ if (rec->thread_data[t].pipes.ack[1])
+ close(rec->thread_data[t].pipes.ack[1]);
+ zfree(&rec->thread_data[t].maps);
+ zfree(&rec->thread_data[t].overwrite_maps);
+ fdarray__exit(&rec->thread_data[t].pollfd);
+ }
+
+ zfree(&rec->thread_data);
+}
+
static int record__mmap_evlist(struct record *rec,
struct evlist *evlist)
{
+ int ret;
struct record_opts *opts = &rec->opts;
bool auxtrace_overwrite = opts->auxtrace_snapshot_mode ||
opts->auxtrace_sample_mode;
@@ -880,6 +1064,14 @@ static int record__mmap_evlist(struct record *rec,
return -EINVAL;
}
}
+
+ if (evlist__initialize_ctlfd(evlist, opts->ctl_fd, opts->ctl_fd_ack))
+ return -1;
+
+ ret = record__alloc_thread_data(rec, evlist);
+ if (ret)
+ return ret;
+
return 0;
}

@@ -1880,9 +2072,6 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
evlist__start_workload(rec->evlist);
}

- if (evlist__initialize_ctlfd(rec->evlist, opts->ctl_fd, opts->ctl_fd_ack))
- goto out_child;
-
if (opts->initial_delay) {
pr_info(EVLIST_DISABLED_MSG);
if (opts->initial_delay > 0) {
@@ -2040,6 +2229,7 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
out_child:
evlist__finalize_ctlfd(rec->evlist);
record__mmap_read_all(rec, true);
+ record__free_thread_data(rec);
record__aio_mmap_read_sync(rec);

if (rec->session->bytes_transferred && rec->session->bytes_compressed) {
--
2.19.0

2021-06-30 15:57:22

by Bayduraev, Alexey V

[permalink] [raw]
Subject: [PATCH v8 03/22] perf record: Introduce thread local variable

Introduce thread local variable and use it for threaded trace streaming.
Use thread affinity mask instead or record affinity mask in affinity
modes.
Introduce and use evlist__ctlfd_update() function to propagate external
control commands to global evlist object.

Acked-by: Andi Kleen <[email protected]>
Acked-by: Namhyung Kim <[email protected]>
Signed-off-by: Alexey Bayduraev <[email protected]>
---
tools/perf/builtin-record.c | 135 ++++++++++++++++++++++++------------
tools/perf/util/evlist.c | 16 +++++
tools/perf/util/evlist.h | 1 +
3 files changed, 107 insertions(+), 45 deletions(-)

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 11ce64b23db4..3935c0fabe01 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -110,6 +110,8 @@ struct thread_data {
unsigned long waking;
};

+static __thread struct thread_data *thread;
+
struct record {
struct perf_tool tool;
struct record_opts opts;
@@ -132,7 +134,6 @@ struct record {
bool timestamp_boundary;
struct switch_output switch_output;
unsigned long long samples;
- struct mmap_cpu_mask affinity_mask;
unsigned long output_max_size; /* = 0: unlimited */
struct thread_mask *thread_masks;
struct thread_data *thread_data;
@@ -567,7 +568,7 @@ static int record__pushfn(struct mmap *map, void *to, void *bf, size_t size)
bf = map->data;
}

- rec->samples++;
+ thread->samples++;
return record__write(rec, map, bf, size);
}

@@ -1260,16 +1261,24 @@ static struct perf_event_header finished_round_event = {

static void record__adjust_affinity(struct record *rec, struct mmap *map)
{
+ int ret = 0;
+
if (rec->opts.affinity != PERF_AFFINITY_SYS &&
- !bitmap_equal(rec->affinity_mask.bits, map->affinity_mask.bits,
- rec->affinity_mask.nbits)) {
- bitmap_zero(rec->affinity_mask.bits, rec->affinity_mask.nbits);
- bitmap_or(rec->affinity_mask.bits, rec->affinity_mask.bits,
- map->affinity_mask.bits, rec->affinity_mask.nbits);
- sched_setaffinity(0, MMAP_CPU_MASK_BYTES(&rec->affinity_mask),
- (cpu_set_t *)rec->affinity_mask.bits);
- if (verbose == 2)
- mmap_cpu_mask__scnprintf(&rec->affinity_mask, "thread");
+ !bitmap_equal(thread->mask->affinity.bits, map->affinity_mask.bits,
+ thread->mask->affinity.nbits)) {
+ bitmap_zero(thread->mask->affinity.bits, thread->mask->affinity.nbits);
+ bitmap_or(thread->mask->affinity.bits, thread->mask->affinity.bits,
+ map->affinity_mask.bits, thread->mask->affinity.nbits);
+ ret = sched_setaffinity(0, MMAP_CPU_MASK_BYTES(&thread->mask->affinity),
+ (cpu_set_t *)thread->mask->affinity.bits);
+ if (ret)
+ pr_err("threads[%d]: sched_setaffinity() call failed: %s\n",
+ thread->tid, strerror(errno));
+ if (verbose == 2) {
+ pr_debug("threads[%d]: addr=", thread->tid);
+ mmap_cpu_mask__scnprintf(&thread->mask->affinity, "thread");
+ pr_debug("threads[%d]: on cpu=%d\n", thread->tid, sched_getcpu());
+ }
}
}

@@ -1310,14 +1319,17 @@ static int record__mmap_read_evlist(struct record *rec, struct evlist *evlist,
u64 bytes_written = rec->bytes_written;
int i;
int rc = 0;
- struct mmap *maps;
+ int nr_mmaps;
+ struct mmap **maps;
int trace_fd = rec->data.file.fd;
off_t off = 0;

if (!evlist)
return 0;

- maps = overwrite ? evlist->overwrite_mmap : evlist->mmap;
+ nr_mmaps = thread->nr_mmaps;
+ maps = overwrite ? thread->overwrite_maps : thread->maps;
+
if (!maps)
return 0;

@@ -1327,9 +1339,9 @@ static int record__mmap_read_evlist(struct record *rec, struct evlist *evlist,
if (record__aio_enabled(rec))
off = record__aio_get_pos(trace_fd);

- for (i = 0; i < evlist->core.nr_mmaps; i++) {
+ for (i = 0; i < nr_mmaps; i++) {
u64 flush = 0;
- struct mmap *map = &maps[i];
+ struct mmap *map = maps[i];

if (map->core.base) {
record__adjust_affinity(rec, map);
@@ -1392,6 +1404,15 @@ static int record__mmap_read_all(struct record *rec, bool synch)
return record__mmap_read_evlist(rec, rec->evlist, true, synch);
}

+static void record__thread_munmap_filtered(struct fdarray *fda, int fd,
+ void *arg __maybe_unused)
+{
+ struct perf_mmap *map = fda->priv[fd].ptr;
+
+ if (map)
+ perf_mmap__put(map);
+}
+
static void record__init_features(struct record *rec)
{
struct perf_session *session = rec->session;
@@ -1836,6 +1857,33 @@ static void record__uniquify_name(struct record *rec)
}
}

+static int record__start_threads(struct record *rec)
+{
+ struct thread_data *thread_data = rec->thread_data;
+
+ thread = &thread_data[0];
+
+ pr_debug("threads[%d]: started on cpu=%d\n", thread->tid, sched_getcpu());
+
+ return 0;
+}
+
+static int record__stop_threads(struct record *rec, unsigned long *waking)
+{
+ int t;
+ struct thread_data *thread_data = rec->thread_data;
+
+ for (t = 0; t < rec->nr_threads; t++) {
+ rec->samples += thread_data[t].samples;
+ *waking += thread_data[t].waking;
+ pr_debug("threads[%d]: samples=%lld, wakes=%ld, trasferred=%ld, compressed=%ld\n",
+ thread_data[t].tid, thread_data[t].samples, thread_data[t].waking,
+ rec->session->bytes_transferred, rec->session->bytes_compressed);
+ }
+
+ return 0;
+}
+
static int __cmd_record(struct record *rec, int argc, const char **argv)
{
int err;
@@ -1944,7 +1992,7 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)

if (record__open(rec) != 0) {
err = -1;
- goto out_child;
+ goto out_free_threads;
}
session->header.env.comp_mmap_len = session->evlist->core.mmap_len;

@@ -1952,7 +2000,7 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
err = record__kcore_copy(&session->machines.host, data);
if (err) {
pr_err("ERROR: Failed to copy kcore\n");
- goto out_child;
+ goto out_free_threads;
}
}

@@ -1963,7 +2011,7 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
bpf__strerror_apply_obj_config(err, errbuf, sizeof(errbuf));
pr_err("ERROR: Apply config to BPF failed: %s\n",
errbuf);
- goto out_child;
+ goto out_free_threads;
}

/*
@@ -1981,11 +2029,11 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
if (data->is_pipe) {
err = perf_header__write_pipe(fd);
if (err < 0)
- goto out_child;
+ goto out_free_threads;
} else {
err = perf_session__write_header(session, rec->evlist, fd, false);
if (err < 0)
- goto out_child;
+ goto out_free_threads;
}

err = -1;
@@ -1993,16 +2041,16 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
&& !perf_header__has_feat(&session->header, HEADER_BUILD_ID)) {
pr_err("Couldn't generate buildids. "
"Use --no-buildid to profile anyway.\n");
- goto out_child;
+ goto out_free_threads;
}

err = record__setup_sb_evlist(rec);
if (err)
- goto out_child;
+ goto out_free_threads;

err = record__synthesize(rec, false);
if (err < 0)
- goto out_child;
+ goto out_free_threads;

if (rec->realtime_prio) {
struct sched_param param;
@@ -2011,10 +2059,13 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
if (sched_setscheduler(0, SCHED_FIFO, &param)) {
pr_err("Could not set realtime priority.\n");
err = -1;
- goto out_child;
+ goto out_free_threads;
}
}

+ if (record__start_threads(rec))
+ goto out_free_threads;
+
/*
* When perf is starting the traced process, all the events
* (apart from group members) have enable_on_exec=1 set,
@@ -2085,7 +2136,7 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
trigger_ready(&switch_output_trigger);
perf_hooks__invoke_record_start();
for (;;) {
- unsigned long long hits = rec->samples;
+ unsigned long long hits = thread->samples;

/*
* rec->evlist->bkw_mmap_state is possible to be
@@ -2154,20 +2205,24 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
alarm(rec->switch_output.time);
}

- if (hits == rec->samples) {
+ if (hits == thread->samples) {
if (done || draining)
break;
- err = evlist__poll(rec->evlist, -1);
+ err = fdarray__poll(&thread->pollfd, -1);
/*
* Propagate error, only if there's any. Ignore positive
* number of returned events and interrupt error.
*/
if (err > 0 || (err < 0 && errno == EINTR))
err = 0;
- waking++;
+ thread->waking++;

- if (evlist__filter_pollfd(rec->evlist, POLLERR | POLLHUP) == 0)
+ if (fdarray__filter(&thread->pollfd, POLLERR | POLLHUP,
+ record__thread_munmap_filtered, NULL) == 0)
draining = true;
+
+ evlist__ctlfd_update(rec->evlist,
+ &thread->pollfd.entries[thread->ctlfd_pos]);
}

if (evlist__ctlfd_process(rec->evlist, &cmd) > 0) {
@@ -2220,18 +2275,20 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
goto out_child;
}

- if (!quiet)
- fprintf(stderr, "[ perf record: Woken up %ld times to write data ]\n", waking);
-
if (target__none(&rec->opts.target))
record__synthesize_workload(rec, true);

out_child:
- evlist__finalize_ctlfd(rec->evlist);
+ record__stop_threads(rec, &waking);
record__mmap_read_all(rec, true);
+out_free_threads:
record__free_thread_data(rec);
+ evlist__finalize_ctlfd(rec->evlist);
record__aio_mmap_read_sync(rec);

+ if (!quiet)
+ fprintf(stderr, "[ perf record: Woken up %ld times to write data ]\n", waking);
+
if (rec->session->bytes_transferred && rec->session->bytes_compressed) {
ratio = (float)rec->session->bytes_transferred/(float)rec->session->bytes_compressed;
session->header.env.comp_ratio = ratio + 0.5;
@@ -3093,17 +3150,6 @@ int cmd_record(int argc, const char **argv)

symbol__init(NULL);

- if (rec->opts.affinity != PERF_AFFINITY_SYS) {
- rec->affinity_mask.nbits = cpu__max_cpu();
- rec->affinity_mask.bits = bitmap_alloc(rec->affinity_mask.nbits);
- if (!rec->affinity_mask.bits) {
- pr_err("Failed to allocate thread mask for %zd cpus\n", rec->affinity_mask.nbits);
- err = -ENOMEM;
- goto out_opts;
- }
- pr_debug2("thread mask[%zd]: empty\n", rec->affinity_mask.nbits);
- }
-
err = record__auxtrace_init(rec);
if (err)
goto out;
@@ -3241,7 +3287,6 @@ int cmd_record(int argc, const char **argv)

err = __cmd_record(&record, argc, argv);
out:
- bitmap_free(rec->affinity_mask.bits);
evlist__delete(rec->evlist);
symbol__exit();
auxtrace_record__free(rec->itr);
diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
index 6ba9664089bd..3d555a98c037 100644
--- a/tools/perf/util/evlist.c
+++ b/tools/perf/util/evlist.c
@@ -2132,6 +2132,22 @@ int evlist__ctlfd_process(struct evlist *evlist, enum evlist_ctl_cmd *cmd)
return err;
}

+int evlist__ctlfd_update(struct evlist *evlist, struct pollfd *update)
+{
+ int ctlfd_pos = evlist->ctl_fd.pos;
+ struct pollfd *entries = evlist->core.pollfd.entries;
+
+ if (!evlist__ctlfd_initialized(evlist))
+ return 0;
+
+ if (entries[ctlfd_pos].fd != update->fd ||
+ entries[ctlfd_pos].events != update->events)
+ return -1;
+
+ entries[ctlfd_pos].revents = update->revents;
+ return 0;
+}
+
struct evsel *evlist__find_evsel(struct evlist *evlist, int idx)
{
struct evsel *evsel;
diff --git a/tools/perf/util/evlist.h b/tools/perf/util/evlist.h
index 2073cfa79f79..b7aa719c638d 100644
--- a/tools/perf/util/evlist.h
+++ b/tools/perf/util/evlist.h
@@ -358,6 +358,7 @@ void evlist__close_control(int ctl_fd, int ctl_fd_ack, bool *ctl_fd_close);
int evlist__initialize_ctlfd(struct evlist *evlist, int ctl_fd, int ctl_fd_ack);
int evlist__finalize_ctlfd(struct evlist *evlist);
bool evlist__ctlfd_initialized(struct evlist *evlist);
+int evlist__ctlfd_update(struct evlist *evlist, struct pollfd *update);
int evlist__ctlfd_process(struct evlist *evlist, enum evlist_ctl_cmd *cmd);
int evlist__ctlfd_ack(struct evlist *evlist);

--
2.19.0

2021-06-30 15:57:51

by Bayduraev, Alexey V

[permalink] [raw]
Subject: [PATCH v8 05/22] perf record: Start threads in the beginning of trace streaming

Start thread in detached state because its management is implemented
via messaging to avoid any scaling issues. Block signals prior thread
start so only main tool thread would be notified on external async
signals during data collection. Thread affinity mask is used to assign
eligible cpus for the thread to run. Wait and sync on thread start using
thread ack pipe.

Acked-by: Namhyung Kim <[email protected]>
Signed-off-by: Alexey Bayduraev <[email protected]>
---
tools/perf/builtin-record.c | 108 +++++++++++++++++++++++++++++++++++-
1 file changed, 107 insertions(+), 1 deletion(-)

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 82a21da2af16..cead2b3c56d7 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -1423,6 +1423,66 @@ static void record__thread_munmap_filtered(struct fdarray *fda, int fd,
perf_mmap__put(map);
}

+static void *record__thread(void *arg)
+{
+ enum thread_msg msg = THREAD_MSG__READY;
+ bool terminate = false;
+ struct fdarray *pollfd;
+ int err, ctlfd_pos;
+
+ thread = arg;
+ thread->tid = syscall(SYS_gettid);
+
+ err = write(thread->pipes.ack[1], &msg, sizeof(msg));
+ if (err == -1)
+ pr_err("threads[%d]: failed to notify on start: %s", thread->tid, strerror(errno));
+
+ pr_debug("threads[%d]: started on cpu=%d\n", thread->tid, sched_getcpu());
+
+ pollfd = &thread->pollfd;
+ ctlfd_pos = thread->ctlfd_pos;
+
+ for (;;) {
+ unsigned long long hits = thread->samples;
+
+ if (record__mmap_read_all(thread->rec, false) < 0 || terminate)
+ break;
+
+ if (hits == thread->samples) {
+
+ err = fdarray__poll(pollfd, -1);
+ /*
+ * Propagate error, only if there's any. Ignore positive
+ * number of returned events and interrupt error.
+ */
+ if (err > 0 || (err < 0 && errno == EINTR))
+ err = 0;
+ thread->waking++;
+
+ if (fdarray__filter(pollfd, POLLERR | POLLHUP,
+ record__thread_munmap_filtered, NULL) == 0)
+ break;
+ }
+
+ if (pollfd->entries[ctlfd_pos].revents & POLLHUP) {
+ terminate = true;
+ close(thread->pipes.msg[0]);
+ pollfd->entries[ctlfd_pos].fd = -1;
+ pollfd->entries[ctlfd_pos].events = 0;
+ }
+
+ pollfd->entries[ctlfd_pos].revents = 0;
+ }
+ record__mmap_read_all(thread->rec, true);
+
+ err = write(thread->pipes.ack[1], &msg, sizeof(msg));
+ if (err == -1)
+ pr_err("threads[%d]: failed to notify on termination: %s",
+ thread->tid, strerror(errno));
+
+ return NULL;
+}
+
static void record__init_features(struct record *rec)
{
struct perf_session *session = rec->session;
@@ -1886,13 +1946,59 @@ static int record__terminate_thread(struct thread_data *thread_data)

static int record__start_threads(struct record *rec)
{
+ int t, tt, ret = 0, nr_threads = rec->nr_threads;
struct thread_data *thread_data = rec->thread_data;
+ sigset_t full, mask;
+ pthread_t handle;
+ pthread_attr_t attrs;
+
+ sigfillset(&full);
+ if (sigprocmask(SIG_SETMASK, &full, &mask)) {
+ pr_err("Failed to block signals on threads start: %s\n", strerror(errno));
+ return -1;
+ }
+
+ pthread_attr_init(&attrs);
+ pthread_attr_setdetachstate(&attrs, PTHREAD_CREATE_DETACHED);
+
+ for (t = 1; t < nr_threads; t++) {
+ enum thread_msg msg = THREAD_MSG__UNDEFINED;
+
+ pthread_attr_setaffinity_np(&attrs,
+ MMAP_CPU_MASK_BYTES(&(thread_data[t].mask->affinity)),
+ (cpu_set_t *)(thread_data[t].mask->affinity.bits));
+
+ if (pthread_create(&handle, &attrs, record__thread, &thread_data[t])) {
+ for (tt = 1; tt < t; tt++)
+ record__terminate_thread(&thread_data[t]);
+ pr_err("Failed to start threads: %s\n", strerror(errno));
+ ret = -1;
+ goto out_err;
+ }
+
+ if (read(thread_data[t].pipes.ack[0], &msg, sizeof(msg)) > 0)
+ pr_debug2("threads[%d]: sent %s\n", rec->thread_data[t].tid,
+ thread_msg_tags[msg]);
+ }
+
+ if (nr_threads > 1) {
+ sched_setaffinity(0, MMAP_CPU_MASK_BYTES(&thread_data[0].mask->affinity),
+ (cpu_set_t *)thread_data[0].mask->affinity.bits);
+ }

thread = &thread_data[0];

pr_debug("threads[%d]: started on cpu=%d\n", thread->tid, sched_getcpu());

- return 0;
+out_err:
+ pthread_attr_destroy(&attrs);
+
+ if (sigprocmask(SIG_SETMASK, &mask, NULL)) {
+ pr_err("Failed to unblock signals on threads start: %s\n", strerror(errno));
+ ret = -1;
+ }
+
+ return ret;
}

static int record__stop_threads(struct record *rec, unsigned long *waking)
--
2.19.0

2021-06-30 15:57:55

by Bayduraev, Alexey V

[permalink] [raw]
Subject: [PATCH v8 04/22] perf record: Stop threads in the end of trace streaming

Signal thread to terminate by closing write fd of msg pipe.
Receive THREAD_MSG__READY message as the confirmation of the
thread's termination. Stop threads created for parallel trace
streaming prior their stats processing.

Acked-by: Andi Kleen <[email protected]>
Acked-by: Namhyung Kim <[email protected]>
Signed-off-by: Alexey Bayduraev <[email protected]>
---
tools/perf/builtin-record.c | 30 ++++++++++++++++++++++++++++++
1 file changed, 30 insertions(+)

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 3935c0fabe01..82a21da2af16 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -112,6 +112,16 @@ struct thread_data {

static __thread struct thread_data *thread;

+enum thread_msg {
+ THREAD_MSG__UNDEFINED = 0,
+ THREAD_MSG__READY,
+ THREAD_MSG__MAX,
+};
+
+static const char *thread_msg_tags[THREAD_MSG__MAX] = {
+ "UNDEFINED", "READY"
+};
+
struct record {
struct perf_tool tool;
struct record_opts opts;
@@ -1857,6 +1867,23 @@ static void record__uniquify_name(struct record *rec)
}
}

+static int record__terminate_thread(struct thread_data *thread_data)
+{
+ int res;
+ enum thread_msg ack = THREAD_MSG__UNDEFINED;
+ pid_t tid = thread_data->tid;
+
+ close(thread_data->pipes.msg[1]);
+ res = read(thread_data->pipes.ack[0], &ack, sizeof(ack));
+ if (res != -1)
+ pr_debug2("threads[%d]: sent %s\n", tid, thread_msg_tags[ack]);
+ else
+ pr_err("threads[%d]: failed to recv msg=%s from tid=%d\n",
+ thread->tid, thread_msg_tags[ack], tid);
+
+ return 0;
+}
+
static int record__start_threads(struct record *rec)
{
struct thread_data *thread_data = rec->thread_data;
@@ -1873,6 +1900,9 @@ static int record__stop_threads(struct record *rec, unsigned long *waking)
int t;
struct thread_data *thread_data = rec->thread_data;

+ for (t = 1; t < rec->nr_threads; t++)
+ record__terminate_thread(&thread_data[t]);
+
for (t = 0; t < rec->nr_threads; t++) {
rec->samples += thread_data[t].samples;
*waking += thread_data[t].waking;
--
2.19.0

2021-06-30 15:58:01

by Bayduraev, Alexey V

[permalink] [raw]
Subject: [PATCH v8 06/22] perf record: Introduce data file at mmap buffer object

Introduce data file and compressor objects into mmap object so
they could be used to process and store data stream from the
corresponding kernel data buffer. Make use of the introduced
per mmap file and compressor when they are initialized and
available.

Acked-by: Andi Kleen <[email protected]>
Acked-by: Namhyung Kim <[email protected]>
Signed-off-by: Alexey Bayduraev <[email protected]>
---
tools/perf/builtin-record.c | 3 +++
tools/perf/util/mmap.c | 6 ++++++
tools/perf/util/mmap.h | 3 +++
3 files changed, 12 insertions(+)

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index cead2b3c56d7..9b7e7a5dc116 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -190,6 +190,9 @@ static int record__write(struct record *rec, struct mmap *map __maybe_unused,
{
struct perf_data_file *file = &rec->session->data->file;

+ if (map && map->file)
+ file = map->file;
+
if (perf_data_file__write(file, bf, size) < 0) {
pr_err("failed to write perf data, error: %m\n");
return -1;
diff --git a/tools/perf/util/mmap.c b/tools/perf/util/mmap.c
index ab7108d22428..a2c5e4237592 100644
--- a/tools/perf/util/mmap.c
+++ b/tools/perf/util/mmap.c
@@ -230,6 +230,8 @@ void mmap__munmap(struct mmap *map)
{
bitmap_free(map->affinity_mask.bits);

+ zstd_fini(&map->zstd_data);
+
perf_mmap__aio_munmap(map);
if (map->data != NULL) {
munmap(map->data, mmap__mmap_len(map));
@@ -291,6 +293,10 @@ int mmap__mmap(struct mmap *map, struct mmap_params *mp, int fd, int cpu)
map->core.flush = mp->flush;

map->comp_level = mp->comp_level;
+ if (zstd_init(&map->zstd_data, map->comp_level)) {
+ pr_debug2("failed to init mmap commpressor, error %d\n", errno);
+ return -1;
+ }

if (map->comp_level && !perf_mmap__aio_enabled(map)) {
map->data = mmap(NULL, mmap__mmap_len(map), PROT_READ|PROT_WRITE,
diff --git a/tools/perf/util/mmap.h b/tools/perf/util/mmap.h
index 9d5f589f02ae..c4aed6e89549 100644
--- a/tools/perf/util/mmap.h
+++ b/tools/perf/util/mmap.h
@@ -13,6 +13,7 @@
#endif
#include "auxtrace.h"
#include "event.h"
+#include "util/compress.h"

struct aiocb;

@@ -43,6 +44,8 @@ struct mmap {
struct mmap_cpu_mask affinity_mask;
void *data;
int comp_level;
+ struct perf_data_file *file;
+ struct zstd_data zstd_data;
};

struct mmap_params {
--
2.19.0

2021-06-30 15:58:51

by Bayduraev, Alexey V

[permalink] [raw]
Subject: [PATCH v8 07/22] perf record: Introduce data transferred and compressed stats

Introduce bytes_transferred and bytes_compressed stats so they
would capture statistics for the related data buffer transfers.

Acked-by: Andi Kleen <[email protected]>
Acked-by: Namhyung Kim <[email protected]>
Signed-off-by: Alexey Bayduraev <[email protected]>
---
tools/perf/builtin-record.c | 64 +++++++++++++++++++++++++++++--------
tools/perf/util/mmap.h | 3 ++
2 files changed, 54 insertions(+), 13 deletions(-)

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 9b7e7a5dc116..f1a6edc0824c 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -198,6 +198,11 @@ static int record__write(struct record *rec, struct mmap *map __maybe_unused,
return -1;
}

+ if (map && map->file) {
+ map->bytes_written += size;
+ return 0;
+ }
+
rec->bytes_written += size;

if (record__output_max_size_exceeded(rec) && !done) {
@@ -215,8 +220,8 @@ static int record__write(struct record *rec, struct mmap *map __maybe_unused,

static int record__aio_enabled(struct record *rec);
static int record__comp_enabled(struct record *rec);
-static size_t zstd_compress(struct perf_session *session, void *dst, size_t dst_size,
- void *src, size_t src_size);
+static size_t zstd_compress(struct zstd_data *data,
+ void *dst, size_t dst_size, void *src, size_t src_size);

#ifdef HAVE_AIO_SUPPORT
static int record__aio_write(struct aiocb *cblock, int trace_fd,
@@ -350,9 +355,13 @@ static int record__aio_pushfn(struct mmap *map, void *to, void *buf, size_t size
*/

if (record__comp_enabled(aio->rec)) {
- size = zstd_compress(aio->rec->session, aio->data + aio->size,
- mmap__mmap_len(map) - aio->size,
+ struct zstd_data *zstd_data = &aio->rec->session->zstd_data;
+
+ aio->rec->session->bytes_transferred += size;
+ size = zstd_compress(zstd_data,
+ aio->data + aio->size, mmap__mmap_len(map) - aio->size,
buf, size);
+ aio->rec->session->bytes_compressed += size;
} else {
memcpy(aio->data + aio->size, buf, size);
}
@@ -577,8 +586,22 @@ static int record__pushfn(struct mmap *map, void *to, void *bf, size_t size)
struct record *rec = to;

if (record__comp_enabled(rec)) {
- size = zstd_compress(rec->session, map->data, mmap__mmap_len(map), bf, size);
+ struct zstd_data *zstd_data = &rec->session->zstd_data;
+
+ if (map->file) {
+ zstd_data = &map->zstd_data;
+ map->bytes_transferred += size;
+ } else {
+ rec->session->bytes_transferred += size;
+ }
+
+ size = zstd_compress(zstd_data, map->data, mmap__mmap_len(map), bf, size);
bf = map->data;
+
+ if (map->file)
+ map->bytes_compressed += size;
+ else
+ rec->session->bytes_compressed += size;
}

thread->samples++;
@@ -1311,18 +1334,15 @@ static size_t process_comp_header(void *record, size_t increment)
return size;
}

-static size_t zstd_compress(struct perf_session *session, void *dst, size_t dst_size,
+static size_t zstd_compress(struct zstd_data *zstd_data, void *dst, size_t dst_size,
void *src, size_t src_size)
{
size_t compressed;
size_t max_record_size = PERF_SAMPLE_MAX_SIZE - sizeof(struct perf_record_compressed) - 1;

- compressed = zstd_compress_stream_to_records(&session->zstd_data, dst, dst_size, src, src_size,
+ compressed = zstd_compress_stream_to_records(zstd_data, dst, dst_size, src, src_size,
max_record_size, process_comp_header);

- session->bytes_transferred += src_size;
- session->bytes_compressed += compressed;
-
return compressed;
}

@@ -2006,8 +2026,10 @@ static int record__start_threads(struct record *rec)

static int record__stop_threads(struct record *rec, unsigned long *waking)
{
- int t;
+ int t, tm;
+ struct mmap *map, *overwrite_map;
struct thread_data *thread_data = rec->thread_data;
+ u64 bytes_written = 0, bytes_transferred = 0, bytes_compressed = 0;

for (t = 1; t < rec->nr_threads; t++)
record__terminate_thread(&thread_data[t]);
@@ -2015,9 +2037,25 @@ static int record__stop_threads(struct record *rec, unsigned long *waking)
for (t = 0; t < rec->nr_threads; t++) {
rec->samples += thread_data[t].samples;
*waking += thread_data[t].waking;
- pr_debug("threads[%d]: samples=%lld, wakes=%ld, trasferred=%ld, compressed=%ld\n",
+ for (tm = 0; tm < thread_data[t].nr_mmaps; tm++) {
+ if (thread_data[t].maps) {
+ map = thread_data[t].maps[tm];
+ bytes_transferred += map->bytes_transferred;
+ bytes_compressed += map->bytes_compressed;
+ bytes_written += map->bytes_written;
+ }
+ if (thread_data[t].overwrite_maps) {
+ overwrite_map = thread_data[t].overwrite_maps[tm];
+ bytes_transferred += overwrite_map->bytes_transferred;
+ bytes_compressed += overwrite_map->bytes_compressed;
+ bytes_written += overwrite_map->bytes_written;
+ }
+ }
+ rec->session->bytes_transferred += bytes_transferred;
+ rec->session->bytes_compressed += bytes_compressed;
+ pr_debug("threads[%d]: samples=%lld, wakes=%ld, trasferred=%ld, compressed=%ld, written=%ld\n",
thread_data[t].tid, thread_data[t].samples, thread_data[t].waking,
- rec->session->bytes_transferred, rec->session->bytes_compressed);
+ bytes_transferred, bytes_compressed, bytes_written);
}

return 0;
diff --git a/tools/perf/util/mmap.h b/tools/perf/util/mmap.h
index c4aed6e89549..c04ca4b5adf5 100644
--- a/tools/perf/util/mmap.h
+++ b/tools/perf/util/mmap.h
@@ -46,6 +46,9 @@ struct mmap {
int comp_level;
struct perf_data_file *file;
struct zstd_data zstd_data;
+ u64 bytes_transferred;
+ u64 bytes_compressed;
+ u64 bytes_written;
};

struct mmap_params {
--
2.19.0

2021-06-30 15:59:09

by Bayduraev, Alexey V

[permalink] [raw]
Subject: [PATCH v8 08/22] perf record: Init data file at mmap buffer object

Initialize data files located at mmap buffer objects so trace data
can be written into several data file located at data directory.

Acked-by: Andi Kleen <[email protected]>
Acked-by: Namhyung Kim <[email protected]>
Signed-off-by: Alexey Bayduraev <[email protected]>
---
tools/perf/builtin-record.c | 41 ++++++++++++++++++++++++++++++-------
tools/perf/util/record.h | 1 +
2 files changed, 35 insertions(+), 7 deletions(-)

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index f1a6edc0824c..2ade17308467 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -160,6 +160,11 @@ static const char *affinity_tags[PERF_AFFINITY_MAX] = {
"SYS", "NODE", "CPU"
};

+static int record__threads_enabled(struct record *rec)
+{
+ return rec->opts.threads_spec;
+}
+
static bool switch_output_signal(struct record *rec)
{
return rec->switch_output.signal &&
@@ -1070,7 +1075,7 @@ static void record__free_thread_data(struct record *rec)
static int record__mmap_evlist(struct record *rec,
struct evlist *evlist)
{
- int ret;
+ int i, ret;
struct record_opts *opts = &rec->opts;
bool auxtrace_overwrite = opts->auxtrace_snapshot_mode ||
opts->auxtrace_sample_mode;
@@ -1109,6 +1114,18 @@ static int record__mmap_evlist(struct record *rec,
if (ret)
return ret;

+ if (record__threads_enabled(rec)) {
+ ret = perf_data__create_dir(&rec->data, evlist->core.nr_mmaps);
+ if (ret)
+ return ret;
+ for (i = 0; i < evlist->core.nr_mmaps; i++) {
+ if (evlist->mmap)
+ evlist->mmap[i].file = &rec->data.dir.files[i];
+ if (evlist->overwrite_mmap)
+ evlist->overwrite_mmap[i].file = &rec->data.dir.files[i];
+ }
+ }
+
return 0;
}

@@ -1416,8 +1433,12 @@ static int record__mmap_read_evlist(struct record *rec, struct evlist *evlist,
/*
* Mark the round finished in case we wrote
* at least one event.
+ *
+ * No need for round events in directory mode,
+ * because per-cpu maps and files have data
+ * sorted by kernel.
*/
- if (bytes_written != rec->bytes_written)
+ if (!record__threads_enabled(rec) && bytes_written != rec->bytes_written)
rc = record__write(rec, NULL, &finished_round_event, sizeof(finished_round_event));

if (overwrite)
@@ -1532,7 +1553,9 @@ static void record__init_features(struct record *rec)
if (!rec->opts.use_clockid)
perf_header__clear_feat(&session->header, HEADER_CLOCK_DATA);

- perf_header__clear_feat(&session->header, HEADER_DIR_FORMAT);
+ if (!record__threads_enabled(rec))
+ perf_header__clear_feat(&session->header, HEADER_DIR_FORMAT);
+
if (!record__comp_enabled(rec))
perf_header__clear_feat(&session->header, HEADER_COMPRESSED);

@@ -1543,15 +1566,21 @@ static void
record__finish_output(struct record *rec)
{
struct perf_data *data = &rec->data;
- int fd = perf_data__fd(data);
+ int i, fd = perf_data__fd(data);

if (data->is_pipe)
return;

rec->session->header.data_size += rec->bytes_written;
data->file.size = lseek(perf_data__fd(data), 0, SEEK_CUR);
+ if (record__threads_enabled(rec)) {
+ for (i = 0; i < data->dir.nr; i++)
+ data->dir.files[i].size = lseek(data->dir.files[i].fd, 0, SEEK_CUR);
+ }

if (!rec->no_buildid) {
+ /* this will be recalculated during process_buildids() */
+ rec->samples = 0;
process_buildids(rec);

if (rec->buildid_all)
@@ -2489,8 +2518,6 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
status = err;

record__synthesize(rec, true);
- /* this will be recalculated during process_buildids() */
- rec->samples = 0;

if (!err) {
if (!rec->timestamp_filename) {
@@ -3283,7 +3310,7 @@ int cmd_record(int argc, const char **argv)
goto out_opts;
}

- if (rec->opts.kcore)
+ if (rec->opts.kcore || record__threads_enabled(rec))
rec->data.is_dir = true;

if (rec->opts.comp_level != 0) {
diff --git a/tools/perf/util/record.h b/tools/perf/util/record.h
index 68f471d9a88b..4d68b7e27272 100644
--- a/tools/perf/util/record.h
+++ b/tools/perf/util/record.h
@@ -77,6 +77,7 @@ struct record_opts {
int ctl_fd;
int ctl_fd_ack;
bool ctl_fd_close;
+ int threads_spec;
};

extern const char * const *record_usage;
--
2.19.0

2021-06-30 15:59:13

by Bayduraev, Alexey V

[permalink] [raw]
Subject: [PATCH v8 09/22] tools lib: Introduce bitmap_intersects() operation

Introduce bitmap_intersects() routine that tests whether
bitmaps bitmap1 and bitmap2 intersects. This routine will
be used during thread masks initialization.

Acked-by: Andi Kleen <[email protected]>
Acked-by: Namhyung Kim <[email protected]>
Signed-off-by: Alexey Bayduraev <[email protected]>
---
tools/include/linux/bitmap.h | 11 +++++++++++
tools/lib/bitmap.c | 14 ++++++++++++++
2 files changed, 25 insertions(+)

diff --git a/tools/include/linux/bitmap.h b/tools/include/linux/bitmap.h
index 330dbf7509cc..9d959bc24859 100644
--- a/tools/include/linux/bitmap.h
+++ b/tools/include/linux/bitmap.h
@@ -18,6 +18,8 @@ int __bitmap_and(unsigned long *dst, const unsigned long *bitmap1,
int __bitmap_equal(const unsigned long *bitmap1,
const unsigned long *bitmap2, unsigned int bits);
void bitmap_clear(unsigned long *map, unsigned int start, int len);
+int __bitmap_intersects(const unsigned long *bitmap1,
+ const unsigned long *bitmap2, unsigned int bits);

#define BITMAP_FIRST_WORD_MASK(start) (~0UL << ((start) & (BITS_PER_LONG - 1)))
#define BITMAP_LAST_WORD_MASK(nbits) (~0UL >> (-(nbits) & (BITS_PER_LONG - 1)))
@@ -170,4 +172,13 @@ static inline int bitmap_equal(const unsigned long *src1,
return __bitmap_equal(src1, src2, nbits);
}

+static inline int bitmap_intersects(const unsigned long *src1,
+ const unsigned long *src2, unsigned int nbits)
+{
+ if (small_const_nbits(nbits))
+ return ((*src1 & *src2) & BITMAP_LAST_WORD_MASK(nbits)) != 0;
+ else
+ return __bitmap_intersects(src1, src2, nbits);
+}
+
#endif /* _PERF_BITOPS_H */
diff --git a/tools/lib/bitmap.c b/tools/lib/bitmap.c
index f4e914712b6f..db466ef7be9d 100644
--- a/tools/lib/bitmap.c
+++ b/tools/lib/bitmap.c
@@ -86,3 +86,17 @@ int __bitmap_equal(const unsigned long *bitmap1,

return 1;
}
+
+int __bitmap_intersects(const unsigned long *bitmap1,
+ const unsigned long *bitmap2, unsigned int bits)
+{
+ unsigned int k, lim = bits/BITS_PER_LONG;
+ for (k = 0; k < lim; ++k)
+ if (bitmap1[k] & bitmap2[k])
+ return 1;
+
+ if (bits % BITS_PER_LONG)
+ if ((bitmap1[k] & bitmap2[k]) & BITMAP_LAST_WORD_MASK(bits))
+ return 1;
+ return 0;
+}
--
2.19.0

2021-06-30 15:59:44

by Bayduraev, Alexey V

[permalink] [raw]
Subject: [PATCH v8 12/22] perf report: Output data file name in raw trace dump

Print path and name of a data file into raw dump (-D)
<file_offset>@<path/file>. Print offset of PERF_RECORD_COMPRESSED
record instead of zero for decompressed records:
[email protected] [0x30]: event: 9
or
[email protected]/data.7 [0x30]: event: 9

Acked-by: Namhyung Kim <[email protected]>
Acked-by: Andi Kleen <[email protected]>
Signed-off-by: Alexey Bayduraev <[email protected]>
---
tools/perf/builtin-inject.c | 3 +-
tools/perf/util/ordered-events.h | 1 +
tools/perf/util/session.c | 77 +++++++++++++++++++-------------
tools/perf/util/session.h | 1 +
tools/perf/util/tool.h | 3 +-
5 files changed, 52 insertions(+), 33 deletions(-)

diff --git a/tools/perf/builtin-inject.c b/tools/perf/builtin-inject.c
index 5d6f583e2cd3..649e6746cc70 100644
--- a/tools/perf/builtin-inject.c
+++ b/tools/perf/builtin-inject.c
@@ -109,7 +109,8 @@ static int perf_event__repipe_op2_synth(struct perf_session *session,

static int perf_event__repipe_op4_synth(struct perf_session *session,
union perf_event *event,
- u64 data __maybe_unused)
+ u64 data __maybe_unused,
+ const char *str __maybe_unused)
{
return perf_event__repipe_synth(session->tool, event);
}
diff --git a/tools/perf/util/ordered-events.h b/tools/perf/util/ordered-events.h
index 75345946c4b9..42c9764c6b5b 100644
--- a/tools/perf/util/ordered-events.h
+++ b/tools/perf/util/ordered-events.h
@@ -9,6 +9,7 @@ struct perf_sample;
struct ordered_event {
u64 timestamp;
u64 file_offset;
+ const char *file_path;
union perf_event *event;
struct list_head list;
};
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index 86145dd699ca..762de808bc03 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -38,7 +38,8 @@

#ifdef HAVE_ZSTD_SUPPORT
static int perf_session__process_compressed_event(struct perf_session *session,
- union perf_event *event, u64 file_offset)
+ union perf_event *event, u64 file_offset,
+ const char *file_path)
{
void *src;
size_t decomp_size, src_size;
@@ -60,6 +61,7 @@ static int perf_session__process_compressed_event(struct perf_session *session,
}

decomp->file_pos = file_offset;
+ decomp->file_path = file_path;
decomp->mmap_len = mmap_len;
decomp->head = 0;

@@ -100,7 +102,8 @@ static int perf_session__process_compressed_event(struct perf_session *session,
static int perf_session__deliver_event(struct perf_session *session,
union perf_event *event,
struct perf_tool *tool,
- u64 file_offset);
+ u64 file_offset,
+ const char *file_path);

static int perf_session__open(struct perf_session *session)
{
@@ -182,7 +185,8 @@ static int ordered_events__deliver_event(struct ordered_events *oe,
ordered_events);

return perf_session__deliver_event(session, event->event,
- session->tool, event->file_offset);
+ session->tool, event->file_offset,
+ event->file_path);
}

struct perf_session *perf_session__new(struct perf_data *data,
@@ -464,7 +468,8 @@ static int process_event_time_conv_stub(struct perf_session *perf_session __mayb

static int perf_session__process_compressed_event_stub(struct perf_session *session __maybe_unused,
union perf_event *event __maybe_unused,
- u64 file_offset __maybe_unused)
+ u64 file_offset __maybe_unused,
+ const char *file_path __maybe_unused)
{
dump_printf(": unhandled!\n");
return 0;
@@ -1267,13 +1272,14 @@ static void sample_read__printf(struct perf_sample *sample, u64 read_format)
}

static void dump_event(struct evlist *evlist, union perf_event *event,
- u64 file_offset, struct perf_sample *sample)
+ u64 file_offset, struct perf_sample *sample,
+ const char *file_path)
{
if (!dump_trace)
return;

- printf("\n%#" PRIx64 " [%#x]: event: %d\n",
- file_offset, event->header.size, event->header.type);
+ printf("\n%#" PRIx64 "@%s [%#x]: event: %d\n",
+ file_offset, file_path, event->header.size, event->header.type);

trace_event(event);
if (event->header.type == PERF_RECORD_SAMPLE && evlist->trace_event_sample_raw)
@@ -1476,12 +1482,13 @@ static int machines__deliver_event(struct machines *machines,
struct evlist *evlist,
union perf_event *event,
struct perf_sample *sample,
- struct perf_tool *tool, u64 file_offset)
+ struct perf_tool *tool, u64 file_offset,
+ const char *file_path)
{
struct evsel *evsel;
struct machine *machine;

- dump_event(evlist, event, file_offset, sample);
+ dump_event(evlist, event, file_offset, sample, file_path);

evsel = evlist__id2evsel(evlist, sample->id);

@@ -1558,7 +1565,8 @@ static int machines__deliver_event(struct machines *machines,
static int perf_session__deliver_event(struct perf_session *session,
union perf_event *event,
struct perf_tool *tool,
- u64 file_offset)
+ u64 file_offset,
+ const char *file_path)
{
struct perf_sample sample;
int ret = evlist__parse_sample(session->evlist, event, &sample);
@@ -1575,7 +1583,7 @@ static int perf_session__deliver_event(struct perf_session *session,
return 0;

ret = machines__deliver_event(&session->machines, session->evlist,
- event, &sample, tool, file_offset);
+ event, &sample, tool, file_offset, file_path);

if (dump_trace && sample.aux_sample.size)
auxtrace__dump_auxtrace_sample(session, &sample);
@@ -1585,7 +1593,8 @@ static int perf_session__deliver_event(struct perf_session *session,

static s64 perf_session__process_user_event(struct perf_session *session,
union perf_event *event,
- u64 file_offset)
+ u64 file_offset,
+ const char *file_path)
{
struct ordered_events *oe = &session->ordered_events;
struct perf_tool *tool = session->tool;
@@ -1595,7 +1604,7 @@ static s64 perf_session__process_user_event(struct perf_session *session,

if (event->header.type != PERF_RECORD_COMPRESSED ||
tool->compressed == perf_session__process_compressed_event_stub)
- dump_event(session->evlist, event, file_offset, &sample);
+ dump_event(session->evlist, event, file_offset, &sample, file_path);

/* These events are processed right away */
switch (event->header.type) {
@@ -1654,9 +1663,9 @@ static s64 perf_session__process_user_event(struct perf_session *session,
case PERF_RECORD_HEADER_FEATURE:
return tool->feature(session, event);
case PERF_RECORD_COMPRESSED:
- err = tool->compressed(session, event, file_offset);
+ err = tool->compressed(session, event, file_offset, file_path);
if (err)
- dump_event(session->evlist, event, file_offset, &sample);
+ dump_event(session->evlist, event, file_offset, &sample, file_path);
return err;
default:
return -EINVAL;
@@ -1673,9 +1682,9 @@ int perf_session__deliver_synth_event(struct perf_session *session,
events_stats__inc(&evlist->stats, event->header.type);

if (event->header.type >= PERF_RECORD_USER_TYPE_START)
- return perf_session__process_user_event(session, event, 0);
+ return perf_session__process_user_event(session, event, 0, NULL);

- return machines__deliver_event(&session->machines, evlist, event, sample, tool, 0);
+ return machines__deliver_event(&session->machines, evlist, event, sample, tool, 0, NULL);
}

static void event_swap(union perf_event *event, bool sample_id_all)
@@ -1772,7 +1781,8 @@ int perf_session__peek_events(struct perf_session *session, u64 offset,
}

static s64 perf_session__process_event(struct perf_session *session,
- union perf_event *event, u64 file_offset)
+ union perf_event *event, u64 file_offset,
+ const char *file_path)
{
struct evlist *evlist = session->evlist;
struct perf_tool *tool = session->tool;
@@ -1787,7 +1797,7 @@ static s64 perf_session__process_event(struct perf_session *session,
events_stats__inc(&evlist->stats, event->header.type);

if (event->header.type >= PERF_RECORD_USER_TYPE_START)
- return perf_session__process_user_event(session, event, file_offset);
+ return perf_session__process_user_event(session, event, file_offset, file_path);

if (tool->ordered_events) {
u64 timestamp = -1ULL;
@@ -1801,7 +1811,7 @@ static s64 perf_session__process_event(struct perf_session *session,
return ret;
}

- return perf_session__deliver_event(session, event, tool, file_offset);
+ return perf_session__deliver_event(session, event, tool, file_offset, file_path);
}

void perf_event_header__bswap(struct perf_event_header *hdr)
@@ -2021,7 +2031,8 @@ static int __perf_session__process_pipe_events(struct perf_session *session)
}
}

- if ((skip = perf_session__process_event(session, event, head)) < 0) {
+ skip = perf_session__process_event(session, event, head, "pipe");
+ if (skip < 0) {
pr_err("%#" PRIx64 " [%#x]: failed to process type: %d\n",
head, event->header.size, event->header.type);
err = -EINVAL;
@@ -2102,7 +2113,7 @@ fetch_decomp_event(u64 head, size_t mmap_size, char *buf, bool needs_swap)
static int __perf_session__process_decomp_events(struct perf_session *session)
{
s64 skip;
- u64 size, file_pos = 0;
+ u64 size;
struct decomp *decomp = session->decomp_last;

if (!decomp)
@@ -2116,9 +2127,9 @@ static int __perf_session__process_decomp_events(struct perf_session *session)
break;

size = event->header.size;
-
- if (size < sizeof(struct perf_event_header) ||
- (skip = perf_session__process_event(session, event, file_pos)) < 0) {
+ skip = perf_session__process_event(session, event, decomp->file_pos,
+ decomp->file_path);
+ if (size < sizeof(struct perf_event_header) || skip < 0) {
pr_err("%#" PRIx64 " [%#x]: failed to process type: %d\n",
decomp->file_pos + decomp->head, event->header.size, event->header.type);
return -EINVAL;
@@ -2149,10 +2160,12 @@ struct reader;

typedef s64 (*reader_cb_t)(struct perf_session *session,
union perf_event *event,
- u64 file_offset);
+ u64 file_offset,
+ const char *file_path);

struct reader {
int fd;
+ const char *path;
u64 data_size;
u64 data_offset;
reader_cb_t process;
@@ -2234,9 +2247,9 @@ reader__process_events(struct reader *rd, struct perf_session *session,
skip = -EINVAL;

if (size < sizeof(struct perf_event_header) ||
- (skip = rd->process(session, event, file_pos)) < 0) {
- pr_err("%#" PRIx64 " [%#x]: failed to process type: %d [%s]\n",
- file_offset + head, event->header.size,
+ (skip = rd->process(session, event, file_pos, rd->path)) < 0) {
+ pr_err("%#" PRIx64 " [%s] [%#x]: failed to process type: %d [%s]\n",
+ file_offset + head, rd->path, event->header.size,
event->header.type, strerror(-skip));
err = skip;
goto out;
@@ -2266,9 +2279,10 @@ reader__process_events(struct reader *rd, struct perf_session *session,

static s64 process_simple(struct perf_session *session,
union perf_event *event,
- u64 file_offset)
+ u64 file_offset,
+ const char *file_path)
{
- return perf_session__process_event(session, event, file_offset);
+ return perf_session__process_event(session, event, file_offset, file_path);
}

static int __perf_session__process_events(struct perf_session *session)
@@ -2278,6 +2292,7 @@ static int __perf_session__process_events(struct perf_session *session)
.data_size = session->header.data_size,
.data_offset = session->header.data_offset,
.process = process_simple,
+ .path = session->data->file.path,
.in_place_update = session->data->in_place_update,
};
struct ordered_events *oe = &session->ordered_events;
diff --git a/tools/perf/util/session.h b/tools/perf/util/session.h
index e31ba4c92a6c..6895a22ab5b7 100644
--- a/tools/perf/util/session.h
+++ b/tools/perf/util/session.h
@@ -46,6 +46,7 @@ struct perf_session {
struct decomp {
struct decomp *next;
u64 file_pos;
+ const char *file_path;
size_t mmap_len;
u64 head;
size_t size;
diff --git a/tools/perf/util/tool.h b/tools/perf/util/tool.h
index bbbc0dcd461f..c966531d3eca 100644
--- a/tools/perf/util/tool.h
+++ b/tools/perf/util/tool.h
@@ -28,7 +28,8 @@ typedef int (*event_attr_op)(struct perf_tool *tool,

typedef int (*event_op2)(struct perf_session *session, union perf_event *event);
typedef s64 (*event_op3)(struct perf_session *session, union perf_event *event);
-typedef int (*event_op4)(struct perf_session *session, union perf_event *event, u64 data);
+typedef int (*event_op4)(struct perf_session *session, union perf_event *event, u64 data,
+ const char *str);

typedef int (*event_oe)(struct perf_tool *tool, union perf_event *event,
struct ordered_events *oe);
--
2.19.0

2021-06-30 16:00:03

by Bayduraev, Alexey V

[permalink] [raw]
Subject: [PATCH v8 14/22] perf session: Introduce reader_state in reader object

We need all the state info about reader in separate object to load data
from multiple files, so we can keep multiple readers at the same time.
Adding struct reader_state and adding all items that need to be kept.

Design and implementation are based on the prototype [1], [2].

[1] git clone https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git -b perf/record_threads
[2] https://lore.kernel.org/lkml/[email protected]/

Suggested-by: Jiri Olsa <[email protected]>
Acked-by: Namhyung Kim <[email protected]>
Signed-off-by: Alexey Bayduraev <[email protected]>
---
tools/perf/util/session.c | 74 +++++++++++++++++++++++----------------
1 file changed, 43 insertions(+), 31 deletions(-)

diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index 7b8e19350095..d9e0c54a74c1 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -55,6 +55,17 @@ typedef s64 (*reader_cb_t)(struct perf_session *session,
u64 file_offset,
const char *file_path);

+struct reader_state {
+ char *mmaps[NUM_MMAPS];
+ size_t mmap_size;
+ int mmap_idx;
+ char *mmap_cur;
+ u64 file_pos;
+ u64 file_offset;
+ u64 data_size;
+ u64 head;
+};
+
struct reader {
int fd;
const char *path;
@@ -62,6 +73,7 @@ struct reader {
u64 data_offset;
reader_cb_t process;
bool in_place_update;
+ struct reader_state state;
};

#ifdef HAVE_ZSTD_SUPPORT
@@ -2176,29 +2188,28 @@ static int
reader__process_events(struct reader *rd, struct perf_session *session,
struct ui_progress *prog)
{
- u64 data_size = rd->data_size;
- u64 head, page_offset, file_offset, file_pos, size;
- int err = 0, mmap_prot, mmap_flags, map_idx = 0;
- size_t mmap_size;
- char *buf, *mmaps[NUM_MMAPS];
+ struct reader_state *st = &rd->state;
+ u64 page_offset, size;
+ int err = 0, mmap_prot, mmap_flags;
+ char *buf, **mmaps = st->mmaps;
union perf_event *event;
s64 skip;

page_offset = page_size * (rd->data_offset / page_size);
- file_offset = page_offset;
- head = rd->data_offset - page_offset;
+ st->file_offset = page_offset;
+ st->head = rd->data_offset - page_offset;

- ui_progress__init_size(prog, data_size, "Processing events...");
+ ui_progress__init_size(prog, rd->data_size, "Processing events...");

- data_size += rd->data_offset;
+ st->data_size = rd->data_size + rd->data_offset;

- mmap_size = MMAP_SIZE;
- if (mmap_size > data_size) {
- mmap_size = data_size;
+ st->mmap_size = MMAP_SIZE;
+ if (st->mmap_size > st->data_size) {
+ st->mmap_size = st->data_size;
session->one_mmap = true;
}

- memset(mmaps, 0, sizeof(mmaps));
+ memset(mmaps, 0, sizeof(st->mmaps));

mmap_prot = PROT_READ;
mmap_flags = MAP_SHARED;
@@ -2210,35 +2221,36 @@ reader__process_events(struct reader *rd, struct perf_session *session,
mmap_flags = MAP_PRIVATE;
}
remap:
- buf = mmap(NULL, mmap_size, mmap_prot, mmap_flags, rd->fd,
- file_offset);
+ buf = mmap(NULL, st->mmap_size, mmap_prot, mmap_flags, rd->fd,
+ st->file_offset);
if (buf == MAP_FAILED) {
pr_err("failed to mmap file\n");
err = -errno;
goto out;
}
- mmaps[map_idx] = buf;
- map_idx = (map_idx + 1) & (ARRAY_SIZE(mmaps) - 1);
- file_pos = file_offset + head;
+ mmaps[st->mmap_idx] = st->mmap_cur = buf;
+ st->mmap_idx = (st->mmap_idx + 1) & (ARRAY_SIZE(st->mmaps) - 1);
+ st->file_pos = st->file_offset + st->head;
if (session->one_mmap) {
session->one_mmap_addr = buf;
- session->one_mmap_offset = file_offset;
+ session->one_mmap_offset = st->file_offset;
}

more:
- event = fetch_mmaped_event(head, mmap_size, buf, session->header.needs_swap);
+ event = fetch_mmaped_event(st->head, st->mmap_size, st->mmap_cur,
+ session->header.needs_swap);
if (IS_ERR(event))
return PTR_ERR(event);

if (!event) {
- if (mmaps[map_idx]) {
- munmap(mmaps[map_idx], mmap_size);
- mmaps[map_idx] = NULL;
+ if (mmaps[st->mmap_idx]) {
+ munmap(mmaps[st->mmap_idx], st->mmap_size);
+ mmaps[st->mmap_idx] = NULL;
}

- page_offset = page_size * (head / page_size);
- file_offset += page_offset;
- head -= page_offset;
+ page_offset = page_size * (st->head / page_size);
+ st->file_offset += page_offset;
+ st->head -= page_offset;
goto remap;
}

@@ -2247,9 +2259,9 @@ reader__process_events(struct reader *rd, struct perf_session *session,
skip = -EINVAL;

if (size < sizeof(struct perf_event_header) ||
- (skip = rd->process(session, event, file_pos, rd->path)) < 0) {
+ (skip = rd->process(session, event, st->file_pos, rd->path)) < 0) {
pr_err("%#" PRIx64 " [%s] [%#x]: failed to process type: %d [%s]\n",
- file_offset + head, rd->path, event->header.size,
+ st->file_offset + st->head, rd->path, event->header.size,
event->header.type, strerror(-skip));
err = skip;
goto out;
@@ -2258,8 +2270,8 @@ reader__process_events(struct reader *rd, struct perf_session *session,
if (skip)
size += skip;

- head += size;
- file_pos += size;
+ st->head += size;
+ st->file_pos += size;

err = __perf_session__process_decomp_events(session);
if (err)
@@ -2270,7 +2282,7 @@ reader__process_events(struct reader *rd, struct perf_session *session,
if (session_done())
goto out;

- if (file_pos < data_size)
+ if (st->file_pos < st->data_size)
goto more;

out:
--
2.19.0

2021-06-30 16:00:12

by Bayduraev, Alexey V

[permalink] [raw]
Subject: [PATCH v8 13/22] perf session: Move reader structure to the top

Moving reader structure to the top of the file. This is necessary
for the further use of this structure in the decompressor.

Acked-by: Namhyung Kim <[email protected]>
Signed-off-by: Alexey Bayduraev <[email protected]>
---
tools/perf/util/session.c | 56 +++++++++++++++++++--------------------
1 file changed, 28 insertions(+), 28 deletions(-)

diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index 762de808bc03..7b8e19350095 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -36,6 +36,34 @@
#include "units.h"
#include <internal/lib.h>

+/*
+ * On 64bit we can mmap the data file in one go. No need for tiny mmap
+ * slices. On 32bit we use 32MB.
+ */
+#if BITS_PER_LONG == 64
+#define MMAP_SIZE ULLONG_MAX
+#define NUM_MMAPS 1
+#else
+#define MMAP_SIZE (32 * 1024 * 1024ULL)
+#define NUM_MMAPS 128
+#endif
+
+struct reader;
+
+typedef s64 (*reader_cb_t)(struct perf_session *session,
+ union perf_event *event,
+ u64 file_offset,
+ const char *file_path);
+
+struct reader {
+ int fd;
+ const char *path;
+ u64 data_size;
+ u64 data_offset;
+ reader_cb_t process;
+ bool in_place_update;
+};
+
#ifdef HAVE_ZSTD_SUPPORT
static int perf_session__process_compressed_event(struct perf_session *session,
union perf_event *event, u64 file_offset,
@@ -2144,34 +2172,6 @@ static int __perf_session__process_decomp_events(struct perf_session *session)
return 0;
}

-/*
- * On 64bit we can mmap the data file in one go. No need for tiny mmap
- * slices. On 32bit we use 32MB.
- */
-#if BITS_PER_LONG == 64
-#define MMAP_SIZE ULLONG_MAX
-#define NUM_MMAPS 1
-#else
-#define MMAP_SIZE (32 * 1024 * 1024ULL)
-#define NUM_MMAPS 128
-#endif
-
-struct reader;
-
-typedef s64 (*reader_cb_t)(struct perf_session *session,
- union perf_event *event,
- u64 file_offset,
- const char *file_path);
-
-struct reader {
- int fd;
- const char *path;
- u64 data_size;
- u64 data_offset;
- reader_cb_t process;
- bool in_place_update;
-};
-
static int
reader__process_events(struct reader *rd, struct perf_session *session,
struct ui_progress *prog)
--
2.19.0

2021-06-30 16:00:24

by Bayduraev, Alexey V

[permalink] [raw]
Subject: [PATCH v8 10/22] perf record: Introduce --threads=<spec> command line option

Provide --threads option in perf record command line interface.
The option can have a value in the form of masks that specify
cpus to be monitored with data streaming threads and its layout
in system topology. The masks can be filtered using cpu mask
provided via -C option.

The specification value can be user defined list of masks. Masks
separated by colon define cpus to be monitored by one thread and
affinity mask of that thread is separated by slash. For example:
<cpus mask 1>/<affinity mask 1>:<cpu mask 2>/<affinity mask 2>
specifies parallel threads layout that consists of two threads
with corresponding assigned cpus to be monitored.

The specification value can be a string e.g. "cpu", "core" or
"socket" meaning creation of data streaming thread for every
cpu or core or socket to monitor distinct cpus or cpus grouped
by core or socket.

The option provided with no or empty value defaults to per-cpu
parallel threads layout creating data streaming thread for every
cpu being monitored.

Feature design and implementation are based on prototypes [1], [2].

[1] git clone https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git -b perf/record_threads
[2] https://lore.kernel.org/lkml/[email protected]/

Suggested-by: Jiri Olsa <[email protected]>
Suggested-by: Namhyung Kim <[email protected]>
Acked-by: Andi Kleen <[email protected]>
Acked-by: Namhyung Kim <[email protected]>
Signed-off-by: Alexey Bayduraev <[email protected]>
---
tools/perf/builtin-record.c | 355 +++++++++++++++++++++++++++++++++++-
tools/perf/util/record.h | 1 +
2 files changed, 354 insertions(+), 2 deletions(-)

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 2ade17308467..8d452797d175 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -51,6 +51,7 @@
#include "util/evlist-hybrid.h"
#include "asm/bug.h"
#include "perf.h"
+#include "cputopo.h"

#include <errno.h>
#include <inttypes.h>
@@ -122,6 +123,20 @@ static const char *thread_msg_tags[THREAD_MSG__MAX] = {
"UNDEFINED", "READY"
};

+enum thread_spec {
+ THREAD_SPEC__UNDEFINED = 0,
+ THREAD_SPEC__CPU,
+ THREAD_SPEC__CORE,
+ THREAD_SPEC__SOCKET,
+ THREAD_SPEC__NUMA,
+ THREAD_SPEC__USER,
+ THREAD_SPEC__MAX,
+};
+
+static const char *thread_spec_tags[THREAD_SPEC__MAX] = {
+ "undefined", "cpu", "core", "socket", "numa", "user"
+};
+
struct record {
struct perf_tool tool;
struct record_opts opts;
@@ -2723,6 +2738,70 @@ static void record__thread_mask_free(struct thread_mask *mask)
record__mmap_cpu_mask_free(&mask->affinity);
}

+static int record__thread_mask_or(struct thread_mask *dest, struct thread_mask *src1,
+ struct thread_mask *src2)
+{
+ if (src1->maps.nbits != src2->maps.nbits ||
+ dest->maps.nbits != src1->maps.nbits ||
+ src1->affinity.nbits != src2->affinity.nbits ||
+ dest->affinity.nbits != src1->affinity.nbits)
+ return -EINVAL;
+
+ bitmap_or(dest->maps.bits, src1->maps.bits,
+ src2->maps.bits, src1->maps.nbits);
+ bitmap_or(dest->affinity.bits, src1->affinity.bits,
+ src2->affinity.bits, src1->affinity.nbits);
+
+ return 0;
+}
+
+static int record__thread_mask_intersects(struct thread_mask *mask_1, struct thread_mask *mask_2)
+{
+ int res1, res2;
+
+ if (mask_1->maps.nbits != mask_2->maps.nbits ||
+ mask_1->affinity.nbits != mask_2->affinity.nbits)
+ return -EINVAL;
+
+ res1 = bitmap_intersects(mask_1->maps.bits, mask_2->maps.bits,
+ mask_1->maps.nbits);
+ res2 = bitmap_intersects(mask_1->affinity.bits, mask_2->affinity.bits,
+ mask_1->affinity.nbits);
+ if (res1 || res2)
+ return 1;
+
+ return 0;
+}
+
+static int record__parse_threads(const struct option *opt, const char *str, int unset)
+{
+ int s;
+ struct record_opts *opts = opt->value;
+
+ if (unset || !str || !strlen(str)) {
+ opts->threads_spec = THREAD_SPEC__CPU;
+ } else {
+ for (s = 1; s < THREAD_SPEC__MAX; s++) {
+ if (s == THREAD_SPEC__USER) {
+ opts->threads_user_spec = strdup(str);
+ opts->threads_spec = THREAD_SPEC__USER;
+ break;
+ }
+ if (!strncasecmp(str, thread_spec_tags[s], strlen(thread_spec_tags[s]))) {
+ opts->threads_spec = s;
+ break;
+ }
+ }
+ }
+
+ pr_debug("threads_spec: %s", thread_spec_tags[opts->threads_spec]);
+ if (opts->threads_spec == THREAD_SPEC__USER)
+ pr_debug("=[%s]", opts->threads_user_spec);
+ pr_debug("\n");
+
+ return 0;
+}
+
static int parse_output_max_size(const struct option *opt,
const char *str, int unset)
{
@@ -3166,6 +3245,9 @@ static struct option __record_options[] = {
"\t\t\t Optionally send control command completion ('ack\\n') to ack-fd descriptor.\n"
"\t\t\t Alternatively, ctl-fifo / ack-fifo will be opened and used as ctl-fd / ack-fd.",
parse_control_option),
+ OPT_CALLBACK_OPTARG(0, "threads", &record.opts, NULL, "spec",
+ "write collected trace data into several data files using parallel threads",
+ record__parse_threads),
OPT_END()
};

@@ -3179,6 +3261,17 @@ static void record__mmap_cpu_mask_init(struct mmap_cpu_mask *mask, struct perf_c
set_bit(cpus->map[c], mask->bits);
}

+static void record__mmap_cpu_mask_init_spec(struct mmap_cpu_mask *mask, char *mask_spec)
+{
+ struct perf_cpu_map *cpus;
+
+ cpus = perf_cpu_map__new(mask_spec);
+ if (cpus) {
+ record__mmap_cpu_mask_init(mask, cpus);
+ perf_cpu_map__put(cpus);
+ }
+}
+
static int record__alloc_thread_masks(struct record *rec, int nr_threads, int nr_bits)
{
int t, ret;
@@ -3198,6 +3291,235 @@ static int record__alloc_thread_masks(struct record *rec, int nr_threads, int nr

return 0;
}
+
+static int record__init_thread_cpu_masks(struct record *rec, struct perf_cpu_map *cpus)
+{
+ int t, ret, nr_cpus = perf_cpu_map__nr(cpus);
+
+ ret = record__alloc_thread_masks(rec, nr_cpus, cpu__max_cpu());
+ if (ret)
+ return ret;
+
+ rec->nr_threads = nr_cpus;
+ pr_debug("threads: nr_threads=%d\n", rec->nr_threads);
+
+ for (t = 0; t < rec->nr_threads; t++) {
+ set_bit(cpus->map[t], rec->thread_masks[t].maps.bits);
+ pr_debug("thread_masks[%d]: maps mask [%d]\n", t, cpus->map[t]);
+ set_bit(cpus->map[t], rec->thread_masks[t].affinity.bits);
+ pr_debug("thread_masks[%d]: affinity mask [%d]\n", t, cpus->map[t]);
+ }
+
+ return 0;
+}
+
+static int record__init_thread_masks_spec(struct record *rec, struct perf_cpu_map *cpus,
+ char **maps_spec, char **affinity_spec, u32 nr_spec)
+{
+ u32 s;
+ int ret = 0, nr_threads = 0;
+ struct mmap_cpu_mask cpus_mask;
+ struct thread_mask thread_mask, full_mask, *prev_masks;
+
+ ret = record__mmap_cpu_mask_alloc(&cpus_mask, cpu__max_cpu());
+ if (ret)
+ goto out;
+ record__mmap_cpu_mask_init(&cpus_mask, cpus);
+ ret = record__thread_mask_alloc(&thread_mask, cpu__max_cpu());
+ if (ret)
+ goto out_free_cpu_mask;
+ ret = record__thread_mask_alloc(&full_mask, cpu__max_cpu());
+ if (ret)
+ goto out_free_thread_mask;
+ record__thread_mask_clear(&full_mask);
+
+ for (s = 0; s < nr_spec; s++) {
+ record__thread_mask_clear(&thread_mask);
+
+ record__mmap_cpu_mask_init_spec(&thread_mask.maps, maps_spec[s]);
+ record__mmap_cpu_mask_init_spec(&thread_mask.affinity, affinity_spec[s]);
+
+ if (!bitmap_and(thread_mask.maps.bits, thread_mask.maps.bits,
+ cpus_mask.bits, thread_mask.maps.nbits) ||
+ !bitmap_and(thread_mask.affinity.bits, thread_mask.affinity.bits,
+ cpus_mask.bits, thread_mask.affinity.nbits))
+ continue;
+
+ ret = record__thread_mask_intersects(&thread_mask, &full_mask);
+ if (ret)
+ goto out_free_full_mask;
+ record__thread_mask_or(&full_mask, &full_mask, &thread_mask);
+
+ prev_masks = rec->thread_masks;
+ rec->thread_masks = realloc(rec->thread_masks,
+ (nr_threads + 1) * sizeof(struct thread_mask));
+ if (!rec->thread_masks) {
+ pr_err("Failed to allocate thread masks\n");
+ rec->thread_masks = prev_masks;
+ ret = -ENOMEM;
+ goto out_free_full_mask;
+ }
+ rec->thread_masks[nr_threads] = thread_mask;
+ if (verbose) {
+ pr_debug("thread_masks[%d]: addr=", nr_threads);
+ mmap_cpu_mask__scnprintf(&rec->thread_masks[nr_threads].maps, "maps");
+ pr_debug("thread_masks[%d]: addr=", nr_threads);
+ mmap_cpu_mask__scnprintf(&rec->thread_masks[nr_threads].affinity,
+ "affinity");
+ }
+ nr_threads++;
+ ret = record__thread_mask_alloc(&thread_mask, cpu__max_cpu());
+ if (ret)
+ goto out_free_full_mask;
+ }
+
+ rec->nr_threads = nr_threads;
+ pr_debug("threads: nr_threads=%d\n", rec->nr_threads);
+
+ if (rec->nr_threads <= 0)
+ ret = -EINVAL;
+
+out_free_full_mask:
+ record__thread_mask_free(&full_mask);
+out_free_thread_mask:
+ record__thread_mask_free(&thread_mask);
+out_free_cpu_mask:
+ record__mmap_cpu_mask_free(&cpus_mask);
+out:
+ return ret;
+}
+
+static int record__init_thread_core_masks(struct record *rec, struct perf_cpu_map *cpus)
+{
+ int ret;
+ struct cpu_topology *topo;
+
+ topo = cpu_topology__new();
+ if (!topo)
+ return -EINVAL;
+
+ ret = record__init_thread_masks_spec(rec, cpus, topo->thread_siblings,
+ topo->thread_siblings, topo->thread_sib);
+ cpu_topology__delete(topo);
+
+ return ret;
+}
+
+static int record__init_thread_socket_masks(struct record *rec, struct perf_cpu_map *cpus)
+{
+ int ret;
+ struct cpu_topology *topo;
+
+ topo = cpu_topology__new();
+ if (!topo)
+ return -EINVAL;
+
+ ret = record__init_thread_masks_spec(rec, cpus, topo->core_siblings,
+ topo->core_siblings, topo->core_sib);
+ cpu_topology__delete(topo);
+
+ return ret;
+}
+
+static int record__init_thread_numa_masks(struct record *rec, struct perf_cpu_map *cpus)
+{
+ u32 s;
+ int ret;
+ char **spec;
+ struct numa_topology *topo;
+
+ topo = numa_topology__new();
+ if (!topo)
+ return -EINVAL;
+ spec = zalloc(topo->nr * sizeof(char *));
+ if (!spec) {
+ ret = -ENOMEM;
+ goto out_delete_topo;
+ }
+ for (s = 0; s < topo->nr; s++)
+ spec[s] = topo->nodes[s].cpus;
+
+ ret = record__init_thread_masks_spec(rec, cpus, spec, spec, topo->nr);
+
+ zfree(&spec);
+
+out_delete_topo:
+ numa_topology__delete(topo);
+
+ return ret;
+}
+
+static int record__init_thread_user_masks(struct record *rec, struct perf_cpu_map *cpus)
+{
+ int t, ret;
+ u32 s, nr_spec = 0;
+ char **maps_spec = NULL, **affinity_spec = NULL, **prev_spec;
+ char *spec, *spec_ptr, *user_spec, *mask, *mask_ptr;
+
+ for (t = 0, user_spec = (char *)rec->opts.threads_user_spec; ; t++, user_spec = NULL) {
+ spec = strtok_r(user_spec, ":", &spec_ptr);
+ if (spec == NULL)
+ break;
+ pr_debug(" spec[%d]: %s\n", t, spec);
+ mask = strtok_r(spec, "/", &mask_ptr);
+ if (mask == NULL)
+ break;
+ pr_debug(" maps mask: %s\n", mask);
+ prev_spec = maps_spec;
+ maps_spec = realloc(maps_spec, (nr_spec + 1) * sizeof(char *));
+ if (!maps_spec) {
+ pr_err("Failed to realloc maps_spec\n");
+ maps_spec = prev_spec;
+ ret = -ENOMEM;
+ goto out_free_all_specs;
+ }
+ maps_spec[nr_spec] = strdup(mask);
+ if (!maps_spec[nr_spec]) {
+ pr_err("Failed to alloc maps_spec[%d]\n", nr_spec);
+ ret = -ENOMEM;
+ goto out_free_all_specs;
+ }
+ mask = strtok_r(NULL, "/", &mask_ptr);
+ if (mask == NULL) {
+ free(maps_spec[nr_spec]);
+ ret = -EINVAL;
+ goto out_free_all_specs;
+ }
+ pr_debug(" affinity mask: %s\n", mask);
+ prev_spec = affinity_spec;
+ affinity_spec = realloc(affinity_spec, (nr_spec + 1) * sizeof(char *));
+ if (!affinity_spec) {
+ pr_err("Failed to realloc affinity_spec\n");
+ affinity_spec = prev_spec;
+ free(maps_spec[nr_spec]);
+ ret = -ENOMEM;
+ goto out_free_all_specs;
+ }
+ affinity_spec[nr_spec] = strdup(mask);
+ if (!affinity_spec[nr_spec]) {
+ pr_err("Failed to alloc affinity_spec[%d]\n", nr_spec);
+ free(maps_spec[nr_spec]);
+ ret = -ENOMEM;
+ goto out_free_all_specs;
+ }
+ nr_spec++;
+ }
+
+ ret = record__init_thread_masks_spec(rec, cpus, maps_spec, affinity_spec, nr_spec);
+
+out_free_all_specs:
+ for (s = 0; s < nr_spec; s++) {
+ if (maps_spec)
+ free(maps_spec[s]);
+ if (affinity_spec)
+ free(affinity_spec[s]);
+ }
+ free(affinity_spec);
+ free(maps_spec);
+
+ return ret;
+}
+
static int record__init_thread_default_masks(struct record *rec, struct perf_cpu_map *cpus)
{
int ret;
@@ -3215,9 +3537,33 @@ static int record__init_thread_default_masks(struct record *rec, struct perf_cpu

static int record__init_thread_masks(struct record *rec)
{
+ int ret = 0;
struct perf_cpu_map *cpus = rec->evlist->core.cpus;

- return record__init_thread_default_masks(rec, cpus);
+ if (!record__threads_enabled(rec))
+ return record__init_thread_default_masks(rec, cpus);
+
+ switch (rec->opts.threads_spec) {
+ case THREAD_SPEC__CPU:
+ ret = record__init_thread_cpu_masks(rec, cpus);
+ break;
+ case THREAD_SPEC__CORE:
+ ret = record__init_thread_core_masks(rec, cpus);
+ break;
+ case THREAD_SPEC__SOCKET:
+ ret = record__init_thread_socket_masks(rec, cpus);
+ break;
+ case THREAD_SPEC__NUMA:
+ ret = record__init_thread_numa_masks(rec, cpus);
+ break;
+ case THREAD_SPEC__USER:
+ ret = record__init_thread_user_masks(rec, cpus);
+ break;
+ default:
+ break;
+ }
+
+ return ret;
}

static int record__fini_thread_masks(struct record *rec)
@@ -3474,7 +3820,12 @@ int cmd_record(int argc, const char **argv)

err = record__init_thread_masks(rec);
if (err) {
- pr_err("record__init_thread_masks failed, error %d\n", err);
+ if (err > 0)
+ pr_err("ERROR: parallel data streaming masks (--threads) intersect\n");
+ else if (err == -EINVAL)
+ pr_err("ERROR: invalid parallel data streaming masks (--threads)\n");
+ else
+ pr_err("record__init_thread_masks failed, error %d\n", err);
goto out;
}

diff --git a/tools/perf/util/record.h b/tools/perf/util/record.h
index 4d68b7e27272..3da156498f47 100644
--- a/tools/perf/util/record.h
+++ b/tools/perf/util/record.h
@@ -78,6 +78,7 @@ struct record_opts {
int ctl_fd_ack;
bool ctl_fd_close;
int threads_spec;
+ const char *threads_user_spec;
};

extern const char * const *record_usage;
--
2.19.0

2021-06-30 16:00:40

by Bayduraev, Alexey V

[permalink] [raw]
Subject: [PATCH v8 11/22] perf record: Document parallel data streaming mode

Document --threads option syntax and parallel data streaming modes
in Documentation/perf-record.txt. Implement compatibility checks for
other modes and related command line options: asynchronous(--aio)
trace streaming and affinity (--affinity) modes, pipe mode, AUX
area tracing --snapshot and --aux-sample options, --switch-output,
--switch-output-event, --switch-max-files and --timestamp-filename
options. Parallel data streaming is compatible with Zstd compression
(--compression-level) and external control commands (--control).
Cpu mask provided via -C option filters --threads specification masks.

Acked-by: Namhyung Kim <[email protected]>
Signed-off-by: Alexey Bayduraev <[email protected]>
---
tools/perf/Documentation/perf-record.txt | 30 +++++++++++++++
tools/perf/builtin-record.c | 49 ++++++++++++++++++++++--
2 files changed, 76 insertions(+), 3 deletions(-)

diff --git a/tools/perf/Documentation/perf-record.txt b/tools/perf/Documentation/perf-record.txt
index d71bac847936..2046b28d9822 100644
--- a/tools/perf/Documentation/perf-record.txt
+++ b/tools/perf/Documentation/perf-record.txt
@@ -695,6 +695,36 @@ measurements:
wait -n ${perf_pid}
exit $?

+--threads=<spec>::
+Write collected trace data into several data files using parallel threads.
+<spec> value can be user defined list of masks. Masks separated by colon
+define cpus to be monitored by a thread and affinity mask of that thread
+is separated by slash:
+
+ <cpus mask 1>/<affinity mask 1>:<cpus mask 2>/<affinity mask 2>:...
+
+For example user specification like the following:
+
+ 0,2-4/2-4:1,5-7/5-7
+
+specifies parallel threads layout that consists of two threads,
+the first thread monitors cpus 0 and 2-4 with the affinity mask 2-4,
+the second monitors cpus 1 and 5-7 with the affinity mask 5-7.
+
+<spec> value can also be a string meaning predefined parallel threads
+layout:
+
+ cpu - create new data streaming thread for every monitored cpu
+ core - create new thread to monitor cpus grouped by a core
+ socket - create new thread to monitor cpus grouped by a socket
+ numa - create new threed to monitor cpus grouped by a numa domain
+
+Predefined layouts can be used on systems with large number of cpus in
+order not to spawn multiple per-cpu streaming threads but still avoid LOST
+events in data directory files. Option specified with no or empty value
+defaults to cpu layout. Masks defined or provided by the option value are
+filtered through the mask provided by -C option.
+
include::intel-hybrid.txt[]

SEE ALSO
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 8d452797d175..c5954cb3e787 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -800,6 +800,12 @@ static int record__auxtrace_init(struct record *rec)
{
int err;

+ if ((rec->opts.auxtrace_snapshot_opts || rec->opts.auxtrace_sample_opts)
+ && record__threads_enabled(rec)) {
+ pr_err("AUX area tracing options are not available in parallel streaming mode.\n");
+ return -EINVAL;
+ }
+
if (!rec->itr) {
rec->itr = auxtrace_record__init(rec->evlist, &err);
if (err)
@@ -2154,6 +2160,17 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
return PTR_ERR(session);
}

+ if (record__threads_enabled(rec)) {
+ if (perf_data__is_pipe(&rec->data)) {
+ pr_err("Parallel trace streaming is not available in pipe mode.\n");
+ return -1;
+ }
+ if (rec->opts.full_auxtrace) {
+ pr_err("Parallel trace streaming is not available in AUX area tracing mode.\n");
+ return -1;
+ }
+ }
+
fd = perf_data__fd(data);
rec->session = session;

@@ -2922,12 +2939,22 @@ static int switch_output_setup(struct record *rec)
* --switch-output=signal, as we'll send a SIGUSR2 from the side band
* thread to its parent.
*/
- if (rec->switch_output_event_set)
+ if (rec->switch_output_event_set) {
+ if (record__threads_enabled(rec)) {
+ pr_warning("WARNING: --switch-output-event option is not available in parallel streaming mode.\n");
+ return 0;
+ }
goto do_signal;
+ }

if (!s->set)
return 0;

+ if (record__threads_enabled(rec)) {
+ pr_warning("WARNING: --switch-output option is not available in parallel streaming mode.\n");
+ return 0;
+ }
+
if (!strcmp(s->str, "signal")) {
do_signal:
s->signal = true;
@@ -3225,8 +3252,8 @@ static struct option __record_options[] = {
"Set affinity mask of trace reading thread to NUMA node cpu mask or cpu of processed mmap buffer",
record__parse_affinity),
#ifdef HAVE_ZSTD_SUPPORT
- OPT_CALLBACK_OPTARG('z', "compression-level", &record.opts, &comp_level_default,
- "n", "Compressed records using specified level (default: 1 - fastest compression, 22 - greatest compression)",
+ OPT_CALLBACK_OPTARG('z', "compression-level", &record.opts, &comp_level_default, "n",
+ "Compress records using specified level (default: 1 - fastest compression, 22 - greatest compression)",
record__parse_comp_level),
#endif
OPT_CALLBACK(0, "max-size", &record.output_max_size,
@@ -3659,6 +3686,17 @@ int cmd_record(int argc, const char **argv)
if (rec->opts.kcore || record__threads_enabled(rec))
rec->data.is_dir = true;

+ if (record__threads_enabled(rec)) {
+ if (rec->opts.affinity != PERF_AFFINITY_SYS) {
+ pr_err("--affinity option is mutually exclusive to parallel streaming mode.\n");
+ goto out_opts;
+ }
+ if (record__aio_enabled(rec)) {
+ pr_err("Asynchronous streaming mode (--aio) is mutually exclusive to parallel streaming mode.\n");
+ goto out_opts;
+ }
+ }
+
if (rec->opts.comp_level != 0) {
pr_debug("Compression enabled, disabling build id collection at the end of the session.\n");
rec->no_buildid = true;
@@ -3692,6 +3730,11 @@ int cmd_record(int argc, const char **argv)
}
}

+ if (rec->timestamp_filename && record__threads_enabled(rec)) {
+ rec->timestamp_filename = false;
+ pr_warning("WARNING: --timestamp-filename option is not available in parallel streaming mode.\n");
+ }
+
/*
* Allow aliases to facilitate the lookup of symbols for address
* filters. Refer to auxtrace_parse_filters().
--
2.19.0

2021-06-30 16:01:15

by Bayduraev, Alexey V

[permalink] [raw]
Subject: [PATCH v8 16/22] perf session: Introduce decompressor into trace reader object

Introduce decompressor into trace reader object so that decompression
could be executed on per data file basis separately for every data
file located in data directory.

Acked-by: Namhyung Kim <[email protected]>
Signed-off-by: Alexey Bayduraev <[email protected]>
---
tools/perf/util/session.c | 54 +++++++++++++++++++++++++++++----------
tools/perf/util/session.h | 1 +
2 files changed, 42 insertions(+), 13 deletions(-)

diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index d9abca5c3904..ab243010148e 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -73,6 +73,9 @@ struct reader {
u64 data_offset;
reader_cb_t process;
bool in_place_update;
+ struct zstd_data zstd_data;
+ struct decomp *decomp;
+ struct decomp *decomp_last;
struct reader_state state;
};

@@ -85,7 +88,10 @@ static int perf_session__process_compressed_event(struct perf_session *session,
size_t decomp_size, src_size;
u64 decomp_last_rem = 0;
size_t mmap_len, decomp_len = session->header.env.comp_mmap_len;
- struct decomp *decomp, *decomp_last = session->decomp_last;
+ struct decomp *decomp, *decomp_last = session->active_reader ?
+ session->active_reader->decomp_last : session->decomp_last;
+ struct zstd_data *zstd_data = session->active_reader ?
+ &session->active_reader->zstd_data : &session->zstd_data;

if (decomp_last) {
decomp_last_rem = decomp_last->size - decomp_last->head;
@@ -113,7 +119,7 @@ static int perf_session__process_compressed_event(struct perf_session *session,
src = (void *)event + sizeof(struct perf_record_compressed);
src_size = event->pack.header.size - sizeof(struct perf_record_compressed);

- decomp_size = zstd_decompress_stream(&(session->zstd_data), src, src_size,
+ decomp_size = zstd_decompress_stream(zstd_data, src, src_size,
&(decomp->data[decomp_last_rem]), decomp_len - decomp_last_rem);
if (!decomp_size) {
munmap(decomp, mmap_len);
@@ -123,12 +129,22 @@ static int perf_session__process_compressed_event(struct perf_session *session,

decomp->size += decomp_size;

- if (session->decomp == NULL) {
- session->decomp = decomp;
- session->decomp_last = decomp;
+ if (session->active_reader) {
+ if (session->active_reader->decomp == NULL) {
+ session->active_reader->decomp = decomp;
+ session->active_reader->decomp_last = decomp;
+ } else {
+ session->active_reader->decomp_last->next = decomp;
+ session->active_reader->decomp_last = decomp;
+ }
} else {
- session->decomp_last->next = decomp;
- session->decomp_last = decomp;
+ if (session->decomp == NULL) {
+ session->decomp = decomp;
+ session->decomp_last = decomp;
+ } else {
+ session->decomp_last->next = decomp;
+ session->decomp_last = decomp;
+ }
}

pr_debug("decomp (B): %zd to %zd\n", src_size, decomp_size);
@@ -319,11 +335,11 @@ static void perf_session__delete_threads(struct perf_session *session)
machine__delete_threads(&session->machines.host);
}

-static void perf_session__release_decomp_events(struct perf_session *session)
+static void perf_decomp__release_events(struct decomp *next)
{
- struct decomp *next, *decomp;
+ struct decomp *decomp;
size_t mmap_len;
- next = session->decomp;
+
do {
decomp = next;
if (decomp == NULL)
@@ -336,6 +352,8 @@ static void perf_session__release_decomp_events(struct perf_session *session)

void perf_session__delete(struct perf_session *session)
{
+ int r;
+
if (session == NULL)
return;
auxtrace__free(session);
@@ -343,10 +361,14 @@ void perf_session__delete(struct perf_session *session)
perf_session__destroy_kernel_maps(session);
perf_session__delete_threads(session);
if (session->readers) {
+ for (r = 0; r < session->nr_readers; r++) {
+ perf_decomp__release_events(session->readers[r].decomp);
+ zstd_fini(&session->readers[r].zstd_data);
+ }
zfree(&session->readers);
session->nr_readers = 0;
}
- perf_session__release_decomp_events(session);
+ perf_decomp__release_events(session->decomp);
perf_env__exit(&session->header.env);
machines__exit(&session->machines);
if (session->data)
@@ -2158,7 +2180,8 @@ static int __perf_session__process_decomp_events(struct perf_session *session)
{
s64 skip;
u64 size;
- struct decomp *decomp = session->decomp_last;
+ struct decomp *decomp = session->active_reader ?
+ session->active_reader->decomp_last : session->decomp_last;

if (!decomp)
return 0;
@@ -2215,6 +2238,9 @@ reader__process_events(struct reader *rd, struct perf_session *session,

memset(mmaps, 0, sizeof(st->mmaps));

+ if (zstd_init(&rd->zstd_data, 0))
+ return -1;
+
mmap_prot = PROT_READ;
mmap_flags = MAP_SHARED;

@@ -2258,12 +2284,13 @@ reader__process_events(struct reader *rd, struct perf_session *session,
goto remap;
}

+ session->active_reader = rd;
size = event->header.size;

skip = -EINVAL;

if (size < sizeof(struct perf_event_header) ||
- (skip = rd->process(session, event, st->file_pos, rd->path)) < 0) {
+ (skip = perf_session__process_event(session, event, st->file_pos, rd->path)) < 0) {
pr_err("%#" PRIx64 " [%s] [%#x]: failed to process type: %d [%s]\n",
st->file_offset + st->head, rd->path, event->header.size,
event->header.type, strerror(-skip));
@@ -2290,6 +2317,7 @@ reader__process_events(struct reader *rd, struct perf_session *session,
goto more;

out:
+ session->active_reader = NULL;
return err;
}

diff --git a/tools/perf/util/session.h b/tools/perf/util/session.h
index 2815d00b5467..e0a8712f8770 100644
--- a/tools/perf/util/session.h
+++ b/tools/perf/util/session.h
@@ -44,6 +44,7 @@ struct perf_session {
struct decomp *decomp_last;
struct reader *readers;
int nr_readers;
+ struct reader *active_reader;
};

struct decomp {
--
2.19.0

2021-06-30 16:01:51

by Bayduraev, Alexey V

[permalink] [raw]
Subject: [PATCH v8 17/22] perf session: Move init into reader__init function

Separating initialization code into reader__init function.

Design and implementation are based on the prototype [1], [2].

[1] git clone https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git -b perf/record_threads
[2] https://lore.kernel.org/lkml/[email protected]/

Suggested-by: Jiri Olsa <[email protected]>
Acked-by: Namhyung Kim <[email protected]>
Signed-off-by: Alexey Bayduraev <[email protected]>
---
tools/perf/util/session.c | 34 ++++++++++++++++++++++++----------
1 file changed, 24 insertions(+), 10 deletions(-)

diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index ab243010148e..578ff304fdb4 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -2212,28 +2212,25 @@ static int __perf_session__process_decomp_events(struct perf_session *session)
}

static int
-reader__process_events(struct reader *rd, struct perf_session *session,
- struct ui_progress *prog)
+reader__init(struct reader *rd, bool *one_mmap)
{
struct reader_state *st = &rd->state;
- u64 page_offset, size;
- int err = 0, mmap_prot, mmap_flags;
- char *buf, **mmaps = st->mmaps;
- union perf_event *event;
- s64 skip;
+ u64 page_offset;
+ char **mmaps = st->mmaps;
+
+ pr_debug("reader processing %s\n", rd->path);

page_offset = page_size * (rd->data_offset / page_size);
st->file_offset = page_offset;
st->head = rd->data_offset - page_offset;

- ui_progress__init_size(prog, rd->data_size, "Processing events...");
-
st->data_size = rd->data_size + rd->data_offset;

st->mmap_size = MMAP_SIZE;
if (st->mmap_size > st->data_size) {
st->mmap_size = st->data_size;
- session->one_mmap = true;
+ if (one_mmap)
+ *one_mmap = true;
}

memset(mmaps, 0, sizeof(st->mmaps));
@@ -2241,6 +2238,20 @@ reader__process_events(struct reader *rd, struct perf_session *session,
if (zstd_init(&rd->zstd_data, 0))
return -1;

+ return 0;
+}
+
+static int
+reader__process_events(struct reader *rd, struct perf_session *session,
+ struct ui_progress *prog)
+{
+ struct reader_state *st = &rd->state;
+ u64 page_offset, size;
+ int err = 0, mmap_prot, mmap_flags;
+ char *buf, **mmaps = st->mmaps;
+ union perf_event *event;
+ s64 skip;
+
mmap_prot = PROT_READ;
mmap_flags = MAP_SHARED;

@@ -2356,6 +2367,9 @@ static int __perf_session__process_events(struct perf_session *session)

ui_progress__init_size(&prog, rd->data_size, "Processing events...");

+ err = reader__init(rd, &session->one_mmap);
+ if (err)
+ goto out_err;
err = reader__process_events(rd, session, &prog);
if (err)
goto out_err;
--
2.19.0

2021-06-30 16:01:58

by Bayduraev, Alexey V

[permalink] [raw]
Subject: [PATCH v8 18/22] perf session: Move map/unmap into reader__mmap function

Moving mapping and unmapping code into reader__mmap, so the mmap
code is located together. Moving head/file_offset computation into
reader__mmap function, so all the offset computation is located
together and in one place only.

Design and implementation are based on the prototype [1], [2].

[1] git clone https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git -b perf/record_threads
[2] https://lore.kernel.org/lkml/[email protected]/

Suggested-by: Jiri Olsa <[email protected]>
Acked-by: Namhyung Kim <[email protected]>
Signed-off-by: Alexey Bayduraev <[email protected]>
---
tools/perf/util/session.c | 60 +++++++++++++++++++++++----------------
1 file changed, 35 insertions(+), 25 deletions(-)

diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index 578ff304fdb4..f956f078e047 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -2215,14 +2215,11 @@ static int
reader__init(struct reader *rd, bool *one_mmap)
{
struct reader_state *st = &rd->state;
- u64 page_offset;
char **mmaps = st->mmaps;

pr_debug("reader processing %s\n", rd->path);

- page_offset = page_size * (rd->data_offset / page_size);
- st->file_offset = page_offset;
- st->head = rd->data_offset - page_offset;
+ st->head = rd->data_offset;

st->data_size = rd->data_size + rd->data_offset;

@@ -2242,15 +2239,12 @@ reader__init(struct reader *rd, bool *one_mmap)
}

static int
-reader__process_events(struct reader *rd, struct perf_session *session,
- struct ui_progress *prog)
+reader__mmap(struct reader *rd, struct perf_session *session)
{
struct reader_state *st = &rd->state;
- u64 page_offset, size;
- int err = 0, mmap_prot, mmap_flags;
+ int mmap_prot, mmap_flags;
char *buf, **mmaps = st->mmaps;
- union perf_event *event;
- s64 skip;
+ u64 page_offset;

mmap_prot = PROT_READ;
mmap_flags = MAP_SHARED;
@@ -2261,20 +2255,45 @@ reader__process_events(struct reader *rd, struct perf_session *session,
mmap_prot |= PROT_WRITE;
mmap_flags = MAP_PRIVATE;
}
-remap:
+
+ if (mmaps[st->mmap_idx]) {
+ munmap(mmaps[st->mmap_idx], st->mmap_size);
+ mmaps[st->mmap_idx] = NULL;
+ }
+
+ page_offset = page_size * (st->head / page_size);
+ st->file_offset += page_offset;
+ st->head -= page_offset;
+
buf = mmap(NULL, st->mmap_size, mmap_prot, mmap_flags, rd->fd,
st->file_offset);
if (buf == MAP_FAILED) {
pr_err("failed to mmap file\n");
- err = -errno;
- goto out;
+ return -errno;
}
mmaps[st->mmap_idx] = st->mmap_cur = buf;
st->mmap_idx = (st->mmap_idx + 1) & (ARRAY_SIZE(st->mmaps) - 1);
st->file_pos = st->file_offset + st->head;
+ return 0;
+}
+
+static int
+reader__process_events(struct reader *rd, struct perf_session *session,
+ struct ui_progress *prog)
+{
+ struct reader_state *st = &rd->state;
+ u64 size;
+ int err = 0;
+ union perf_event *event;
+ s64 skip;
+
+remap:
+ err = reader__mmap(rd, session);
+ if (err)
+ goto out;
if (session->one_mmap) {
- session->one_mmap_addr = buf;
- session->one_mmap_offset = st->file_offset;
+ session->one_mmap_addr = rd->state.mmap_cur;
+ session->one_mmap_offset = rd->state.file_offset;
}

more:
@@ -2283,17 +2302,8 @@ reader__process_events(struct reader *rd, struct perf_session *session,
if (IS_ERR(event))
return PTR_ERR(event);

- if (!event) {
- if (mmaps[st->mmap_idx]) {
- munmap(mmaps[st->mmap_idx], st->mmap_size);
- mmaps[st->mmap_idx] = NULL;
- }
-
- page_offset = page_size * (st->head / page_size);
- st->file_offset += page_offset;
- st->head -= page_offset;
+ if (!event)
goto remap;
- }

session->active_reader = rd;
size = event->header.size;
--
2.19.0

2021-06-30 16:02:11

by Bayduraev, Alexey V

[permalink] [raw]
Subject: [PATCH v8 19/22] perf session: Load single file for analysis

Adding eof flag to reader state and moving the check to reader__mmap.
Separating reading code of single event into reader__read_event function.
Adding basic reader return codes to simplify the code and introducing
reader remmap/read_event loop based on them.

Design and implementation are based on the prototype [1], [2].

[1] git clone https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git -b perf/record_threads
[2] https://lore.kernel.org/lkml/[email protected]/

Suggested-by: Jiri Olsa <[email protected]>
Acked-by: Namhyung Kim <[email protected]>
Signed-off-by: Alexey Bayduraev <[email protected]>
---
tools/perf/util/session.c | 71 ++++++++++++++++++++++++---------------
1 file changed, 44 insertions(+), 27 deletions(-)

diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index f956f078e047..a4296dfd6554 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -64,6 +64,12 @@ struct reader_state {
u64 file_offset;
u64 data_size;
u64 head;
+ bool eof;
+};
+
+enum {
+ READER_EOF = 0,
+ READER_OK = 1,
};

struct reader {
@@ -2246,6 +2252,11 @@ reader__mmap(struct reader *rd, struct perf_session *session)
char *buf, **mmaps = st->mmaps;
u64 page_offset;

+ if (st->file_pos >= st->data_size) {
+ st->eof = true;
+ return READER_EOF;
+ }
+
mmap_prot = PROT_READ;
mmap_flags = MAP_SHARED;

@@ -2274,36 +2285,26 @@ reader__mmap(struct reader *rd, struct perf_session *session)
mmaps[st->mmap_idx] = st->mmap_cur = buf;
st->mmap_idx = (st->mmap_idx + 1) & (ARRAY_SIZE(st->mmaps) - 1);
st->file_pos = st->file_offset + st->head;
- return 0;
+ return READER_OK;
}

static int
-reader__process_events(struct reader *rd, struct perf_session *session,
- struct ui_progress *prog)
+reader__read_event(struct reader *rd, struct perf_session *session,
+ struct ui_progress *prog)
{
struct reader_state *st = &rd->state;
- u64 size;
- int err = 0;
+ int err = READER_OK;
union perf_event *event;
+ u64 size;
s64 skip;

-remap:
- err = reader__mmap(rd, session);
- if (err)
- goto out;
- if (session->one_mmap) {
- session->one_mmap_addr = rd->state.mmap_cur;
- session->one_mmap_offset = rd->state.file_offset;
- }
-
-more:
event = fetch_mmaped_event(st->head, st->mmap_size, st->mmap_cur,
session->header.needs_swap);
if (IS_ERR(event))
return PTR_ERR(event);

if (!event)
- goto remap;
+ return READER_EOF;

session->active_reader = rd;
size = event->header.size;
@@ -2325,18 +2326,12 @@ reader__process_events(struct reader *rd, struct perf_session *session,
st->head += size;
st->file_pos += size;

- err = __perf_session__process_decomp_events(session);
- if (err)
- goto out;
+ skip = __perf_session__process_decomp_events(session);
+ if (skip)
+ err = skip;

ui_progress__update(prog, size);

- if (session_done())
- goto out;
-
- if (st->file_pos < st->data_size)
- goto more;
-
out:
session->active_reader = NULL;
return err;
@@ -2380,9 +2375,31 @@ static int __perf_session__process_events(struct perf_session *session)
err = reader__init(rd, &session->one_mmap);
if (err)
goto out_err;
- err = reader__process_events(rd, session, &prog);
- if (err)
+ err = reader__mmap(rd, session);
+ if (err != READER_OK) {
+ if (err == READER_EOF)
+ err = -EINVAL;
goto out_err;
+ }
+ if (session->one_mmap) {
+ session->one_mmap_addr = rd->state.mmap_cur;
+ session->one_mmap_offset = rd->state.file_offset;
+ }
+
+ while (true) {
+ if (session_done())
+ break;
+
+ err = reader__read_event(rd, session, &prog);
+ if (err < 0)
+ break;
+ if (err == READER_EOF) {
+ err = reader__mmap(rd, session);
+ if (err <= 0)
+ break;
+ }
+ }
+
/* do the final flush for ordered samples */
err = ordered_events__flush(oe, OE_FLUSH__FINAL);
if (err)
--
2.19.0

2021-06-30 16:02:45

by Bayduraev, Alexey V

[permalink] [raw]
Subject: [PATCH v8 20/22] perf session: Load data directory files for analysis

Load data directory files and provide basic raw dump and aggregated
analysis support of data directories in report mode, still with no
memory consumption optimizations.

Design and implementation are based on the prototype [1], [2].

[1] git clone https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git -b perf/record_threads
[2] https://lore.kernel.org/lkml/[email protected]/

Suggested-by: Jiri Olsa <[email protected]>
Acked-by: Namhyung Kim <[email protected]>
Signed-off-by: Alexey Bayduraev <[email protected]>
---
tools/perf/util/session.c | 129 ++++++++++++++++++++++++++++++++++++++
1 file changed, 129 insertions(+)

diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index a4296dfd6554..b11a502c22a3 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -65,6 +65,7 @@ struct reader_state {
u64 data_size;
u64 head;
bool eof;
+ u64 size;
};

enum {
@@ -2323,6 +2324,7 @@ reader__read_event(struct reader *rd, struct perf_session *session,
if (skip)
size += skip;

+ st->size += size;
st->head += size;
st->file_pos += size;

@@ -2422,6 +2424,130 @@ static int __perf_session__process_events(struct perf_session *session)
return err;
}

+/*
+ * This function reads, merge and process directory data.
+ * It assumens the version 1 of directory data, where each
+ * data file holds per-cpu data, already sorted by kernel.
+ */
+static int __perf_session__process_dir_events(struct perf_session *session)
+{
+ struct perf_data *data = session->data;
+ struct perf_tool *tool = session->tool;
+ int i, ret = 0, readers = 1;
+ struct ui_progress prog;
+ u64 total_size = perf_data__size(session->data);
+ struct reader *rd;
+
+ perf_tool__fill_defaults(tool);
+
+ ui_progress__init_size(&prog, total_size, "Sorting events...");
+
+ for (i = 0; i < data->dir.nr; i++) {
+ if (data->dir.files[i].size)
+ readers++;
+ }
+
+ rd = session->readers = zalloc(readers * sizeof(struct reader));
+ if (!rd)
+ return -ENOMEM;
+ session->nr_readers = readers;
+ readers = 0;
+
+ rd[readers] = (struct reader) {
+ .fd = perf_data__fd(session->data),
+ .path = session->data->file.path,
+ .data_size = session->header.data_size,
+ .data_offset = session->header.data_offset,
+ .in_place_update = session->data->in_place_update,
+ };
+ ret = reader__init(&rd[readers], NULL);
+ if (ret)
+ goto out_err;
+ ret = reader__mmap(&rd[readers], session);
+ if (ret != READER_OK) {
+ if (ret == READER_EOF)
+ ret = -EINVAL;
+ goto out_err;
+ }
+ readers++;
+
+ for (i = 0; i < data->dir.nr; i++) {
+ if (data->dir.files[i].size) {
+ rd[readers] = (struct reader) {
+ .fd = data->dir.files[i].fd,
+ .path = data->dir.files[i].path,
+ .data_size = data->dir.files[i].size,
+ .data_offset = 0,
+ .in_place_update = session->data->in_place_update,
+ };
+ ret = reader__init(&rd[readers], NULL);
+ if (ret)
+ goto out_err;
+ ret = reader__mmap(&rd[readers], session);
+ if (ret != READER_OK) {
+ if (ret == READER_EOF)
+ ret = -EINVAL;
+ goto out_err;
+ }
+ readers++;
+ }
+ }
+
+ i = 0;
+
+ while ((ret >= 0) && readers) {
+ if (session_done())
+ return 0;
+
+ if (rd[i].state.eof) {
+ i = (i + 1) % session->nr_readers;
+ continue;
+ }
+
+ ret = reader__read_event(&rd[i], session, &prog);
+ if (ret < 0)
+ break;
+ if (ret == READER_EOF) {
+ ret = reader__mmap(&rd[i], session);
+ if (ret < 0)
+ goto out_err;
+ if (ret == READER_EOF)
+ readers--;
+ }
+
+ /*
+ * Processing 10MBs of data from each reader in sequence,
+ * because that's the way the ordered events sorting works
+ * most efficiently.
+ */
+ if (rd[i].state.size >= 10*1024*1024) {
+ rd[i].state.size = 0;
+ i = (i + 1) % session->nr_readers;
+ }
+ }
+
+ ret = ordered_events__flush(&session->ordered_events, OE_FLUSH__FINAL);
+ if (ret)
+ goto out_err;
+
+ ret = perf_session__flush_thread_stacks(session);
+out_err:
+ ui_progress__finish();
+
+ if (!tool->no_warn)
+ perf_session__warn_about_errors(session);
+
+ /*
+ * We may switching perf.data output, make ordered_events
+ * reusable.
+ */
+ ordered_events__reinit(&session->ordered_events);
+
+ session->one_mmap = false;
+
+ return ret;
+}
+
int perf_session__process_events(struct perf_session *session)
{
if (perf_session__register_idle_thread(session) < 0)
@@ -2430,6 +2556,9 @@ int perf_session__process_events(struct perf_session *session)
if (perf_data__is_pipe(session->data))
return __perf_session__process_pipe_events(session);

+ if (perf_data__is_dir(session->data))
+ return __perf_session__process_dir_events(session);
+
return __perf_session__process_events(session);
}

--
2.19.0

2021-06-30 16:02:47

by Bayduraev, Alexey V

[permalink] [raw]
Subject: [PATCH v8 15/22] perf session: Introduce reader objects in session object

Allow to allocate multiple reader objects, so we could load multiple
data files located in data directory at the same time.

Design and implementation are based on the prototype [1], [2].

[1] git clone https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git -b perf/record_threads
[2] https://lore.kernel.org/lkml/[email protected]/

Suggested-by: Jiri Olsa <[email protected]>
Acked-by: Namhyung Kim <[email protected]>
Signed-off-by: Alexey Bayduraev <[email protected]>
---
tools/perf/util/session.c | 33 +++++++++++++++++++++------------
tools/perf/util/session.h | 3 +++
2 files changed, 24 insertions(+), 12 deletions(-)

diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index d9e0c54a74c1..d9abca5c3904 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -342,6 +342,10 @@ void perf_session__delete(struct perf_session *session)
auxtrace_index__free(&session->auxtrace_index);
perf_session__destroy_kernel_maps(session);
perf_session__delete_threads(session);
+ if (session->readers) {
+ zfree(&session->readers);
+ session->nr_readers = 0;
+ }
perf_session__release_decomp_events(session);
perf_env__exit(&session->header.env);
machines__exit(&session->machines);
@@ -2299,14 +2303,7 @@ static s64 process_simple(struct perf_session *session,

static int __perf_session__process_events(struct perf_session *session)
{
- struct reader rd = {
- .fd = perf_data__fd(session->data),
- .data_size = session->header.data_size,
- .data_offset = session->header.data_offset,
- .process = process_simple,
- .path = session->data->file.path,
- .in_place_update = session->data->in_place_update,
- };
+ struct reader *rd;
struct ordered_events *oe = &session->ordered_events;
struct perf_tool *tool = session->tool;
struct ui_progress prog;
@@ -2314,12 +2311,24 @@ static int __perf_session__process_events(struct perf_session *session)

perf_tool__fill_defaults(tool);

- if (rd.data_size == 0)
- return -1;
+ rd = session->readers = zalloc(sizeof(struct reader));
+ if (!rd)
+ return -ENOMEM;
+
+ session->nr_readers = 1;
+
+ *rd = (struct reader) {
+ .fd = perf_data__fd(session->data),
+ .data_size = session->header.data_size,
+ .data_offset = session->header.data_offset,
+ .process = process_simple,
+ .path = session->data->file.path,
+ .in_place_update = session->data->in_place_update,
+ };

- ui_progress__init_size(&prog, rd.data_size, "Processing events...");
+ ui_progress__init_size(&prog, rd->data_size, "Processing events...");

- err = reader__process_events(&rd, session, &prog);
+ err = reader__process_events(rd, session, &prog);
if (err)
goto out_err;
/* do the final flush for ordered samples */
diff --git a/tools/perf/util/session.h b/tools/perf/util/session.h
index 6895a22ab5b7..2815d00b5467 100644
--- a/tools/perf/util/session.h
+++ b/tools/perf/util/session.h
@@ -19,6 +19,7 @@ struct thread;

struct auxtrace;
struct itrace_synth_opts;
+struct reader;

struct perf_session {
struct perf_header header;
@@ -41,6 +42,8 @@ struct perf_session {
struct zstd_data zstd_data;
struct decomp *decomp;
struct decomp *decomp_last;
+ struct reader *readers;
+ int nr_readers;
};

struct decomp {
--
2.19.0

2021-06-30 16:02:50

by Bayduraev, Alexey V

[permalink] [raw]
Subject: [PATCH v8 22/22] perf record: Introduce record__bytes_written and fix --max-size option

Adding a function to calculate the total amount of data transferred
and using it to fix the --max-size option.

Signed-off-by: Alexey Bayduraev <[email protected]>
---
tools/perf/builtin-record.c | 33 ++++++++++++++++++++++++++-------
1 file changed, 26 insertions(+), 7 deletions(-)

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index c5954cb3e787..16a81d8a840a 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -199,10 +199,28 @@ static bool switch_output_time(struct record *rec)
trigger_is_ready(&switch_output_trigger);
}

+static u64 record__bytes_written(struct record *rec)
+{
+ int t, tm;
+ struct thread_data *thread_data = rec->thread_data;
+ u64 bytes_written = rec->bytes_written;
+
+ for (t = 0; t < rec->nr_threads; t++) {
+ for (tm = 0; tm < thread_data[t].nr_mmaps; tm++) {
+ if (thread_data[t].maps)
+ bytes_written += thread_data[t].maps[tm]->bytes_written;
+ if (thread_data[t].overwrite_maps)
+ bytes_written += thread_data[t].overwrite_maps[tm]->bytes_written;
+ }
+ }
+
+ return bytes_written;
+}
+
static bool record__output_max_size_exceeded(struct record *rec)
{
return rec->output_max_size &&
- (rec->bytes_written >= rec->output_max_size);
+ (record__bytes_written(rec) >= rec->output_max_size);
}

static int record__write(struct record *rec, struct mmap *map __maybe_unused,
@@ -218,20 +236,21 @@ static int record__write(struct record *rec, struct mmap *map __maybe_unused,
return -1;
}

- if (map && map->file) {
+ if (map && map->file)
map->bytes_written += size;
- return 0;
- }
-
- rec->bytes_written += size;
+ else
+ rec->bytes_written += size;

if (record__output_max_size_exceeded(rec) && !done) {
fprintf(stderr, "[ perf record: perf size limit reached (%" PRIu64 " KB),"
" stopping session ]\n",
- rec->bytes_written >> 10);
+ record__bytes_written(rec) >> 10);
done = 1;
}

+ if (map && map->file)
+ return 0;
+
if (switch_output_size(rec))
trigger_hit(&switch_output_trigger);

--
2.19.0

2021-06-30 16:04:28

by Bayduraev, Alexey V

[permalink] [raw]
Subject: [PATCH v8 21/22] perf session: Introduce READER_NODATA state

Adding READER_NODATA state to differentiate it from the real end-of-file
state. Also an indent depth in reades initialization loop is reduced.

Suggested-by: Namhyung Kim <[email protected]>
Signed-off-by: Alexey Bayduraev <[email protected]>
---
tools/perf/util/session.c | 45 ++++++++++++++++++++-------------------
1 file changed, 23 insertions(+), 22 deletions(-)

diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index b11a502c22a3..c2b6c5f4e119 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -70,7 +70,8 @@ struct reader_state {

enum {
READER_EOF = 0,
- READER_OK = 1,
+ READER_NODATA = 1,
+ READER_OK = 2,
};

struct reader {
@@ -2305,7 +2306,7 @@ reader__read_event(struct reader *rd, struct perf_session *session,
return PTR_ERR(event);

if (!event)
- return READER_EOF;
+ return READER_NODATA;

session->active_reader = rd;
size = event->header.size;
@@ -2395,7 +2396,7 @@ static int __perf_session__process_events(struct perf_session *session)
err = reader__read_event(rd, session, &prog);
if (err < 0)
break;
- if (err == READER_EOF) {
+ if (err == READER_NODATA) {
err = reader__mmap(rd, session);
if (err <= 0)
break;
@@ -2472,25 +2473,25 @@ static int __perf_session__process_dir_events(struct perf_session *session)
readers++;

for (i = 0; i < data->dir.nr; i++) {
- if (data->dir.files[i].size) {
- rd[readers] = (struct reader) {
- .fd = data->dir.files[i].fd,
- .path = data->dir.files[i].path,
- .data_size = data->dir.files[i].size,
- .data_offset = 0,
- .in_place_update = session->data->in_place_update,
- };
- ret = reader__init(&rd[readers], NULL);
- if (ret)
- goto out_err;
- ret = reader__mmap(&rd[readers], session);
- if (ret != READER_OK) {
- if (ret == READER_EOF)
- ret = -EINVAL;
- goto out_err;
- }
- readers++;
+ if (!data->dir.files[i].size)
+ continue;
+ rd[readers] = (struct reader) {
+ .fd = data->dir.files[i].fd,
+ .path = data->dir.files[i].path,
+ .data_size = data->dir.files[i].size,
+ .data_offset = 0,
+ .in_place_update = session->data->in_place_update,
+ };
+ ret = reader__init(&rd[readers], NULL);
+ if (ret)
+ goto out_err;
+ ret = reader__mmap(&rd[readers], session);
+ if (ret != READER_OK) {
+ if (ret == READER_EOF)
+ ret = -EINVAL;
+ goto out_err;
}
+ readers++;
}

i = 0;
@@ -2507,7 +2508,7 @@ static int __perf_session__process_dir_events(struct perf_session *session)
ret = reader__read_event(&rd[i], session, &prog);
if (ret < 0)
break;
- if (ret == READER_EOF) {
+ if (ret == READER_NODATA) {
ret = reader__mmap(&rd[i], session);
if (ret < 0)
goto out_err;
--
2.19.0

2021-06-30 16:27:12

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: [PATCH v8 02/22] perf record: Introduce thread specific data array

Em Wed, Jun 30, 2021 at 06:54:41PM +0300, Alexey Bayduraev escreveu:
> Introduce thread specific data object and array of such objects
> to store and manage thread local data. Implement functions to
> allocate, initialize, finalize and release thread specific data.
>
> Thread local maps and overwrite_maps arrays keep pointers to
> mmap buffer objects to serve according to maps thread mask.
> Thread local pollfd array keeps event fds connected to mmaps
> buffers according to maps thread mask.
>
> Thread control commands are delivered via thread local comm pipes
> and ctlfd_pos fd. External control commands (--control option)
> are delivered via evlist ctlfd_pos fd and handled by the main
> tool thread.
>
> Acked-by: Namhyung Kim <[email protected]>
> Signed-off-by: Alexey Bayduraev <[email protected]>
> ---
> tools/lib/api/fd/array.c | 17 ++++
> tools/lib/api/fd/array.h | 1 +
> tools/perf/builtin-record.c | 196 +++++++++++++++++++++++++++++++++++-
> 3 files changed, 211 insertions(+), 3 deletions(-)
>
> diff --git a/tools/lib/api/fd/array.c b/tools/lib/api/fd/array.c
> index 5e6cb9debe37..de8bcbaea3f1 100644
> --- a/tools/lib/api/fd/array.c
> +++ b/tools/lib/api/fd/array.c
> @@ -88,6 +88,23 @@ int fdarray__add(struct fdarray *fda, int fd, short revents, enum fdarray_flags
> return pos;
> }
>
> +int fdarray__clone(struct fdarray *fda, int pos, struct fdarray *base)
> +{
> + struct pollfd *entry;
> + int npos;
> +
> + if (pos >= base->nr)
> + return -EINVAL;
> +
> + entry = &base->entries[pos];
> +
> + npos = fdarray__add(fda, entry->fd, entry->events, base->priv[pos].flags);
> + if (npos >= 0)
> + fda->priv[npos] = base->priv[pos];
> +
> + return npos;
> +}
> +
> int fdarray__filter(struct fdarray *fda, short revents,
> void (*entry_destructor)(struct fdarray *fda, int fd, void *arg),
> void *arg)
> diff --git a/tools/lib/api/fd/array.h b/tools/lib/api/fd/array.h
> index 7fcf21a33c0c..4a03da7f1fc1 100644
> --- a/tools/lib/api/fd/array.h
> +++ b/tools/lib/api/fd/array.h
> @@ -42,6 +42,7 @@ struct fdarray *fdarray__new(int nr_alloc, int nr_autogrow);
> void fdarray__delete(struct fdarray *fda);
>
> int fdarray__add(struct fdarray *fda, int fd, short revents, enum fdarray_flags flags);
> +int fdarray__clone(struct fdarray *fda, int pos, struct fdarray *base);
> int fdarray__poll(struct fdarray *fda, int timeout);
> int fdarray__filter(struct fdarray *fda, short revents,
> void (*entry_destructor)(struct fdarray *fda, int fd, void *arg),


Please split the fdarray.[ch] parts into a separate patch, then the rest
users it in a second patch.

If, theoretically, we needed to revert the builtin-record.c we could do
it with 'git revert' instead of a patch that removes it while leaving
the fdarray__clone part, that at the time of the revert could alredy be
in use in other parts of the code.

> diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
> index 31b3a515abc1..11ce64b23db4 100644
> --- a/tools/perf/builtin-record.c
> +++ b/tools/perf/builtin-record.c
> @@ -58,6 +58,7 @@
> #include <poll.h>
> #include <pthread.h>
> #include <unistd.h>
> +#include <sys/syscall.h>
> #include <sched.h>
> #include <signal.h>
> #ifdef HAVE_EVENTFD_SUPPORT
> @@ -92,6 +93,23 @@ struct thread_mask {
> struct mmap_cpu_mask affinity;
> };
>
> +struct thread_data {
> + pid_t tid;
> + struct thread_mask *mask;
> + struct {
> + int msg[2];
> + int ack[2];
> + } pipes;
> + struct fdarray pollfd;
> + int ctlfd_pos;
> + struct mmap **maps;
> + struct mmap **overwrite_maps;
> + int nr_mmaps;

Move nr_mmaps to after ctlfd_pos

> + struct record *rec;
> + unsigned long long samples;
> + unsigned long waking;
> +};
> +
> struct record {
> struct perf_tool tool;
> struct record_opts opts;
> @@ -117,6 +135,7 @@ struct record {
> struct mmap_cpu_mask affinity_mask;
> unsigned long output_max_size; /* = 0: unlimited */
> struct thread_mask *thread_masks;
> + struct thread_data *thread_data;
> int nr_threads;
> };
>
> @@ -847,9 +866,174 @@ static int record__kcore_copy(struct machine *machine, struct perf_data *data)
> return kcore_copy(from_dir, kcore_dir);
> }
>
> +static int record__thread_data_init_pipes(struct thread_data *thread_data)
> +{
> + if (pipe(thread_data->pipes.msg) || pipe(thread_data->pipes.ack)) {
> + pr_err("Failed to create thread communication pipes: %s\n", strerror(errno));
> + return -ENOMEM;
> + }

If one fails you should cleanup the other

> +
> + pr_debug2("thread_data[%p]: msg=[%d,%d], ack=[%d,%d]\n", thread_data,
> + thread_data->pipes.msg[0], thread_data->pipes.msg[1],
> + thread_data->pipes.ack[0], thread_data->pipes.ack[1]);
> +
> + return 0;
> +}
> +
> +static int record__thread_data_init_maps(struct thread_data *thread_data, struct evlist *evlist)
> +{
> + int m, tm, nr_mmaps = evlist->core.nr_mmaps;
> + struct mmap *mmap = evlist->mmap;
> + struct mmap *overwrite_mmap = evlist->overwrite_mmap;
> + struct perf_cpu_map *cpus = evlist->core.cpus;
> +
> + thread_data->nr_mmaps = bitmap_weight(thread_data->mask->maps.bits,
> + thread_data->mask->maps.nbits);
> + if (mmap) {
> + thread_data->maps = zalloc(thread_data->nr_mmaps * sizeof(struct mmap *));
> + if (!thread_data->maps) {
> + pr_err("Failed to allocate maps thread data\n");
> + return -ENOMEM;
> + }
> + }
> + if (overwrite_mmap) {
> + thread_data->overwrite_maps = zalloc(thread_data->nr_mmaps * sizeof(struct mmap *));
> + if (!thread_data->overwrite_maps) {
> + pr_err("Failed to allocate overwrite maps thread data\n");
> + return -ENOMEM;
> + }

ditto, release the allocated resources on error exit

> + }
> + pr_debug2("thread_data[%p]: nr_mmaps=%d, maps=%p, ow_maps=%p\n", thread_data,
> + thread_data->nr_mmaps, thread_data->maps, thread_data->overwrite_maps);
> +
> + for (m = 0, tm = 0; m < nr_mmaps && tm < thread_data->nr_mmaps; m++) {
> + if (test_bit(cpus->map[m], thread_data->mask->maps.bits)) {
> + if (thread_data->maps) {
> + thread_data->maps[tm] = &mmap[m];
> + pr_debug2("thread_data[%p]: maps[%d] -> mmap[%d], cpus[%d]\n",
> + thread_data, tm, m, cpus->map[m]);
> + }
> + if (thread_data->overwrite_maps) {
> + thread_data->overwrite_maps[tm] = &overwrite_mmap[m];
> + pr_debug2("thread_data[%p]: ow_maps[%d] -> ow_mmap[%d], cpus[%d]\n",
> + thread_data, tm, m, cpus->map[m]);
> + }
> + tm++;
> + }
> + }
> +
> + return 0;
> +}
> +
> +static int record__thread_data_init_pollfd(struct thread_data *thread_data, struct evlist *evlist)
> +{
> + int f, tm, pos;
> + struct mmap *map, *overwrite_map;
> +
> + fdarray__init(&thread_data->pollfd, 64);
> +
> + for (tm = 0; tm < thread_data->nr_mmaps; tm++) {
> + map = thread_data->maps ? thread_data->maps[tm] : NULL;
> + overwrite_map = thread_data->overwrite_maps ?
> + thread_data->overwrite_maps[tm] : NULL;
> +
> + for (f = 0; f < evlist->core.pollfd.nr; f++) {
> + void *ptr = evlist->core.pollfd.priv[f].ptr;
> +
> + if ((map && ptr == map) || (overwrite_map && ptr == overwrite_map)) {
> + pos = fdarray__clone(&thread_data->pollfd, f, &evlist->core.pollfd);
> + if (pos < 0)
> + return pos;
> + pr_debug2("thread_data[%p]: pollfd[%d] <- event_fd=%d\n",
> + thread_data, pos, evlist->core.pollfd.entries[f].fd);
> + }
> + }
> + }
> +
> + return 0;
> +}
> +
> +static int record__alloc_thread_data(struct record *rec, struct evlist *evlist)
> +{
> + int t, ret;
> + struct thread_data *thread_data;
> +
> + rec->thread_data = zalloc(rec->nr_threads * sizeof(*(rec->thread_data)));
> + if (!rec->thread_data) {
> + pr_err("Failed to allocate thread data\n");
> + return -ENOMEM;
> + }
> + thread_data = rec->thread_data;
> +
> + for (t = 0; t < rec->nr_threads; t++) {
> + thread_data[t].rec = rec;
> + thread_data[t].mask = &rec->thread_masks[t];
> + ret = record__thread_data_init_maps(&thread_data[t], evlist);
> + if (ret)
> + return ret;

Also release allocated resources on exit

> + ret = record__thread_data_init_pollfd(&thread_data[t], evlist);

So record__thread_data_init_pollfd() can fail, you emitted a warning
here for zalloc() failure, and in the record__thread_data_init_maps()
case that function emits error messages, for consistency please emit one
here or in record__thread_data_init_pollfd() when fdarray__clone() fails

> + if (ret)
> + return ret;

release resources on exit

> + if (t) {
> + thread_data[t].tid = -1;
> + ret = record__thread_data_init_pipes(&thread_data[t]);
> + if (ret)
> + return ret;

release resources on exit

> + thread_data[t].ctlfd_pos = fdarray__add(&thread_data[t].pollfd,
> + thread_data[t].pipes.msg[0],
> + POLLIN | POLLERR | POLLHUP,
> + fdarray_flag__nonfilterable);
> + if (thread_data[t].ctlfd_pos < 0)
> + return -ENOMEM;

pr_err() and release resources


> + pr_debug2("thread_data[%p]: pollfd[%d] <- ctl_fd=%d\n",
> + thread_data, thread_data[t].ctlfd_pos,
> + thread_data[t].pipes.msg[0]);
> + } else {
> + thread_data[t].tid = syscall(SYS_gettid);
> + if (evlist->ctl_fd.pos == -1)
> + continue;
> + thread_data[t].ctlfd_pos = fdarray__clone(&thread_data[t].pollfd,
> + evlist->ctl_fd.pos,
> + &evlist->core.pollfd);
> + if (thread_data[t].ctlfd_pos < 0)
> + return -ENOMEM;

Ditto

> + pr_debug2("thread_data[%p]: pollfd[%d] <- ctl_fd=%d\n",
> + thread_data, thread_data[t].ctlfd_pos,
> + evlist->core.pollfd.entries[evlist->ctl_fd.pos].fd);
> + }
> + }
> +
> + return 0;
> +}
> +
> +static void record__free_thread_data(struct record *rec)
> +{
> + int t;
> +
> + if (rec->thread_data == NULL)
> + return;
> +
> + for (t = 0; t < rec->nr_threads; t++) {
> + if (rec->thread_data[t].pipes.msg[0])
> + close(rec->thread_data[t].pipes.msg[0]);

Just to make this consistent with the zfree() use below, please init
rec->thread_data[t].pipes.msg[0] to zero (probably best would be to -1)

> + if (rec->thread_data[t].pipes.msg[1])
> + close(rec->thread_data[t].pipes.msg[1]);
> + if (rec->thread_data[t].pipes.ack[0])
> + close(rec->thread_data[t].pipes.ack[0]);
> + if (rec->thread_data[t].pipes.ack[1])
> + close(rec->thread_data[t].pipes.ack[1]);
> + zfree(&rec->thread_data[t].maps);
> + zfree(&rec->thread_data[t].overwrite_maps);
> + fdarray__exit(&rec->thread_data[t].pollfd);
> + }
> +
> + zfree(&rec->thread_data);
> +}
> +
> static int record__mmap_evlist(struct record *rec,
> struct evlist *evlist)
> {
> + int ret;
> struct record_opts *opts = &rec->opts;
> bool auxtrace_overwrite = opts->auxtrace_snapshot_mode ||
> opts->auxtrace_sample_mode;
> @@ -880,6 +1064,14 @@ static int record__mmap_evlist(struct record *rec,
> return -EINVAL;
> }
> }
> +
> + if (evlist__initialize_ctlfd(evlist, opts->ctl_fd, opts->ctl_fd_ack))
> + return -1;
> +
> + ret = record__alloc_thread_data(rec, evlist);
> + if (ret)
> + return ret;
> +
> return 0;
> }
>
> @@ -1880,9 +2072,6 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
> evlist__start_workload(rec->evlist);
> }
>
> - if (evlist__initialize_ctlfd(rec->evlist, opts->ctl_fd, opts->ctl_fd_ack))
> - goto out_child;
> -
> if (opts->initial_delay) {
> pr_info(EVLIST_DISABLED_MSG);
> if (opts->initial_delay > 0) {
> @@ -2040,6 +2229,7 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
> out_child:
> evlist__finalize_ctlfd(rec->evlist);
> record__mmap_read_all(rec, true);
> + record__free_thread_data(rec);
> record__aio_mmap_read_sync(rec);
>
> if (rec->session->bytes_transferred && rec->session->bytes_compressed) {
> --
> 2.19.0
>

--

- Arnaldo

2021-06-30 17:18:26

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: [PATCH v8 03/22] perf record: Introduce thread local variable

Em Wed, Jun 30, 2021 at 06:54:42PM +0300, Alexey Bayduraev escreveu:
> Introduce thread local variable and use it for threaded trace streaming.
> Use thread affinity mask instead or record affinity mask in affinity
> modes.
> Introduce and use evlist__ctlfd_update() function to propagate external
> control commands to global evlist object.
>
> Acked-by: Andi Kleen <[email protected]>
> Acked-by: Namhyung Kim <[email protected]>
> Signed-off-by: Alexey Bayduraev <[email protected]>
> ---
> tools/perf/builtin-record.c | 135 ++++++++++++++++++++++++------------
> tools/perf/util/evlist.c | 16 +++++
> tools/perf/util/evlist.h | 1 +
> 3 files changed, 107 insertions(+), 45 deletions(-)
>
> diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
> index 11ce64b23db4..3935c0fabe01 100644
> --- a/tools/perf/builtin-record.c
> +++ b/tools/perf/builtin-record.c
> @@ -110,6 +110,8 @@ struct thread_data {
> unsigned long waking;
> };
>
> +static __thread struct thread_data *thread;
> +
> struct record {
> struct perf_tool tool;
> struct record_opts opts;
> @@ -132,7 +134,6 @@ struct record {
> bool timestamp_boundary;
> struct switch_output switch_output;
> unsigned long long samples;
> - struct mmap_cpu_mask affinity_mask;
> unsigned long output_max_size; /* = 0: unlimited */
> struct thread_mask *thread_masks;
> struct thread_data *thread_data;
> @@ -567,7 +568,7 @@ static int record__pushfn(struct mmap *map, void *to, void *bf, size_t size)
> bf = map->data;
> }
>
> - rec->samples++;
> + thread->samples++;
> return record__write(rec, map, bf, size);
> }
>
> @@ -1260,16 +1261,24 @@ static struct perf_event_header finished_round_event = {
>
> static void record__adjust_affinity(struct record *rec, struct mmap *map)
> {
> + int ret = 0;

Why you set this to zero here if it is going to be used only insde the
if block?

> +
> if (rec->opts.affinity != PERF_AFFINITY_SYS &&
> - !bitmap_equal(rec->affinity_mask.bits, map->affinity_mask.bits,
> - rec->affinity_mask.nbits)) {
> - bitmap_zero(rec->affinity_mask.bits, rec->affinity_mask.nbits);
> - bitmap_or(rec->affinity_mask.bits, rec->affinity_mask.bits,
> - map->affinity_mask.bits, rec->affinity_mask.nbits);
> - sched_setaffinity(0, MMAP_CPU_MASK_BYTES(&rec->affinity_mask),
> - (cpu_set_t *)rec->affinity_mask.bits);
> - if (verbose == 2)
> - mmap_cpu_mask__scnprintf(&rec->affinity_mask, "thread");
> + !bitmap_equal(thread->mask->affinity.bits, map->affinity_mask.bits,
> + thread->mask->affinity.nbits)) {
> + bitmap_zero(thread->mask->affinity.bits, thread->mask->affinity.nbits);
> + bitmap_or(thread->mask->affinity.bits, thread->mask->affinity.bits,
> + map->affinity_mask.bits, thread->mask->affinity.nbits);
> + ret = sched_setaffinity(0, MMAP_CPU_MASK_BYTES(&thread->mask->affinity),
> + (cpu_set_t *)thread->mask->affinity.bits);
> + if (ret)
> + pr_err("threads[%d]: sched_setaffinity() call failed: %s\n",
> + thread->tid, strerror(errno));

Also, if record__adjust_affinity() fails by means of sched_setaffinity
not working, shouldn't we propagate this error?

> + if (verbose == 2) {
> + pr_debug("threads[%d]: addr=", thread->tid);
> + mmap_cpu_mask__scnprintf(&thread->mask->affinity, "thread");
> + pr_debug("threads[%d]: on cpu=%d\n", thread->tid, sched_getcpu());
> + }
> }
> }
>
> @@ -1310,14 +1319,17 @@ static int record__mmap_read_evlist(struct record *rec, struct evlist *evlist,
> u64 bytes_written = rec->bytes_written;
> int i;
> int rc = 0;
> - struct mmap *maps;
> + int nr_mmaps;
> + struct mmap **maps;
> int trace_fd = rec->data.file.fd;
> off_t off = 0;
>
> if (!evlist)
> return 0;
>
> - maps = overwrite ? evlist->overwrite_mmap : evlist->mmap;
> + nr_mmaps = thread->nr_mmaps;
> + maps = overwrite ? thread->overwrite_maps : thread->maps;
> +
> if (!maps)
> return 0;
>
> @@ -1327,9 +1339,9 @@ static int record__mmap_read_evlist(struct record *rec, struct evlist *evlist,
> if (record__aio_enabled(rec))
> off = record__aio_get_pos(trace_fd);
>
> - for (i = 0; i < evlist->core.nr_mmaps; i++) {
> + for (i = 0; i < nr_mmaps; i++) {
> u64 flush = 0;
> - struct mmap *map = &maps[i];
> + struct mmap *map = maps[i];
>
> if (map->core.base) {
> record__adjust_affinity(rec, map);
> @@ -1392,6 +1404,15 @@ static int record__mmap_read_all(struct record *rec, bool synch)
> return record__mmap_read_evlist(rec, rec->evlist, true, synch);
> }
>
> +static void record__thread_munmap_filtered(struct fdarray *fda, int fd,
> + void *arg __maybe_unused)
> +{
> + struct perf_mmap *map = fda->priv[fd].ptr;
> +
> + if (map)
> + perf_mmap__put(map);
> +}
> +
> static void record__init_features(struct record *rec)
> {
> struct perf_session *session = rec->session;
> @@ -1836,6 +1857,33 @@ static void record__uniquify_name(struct record *rec)
> }
> }
>
> +static int record__start_threads(struct record *rec)
> +{
> + struct thread_data *thread_data = rec->thread_data;
> +
> + thread = &thread_data[0];
> +
> + pr_debug("threads[%d]: started on cpu=%d\n", thread->tid, sched_getcpu());
> +
> + return 0;
> +}
> +
> +static int record__stop_threads(struct record *rec, unsigned long *waking)
> +{
> + int t;
> + struct thread_data *thread_data = rec->thread_data;
> +
> + for (t = 0; t < rec->nr_threads; t++) {
> + rec->samples += thread_data[t].samples;
> + *waking += thread_data[t].waking;
> + pr_debug("threads[%d]: samples=%lld, wakes=%ld, trasferred=%ld, compressed=%ld\n",
> + thread_data[t].tid, thread_data[t].samples, thread_data[t].waking,
> + rec->session->bytes_transferred, rec->session->bytes_compressed);
> + }
> +
> + return 0;
> +}
> +
> static int __cmd_record(struct record *rec, int argc, const char **argv)
> {
> int err;
> @@ -1944,7 +1992,7 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
>
> if (record__open(rec) != 0) {
> err = -1;
> - goto out_child;
> + goto out_free_threads;
> }
> session->header.env.comp_mmap_len = session->evlist->core.mmap_len;
>
> @@ -1952,7 +2000,7 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
> err = record__kcore_copy(&session->machines.host, data);
> if (err) {
> pr_err("ERROR: Failed to copy kcore\n");
> - goto out_child;
> + goto out_free_threads;
> }
> }
>
> @@ -1963,7 +2011,7 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
> bpf__strerror_apply_obj_config(err, errbuf, sizeof(errbuf));
> pr_err("ERROR: Apply config to BPF failed: %s\n",
> errbuf);
> - goto out_child;
> + goto out_free_threads;
> }
>
> /*
> @@ -1981,11 +2029,11 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
> if (data->is_pipe) {
> err = perf_header__write_pipe(fd);
> if (err < 0)
> - goto out_child;
> + goto out_free_threads;
> } else {
> err = perf_session__write_header(session, rec->evlist, fd, false);
> if (err < 0)
> - goto out_child;
> + goto out_free_threads;
> }
>
> err = -1;
> @@ -1993,16 +2041,16 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
> && !perf_header__has_feat(&session->header, HEADER_BUILD_ID)) {
> pr_err("Couldn't generate buildids. "
> "Use --no-buildid to profile anyway.\n");
> - goto out_child;
> + goto out_free_threads;
> }
>
> err = record__setup_sb_evlist(rec);
> if (err)
> - goto out_child;
> + goto out_free_threads;
>
> err = record__synthesize(rec, false);
> if (err < 0)
> - goto out_child;
> + goto out_free_threads;
>
> if (rec->realtime_prio) {
> struct sched_param param;
> @@ -2011,10 +2059,13 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
> if (sched_setscheduler(0, SCHED_FIFO, &param)) {
> pr_err("Could not set realtime priority.\n");
> err = -1;
> - goto out_child;
> + goto out_free_threads;
> }
> }
>
> + if (record__start_threads(rec))
> + goto out_free_threads;
> +
> /*
> * When perf is starting the traced process, all the events
> * (apart from group members) have enable_on_exec=1 set,
> @@ -2085,7 +2136,7 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
> trigger_ready(&switch_output_trigger);
> perf_hooks__invoke_record_start();
> for (;;) {
> - unsigned long long hits = rec->samples;
> + unsigned long long hits = thread->samples;
>
> /*
> * rec->evlist->bkw_mmap_state is possible to be
> @@ -2154,20 +2205,24 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
> alarm(rec->switch_output.time);
> }
>
> - if (hits == rec->samples) {
> + if (hits == thread->samples) {
> if (done || draining)
> break;
> - err = evlist__poll(rec->evlist, -1);
> + err = fdarray__poll(&thread->pollfd, -1);
> /*
> * Propagate error, only if there's any. Ignore positive
> * number of returned events and interrupt error.
> */
> if (err > 0 || (err < 0 && errno == EINTR))
> err = 0;
> - waking++;
> + thread->waking++;
>
> - if (evlist__filter_pollfd(rec->evlist, POLLERR | POLLHUP) == 0)
> + if (fdarray__filter(&thread->pollfd, POLLERR | POLLHUP,
> + record__thread_munmap_filtered, NULL) == 0)
> draining = true;
> +
> + evlist__ctlfd_update(rec->evlist,
> + &thread->pollfd.entries[thread->ctlfd_pos]);
> }
>
> if (evlist__ctlfd_process(rec->evlist, &cmd) > 0) {
> @@ -2220,18 +2275,20 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
> goto out_child;
> }
>
> - if (!quiet)
> - fprintf(stderr, "[ perf record: Woken up %ld times to write data ]\n", waking);
> -
> if (target__none(&rec->opts.target))
> record__synthesize_workload(rec, true);
>
> out_child:
> - evlist__finalize_ctlfd(rec->evlist);
> + record__stop_threads(rec, &waking);
> record__mmap_read_all(rec, true);
> +out_free_threads:
> record__free_thread_data(rec);
> + evlist__finalize_ctlfd(rec->evlist);
> record__aio_mmap_read_sync(rec);
>
> + if (!quiet)
> + fprintf(stderr, "[ perf record: Woken up %ld times to write data ]\n", waking);
> +
> if (rec->session->bytes_transferred && rec->session->bytes_compressed) {
> ratio = (float)rec->session->bytes_transferred/(float)rec->session->bytes_compressed;
> session->header.env.comp_ratio = ratio + 0.5;
> @@ -3093,17 +3150,6 @@ int cmd_record(int argc, const char **argv)
>
> symbol__init(NULL);
>
> - if (rec->opts.affinity != PERF_AFFINITY_SYS) {
> - rec->affinity_mask.nbits = cpu__max_cpu();
> - rec->affinity_mask.bits = bitmap_alloc(rec->affinity_mask.nbits);
> - if (!rec->affinity_mask.bits) {
> - pr_err("Failed to allocate thread mask for %zd cpus\n", rec->affinity_mask.nbits);
> - err = -ENOMEM;
> - goto out_opts;
> - }
> - pr_debug2("thread mask[%zd]: empty\n", rec->affinity_mask.nbits);
> - }
> -
> err = record__auxtrace_init(rec);
> if (err)
> goto out;
> @@ -3241,7 +3287,6 @@ int cmd_record(int argc, const char **argv)
>
> err = __cmd_record(&record, argc, argv);
> out:
> - bitmap_free(rec->affinity_mask.bits);
> evlist__delete(rec->evlist);
> symbol__exit();
> auxtrace_record__free(rec->itr);


Can the following be moved to a separate patch?

> diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
> index 6ba9664089bd..3d555a98c037 100644
> --- a/tools/perf/util/evlist.c
> +++ b/tools/perf/util/evlist.c
> @@ -2132,6 +2132,22 @@ int evlist__ctlfd_process(struct evlist *evlist, enum evlist_ctl_cmd *cmd)
> return err;
> }
>
> +int evlist__ctlfd_update(struct evlist *evlist, struct pollfd *update)
> +{
> + int ctlfd_pos = evlist->ctl_fd.pos;
> + struct pollfd *entries = evlist->core.pollfd.entries;
> +
> + if (!evlist__ctlfd_initialized(evlist))
> + return 0;
> +
> + if (entries[ctlfd_pos].fd != update->fd ||
> + entries[ctlfd_pos].events != update->events)
> + return -1;
> +
> + entries[ctlfd_pos].revents = update->revents;
> + return 0;
> +}
> +
> struct evsel *evlist__find_evsel(struct evlist *evlist, int idx)
> {
> struct evsel *evsel;
> diff --git a/tools/perf/util/evlist.h b/tools/perf/util/evlist.h
> index 2073cfa79f79..b7aa719c638d 100644
> --- a/tools/perf/util/evlist.h
> +++ b/tools/perf/util/evlist.h
> @@ -358,6 +358,7 @@ void evlist__close_control(int ctl_fd, int ctl_fd_ack, bool *ctl_fd_close);
> int evlist__initialize_ctlfd(struct evlist *evlist, int ctl_fd, int ctl_fd_ack);
> int evlist__finalize_ctlfd(struct evlist *evlist);
> bool evlist__ctlfd_initialized(struct evlist *evlist);
> +int evlist__ctlfd_update(struct evlist *evlist, struct pollfd *update);
> int evlist__ctlfd_process(struct evlist *evlist, enum evlist_ctl_cmd *cmd);
> int evlist__ctlfd_ack(struct evlist *evlist);
>
> --
> 2.19.0
>

--

- Arnaldo

2021-06-30 17:20:37

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: [PATCH v8 02/22] perf record: Introduce thread specific data array

Em Wed, Jun 30, 2021 at 06:54:41PM +0300, Alexey Bayduraev escreveu:
> Introduce thread specific data object and array of such objects
> to store and manage thread local data. Implement functions to
> allocate, initialize, finalize and release thread specific data.
>
> Thread local maps and overwrite_maps arrays keep pointers to
> mmap buffer objects to serve according to maps thread mask.
> Thread local pollfd array keeps event fds connected to mmaps
> buffers according to maps thread mask.
>
> Thread control commands are delivered via thread local comm pipes
> and ctlfd_pos fd. External control commands (--control option)
> are delivered via evlist ctlfd_pos fd and handled by the main
> tool thread.
>
> Acked-by: Namhyung Kim <[email protected]>
> Signed-off-by: Alexey Bayduraev <[email protected]>
> ---
> tools/lib/api/fd/array.c | 17 ++++
> tools/lib/api/fd/array.h | 1 +
> tools/perf/builtin-record.c | 196 +++++++++++++++++++++++++++++++++++-
> 3 files changed, 211 insertions(+), 3 deletions(-)
>
> diff --git a/tools/lib/api/fd/array.c b/tools/lib/api/fd/array.c
> index 5e6cb9debe37..de8bcbaea3f1 100644
> --- a/tools/lib/api/fd/array.c
> +++ b/tools/lib/api/fd/array.c
> @@ -88,6 +88,23 @@ int fdarray__add(struct fdarray *fda, int fd, short revents, enum fdarray_flags
> return pos;
> }
>
> +int fdarray__clone(struct fdarray *fda, int pos, struct fdarray *base)
> +{
> + struct pollfd *entry;
> + int npos;
> +
> + if (pos >= base->nr)
> + return -EINVAL;
> +
> + entry = &base->entries[pos];
> +
> + npos = fdarray__add(fda, entry->fd, entry->events, base->priv[pos].flags);
> + if (npos >= 0)
> + fda->priv[npos] = base->priv[pos];
> +
> + return npos;
> +}
> +
> int fdarray__filter(struct fdarray *fda, short revents,
> void (*entry_destructor)(struct fdarray *fda, int fd, void *arg),
> void *arg)
> diff --git a/tools/lib/api/fd/array.h b/tools/lib/api/fd/array.h
> index 7fcf21a33c0c..4a03da7f1fc1 100644
> --- a/tools/lib/api/fd/array.h
> +++ b/tools/lib/api/fd/array.h
> @@ -42,6 +42,7 @@ struct fdarray *fdarray__new(int nr_alloc, int nr_autogrow);
> void fdarray__delete(struct fdarray *fda);
>
> int fdarray__add(struct fdarray *fda, int fd, short revents, enum fdarray_flags flags);
> +int fdarray__clone(struct fdarray *fda, int pos, struct fdarray *base);
> int fdarray__poll(struct fdarray *fda, int timeout);
> int fdarray__filter(struct fdarray *fda, short revents,
> void (*entry_destructor)(struct fdarray *fda, int fd, void *arg),
> diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
> index 31b3a515abc1..11ce64b23db4 100644
> --- a/tools/perf/builtin-record.c
> +++ b/tools/perf/builtin-record.c
> @@ -58,6 +58,7 @@
> #include <poll.h>
> #include <pthread.h>
> #include <unistd.h>
> +#include <sys/syscall.h>
> #include <sched.h>
> #include <signal.h>
> #ifdef HAVE_EVENTFD_SUPPORT
> @@ -92,6 +93,23 @@ struct thread_mask {
> struct mmap_cpu_mask affinity;
> };
>
> +struct thread_data {

Please rename this to 'struct record_thread', 'data' is way too generic.

> + pid_t tid;
> + struct thread_mask *mask;
> + struct {
> + int msg[2];
> + int ack[2];
> + } pipes;
> + struct fdarray pollfd;
> + int ctlfd_pos;
> + struct mmap **maps;
> + struct mmap **overwrite_maps;
> + int nr_mmaps;
> + struct record *rec;
> + unsigned long long samples;
> + unsigned long waking;
> +};
> +
> struct record {
> struct perf_tool tool;
> struct record_opts opts;
> @@ -117,6 +135,7 @@ struct record {
> struct mmap_cpu_mask affinity_mask;
> unsigned long output_max_size; /* = 0: unlimited */
> struct thread_mask *thread_masks;
> + struct thread_data *thread_data;
> int nr_threads;
> };
>
> @@ -847,9 +866,174 @@ static int record__kcore_copy(struct machine *machine, struct perf_data *data)
> return kcore_copy(from_dir, kcore_dir);
> }
>
> +static int record__thread_data_init_pipes(struct thread_data *thread_data)
> +{
> + if (pipe(thread_data->pipes.msg) || pipe(thread_data->pipes.ack)) {
> + pr_err("Failed to create thread communication pipes: %s\n", strerror(errno));
> + return -ENOMEM;
> + }
> +
> + pr_debug2("thread_data[%p]: msg=[%d,%d], ack=[%d,%d]\n", thread_data,
> + thread_data->pipes.msg[0], thread_data->pipes.msg[1],
> + thread_data->pipes.ack[0], thread_data->pipes.ack[1]);
> +
> + return 0;
> +}
> +
> +static int record__thread_data_init_maps(struct thread_data *thread_data, struct evlist *evlist)
> +{
> + int m, tm, nr_mmaps = evlist->core.nr_mmaps;
> + struct mmap *mmap = evlist->mmap;
> + struct mmap *overwrite_mmap = evlist->overwrite_mmap;
> + struct perf_cpu_map *cpus = evlist->core.cpus;
> +
> + thread_data->nr_mmaps = bitmap_weight(thread_data->mask->maps.bits,
> + thread_data->mask->maps.nbits);
> + if (mmap) {
> + thread_data->maps = zalloc(thread_data->nr_mmaps * sizeof(struct mmap *));
> + if (!thread_data->maps) {
> + pr_err("Failed to allocate maps thread data\n");
> + return -ENOMEM;
> + }
> + }
> + if (overwrite_mmap) {
> + thread_data->overwrite_maps = zalloc(thread_data->nr_mmaps * sizeof(struct mmap *));
> + if (!thread_data->overwrite_maps) {
> + pr_err("Failed to allocate overwrite maps thread data\n");
> + return -ENOMEM;
> + }
> + }
> + pr_debug2("thread_data[%p]: nr_mmaps=%d, maps=%p, ow_maps=%p\n", thread_data,
> + thread_data->nr_mmaps, thread_data->maps, thread_data->overwrite_maps);
> +
> + for (m = 0, tm = 0; m < nr_mmaps && tm < thread_data->nr_mmaps; m++) {
> + if (test_bit(cpus->map[m], thread_data->mask->maps.bits)) {
> + if (thread_data->maps) {
> + thread_data->maps[tm] = &mmap[m];
> + pr_debug2("thread_data[%p]: maps[%d] -> mmap[%d], cpus[%d]\n",
> + thread_data, tm, m, cpus->map[m]);
> + }
> + if (thread_data->overwrite_maps) {
> + thread_data->overwrite_maps[tm] = &overwrite_mmap[m];
> + pr_debug2("thread_data[%p]: ow_maps[%d] -> ow_mmap[%d], cpus[%d]\n",
> + thread_data, tm, m, cpus->map[m]);
> + }
> + tm++;
> + }
> + }
> +
> + return 0;
> +}
> +
> +static int record__thread_data_init_pollfd(struct thread_data *thread_data, struct evlist *evlist)
> +{
> + int f, tm, pos;
> + struct mmap *map, *overwrite_map;
> +
> + fdarray__init(&thread_data->pollfd, 64);
> +
> + for (tm = 0; tm < thread_data->nr_mmaps; tm++) {
> + map = thread_data->maps ? thread_data->maps[tm] : NULL;
> + overwrite_map = thread_data->overwrite_maps ?
> + thread_data->overwrite_maps[tm] : NULL;
> +
> + for (f = 0; f < evlist->core.pollfd.nr; f++) {
> + void *ptr = evlist->core.pollfd.priv[f].ptr;
> +
> + if ((map && ptr == map) || (overwrite_map && ptr == overwrite_map)) {
> + pos = fdarray__clone(&thread_data->pollfd, f, &evlist->core.pollfd);
> + if (pos < 0)
> + return pos;
> + pr_debug2("thread_data[%p]: pollfd[%d] <- event_fd=%d\n",
> + thread_data, pos, evlist->core.pollfd.entries[f].fd);
> + }
> + }
> + }
> +
> + return 0;
> +}
> +
> +static int record__alloc_thread_data(struct record *rec, struct evlist *evlist)
> +{
> + int t, ret;
> + struct thread_data *thread_data;
> +
> + rec->thread_data = zalloc(rec->nr_threads * sizeof(*(rec->thread_data)));
> + if (!rec->thread_data) {
> + pr_err("Failed to allocate thread data\n");
> + return -ENOMEM;
> + }
> + thread_data = rec->thread_data;
> +
> + for (t = 0; t < rec->nr_threads; t++) {
> + thread_data[t].rec = rec;
> + thread_data[t].mask = &rec->thread_masks[t];
> + ret = record__thread_data_init_maps(&thread_data[t], evlist);
> + if (ret)
> + return ret;
> + ret = record__thread_data_init_pollfd(&thread_data[t], evlist);
> + if (ret)
> + return ret;
> + if (t) {
> + thread_data[t].tid = -1;
> + ret = record__thread_data_init_pipes(&thread_data[t]);
> + if (ret)
> + return ret;
> + thread_data[t].ctlfd_pos = fdarray__add(&thread_data[t].pollfd,
> + thread_data[t].pipes.msg[0],
> + POLLIN | POLLERR | POLLHUP,
> + fdarray_flag__nonfilterable);
> + if (thread_data[t].ctlfd_pos < 0)
> + return -ENOMEM;
> + pr_debug2("thread_data[%p]: pollfd[%d] <- ctl_fd=%d\n",
> + thread_data, thread_data[t].ctlfd_pos,
> + thread_data[t].pipes.msg[0]);
> + } else {
> + thread_data[t].tid = syscall(SYS_gettid);
> + if (evlist->ctl_fd.pos == -1)
> + continue;
> + thread_data[t].ctlfd_pos = fdarray__clone(&thread_data[t].pollfd,
> + evlist->ctl_fd.pos,
> + &evlist->core.pollfd);
> + if (thread_data[t].ctlfd_pos < 0)
> + return -ENOMEM;
> + pr_debug2("thread_data[%p]: pollfd[%d] <- ctl_fd=%d\n",
> + thread_data, thread_data[t].ctlfd_pos,
> + evlist->core.pollfd.entries[evlist->ctl_fd.pos].fd);
> + }
> + }
> +
> + return 0;
> +}
> +
> +static void record__free_thread_data(struct record *rec)
> +{
> + int t;
> +
> + if (rec->thread_data == NULL)
> + return;
> +
> + for (t = 0; t < rec->nr_threads; t++) {
> + if (rec->thread_data[t].pipes.msg[0])
> + close(rec->thread_data[t].pipes.msg[0]);
> + if (rec->thread_data[t].pipes.msg[1])
> + close(rec->thread_data[t].pipes.msg[1]);
> + if (rec->thread_data[t].pipes.ack[0])
> + close(rec->thread_data[t].pipes.ack[0]);
> + if (rec->thread_data[t].pipes.ack[1])
> + close(rec->thread_data[t].pipes.ack[1]);
> + zfree(&rec->thread_data[t].maps);
> + zfree(&rec->thread_data[t].overwrite_maps);
> + fdarray__exit(&rec->thread_data[t].pollfd);
> + }
> +
> + zfree(&rec->thread_data);
> +}
> +
> static int record__mmap_evlist(struct record *rec,
> struct evlist *evlist)
> {
> + int ret;
> struct record_opts *opts = &rec->opts;
> bool auxtrace_overwrite = opts->auxtrace_snapshot_mode ||
> opts->auxtrace_sample_mode;
> @@ -880,6 +1064,14 @@ static int record__mmap_evlist(struct record *rec,
> return -EINVAL;
> }
> }
> +
> + if (evlist__initialize_ctlfd(evlist, opts->ctl_fd, opts->ctl_fd_ack))
> + return -1;
> +
> + ret = record__alloc_thread_data(rec, evlist);
> + if (ret)
> + return ret;
> +
> return 0;
> }
>
> @@ -1880,9 +2072,6 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
> evlist__start_workload(rec->evlist);
> }
>
> - if (evlist__initialize_ctlfd(rec->evlist, opts->ctl_fd, opts->ctl_fd_ack))
> - goto out_child;
> -
> if (opts->initial_delay) {
> pr_info(EVLIST_DISABLED_MSG);
> if (opts->initial_delay > 0) {
> @@ -2040,6 +2229,7 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
> out_child:
> evlist__finalize_ctlfd(rec->evlist);
> record__mmap_read_all(rec, true);
> + record__free_thread_data(rec);
> record__aio_mmap_read_sync(rec);
>
> if (rec->session->bytes_transferred && rec->session->bytes_compressed) {
> --
> 2.19.0
>

--

- Arnaldo

2021-06-30 17:24:34

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: [PATCH v8 05/22] perf record: Start threads in the beginning of trace streaming

Em Wed, Jun 30, 2021 at 06:54:44PM +0300, Alexey Bayduraev escreveu:
> Start thread in detached state because its management is implemented
> via messaging to avoid any scaling issues. Block signals prior thread
> start so only main tool thread would be notified on external async
> signals during data collection. Thread affinity mask is used to assign
> eligible cpus for the thread to run. Wait and sync on thread start using
> thread ack pipe.
>
> Acked-by: Namhyung Kim <[email protected]>
> Signed-off-by: Alexey Bayduraev <[email protected]>
> ---
> tools/perf/builtin-record.c | 108 +++++++++++++++++++++++++++++++++++-
> 1 file changed, 107 insertions(+), 1 deletion(-)
>
> diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
> index 82a21da2af16..cead2b3c56d7 100644
> --- a/tools/perf/builtin-record.c
> +++ b/tools/perf/builtin-record.c
> @@ -1423,6 +1423,66 @@ static void record__thread_munmap_filtered(struct fdarray *fda, int fd,
> perf_mmap__put(map);
> }
>
> +static void *record__thread(void *arg)
> +{
> + enum thread_msg msg = THREAD_MSG__READY;
> + bool terminate = false;
> + struct fdarray *pollfd;
> + int err, ctlfd_pos;
> +
> + thread = arg;
> + thread->tid = syscall(SYS_gettid);

We have 'gettid()' in tools/perf, its not in a nice place but we have
tools/build/feature/test-gettid.c to test if gettid() is available in
system headers and if not, then:

tools/perf/jvmti/jvmti_agent.c

#ifndef HAVE_GETTID
static inline pid_t gettid(void)
{
return (pid_t)syscall(__NR_gettid);
}
#endif

I'll move it to a more suitable place so that you can use it here.


> + err = write(thread->pipes.ack[1], &msg, sizeof(msg));
> + if (err == -1)
> + pr_err("threads[%d]: failed to notify on start: %s", thread->tid, strerror(errno));
> +
> + pr_debug("threads[%d]: started on cpu=%d\n", thread->tid, sched_getcpu());
> +
> + pollfd = &thread->pollfd;
> + ctlfd_pos = thread->ctlfd_pos;
> +
> + for (;;) {
> + unsigned long long hits = thread->samples;
> +
> + if (record__mmap_read_all(thread->rec, false) < 0 || terminate)
> + break;
> +
> + if (hits == thread->samples) {
> +
> + err = fdarray__poll(pollfd, -1);
> + /*
> + * Propagate error, only if there's any. Ignore positive
> + * number of returned events and interrupt error.
> + */
> + if (err > 0 || (err < 0 && errno == EINTR))
> + err = 0;
> + thread->waking++;
> +
> + if (fdarray__filter(pollfd, POLLERR | POLLHUP,
> + record__thread_munmap_filtered, NULL) == 0)
> + break;
> + }
> +
> + if (pollfd->entries[ctlfd_pos].revents & POLLHUP) {
> + terminate = true;
> + close(thread->pipes.msg[0]);
> + pollfd->entries[ctlfd_pos].fd = -1;
> + pollfd->entries[ctlfd_pos].events = 0;
> + }
> +
> + pollfd->entries[ctlfd_pos].revents = 0;
> + }
> + record__mmap_read_all(thread->rec, true);
> +
> + err = write(thread->pipes.ack[1], &msg, sizeof(msg));
> + if (err == -1)
> + pr_err("threads[%d]: failed to notify on termination: %s",
> + thread->tid, strerror(errno));
> +
> + return NULL;
> +}
> +
> static void record__init_features(struct record *rec)
> {
> struct perf_session *session = rec->session;
> @@ -1886,13 +1946,59 @@ static int record__terminate_thread(struct thread_data *thread_data)
>
> static int record__start_threads(struct record *rec)
> {
> + int t, tt, ret = 0, nr_threads = rec->nr_threads;
> struct thread_data *thread_data = rec->thread_data;
> + sigset_t full, mask;
> + pthread_t handle;
> + pthread_attr_t attrs;
> +
> + sigfillset(&full);
> + if (sigprocmask(SIG_SETMASK, &full, &mask)) {
> + pr_err("Failed to block signals on threads start: %s\n", strerror(errno));
> + return -1;
> + }
> +
> + pthread_attr_init(&attrs);
> + pthread_attr_setdetachstate(&attrs, PTHREAD_CREATE_DETACHED);
> +
> + for (t = 1; t < nr_threads; t++) {
> + enum thread_msg msg = THREAD_MSG__UNDEFINED;
> +
> + pthread_attr_setaffinity_np(&attrs,
> + MMAP_CPU_MASK_BYTES(&(thread_data[t].mask->affinity)),
> + (cpu_set_t *)(thread_data[t].mask->affinity.bits));
> +
> + if (pthread_create(&handle, &attrs, record__thread, &thread_data[t])) {
> + for (tt = 1; tt < t; tt++)
> + record__terminate_thread(&thread_data[t]);
> + pr_err("Failed to start threads: %s\n", strerror(errno));
> + ret = -1;
> + goto out_err;
> + }
> +
> + if (read(thread_data[t].pipes.ack[0], &msg, sizeof(msg)) > 0)
> + pr_debug2("threads[%d]: sent %s\n", rec->thread_data[t].tid,
> + thread_msg_tags[msg]);
> + }
> +
> + if (nr_threads > 1) {
> + sched_setaffinity(0, MMAP_CPU_MASK_BYTES(&thread_data[0].mask->affinity),
> + (cpu_set_t *)thread_data[0].mask->affinity.bits);
> + }
>
> thread = &thread_data[0];
>
> pr_debug("threads[%d]: started on cpu=%d\n", thread->tid, sched_getcpu());
>
> - return 0;
> +out_err:
> + pthread_attr_destroy(&attrs);
> +
> + if (sigprocmask(SIG_SETMASK, &mask, NULL)) {
> + pr_err("Failed to unblock signals on threads start: %s\n", strerror(errno));
> + ret = -1;
> + }
> +
> + return ret;
> }
>
> static int record__stop_threads(struct record *rec, unsigned long *waking)
> --
> 2.19.0
>

--

- Arnaldo

2021-06-30 17:25:18

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: [PATCH v8 09/22] tools lib: Introduce bitmap_intersects() operation

Em Wed, Jun 30, 2021 at 06:54:48PM +0300, Alexey Bayduraev escreveu:
> Introduce bitmap_intersects() routine that tests whether

Is this _adopting_ bitmap_intersects() from the kernel sources?

> bitmaps bitmap1 and bitmap2 intersects. This routine will
> be used during thread masks initialization.
>
> Acked-by: Andi Kleen <[email protected]>
> Acked-by: Namhyung Kim <[email protected]>
> Signed-off-by: Alexey Bayduraev <[email protected]>
> ---
> tools/include/linux/bitmap.h | 11 +++++++++++
> tools/lib/bitmap.c | 14 ++++++++++++++
> 2 files changed, 25 insertions(+)
>
> diff --git a/tools/include/linux/bitmap.h b/tools/include/linux/bitmap.h
> index 330dbf7509cc..9d959bc24859 100644
> --- a/tools/include/linux/bitmap.h
> +++ b/tools/include/linux/bitmap.h
> @@ -18,6 +18,8 @@ int __bitmap_and(unsigned long *dst, const unsigned long *bitmap1,
> int __bitmap_equal(const unsigned long *bitmap1,
> const unsigned long *bitmap2, unsigned int bits);
> void bitmap_clear(unsigned long *map, unsigned int start, int len);
> +int __bitmap_intersects(const unsigned long *bitmap1,
> + const unsigned long *bitmap2, unsigned int bits);
>
> #define BITMAP_FIRST_WORD_MASK(start) (~0UL << ((start) & (BITS_PER_LONG - 1)))
> #define BITMAP_LAST_WORD_MASK(nbits) (~0UL >> (-(nbits) & (BITS_PER_LONG - 1)))
> @@ -170,4 +172,13 @@ static inline int bitmap_equal(const unsigned long *src1,
> return __bitmap_equal(src1, src2, nbits);
> }
>
> +static inline int bitmap_intersects(const unsigned long *src1,
> + const unsigned long *src2, unsigned int nbits)
> +{
> + if (small_const_nbits(nbits))
> + return ((*src1 & *src2) & BITMAP_LAST_WORD_MASK(nbits)) != 0;
> + else
> + return __bitmap_intersects(src1, src2, nbits);
> +}
> +
> #endif /* _PERF_BITOPS_H */
> diff --git a/tools/lib/bitmap.c b/tools/lib/bitmap.c
> index f4e914712b6f..db466ef7be9d 100644
> --- a/tools/lib/bitmap.c
> +++ b/tools/lib/bitmap.c
> @@ -86,3 +86,17 @@ int __bitmap_equal(const unsigned long *bitmap1,
>
> return 1;
> }
> +
> +int __bitmap_intersects(const unsigned long *bitmap1,
> + const unsigned long *bitmap2, unsigned int bits)
> +{
> + unsigned int k, lim = bits/BITS_PER_LONG;
> + for (k = 0; k < lim; ++k)
> + if (bitmap1[k] & bitmap2[k])
> + return 1;
> +
> + if (bits % BITS_PER_LONG)
> + if ((bitmap1[k] & bitmap2[k]) & BITMAP_LAST_WORD_MASK(bits))
> + return 1;
> + return 0;
> +}
> --
> 2.19.0
>

--

- Arnaldo

2021-06-30 17:26:25

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: [PATCH v8 06/22] perf record: Introduce data file at mmap buffer object

Em Wed, Jun 30, 2021 at 06:54:45PM +0300, Alexey Bayduraev escreveu:
> Introduce data file and compressor objects into mmap object so
> they could be used to process and store data stream from the
> corresponding kernel data buffer. Make use of the introduced
> per mmap file and compressor when they are initialized and
> available.

So you're introducing even compressed storage in this patchset? To make
it smaller I thought this could be in a followup cset.

- Arnaldo

> Acked-by: Andi Kleen <[email protected]>
> Acked-by: Namhyung Kim <[email protected]>
> Signed-off-by: Alexey Bayduraev <[email protected]>
> ---
> tools/perf/builtin-record.c | 3 +++
> tools/perf/util/mmap.c | 6 ++++++
> tools/perf/util/mmap.h | 3 +++
> 3 files changed, 12 insertions(+)
>
> diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
> index cead2b3c56d7..9b7e7a5dc116 100644
> --- a/tools/perf/builtin-record.c
> +++ b/tools/perf/builtin-record.c
> @@ -190,6 +190,9 @@ static int record__write(struct record *rec, struct mmap *map __maybe_unused,
> {
> struct perf_data_file *file = &rec->session->data->file;
>
> + if (map && map->file)
> + file = map->file;
> +
> if (perf_data_file__write(file, bf, size) < 0) {
> pr_err("failed to write perf data, error: %m\n");
> return -1;
> diff --git a/tools/perf/util/mmap.c b/tools/perf/util/mmap.c
> index ab7108d22428..a2c5e4237592 100644
> --- a/tools/perf/util/mmap.c
> +++ b/tools/perf/util/mmap.c
> @@ -230,6 +230,8 @@ void mmap__munmap(struct mmap *map)
> {
> bitmap_free(map->affinity_mask.bits);
>
> + zstd_fini(&map->zstd_data);
> +
> perf_mmap__aio_munmap(map);
> if (map->data != NULL) {
> munmap(map->data, mmap__mmap_len(map));
> @@ -291,6 +293,10 @@ int mmap__mmap(struct mmap *map, struct mmap_params *mp, int fd, int cpu)
> map->core.flush = mp->flush;
>
> map->comp_level = mp->comp_level;
> + if (zstd_init(&map->zstd_data, map->comp_level)) {
> + pr_debug2("failed to init mmap commpressor, error %d\n", errno);
> + return -1;
> + }
>
> if (map->comp_level && !perf_mmap__aio_enabled(map)) {
> map->data = mmap(NULL, mmap__mmap_len(map), PROT_READ|PROT_WRITE,
> diff --git a/tools/perf/util/mmap.h b/tools/perf/util/mmap.h
> index 9d5f589f02ae..c4aed6e89549 100644
> --- a/tools/perf/util/mmap.h
> +++ b/tools/perf/util/mmap.h
> @@ -13,6 +13,7 @@
> #endif
> #include "auxtrace.h"
> #include "event.h"
> +#include "util/compress.h"
>
> struct aiocb;
>
> @@ -43,6 +44,8 @@ struct mmap {
> struct mmap_cpu_mask affinity_mask;
> void *data;
> int comp_level;
> + struct perf_data_file *file;
> + struct zstd_data zstd_data;
> };
>
> struct mmap_params {
> --
> 2.19.0
>

--

- Arnaldo

2021-06-30 17:29:47

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: [PATCH v8 11/22] perf record: Document parallel data streaming mode

Em Wed, Jun 30, 2021 at 06:54:50PM +0300, Alexey Bayduraev escreveu:
> Document --threads option syntax and parallel data streaming modes
> in Documentation/perf-record.txt. Implement compatibility checks for
> other modes and related command line options: asynchronous(--aio)
> trace streaming and affinity (--affinity) modes, pipe mode, AUX
> area tracing --snapshot and --aux-sample options, --switch-output,
> --switch-output-event, --switch-max-files and --timestamp-filename
> options. Parallel data streaming is compatible with Zstd compression
> (--compression-level) and external control commands (--control).
> Cpu mask provided via -C option filters --threads specification masks.

Please don't separate the documentation of an option from the patch
where it is introduced.

- Arnaldo

> Acked-by: Namhyung Kim <[email protected]>
> Signed-off-by: Alexey Bayduraev <[email protected]>
> ---
> tools/perf/Documentation/perf-record.txt | 30 +++++++++++++++
> tools/perf/builtin-record.c | 49 ++++++++++++++++++++++--
> 2 files changed, 76 insertions(+), 3 deletions(-)
>
> diff --git a/tools/perf/Documentation/perf-record.txt b/tools/perf/Documentation/perf-record.txt
> index d71bac847936..2046b28d9822 100644
> --- a/tools/perf/Documentation/perf-record.txt
> +++ b/tools/perf/Documentation/perf-record.txt
> @@ -695,6 +695,36 @@ measurements:
> wait -n ${perf_pid}
> exit $?
>
> +--threads=<spec>::
> +Write collected trace data into several data files using parallel threads.
> +<spec> value can be user defined list of masks. Masks separated by colon
> +define cpus to be monitored by a thread and affinity mask of that thread
> +is separated by slash:
> +
> + <cpus mask 1>/<affinity mask 1>:<cpus mask 2>/<affinity mask 2>:...
> +
> +For example user specification like the following:
> +
> + 0,2-4/2-4:1,5-7/5-7
> +
> +specifies parallel threads layout that consists of two threads,
> +the first thread monitors cpus 0 and 2-4 with the affinity mask 2-4,
> +the second monitors cpus 1 and 5-7 with the affinity mask 5-7.
> +
> +<spec> value can also be a string meaning predefined parallel threads
> +layout:
> +
> + cpu - create new data streaming thread for every monitored cpu
> + core - create new thread to monitor cpus grouped by a core
> + socket - create new thread to monitor cpus grouped by a socket
> + numa - create new threed to monitor cpus grouped by a numa domain
> +
> +Predefined layouts can be used on systems with large number of cpus in
> +order not to spawn multiple per-cpu streaming threads but still avoid LOST
> +events in data directory files. Option specified with no or empty value
> +defaults to cpu layout. Masks defined or provided by the option value are
> +filtered through the mask provided by -C option.
> +
> include::intel-hybrid.txt[]
>
> SEE ALSO
> diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
> index 8d452797d175..c5954cb3e787 100644
> --- a/tools/perf/builtin-record.c
> +++ b/tools/perf/builtin-record.c
> @@ -800,6 +800,12 @@ static int record__auxtrace_init(struct record *rec)
> {
> int err;
>
> + if ((rec->opts.auxtrace_snapshot_opts || rec->opts.auxtrace_sample_opts)
> + && record__threads_enabled(rec)) {
> + pr_err("AUX area tracing options are not available in parallel streaming mode.\n");
> + return -EINVAL;
> + }
> +
> if (!rec->itr) {
> rec->itr = auxtrace_record__init(rec->evlist, &err);
> if (err)
> @@ -2154,6 +2160,17 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
> return PTR_ERR(session);
> }
>
> + if (record__threads_enabled(rec)) {
> + if (perf_data__is_pipe(&rec->data)) {
> + pr_err("Parallel trace streaming is not available in pipe mode.\n");
> + return -1;
> + }
> + if (rec->opts.full_auxtrace) {
> + pr_err("Parallel trace streaming is not available in AUX area tracing mode.\n");
> + return -1;
> + }
> + }
> +
> fd = perf_data__fd(data);
> rec->session = session;
>
> @@ -2922,12 +2939,22 @@ static int switch_output_setup(struct record *rec)
> * --switch-output=signal, as we'll send a SIGUSR2 from the side band
> * thread to its parent.
> */
> - if (rec->switch_output_event_set)
> + if (rec->switch_output_event_set) {
> + if (record__threads_enabled(rec)) {
> + pr_warning("WARNING: --switch-output-event option is not available in parallel streaming mode.\n");
> + return 0;
> + }
> goto do_signal;
> + }
>
> if (!s->set)
> return 0;
>
> + if (record__threads_enabled(rec)) {
> + pr_warning("WARNING: --switch-output option is not available in parallel streaming mode.\n");
> + return 0;
> + }
> +
> if (!strcmp(s->str, "signal")) {
> do_signal:
> s->signal = true;
> @@ -3225,8 +3252,8 @@ static struct option __record_options[] = {
> "Set affinity mask of trace reading thread to NUMA node cpu mask or cpu of processed mmap buffer",
> record__parse_affinity),
> #ifdef HAVE_ZSTD_SUPPORT
> - OPT_CALLBACK_OPTARG('z', "compression-level", &record.opts, &comp_level_default,
> - "n", "Compressed records using specified level (default: 1 - fastest compression, 22 - greatest compression)",
> + OPT_CALLBACK_OPTARG('z', "compression-level", &record.opts, &comp_level_default, "n",
> + "Compress records using specified level (default: 1 - fastest compression, 22 - greatest compression)",
> record__parse_comp_level),
> #endif
> OPT_CALLBACK(0, "max-size", &record.output_max_size,
> @@ -3659,6 +3686,17 @@ int cmd_record(int argc, const char **argv)
> if (rec->opts.kcore || record__threads_enabled(rec))
> rec->data.is_dir = true;
>
> + if (record__threads_enabled(rec)) {
> + if (rec->opts.affinity != PERF_AFFINITY_SYS) {
> + pr_err("--affinity option is mutually exclusive to parallel streaming mode.\n");
> + goto out_opts;
> + }
> + if (record__aio_enabled(rec)) {
> + pr_err("Asynchronous streaming mode (--aio) is mutually exclusive to parallel streaming mode.\n");
> + goto out_opts;
> + }
> + }
> +
> if (rec->opts.comp_level != 0) {
> pr_debug("Compression enabled, disabling build id collection at the end of the session.\n");
> rec->no_buildid = true;
> @@ -3692,6 +3730,11 @@ int cmd_record(int argc, const char **argv)
> }
> }
>
> + if (rec->timestamp_filename && record__threads_enabled(rec)) {
> + rec->timestamp_filename = false;
> + pr_warning("WARNING: --timestamp-filename option is not available in parallel streaming mode.\n");
> + }
> +
> /*
> * Allow aliases to facilitate the lookup of symbols for address
> * filters. Refer to auxtrace_parse_filters().
> --
> 2.19.0
>

--

- Arnaldo

2021-06-30 17:32:14

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: [PATCH v8 10/22] perf record: Introduce --threads=<spec> command line option

Em Wed, Jun 30, 2021 at 06:54:49PM +0300, Alexey Bayduraev escreveu:
> Provide --threads option in perf record command line interface.
> The option can have a value in the form of masks that specify
> cpus to be monitored with data streaming threads and its layout
> in system topology. The masks can be filtered using cpu mask
> provided via -C option.

Perhaps make this --jobs/-j to reuse what 'make' and 'ninja' uses?

Unfortunately:

[acme@quaco pahole]$ perf record -h -j

Usage: perf record [<options>] [<command>]
or: perf record [<options>] -- <command> [<options>]

-j, --branch-filter <branch filter mask>
branch stack filter modes

[acme@quaco pahole]$

But we could reuse --jobs

I thought you would start with plain:

-j N

And start one thread per CPU in 'perf record' existing CPU affinity
mask, then go on introducing more sophisticated modes.

Have you done this way because its how VTune has evolved over the years
and now expects from 'perf record'?

- Arnaldo

> The specification value can be user defined list of masks. Masks
> separated by colon define cpus to be monitored by one thread and
> affinity mask of that thread is separated by slash. For example:
> <cpus mask 1>/<affinity mask 1>:<cpu mask 2>/<affinity mask 2>
> specifies parallel threads layout that consists of two threads
> with corresponding assigned cpus to be monitored.
>
> The specification value can be a string e.g. "cpu", "core" or
> "socket" meaning creation of data streaming thread for every
> cpu or core or socket to monitor distinct cpus or cpus grouped
> by core or socket.
>
> The option provided with no or empty value defaults to per-cpu
> parallel threads layout creating data streaming thread for every
> cpu being monitored.
>
> Feature design and implementation are based on prototypes [1], [2].
>
> [1] git clone https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git -b perf/record_threads
> [2] https://lore.kernel.org/lkml/[email protected]/
>
> Suggested-by: Jiri Olsa <[email protected]>
> Suggested-by: Namhyung Kim <[email protected]>
> Acked-by: Andi Kleen <[email protected]>
> Acked-by: Namhyung Kim <[email protected]>
> Signed-off-by: Alexey Bayduraev <[email protected]>
> ---
> tools/perf/builtin-record.c | 355 +++++++++++++++++++++++++++++++++++-
> tools/perf/util/record.h | 1 +
> 2 files changed, 354 insertions(+), 2 deletions(-)
>
> diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
> index 2ade17308467..8d452797d175 100644
> --- a/tools/perf/builtin-record.c
> +++ b/tools/perf/builtin-record.c
> @@ -51,6 +51,7 @@
> #include "util/evlist-hybrid.h"
> #include "asm/bug.h"
> #include "perf.h"
> +#include "cputopo.h"
>
> #include <errno.h>
> #include <inttypes.h>
> @@ -122,6 +123,20 @@ static const char *thread_msg_tags[THREAD_MSG__MAX] = {
> "UNDEFINED", "READY"
> };
>
> +enum thread_spec {
> + THREAD_SPEC__UNDEFINED = 0,
> + THREAD_SPEC__CPU,
> + THREAD_SPEC__CORE,
> + THREAD_SPEC__SOCKET,
> + THREAD_SPEC__NUMA,
> + THREAD_SPEC__USER,
> + THREAD_SPEC__MAX,
> +};
> +
> +static const char *thread_spec_tags[THREAD_SPEC__MAX] = {
> + "undefined", "cpu", "core", "socket", "numa", "user"
> +};
> +
> struct record {
> struct perf_tool tool;
> struct record_opts opts;
> @@ -2723,6 +2738,70 @@ static void record__thread_mask_free(struct thread_mask *mask)
> record__mmap_cpu_mask_free(&mask->affinity);
> }
>
> +static int record__thread_mask_or(struct thread_mask *dest, struct thread_mask *src1,
> + struct thread_mask *src2)
> +{
> + if (src1->maps.nbits != src2->maps.nbits ||
> + dest->maps.nbits != src1->maps.nbits ||
> + src1->affinity.nbits != src2->affinity.nbits ||
> + dest->affinity.nbits != src1->affinity.nbits)
> + return -EINVAL;
> +
> + bitmap_or(dest->maps.bits, src1->maps.bits,
> + src2->maps.bits, src1->maps.nbits);
> + bitmap_or(dest->affinity.bits, src1->affinity.bits,
> + src2->affinity.bits, src1->affinity.nbits);
> +
> + return 0;
> +}
> +
> +static int record__thread_mask_intersects(struct thread_mask *mask_1, struct thread_mask *mask_2)
> +{
> + int res1, res2;
> +
> + if (mask_1->maps.nbits != mask_2->maps.nbits ||
> + mask_1->affinity.nbits != mask_2->affinity.nbits)
> + return -EINVAL;
> +
> + res1 = bitmap_intersects(mask_1->maps.bits, mask_2->maps.bits,
> + mask_1->maps.nbits);
> + res2 = bitmap_intersects(mask_1->affinity.bits, mask_2->affinity.bits,
> + mask_1->affinity.nbits);
> + if (res1 || res2)
> + return 1;
> +
> + return 0;
> +}
> +
> +static int record__parse_threads(const struct option *opt, const char *str, int unset)
> +{
> + int s;
> + struct record_opts *opts = opt->value;
> +
> + if (unset || !str || !strlen(str)) {
> + opts->threads_spec = THREAD_SPEC__CPU;
> + } else {
> + for (s = 1; s < THREAD_SPEC__MAX; s++) {
> + if (s == THREAD_SPEC__USER) {
> + opts->threads_user_spec = strdup(str);
> + opts->threads_spec = THREAD_SPEC__USER;
> + break;
> + }
> + if (!strncasecmp(str, thread_spec_tags[s], strlen(thread_spec_tags[s]))) {
> + opts->threads_spec = s;
> + break;
> + }
> + }
> + }
> +
> + pr_debug("threads_spec: %s", thread_spec_tags[opts->threads_spec]);
> + if (opts->threads_spec == THREAD_SPEC__USER)
> + pr_debug("=[%s]", opts->threads_user_spec);
> + pr_debug("\n");
> +
> + return 0;
> +}
> +
> static int parse_output_max_size(const struct option *opt,
> const char *str, int unset)
> {
> @@ -3166,6 +3245,9 @@ static struct option __record_options[] = {
> "\t\t\t Optionally send control command completion ('ack\\n') to ack-fd descriptor.\n"
> "\t\t\t Alternatively, ctl-fifo / ack-fifo will be opened and used as ctl-fd / ack-fd.",
> parse_control_option),
> + OPT_CALLBACK_OPTARG(0, "threads", &record.opts, NULL, "spec",
> + "write collected trace data into several data files using parallel threads",
> + record__parse_threads),
> OPT_END()
> };
>
> @@ -3179,6 +3261,17 @@ static void record__mmap_cpu_mask_init(struct mmap_cpu_mask *mask, struct perf_c
> set_bit(cpus->map[c], mask->bits);
> }
>
> +static void record__mmap_cpu_mask_init_spec(struct mmap_cpu_mask *mask, char *mask_spec)
> +{
> + struct perf_cpu_map *cpus;
> +
> + cpus = perf_cpu_map__new(mask_spec);
> + if (cpus) {
> + record__mmap_cpu_mask_init(mask, cpus);
> + perf_cpu_map__put(cpus);
> + }
> +}
> +
> static int record__alloc_thread_masks(struct record *rec, int nr_threads, int nr_bits)
> {
> int t, ret;
> @@ -3198,6 +3291,235 @@ static int record__alloc_thread_masks(struct record *rec, int nr_threads, int nr
>
> return 0;
> }
> +
> +static int record__init_thread_cpu_masks(struct record *rec, struct perf_cpu_map *cpus)
> +{
> + int t, ret, nr_cpus = perf_cpu_map__nr(cpus);
> +
> + ret = record__alloc_thread_masks(rec, nr_cpus, cpu__max_cpu());
> + if (ret)
> + return ret;
> +
> + rec->nr_threads = nr_cpus;
> + pr_debug("threads: nr_threads=%d\n", rec->nr_threads);
> +
> + for (t = 0; t < rec->nr_threads; t++) {
> + set_bit(cpus->map[t], rec->thread_masks[t].maps.bits);
> + pr_debug("thread_masks[%d]: maps mask [%d]\n", t, cpus->map[t]);
> + set_bit(cpus->map[t], rec->thread_masks[t].affinity.bits);
> + pr_debug("thread_masks[%d]: affinity mask [%d]\n", t, cpus->map[t]);
> + }
> +
> + return 0;
> +}
> +
> +static int record__init_thread_masks_spec(struct record *rec, struct perf_cpu_map *cpus,
> + char **maps_spec, char **affinity_spec, u32 nr_spec)
> +{
> + u32 s;
> + int ret = 0, nr_threads = 0;
> + struct mmap_cpu_mask cpus_mask;
> + struct thread_mask thread_mask, full_mask, *prev_masks;
> +
> + ret = record__mmap_cpu_mask_alloc(&cpus_mask, cpu__max_cpu());
> + if (ret)
> + goto out;
> + record__mmap_cpu_mask_init(&cpus_mask, cpus);
> + ret = record__thread_mask_alloc(&thread_mask, cpu__max_cpu());
> + if (ret)
> + goto out_free_cpu_mask;
> + ret = record__thread_mask_alloc(&full_mask, cpu__max_cpu());
> + if (ret)
> + goto out_free_thread_mask;
> + record__thread_mask_clear(&full_mask);
> +
> + for (s = 0; s < nr_spec; s++) {
> + record__thread_mask_clear(&thread_mask);
> +
> + record__mmap_cpu_mask_init_spec(&thread_mask.maps, maps_spec[s]);
> + record__mmap_cpu_mask_init_spec(&thread_mask.affinity, affinity_spec[s]);
> +
> + if (!bitmap_and(thread_mask.maps.bits, thread_mask.maps.bits,
> + cpus_mask.bits, thread_mask.maps.nbits) ||
> + !bitmap_and(thread_mask.affinity.bits, thread_mask.affinity.bits,
> + cpus_mask.bits, thread_mask.affinity.nbits))
> + continue;
> +
> + ret = record__thread_mask_intersects(&thread_mask, &full_mask);
> + if (ret)
> + goto out_free_full_mask;
> + record__thread_mask_or(&full_mask, &full_mask, &thread_mask);
> +
> + prev_masks = rec->thread_masks;
> + rec->thread_masks = realloc(rec->thread_masks,
> + (nr_threads + 1) * sizeof(struct thread_mask));
> + if (!rec->thread_masks) {
> + pr_err("Failed to allocate thread masks\n");
> + rec->thread_masks = prev_masks;
> + ret = -ENOMEM;
> + goto out_free_full_mask;
> + }
> + rec->thread_masks[nr_threads] = thread_mask;
> + if (verbose) {
> + pr_debug("thread_masks[%d]: addr=", nr_threads);
> + mmap_cpu_mask__scnprintf(&rec->thread_masks[nr_threads].maps, "maps");
> + pr_debug("thread_masks[%d]: addr=", nr_threads);
> + mmap_cpu_mask__scnprintf(&rec->thread_masks[nr_threads].affinity,
> + "affinity");
> + }
> + nr_threads++;
> + ret = record__thread_mask_alloc(&thread_mask, cpu__max_cpu());
> + if (ret)
> + goto out_free_full_mask;
> + }
> +
> + rec->nr_threads = nr_threads;
> + pr_debug("threads: nr_threads=%d\n", rec->nr_threads);
> +
> + if (rec->nr_threads <= 0)
> + ret = -EINVAL;
> +
> +out_free_full_mask:
> + record__thread_mask_free(&full_mask);
> +out_free_thread_mask:
> + record__thread_mask_free(&thread_mask);
> +out_free_cpu_mask:
> + record__mmap_cpu_mask_free(&cpus_mask);
> +out:
> + return ret;
> +}
> +
> +static int record__init_thread_core_masks(struct record *rec, struct perf_cpu_map *cpus)
> +{
> + int ret;
> + struct cpu_topology *topo;
> +
> + topo = cpu_topology__new();
> + if (!topo)
> + return -EINVAL;
> +
> + ret = record__init_thread_masks_spec(rec, cpus, topo->thread_siblings,
> + topo->thread_siblings, topo->thread_sib);
> + cpu_topology__delete(topo);
> +
> + return ret;
> +}
> +
> +static int record__init_thread_socket_masks(struct record *rec, struct perf_cpu_map *cpus)
> +{
> + int ret;
> + struct cpu_topology *topo;
> +
> + topo = cpu_topology__new();
> + if (!topo)
> + return -EINVAL;
> +
> + ret = record__init_thread_masks_spec(rec, cpus, topo->core_siblings,
> + topo->core_siblings, topo->core_sib);
> + cpu_topology__delete(topo);
> +
> + return ret;
> +}
> +
> +static int record__init_thread_numa_masks(struct record *rec, struct perf_cpu_map *cpus)
> +{
> + u32 s;
> + int ret;
> + char **spec;
> + struct numa_topology *topo;
> +
> + topo = numa_topology__new();
> + if (!topo)
> + return -EINVAL;
> + spec = zalloc(topo->nr * sizeof(char *));
> + if (!spec) {
> + ret = -ENOMEM;
> + goto out_delete_topo;
> + }
> + for (s = 0; s < topo->nr; s++)
> + spec[s] = topo->nodes[s].cpus;
> +
> + ret = record__init_thread_masks_spec(rec, cpus, spec, spec, topo->nr);
> +
> + zfree(&spec);
> +
> +out_delete_topo:
> + numa_topology__delete(topo);
> +
> + return ret;
> +}
> +
> +static int record__init_thread_user_masks(struct record *rec, struct perf_cpu_map *cpus)
> +{
> + int t, ret;
> + u32 s, nr_spec = 0;
> + char **maps_spec = NULL, **affinity_spec = NULL, **prev_spec;
> + char *spec, *spec_ptr, *user_spec, *mask, *mask_ptr;
> +
> + for (t = 0, user_spec = (char *)rec->opts.threads_user_spec; ; t++, user_spec = NULL) {
> + spec = strtok_r(user_spec, ":", &spec_ptr);
> + if (spec == NULL)
> + break;
> + pr_debug(" spec[%d]: %s\n", t, spec);
> + mask = strtok_r(spec, "/", &mask_ptr);
> + if (mask == NULL)
> + break;
> + pr_debug(" maps mask: %s\n", mask);
> + prev_spec = maps_spec;
> + maps_spec = realloc(maps_spec, (nr_spec + 1) * sizeof(char *));
> + if (!maps_spec) {
> + pr_err("Failed to realloc maps_spec\n");
> + maps_spec = prev_spec;
> + ret = -ENOMEM;
> + goto out_free_all_specs;
> + }
> + maps_spec[nr_spec] = strdup(mask);
> + if (!maps_spec[nr_spec]) {
> + pr_err("Failed to alloc maps_spec[%d]\n", nr_spec);
> + ret = -ENOMEM;
> + goto out_free_all_specs;
> + }
> + mask = strtok_r(NULL, "/", &mask_ptr);
> + if (mask == NULL) {
> + free(maps_spec[nr_spec]);
> + ret = -EINVAL;
> + goto out_free_all_specs;
> + }
> + pr_debug(" affinity mask: %s\n", mask);
> + prev_spec = affinity_spec;
> + affinity_spec = realloc(affinity_spec, (nr_spec + 1) * sizeof(char *));
> + if (!affinity_spec) {
> + pr_err("Failed to realloc affinity_spec\n");
> + affinity_spec = prev_spec;
> + free(maps_spec[nr_spec]);
> + ret = -ENOMEM;
> + goto out_free_all_specs;
> + }
> + affinity_spec[nr_spec] = strdup(mask);
> + if (!affinity_spec[nr_spec]) {
> + pr_err("Failed to alloc affinity_spec[%d]\n", nr_spec);
> + free(maps_spec[nr_spec]);
> + ret = -ENOMEM;
> + goto out_free_all_specs;
> + }
> + nr_spec++;
> + }
> +
> + ret = record__init_thread_masks_spec(rec, cpus, maps_spec, affinity_spec, nr_spec);
> +
> +out_free_all_specs:
> + for (s = 0; s < nr_spec; s++) {
> + if (maps_spec)
> + free(maps_spec[s]);
> + if (affinity_spec)
> + free(affinity_spec[s]);
> + }
> + free(affinity_spec);
> + free(maps_spec);
> +
> + return ret;
> +}
> +
> static int record__init_thread_default_masks(struct record *rec, struct perf_cpu_map *cpus)
> {
> int ret;
> @@ -3215,9 +3537,33 @@ static int record__init_thread_default_masks(struct record *rec, struct perf_cpu
>
> static int record__init_thread_masks(struct record *rec)
> {
> + int ret = 0;
> struct perf_cpu_map *cpus = rec->evlist->core.cpus;
>
> - return record__init_thread_default_masks(rec, cpus);
> + if (!record__threads_enabled(rec))
> + return record__init_thread_default_masks(rec, cpus);
> +
> + switch (rec->opts.threads_spec) {
> + case THREAD_SPEC__CPU:
> + ret = record__init_thread_cpu_masks(rec, cpus);
> + break;
> + case THREAD_SPEC__CORE:
> + ret = record__init_thread_core_masks(rec, cpus);
> + break;
> + case THREAD_SPEC__SOCKET:
> + ret = record__init_thread_socket_masks(rec, cpus);
> + break;
> + case THREAD_SPEC__NUMA:
> + ret = record__init_thread_numa_masks(rec, cpus);
> + break;
> + case THREAD_SPEC__USER:
> + ret = record__init_thread_user_masks(rec, cpus);
> + break;
> + default:
> + break;
> + }
> +
> + return ret;
> }
>
> static int record__fini_thread_masks(struct record *rec)
> @@ -3474,7 +3820,12 @@ int cmd_record(int argc, const char **argv)
>
> err = record__init_thread_masks(rec);
> if (err) {
> - pr_err("record__init_thread_masks failed, error %d\n", err);
> + if (err > 0)
> + pr_err("ERROR: parallel data streaming masks (--threads) intersect\n");
> + else if (err == -EINVAL)
> + pr_err("ERROR: invalid parallel data streaming masks (--threads)\n");
> + else
> + pr_err("record__init_thread_masks failed, error %d\n", err);
> goto out;
> }
>
> diff --git a/tools/perf/util/record.h b/tools/perf/util/record.h
> index 4d68b7e27272..3da156498f47 100644
> --- a/tools/perf/util/record.h
> +++ b/tools/perf/util/record.h
> @@ -78,6 +78,7 @@ struct record_opts {
> int ctl_fd_ack;
> bool ctl_fd_close;
> int threads_spec;
> + const char *threads_user_spec;
> };
>
> extern const char * const *record_usage;
> --
> 2.19.0
>

--

- Arnaldo

2021-06-30 17:35:09

by Bayduraev, Alexey V

[permalink] [raw]
Subject: Re: [PATCH v8 09/22] tools lib: Introduce bitmap_intersects() operation



On 30.06.2021 20:24, Arnaldo Carvalho de Melo wrote:
> Em Wed, Jun 30, 2021 at 06:54:48PM +0300, Alexey Bayduraev escreveu:
>> Introduce bitmap_intersects() routine that tests whether
>
> Is this _adopting_ bitmap_intersects() from the kernel sources?

Yes, __bitmap_intersects is copy-pasted from the kernel sources,
like the others in bitmap.h

Regards,
Alexey

>
>> bitmaps bitmap1 and bitmap2 intersects. This routine will
>> be used during thread masks initialization.
>>
>> Acked-by: Andi Kleen <[email protected]>
>> Acked-by: Namhyung Kim <[email protected]>
>> Signed-off-by: Alexey Bayduraev <[email protected]>
>> ---
>> tools/include/linux/bitmap.h | 11 +++++++++++
>> tools/lib/bitmap.c | 14 ++++++++++++++
>> 2 files changed, 25 insertions(+)
>>
>> diff --git a/tools/include/linux/bitmap.h b/tools/include/linux/bitmap.h
>> index 330dbf7509cc..9d959bc24859 100644
>> --- a/tools/include/linux/bitmap.h
>> +++ b/tools/include/linux/bitmap.h
>> @@ -18,6 +18,8 @@ int __bitmap_and(unsigned long *dst, const unsigned long *bitmap1,
>> int __bitmap_equal(const unsigned long *bitmap1,
>> const unsigned long *bitmap2, unsigned int bits);
>> void bitmap_clear(unsigned long *map, unsigned int start, int len);
>> +int __bitmap_intersects(const unsigned long *bitmap1,
>> + const unsigned long *bitmap2, unsigned int bits);
>>
>> #define BITMAP_FIRST_WORD_MASK(start) (~0UL << ((start) & (BITS_PER_LONG - 1)))
>> #define BITMAP_LAST_WORD_MASK(nbits) (~0UL >> (-(nbits) & (BITS_PER_LONG - 1)))
>> @@ -170,4 +172,13 @@ static inline int bitmap_equal(const unsigned long *src1,
>> return __bitmap_equal(src1, src2, nbits);
>> }
>>
>> +static inline int bitmap_intersects(const unsigned long *src1,
>> + const unsigned long *src2, unsigned int nbits)
>> +{
>> + if (small_const_nbits(nbits))
>> + return ((*src1 & *src2) & BITMAP_LAST_WORD_MASK(nbits)) != 0;
>> + else
>> + return __bitmap_intersects(src1, src2, nbits);
>> +}
>> +
>> #endif /* _PERF_BITOPS_H */
>> diff --git a/tools/lib/bitmap.c b/tools/lib/bitmap.c
>> index f4e914712b6f..db466ef7be9d 100644
>> --- a/tools/lib/bitmap.c
>> +++ b/tools/lib/bitmap.c
>> @@ -86,3 +86,17 @@ int __bitmap_equal(const unsigned long *bitmap1,
>>
>> return 1;
>> }
>> +
>> +int __bitmap_intersects(const unsigned long *bitmap1,
>> + const unsigned long *bitmap2, unsigned int bits)
>> +{
>> + unsigned int k, lim = bits/BITS_PER_LONG;
>> + for (k = 0; k < lim; ++k)
>> + if (bitmap1[k] & bitmap2[k])
>> + return 1;
>> +
>> + if (bits % BITS_PER_LONG)
>> + if ((bitmap1[k] & bitmap2[k]) & BITMAP_LAST_WORD_MASK(bits))
>> + return 1;
>> + return 0;
>> +}
>> --
>> 2.19.0
>>
>

2021-06-30 17:43:50

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: [PATCH v8 09/22] tools lib: Introduce bitmap_intersects() operation

Em Wed, Jun 30, 2021 at 02:24:26PM -0300, Arnaldo Carvalho de Melo escreveu:
> Em Wed, Jun 30, 2021 at 06:54:48PM +0300, Alexey Bayduraev escreveu:
> > Introduce bitmap_intersects() routine that tests whether
>
> Is this _adopting_ bitmap_intersects() from the kernel sources?

Ok, clarified that in the changeset comment and applied this patch to
reduce the number of patches in this patchset, there is another patch I
think can cherry picked, checking.

- ARnaldo

> > bitmaps bitmap1 and bitmap2 intersects. This routine will
> > be used during thread masks initialization.
> >
> > Acked-by: Andi Kleen <[email protected]>
> > Acked-by: Namhyung Kim <[email protected]>
> > Signed-off-by: Alexey Bayduraev <[email protected]>
> > ---
> > tools/include/linux/bitmap.h | 11 +++++++++++
> > tools/lib/bitmap.c | 14 ++++++++++++++
> > 2 files changed, 25 insertions(+)
> >
> > diff --git a/tools/include/linux/bitmap.h b/tools/include/linux/bitmap.h
> > index 330dbf7509cc..9d959bc24859 100644
> > --- a/tools/include/linux/bitmap.h
> > +++ b/tools/include/linux/bitmap.h
> > @@ -18,6 +18,8 @@ int __bitmap_and(unsigned long *dst, const unsigned long *bitmap1,
> > int __bitmap_equal(const unsigned long *bitmap1,
> > const unsigned long *bitmap2, unsigned int bits);
> > void bitmap_clear(unsigned long *map, unsigned int start, int len);
> > +int __bitmap_intersects(const unsigned long *bitmap1,
> > + const unsigned long *bitmap2, unsigned int bits);
> >
> > #define BITMAP_FIRST_WORD_MASK(start) (~0UL << ((start) & (BITS_PER_LONG - 1)))
> > #define BITMAP_LAST_WORD_MASK(nbits) (~0UL >> (-(nbits) & (BITS_PER_LONG - 1)))
> > @@ -170,4 +172,13 @@ static inline int bitmap_equal(const unsigned long *src1,
> > return __bitmap_equal(src1, src2, nbits);
> > }
> >
> > +static inline int bitmap_intersects(const unsigned long *src1,
> > + const unsigned long *src2, unsigned int nbits)
> > +{
> > + if (small_const_nbits(nbits))
> > + return ((*src1 & *src2) & BITMAP_LAST_WORD_MASK(nbits)) != 0;
> > + else
> > + return __bitmap_intersects(src1, src2, nbits);
> > +}
> > +
> > #endif /* _PERF_BITOPS_H */
> > diff --git a/tools/lib/bitmap.c b/tools/lib/bitmap.c
> > index f4e914712b6f..db466ef7be9d 100644
> > --- a/tools/lib/bitmap.c
> > +++ b/tools/lib/bitmap.c
> > @@ -86,3 +86,17 @@ int __bitmap_equal(const unsigned long *bitmap1,
> >
> > return 1;
> > }
> > +
> > +int __bitmap_intersects(const unsigned long *bitmap1,
> > + const unsigned long *bitmap2, unsigned int bits)
> > +{
> > + unsigned int k, lim = bits/BITS_PER_LONG;
> > + for (k = 0; k < lim; ++k)
> > + if (bitmap1[k] & bitmap2[k])
> > + return 1;
> > +
> > + if (bits % BITS_PER_LONG)
> > + if ((bitmap1[k] & bitmap2[k]) & BITMAP_LAST_WORD_MASK(bits))
> > + return 1;
> > + return 0;
> > +}
> > --
> > 2.19.0
> >
>
> --
>
> - Arnaldo

--

- Arnaldo

2021-06-30 18:38:18

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: [PATCH v8 12/22] perf report: Output data file name in raw trace dump

Em Wed, Jun 30, 2021 at 06:54:51PM +0300, Alexey Bayduraev escreveu:
> Print path and name of a data file into raw dump (-D)
> <file_offset>@<path/file>. Print offset of PERF_RECORD_COMPRESSED
> record instead of zero for decompressed records:
> [email protected] [0x30]: event: 9
> or
> [email protected]/data.7 [0x30]: event: 9

You're changing the logic in a subtle way here, see below

> Acked-by: Namhyung Kim <[email protected]>
> Acked-by: Andi Kleen <[email protected]>

<SNIP>

> @@ -2021,7 +2031,8 @@ static int __perf_session__process_pipe_events(struct perf_session *session)
> }
> }
>
> - if ((skip = perf_session__process_event(session, event, head)) < 0) {
> + skip = perf_session__process_event(session, event, head, "pipe");
> + if (skip < 0) {


Why do this in this patch? Its not needed, leave it alone to make the
patch smaller.

> pr_err("%#" PRIx64 " [%#x]: failed to process type: %d\n",
> head, event->header.size, event->header.type);
> err = -EINVAL;
> @@ -2102,7 +2113,7 @@ fetch_decomp_event(u64 head, size_t mmap_size, char *buf, bool needs_swap)
> static int __perf_session__process_decomp_events(struct perf_session *session)
> {
> s64 skip;
> - u64 size, file_pos = 0;
> + u64 size;
> struct decomp *decomp = session->decomp_last;
>
> if (!decomp)
> @@ -2116,9 +2127,9 @@ static int __perf_session__process_decomp_events(struct perf_session *session)
> break;
>
> size = event->header.size;
> -
> - if (size < sizeof(struct perf_event_header) ||
> - (skip = perf_session__process_event(session, event, file_pos)) < 0) {


The call to perf_session__process_event() will not be made if

(size < sizeof(struct perf_event_header)

evaluates to true, with your change it is being made unconditionally,
also before it was using that file_pos variable, set to zero and
possibly modified by the logic in this function.

And I read just "perf report: Output data file name in raw trace", so
when I saw this separate change to pass 'decomp->file_pos' and remove
that 'file_pos = 0' part I scratched my head, then I read again the
commit log messsage and there it says it also does this separate change.

Please make it separate patch where you explain why this has to be done
this way and what previous cset this fixes, so that the
[email protected] guys pick it as it sounds like a fix.

> + skip = perf_session__process_event(session, event, decomp->file_pos,
> + decomp->file_path);
> + if (size < sizeof(struct perf_event_header) || skip < 0) {
> pr_err("%#" PRIx64 " [%#x]: failed to process type: %d\n",
> decomp->file_pos + decomp->head, event->header.size, event->header.type);
> return -EINVAL;
> @@ -2149,10 +2160,12 @@ struct reader;
>
> typedef s64 (*reader_cb_t)(struct perf_session *session,
> union perf_event *event,
> - u64 file_offset);
> + u64 file_offset,
> + const char *file_path);
>
> struct reader {
> int fd;
> + const char *path;
> u64 data_size;
> u64 data_offset;
> reader_cb_t process;
> @@ -2234,9 +2247,9 @@ reader__process_events(struct reader *rd, struct perf_session *session,
> skip = -EINVAL;
>
> if (size < sizeof(struct perf_event_header) ||
> - (skip = rd->process(session, event, file_pos)) < 0) {
> - pr_err("%#" PRIx64 " [%#x]: failed to process type: %d [%s]\n",
> - file_offset + head, event->header.size,
> + (skip = rd->process(session, event, file_pos, rd->path)) < 0) {
> + pr_err("%#" PRIx64 " [%s] [%#x]: failed to process type: %d [%s]\n",
> + file_offset + head, rd->path, event->header.size,
> event->header.type, strerror(-skip));
> err = skip;
> goto out;
> @@ -2266,9 +2279,10 @@ reader__process_events(struct reader *rd, struct perf_session *session,
>
> static s64 process_simple(struct perf_session *session,
> union perf_event *event,
> - u64 file_offset)
> + u64 file_offset,
> + const char *file_path)
> {
> - return perf_session__process_event(session, event, file_offset);
> + return perf_session__process_event(session, event, file_offset, file_path);
> }
>
> static int __perf_session__process_events(struct perf_session *session)
> @@ -2278,6 +2292,7 @@ static int __perf_session__process_events(struct perf_session *session)
> .data_size = session->header.data_size,
> .data_offset = session->header.data_offset,
> .process = process_simple,
> + .path = session->data->file.path,
> .in_place_update = session->data->in_place_update,
> };
> struct ordered_events *oe = &session->ordered_events;
> diff --git a/tools/perf/util/session.h b/tools/perf/util/session.h
> index e31ba4c92a6c..6895a22ab5b7 100644
> --- a/tools/perf/util/session.h
> +++ b/tools/perf/util/session.h
> @@ -46,6 +46,7 @@ struct perf_session {
> struct decomp {
> struct decomp *next;
> u64 file_pos;
> + const char *file_path;
> size_t mmap_len;
> u64 head;
> size_t size;
> diff --git a/tools/perf/util/tool.h b/tools/perf/util/tool.h
> index bbbc0dcd461f..c966531d3eca 100644
> --- a/tools/perf/util/tool.h
> +++ b/tools/perf/util/tool.h
> @@ -28,7 +28,8 @@ typedef int (*event_attr_op)(struct perf_tool *tool,
>
> typedef int (*event_op2)(struct perf_session *session, union perf_event *event);
> typedef s64 (*event_op3)(struct perf_session *session, union perf_event *event);
> -typedef int (*event_op4)(struct perf_session *session, union perf_event *event, u64 data);
> +typedef int (*event_op4)(struct perf_session *session, union perf_event *event, u64 data,
> + const char *str);
>
> typedef int (*event_oe)(struct perf_tool *tool, union perf_event *event,
> struct ordered_events *oe);
> --
> 2.19.0
>

--

- Arnaldo

2021-06-30 18:57:53

by Bayduraev, Alexey V

[permalink] [raw]
Subject: Re: [PATCH v8 10/22] perf record: Introduce --threads=<spec> command line option

Hi,

On 30.06.2021 20:28, Arnaldo Carvalho de Melo wrote:
> Em Wed, Jun 30, 2021 at 06:54:49PM +0300, Alexey Bayduraev escreveu:
>> Provide --threads option in perf record command line interface.
>> The option can have a value in the form of masks that specify
>> cpus to be monitored with data streaming threads and its layout
>> in system topology. The masks can be filtered using cpu mask
>> provided via -C option.
>
> Perhaps make this --jobs/-j to reuse what 'make' and 'ninja' uses?
>
> Unfortunately:
>
> [acme@quaco pahole]$ perf record -h -j
>
> Usage: perf record [<options>] [<command>]
> or: perf record [<options>] -- <command> [<options>]
>
> -j, --branch-filter <branch filter mask>
> branch stack filter modes
>
> [acme@quaco pahole]$
>
> But we could reuse --jobs
>
> I thought you would start with plain:
>
> -j N
>
> And start one thread per CPU in 'perf record' existing CPU affinity
> mask, then go on introducing more sophisticated modes.

As I remember the first prototype [1] and
[2] https://lore.kernel.org/lkml/[email protected]/

introduces:

--thread=mode|number_of_threads

where mode defines cpu masks (cpu/numa/socket/etc)

Then somewhere while discussing this patchset it was decided, for unification,
that --thread should only define CPU/affinity masks or their aliases.
I think Alexei or Jiri could clarify this more.

>
> Have you done this way because its how VTune has evolved over the years
> and now expects from 'perf record'?

VTune uses only --thread=cpu or no threading.

Regards,
Alexey

>
> - Arnaldo
>
>> The specification value can be user defined list of masks. Masks
>> separated by colon define cpus to be monitored by one thread and
>> affinity mask of that thread is separated by slash. For example:
>> <cpus mask 1>/<affinity mask 1>:<cpu mask 2>/<affinity mask 2>
>> specifies parallel threads layout that consists of two threads
>> with corresponding assigned cpus to be monitored.
>>
>> The specification value can be a string e.g. "cpu", "core" or
>> "socket" meaning creation of data streaming thread for every
>> cpu or core or socket to monitor distinct cpus or cpus grouped
>> by core or socket.
>>
>> The option provided with no or empty value defaults to per-cpu
>> parallel threads layout creating data streaming thread for every
>> cpu being monitored.
>>
>> Feature design and implementation are based on prototypes [1], [2].
>>
>> [1] git clone https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git -b perf/record_threads
>> [2] https://lore.kernel.org/lkml/[email protected]/
>>
>> Suggested-by: Jiri Olsa <[email protected]>
>> Suggested-by: Namhyung Kim <[email protected]>
>> Acked-by: Andi Kleen <[email protected]>
>> Acked-by: Namhyung Kim <[email protected]>
>> Signed-off-by: Alexey Bayduraev <[email protected]>
>> ---
>> tools/perf/builtin-record.c | 355 +++++++++++++++++++++++++++++++++++-
>> tools/perf/util/record.h | 1 +
>> 2 files changed, 354 insertions(+), 2 deletions(-)
>>
>> diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
>> index 2ade17308467..8d452797d175 100644
>> --- a/tools/perf/builtin-record.c
>> +++ b/tools/perf/builtin-record.c
>> @@ -51,6 +51,7 @@
>> #include "util/evlist-hybrid.h"
>> #include "asm/bug.h"
>> #include "perf.h"
>> +#include "cputopo.h"
>>
>> #include <errno.h>
>> #include <inttypes.h>
>> @@ -122,6 +123,20 @@ static const char *thread_msg_tags[THREAD_MSG__MAX] = {
>> "UNDEFINED", "READY"
>> };
>>
>> +enum thread_spec {
>> + THREAD_SPEC__UNDEFINED = 0,
>> + THREAD_SPEC__CPU,
>> + THREAD_SPEC__CORE,
>> + THREAD_SPEC__SOCKET,
>> + THREAD_SPEC__NUMA,
>> + THREAD_SPEC__USER,
>> + THREAD_SPEC__MAX,
>> +};
>> +
>> +static const char *thread_spec_tags[THREAD_SPEC__MAX] = {
>> + "undefined", "cpu", "core", "socket", "numa", "user"
>> +};
>> +
>> struct record {
>> struct perf_tool tool;
>> struct record_opts opts;
>> @@ -2723,6 +2738,70 @@ static void record__thread_mask_free(struct thread_mask *mask)
>> record__mmap_cpu_mask_free(&mask->affinity);
>> }
>>
>> +static int record__thread_mask_or(struct thread_mask *dest, struct thread_mask *src1,
>> + struct thread_mask *src2)
>> +{
>> + if (src1->maps.nbits != src2->maps.nbits ||
>> + dest->maps.nbits != src1->maps.nbits ||
>> + src1->affinity.nbits != src2->affinity.nbits ||
>> + dest->affinity.nbits != src1->affinity.nbits)
>> + return -EINVAL;
>> +
>> + bitmap_or(dest->maps.bits, src1->maps.bits,
>> + src2->maps.bits, src1->maps.nbits);
>> + bitmap_or(dest->affinity.bits, src1->affinity.bits,
>> + src2->affinity.bits, src1->affinity.nbits);
>> +
>> + return 0;
>> +}
>> +
>> +static int record__thread_mask_intersects(struct thread_mask *mask_1, struct thread_mask *mask_2)
>> +{
>> + int res1, res2;
>> +
>> + if (mask_1->maps.nbits != mask_2->maps.nbits ||
>> + mask_1->affinity.nbits != mask_2->affinity.nbits)
>> + return -EINVAL;
>> +
>> + res1 = bitmap_intersects(mask_1->maps.bits, mask_2->maps.bits,
>> + mask_1->maps.nbits);
>> + res2 = bitmap_intersects(mask_1->affinity.bits, mask_2->affinity.bits,
>> + mask_1->affinity.nbits);
>> + if (res1 || res2)
>> + return 1;
>> +
>> + return 0;
>> +}
>> +
>> +static int record__parse_threads(const struct option *opt, const char *str, int unset)
>> +{
>> + int s;
>> + struct record_opts *opts = opt->value;
>> +
>> + if (unset || !str || !strlen(str)) {
>> + opts->threads_spec = THREAD_SPEC__CPU;
>> + } else {
>> + for (s = 1; s < THREAD_SPEC__MAX; s++) {
>> + if (s == THREAD_SPEC__USER) {
>> + opts->threads_user_spec = strdup(str);
>> + opts->threads_spec = THREAD_SPEC__USER;
>> + break;
>> + }
>> + if (!strncasecmp(str, thread_spec_tags[s], strlen(thread_spec_tags[s]))) {
>> + opts->threads_spec = s;
>> + break;
>> + }
>> + }
>> + }
>> +
>> + pr_debug("threads_spec: %s", thread_spec_tags[opts->threads_spec]);
>> + if (opts->threads_spec == THREAD_SPEC__USER)
>> + pr_debug("=[%s]", opts->threads_user_spec);
>> + pr_debug("\n");
>> +
>> + return 0;
>> +}
>> +
>> static int parse_output_max_size(const struct option *opt,
>> const char *str, int unset)
>> {
>> @@ -3166,6 +3245,9 @@ static struct option __record_options[] = {
>> "\t\t\t Optionally send control command completion ('ack\\n') to ack-fd descriptor.\n"
>> "\t\t\t Alternatively, ctl-fifo / ack-fifo will be opened and used as ctl-fd / ack-fd.",
>> parse_control_option),
>> + OPT_CALLBACK_OPTARG(0, "threads", &record.opts, NULL, "spec",
>> + "write collected trace data into several data files using parallel threads",
>> + record__parse_threads),
>> OPT_END()
>> };
>>
>> @@ -3179,6 +3261,17 @@ static void record__mmap_cpu_mask_init(struct mmap_cpu_mask *mask, struct perf_c
>> set_bit(cpus->map[c], mask->bits);
>> }
>>
>> +static void record__mmap_cpu_mask_init_spec(struct mmap_cpu_mask *mask, char *mask_spec)
>> +{
>> + struct perf_cpu_map *cpus;
>> +
>> + cpus = perf_cpu_map__new(mask_spec);
>> + if (cpus) {
>> + record__mmap_cpu_mask_init(mask, cpus);
>> + perf_cpu_map__put(cpus);
>> + }
>> +}
>> +
>> static int record__alloc_thread_masks(struct record *rec, int nr_threads, int nr_bits)
>> {
>> int t, ret;
>> @@ -3198,6 +3291,235 @@ static int record__alloc_thread_masks(struct record *rec, int nr_threads, int nr
>>
>> return 0;
>> }
>> +
>> +static int record__init_thread_cpu_masks(struct record *rec, struct perf_cpu_map *cpus)
>> +{
>> + int t, ret, nr_cpus = perf_cpu_map__nr(cpus);
>> +
>> + ret = record__alloc_thread_masks(rec, nr_cpus, cpu__max_cpu());
>> + if (ret)
>> + return ret;
>> +
>> + rec->nr_threads = nr_cpus;
>> + pr_debug("threads: nr_threads=%d\n", rec->nr_threads);
>> +
>> + for (t = 0; t < rec->nr_threads; t++) {
>> + set_bit(cpus->map[t], rec->thread_masks[t].maps.bits);
>> + pr_debug("thread_masks[%d]: maps mask [%d]\n", t, cpus->map[t]);
>> + set_bit(cpus->map[t], rec->thread_masks[t].affinity.bits);
>> + pr_debug("thread_masks[%d]: affinity mask [%d]\n", t, cpus->map[t]);
>> + }
>> +
>> + return 0;
>> +}
>> +
>> +static int record__init_thread_masks_spec(struct record *rec, struct perf_cpu_map *cpus,
>> + char **maps_spec, char **affinity_spec, u32 nr_spec)
>> +{
>> + u32 s;
>> + int ret = 0, nr_threads = 0;
>> + struct mmap_cpu_mask cpus_mask;
>> + struct thread_mask thread_mask, full_mask, *prev_masks;
>> +
>> + ret = record__mmap_cpu_mask_alloc(&cpus_mask, cpu__max_cpu());
>> + if (ret)
>> + goto out;
>> + record__mmap_cpu_mask_init(&cpus_mask, cpus);
>> + ret = record__thread_mask_alloc(&thread_mask, cpu__max_cpu());
>> + if (ret)
>> + goto out_free_cpu_mask;
>> + ret = record__thread_mask_alloc(&full_mask, cpu__max_cpu());
>> + if (ret)
>> + goto out_free_thread_mask;
>> + record__thread_mask_clear(&full_mask);
>> +
>> + for (s = 0; s < nr_spec; s++) {
>> + record__thread_mask_clear(&thread_mask);
>> +
>> + record__mmap_cpu_mask_init_spec(&thread_mask.maps, maps_spec[s]);
>> + record__mmap_cpu_mask_init_spec(&thread_mask.affinity, affinity_spec[s]);
>> +
>> + if (!bitmap_and(thread_mask.maps.bits, thread_mask.maps.bits,
>> + cpus_mask.bits, thread_mask.maps.nbits) ||
>> + !bitmap_and(thread_mask.affinity.bits, thread_mask.affinity.bits,
>> + cpus_mask.bits, thread_mask.affinity.nbits))
>> + continue;
>> +
>> + ret = record__thread_mask_intersects(&thread_mask, &full_mask);
>> + if (ret)
>> + goto out_free_full_mask;
>> + record__thread_mask_or(&full_mask, &full_mask, &thread_mask);
>> +
>> + prev_masks = rec->thread_masks;
>> + rec->thread_masks = realloc(rec->thread_masks,
>> + (nr_threads + 1) * sizeof(struct thread_mask));
>> + if (!rec->thread_masks) {
>> + pr_err("Failed to allocate thread masks\n");
>> + rec->thread_masks = prev_masks;
>> + ret = -ENOMEM;
>> + goto out_free_full_mask;
>> + }
>> + rec->thread_masks[nr_threads] = thread_mask;
>> + if (verbose) {
>> + pr_debug("thread_masks[%d]: addr=", nr_threads);
>> + mmap_cpu_mask__scnprintf(&rec->thread_masks[nr_threads].maps, "maps");
>> + pr_debug("thread_masks[%d]: addr=", nr_threads);
>> + mmap_cpu_mask__scnprintf(&rec->thread_masks[nr_threads].affinity,
>> + "affinity");
>> + }
>> + nr_threads++;
>> + ret = record__thread_mask_alloc(&thread_mask, cpu__max_cpu());
>> + if (ret)
>> + goto out_free_full_mask;
>> + }
>> +
>> + rec->nr_threads = nr_threads;
>> + pr_debug("threads: nr_threads=%d\n", rec->nr_threads);
>> +
>> + if (rec->nr_threads <= 0)
>> + ret = -EINVAL;
>> +
>> +out_free_full_mask:
>> + record__thread_mask_free(&full_mask);
>> +out_free_thread_mask:
>> + record__thread_mask_free(&thread_mask);
>> +out_free_cpu_mask:
>> + record__mmap_cpu_mask_free(&cpus_mask);
>> +out:
>> + return ret;
>> +}
>> +
>> +static int record__init_thread_core_masks(struct record *rec, struct perf_cpu_map *cpus)
>> +{
>> + int ret;
>> + struct cpu_topology *topo;
>> +
>> + topo = cpu_topology__new();
>> + if (!topo)
>> + return -EINVAL;
>> +
>> + ret = record__init_thread_masks_spec(rec, cpus, topo->thread_siblings,
>> + topo->thread_siblings, topo->thread_sib);
>> + cpu_topology__delete(topo);
>> +
>> + return ret;
>> +}
>> +
>> +static int record__init_thread_socket_masks(struct record *rec, struct perf_cpu_map *cpus)
>> +{
>> + int ret;
>> + struct cpu_topology *topo;
>> +
>> + topo = cpu_topology__new();
>> + if (!topo)
>> + return -EINVAL;
>> +
>> + ret = record__init_thread_masks_spec(rec, cpus, topo->core_siblings,
>> + topo->core_siblings, topo->core_sib);
>> + cpu_topology__delete(topo);
>> +
>> + return ret;
>> +}
>> +
>> +static int record__init_thread_numa_masks(struct record *rec, struct perf_cpu_map *cpus)
>> +{
>> + u32 s;
>> + int ret;
>> + char **spec;
>> + struct numa_topology *topo;
>> +
>> + topo = numa_topology__new();
>> + if (!topo)
>> + return -EINVAL;
>> + spec = zalloc(topo->nr * sizeof(char *));
>> + if (!spec) {
>> + ret = -ENOMEM;
>> + goto out_delete_topo;
>> + }
>> + for (s = 0; s < topo->nr; s++)
>> + spec[s] = topo->nodes[s].cpus;
>> +
>> + ret = record__init_thread_masks_spec(rec, cpus, spec, spec, topo->nr);
>> +
>> + zfree(&spec);
>> +
>> +out_delete_topo:
>> + numa_topology__delete(topo);
>> +
>> + return ret;
>> +}
>> +
>> +static int record__init_thread_user_masks(struct record *rec, struct perf_cpu_map *cpus)
>> +{
>> + int t, ret;
>> + u32 s, nr_spec = 0;
>> + char **maps_spec = NULL, **affinity_spec = NULL, **prev_spec;
>> + char *spec, *spec_ptr, *user_spec, *mask, *mask_ptr;
>> +
>> + for (t = 0, user_spec = (char *)rec->opts.threads_user_spec; ; t++, user_spec = NULL) {
>> + spec = strtok_r(user_spec, ":", &spec_ptr);
>> + if (spec == NULL)
>> + break;
>> + pr_debug(" spec[%d]: %s\n", t, spec);
>> + mask = strtok_r(spec, "/", &mask_ptr);
>> + if (mask == NULL)
>> + break;
>> + pr_debug(" maps mask: %s\n", mask);
>> + prev_spec = maps_spec;
>> + maps_spec = realloc(maps_spec, (nr_spec + 1) * sizeof(char *));
>> + if (!maps_spec) {
>> + pr_err("Failed to realloc maps_spec\n");
>> + maps_spec = prev_spec;
>> + ret = -ENOMEM;
>> + goto out_free_all_specs;
>> + }
>> + maps_spec[nr_spec] = strdup(mask);
>> + if (!maps_spec[nr_spec]) {
>> + pr_err("Failed to alloc maps_spec[%d]\n", nr_spec);
>> + ret = -ENOMEM;
>> + goto out_free_all_specs;
>> + }
>> + mask = strtok_r(NULL, "/", &mask_ptr);
>> + if (mask == NULL) {
>> + free(maps_spec[nr_spec]);
>> + ret = -EINVAL;
>> + goto out_free_all_specs;
>> + }
>> + pr_debug(" affinity mask: %s\n", mask);
>> + prev_spec = affinity_spec;
>> + affinity_spec = realloc(affinity_spec, (nr_spec + 1) * sizeof(char *));
>> + if (!affinity_spec) {
>> + pr_err("Failed to realloc affinity_spec\n");
>> + affinity_spec = prev_spec;
>> + free(maps_spec[nr_spec]);
>> + ret = -ENOMEM;
>> + goto out_free_all_specs;
>> + }
>> + affinity_spec[nr_spec] = strdup(mask);
>> + if (!affinity_spec[nr_spec]) {
>> + pr_err("Failed to alloc affinity_spec[%d]\n", nr_spec);
>> + free(maps_spec[nr_spec]);
>> + ret = -ENOMEM;
>> + goto out_free_all_specs;
>> + }
>> + nr_spec++;
>> + }
>> +
>> + ret = record__init_thread_masks_spec(rec, cpus, maps_spec, affinity_spec, nr_spec);
>> +
>> +out_free_all_specs:
>> + for (s = 0; s < nr_spec; s++) {
>> + if (maps_spec)
>> + free(maps_spec[s]);
>> + if (affinity_spec)
>> + free(affinity_spec[s]);
>> + }
>> + free(affinity_spec);
>> + free(maps_spec);
>> +
>> + return ret;
>> +}
>> +
>> static int record__init_thread_default_masks(struct record *rec, struct perf_cpu_map *cpus)
>> {
>> int ret;
>> @@ -3215,9 +3537,33 @@ static int record__init_thread_default_masks(struct record *rec, struct perf_cpu
>>
>> static int record__init_thread_masks(struct record *rec)
>> {
>> + int ret = 0;
>> struct perf_cpu_map *cpus = rec->evlist->core.cpus;
>>
>> - return record__init_thread_default_masks(rec, cpus);
>> + if (!record__threads_enabled(rec))
>> + return record__init_thread_default_masks(rec, cpus);
>> +
>> + switch (rec->opts.threads_spec) {
>> + case THREAD_SPEC__CPU:
>> + ret = record__init_thread_cpu_masks(rec, cpus);
>> + break;
>> + case THREAD_SPEC__CORE:
>> + ret = record__init_thread_core_masks(rec, cpus);
>> + break;
>> + case THREAD_SPEC__SOCKET:
>> + ret = record__init_thread_socket_masks(rec, cpus);
>> + break;
>> + case THREAD_SPEC__NUMA:
>> + ret = record__init_thread_numa_masks(rec, cpus);
>> + break;
>> + case THREAD_SPEC__USER:
>> + ret = record__init_thread_user_masks(rec, cpus);
>> + break;
>> + default:
>> + break;
>> + }
>> +
>> + return ret;
>> }
>>
>> static int record__fini_thread_masks(struct record *rec)
>> @@ -3474,7 +3820,12 @@ int cmd_record(int argc, const char **argv)
>>
>> err = record__init_thread_masks(rec);
>> if (err) {
>> - pr_err("record__init_thread_masks failed, error %d\n", err);
>> + if (err > 0)
>> + pr_err("ERROR: parallel data streaming masks (--threads) intersect\n");
>> + else if (err == -EINVAL)
>> + pr_err("ERROR: invalid parallel data streaming masks (--threads)\n");
>> + else
>> + pr_err("record__init_thread_masks failed, error %d\n", err);
>> goto out;
>> }
>>
>> diff --git a/tools/perf/util/record.h b/tools/perf/util/record.h
>> index 4d68b7e27272..3da156498f47 100644
>> --- a/tools/perf/util/record.h
>> +++ b/tools/perf/util/record.h
>> @@ -78,6 +78,7 @@ struct record_opts {
>> int ctl_fd_ack;
>> bool ctl_fd_close;
>> int threads_spec;
>> + const char *threads_user_spec;
>> };
>>
>> extern const char * const *record_usage;
>> --
>> 2.19.0
>>
>

2021-07-01 10:10:32

by Bayduraev, Alexey V

[permalink] [raw]
Subject: Re: [PATCH v8 21/22] perf session: Introduce READER_NODATA state

Hi,

On 30.06.2021 18:55, Alexey Bayduraev wrote:
> Adding READER_NODATA state to differentiate it from the real end-of-file
> state. Also an indent depth in reades initialization loop is reduced.

Just a little note: I will squash this patch to 19 and 20.

Regards,
Alexey

>
> Suggested-by: Namhyung Kim <[email protected]>
> Signed-off-by: Alexey Bayduraev <[email protected]>
> ---
> tools/perf/util/session.c | 45 ++++++++++++++++++++-------------------
> 1 file changed, 23 insertions(+), 22 deletions(-)
>
> diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
> index b11a502c22a3..c2b6c5f4e119 100644
> --- a/tools/perf/util/session.c
> +++ b/tools/perf/util/session.c
> @@ -70,7 +70,8 @@ struct reader_state {
>
> enum {
> READER_EOF = 0,
> - READER_OK = 1,
> + READER_NODATA = 1,
> + READER_OK = 2,
> };
>
> struct reader {
> @@ -2305,7 +2306,7 @@ reader__read_event(struct reader *rd, struct perf_session *session,
> return PTR_ERR(event);
>
> if (!event)
> - return READER_EOF;
> + return READER_NODATA;
>
> session->active_reader = rd;
> size = event->header.size;
> @@ -2395,7 +2396,7 @@ static int __perf_session__process_events(struct perf_session *session)
> err = reader__read_event(rd, session, &prog);
> if (err < 0)
> break;
> - if (err == READER_EOF) {
> + if (err == READER_NODATA) {
> err = reader__mmap(rd, session);
> if (err <= 0)
> break;
> @@ -2472,25 +2473,25 @@ static int __perf_session__process_dir_events(struct perf_session *session)
> readers++;
>
> for (i = 0; i < data->dir.nr; i++) {
> - if (data->dir.files[i].size) {
> - rd[readers] = (struct reader) {
> - .fd = data->dir.files[i].fd,
> - .path = data->dir.files[i].path,
> - .data_size = data->dir.files[i].size,
> - .data_offset = 0,
> - .in_place_update = session->data->in_place_update,
> - };
> - ret = reader__init(&rd[readers], NULL);
> - if (ret)
> - goto out_err;
> - ret = reader__mmap(&rd[readers], session);
> - if (ret != READER_OK) {
> - if (ret == READER_EOF)
> - ret = -EINVAL;
> - goto out_err;
> - }
> - readers++;
> + if (!data->dir.files[i].size)
> + continue;
> + rd[readers] = (struct reader) {
> + .fd = data->dir.files[i].fd,
> + .path = data->dir.files[i].path,
> + .data_size = data->dir.files[i].size,
> + .data_offset = 0,
> + .in_place_update = session->data->in_place_update,
> + };
> + ret = reader__init(&rd[readers], NULL);
> + if (ret)
> + goto out_err;
> + ret = reader__mmap(&rd[readers], session);
> + if (ret != READER_OK) {
> + if (ret == READER_EOF)
> + ret = -EINVAL;
> + goto out_err;
> }
> + readers++;
> }
>
> i = 0;
> @@ -2507,7 +2508,7 @@ static int __perf_session__process_dir_events(struct perf_session *session)
> ret = reader__read_event(&rd[i], session, &prog);
> if (ret < 0)
> break;
> - if (ret == READER_EOF) {
> + if (ret == READER_NODATA) {
> ret = reader__mmap(&rd[i], session);
> if (ret < 0)
> goto out_err;
>

2021-07-01 11:51:52

by Bayduraev, Alexey V

[permalink] [raw]
Subject: Re: [PATCH v8 10/22] perf record: Introduce --threads=<spec> command line option

Hi,

On 30.06.2021 21:54, Bayduraev, Alexey V wrote:
> Hi,
>
> On 30.06.2021 20:28, Arnaldo Carvalho de Melo wrote:
>> Em Wed, Jun 30, 2021 at 06:54:49PM +0300, Alexey Bayduraev escreveu:
>>> Provide --threads option in perf record command line interface.
>>> The option can have a value in the form of masks that specify
>>> cpus to be monitored with data streaming threads and its layout
>>> in system topology. The masks can be filtered using cpu mask
>>> provided via -C option.
>>
>> Perhaps make this --jobs/-j to reuse what 'make' and 'ninja' uses?
>>
>> Unfortunately:
>>
>> [acme@quaco pahole]$ perf record -h -j
>>
>> Usage: perf record [<options>] [<command>]
>> or: perf record [<options>] -- <command> [<options>]
>>
>> -j, --branch-filter <branch filter mask>
>> branch stack filter modes
>>
>> [acme@quaco pahole]$
>>
>> But we could reuse --jobs
>>
>> I thought you would start with plain:
>>
>> -j N
>>
>> And start one thread per CPU in 'perf record' existing CPU affinity
>> mask, then go on introducing more sophisticated modes.
>
> As I remember the first prototype [1] and
> [2] https://lore.kernel.org/lkml/[email protected]/
>
> introduces:
>
> --thread=mode|number_of_threads
>
> where mode defines cpu masks (cpu/numa/socket/etc)
>
> Then somewhere while discussing this patchset it was decided, for unification,
> that --thread should only define CPU/affinity masks or their aliases.
> I think Alexei or Jiri could clarify this more.
>
>>
>> Have you done this way because its how VTune has evolved over the years
>> and now expects from 'perf record'?
>
> VTune uses only --thread=cpu or no threading.

However we would like to have such sophisticated cpu/affinity masks to
tune perf-record for different workloads.
For example, some HPC workloads prefer "numa" mask or most of telecom
workloads disallow to use cpus where their non-preemtable
communication threads work.

Regards,
Alexey

>
> Regards,
> Alexey
>
>>
>> - Arnaldo
>>
>>> The specification value can be user defined list of masks. Masks
>>> separated by colon define cpus to be monitored by one thread and
>>> affinity mask of that thread is separated by slash. For example:
>>> <cpus mask 1>/<affinity mask 1>:<cpu mask 2>/<affinity mask 2>
>>> specifies parallel threads layout that consists of two threads
>>> with corresponding assigned cpus to be monitored.
>>>
>>> The specification value can be a string e.g. "cpu", "core" or
>>> "socket" meaning creation of data streaming thread for every
>>> cpu or core or socket to monitor distinct cpus or cpus grouped
>>> by core or socket.
>>>
>>> The option provided with no or empty value defaults to per-cpu
>>> parallel threads layout creating data streaming thread for every
>>> cpu being monitored.
>>>
>>> Feature design and implementation are based on prototypes [1], [2].
>>>
>>> [1] git clone https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git -b perf/record_threads
>>> [2] https://lore.kernel.org/lkml/[email protected]/
>>>
>>> Suggested-by: Jiri Olsa <[email protected]>
>>> Suggested-by: Namhyung Kim <[email protected]>
>>> Acked-by: Andi Kleen <[email protected]>
>>> Acked-by: Namhyung Kim <[email protected]>
>>> Signed-off-by: Alexey Bayduraev <[email protected]>
>>> ---
>>> tools/perf/builtin-record.c | 355 +++++++++++++++++++++++++++++++++++-
>>> tools/perf/util/record.h | 1 +
>>> 2 files changed, 354 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
>>> index 2ade17308467..8d452797d175 100644
>>> --- a/tools/perf/builtin-record.c
>>> +++ b/tools/perf/builtin-record.c
>>> @@ -51,6 +51,7 @@
>>> #include "util/evlist-hybrid.h"
>>> #include "asm/bug.h"
>>> #include "perf.h"
>>> +#include "cputopo.h"
>>>
>>> #include <errno.h>
>>> #include <inttypes.h>
>>> @@ -122,6 +123,20 @@ static const char *thread_msg_tags[THREAD_MSG__MAX] = {
>>> "UNDEFINED", "READY"
>>> };
>>>
>>> +enum thread_spec {
>>> + THREAD_SPEC__UNDEFINED = 0,
>>> + THREAD_SPEC__CPU,
>>> + THREAD_SPEC__CORE,
>>> + THREAD_SPEC__SOCKET,
>>> + THREAD_SPEC__NUMA,
>>> + THREAD_SPEC__USER,
>>> + THREAD_SPEC__MAX,
>>> +};
>>> +
>>> +static const char *thread_spec_tags[THREAD_SPEC__MAX] = {
>>> + "undefined", "cpu", "core", "socket", "numa", "user"
>>> +};
>>> +
>>> struct record {
>>> struct perf_tool tool;
>>> struct record_opts opts;
>>> @@ -2723,6 +2738,70 @@ static void record__thread_mask_free(struct thread_mask *mask)
>>> record__mmap_cpu_mask_free(&mask->affinity);
>>> }
>>>
>>> +static int record__thread_mask_or(struct thread_mask *dest, struct thread_mask *src1,
>>> + struct thread_mask *src2)
>>> +{
>>> + if (src1->maps.nbits != src2->maps.nbits ||
>>> + dest->maps.nbits != src1->maps.nbits ||
>>> + src1->affinity.nbits != src2->affinity.nbits ||
>>> + dest->affinity.nbits != src1->affinity.nbits)
>>> + return -EINVAL;
>>> +
>>> + bitmap_or(dest->maps.bits, src1->maps.bits,
>>> + src2->maps.bits, src1->maps.nbits);
>>> + bitmap_or(dest->affinity.bits, src1->affinity.bits,
>>> + src2->affinity.bits, src1->affinity.nbits);
>>> +
>>> + return 0;
>>> +}
>>> +
>>> +static int record__thread_mask_intersects(struct thread_mask *mask_1, struct thread_mask *mask_2)
>>> +{
>>> + int res1, res2;
>>> +
>>> + if (mask_1->maps.nbits != mask_2->maps.nbits ||
>>> + mask_1->affinity.nbits != mask_2->affinity.nbits)
>>> + return -EINVAL;
>>> +
>>> + res1 = bitmap_intersects(mask_1->maps.bits, mask_2->maps.bits,
>>> + mask_1->maps.nbits);
>>> + res2 = bitmap_intersects(mask_1->affinity.bits, mask_2->affinity.bits,
>>> + mask_1->affinity.nbits);
>>> + if (res1 || res2)
>>> + return 1;
>>> +
>>> + return 0;
>>> +}
>>> +
>>> +static int record__parse_threads(const struct option *opt, const char *str, int unset)
>>> +{
>>> + int s;
>>> + struct record_opts *opts = opt->value;
>>> +
>>> + if (unset || !str || !strlen(str)) {
>>> + opts->threads_spec = THREAD_SPEC__CPU;
>>> + } else {
>>> + for (s = 1; s < THREAD_SPEC__MAX; s++) {
>>> + if (s == THREAD_SPEC__USER) {
>>> + opts->threads_user_spec = strdup(str);
>>> + opts->threads_spec = THREAD_SPEC__USER;
>>> + break;
>>> + }
>>> + if (!strncasecmp(str, thread_spec_tags[s], strlen(thread_spec_tags[s]))) {
>>> + opts->threads_spec = s;
>>> + break;
>>> + }
>>> + }
>>> + }
>>> +
>>> + pr_debug("threads_spec: %s", thread_spec_tags[opts->threads_spec]);
>>> + if (opts->threads_spec == THREAD_SPEC__USER)
>>> + pr_debug("=[%s]", opts->threads_user_spec);
>>> + pr_debug("\n");
>>> +
>>> + return 0;
>>> +}
>>> +
>>> static int parse_output_max_size(const struct option *opt,
>>> const char *str, int unset)
>>> {
>>> @@ -3166,6 +3245,9 @@ static struct option __record_options[] = {
>>> "\t\t\t Optionally send control command completion ('ack\\n') to ack-fd descriptor.\n"
>>> "\t\t\t Alternatively, ctl-fifo / ack-fifo will be opened and used as ctl-fd / ack-fd.",
>>> parse_control_option),
>>> + OPT_CALLBACK_OPTARG(0, "threads", &record.opts, NULL, "spec",
>>> + "write collected trace data into several data files using parallel threads",
>>> + record__parse_threads),
>>> OPT_END()
>>> };
>>>
>>> @@ -3179,6 +3261,17 @@ static void record__mmap_cpu_mask_init(struct mmap_cpu_mask *mask, struct perf_c
>>> set_bit(cpus->map[c], mask->bits);
>>> }
>>>
>>> +static void record__mmap_cpu_mask_init_spec(struct mmap_cpu_mask *mask, char *mask_spec)
>>> +{
>>> + struct perf_cpu_map *cpus;
>>> +
>>> + cpus = perf_cpu_map__new(mask_spec);
>>> + if (cpus) {
>>> + record__mmap_cpu_mask_init(mask, cpus);
>>> + perf_cpu_map__put(cpus);
>>> + }
>>> +}
>>> +
>>> static int record__alloc_thread_masks(struct record *rec, int nr_threads, int nr_bits)
>>> {
>>> int t, ret;
>>> @@ -3198,6 +3291,235 @@ static int record__alloc_thread_masks(struct record *rec, int nr_threads, int nr
>>>
>>> return 0;
>>> }
>>> +
>>> +static int record__init_thread_cpu_masks(struct record *rec, struct perf_cpu_map *cpus)
>>> +{
>>> + int t, ret, nr_cpus = perf_cpu_map__nr(cpus);
>>> +
>>> + ret = record__alloc_thread_masks(rec, nr_cpus, cpu__max_cpu());
>>> + if (ret)
>>> + return ret;
>>> +
>>> + rec->nr_threads = nr_cpus;
>>> + pr_debug("threads: nr_threads=%d\n", rec->nr_threads);
>>> +
>>> + for (t = 0; t < rec->nr_threads; t++) {
>>> + set_bit(cpus->map[t], rec->thread_masks[t].maps.bits);
>>> + pr_debug("thread_masks[%d]: maps mask [%d]\n", t, cpus->map[t]);
>>> + set_bit(cpus->map[t], rec->thread_masks[t].affinity.bits);
>>> + pr_debug("thread_masks[%d]: affinity mask [%d]\n", t, cpus->map[t]);
>>> + }
>>> +
>>> + return 0;
>>> +}
>>> +
>>> +static int record__init_thread_masks_spec(struct record *rec, struct perf_cpu_map *cpus,
>>> + char **maps_spec, char **affinity_spec, u32 nr_spec)
>>> +{
>>> + u32 s;
>>> + int ret = 0, nr_threads = 0;
>>> + struct mmap_cpu_mask cpus_mask;
>>> + struct thread_mask thread_mask, full_mask, *prev_masks;
>>> +
>>> + ret = record__mmap_cpu_mask_alloc(&cpus_mask, cpu__max_cpu());
>>> + if (ret)
>>> + goto out;
>>> + record__mmap_cpu_mask_init(&cpus_mask, cpus);
>>> + ret = record__thread_mask_alloc(&thread_mask, cpu__max_cpu());
>>> + if (ret)
>>> + goto out_free_cpu_mask;
>>> + ret = record__thread_mask_alloc(&full_mask, cpu__max_cpu());
>>> + if (ret)
>>> + goto out_free_thread_mask;
>>> + record__thread_mask_clear(&full_mask);
>>> +
>>> + for (s = 0; s < nr_spec; s++) {
>>> + record__thread_mask_clear(&thread_mask);
>>> +
>>> + record__mmap_cpu_mask_init_spec(&thread_mask.maps, maps_spec[s]);
>>> + record__mmap_cpu_mask_init_spec(&thread_mask.affinity, affinity_spec[s]);
>>> +
>>> + if (!bitmap_and(thread_mask.maps.bits, thread_mask.maps.bits,
>>> + cpus_mask.bits, thread_mask.maps.nbits) ||
>>> + !bitmap_and(thread_mask.affinity.bits, thread_mask.affinity.bits,
>>> + cpus_mask.bits, thread_mask.affinity.nbits))
>>> + continue;
>>> +
>>> + ret = record__thread_mask_intersects(&thread_mask, &full_mask);
>>> + if (ret)
>>> + goto out_free_full_mask;
>>> + record__thread_mask_or(&full_mask, &full_mask, &thread_mask);
>>> +
>>> + prev_masks = rec->thread_masks;
>>> + rec->thread_masks = realloc(rec->thread_masks,
>>> + (nr_threads + 1) * sizeof(struct thread_mask));
>>> + if (!rec->thread_masks) {
>>> + pr_err("Failed to allocate thread masks\n");
>>> + rec->thread_masks = prev_masks;
>>> + ret = -ENOMEM;
>>> + goto out_free_full_mask;
>>> + }
>>> + rec->thread_masks[nr_threads] = thread_mask;
>>> + if (verbose) {
>>> + pr_debug("thread_masks[%d]: addr=", nr_threads);
>>> + mmap_cpu_mask__scnprintf(&rec->thread_masks[nr_threads].maps, "maps");
>>> + pr_debug("thread_masks[%d]: addr=", nr_threads);
>>> + mmap_cpu_mask__scnprintf(&rec->thread_masks[nr_threads].affinity,
>>> + "affinity");
>>> + }
>>> + nr_threads++;
>>> + ret = record__thread_mask_alloc(&thread_mask, cpu__max_cpu());
>>> + if (ret)
>>> + goto out_free_full_mask;
>>> + }
>>> +
>>> + rec->nr_threads = nr_threads;
>>> + pr_debug("threads: nr_threads=%d\n", rec->nr_threads);
>>> +
>>> + if (rec->nr_threads <= 0)
>>> + ret = -EINVAL;
>>> +
>>> +out_free_full_mask:
>>> + record__thread_mask_free(&full_mask);
>>> +out_free_thread_mask:
>>> + record__thread_mask_free(&thread_mask);
>>> +out_free_cpu_mask:
>>> + record__mmap_cpu_mask_free(&cpus_mask);
>>> +out:
>>> + return ret;
>>> +}
>>> +
>>> +static int record__init_thread_core_masks(struct record *rec, struct perf_cpu_map *cpus)
>>> +{
>>> + int ret;
>>> + struct cpu_topology *topo;
>>> +
>>> + topo = cpu_topology__new();
>>> + if (!topo)
>>> + return -EINVAL;
>>> +
>>> + ret = record__init_thread_masks_spec(rec, cpus, topo->thread_siblings,
>>> + topo->thread_siblings, topo->thread_sib);
>>> + cpu_topology__delete(topo);
>>> +
>>> + return ret;
>>> +}
>>> +
>>> +static int record__init_thread_socket_masks(struct record *rec, struct perf_cpu_map *cpus)
>>> +{
>>> + int ret;
>>> + struct cpu_topology *topo;
>>> +
>>> + topo = cpu_topology__new();
>>> + if (!topo)
>>> + return -EINVAL;
>>> +
>>> + ret = record__init_thread_masks_spec(rec, cpus, topo->core_siblings,
>>> + topo->core_siblings, topo->core_sib);
>>> + cpu_topology__delete(topo);
>>> +
>>> + return ret;
>>> +}
>>> +
>>> +static int record__init_thread_numa_masks(struct record *rec, struct perf_cpu_map *cpus)
>>> +{
>>> + u32 s;
>>> + int ret;
>>> + char **spec;
>>> + struct numa_topology *topo;
>>> +
>>> + topo = numa_topology__new();
>>> + if (!topo)
>>> + return -EINVAL;
>>> + spec = zalloc(topo->nr * sizeof(char *));
>>> + if (!spec) {
>>> + ret = -ENOMEM;
>>> + goto out_delete_topo;
>>> + }
>>> + for (s = 0; s < topo->nr; s++)
>>> + spec[s] = topo->nodes[s].cpus;
>>> +
>>> + ret = record__init_thread_masks_spec(rec, cpus, spec, spec, topo->nr);
>>> +
>>> + zfree(&spec);
>>> +
>>> +out_delete_topo:
>>> + numa_topology__delete(topo);
>>> +
>>> + return ret;
>>> +}
>>> +
>>> +static int record__init_thread_user_masks(struct record *rec, struct perf_cpu_map *cpus)
>>> +{
>>> + int t, ret;
>>> + u32 s, nr_spec = 0;
>>> + char **maps_spec = NULL, **affinity_spec = NULL, **prev_spec;
>>> + char *spec, *spec_ptr, *user_spec, *mask, *mask_ptr;
>>> +
>>> + for (t = 0, user_spec = (char *)rec->opts.threads_user_spec; ; t++, user_spec = NULL) {
>>> + spec = strtok_r(user_spec, ":", &spec_ptr);
>>> + if (spec == NULL)
>>> + break;
>>> + pr_debug(" spec[%d]: %s\n", t, spec);
>>> + mask = strtok_r(spec, "/", &mask_ptr);
>>> + if (mask == NULL)
>>> + break;
>>> + pr_debug(" maps mask: %s\n", mask);
>>> + prev_spec = maps_spec;
>>> + maps_spec = realloc(maps_spec, (nr_spec + 1) * sizeof(char *));
>>> + if (!maps_spec) {
>>> + pr_err("Failed to realloc maps_spec\n");
>>> + maps_spec = prev_spec;
>>> + ret = -ENOMEM;
>>> + goto out_free_all_specs;
>>> + }
>>> + maps_spec[nr_spec] = strdup(mask);
>>> + if (!maps_spec[nr_spec]) {
>>> + pr_err("Failed to alloc maps_spec[%d]\n", nr_spec);
>>> + ret = -ENOMEM;
>>> + goto out_free_all_specs;
>>> + }
>>> + mask = strtok_r(NULL, "/", &mask_ptr);
>>> + if (mask == NULL) {
>>> + free(maps_spec[nr_spec]);
>>> + ret = -EINVAL;
>>> + goto out_free_all_specs;
>>> + }
>>> + pr_debug(" affinity mask: %s\n", mask);
>>> + prev_spec = affinity_spec;
>>> + affinity_spec = realloc(affinity_spec, (nr_spec + 1) * sizeof(char *));
>>> + if (!affinity_spec) {
>>> + pr_err("Failed to realloc affinity_spec\n");
>>> + affinity_spec = prev_spec;
>>> + free(maps_spec[nr_spec]);
>>> + ret = -ENOMEM;
>>> + goto out_free_all_specs;
>>> + }
>>> + affinity_spec[nr_spec] = strdup(mask);
>>> + if (!affinity_spec[nr_spec]) {
>>> + pr_err("Failed to alloc affinity_spec[%d]\n", nr_spec);
>>> + free(maps_spec[nr_spec]);
>>> + ret = -ENOMEM;
>>> + goto out_free_all_specs;
>>> + }
>>> + nr_spec++;
>>> + }
>>> +
>>> + ret = record__init_thread_masks_spec(rec, cpus, maps_spec, affinity_spec, nr_spec);
>>> +
>>> +out_free_all_specs:
>>> + for (s = 0; s < nr_spec; s++) {
>>> + if (maps_spec)
>>> + free(maps_spec[s]);
>>> + if (affinity_spec)
>>> + free(affinity_spec[s]);
>>> + }
>>> + free(affinity_spec);
>>> + free(maps_spec);
>>> +
>>> + return ret;
>>> +}
>>> +
>>> static int record__init_thread_default_masks(struct record *rec, struct perf_cpu_map *cpus)
>>> {
>>> int ret;
>>> @@ -3215,9 +3537,33 @@ static int record__init_thread_default_masks(struct record *rec, struct perf_cpu
>>>
>>> static int record__init_thread_masks(struct record *rec)
>>> {
>>> + int ret = 0;
>>> struct perf_cpu_map *cpus = rec->evlist->core.cpus;
>>>
>>> - return record__init_thread_default_masks(rec, cpus);
>>> + if (!record__threads_enabled(rec))
>>> + return record__init_thread_default_masks(rec, cpus);
>>> +
>>> + switch (rec->opts.threads_spec) {
>>> + case THREAD_SPEC__CPU:
>>> + ret = record__init_thread_cpu_masks(rec, cpus);
>>> + break;
>>> + case THREAD_SPEC__CORE:
>>> + ret = record__init_thread_core_masks(rec, cpus);
>>> + break;
>>> + case THREAD_SPEC__SOCKET:
>>> + ret = record__init_thread_socket_masks(rec, cpus);
>>> + break;
>>> + case THREAD_SPEC__NUMA:
>>> + ret = record__init_thread_numa_masks(rec, cpus);
>>> + break;
>>> + case THREAD_SPEC__USER:
>>> + ret = record__init_thread_user_masks(rec, cpus);
>>> + break;
>>> + default:
>>> + break;
>>> + }
>>> +
>>> + return ret;
>>> }
>>>
>>> static int record__fini_thread_masks(struct record *rec)
>>> @@ -3474,7 +3820,12 @@ int cmd_record(int argc, const char **argv)
>>>
>>> err = record__init_thread_masks(rec);
>>> if (err) {
>>> - pr_err("record__init_thread_masks failed, error %d\n", err);
>>> + if (err > 0)
>>> + pr_err("ERROR: parallel data streaming masks (--threads) intersect\n");
>>> + else if (err == -EINVAL)
>>> + pr_err("ERROR: invalid parallel data streaming masks (--threads)\n");
>>> + else
>>> + pr_err("record__init_thread_masks failed, error %d\n", err);
>>> goto out;
>>> }
>>>
>>> diff --git a/tools/perf/util/record.h b/tools/perf/util/record.h
>>> index 4d68b7e27272..3da156498f47 100644
>>> --- a/tools/perf/util/record.h
>>> +++ b/tools/perf/util/record.h
>>> @@ -78,6 +78,7 @@ struct record_opts {
>>> int ctl_fd_ack;
>>> bool ctl_fd_close;
>>> int threads_spec;
>>> + const char *threads_user_spec;
>>> };
>>>
>>> extern const char * const *record_usage;
>>> --
>>> 2.19.0
>>>
>>

2021-07-01 14:28:18

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: [PATCH v8 10/22] perf record: Introduce --threads=<spec> command line option

Em Thu, Jul 01, 2021 at 02:50:40PM +0300, Bayduraev, Alexey V escreveu:
> On 30.06.2021 21:54, Bayduraev, Alexey V wrote:
> > On 30.06.2021 20:28, Arnaldo Carvalho de Melo wrote:
> >> I thought you would start with plain:

> >> -j N

> >> And start one thread per CPU in 'perf record' existing CPU affinity
> >> mask, then go on introducing more sophisticated modes.

> > As I remember the first prototype [1] and
> > [2] https://lore.kernel.org/lkml/[email protected]/

> > introduces:

> > --thread=mode|number_of_threads

> > where mode defines cpu masks (cpu/numa/socket/etc)

> > Then somewhere while discussing this patchset it was decided, for unification,
> > that --thread should only define CPU/affinity masks or their aliases.
> > I think Alexei or Jiri could clarify this more.

> >> Have you done this way because its how VTune has evolved over the years
> >> and now expects from 'perf record'?

> > VTune uses only --thread=cpu or no threading.

> However we would like to have such sophisticated cpu/affinity masks to
> tune perf-record for different workloads.

I don't have, a priori, anything against the modes you propose, as you
have a justification for them, its just how we should introduce that.

I.e. first doing the simple case of '-j NCPUS' and then doing what you
need, so that we get more granular patches.

Not adding too much complexity per patch pays off when/if we find bugs
and need to bisect.

> For example, some HPC workloads prefer "numa" mask or most of telecom
> workloads disallow to use cpus where their non-preemtable
> communication threads work.

- Arnaldo

2021-07-01 16:45:21

by Bayduraev, Alexey V

[permalink] [raw]
Subject: Re: [PATCH v8 06/22] perf record: Introduce data file at mmap buffer object

Hi,

On 30.06.2021 20:23, Arnaldo Carvalho de Melo wrote:
> Em Wed, Jun 30, 2021 at 06:54:45PM +0300, Alexey Bayduraev escreveu:
>> Introduce data file and compressor objects into mmap object so
>> they could be used to process and store data stream from the
>> corresponding kernel data buffer. Make use of the introduced
>> per mmap file and compressor when they are initialized and
>> available.
>
> So you're introducing even compressed storage in this patchset? To make
> it smaller I thought this could be in a followup cset.

These patches were slit by request from Jiri:

Changes in v5:
- split patch 06/12 to 06/20 and 07/20

request:
https://lore.kernel.org/lkml/YG97LMsttP4VEWPU@krava/

And I'm a bit confused :)

Regards,
Alexey

>
> - Arnaldo
>
>> Acked-by: Andi Kleen <[email protected]>
>> Acked-by: Namhyung Kim <[email protected]>
>> Signed-off-by: Alexey Bayduraev <[email protected]>
>> ---
>> tools/perf/builtin-record.c | 3 +++
>> tools/perf/util/mmap.c | 6 ++++++
>> tools/perf/util/mmap.h | 3 +++
>> 3 files changed, 12 insertions(+)
>>
>> diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
>> index cead2b3c56d7..9b7e7a5dc116 100644
>> --- a/tools/perf/builtin-record.c
>> +++ b/tools/perf/builtin-record.c
>> @@ -190,6 +190,9 @@ static int record__write(struct record *rec, struct mmap *map __maybe_unused,
>> {
>> struct perf_data_file *file = &rec->session->data->file;
>>
>> + if (map && map->file)
>> + file = map->file;
>> +
>> if (perf_data_file__write(file, bf, size) < 0) {
>> pr_err("failed to write perf data, error: %m\n");
>> return -1;
>> diff --git a/tools/perf/util/mmap.c b/tools/perf/util/mmap.c
>> index ab7108d22428..a2c5e4237592 100644
>> --- a/tools/perf/util/mmap.c
>> +++ b/tools/perf/util/mmap.c
>> @@ -230,6 +230,8 @@ void mmap__munmap(struct mmap *map)
>> {
>> bitmap_free(map->affinity_mask.bits);
>>
>> + zstd_fini(&map->zstd_data);
>> +
>> perf_mmap__aio_munmap(map);
>> if (map->data != NULL) {
>> munmap(map->data, mmap__mmap_len(map));
>> @@ -291,6 +293,10 @@ int mmap__mmap(struct mmap *map, struct mmap_params *mp, int fd, int cpu)
>> map->core.flush = mp->flush;
>>
>> map->comp_level = mp->comp_level;
>> + if (zstd_init(&map->zstd_data, map->comp_level)) {
>> + pr_debug2("failed to init mmap commpressor, error %d\n", errno);
>> + return -1;
>> + }
>>
>> if (map->comp_level && !perf_mmap__aio_enabled(map)) {
>> map->data = mmap(NULL, mmap__mmap_len(map), PROT_READ|PROT_WRITE,
>> diff --git a/tools/perf/util/mmap.h b/tools/perf/util/mmap.h
>> index 9d5f589f02ae..c4aed6e89549 100644
>> --- a/tools/perf/util/mmap.h
>> +++ b/tools/perf/util/mmap.h
>> @@ -13,6 +13,7 @@
>> #endif
>> #include "auxtrace.h"
>> #include "event.h"
>> +#include "util/compress.h"
>>
>> struct aiocb;
>>
>> @@ -43,6 +44,8 @@ struct mmap {
>> struct mmap_cpu_mask affinity_mask;
>> void *data;
>> int comp_level;
>> + struct perf_data_file *file;
>> + struct zstd_data zstd_data;
>> };
>>
>> struct mmap_params {
>> --
>> 2.19.0
>>
>

2021-07-01 17:25:28

by Bayduraev, Alexey V

[permalink] [raw]
Subject: Re: [PATCH v8 03/22] perf record: Introduce thread local variable

Hi,

On 30.06.2021 20:16, Arnaldo Carvalho de Melo wrote:
> Em Wed, Jun 30, 2021 at 06:54:42PM +0300, Alexey Bayduraev escreveu:
>> Introduce thread local variable and use it for threaded trace streaming.
[SNIP]
>> static void record__adjust_affinity(struct record *rec, struct mmap *map)
>> {
>> + int ret = 0;
>
> Why you set this to zero here if it is going to be used only insde the
> if block?
>
>> +
>> if (rec->opts.affinity != PERF_AFFINITY_SYS &&
>> - !bitmap_equal(rec->affinity_mask.bits, map->affinity_mask.bits,
>> - rec->affinity_mask.nbits)) {
>> - bitmap_zero(rec->affinity_mask.bits, rec->affinity_mask.nbits);
>> - bitmap_or(rec->affinity_mask.bits, rec->affinity_mask.bits,
>> - map->affinity_mask.bits, rec->affinity_mask.nbits);
>> - sched_setaffinity(0, MMAP_CPU_MASK_BYTES(&rec->affinity_mask),
>> - (cpu_set_t *)rec->affinity_mask.bits);
>> - if (verbose == 2)
>> - mmap_cpu_mask__scnprintf(&rec->affinity_mask, "thread");
>> + !bitmap_equal(thread->mask->affinity.bits, map->affinity_mask.bits,
>> + thread->mask->affinity.nbits)) {
>> + bitmap_zero(thread->mask->affinity.bits, thread->mask->affinity.nbits);
>> + bitmap_or(thread->mask->affinity.bits, thread->mask->affinity.bits,
>> + map->affinity_mask.bits, thread->mask->affinity.nbits);
>> + ret = sched_setaffinity(0, MMAP_CPU_MASK_BYTES(&thread->mask->affinity),
>> + (cpu_set_t *)thread->mask->affinity.bits);
>> + if (ret)
>> + pr_err("threads[%d]: sched_setaffinity() call failed: %s\n",
>> + thread->tid, strerror(errno));
>
> Also, if record__adjust_affinity() fails by means of sched_setaffinity
> not working, shouldn't we propagate this error?
>

I am also in doubt. record__adjust_affinity() is called inside mmap read
function and I don't think it's a good idea to terminate the data
collection in the middle. I guess this can be critical only for real-time
workloads.

Regards,
Alexey

>> + if (verbose == 2) {
>> + pr_debug("threads[%d]: addr=", thread->tid);
>> + mmap_cpu_mask__scnprintf(&thread->mask->affinity, "thread");
>> + pr_debug("threads[%d]: on cpu=%d\n", thread->tid, sched_getcpu());
>> + }
>> }
>> }
>>
>> @@ -1310,14 +1319,17 @@ static int record__mmap_read_evlist(struct record *rec, struct evlist *evlist,
>> u64 bytes_written = rec->bytes_written;
>> int i;
>> int rc = 0;
>> - struct mmap *maps;
>> + int nr_mmaps;
>> + struct mmap **maps;
>> int trace_fd = rec->data.file.fd;
>> off_t off = 0;
>>
>> if (!evlist)
>> return 0;
>>
>> - maps = overwrite ? evlist->overwrite_mmap : evlist->mmap;
>> + nr_mmaps = thread->nr_mmaps;
>> + maps = overwrite ? thread->overwrite_maps : thread->maps;
>> +
>> if (!maps)
>> return 0;
>>
>> @@ -1327,9 +1339,9 @@ static int record__mmap_read_evlist(struct record *rec, struct evlist *evlist,
>> if (record__aio_enabled(rec))
>> off = record__aio_get_pos(trace_fd);
>>
>> - for (i = 0; i < evlist->core.nr_mmaps; i++) {
>> + for (i = 0; i < nr_mmaps; i++) {
>> u64 flush = 0;
>> - struct mmap *map = &maps[i];
>> + struct mmap *map = maps[i];
>>
>> if (map->core.base) {
>> record__adjust_affinity(rec, map);
>> @@ -1392,6 +1404,15 @@ static int record__mmap_read_all(struct record *rec, bool synch)
>> return record__mmap_read_evlist(rec, rec->evlist, true, synch);
>> }
>>
>> +static void record__thread_munmap_filtered(struct fdarray *fda, int fd,
>> + void *arg __maybe_unused)
>> +{
>> + struct perf_mmap *map = fda->priv[fd].ptr;
>> +
>> + if (map)
>> + perf_mmap__put(map);
>> +}
>> +
>> static void record__init_features(struct record *rec)
>> {
>> struct perf_session *session = rec->session;
>> @@ -1836,6 +1857,33 @@ static void record__uniquify_name(struct record *rec)
>> }
>> }
>>
>> +static int record__start_threads(struct record *rec)
>> +{
>> + struct thread_data *thread_data = rec->thread_data;
>> +
>> + thread = &thread_data[0];
>> +
>> + pr_debug("threads[%d]: started on cpu=%d\n", thread->tid, sched_getcpu());
>> +
>> + return 0;
>> +}
>> +
>> +static int record__stop_threads(struct record *rec, unsigned long *waking)
>> +{
>> + int t;
>> + struct thread_data *thread_data = rec->thread_data;
>> +
>> + for (t = 0; t < rec->nr_threads; t++) {
>> + rec->samples += thread_data[t].samples;
>> + *waking += thread_data[t].waking;
>> + pr_debug("threads[%d]: samples=%lld, wakes=%ld, trasferred=%ld, compressed=%ld\n",
>> + thread_data[t].tid, thread_data[t].samples, thread_data[t].waking,
>> + rec->session->bytes_transferred, rec->session->bytes_compressed);
>> + }
>> +
>> + return 0;
>> +}
>> +
>> static int __cmd_record(struct record *rec, int argc, const char **argv)
>> {
>> int err;
>> @@ -1944,7 +1992,7 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
>>
>> if (record__open(rec) != 0) {
>> err = -1;
>> - goto out_child;
>> + goto out_free_threads;
>> }
>> session->header.env.comp_mmap_len = session->evlist->core.mmap_len;
>>
>> @@ -1952,7 +2000,7 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
>> err = record__kcore_copy(&session->machines.host, data);
>> if (err) {
>> pr_err("ERROR: Failed to copy kcore\n");
>> - goto out_child;
>> + goto out_free_threads;
>> }
>> }
>>
>> @@ -1963,7 +2011,7 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
>> bpf__strerror_apply_obj_config(err, errbuf, sizeof(errbuf));
>> pr_err("ERROR: Apply config to BPF failed: %s\n",
>> errbuf);
>> - goto out_child;
>> + goto out_free_threads;
>> }
>>
>> /*
>> @@ -1981,11 +2029,11 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
>> if (data->is_pipe) {
>> err = perf_header__write_pipe(fd);
>> if (err < 0)
>> - goto out_child;
>> + goto out_free_threads;
>> } else {
>> err = perf_session__write_header(session, rec->evlist, fd, false);
>> if (err < 0)
>> - goto out_child;
>> + goto out_free_threads;
>> }
>>
>> err = -1;
>> @@ -1993,16 +2041,16 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
>> && !perf_header__has_feat(&session->header, HEADER_BUILD_ID)) {
>> pr_err("Couldn't generate buildids. "
>> "Use --no-buildid to profile anyway.\n");
>> - goto out_child;
>> + goto out_free_threads;
>> }
>>
>> err = record__setup_sb_evlist(rec);
>> if (err)
>> - goto out_child;
>> + goto out_free_threads;
>>
>> err = record__synthesize(rec, false);
>> if (err < 0)
>> - goto out_child;
>> + goto out_free_threads;
>>
>> if (rec->realtime_prio) {
>> struct sched_param param;
>> @@ -2011,10 +2059,13 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
>> if (sched_setscheduler(0, SCHED_FIFO, &param)) {
>> pr_err("Could not set realtime priority.\n");
>> err = -1;
>> - goto out_child;
>> + goto out_free_threads;
>> }
>> }
>>
>> + if (record__start_threads(rec))
>> + goto out_free_threads;
>> +
>> /*
>> * When perf is starting the traced process, all the events
>> * (apart from group members) have enable_on_exec=1 set,
>> @@ -2085,7 +2136,7 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
>> trigger_ready(&switch_output_trigger);
>> perf_hooks__invoke_record_start();
>> for (;;) {
>> - unsigned long long hits = rec->samples;
>> + unsigned long long hits = thread->samples;
>>
>> /*
>> * rec->evlist->bkw_mmap_state is possible to be
>> @@ -2154,20 +2205,24 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
>> alarm(rec->switch_output.time);
>> }
>>
>> - if (hits == rec->samples) {
>> + if (hits == thread->samples) {
>> if (done || draining)
>> break;
>> - err = evlist__poll(rec->evlist, -1);
>> + err = fdarray__poll(&thread->pollfd, -1);
>> /*
>> * Propagate error, only if there's any. Ignore positive
>> * number of returned events and interrupt error.
>> */
>> if (err > 0 || (err < 0 && errno == EINTR))
>> err = 0;
>> - waking++;
>> + thread->waking++;
>>
>> - if (evlist__filter_pollfd(rec->evlist, POLLERR | POLLHUP) == 0)
>> + if (fdarray__filter(&thread->pollfd, POLLERR | POLLHUP,
>> + record__thread_munmap_filtered, NULL) == 0)
>> draining = true;
>> +
>> + evlist__ctlfd_update(rec->evlist,
>> + &thread->pollfd.entries[thread->ctlfd_pos]);
>> }
>>
>> if (evlist__ctlfd_process(rec->evlist, &cmd) > 0) {
>> @@ -2220,18 +2275,20 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
>> goto out_child;
>> }
>>
>> - if (!quiet)
>> - fprintf(stderr, "[ perf record: Woken up %ld times to write data ]\n", waking);
>> -
>> if (target__none(&rec->opts.target))
>> record__synthesize_workload(rec, true);
>>
>> out_child:
>> - evlist__finalize_ctlfd(rec->evlist);
>> + record__stop_threads(rec, &waking);
>> record__mmap_read_all(rec, true);
>> +out_free_threads:
>> record__free_thread_data(rec);
>> + evlist__finalize_ctlfd(rec->evlist);
>> record__aio_mmap_read_sync(rec);
>>
>> + if (!quiet)
>> + fprintf(stderr, "[ perf record: Woken up %ld times to write data ]\n", waking);
>> +
>> if (rec->session->bytes_transferred && rec->session->bytes_compressed) {
>> ratio = (float)rec->session->bytes_transferred/(float)rec->session->bytes_compressed;
>> session->header.env.comp_ratio = ratio + 0.5;
>> @@ -3093,17 +3150,6 @@ int cmd_record(int argc, const char **argv)
>>
>> symbol__init(NULL);
>>
>> - if (rec->opts.affinity != PERF_AFFINITY_SYS) {
>> - rec->affinity_mask.nbits = cpu__max_cpu();
>> - rec->affinity_mask.bits = bitmap_alloc(rec->affinity_mask.nbits);
>> - if (!rec->affinity_mask.bits) {
>> - pr_err("Failed to allocate thread mask for %zd cpus\n", rec->affinity_mask.nbits);
>> - err = -ENOMEM;
>> - goto out_opts;
>> - }
>> - pr_debug2("thread mask[%zd]: empty\n", rec->affinity_mask.nbits);
>> - }
>> -
>> err = record__auxtrace_init(rec);
>> if (err)
>> goto out;
>> @@ -3241,7 +3287,6 @@ int cmd_record(int argc, const char **argv)
>>
>> err = __cmd_record(&record, argc, argv);
>> out:
>> - bitmap_free(rec->affinity_mask.bits);
>> evlist__delete(rec->evlist);
>> symbol__exit();
>> auxtrace_record__free(rec->itr);
>
>
> Can the following be moved to a separate patch?
>
>> diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
>> index 6ba9664089bd..3d555a98c037 100644
>> --- a/tools/perf/util/evlist.c
>> +++ b/tools/perf/util/evlist.c
>> @@ -2132,6 +2132,22 @@ int evlist__ctlfd_process(struct evlist *evlist, enum evlist_ctl_cmd *cmd)
>> return err;
>> }
>>
>> +int evlist__ctlfd_update(struct evlist *evlist, struct pollfd *update)
>> +{
>> + int ctlfd_pos = evlist->ctl_fd.pos;
>> + struct pollfd *entries = evlist->core.pollfd.entries;
>> +
>> + if (!evlist__ctlfd_initialized(evlist))
>> + return 0;
>> +
>> + if (entries[ctlfd_pos].fd != update->fd ||
>> + entries[ctlfd_pos].events != update->events)
>> + return -1;
>> +
>> + entries[ctlfd_pos].revents = update->revents;
>> + return 0;
>> +}
>> +
>> struct evsel *evlist__find_evsel(struct evlist *evlist, int idx)
>> {
>> struct evsel *evsel;
>> diff --git a/tools/perf/util/evlist.h b/tools/perf/util/evlist.h
>> index 2073cfa79f79..b7aa719c638d 100644
>> --- a/tools/perf/util/evlist.h
>> +++ b/tools/perf/util/evlist.h
>> @@ -358,6 +358,7 @@ void evlist__close_control(int ctl_fd, int ctl_fd_ack, bool *ctl_fd_close);
>> int evlist__initialize_ctlfd(struct evlist *evlist, int ctl_fd, int ctl_fd_ack);
>> int evlist__finalize_ctlfd(struct evlist *evlist);
>> bool evlist__ctlfd_initialized(struct evlist *evlist);
>> +int evlist__ctlfd_update(struct evlist *evlist, struct pollfd *update);
>> int evlist__ctlfd_process(struct evlist *evlist, enum evlist_ctl_cmd *cmd);
>> int evlist__ctlfd_ack(struct evlist *evlist);
>>
>> --
>> 2.19.0
>>
>

2021-07-01 17:30:34

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: [PATCH v8 06/22] perf record: Introduce data file at mmap buffer object

Em Thu, Jul 01, 2021 at 07:41:18PM +0300, Bayduraev, Alexey V escreveu:
> Hi,
>
> On 30.06.2021 20:23, Arnaldo Carvalho de Melo wrote:
> > Em Wed, Jun 30, 2021 at 06:54:45PM +0300, Alexey Bayduraev escreveu:
> >> Introduce data file and compressor objects into mmap object so
> >> they could be used to process and store data stream from the
> >> corresponding kernel data buffer. Make use of the introduced
> >> per mmap file and compressor when they are initialized and
> >> available.
> >
> > So you're introducing even compressed storage in this patchset? To make
> > it smaller I thought this could be in a followup cset.
>
> These patches were slit by request from Jiri:
>
> Changes in v5:
> - split patch 06/12 to 06/20 and 07/20
>
> request:
> https://lore.kernel.org/lkml/YG97LMsttP4VEWPU@krava/
>
> And I'm a bit confused :)

And it is good that this patch is split, I just didn't expect it in the
middle of the first submission for threaded mode.

- Arnaldo

> Regards,
> Alexey
>
> >
> > - Arnaldo
> >
> >> Acked-by: Andi Kleen <[email protected]>
> >> Acked-by: Namhyung Kim <[email protected]>
> >> Signed-off-by: Alexey Bayduraev <[email protected]>
> >> ---
> >> tools/perf/builtin-record.c | 3 +++
> >> tools/perf/util/mmap.c | 6 ++++++
> >> tools/perf/util/mmap.h | 3 +++
> >> 3 files changed, 12 insertions(+)
> >>
> >> diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
> >> index cead2b3c56d7..9b7e7a5dc116 100644
> >> --- a/tools/perf/builtin-record.c
> >> +++ b/tools/perf/builtin-record.c
> >> @@ -190,6 +190,9 @@ static int record__write(struct record *rec, struct mmap *map __maybe_unused,
> >> {
> >> struct perf_data_file *file = &rec->session->data->file;
> >>
> >> + if (map && map->file)
> >> + file = map->file;
> >> +
> >> if (perf_data_file__write(file, bf, size) < 0) {
> >> pr_err("failed to write perf data, error: %m\n");
> >> return -1;
> >> diff --git a/tools/perf/util/mmap.c b/tools/perf/util/mmap.c
> >> index ab7108d22428..a2c5e4237592 100644
> >> --- a/tools/perf/util/mmap.c
> >> +++ b/tools/perf/util/mmap.c
> >> @@ -230,6 +230,8 @@ void mmap__munmap(struct mmap *map)
> >> {
> >> bitmap_free(map->affinity_mask.bits);
> >>
> >> + zstd_fini(&map->zstd_data);
> >> +
> >> perf_mmap__aio_munmap(map);
> >> if (map->data != NULL) {
> >> munmap(map->data, mmap__mmap_len(map));
> >> @@ -291,6 +293,10 @@ int mmap__mmap(struct mmap *map, struct mmap_params *mp, int fd, int cpu)
> >> map->core.flush = mp->flush;
> >>
> >> map->comp_level = mp->comp_level;
> >> + if (zstd_init(&map->zstd_data, map->comp_level)) {
> >> + pr_debug2("failed to init mmap commpressor, error %d\n", errno);
> >> + return -1;
> >> + }
> >>
> >> if (map->comp_level && !perf_mmap__aio_enabled(map)) {
> >> map->data = mmap(NULL, mmap__mmap_len(map), PROT_READ|PROT_WRITE,
> >> diff --git a/tools/perf/util/mmap.h b/tools/perf/util/mmap.h
> >> index 9d5f589f02ae..c4aed6e89549 100644
> >> --- a/tools/perf/util/mmap.h
> >> +++ b/tools/perf/util/mmap.h
> >> @@ -13,6 +13,7 @@
> >> #endif
> >> #include "auxtrace.h"
> >> #include "event.h"
> >> +#include "util/compress.h"
> >>
> >> struct aiocb;
> >>
> >> @@ -43,6 +44,8 @@ struct mmap {
> >> struct mmap_cpu_mask affinity_mask;
> >> void *data;
> >> int comp_level;
> >> + struct perf_data_file *file;
> >> + struct zstd_data zstd_data;
> >> };
> >>
> >> struct mmap_params {
> >> --
> >> 2.19.0
> >>
> >

--

- Arnaldo

2021-07-01 18:16:51

by Bayduraev, Alexey V

[permalink] [raw]
Subject: Re: [PATCH v8 10/22] perf record: Introduce --threads=<spec> command line option

Hi,

On 01.07.2021 17:26, Arnaldo Carvalho de Melo wrote:
> Em Thu, Jul 01, 2021 at 02:50:40PM +0300, Bayduraev, Alexey V escreveu:
>> On 30.06.2021 21:54, Bayduraev, Alexey V wrote:
>>> On 30.06.2021 20:28, Arnaldo Carvalho de Melo wrote:
>>>> I thought you would start with plain:
>
>>>> -j N
>
>>>> And start one thread per CPU in 'perf record' existing CPU affinity
>>>> mask, then go on introducing more sophisticated modes.
>
>>> As I remember the first prototype [1] and
>>> [2] https://lore.kernel.org/lkml/[email protected]/
>
>>> introduces:
>
>>> --thread=mode|number_of_threads
>
>>> where mode defines cpu masks (cpu/numa/socket/etc)
>
>>> Then somewhere while discussing this patchset it was decided, for unification,
>>> that --thread should only define CPU/affinity masks or their aliases.
>>> I think Alexei or Jiri could clarify this more.
>
>>>> Have you done this way because its how VTune has evolved over the years
>>>> and now expects from 'perf record'?
>
>>> VTune uses only --thread=cpu or no threading.
>
>> However we would like to have such sophisticated cpu/affinity masks to
>> tune perf-record for different workloads.
>
> I don't have, a priori, anything against the modes you propose, as you
> have a justification for them, its just how we should introduce that.
>
> I.e. first doing the simple case of '-j NCPUS' and then doing what you
> need, so that we get more granular patches.
>
> Not adding too much complexity per patch pays off when/if we find bugs
> and need to bisect.

This is a good idea, especially since this 10/22 patch is the most complex
patch in the patchset. I also think we can keep this simple -threads=NCPUS
form along with the -threads=masks option.

Regards,
Alexey

>
>> For example, some HPC workloads prefer "numa" mask or most of telecom
>> workloads disallow to use cpus where their non-preemtable
>> communication threads work.
>
> - Arnaldo
>

2021-07-01 22:48:17

by Bayduraev, Alexey V

[permalink] [raw]
Subject: Re: [PATCH v8 12/22] perf report: Output data file name in raw trace dump


On 30.06.2021 21:36, Arnaldo Carvalho de Melo wrote:
> Em Wed, Jun 30, 2021 at 06:54:51PM +0300, Alexey Bayduraev escreveu:
<SNIP>
>> @@ -2116,9 +2127,9 @@ static int __perf_session__process_decomp_events(struct perf_session *session)
>> break;
>>
>> size = event->header.size;
>> -
>> - if (size < sizeof(struct perf_event_header) ||
>> - (skip = perf_session__process_event(session, event, file_pos)) < 0) {
>
>
> The call to perf_session__process_event() will not be made if
>
> (size < sizeof(struct perf_event_header)
>
> evaluates to true, with your change it is being made unconditionally,
> also before it was using that file_pos variable, set to zero and
> possibly modified by the logic in this function.
>
> And I read just "perf report: Output data file name in raw trace", so
> when I saw this separate change to pass 'decomp->file_pos' and remove
> that 'file_pos = 0' part I scratched my head, then I read again the
> commit log messsage and there it says it also does this separate change.
>
> Please make it separate patch where you explain why this has to be done
> this way and what previous cset this fixes, so that the
> [email protected] guys pick it as it sounds like a fix.

As I understand it, file_pos is mostly used to show file offset
in dump_event(), like:

_0x17cf08_ [0x28]: event: 9

In current implementation file_pos is always 0 for _uncompressed_ events:

0 [0x28]: event: 9

Also file_pos is used to do lseek() for some user events:
PERF_RECORD_HEADER_TRACING_DATA, PERF_RECORD_AUXTRACE. etc.

As long as the compressed event container does not contain user events,
everything is fine. Currently only CPU events are compressed.

We can only fix zero offset for uncompressed events in dump_event(),
unfortunately we cannot show the original file offset because we
uncompress the entire compressed container and do not know compressed
size of each event in the container.

Thus we have 3 options:

1. Show for each uncompressed event decomp->file_pos: offset to
compressed container event:

We will see a series of CPU events with the same file offset.
This is done by this patch.

2. Show decomp->file_pos + offset in uncompressed buffer.

We will see a series of CPU events with overlapping file offsets.
Better is to show something like file_pos:uncompressed_pos,
but this would require changing all calls to dump_event().

3. Keep 0

What in your opinion is more preferable?

>
>> + skip = perf_session__process_event(session, event, decomp->file_pos,
>> + decomp->file_path);
>> + if (size < sizeof(struct perf_event_header) || skip < 0) {
>> pr_err("%#" PRIx64 " [%#x]: failed to process type: %d\n",
>> decomp->file_pos + decomp->head, event->header.size, event->header.type);
>> return -EINVAL;

Also checking of

(size < sizeof(struct perf_event_header))

after perf_session__process_event() is incorrect.

Thanks,
Alexey

>> @@ -2149,10 +2160,12 @@ struct reader;
<SNIP>

2021-07-02 10:33:27

by Jiri Olsa

[permalink] [raw]
Subject: Re: [PATCH v8 20/22] perf session: Load data directory files for analysis

On Wed, Jun 30, 2021 at 06:54:59PM +0300, Alexey Bayduraev wrote:

SNIP

> + while ((ret >= 0) && readers) {
> + if (session_done())
> + return 0;
> +
> + if (rd[i].state.eof) {
> + i = (i + 1) % session->nr_readers;
> + continue;
> + }
> +
> + ret = reader__read_event(&rd[i], session, &prog);
> + if (ret < 0)
> + break;
> + if (ret == READER_EOF) {
> + ret = reader__mmap(&rd[i], session);
> + if (ret < 0)
> + goto out_err;
> + if (ret == READER_EOF)
> + readers--;
> + }
> +
> + /*
> + * Processing 10MBs of data from each reader in sequence,
> + * because that's the way the ordered events sorting works
> + * most efficiently.
> + */
> + if (rd[i].state.size >= 10*1024*1024) {
> + rd[i].state.size = 0;
> + i = (i + 1) % session->nr_readers;
> + }

hi,
so this was sort of hack to make this faster.. we need some
justification for this and make that configurable as well,
if we keep it

jirka

2021-07-02 12:16:04

by Bayduraev, Alexey V

[permalink] [raw]
Subject: Re: [PATCH v8 20/22] perf session: Load data directory files for analysis


On 02.07.2021 13:30, Jiri Olsa wrote:
> On Wed, Jun 30, 2021 at 06:54:59PM +0300, Alexey Bayduraev wrote:
>
> SNIP
>
>> + while ((ret >= 0) && readers) {
>> + if (session_done())
>> + return 0;
>> +
>> + if (rd[i].state.eof) {
>> + i = (i + 1) % session->nr_readers;
>> + continue;
>> + }
>> +
>> + ret = reader__read_event(&rd[i], session, &prog);
>> + if (ret < 0)
>> + break;
>> + if (ret == READER_EOF) {
>> + ret = reader__mmap(&rd[i], session);
>> + if (ret < 0)
>> + goto out_err;
>> + if (ret == READER_EOF)
>> + readers--;
>> + }
>> +
>> + /*
>> + * Processing 10MBs of data from each reader in sequence,
>> + * because that's the way the ordered events sorting works
>> + * most efficiently.
>> + */
>> + if (rd[i].state.size >= 10*1024*1024) {
>> + rd[i].state.size = 0;
>> + i = (i + 1) % session->nr_readers;
>> + }
>
> hi,
> so this was sort of hack to make this faster.. we need some
> justification for this and make that configurable as well,
> if we keep it

Hi,

I am currently thinking of another round-robin read algorithm, based on
timestamps with periodical ordered_queue flushing after each round, this
might be better, but I'm not sure if this should be included in this patchset.

Probably introduction of new constant will be enough for current implementation.

Regards,
Alexey

>
> jirka
>