2021-05-04 07:51:30

by Bayduraev, Alexey V

[permalink] [raw]
Subject: [PATCH v5 00/20] Introduce threaded trace streaming for basic perf record operation

Changes in v5:
- fixed leaks in record__init_thread_masks_spec()
- fixed leaks after failed realloc
- replaced "%m" to strerror()
- added masks examples to the documentation
- captured Acked-by: tags by Andi Kleen
- do not allow --thread option for full_auxtrace mode
- split patch 06/12 to 06/20 and 07/20
- split patch 08/12 to 09/20 and 10/20
- split patches 11/12 and 11/12 to 13/20-20/20

v4: https://lore.kernel.org/lkml/[email protected]/

Changes in v4:
- renamed 'comm' structure to 'pipes'
- moved thread fd/maps messages to verbose=2
- fixed leaks during allocation of thread_data structures
- fixed leaks during allocation of thread masks
- fixed possible fails when releasing thread masks

v3: https://lore.kernel.org/lkml/[email protected]/

Changes in v3:
- avoided skipped redundant patch 3/15
- applied "data file" and "data directory" terms allover the patch set
- captured Acked-by: tags by Namhyung Kim
- avoided braces where don't needed
- employed thread local variable for serial trace streaming
- added specs for --thread option - core, socket, numa and user defined
- added parallel loading of data directory files similar to the prototype [1]

v2: https://lore.kernel.org/lkml/[email protected]/

Changes in v2:
- explicitly added credit tags to patches 6/15 and 15/15,
additionally to cites [1], [2]
- updated description of 3/15 to explicitly mention the reason
to open data directories in read access mode (e.g. for perf report)
- implemented fix for compilation error of 2/15
- explicitly elaborated on found issues to be resolved for
threaded AUX trace capture

v1: https://lore.kernel.org/lkml/[email protected]/

Patch set provides parallel threaded trace streaming mode for basic
perf record operation. Provided mode mitigates profiling data losses
and resolves scalability issues of serial and asynchronous (--aio)
trace streaming modes on multicore server systems. The design and
implementation are based on the prototype [1], [2].

Parallel threaded mode executes trace streaming threads that read kernel
data buffers and write captured data into several data files located at
data directory. Layout of trace streaming threads and their mapping to data
buffers to read can be configured using a value of --thread command line
option. Specification value provides masks separated by colon so the masks
define cpus to be monitored by one thread and thread affinity mask is
separated by slash. <cpus mask 1>/<affinity mask 1>:<cpu mask 2>/<affinity mask 2>
specifies parallel threads layout that consists of two threads with
corresponding assigned cpus to be monitored. Specification value can be
a string e.g. "cpu", "core" or "socket" meaning creation of data streaming
thread for monitoring every cpu, whole core or socket. The option provided
with no or empty value defaults to "cpu" layout creating data streaming
thread for every cpu being monitored. Specification masks are filtered
by the mask provided via -C option.

Parallel streaming mode is compatible with Zstd compression/decompression
(--compression-level) and external control commands (--control). The mode
is not enabled for pipe mode. The mode is not enabled for AUX area tracing,
related and derived modes like --snapshot or --aux-sample. --switch-output-*
and --timestamp-filename options are not enabled for parallel streaming.
Initial intent to enable AUX area tracing faced the need to define some
optimal way to store index data in data directory. --switch-output-* and
--timestamp-filename use cases are not clear for data directories.
Asynchronous(--aio) trace streaming and affinity (--affinity) modes are
mutually exclusive to parallel streaming mode.

Basic analysis of data directories is provided in perf report mode.
Raw dump and aggregated reports are available for data directories,
still with no memory consumption optimizations.

Tested:

tools/perf/perf record -o prof.data --threads -- matrix.gcc.g.O3
tools/perf/perf record -o prof.data --threads= -- matrix.gcc.g.O3
tools/perf/perf record -o prof.data --threads=cpu -- matrix.gcc.g.O3
tools/perf/perf record -o prof.data --threads=core -- matrix.gcc.g.O3
tools/perf/perf record -o prof.data --threads=socket -- matrix.gcc.g.O3
tools/perf/perf record -o prof.data --threads=numa -- matrix.gcc.g.O3
tools/perf/perf record -o prof.data --threads=0-3/3:4-7/4 -- matrix.gcc.g.O3
tools/perf/perf record -o prof.data -C 2,5 --threads=0-3/3:4-7/4 -- matrix.gcc.g.O3
tools/perf/perf record -o prof.data -C 3,4 --threads=0-3/3:4-7/4 -- matrix.gcc.g.O3
tools/perf/perf record -o prof.data -C 0,4,2,6 --threads=core -- matrix.gcc.g.O3
tools/perf/perf record -o prof.data -C 0,4,2,6 --threads=numa -- matrix.gcc.g.O3
tools/perf/perf record -o prof.data --threads -g --call-graph dwarf,4096 -- matrix.gcc.g.O3
tools/perf/perf record -o prof.data --threads -g --call-graph dwarf,4096 --compression-level=3 -- matrix.gcc.g.O3
tools/perf/perf record -o prof.data --threads -a
tools/perf/perf record -D -1 -e cpu-cycles -a --control fd:10,11 -- sleep 30
tools/perf/perf record --threads -D -1 -e cpu-cycles -a --control fd:10,11 -- sleep 30

tools/perf/perf report -i prof.data
tools/perf/perf report -i prof.data --call-graph=callee
tools/perf/perf report -i prof.data --stdio --header
tools/perf/perf report -i prof.data -D --header

[1] git clone https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git -b perf/record_threads
[2] https://lore.kernel.org/lkml/[email protected]/

---

Alexey Bayduraev (20):
perf record: introduce thread affinity and mmap masks
perf record: introduce thread specific data array
perf record: introduce thread local variable
perf record: stop threads in the end of trace streaming
perf record: start threads in the beginning of trace streaming
perf record: introduce data file at mmap buffer object
perf record: introduce data transferred and compressed stats
perf record: init data file at mmap buffer object
tools lib: introduce bitmap_intersects() operation
perf record: introduce --threads=<spec> command line option
perf record: document parallel data streaming mode
perf report: output data file name in raw trace dump
perf session: move reader structure to the top
perf session: introduce reader_state in reader object
perf session: introduce reader objects in session object
perf session: introduce decompressor into trace reader object
perf session: move init into reader__init function
perf session: move map/unmap into reader__mmap function
perf session: load single file for analysis
perf session: load data directory files for analysis

tools/include/linux/bitmap.h | 11 +
tools/lib/api/fd/array.c | 17 +
tools/lib/api/fd/array.h | 1 +
tools/lib/bitmap.c | 14 +
tools/perf/Documentation/perf-record.txt | 30 +
tools/perf/builtin-inject.c | 3 +-
tools/perf/builtin-record.c | 1066 ++++++++++++++++++++--
tools/perf/util/evlist.c | 16 +
tools/perf/util/evlist.h | 1 +
tools/perf/util/mmap.c | 6 +
tools/perf/util/mmap.h | 6 +
tools/perf/util/ordered-events.h | 1 +
tools/perf/util/record.h | 2 +
tools/perf/util/session.c | 491 +++++++---
tools/perf/util/session.h | 5 +
tools/perf/util/tool.h | 3 +-
16 files changed, 1474 insertions(+), 199 deletions(-)

--
2.19.0


2021-05-04 07:53:41

by Bayduraev, Alexey V

[permalink] [raw]
Subject: [PATCH v5 08/20] perf record: init data file at mmap buffer object

Initialize data files located at mmap buffer objects so trace data
can be written into several data file located at data directory.

Acked-by: Andi Kleen <[email protected]>
Signed-off-by: Alexey Bayduraev <[email protected]>
---
tools/perf/builtin-record.c | 41 ++++++++++++++++++++++++++++++-------
tools/perf/util/record.h | 1 +
2 files changed, 35 insertions(+), 7 deletions(-)

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 75cebec57357..bf730e1220dc 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -160,6 +160,11 @@ static const char *affinity_tags[PERF_AFFINITY_MAX] = {
"SYS", "NODE", "CPU"
};

+static int record__threads_enabled(struct record *rec)
+{
+ return rec->opts.threads_spec;
+}
+
static bool switch_output_signal(struct record *rec)
{
return rec->switch_output.signal &&
@@ -1070,7 +1075,7 @@ static void record__free_thread_data(struct record *rec)
static int record__mmap_evlist(struct record *rec,
struct evlist *evlist)
{
- int ret;
+ int i, ret;
struct record_opts *opts = &rec->opts;
bool auxtrace_overwrite = opts->auxtrace_snapshot_mode ||
opts->auxtrace_sample_mode;
@@ -1109,6 +1114,18 @@ static int record__mmap_evlist(struct record *rec,
if (ret)
return ret;

+ if (record__threads_enabled(rec)) {
+ ret = perf_data__create_dir(&rec->data, evlist->core.nr_mmaps);
+ if (ret)
+ return ret;
+ for (i = 0; i < evlist->core.nr_mmaps; i++) {
+ if (evlist->mmap)
+ evlist->mmap[i].file = &rec->data.dir.files[i];
+ if (evlist->overwrite_mmap)
+ evlist->overwrite_mmap[i].file = &rec->data.dir.files[i];
+ }
+ }
+
return 0;
}

@@ -1410,8 +1427,12 @@ static int record__mmap_read_evlist(struct record *rec, struct evlist *evlist,
/*
* Mark the round finished in case we wrote
* at least one event.
+ *
+ * No need for round events in directory mode,
+ * because per-cpu maps and files have data
+ * sorted by kernel.
*/
- if (bytes_written != rec->bytes_written)
+ if (!record__threads_enabled(rec) && bytes_written != rec->bytes_written)
rc = record__write(rec, NULL, &finished_round_event, sizeof(finished_round_event));

if (overwrite)
@@ -1525,7 +1546,9 @@ static void record__init_features(struct record *rec)
if (!rec->opts.use_clockid)
perf_header__clear_feat(&session->header, HEADER_CLOCK_DATA);

- perf_header__clear_feat(&session->header, HEADER_DIR_FORMAT);
+ if (!record__threads_enabled(rec))
+ perf_header__clear_feat(&session->header, HEADER_DIR_FORMAT);
+
if (!record__comp_enabled(rec))
perf_header__clear_feat(&session->header, HEADER_COMPRESSED);

@@ -1536,15 +1559,21 @@ static void
record__finish_output(struct record *rec)
{
struct perf_data *data = &rec->data;
- int fd = perf_data__fd(data);
+ int i, fd = perf_data__fd(data);

if (data->is_pipe)
return;

rec->session->header.data_size += rec->bytes_written;
data->file.size = lseek(perf_data__fd(data), 0, SEEK_CUR);
+ if (record__threads_enabled(rec)) {
+ for (i = 0; i < data->dir.nr; i++)
+ data->dir.files[i].size = lseek(data->dir.files[i].fd, 0, SEEK_CUR);
+ }

if (!rec->no_buildid) {
+ /* this will be recalculated during process_buildids() */
+ rec->samples = 0;
process_buildids(rec);

if (rec->buildid_all)
@@ -2480,8 +2509,6 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
status = err;

record__synthesize(rec, true);
- /* this will be recalculated during process_buildids() */
- rec->samples = 0;

if (!err) {
if (!rec->timestamp_filename) {
@@ -3245,7 +3272,7 @@ int cmd_record(int argc, const char **argv)
rec->no_buildid = true;
}

- if (rec->opts.kcore)
+ if (rec->opts.kcore || record__threads_enabled(rec))
rec->data.is_dir = true;

if (rec->opts.comp_level != 0) {
diff --git a/tools/perf/util/record.h b/tools/perf/util/record.h
index 68f471d9a88b..4d68b7e27272 100644
--- a/tools/perf/util/record.h
+++ b/tools/perf/util/record.h
@@ -77,6 +77,7 @@ struct record_opts {
int ctl_fd;
int ctl_fd_ack;
bool ctl_fd_close;
+ int threads_spec;
};

extern const char * const *record_usage;
--
2.19.0

2021-05-04 07:53:56

by Bayduraev, Alexey V

[permalink] [raw]
Subject: [PATCH v5 13/20] perf session: move reader structure to the top

Moving reader structure to the top of the file. This is necessary
for the further use of this structure in the decompressor.

Signed-off-by: Alexey Bayduraev <[email protected]>
---
tools/perf/util/session.c | 54 +++++++++++++++++++--------------------
1 file changed, 27 insertions(+), 27 deletions(-)

diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index 573bdb2f446b..a4d225e0569c 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -36,6 +36,33 @@
#include "units.h"
#include <internal/lib.h>

+/*
+ * On 64bit we can mmap the data file in one go. No need for tiny mmap
+ * slices. On 32bit we use 32MB.
+ */
+#if BITS_PER_LONG == 64
+#define MMAP_SIZE ULLONG_MAX
+#define NUM_MMAPS 1
+#else
+#define MMAP_SIZE (32 * 1024 * 1024ULL)
+#define NUM_MMAPS 128
+#endif
+
+struct reader;
+
+typedef s64 (*reader_cb_t)(struct perf_session *session,
+ union perf_event *event,
+ u64 file_offset,
+ const char *file_path);
+
+struct reader {
+ int fd;
+ const char *path;
+ u64 data_size;
+ u64 data_offset;
+ reader_cb_t process;
+};
+
#ifdef HAVE_ZSTD_SUPPORT
static int perf_session__process_compressed_event(struct perf_session *session,
union perf_event *event, u64 file_offset,
@@ -2143,33 +2170,6 @@ static int __perf_session__process_decomp_events(struct perf_session *session)
return 0;
}

-/*
- * On 64bit we can mmap the data file in one go. No need for tiny mmap
- * slices. On 32bit we use 32MB.
- */
-#if BITS_PER_LONG == 64
-#define MMAP_SIZE ULLONG_MAX
-#define NUM_MMAPS 1
-#else
-#define MMAP_SIZE (32 * 1024 * 1024ULL)
-#define NUM_MMAPS 128
-#endif
-
-struct reader;
-
-typedef s64 (*reader_cb_t)(struct perf_session *session,
- union perf_event *event,
- u64 file_offset,
- const char *file_path);
-
-struct reader {
- int fd;
- const char *path;
- u64 data_size;
- u64 data_offset;
- reader_cb_t process;
-};
-
static int
reader__process_events(struct reader *rd, struct perf_session *session,
struct ui_progress *prog)
--
2.19.0

2021-05-04 07:54:56

by Bayduraev, Alexey V

[permalink] [raw]
Subject: [PATCH v5 15/20] perf session: introduce reader objects in session object

Allow to allocate multiple reader objects, so we could load multiple
data files located in data directory at the same time.

Design and implementation are based on the prototype [1], [2].

[1] git clone https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git -b perf/record_threads
[2] https://lore.kernel.org/lkml/[email protected]/

Suggested-by: Jiri Olsa <[email protected]>
Signed-off-by: Alexey Bayduraev <[email protected]>
---
tools/perf/util/session.c | 31 ++++++++++++++++++++-----------
tools/perf/util/session.h | 3 +++
2 files changed, 23 insertions(+), 11 deletions(-)

diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index 054f4d04eea9..65ce798eb27d 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -341,6 +341,10 @@ void perf_session__delete(struct perf_session *session)
auxtrace_index__free(&session->auxtrace_index);
perf_session__destroy_kernel_maps(session);
perf_session__delete_threads(session);
+ if (session->readers) {
+ zfree(&session->readers);
+ session->nr_readers = 0;
+ }
perf_session__release_decomp_events(session);
perf_env__exit(&session->header.env);
machines__exit(&session->machines);
@@ -2295,13 +2299,7 @@ static s64 process_simple(struct perf_session *session,

static int __perf_session__process_events(struct perf_session *session)
{
- struct reader rd = {
- .fd = perf_data__fd(session->data),
- .data_size = session->header.data_size,
- .data_offset = session->header.data_offset,
- .process = process_simple,
- .path = session->data->file.path,
- };
+ struct reader *rd;
struct ordered_events *oe = &session->ordered_events;
struct perf_tool *tool = session->tool;
struct ui_progress prog;
@@ -2309,12 +2307,23 @@ static int __perf_session__process_events(struct perf_session *session)

perf_tool__fill_defaults(tool);

- if (rd.data_size == 0)
- return -1;
+ rd = session->readers = zalloc(sizeof(struct reader));
+ if (!rd)
+ return -ENOMEM;
+
+ session->nr_readers = 1;
+
+ *rd = (struct reader) {
+ .fd = perf_data__fd(session->data),
+ .data_size = session->header.data_size,
+ .data_offset = session->header.data_offset,
+ .process = process_simple,
+ .path = session->data->file.path,
+ };

- ui_progress__init_size(&prog, rd.data_size, "Processing events...");
+ ui_progress__init_size(&prog, rd->data_size, "Processing events...");

- err = reader__process_events(&rd, session, &prog);
+ err = reader__process_events(rd, session, &prog);
if (err)
goto out_err;
/* do the final flush for ordered samples */
diff --git a/tools/perf/util/session.h b/tools/perf/util/session.h
index 6895a22ab5b7..2815d00b5467 100644
--- a/tools/perf/util/session.h
+++ b/tools/perf/util/session.h
@@ -19,6 +19,7 @@ struct thread;

struct auxtrace;
struct itrace_synth_opts;
+struct reader;

struct perf_session {
struct perf_header header;
@@ -41,6 +42,8 @@ struct perf_session {
struct zstd_data zstd_data;
struct decomp *decomp;
struct decomp *decomp_last;
+ struct reader *readers;
+ int nr_readers;
};

struct decomp {
--
2.19.0

2021-05-04 07:55:00

by Bayduraev, Alexey V

[permalink] [raw]
Subject: [PATCH v5 20/20] perf session: load data directory files for analysis

Load data directory files and provide basic raw dump and aggregated
analysis support of data directories in report mode, still with no
memory consumption optimizations.

Design and implementation are based on the prototype [1], [2].

[1] git clone https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git -b perf/record_threads
[2] https://lore.kernel.org/lkml/[email protected]/

Suggested-by: Jiri Olsa <[email protected]>
Signed-off-by: Alexey Bayduraev <[email protected]>
---
tools/perf/util/session.c | 127 ++++++++++++++++++++++++++++++++++++++
1 file changed, 127 insertions(+)

diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index 47a414016510..1a741e6fc35b 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -65,6 +65,7 @@ struct reader_state {
u64 data_size;
u64 head;
bool eof;
+ u64 size;
};

enum {
@@ -2316,6 +2317,7 @@ reader__read_event(struct reader *rd, struct perf_session *session,
if (skip)
size += skip;

+ st->size += size;
st->head += size;
st->file_pos += size;

@@ -2414,6 +2416,128 @@ static int __perf_session__process_events(struct perf_session *session)
return err;
}

+/*
+ * This function reads, merge and process directory data.
+ * It assumens the version 1 of directory data, where each
+ * data file holds per-cpu data, already sorted by kernel.
+ */
+static int __perf_session__process_dir_events(struct perf_session *session)
+{
+ struct perf_data *data = session->data;
+ struct perf_tool *tool = session->tool;
+ int i, ret = 0, readers = 1;
+ struct ui_progress prog;
+ u64 total_size = perf_data__size(session->data);
+ struct reader *rd;
+
+ perf_tool__fill_defaults(tool);
+
+ ui_progress__init_size(&prog, total_size, "Sorting events...");
+
+ for (i = 0; i < data->dir.nr; i++) {
+ if (data->dir.files[i].size)
+ readers++;
+ }
+
+ rd = session->readers = zalloc(readers * sizeof(struct reader));
+ if (!rd)
+ return -ENOMEM;
+ session->nr_readers = readers;
+ readers = 0;
+
+ rd[readers] = (struct reader) {
+ .fd = perf_data__fd(session->data),
+ .path = session->data->file.path,
+ .data_size = session->header.data_size,
+ .data_offset = session->header.data_offset,
+ };
+ ret = reader__init(&rd[readers], NULL);
+ if (ret)
+ goto out_err;
+ ret = reader__mmap(&rd[readers], session);
+ if (ret != READER_OK) {
+ if (ret == READER_EOF)
+ ret = -EINVAL;
+ goto out_err;
+ }
+ readers++;
+
+ for (i = 0; i < data->dir.nr; i++) {
+ if (data->dir.files[i].size) {
+ rd[readers] = (struct reader) {
+ .fd = data->dir.files[i].fd,
+ .path = data->dir.files[i].path,
+ .data_size = data->dir.files[i].size,
+ .data_offset = 0,
+ };
+ ret = reader__init(&rd[readers], NULL);
+ if (ret)
+ goto out_err;
+ ret = reader__mmap(&rd[readers], session);
+ if (ret != READER_OK) {
+ if (ret == READER_EOF)
+ ret = -EINVAL;
+ goto out_err;
+ }
+ readers++;
+ }
+ }
+
+ i = 0;
+
+ while ((ret >= 0) && readers) {
+ if (session_done())
+ return 0;
+
+ if (rd[i].state.eof) {
+ i = (i + 1) % session->nr_readers;
+ continue;
+ }
+
+ ret = reader__read_event(&rd[i], session, &prog);
+ if (ret < 0)
+ break;
+ if (ret == READER_EOF) {
+ ret = reader__mmap(&rd[i], session);
+ if (ret < 0)
+ goto out_err;
+ if (ret == READER_EOF)
+ readers--;
+ }
+
+ /*
+ * Processing 10MBs of data from each reader in sequence,
+ * because that's the way the ordered events sorting works
+ * most efficiently.
+ */
+ if (rd[i].state.size >= 10*1024*1024) {
+ rd[i].state.size = 0;
+ i = (i + 1) % session->nr_readers;
+ }
+ }
+
+ ret = ordered_events__flush(&session->ordered_events, OE_FLUSH__FINAL);
+ if (ret)
+ goto out_err;
+
+ ret = perf_session__flush_thread_stacks(session);
+out_err:
+ ui_progress__finish();
+
+ if (!tool->no_warn)
+ perf_session__warn_about_errors(session);
+
+ /*
+ * We may switching perf.data output, make ordered_events
+ * reusable.
+ */
+ ordered_events__reinit(&session->ordered_events);
+
+ session->one_mmap = false;
+
+ return ret;
+}
+
int perf_session__process_events(struct perf_session *session)
{
if (perf_session__register_idle_thread(session) < 0)
@@ -2422,6 +2546,9 @@ int perf_session__process_events(struct perf_session *session)
if (perf_data__is_pipe(session->data))
return __perf_session__process_pipe_events(session);

+ if (perf_data__is_dir(session->data))
+ return __perf_session__process_dir_events(session);
+
return __perf_session__process_events(session);
}

--
2.19.0

2021-05-06 06:21:43

by Namhyung Kim

[permalink] [raw]
Subject: Re: [PATCH v5 00/20] Introduce threaded trace streaming for basic perf record operation

Hello,

On Tue, May 4, 2021 at 12:05 AM Alexey Bayduraev
<[email protected]> wrote:
>
> Changes in v5:
> - fixed leaks in record__init_thread_masks_spec()
> - fixed leaks after failed realloc
> - replaced "%m" to strerror()
> - added masks examples to the documentation
> - captured Acked-by: tags by Andi Kleen
> - do not allow --thread option for full_auxtrace mode
> - split patch 06/12 to 06/20 and 07/20
> - split patch 08/12 to 09/20 and 10/20
> - split patches 11/12 and 11/12 to 13/20-20/20
>
> v4: https://lore.kernel.org/lkml/[email protected]/
>
> Changes in v4:
> - renamed 'comm' structure to 'pipes'
> - moved thread fd/maps messages to verbose=2
> - fixed leaks during allocation of thread_data structures
> - fixed leaks during allocation of thread masks
> - fixed possible fails when releasing thread masks
>
> v3: https://lore.kernel.org/lkml/[email protected]/
>
> Changes in v3:
> - avoided skipped redundant patch 3/15
> - applied "data file" and "data directory" terms allover the patch set
> - captured Acked-by: tags by Namhyung Kim
> - avoided braces where don't needed
> - employed thread local variable for serial trace streaming
> - added specs for --thread option - core, socket, numa and user defined
> - added parallel loading of data directory files similar to the prototype [1]
>
> v2: https://lore.kernel.org/lkml/[email protected]/
>
> Changes in v2:
> - explicitly added credit tags to patches 6/15 and 15/15,
> additionally to cites [1], [2]
> - updated description of 3/15 to explicitly mention the reason
> to open data directories in read access mode (e.g. for perf report)
> - implemented fix for compilation error of 2/15
> - explicitly elaborated on found issues to be resolved for
> threaded AUX trace capture
>
> v1: https://lore.kernel.org/lkml/[email protected]/
>
> Patch set provides parallel threaded trace streaming mode for basic
> perf record operation. Provided mode mitigates profiling data losses
> and resolves scalability issues of serial and asynchronous (--aio)
> trace streaming modes on multicore server systems. The design and
> implementation are based on the prototype [1], [2].
>
> Parallel threaded mode executes trace streaming threads that read kernel
> data buffers and write captured data into several data files located at
> data directory. Layout of trace streaming threads and their mapping to data
> buffers to read can be configured using a value of --thread command line
> option. Specification value provides masks separated by colon so the masks
> define cpus to be monitored by one thread and thread affinity mask is
> separated by slash. <cpus mask 1>/<affinity mask 1>:<cpu mask 2>/<affinity mask 2>
> specifies parallel threads layout that consists of two threads with
> corresponding assigned cpus to be monitored. Specification value can be
> a string e.g. "cpu", "core" or "socket" meaning creation of data streaming
> thread for monitoring every cpu, whole core or socket. The option provided
> with no or empty value defaults to "cpu" layout creating data streaming
> thread for every cpu being monitored. Specification masks are filtered
> by the mask provided via -C option.
>
> Parallel streaming mode is compatible with Zstd compression/decompression
> (--compression-level) and external control commands (--control). The mode
> is not enabled for pipe mode. The mode is not enabled for AUX area tracing,
> related and derived modes like --snapshot or --aux-sample. --switch-output-*
> and --timestamp-filename options are not enabled for parallel streaming.
> Initial intent to enable AUX area tracing faced the need to define some
> optimal way to store index data in data directory. --switch-output-* and
> --timestamp-filename use cases are not clear for data directories.
> Asynchronous(--aio) trace streaming and affinity (--affinity) modes are
> mutually exclusive to parallel streaming mode.
>
> Basic analysis of data directories is provided in perf report mode.
> Raw dump and aggregated reports are available for data directories,
> still with no memory consumption optimizations.

Do you have an idea how to improve it?

I have to say again that I don't like merely adding more threads to
record. Yeah, parallelizing the perf record is good, but we have to
think about the perf report (and others) too.

Thanks,
Namhyung


>
> Tested:
>
> tools/perf/perf record -o prof.data --threads -- matrix.gcc.g.O3
> tools/perf/perf record -o prof.data --threads= -- matrix.gcc.g.O3
> tools/perf/perf record -o prof.data --threads=cpu -- matrix.gcc.g.O3
> tools/perf/perf record -o prof.data --threads=core -- matrix.gcc.g.O3
> tools/perf/perf record -o prof.data --threads=socket -- matrix.gcc.g.O3
> tools/perf/perf record -o prof.data --threads=numa -- matrix.gcc.g.O3
> tools/perf/perf record -o prof.data --threads=0-3/3:4-7/4 -- matrix.gcc.g.O3
> tools/perf/perf record -o prof.data -C 2,5 --threads=0-3/3:4-7/4 -- matrix.gcc.g.O3
> tools/perf/perf record -o prof.data -C 3,4 --threads=0-3/3:4-7/4 -- matrix.gcc.g.O3
> tools/perf/perf record -o prof.data -C 0,4,2,6 --threads=core -- matrix.gcc.g.O3
> tools/perf/perf record -o prof.data -C 0,4,2,6 --threads=numa -- matrix.gcc.g.O3
> tools/perf/perf record -o prof.data --threads -g --call-graph dwarf,4096 -- matrix.gcc.g.O3
> tools/perf/perf record -o prof.data --threads -g --call-graph dwarf,4096 --compression-level=3 -- matrix.gcc.g.O3
> tools/perf/perf record -o prof.data --threads -a
> tools/perf/perf record -D -1 -e cpu-cycles -a --control fd:10,11 -- sleep 30
> tools/perf/perf record --threads -D -1 -e cpu-cycles -a --control fd:10,11 -- sleep 30
>
> tools/perf/perf report -i prof.data
> tools/perf/perf report -i prof.data --call-graph=callee
> tools/perf/perf report -i prof.data --stdio --header
> tools/perf/perf report -i prof.data -D --header
>
> [1] git clone https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git -b perf/record_threads
> [2] https://lore.kernel.org/lkml/[email protected]/
>
> ---
>
> Alexey Bayduraev (20):
> perf record: introduce thread affinity and mmap masks
> perf record: introduce thread specific data array
> perf record: introduce thread local variable
> perf record: stop threads in the end of trace streaming
> perf record: start threads in the beginning of trace streaming
> perf record: introduce data file at mmap buffer object
> perf record: introduce data transferred and compressed stats
> perf record: init data file at mmap buffer object
> tools lib: introduce bitmap_intersects() operation
> perf record: introduce --threads=<spec> command line option
> perf record: document parallel data streaming mode
> perf report: output data file name in raw trace dump
> perf session: move reader structure to the top
> perf session: introduce reader_state in reader object
> perf session: introduce reader objects in session object
> perf session: introduce decompressor into trace reader object
> perf session: move init into reader__init function
> perf session: move map/unmap into reader__mmap function
> perf session: load single file for analysis
> perf session: load data directory files for analysis
>
> tools/include/linux/bitmap.h | 11 +
> tools/lib/api/fd/array.c | 17 +
> tools/lib/api/fd/array.h | 1 +
> tools/lib/bitmap.c | 14 +
> tools/perf/Documentation/perf-record.txt | 30 +
> tools/perf/builtin-inject.c | 3 +-
> tools/perf/builtin-record.c | 1066 ++++++++++++++++++++--
> tools/perf/util/evlist.c | 16 +
> tools/perf/util/evlist.h | 1 +
> tools/perf/util/mmap.c | 6 +
> tools/perf/util/mmap.h | 6 +
> tools/perf/util/ordered-events.h | 1 +
> tools/perf/util/record.h | 2 +
> tools/perf/util/session.c | 491 +++++++---
> tools/perf/util/session.h | 5 +
> tools/perf/util/tool.h | 3 +-
> 16 files changed, 1474 insertions(+), 199 deletions(-)
>
> --
> 2.19.0
>

2021-05-06 12:45:12

by Bayduraev, Alexey V

[permalink] [raw]
Subject: Re: [PATCH v5 00/20] Introduce threaded trace streaming for basic perf record operation

Hi,

On 06.05.2021 9:20, Namhyung Kim wrote:
> Hello,
>
> On Tue, May 4, 2021 at 12:05 AM Alexey Bayduraev
> <[email protected]> wrote:
>>
<SNIP>>>
>> Basic analysis of data directories is provided in perf report mode.
>> Raw dump and aggregated reports are available for data directories,
>> still with no memory consumption optimizations.
>
> Do you have an idea how to improve it?
>
> I have to say again that I don't like merely adding more threads to
> record. Yeah, parallelizing the perf record is good, but we have to
> think about the perf report (and others) too.

There is your idea about separating tracking records and process them
first, but these changes can be much larger than my patch and I think
they looks like independent patch and could be introduced as extension
of parallel data loading.

I also thought and experimented with the intermediate flushing of
the ordered queue. This is simple for per-cpu data files (sorted
by time), but not clear for arbitrary CPU masks.

I think my patch can be the first step to introduce parallel mode
to the perf tool. It just extends perf-record (already used in our
vtune tool) and allows to load parallel data in experimental mode.
Next patches could optimize and extend parallel data loading.

Regards,
Alexey

>
> Thanks,
> Namhyung
>
>
>>
>> Tested:
>>
>> tools/perf/perf record -o prof.data --threads -- matrix.gcc.g.O3
>> tools/perf/perf record -o prof.data --threads= -- matrix.gcc.g.O3
>> tools/perf/perf record -o prof.data --threads=cpu -- matrix.gcc.g.O3
>> tools/perf/perf record -o prof.data --threads=core -- matrix.gcc.g.O3
>> tools/perf/perf record -o prof.data --threads=socket -- matrix.gcc.g.O3
>> tools/perf/perf record -o prof.data --threads=numa -- matrix.gcc.g.O3
>> tools/perf/perf record -o prof.data --threads=0-3/3:4-7/4 -- matrix.gcc.g.O3
>> tools/perf/perf record -o prof.data -C 2,5 --threads=0-3/3:4-7/4 -- matrix.gcc.g.O3
>> tools/perf/perf record -o prof.data -C 3,4 --threads=0-3/3:4-7/4 -- matrix.gcc.g.O3
>> tools/perf/perf record -o prof.data -C 0,4,2,6 --threads=core -- matrix.gcc.g.O3
>> tools/perf/perf record -o prof.data -C 0,4,2,6 --threads=numa -- matrix.gcc.g.O3
>> tools/perf/perf record -o prof.data --threads -g --call-graph dwarf,4096 -- matrix.gcc.g.O3
>> tools/perf/perf record -o prof.data --threads -g --call-graph dwarf,4096 --compression-level=3 -- matrix.gcc.g.O3
>> tools/perf/perf record -o prof.data --threads -a
>> tools/perf/perf record -D -1 -e cpu-cycles -a --control fd:10,11 -- sleep 30
>> tools/perf/perf record --threads -D -1 -e cpu-cycles -a --control fd:10,11 -- sleep 30
>>
>> tools/perf/perf report -i prof.data
>> tools/perf/perf report -i prof.data --call-graph=callee
>> tools/perf/perf report -i prof.data --stdio --header
>> tools/perf/perf report -i prof.data -D --header
>>
>> [1] git clone https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git -b perf/record_threads
>> [2] https://lore.kernel.org/lkml/[email protected]/
>>
>> ---
>>
>> Alexey Bayduraev (20):
>> perf record: introduce thread affinity and mmap masks
>> perf record: introduce thread specific data array
>> perf record: introduce thread local variable
>> perf record: stop threads in the end of trace streaming
>> perf record: start threads in the beginning of trace streaming
>> perf record: introduce data file at mmap buffer object
>> perf record: introduce data transferred and compressed stats
>> perf record: init data file at mmap buffer object
>> tools lib: introduce bitmap_intersects() operation
>> perf record: introduce --threads=<spec> command line option
>> perf record: document parallel data streaming mode
>> perf report: output data file name in raw trace dump
>> perf session: move reader structure to the top
>> perf session: introduce reader_state in reader object
>> perf session: introduce reader objects in session object
>> perf session: introduce decompressor into trace reader object
>> perf session: move init into reader__init function
>> perf session: move map/unmap into reader__mmap function
>> perf session: load single file for analysis
>> perf session: load data directory files for analysis
>>
>> tools/include/linux/bitmap.h | 11 +
>> tools/lib/api/fd/array.c | 17 +
>> tools/lib/api/fd/array.h | 1 +
>> tools/lib/bitmap.c | 14 +
>> tools/perf/Documentation/perf-record.txt | 30 +
>> tools/perf/builtin-inject.c | 3 +-
>> tools/perf/builtin-record.c | 1066 ++++++++++++++++++++--
>> tools/perf/util/evlist.c | 16 +
>> tools/perf/util/evlist.h | 1 +
>> tools/perf/util/mmap.c | 6 +
>> tools/perf/util/mmap.h | 6 +
>> tools/perf/util/ordered-events.h | 1 +
>> tools/perf/util/record.h | 2 +
>> tools/perf/util/session.c | 491 +++++++---
>> tools/perf/util/session.h | 5 +
>> tools/perf/util/tool.h | 3 +-
>> 16 files changed, 1474 insertions(+), 199 deletions(-)
>>
>> --
>> 2.19.0
>>

2021-05-06 14:19:56

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH v5 00/20] Introduce threaded trace streaming for basic perf record operation


On 5/5/2021 11:20 PM, Namhyung Kim wrote:
>
> Do you have an idea how to improve it?
>
> I have to say again that I don't like merely adding more threads to
> record. Yeah, parallelizing the perf record is good, but we have to
> think about the perf report (and others) too.

perf report/script can be already parallelized with --time xx/x% and a
simple shell script that runs multiple processes. While that's a bit
awkward for interactive use it works fine for scripting. I use it all
the time for PT batch processing for example. The real bottleneck we
have is really record on systems with many CPUs (which are more and more
common), and that can only be fixed with some variant of this patch kit.

-Andi

2021-05-07 06:37:57

by Namhyung Kim

[permalink] [raw]
Subject: Re: [PATCH v5 00/20] Introduce threaded trace streaming for basic perf record operation

On Thu, May 6, 2021 at 5:44 AM Bayduraev, Alexey V
<[email protected]> wrote:
>
> Hi,
>
> On 06.05.2021 9:20, Namhyung Kim wrote:
> > Hello,
> >
> > On Tue, May 4, 2021 at 12:05 AM Alexey Bayduraev
> > <[email protected]> wrote:
> >>
> <SNIP>>>
> >> Basic analysis of data directories is provided in perf report mode.
> >> Raw dump and aggregated reports are available for data directories,
> >> still with no memory consumption optimizations.
> >
> > Do you have an idea how to improve it?
> >
> > I have to say again that I don't like merely adding more threads to
> > record. Yeah, parallelizing the perf record is good, but we have to
> > think about the perf report (and others) too.
>
> There is your idea about separating tracking records and process them
> first, but these changes can be much larger than my patch and I think
> they looks like independent patch and could be introduced as extension
> of parallel data loading.
>
> I also thought and experimented with the intermediate flushing of
> the ordered queue. This is simple for per-cpu data files (sorted
> by time), but not clear for arbitrary CPU masks.
>
> I think my patch can be the first step to introduce parallel mode
> to the perf tool. It just extends perf-record (already used in our
> vtune tool) and allows to load parallel data in experimental mode.
> Next patches could optimize and extend parallel data loading.

Yeah I agree that we can change it incrementally and good to
know that you are thinking about the next step. :)

Thanks,
Namhyung

2021-05-07 07:04:01

by Namhyung Kim

[permalink] [raw]
Subject: Re: [PATCH v5 00/20] Introduce threaded trace streaming for basic perf record operation

Hi Andi,

On Thu, May 6, 2021 at 7:17 AM Andi Kleen <[email protected]> wrote:
>
>
> On 5/5/2021 11:20 PM, Namhyung Kim wrote:
> >
> > Do you have an idea how to improve it?
> >
> > I have to say again that I don't like merely adding more threads to
> > record. Yeah, parallelizing the perf record is good, but we have to
> > think about the perf report (and others) too.
>
> perf report/script can be already parallelized with --time xx/x% and a
> simple shell script that runs multiple processes. While that's a bit
> awkward for interactive use it works fine for scripting. I use it all
> the time for PT batch processing for example. The real bottleneck we
> have is really record on systems with many CPUs (which are more and more
> common), and that can only be fixed with some variant of this patch kit.

Right, spreading partial analysis to multiple processes would work
for some use cases. I also agree that parallelizing perf record is
more important, but if there's a way to improve perf report/script
we should try that too.

Thanks,
Namhyung