2019-12-23 06:09:54

by Namhyung Kim

[permalink] [raw]
Subject: [PATCHSET 0/9] perf: Improve cgroup profiling (v3)

Hello,

This work is to improve cgroup profiling in perf. Currently it only
supports profiling tasks in a specific cgroup and there's no way to
identify which cgroup the current sample belongs to. So I added
PERF_SAMPLE_CGROUP to add cgroup id into each sample. It's a 64-bit
integer having file handle of the cgroup. And kernel also generates
PERF_RECORD_CGROUP event for new groups to correlate the cgroup id and
cgroup name (path in the cgroup filesystem). The cgroup id can be
read from userspace by name_to_handle_at() system call so it can
synthesize the CGROUP event for existing groups.

So why do we want this? Systems running a large number of jobs in
different cgroups want to profiling such jobs precisely. This includes
container hosting systems widely used today. Currently perf supports
namespace tracking but the systems may not use (cgroup) namespace for
their jobs. Also it'd be more intuitive to see cgroup names (as
they're given by user or sysadmin) rather than numeric
cgroup/namespace id even if they use the namespaces.

From Stephane Eranian:
> In data centers you care about attributing samples to a job not such
> much to a process. A job may have multiple processes which may come
> and go. The cgroup on the other hand stays around for the entire
> lifetime of the job. It is much easier to map a cgroup name to a
> particular job than it is to map a pid back to a job name,
> especially for offline post-processing.

Note that this only works for "perf_event" cgroups (obviously) so if
users are still using cgroup-v1 interface, they need to have same
hierarchy for subsystem(s) want to profile with it.

* Changes from v2:
- remove path_len from cgroup_event
- bail out if kernel doesn't support cgroup sampling
- add some description in the Kconfig

* Changes from v1:
- use new cgroup id (= file handle)

The testing result looks something like this:

[root@qemu build]# ./perf record --all-cgroups ./cgtest
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.009 MB perf.data (150 samples) ]

[root@qemu build]# ./perf report -s cgroup,comm --stdio
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 150 of event 'cpu-clock:pppH'
# Event count (approx.): 37500000
#
# Overhead cgroup Command
# ........ .......... .......
#
32.00% /sub/cgrp2 looper2
28.00% /sub/cgrp1 looper1
25.33% /sub looper0
4.00% / cgtest
4.00% /sub cgtest
3.33% /sub/cgrp2 cgtest
2.67% /sub/cgrp1 cgtest
0.67% / looper0


The test program (cgtest) follows.

Thanks,
Namhyung


Cc: Tejun Heo <[email protected]>
Cc: Li Zefan <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Adrian Hunter <[email protected]>


-------8<-----------------------------------------8<----------------
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <fcntl.h>
#include <sched.h>
#include <errno.h>
#include <sys/stat.h>
#include <sys/wait.h>
#include <sys/prctl.h>
#include <sys/mount.h>

char cgbase[] = "/sys/fs/cgroup/perf_event";

void mkcgrp(char *name) {
char buf[256];

snprintf(buf, sizeof(buf), "%s%s", cgbase, name);
if (mkdir(buf, 0755) < 0)
perror("mkdir");
}

void rmcgrp(char *name) {
char buf[256];

snprintf(buf, sizeof(buf), "%s%s", cgbase, name);
if (rmdir(buf) < 0)
perror("rmdir");
}

void setcgrp(char *name) {
char buf[256];
int fd;

snprintf(buf, sizeof(buf), "%s%s/tasks", cgbase, name);

fd = open(buf, O_WRONLY);
if (fd > 0) {
if (write(fd, "0\n", 2) != 2)
perror("write");
close(fd);
}
}

void create_sub_cgroup(int idx) {
char buf[128];

snprintf(buf, sizeof(buf), "/sub/cgrp%d", idx);
mkcgrp(buf);
}

void remove_sub_cgroup(int idx) {
char buf[128];

snprintf(buf, sizeof(buf), "/sub/cgrp%d", idx);
rmcgrp(buf);
}

void set_sub_cgroup(int idx) {
char buf[128];

snprintf(buf, sizeof(buf), "/sub/cgrp%d", idx);
setcgrp(buf);
}

void set_task_name(int idx) {
char buf[16];

snprintf(buf, sizeof(buf), "looper%d", idx);
prctl(PR_SET_NAME, buf, 0, 0, 0);
}

void loop(unsigned max) {
volatile unsigned int count = 1;

while (count++ != max) {
asm volatile ("pause");
}
}

void worker(int idx, unsigned cnt, int new_ns) {
int oldns;

create_sub_cgroup(idx);
set_sub_cgroup(idx);

if (new_ns) {
if (unshare(CLONE_NEWCGROUP) < 0)
perror("unshare");

#if 0 /* FIXME */
if (unshare(CLONE_NEWNS) < 0)
perror("unshare");

if (mount("none", "/sys", NULL, MS_REMOUNT | MS_REC | MS_SLAVE, NULL) < 0)
perror("mount --make-rslave");

sleep(1);
if (umount("/sys/fs/cgroup/perf_event") < 0)
perror("umount");

if (mount("cgroup", "/sys/fs/cgroup/perf_event", "cgroup",
MS_NODEV | MS_NOEXEC | MS_NOSUID, "perf_event") < 0)
perror("mount again");
#endif
}

if (fork() == 0) {
set_task_name(idx);
loop(cnt);
exit(0);
}
wait(NULL);
}

int main(int argc, char *argv[])
{
int i, nr = 2;
int new_ns = 1;
unsigned cnt = 1000000;
int fd;

if (argc > 1)
nr = atoi(argv[1]);
if (argc > 2)
cnt = atoi(argv[2]);
if (argc > 3)
new_ns = atoi(argv[3]);

mkcgrp("/sub");
setcgrp("/sub");

for (i = 0; i < nr; i++) {
if (fork() == 0) {
worker(i+1, cnt, new_ns);
exit(0);
}
}

set_task_name(0);
loop(cnt);

for (i = 0; i < nr; i++)
wait(NULL);

for (i = 0; i < nr; i++)
remove_sub_cgroup(i+1);

setcgrp("/");
rmcgrp("/sub");

return 0;
}
-------8<-----------------------------------------8<----------------


Namhyung Kim (9):
perf/core: Add PERF_RECORD_CGROUP event
perf/core: Add PERF_SAMPLE_CGROUP feature
perf tools: Basic support for CGROUP event
perf tools: Maintain cgroup hierarchy
perf report: Add 'cgroup' sort key
perf record: Support synthesizing cgroup events
perf record: Add --all-cgroups option
perf top: Add --all-cgroups option
perf script: Add --show-cgroup-events option

include/linux/perf_event.h | 1 +
include/uapi/linux/perf_event.h | 16 ++-
init/Kconfig | 3 +-
kernel/events/core.c | 133 ++++++++++++++++++++++
tools/include/uapi/linux/perf_event.h | 16 ++-
tools/perf/Documentation/perf-record.txt | 5 +-
tools/perf/Documentation/perf-report.txt | 1 +
tools/perf/Documentation/perf-script.txt | 3 +
tools/perf/Documentation/perf-top.txt | 4 +
tools/perf/builtin-diff.c | 1 +
tools/perf/builtin-record.c | 10 ++
tools/perf/builtin-report.c | 1 +
tools/perf/builtin-script.c | 41 +++++++
tools/perf/builtin-top.c | 9 ++
tools/perf/lib/include/perf/event.h | 7 ++
tools/perf/util/cgroup.c | 75 +++++++++++-
tools/perf/util/cgroup.h | 16 ++-
tools/perf/util/event.c | 19 ++++
tools/perf/util/event.h | 6 +
tools/perf/util/evsel.c | 17 ++-
tools/perf/util/evsel.h | 1 +
tools/perf/util/hist.c | 7 ++
tools/perf/util/hist.h | 1 +
tools/perf/util/machine.c | 19 ++++
tools/perf/util/machine.h | 3 +
tools/perf/util/perf_event_attr_fprintf.c | 2 +
tools/perf/util/record.h | 1 +
tools/perf/util/session.c | 8 ++
tools/perf/util/sort.c | 31 +++++
tools/perf/util/sort.h | 2 +
tools/perf/util/synthetic-events.c | 119 +++++++++++++++++++
tools/perf/util/synthetic-events.h | 1 +
tools/perf/util/tool.h | 2 +
33 files changed, 568 insertions(+), 13 deletions(-)

--
2.24.1.735.g03f4e72817-goog


2019-12-23 06:09:54

by Namhyung Kim

[permalink] [raw]
Subject: [PATCH 5/9] perf report: Add 'cgroup' sort key

The cgroup sort key is to show cgroup membership of each task.
Currently it shows full path in the cgroupfs (not relative to the root
of cgroup namespace) since it'd be more intuitive IMHO. Otherwise
root cgroup in different namespaces will all show same name - "/".

The cgroup sort key should come before cgroup_id otherwise
sort_dimension__add() will match it to cgroup_id as it only matches
with the given substring.

For example it will look like following. Note that record patch
adding --all-cgroups patch will come later.

$ perf record -a --namespace --all-cgroups cgtest
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.208 MB perf.data (4090 samples) ]

$ perf report -s cgroup_id,cgroup,pid
...
# Overhead cgroup id (dev/inode) cgroup Pid:Command
# ........ ..................... .......... ...............
#
93.96% 0/0x0 / 0:swapper
1.25% 3/0xeffffffb / 278:looper0
0.86% 3/0xf000015f /sub/cgrp1 280:cgtest
0.37% 3/0xf0000160 /sub/cgrp2 281:cgtest
0.34% 3/0xf0000163 /sub/cgrp3 282:cgtest
0.22% 3/0xeffffffb /sub 278:looper0
0.20% 3/0xeffffffb / 280:cgtest
0.15% 3/0xf0000163 /sub/cgrp3 285:looper3

Signed-off-by: Namhyung Kim <[email protected]>
---
tools/perf/Documentation/perf-report.txt | 1 +
tools/perf/util/hist.c | 7 ++++++
tools/perf/util/hist.h | 1 +
tools/perf/util/sort.c | 31 ++++++++++++++++++++++++
tools/perf/util/sort.h | 2 ++
5 files changed, 42 insertions(+)

diff --git a/tools/perf/Documentation/perf-report.txt b/tools/perf/Documentation/perf-report.txt
index 8dbe2119686a..5af43f3faf9f 100644
--- a/tools/perf/Documentation/perf-report.txt
+++ b/tools/perf/Documentation/perf-report.txt
@@ -95,6 +95,7 @@ OPTIONS
abort cost. This is the global weight.
- local_weight: Local weight version of the weight above.
- cgroup_id: ID derived from cgroup namespace device and inode numbers.
+ - cgroup: cgroup pathname in the cgroupfs.
- transaction: Transaction abort flags.
- overhead: Overhead percentage of sample
- overhead_sys: Overhead percentage of sample running in system mode
diff --git a/tools/perf/util/hist.c b/tools/perf/util/hist.c
index ca5a8f4d007e..9f28d2f487c1 100644
--- a/tools/perf/util/hist.c
+++ b/tools/perf/util/hist.c
@@ -10,6 +10,7 @@
#include "mem-events.h"
#include "session.h"
#include "namespaces.h"
+#include "cgroup.h"
#include "sort.h"
#include "units.h"
#include "evlist.h"
@@ -222,6 +223,11 @@ void hists__calc_col_len(struct hists *hists, struct hist_entry *h)

if (h->trace_output)
hists__new_col_len(hists, HISTC_TRACE, strlen(h->trace_output));
+
+ if (h->cgroup) {
+ struct cgroup *cgrp = cgroup__findnew(h->cgroup, "<unknown>");
+ hists__new_col_len(hists, HISTC_CGROUP, strlen(cgrp->name));
+ }
}

void hists__output_recalc_col_len(struct hists *hists, int max_rows)
@@ -691,6 +697,7 @@ __hists__add_entry(struct hists *hists,
.dev = ns ? ns->link_info[CGROUP_NS_INDEX].dev : 0,
.ino = ns ? ns->link_info[CGROUP_NS_INDEX].ino : 0,
},
+ .cgroup = sample->cgroup,
.ms = {
.maps = al->maps,
.map = al->map,
diff --git a/tools/perf/util/hist.h b/tools/perf/util/hist.h
index 45286900aacb..191d6a84a96b 100644
--- a/tools/perf/util/hist.h
+++ b/tools/perf/util/hist.h
@@ -38,6 +38,7 @@ enum hist_column {
HISTC_THREAD,
HISTC_COMM,
HISTC_CGROUP_ID,
+ HISTC_CGROUP,
HISTC_PARENT,
HISTC_CPU,
HISTC_SOCKET,
diff --git a/tools/perf/util/sort.c b/tools/perf/util/sort.c
index 9fcba2872130..9abdab67b0aa 100644
--- a/tools/perf/util/sort.c
+++ b/tools/perf/util/sort.c
@@ -25,6 +25,7 @@
#include "mem-events.h"
#include "annotate.h"
#include "time-utils.h"
+#include "cgroup.h"
#include <linux/kernel.h>
#include <linux/string.h>

@@ -635,6 +636,35 @@ struct sort_entry sort_cgroup_id = {
.se_width_idx = HISTC_CGROUP_ID,
};

+/* --sort cgroup */
+
+static int64_t
+sort__cgroup_cmp(struct hist_entry *left, struct hist_entry *right)
+{
+ return right->cgroup - left->cgroup;
+}
+
+static int hist_entry__cgroup_snprintf(struct hist_entry *he,
+ char *bf, size_t size,
+ unsigned int width __maybe_unused)
+{
+ const char *cgrp_name = "N/A";
+
+ if (he->cgroup) {
+ struct cgroup *cgrp = cgroup__findnew(he->cgroup, cgrp_name);
+ cgrp_name = cgrp->name;
+ }
+
+ return repsep_snprintf(bf, size, "%s", cgrp_name);
+}
+
+struct sort_entry sort_cgroup = {
+ .se_header = "cgroup",
+ .se_cmp = sort__cgroup_cmp,
+ .se_snprintf = hist_entry__cgroup_snprintf,
+ .se_width_idx = HISTC_CGROUP,
+};
+
/* --sort socket */

static int64_t
@@ -1659,6 +1689,7 @@ static struct sort_dimension common_sort_dimensions[] = {
DIM(SORT_TRACE, "trace", sort_trace),
DIM(SORT_SYM_SIZE, "symbol_size", sort_sym_size),
DIM(SORT_DSO_SIZE, "dso_size", sort_dso_size),
+ DIM(SORT_CGROUP, "cgroup", sort_cgroup),
DIM(SORT_CGROUP_ID, "cgroup_id", sort_cgroup_id),
DIM(SORT_SYM_IPC_NULL, "ipc_null", sort_sym_ipc_null),
DIM(SORT_TIME, "time", sort_time),
diff --git a/tools/perf/util/sort.h b/tools/perf/util/sort.h
index 5aff9542d9b7..321a3995f69b 100644
--- a/tools/perf/util/sort.h
+++ b/tools/perf/util/sort.h
@@ -101,6 +101,7 @@ struct hist_entry {
struct thread *thread;
struct comm *comm;
struct namespace_id cgroup_id;
+ u64 cgroup;
u64 ip;
u64 transaction;
s32 socket;
@@ -222,6 +223,7 @@ enum sort_type {
SORT_TRACE,
SORT_SYM_SIZE,
SORT_DSO_SIZE,
+ SORT_CGROUP,
SORT_CGROUP_ID,
SORT_SYM_IPC_NULL,
SORT_TIME,
--
2.24.1.735.g03f4e72817-goog

2019-12-23 06:09:54

by Namhyung Kim

[permalink] [raw]
Subject: [PATCH 3/9] perf tools: Basic support for CGROUP event

Implement basic functionality to support cgroup tracking. Each cgroup
can be identified by inode number which can be read from userspace
too. The actual cgroup processing will come in the later patch.

Cc: Adrian Hunter <[email protected]>
Signed-off-by: Namhyung Kim <[email protected]>
---
tools/include/uapi/linux/perf_event.h | 16 ++++++++++++++--
tools/perf/builtin-diff.c | 1 +
tools/perf/builtin-report.c | 1 +
tools/perf/lib/include/perf/event.h | 7 +++++++
tools/perf/util/event.c | 18 ++++++++++++++++++
tools/perf/util/event.h | 6 ++++++
tools/perf/util/evsel.c | 6 ++++++
tools/perf/util/machine.c | 12 ++++++++++++
tools/perf/util/machine.h | 3 +++
tools/perf/util/perf_event_attr_fprintf.c | 2 ++
tools/perf/util/session.c | 4 ++++
tools/perf/util/tool.h | 1 +
12 files changed, 75 insertions(+), 2 deletions(-)

diff --git a/tools/include/uapi/linux/perf_event.h b/tools/include/uapi/linux/perf_event.h
index 377d794d3105..3a81e9806cb1 100644
--- a/tools/include/uapi/linux/perf_event.h
+++ b/tools/include/uapi/linux/perf_event.h
@@ -142,8 +142,9 @@ enum perf_event_sample_format {
PERF_SAMPLE_REGS_INTR = 1U << 18,
PERF_SAMPLE_PHYS_ADDR = 1U << 19,
PERF_SAMPLE_AUX = 1U << 20,
+ PERF_SAMPLE_CGROUP = 1U << 21,

- PERF_SAMPLE_MAX = 1U << 21, /* non-ABI */
+ PERF_SAMPLE_MAX = 1U << 22, /* non-ABI */

__PERF_SAMPLE_CALLCHAIN_EARLY = 1ULL << 63, /* non-ABI; internal use */
};
@@ -377,7 +378,8 @@ struct perf_event_attr {
ksymbol : 1, /* include ksymbol events */
bpf_event : 1, /* include bpf events */
aux_output : 1, /* generate AUX records instead of events */
- __reserved_1 : 32;
+ cgroup : 1, /* include cgroup events */
+ __reserved_1 : 31;

union {
__u32 wakeup_events; /* wakeup every n events */
@@ -1006,6 +1008,16 @@ enum perf_event_type {
*/
PERF_RECORD_BPF_EVENT = 18,

+ /*
+ * struct {
+ * struct perf_event_header header;
+ * u64 id;
+ * char path[];
+ * struct sample_id sample_id;
+ * };
+ */
+ PERF_RECORD_CGROUP = 19,
+
PERF_RECORD_MAX, /* non-ABI */
};

diff --git a/tools/perf/builtin-diff.c b/tools/perf/builtin-diff.c
index f8b6ae557d8b..83d4094bf152 100644
--- a/tools/perf/builtin-diff.c
+++ b/tools/perf/builtin-diff.c
@@ -455,6 +455,7 @@ static struct perf_diff pdiff = {
.fork = perf_event__process_fork,
.lost = perf_event__process_lost,
.namespaces = perf_event__process_namespaces,
+ .cgroup = perf_event__process_cgroup,
.ordered_events = true,
.ordering_requires_timestamps = true,
},
diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c
index 387311c67264..6ff0037b98ea 100644
--- a/tools/perf/builtin-report.c
+++ b/tools/perf/builtin-report.c
@@ -1093,6 +1093,7 @@ int cmd_report(int argc, const char **argv)
.mmap2 = perf_event__process_mmap2,
.comm = perf_event__process_comm,
.namespaces = perf_event__process_namespaces,
+ .cgroup = perf_event__process_cgroup,
.exit = perf_event__process_exit,
.fork = perf_event__process_fork,
.lost = perf_event__process_lost,
diff --git a/tools/perf/lib/include/perf/event.h b/tools/perf/lib/include/perf/event.h
index 18106899cb4e..69b44d2cc0f5 100644
--- a/tools/perf/lib/include/perf/event.h
+++ b/tools/perf/lib/include/perf/event.h
@@ -105,6 +105,12 @@ struct perf_record_bpf_event {
__u8 tag[BPF_TAG_SIZE]; // prog tag
};

+struct perf_record_cgroup {
+ struct perf_event_header header;
+ __u64 id;
+ char path[PATH_MAX];
+};
+
struct perf_record_sample {
struct perf_event_header header;
__u64 array[];
@@ -352,6 +358,7 @@ union perf_event {
struct perf_record_mmap2 mmap2;
struct perf_record_comm comm;
struct perf_record_namespaces namespaces;
+ struct perf_record_cgroup cgroup;
struct perf_record_fork fork;
struct perf_record_lost lost;
struct perf_record_lost_samples lost_samples;
diff --git a/tools/perf/util/event.c b/tools/perf/util/event.c
index c5447ff516a2..824c038e5c33 100644
--- a/tools/perf/util/event.c
+++ b/tools/perf/util/event.c
@@ -54,6 +54,7 @@ static const char *perf_event__names[] = {
[PERF_RECORD_NAMESPACES] = "NAMESPACES",
[PERF_RECORD_KSYMBOL] = "KSYMBOL",
[PERF_RECORD_BPF_EVENT] = "BPF_EVENT",
+ [PERF_RECORD_CGROUP] = "CGROUP",
[PERF_RECORD_HEADER_ATTR] = "ATTR",
[PERF_RECORD_HEADER_EVENT_TYPE] = "EVENT_TYPE",
[PERF_RECORD_HEADER_TRACING_DATA] = "TRACING_DATA",
@@ -180,6 +181,12 @@ size_t perf_event__fprintf_namespaces(union perf_event *event, FILE *fp)
return ret;
}

+size_t perf_event__fprintf_cgroup(union perf_event *event, FILE *fp)
+{
+ return fprintf(fp, " cgroup: %" PRI_lu64 " %s\n",
+ event->cgroup.id, event->cgroup.path);
+}
+
int perf_event__process_comm(struct perf_tool *tool __maybe_unused,
union perf_event *event,
struct perf_sample *sample,
@@ -196,6 +203,14 @@ int perf_event__process_namespaces(struct perf_tool *tool __maybe_unused,
return machine__process_namespaces_event(machine, event, sample);
}

+int perf_event__process_cgroup(struct perf_tool *tool __maybe_unused,
+ union perf_event *event,
+ struct perf_sample *sample,
+ struct machine *machine)
+{
+ return machine__process_cgroup_event(machine, event, sample);
+}
+
int perf_event__process_lost(struct perf_tool *tool __maybe_unused,
union perf_event *event,
struct perf_sample *sample,
@@ -417,6 +432,9 @@ size_t perf_event__fprintf(union perf_event *event, FILE *fp)
case PERF_RECORD_NAMESPACES:
ret += perf_event__fprintf_namespaces(event, fp);
break;
+ case PERF_RECORD_CGROUP:
+ ret += perf_event__fprintf_cgroup(event, fp);
+ break;
case PERF_RECORD_MMAP2:
ret += perf_event__fprintf_mmap2(event, fp);
break;
diff --git a/tools/perf/util/event.h b/tools/perf/util/event.h
index 85223159737c..0ad3ba22817d 100644
--- a/tools/perf/util/event.h
+++ b/tools/perf/util/event.h
@@ -135,6 +135,7 @@ struct perf_sample {
u32 raw_size;
u64 data_src;
u64 phys_addr;
+ u64 cgroup;
u32 flags;
u16 insn_len;
u8 cpumode;
@@ -321,6 +322,10 @@ int perf_event__process_namespaces(struct perf_tool *tool,
union perf_event *event,
struct perf_sample *sample,
struct machine *machine);
+int perf_event__process_cgroup(struct perf_tool *tool,
+ union perf_event *event,
+ struct perf_sample *sample,
+ struct machine *machine);
int perf_event__process_mmap(struct perf_tool *tool,
union perf_event *event,
struct perf_sample *sample,
@@ -376,6 +381,7 @@ size_t perf_event__fprintf_switch(union perf_event *event, FILE *fp);
size_t perf_event__fprintf_thread_map(union perf_event *event, FILE *fp);
size_t perf_event__fprintf_cpu_map(union perf_event *event, FILE *fp);
size_t perf_event__fprintf_namespaces(union perf_event *event, FILE *fp);
+size_t perf_event__fprintf_cgroup(union perf_event *event, FILE *fp);
size_t perf_event__fprintf_ksymbol(union perf_event *event, FILE *fp);
size_t perf_event__fprintf_bpf(union perf_event *event, FILE *fp);
size_t perf_event__fprintf(union perf_event *event, FILE *fp);
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index a69e64236120..0f5a67603139 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -2250,6 +2250,12 @@ int perf_evsel__parse_sample(struct evsel *evsel, union perf_event *event,
array++;
}

+ data->cgroup = 0;
+ if (type & PERF_SAMPLE_CGROUP) {
+ data->cgroup = *array;
+ array++;
+ }
+
if (type & PERF_SAMPLE_AUX) {
OVERFLOW_CHECK_u64(array);
sz = *array++;
diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c
index c8c5410315e8..2c3223bec561 100644
--- a/tools/perf/util/machine.c
+++ b/tools/perf/util/machine.c
@@ -654,6 +654,16 @@ int machine__process_namespaces_event(struct machine *machine __maybe_unused,
return err;
}

+int machine__process_cgroup_event(struct machine *machine __maybe_unused,
+ union perf_event *event,
+ struct perf_sample *sample __maybe_unused)
+{
+ if (dump_trace)
+ perf_event__fprintf_cgroup(event, stdout);
+
+ return 0;
+}
+
int machine__process_lost_event(struct machine *machine __maybe_unused,
union perf_event *event, struct perf_sample *sample __maybe_unused)
{
@@ -1880,6 +1890,8 @@ int machine__process_event(struct machine *machine, union perf_event *event,
ret = machine__process_mmap_event(machine, event, sample); break;
case PERF_RECORD_NAMESPACES:
ret = machine__process_namespaces_event(machine, event, sample); break;
+ case PERF_RECORD_CGROUP:
+ ret = machine__process_cgroup_event(machine, event, sample); break;
case PERF_RECORD_MMAP2:
ret = machine__process_mmap2_event(machine, event, sample); break;
case PERF_RECORD_FORK:
diff --git a/tools/perf/util/machine.h b/tools/perf/util/machine.h
index be0a930eca89..fa1be9ea00fa 100644
--- a/tools/perf/util/machine.h
+++ b/tools/perf/util/machine.h
@@ -128,6 +128,9 @@ int machine__process_switch_event(struct machine *machine,
int machine__process_namespaces_event(struct machine *machine,
union perf_event *event,
struct perf_sample *sample);
+int machine__process_cgroup_event(struct machine *machine,
+ union perf_event *event,
+ struct perf_sample *sample);
int machine__process_mmap_event(struct machine *machine, union perf_event *event,
struct perf_sample *sample);
int machine__process_mmap2_event(struct machine *machine, union perf_event *event,
diff --git a/tools/perf/util/perf_event_attr_fprintf.c b/tools/perf/util/perf_event_attr_fprintf.c
index 651203126c71..a37a89c747d8 100644
--- a/tools/perf/util/perf_event_attr_fprintf.c
+++ b/tools/perf/util/perf_event_attr_fprintf.c
@@ -35,6 +35,7 @@ static void __p_sample_type(char *buf, size_t size, u64 value)
bit_name(BRANCH_STACK), bit_name(REGS_USER), bit_name(STACK_USER),
bit_name(IDENTIFIER), bit_name(REGS_INTR), bit_name(DATA_SRC),
bit_name(WEIGHT), bit_name(PHYS_ADDR), bit_name(AUX),
+ bit_name(CGROUP),
{ .name = NULL, }
};
#undef bit_name
@@ -131,6 +132,7 @@ int perf_event_attr__fprintf(FILE *fp, struct perf_event_attr *attr,
PRINT_ATTRf(ksymbol, p_unsigned);
PRINT_ATTRf(bpf_event, p_unsigned);
PRINT_ATTRf(aux_output, p_unsigned);
+ PRINT_ATTRf(cgroup, p_unsigned);

PRINT_ATTRn("{ wakeup_events, wakeup_watermark }", wakeup_events, p_unsigned);
PRINT_ATTRf(bp_type, p_unsigned);
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index d0d7d25b23e3..6b4c12d48c3f 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -471,6 +471,8 @@ void perf_tool__fill_defaults(struct perf_tool *tool)
tool->comm = process_event_stub;
if (tool->namespaces == NULL)
tool->namespaces = process_event_stub;
+ if (tool->cgroup == NULL)
+ tool->cgroup = process_event_stub;
if (tool->fork == NULL)
tool->fork = process_event_stub;
if (tool->exit == NULL)
@@ -1434,6 +1436,8 @@ static int machines__deliver_event(struct machines *machines,
return tool->comm(tool, event, sample, machine);
case PERF_RECORD_NAMESPACES:
return tool->namespaces(tool, event, sample, machine);
+ case PERF_RECORD_CGROUP:
+ return tool->cgroup(tool, event, sample, machine);
case PERF_RECORD_FORK:
return tool->fork(tool, event, sample, machine);
case PERF_RECORD_EXIT:
diff --git a/tools/perf/util/tool.h b/tools/perf/util/tool.h
index 2abbf668b8de..472ef5eb4068 100644
--- a/tools/perf/util/tool.h
+++ b/tools/perf/util/tool.h
@@ -46,6 +46,7 @@ struct perf_tool {
mmap2,
comm,
namespaces,
+ cgroup,
fork,
exit,
lost,
--
2.24.1.735.g03f4e72817-goog

2019-12-23 06:10:06

by Namhyung Kim

[permalink] [raw]
Subject: [PATCH 9/9] perf script: Add --show-cgroup-events option

The --show-cgroup-events option is to print CGROUP events in the
output like others.

Signed-off-by: Namhyung Kim <[email protected]>
---
tools/perf/Documentation/perf-script.txt | 3 ++
tools/perf/builtin-script.c | 41 ++++++++++++++++++++++++
2 files changed, 44 insertions(+)

diff --git a/tools/perf/Documentation/perf-script.txt b/tools/perf/Documentation/perf-script.txt
index 2599b057e47b..3dd297600427 100644
--- a/tools/perf/Documentation/perf-script.txt
+++ b/tools/perf/Documentation/perf-script.txt
@@ -319,6 +319,9 @@ OPTIONS
--show-bpf-events
Display bpf events i.e. events of type PERF_RECORD_KSYMBOL and PERF_RECORD_BPF_EVENT.

+--show-cgroup-events
+ Display cgroup events i.e. events of type PERF_RECORD_CGROUP.
+
--demangle::
Demangle symbol names to human readable form. It's enabled by default,
disable with --no-demangle.
diff --git a/tools/perf/builtin-script.c b/tools/perf/builtin-script.c
index e2406b291c1c..3db4afc29430 100644
--- a/tools/perf/builtin-script.c
+++ b/tools/perf/builtin-script.c
@@ -1681,6 +1681,7 @@ struct perf_script {
bool show_lost_events;
bool show_round_events;
bool show_bpf_events;
+ bool show_cgroup_events;
bool allocated;
bool per_event_dump;
struct evswitch evswitch;
@@ -2199,6 +2200,41 @@ static int process_namespaces_event(struct perf_tool *tool,
return ret;
}

+static int process_cgroup_event(struct perf_tool *tool,
+ union perf_event *event,
+ struct perf_sample *sample,
+ struct machine *machine)
+{
+ struct thread *thread;
+ struct perf_script *script = container_of(tool, struct perf_script, tool);
+ struct perf_session *session = script->session;
+ struct evsel *evsel = perf_evlist__id2evsel(session->evlist, sample->id);
+ int ret = -1;
+
+ thread = machine__findnew_thread(machine, sample->pid, sample->tid);
+ if (thread == NULL) {
+ pr_debug("problem processing CGROUP event, skipping it.\n");
+ return -1;
+ }
+
+ if (perf_event__process_cgroup(tool, event, sample, machine) < 0)
+ goto out;
+
+ if (!evsel->core.attr.sample_id_all) {
+ sample->cpu = 0;
+ sample->time = 0;
+ }
+ if (!filter_cpu(sample)) {
+ perf_sample__fprintf_start(sample, thread, evsel,
+ PERF_RECORD_CGROUP, stdout);
+ perf_event__fprintf(event, stdout);
+ }
+ ret = 0;
+out:
+ thread__put(thread);
+ return ret;
+}
+
static int process_fork_event(struct perf_tool *tool,
union perf_event *event,
struct perf_sample *sample,
@@ -2538,6 +2574,8 @@ static int __cmd_script(struct perf_script *script)
script->tool.context_switch = process_switch_event;
if (script->show_namespace_events)
script->tool.namespaces = process_namespaces_event;
+ if (script->show_cgroup_events)
+ script->tool.cgroup = process_cgroup_event;
if (script->show_lost_events)
script->tool.lost = process_lost_event;
if (script->show_round_events) {
@@ -3463,6 +3501,7 @@ int cmd_script(int argc, const char **argv)
.mmap2 = perf_event__process_mmap2,
.comm = perf_event__process_comm,
.namespaces = perf_event__process_namespaces,
+ .cgroup = perf_event__process_cgroup,
.exit = perf_event__process_exit,
.fork = perf_event__process_fork,
.attr = process_attr,
@@ -3563,6 +3602,8 @@ int cmd_script(int argc, const char **argv)
"Show context switch events (if recorded)"),
OPT_BOOLEAN('\0', "show-namespace-events", &script.show_namespace_events,
"Show namespace events (if recorded)"),
+ OPT_BOOLEAN('\0', "show-cgroup-events", &script.show_cgroup_events,
+ "Show cgroup events (if recorded)"),
OPT_BOOLEAN('\0', "show-lost-events", &script.show_lost_events,
"Show lost events (if recorded)"),
OPT_BOOLEAN('\0', "show-round-events", &script.show_round_events,
--
2.24.1.735.g03f4e72817-goog

2019-12-23 06:10:19

by Namhyung Kim

[permalink] [raw]
Subject: [PATCH 4/9] perf tools: Maintain cgroup hierarchy

Each cgroup is kept in the global cgroup_tree sorted by the cgroup id.
Hist entries have cgroup id can compare it directly and later it can
be used to find a group name using this tree.

Signed-off-by: Namhyung Kim <[email protected]>
---
tools/perf/util/cgroup.c | 72 +++++++++++++++++++++++++++++++++++++++
tools/perf/util/cgroup.h | 15 +++++---
tools/perf/util/machine.c | 7 ++++
tools/perf/util/session.c | 4 +++
4 files changed, 94 insertions(+), 4 deletions(-)

diff --git a/tools/perf/util/cgroup.c b/tools/perf/util/cgroup.c
index 4881d4af3381..4e8ef1db0c94 100644
--- a/tools/perf/util/cgroup.c
+++ b/tools/perf/util/cgroup.c
@@ -13,6 +13,8 @@

int nr_cgroups;

+static struct rb_root cgroup_tree = RB_ROOT;
+
static int
cgroupfs_find_mountpoint(char *buf, size_t maxlen)
{
@@ -250,3 +252,73 @@ int parse_cgroups(const struct option *opt, const char *str,
}
return 0;
}
+
+struct cgroup *cgroup__findnew(uint64_t id, const char *path)
+{
+ struct rb_node **p = &cgroup_tree.rb_node;
+ struct rb_node *parent = NULL;
+ struct cgroup *cgrp;
+
+ while (*p != NULL) {
+ parent = *p;
+ cgrp = rb_entry(parent, struct cgroup, node);
+
+ if (cgrp->id == id)
+ return cgrp;
+
+ if (cgrp->id < id)
+ p = &(*p)->rb_left;
+ else
+ p = &(*p)->rb_right;
+ }
+
+ cgrp = malloc(sizeof(*cgrp));
+ if (cgrp == NULL)
+ return NULL;
+
+ cgrp->name = strdup(path);
+ if (cgrp->name == NULL) {
+ free(cgrp);
+ return NULL;
+ }
+
+ cgrp->fd = -1;
+ cgrp->id = id;
+ refcount_set(&cgrp->refcnt, 1);
+
+ rb_link_node(&cgrp->node, parent, p);
+ rb_insert_color(&cgrp->node, &cgroup_tree);
+
+ return cgrp;
+}
+
+struct cgroup *cgroup__find_by_path(const char *path)
+{
+ struct rb_node *node;
+
+ node = rb_first(&cgroup_tree);
+ while (node) {
+ struct cgroup *cgrp = rb_entry(node, struct cgroup, node);
+
+ if (!strcmp(cgrp->name, path))
+ return cgrp;
+
+ node = rb_next(&cgrp->node);
+ }
+
+ return NULL;
+}
+
+void destroy_cgroups(void)
+{
+ struct rb_node *node;
+ struct cgroup *cgrp;
+
+ while (!RB_EMPTY_ROOT(&cgroup_tree)) {
+ node = rb_first(&cgroup_tree);
+ cgrp = rb_entry(node, struct cgroup, node);
+
+ rb_erase(node, &cgroup_tree);
+ cgroup__put(cgrp);
+ }
+}
diff --git a/tools/perf/util/cgroup.h b/tools/perf/util/cgroup.h
index 2ec11f01090d..381583df27c7 100644
--- a/tools/perf/util/cgroup.h
+++ b/tools/perf/util/cgroup.h
@@ -3,16 +3,18 @@
#define __CGROUP_H__

#include <linux/refcount.h>
+#include <linux/rbtree.h>

struct option;

struct cgroup {
- char *name;
- int fd;
- refcount_t refcnt;
+ struct rb_node node;
+ u64 id;
+ char *name;
+ int fd;
+ refcount_t refcnt;
};

-
extern int nr_cgroups; /* number of explicit cgroups defined */

struct cgroup *cgroup__get(struct cgroup *cgroup);
@@ -26,4 +28,9 @@ void evlist__set_default_cgroup(struct evlist *evlist, struct cgroup *cgroup);

int parse_cgroups(const struct option *opt, const char *str, int unset);

+struct cgroup *cgroup__findnew(uint64_t id, const char *path);
+struct cgroup *cgroup__find_by_path(const char *path);
+
+void destroy_cgroups(void);
+
#endif /* __CGROUP_H__ */
diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c
index 2c3223bec561..19d40a2016d7 100644
--- a/tools/perf/util/machine.c
+++ b/tools/perf/util/machine.c
@@ -33,6 +33,7 @@
#include "asm/bug.h"
#include "bpf-event.h"
#include <internal/lib.h> // page_size
+#include "cgroup.h"

#include <linux/ctype.h>
#include <symbol/kallsyms.h>
@@ -658,9 +659,15 @@ int machine__process_cgroup_event(struct machine *machine __maybe_unused,
union perf_event *event,
struct perf_sample *sample __maybe_unused)
{
+ struct cgroup *cgrp;
+
if (dump_trace)
perf_event__fprintf_cgroup(event, stdout);

+ cgrp = cgroup__findnew(event->cgroup.id, event->cgroup.path);
+ if (cgrp == NULL)
+ return -ENOMEM;
+
return 0;
}

diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index 6b4c12d48c3f..5e2c9f504184 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -289,6 +289,8 @@ static void perf_session__release_decomp_events(struct perf_session *session)
} while (1);
}

+extern void destroy_cgroups(void);
+
void perf_session__delete(struct perf_session *session)
{
if (session == NULL)
@@ -303,6 +305,8 @@ void perf_session__delete(struct perf_session *session)
if (session->data)
perf_data__close(session->data);
free(session);
+
+ destroy_cgroups();
}

static int process_event_synth_tracing_data_stub(struct perf_session *session
--
2.24.1.735.g03f4e72817-goog

2019-12-23 06:10:24

by Namhyung Kim

[permalink] [raw]
Subject: [PATCH 7/9] perf record: Add --all-cgroups option

The --all-cgroups option is to enable cgroup profiling support. It
tells kernel to record CGROUP events in the ring buffer so that perf
report can identify task/cgroup association later.

Signed-off-by: Namhyung Kim <[email protected]>
---
tools/perf/Documentation/perf-record.txt | 5 ++++-
tools/perf/builtin-record.c | 5 +++++
tools/perf/util/evsel.c | 11 ++++++++++-
tools/perf/util/evsel.h | 1 +
tools/perf/util/record.h | 1 +
5 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/tools/perf/Documentation/perf-record.txt b/tools/perf/Documentation/perf-record.txt
index b23a4012a606..6dd9d69d7dee 100644
--- a/tools/perf/Documentation/perf-record.txt
+++ b/tools/perf/Documentation/perf-record.txt
@@ -385,7 +385,10 @@ displayed with the weight and local_weight sort keys. This currently works for
abort events and some memory events in precise mode on modern Intel CPUs.

--namespaces::
-Record events of type PERF_RECORD_NAMESPACES.
+Record events of type PERF_RECORD_NAMESPACES. This enables 'cgroup_id' sort key.
+
+--all-cgroups::
+Record events of type PERF_RECORD_CGROUP. This enables 'cgroup' sort key.

--transaction::
Record transaction flags for transaction related events.
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 2802de9538ff..7d7912e121d6 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -1433,6 +1433,9 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
if (rec->opts.record_namespaces)
tool->namespace_events = true;

+ if (rec->opts.record_cgroup)
+ tool->cgroup_events = true;
+
if (rec->opts.auxtrace_snapshot_mode || rec->switch_output.enabled) {
signal(SIGUSR2, snapshot_sig_handler);
if (rec->opts.auxtrace_snapshot_mode)
@@ -2363,6 +2366,8 @@ static struct option __record_options[] = {
"per thread proc mmap processing timeout in ms"),
OPT_BOOLEAN(0, "namespaces", &record.opts.record_namespaces,
"Record namespaces events"),
+ OPT_BOOLEAN(0, "all-cgroups", &record.opts.record_cgroup,
+ "Record cgroup events"),
OPT_BOOLEAN(0, "switch-events", &record.opts.record_switch_events,
"Record context switch events"),
OPT_BOOLEAN_FLAG(0, "all-kernel", &record.opts.all_kernel,
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 0f5a67603139..ff6c6683a32b 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -1102,6 +1102,11 @@ void perf_evsel__config(struct evsel *evsel, struct record_opts *opts,
if (opts->record_namespaces)
attr->namespaces = track;

+ if (opts->record_cgroup) {
+ attr->cgroup = track && !perf_missing_features.cgroup;
+ perf_evsel__set_sample_bit(evsel, CGROUP);
+ }
+
if (opts->record_switch_events)
attr->context_switch = track;

@@ -1782,7 +1787,11 @@ static int evsel__open_cpu(struct evsel *evsel, struct perf_cpu_map *cpus,
* Must probe features in the order they were added to the
* perf_event_attr interface.
*/
- if (!perf_missing_features.aux_output && evsel->core.attr.aux_output) {
+ if (!perf_missing_features.cgroup && evsel->core.attr.cgroup) {
+ perf_missing_features.cgroup = true;
+ pr_debug2_peo("Kernel has no cgroup sampling support, bailing out\n");
+ goto out_close;
+ } else if (!perf_missing_features.aux_output && evsel->core.attr.aux_output) {
perf_missing_features.aux_output = true;
pr_debug2_peo("Kernel has no attr.aux_output support, bailing out\n");
goto out_close;
diff --git a/tools/perf/util/evsel.h b/tools/perf/util/evsel.h
index dc14f4a823cd..df71b530740b 100644
--- a/tools/perf/util/evsel.h
+++ b/tools/perf/util/evsel.h
@@ -119,6 +119,7 @@ struct perf_missing_features {
bool ksymbol;
bool bpf;
bool aux_output;
+ bool cgroup;
};

extern struct perf_missing_features perf_missing_features;
diff --git a/tools/perf/util/record.h b/tools/perf/util/record.h
index 5421fd2ad383..24316458be20 100644
--- a/tools/perf/util/record.h
+++ b/tools/perf/util/record.h
@@ -34,6 +34,7 @@ struct record_opts {
bool auxtrace_snapshot_on_exit;
bool auxtrace_sample_mode;
bool record_namespaces;
+ bool record_cgroup;
bool record_switch_events;
bool all_kernel;
bool all_user;
--
2.24.1.735.g03f4e72817-goog

2019-12-23 06:10:40

by Namhyung Kim

[permalink] [raw]
Subject: [PATCH 6/9] perf record: Support synthesizing cgroup events

Synthesize cgroup events by iterating cgroup filesystem directories.
The cgroup event only saves the portion of cgroup path after the mount
point and the cgroup id (which actually is a file handle).

Signed-off-by: Namhyung Kim <[email protected]>
---
tools/perf/builtin-record.c | 5 ++
tools/perf/util/cgroup.c | 3 +-
tools/perf/util/cgroup.h | 1 +
tools/perf/util/event.c | 1 +
tools/perf/util/synthetic-events.c | 119 +++++++++++++++++++++++++++++
tools/perf/util/synthetic-events.h | 1 +
tools/perf/util/tool.h | 1 +
7 files changed, 129 insertions(+), 2 deletions(-)

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 4c301466101b..2802de9538ff 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -1397,6 +1397,11 @@ static int record__synthesize(struct record *rec, bool tail)
if (err < 0)
pr_warning("Couldn't synthesize bpf events.\n");

+ err = perf_event__synthesize_cgroups(tool, process_synthesized_event,
+ machine);
+ if (err < 0)
+ pr_warning("Couldn't synthesize cgroup events.\n");
+
err = __machine__synthesize_threads(machine, tool, &opts->target, rec->evlist->core.threads,
process_synthesized_event, opts->sample_address,
1);
diff --git a/tools/perf/util/cgroup.c b/tools/perf/util/cgroup.c
index 4e8ef1db0c94..5147d22b3bda 100644
--- a/tools/perf/util/cgroup.c
+++ b/tools/perf/util/cgroup.c
@@ -15,8 +15,7 @@ int nr_cgroups;

static struct rb_root cgroup_tree = RB_ROOT;

-static int
-cgroupfs_find_mountpoint(char *buf, size_t maxlen)
+int cgroupfs_find_mountpoint(char *buf, size_t maxlen)
{
FILE *fp;
char mountpoint[PATH_MAX + 1], tokens[PATH_MAX + 1], type[PATH_MAX + 1];
diff --git a/tools/perf/util/cgroup.h b/tools/perf/util/cgroup.h
index 381583df27c7..9a67060723fa 100644
--- a/tools/perf/util/cgroup.h
+++ b/tools/perf/util/cgroup.h
@@ -17,6 +17,7 @@ struct cgroup {

extern int nr_cgroups; /* number of explicit cgroups defined */

+int cgroupfs_find_mountpoint(char *buf, size_t maxlen);
struct cgroup *cgroup__get(struct cgroup *cgroup);
void cgroup__put(struct cgroup *cgroup);

diff --git a/tools/perf/util/event.c b/tools/perf/util/event.c
index 824c038e5c33..28801c867f39 100644
--- a/tools/perf/util/event.c
+++ b/tools/perf/util/event.c
@@ -33,6 +33,7 @@
#include "bpf-event.h"
#include "tool.h"
#include "../perf.h"
+#include "cgroup.h"

static const char *perf_event__names[] = {
[0] = "TOTAL",
diff --git a/tools/perf/util/synthetic-events.c b/tools/perf/util/synthetic-events.c
index c423298fe62d..cfc850efa318 100644
--- a/tools/perf/util/synthetic-events.c
+++ b/tools/perf/util/synthetic-events.c
@@ -16,6 +16,7 @@
#include "util/synthetic-events.h"
#include "util/target.h"
#include "util/time-utils.h"
+#include "util/cgroup.h"
#include <linux/bitops.h>
#include <linux/kernel.h>
#include <linux/string.h>
@@ -413,6 +414,124 @@ int perf_event__synthesize_mmap_events(struct perf_tool *tool,
return rc;
}

+static int perf_event__synthesize_cgroup(struct perf_tool *tool,
+ union perf_event *event,
+ char *path, size_t mount_len,
+ perf_event__handler_t process,
+ struct machine *machine)
+{
+ size_t event_size = sizeof(event->cgroup) - sizeof(event->cgroup.path);
+ size_t path_len = strlen(path) - mount_len + 1;
+ struct {
+ struct file_handle fh;
+ uint64_t cgroup_id;
+ } handle;
+ int mount_id;
+
+ while (path_len % sizeof(u64))
+ path[mount_len + path_len++] = '\0';
+
+ memset(&event->cgroup, 0, event_size);
+
+ event->cgroup.header.type = PERF_RECORD_CGROUP;
+ event->cgroup.header.size = event_size + path_len + machine->id_hdr_size;
+
+ handle.fh.handle_bytes = sizeof(handle.cgroup_id);
+ if (name_to_handle_at(AT_FDCWD, path, &handle.fh, &mount_id, 0) < 0) {
+ pr_debug("stat failed: %s\n", path);
+ return -1;
+ }
+
+ event->cgroup.id = handle.cgroup_id;
+ strncpy(event->cgroup.path, path + mount_len, path_len);
+ memset(event->cgroup.path + path_len, 0, machine->id_hdr_size);
+
+ if (perf_tool__process_synth_event(tool, event, machine, process) < 0) {
+ pr_debug("process synth event failed\n");
+ return -1;
+ }
+
+ return 0;
+}
+
+static int perf_event__walk_cgroup_tree(struct perf_tool *tool,
+ union perf_event *event,
+ char *path, size_t mount_len,
+ perf_event__handler_t process,
+ struct machine *machine)
+{
+ size_t pos = strlen(path);
+ DIR *d;
+ struct dirent *dent;
+ int ret = 0;
+
+ if (perf_event__synthesize_cgroup(tool, event, path, mount_len,
+ process, machine) < 0)
+ return -1;
+
+ d = opendir(path);
+ if (d == NULL) {
+ pr_debug("failed to open directory: %s\n", path);
+ return -1;
+ }
+
+ while ((dent = readdir(d)) != NULL) {
+ if (dent->d_type != DT_DIR)
+ continue;
+ if (!strcmp(dent->d_name, ".") ||
+ !strcmp(dent->d_name, ".."))
+ continue;
+
+ if (path[pos - 1] != '/')
+ strcat(path, "/");
+ strcat(path, dent->d_name);
+
+ ret = perf_event__walk_cgroup_tree(tool, event, path,
+ mount_len, process, machine);
+ if (ret < 0)
+ break;
+
+ path[pos] = '\0';
+ }
+
+ closedir(d);
+ return ret;
+}
+
+int perf_event__synthesize_cgroups(struct perf_tool *tool,
+ perf_event__handler_t process,
+ struct machine *machine)
+{
+ union perf_event event;
+ char *cgrp_root;
+ size_t mount_len; /* length of mount point in the path */
+ int ret = -1;
+
+ cgrp_root = malloc(PATH_MAX);
+ if (cgrp_root == NULL)
+ return -1;
+
+ if (cgroupfs_find_mountpoint(cgrp_root, PATH_MAX) < 0) {
+ pr_debug("cannot find cgroup mount point\n");
+ goto out;
+ }
+
+ mount_len = strlen(cgrp_root);
+ /* make sure the path starts with a slash (after mount point) */
+ strcat(cgrp_root, "/");
+
+ if (perf_event__walk_cgroup_tree(tool, &event, cgrp_root, mount_len,
+ process, machine) < 0)
+ goto out;
+
+ ret = 0;
+
+out:
+ free(cgrp_root);
+
+ return ret;
+}
+
int perf_event__synthesize_modules(struct perf_tool *tool, perf_event__handler_t process,
struct machine *machine)
{
diff --git a/tools/perf/util/synthetic-events.h b/tools/perf/util/synthetic-events.h
index baead0cdc381..e7a3e9589738 100644
--- a/tools/perf/util/synthetic-events.h
+++ b/tools/perf/util/synthetic-events.h
@@ -45,6 +45,7 @@ int perf_event__synthesize_kernel_mmap(struct perf_tool *tool, perf_event__handl
int perf_event__synthesize_mmap_events(struct perf_tool *tool, union perf_event *event, pid_t pid, pid_t tgid, perf_event__handler_t process, struct machine *machine, bool mmap_data);
int perf_event__synthesize_modules(struct perf_tool *tool, perf_event__handler_t process, struct machine *machine);
int perf_event__synthesize_namespaces(struct perf_tool *tool, union perf_event *event, pid_t pid, pid_t tgid, perf_event__handler_t process, struct machine *machine);
+int perf_event__synthesize_cgroups(struct perf_tool *tool, perf_event__handler_t process, struct machine *machine);
int perf_event__synthesize_sample(union perf_event *event, u64 type, u64 read_format, const struct perf_sample *sample);
int perf_event__synthesize_stat_config(struct perf_tool *tool, struct perf_stat_config *config, perf_event__handler_t process, struct machine *machine);
int perf_event__synthesize_stat_events(struct perf_stat_config *config, struct perf_tool *tool, struct evlist *evlist, perf_event__handler_t process, bool attrs);
diff --git a/tools/perf/util/tool.h b/tools/perf/util/tool.h
index 472ef5eb4068..3fb67bd31e4a 100644
--- a/tools/perf/util/tool.h
+++ b/tools/perf/util/tool.h
@@ -79,6 +79,7 @@ struct perf_tool {
bool ordered_events;
bool ordering_requires_timestamps;
bool namespace_events;
+ bool cgroup_events;
bool no_warn;
enum show_feature_header show_feat_hdr;
};
--
2.24.1.735.g03f4e72817-goog

2019-12-23 17:36:45

by Vince Weaver

[permalink] [raw]
Subject: Re: [PATCHSET 0/9] perf: Improve cgroup profiling (v3)

On Mon, 23 Dec 2019, Namhyung Kim wrote:

> This work is to improve cgroup profiling in perf. Currently it only
> supports profiling tasks in a specific cgroup and there's no way to
> identify which cgroup the current sample belongs to. So I added
> PERF_SAMPLE_CGROUP to add cgroup id into each sample. It's a 64-bit
> integer having file handle of the cgroup. And kernel also generates
> PERF_RECORD_CGROUP event for new groups to correlate the cgroup id and
> cgroup name (path in the cgroup filesystem). The cgroup id can be
> read from userspace by name_to_handle_at() system call so it can
> synthesize the CGROUP event for existing groups.

so is there a patch to the manpage that describes this new behavior in
perf_event_open()?

Vince

2019-12-24 00:41:13

by Namhyung Kim

[permalink] [raw]
Subject: Re: [PATCHSET 0/9] perf: Improve cgroup profiling (v3)

Hi Vince,

On Tue, Dec 24, 2019 at 2:35 AM Vince Weaver <[email protected]> wrote:
>
> On Mon, 23 Dec 2019, Namhyung Kim wrote:
>
> > This work is to improve cgroup profiling in perf. Currently it only
> > supports profiling tasks in a specific cgroup and there's no way to
> > identify which cgroup the current sample belongs to. So I added
> > PERF_SAMPLE_CGROUP to add cgroup id into each sample. It's a 64-bit
> > integer having file handle of the cgroup. And kernel also generates
> > PERF_RECORD_CGROUP event for new groups to correlate the cgroup id and
> > cgroup name (path in the cgroup filesystem). The cgroup id can be
> > read from userspace by name_to_handle_at() system call so it can
> > synthesize the CGROUP event for existing groups.
>
> so is there a patch to the manpage that describes this new behavior in
> perf_event_open()?

Not yet. I'll cook a patch once it's merged to the Linus' tree.

Thanks
Namhyung

2019-12-26 12:47:55

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: [PATCHSET 0/9] perf: Improve cgroup profiling (v3)

Em Tue, Dec 24, 2019 at 09:40:04AM +0900, Namhyung Kim escreveu:
> On Tue, Dec 24, 2019 at 2:35 AM Vince Weaver <[email protected]> wrote:
> > On Mon, 23 Dec 2019, Namhyung Kim wrote:
> > > This work is to improve cgroup profiling in perf. Currently it only
> > > supports profiling tasks in a specific cgroup and there's no way to
> > > identify which cgroup the current sample belongs to. So I added
> > > PERF_SAMPLE_CGROUP to add cgroup id into each sample. It's a 64-bit
> > > integer having file handle of the cgroup. And kernel also generates
> > > PERF_RECORD_CGROUP event for new groups to correlate the cgroup id and
> > > cgroup name (path in the cgroup filesystem). The cgroup id can be
> > > read from userspace by name_to_handle_at() system call so it can
> > > synthesize the CGROUP event for existing groups.

> > so is there a patch to the manpage that describes this new behavior in
> > perf_event_open()?

> Not yet. I'll cook a patch once it's merged to the Linus' tree.

Vince, was it ever considered to carry the man page in the kernel
sources and then make it so that new features need to come with the
respective changes to the man page? I think that would be a good move,
you would be the maintainer for that file, what do you think?

- Arnaldo

2019-12-27 18:32:46

by Vince Weaver

[permalink] [raw]
Subject: Re: [PATCHSET 0/9] perf: Improve cgroup profiling (v3)

On Thu, 26 Dec 2019, Arnaldo Carvalho de Melo wrote:

> Em Tue, Dec 24, 2019 at 09:40:04AM +0900, Namhyung Kim escreveu:
> > On Tue, Dec 24, 2019 at 2:35 AM Vince Weaver <[email protected]> wrote:
> > > On Mon, 23 Dec 2019, Namhyung Kim wrote:
> > > > This work is to improve cgroup profiling in perf. Currently it only
> > > > supports profiling tasks in a specific cgroup and there's no way to
> > > > identify which cgroup the current sample belongs to. So I added
> > > > PERF_SAMPLE_CGROUP to add cgroup id into each sample. It's a 64-bit
> > > > integer having file handle of the cgroup. And kernel also generates
> > > > PERF_RECORD_CGROUP event for new groups to correlate the cgroup id and
> > > > cgroup name (path in the cgroup filesystem). The cgroup id can be
> > > > read from userspace by name_to_handle_at() system call so it can
> > > > synthesize the CGROUP event for existing groups.
>
> > > so is there a patch to the manpage that describes this new behavior in
> > > perf_event_open()?
>
> > Not yet. I'll cook a patch once it's merged to the Linus' tree.
>
> Vince, was it ever considered to carry the man page in the kernel
> sources and then make it so that new features need to come with the
> respective changes to the man page? I think that would be a good move,
> you would be the maintainer for that file, what do you think?

While I do a lot of work on the perf_event_open() manpage, it's part of
the linux man-pages project so I don't really control where it is
maintained.

I personally do not think it would help much merging into the kernel tree.
I still think the idea of moving everything into linux-git (such as the
"perf" tool) isn't always the best idea and can make it harder for people
who aren't kernel developers to work on things.

Vince