LinuxLists.cc - [PATCH 0/2] perf: Allow closing siblings' file descriptors

2020-07-08 15:20:37

Subject: [PATCH 0/2] perf: Allow closing siblings' file descriptors

Hi guys,

I've been looking at reducing the number of open file descriptors per perf
session. If we retain one descriptor per event, in a large group they add
up. At the same time, we're not actually using them for anything after the
SET_OUTPUT and maybe SET_FILTER ioctls. So, this series is a stab at that.

So, I added a new flag to the perf_event_open() that tells perf to keep
the event around after its file descriptor gets closed, for as long as its
group leader is alive. Since this is a new behavior, the flag is an opt-in.

I also hacked this into the perf tool (mostly perf record, but I'll hack
stat as well if this general approach is agreeable).

Alexander Shishkin (2):
perf: Add closing sibling events' file descriptors
perf record: Support closing siblings' file descriptors

include/linux/perf_event.h | 7 ++
include/uapi/linux/perf_event.h | 1 +
kernel/events/core.c | 149 +++++++++++++++++-------
tools/include/uapi/linux/perf_event.h | 1 +
tools/lib/perf/evlist.c | 30 ++++-
tools/lib/perf/evsel.c | 21 ++++
tools/lib/perf/include/internal/evsel.h | 4 +
tools/perf/builtin-record.c | 48 ++++++--
tools/perf/util/cpumap.c | 4 +
tools/perf/util/evlist.c | 4 +-
tools/perf/util/evsel.c | 17 ++-
tools/perf/util/evsel.h | 3 +
12 files changed, 234 insertions(+), 55 deletions(-)

--
2.27.0

2020-07-08 15:21:25

by Alexander Shishkin

[permalink] [raw]

Subject: [PATCH 2/2] perf record: Support closing siblings' file descriptors

This changes perf record to close siblings' file descriptors after they
are created, to reduce the number of open files. The trick is setting up
mappings and applying ioctls to these guys before they get closed, which
means a slight rework to allow these things to happen in the same loop as
evsel creation.

Before:

> $ perf record -e '{dummy,syscalls:*}' -o before.data uname
> Error:
> Too many events are opened.
> Probably the maximum number of open file descriptors has been reached.
> Hint: Try again after reducing the number of events.
> Hint: Try increasing the limit with 'ulimit -n <limit>'

After:

> $ perf record -e '{dummy,syscalls:*}' -o after.data uname
> Couldn't synthesize cgroup events.
> Linux
> [ perf record: Woken up 5 times to write data ]
> [ perf record: Captured and wrote 0.118 MB after.data (62 samples) ]

Signed-off-by: Alexander Shishkin <[email protected]>
---
tools/include/uapi/linux/perf_event.h | 1 +
tools/lib/perf/evlist.c | 30 +++++++++++++++-
tools/lib/perf/evsel.c | 21 +++++++++++
tools/lib/perf/include/internal/evsel.h | 4 +++
tools/perf/builtin-record.c | 48 +++++++++++++++++++------
tools/perf/util/cpumap.c | 4 +++
tools/perf/util/evlist.c | 4 ++-
tools/perf/util/evsel.c | 17 ++++++++-
tools/perf/util/evsel.h | 3 ++
9 files changed, 119 insertions(+), 13 deletions(-)

diff --git a/tools/include/uapi/linux/perf_event.h b/tools/include/uapi/linux/perf_event.h
index 7b2d6fc9e6ed..90238b8ee2ae 100644
--- a/tools/include/uapi/linux/perf_event.h
+++ b/tools/include/uapi/linux/perf_event.h
@@ -1069,6 +1069,7 @@ enum perf_callchain_context {
#define PERF_FLAG_FD_OUTPUT (1UL << 1)
#define PERF_FLAG_PID_CGROUP (1UL << 2) /* pid=cgroup id, per-cpu mode only */
#define PERF_FLAG_FD_CLOEXEC (1UL << 3) /* O_CLOEXEC */
+#define PERF_FLAG_ALLOW_CLOSE (1UL << 4) /* XXX */

#if defined(__LITTLE_ENDIAN_BITFIELD)
union perf_mem_data_src {
diff --git a/tools/lib/perf/evlist.c b/tools/lib/perf/evlist.c
index 6a875a0f01bb..8aeb5634bc61 100644
--- a/tools/lib/perf/evlist.c
+++ b/tools/lib/perf/evlist.c
@@ -403,6 +403,7 @@ perf_evlist__mmap_cb_get(struct perf_evlist *evlist, bool overwrite, int idx)
}

#define FD(e, x, y) (*(int *) xyarray__entry(e->fd, x, y))
+#define OUTPUT(e, x, y) (*(int *)xyarray__entry(e->output, x, y))

static int
perf_evlist__mmap_cb_mmap(struct perf_mmap *map, struct perf_mmap_param *mp,
@@ -433,6 +434,11 @@ mmap_per_evsel(struct perf_evlist *evlist, struct perf_evlist_mmap_ops *ops,
bool overwrite = evsel->attr.write_backward;
struct perf_mmap *map;
int *output, fd, cpu;
+ bool do_close = false;
+
+ /* Skip over not yet opened evsels */
+ if (!evsel->fd)
+ continue;

if (evsel->system_wide && thread)
continue;
@@ -454,6 +460,10 @@ mmap_per_evsel(struct perf_evlist *evlist, struct perf_evlist_mmap_ops *ops,
}

fd = FD(evsel, cpu, thread);
+ if (OUTPUT(evsel, cpu, thread) >= 0) {
+ *output = OUTPUT(evsel, cpu, thread);
+ continue;
+ }

if (*output == -1) {
*output = fd;
@@ -479,6 +489,10 @@ mmap_per_evsel(struct perf_evlist *evlist, struct perf_evlist_mmap_ops *ops,
if (!idx)
perf_evlist__set_mmap_first(evlist, map, overwrite);
} else {
+ if (fd < 0) {
+ fd = -fd;
+ do_close = true;
+ }
if (ioctl(fd, PERF_EVENT_IOC_SET_OUTPUT, *output) != 0)
return -1;

@@ -487,7 +501,7 @@ mmap_per_evsel(struct perf_evlist *evlist, struct perf_evlist_mmap_ops *ops,

revent = !overwrite ? POLLIN : 0;

- if (!evsel->system_wide &&
+ if (!evsel->system_wide && !do_close &&
perf_evlist__add_pollfd(evlist, fd, map, revent) < 0) {
perf_mmap__put(map);
return -1;
@@ -500,6 +514,13 @@ mmap_per_evsel(struct perf_evlist *evlist, struct perf_evlist_mmap_ops *ops,
perf_evlist__set_sid_idx(evlist, evsel, idx, cpu,
thread);
}
+
+ if (do_close) {
+ close(fd);
+ perf_mmap__put(map);
+ FD(evsel, cpu, thread) = -1; /* *output */
+ }
+ OUTPUT(evsel, cpu, thread) = *output;
}

return 0;
@@ -587,10 +608,17 @@ int perf_evlist__mmap_ops(struct perf_evlist *evlist,
evlist->nr_mmaps = perf_evlist__nr_mmaps(evlist);

perf_evlist__for_each_entry(evlist, evsel) {
+ /* Skip over not yet opened evsels */
+ if (!evsel->fd)
+ continue;
+
if ((evsel->attr.read_format & PERF_FORMAT_ID) &&
evsel->sample_id == NULL &&
perf_evsel__alloc_id(evsel, perf_cpu_map__nr(cpus), threads->nr) < 0)
return -ENOMEM;
+ if (evsel->output == NULL &&
+ perf_evsel__alloc_output(evsel, perf_cpu_map__nr(cpus), threads->nr) < 0)
+ return -ENOMEM;
}

if (evlist->pollfd.entries == NULL && perf_evlist__alloc_pollfd(evlist) < 0)
diff --git a/tools/lib/perf/evsel.c b/tools/lib/perf/evsel.c
index 4dc06289f4c7..6661145ca673 100644
--- a/tools/lib/perf/evsel.c
+++ b/tools/lib/perf/evsel.c
@@ -55,6 +55,21 @@ int perf_evsel__alloc_fd(struct perf_evsel *evsel, int ncpus, int nthreads)
return evsel->fd != NULL ? 0 : -ENOMEM;
}

+int perf_evsel__alloc_output(struct perf_evsel *evsel, int ncpus, int nthreads)
+{
+ evsel->output = xyarray__new(ncpus, nthreads, sizeof(int));
+
+ if (evsel->output) {
+ int cpu, thread;
+ for (cpu = 0; cpu < ncpus; cpu++) {
+ for (thread = 0; thread < nthreads; thread++)
+ *(int *)xyarray__entry(evsel->output, cpu, thread) = -1;
+ }
+ }
+
+ return evsel->output != NULL ? 0 : -ENOMEM;
+}
+
static int
sys_perf_event_open(struct perf_event_attr *attr,
pid_t pid, int cpu, int group_fd,
@@ -139,6 +154,12 @@ void perf_evsel__free_fd(struct perf_evsel *evsel)
evsel->fd = NULL;
}

+void perf_evsel__free_output(struct perf_evsel *evsel)
+{
+ xyarray__delete(evsel->output);
+ evsel->output = NULL;
+}
+
void perf_evsel__close(struct perf_evsel *evsel)
{
if (evsel->fd == NULL)
diff --git a/tools/lib/perf/include/internal/evsel.h b/tools/lib/perf/include/internal/evsel.h
index 1ffd083b235e..9bbbec7d92b1 100644
--- a/tools/lib/perf/include/internal/evsel.h
+++ b/tools/lib/perf/include/internal/evsel.h
@@ -42,6 +42,7 @@ struct perf_evsel {
struct perf_thread_map *threads;
struct xyarray *fd;
struct xyarray *sample_id;
+ struct xyarray *output;
u64 *id;
u32 ids;

@@ -60,4 +61,7 @@ int perf_evsel__apply_filter(struct perf_evsel *evsel, const char *filter);
int perf_evsel__alloc_id(struct perf_evsel *evsel, int ncpus, int nthreads);
void perf_evsel__free_id(struct perf_evsel *evsel);

+int perf_evsel__alloc_output(struct perf_evsel *evsel, int ncpus, int nthreads);
+void perf_evsel__free_output(struct perf_evsel *evsel);
+
#endif /* __LIBPERF_INTERNAL_EVSEL_H */
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index e108d90ae2ed..7dd7db4ff37a 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -837,6 +837,21 @@ static int record__mmap(struct record *rec)
return record__mmap_evlist(rec, rec->evlist);
}

+static int record__apply_filters(struct record *rec)
+{
+ struct evsel *pos;
+ char msg[BUFSIZ];
+
+ if (perf_evlist__apply_filters(rec->evlist, &pos)) {
+ pr_err("failed to set filter \"%s\" on event %s with %d (%s)\n",
+ pos->filter, evsel__name(pos), errno,
+ str_error_r(errno, msg, sizeof(msg)));
+ return -1;
+ }
+
+ return 0;
+}
+
static int record__open(struct record *rec)
{
char msg[BUFSIZ];
@@ -844,6 +859,7 @@ static int record__open(struct record *rec)
struct evlist *evlist = rec->evlist;
struct perf_session *session = rec->session;
struct record_opts *opts = &rec->opts;
+ bool late_mmap = true;
int rc = 0;

/*
@@ -894,6 +910,21 @@ static int record__open(struct record *rec)
}

pos->supported = true;
+
+ if (pos->allow_close) {
+ rc = record__mmap(rec);
+ if (rc)
+ goto out;
+ rc = record__apply_filters(rec);
+ if (rc)
+ goto out;
+ late_mmap = false;
+ }
+ }
+
+ evlist__for_each_entry(evlist, pos) {
+ if (pos->core.output)
+ perf_evsel__free_output(&pos->core);
}

if (symbol_conf.kptr_restrict && !perf_evlist__exclude_kernel(evlist)) {
@@ -907,18 +938,15 @@ static int record__open(struct record *rec)
"even with a suitable vmlinux or kallsyms file.\n\n");
}

- if (perf_evlist__apply_filters(evlist, &pos)) {
- pr_err("failed to set filter \"%s\" on event %s with %d (%s)\n",
- pos->filter, evsel__name(pos), errno,
- str_error_r(errno, msg, sizeof(msg)));
- rc = -1;
- goto out;
+ if (late_mmap) {
+ rc = record__mmap(rec);
+ if (rc)
+ goto out;
+ rc = record__apply_filters(rec);
+ if (rc)
+ goto out;
}

- rc = record__mmap(rec);
- if (rc)
- goto out;
-
session->evlist = evlist;
perf_session__set_id_hdr_size(session);
out:
diff --git a/tools/perf/util/cpumap.c b/tools/perf/util/cpumap.c
index dc5c5e6fc502..2eac58ada5ab 100644
--- a/tools/perf/util/cpumap.c
+++ b/tools/perf/util/cpumap.c
@@ -432,6 +432,10 @@ int cpu__setup_cpunode_map(void)
const char *mnt;
int n;

+ /* Only do this once */
+ if (cpunode_map)
+ return 0;
+
/* initialize globals */
if (init_cpunode_map())
return -1;
diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
index 173b4f0e0e6e..776a8339df99 100644
--- a/tools/perf/util/evlist.c
+++ b/tools/perf/util/evlist.c
@@ -977,7 +977,7 @@ int perf_evlist__apply_filters(struct evlist *evlist, struct evsel **err_evsel)
int err = 0;

evlist__for_each_entry(evlist, evsel) {
- if (evsel->filter == NULL)
+ if (evsel->filter == NULL || evsel->filter_applied)
continue;

/*
@@ -989,6 +989,8 @@ int perf_evlist__apply_filters(struct evlist *evlist, struct evsel **err_evsel)
*err_evsel = evsel;
break;
}
+
+ evsel->filter_applied = true;
}

return err;
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 96e5171dce41..696c5d7190e2 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -1172,6 +1172,7 @@ int evsel__set_filter(struct evsel *evsel, const char *filter)
if (new_filter != NULL) {
free(evsel->filter);
evsel->filter = new_filter;
+ evsel->filter_applied = false;
return 0;
}

@@ -1632,6 +1633,9 @@ static int evsel__open_cpu(struct evsel *evsel, struct perf_cpu_map *cpus,
pid = evsel->cgrp->fd;
}

+ if (evsel->leader != evsel)
+ flags |= PERF_FLAG_ALLOW_CLOSE;
+
fallback_missing_features:
if (perf_missing_features.clockid_wrong)
evsel->core.attr.clockid = CLOCK_MONOTONIC; /* should always work */
@@ -1641,6 +1645,8 @@ static int evsel__open_cpu(struct evsel *evsel, struct perf_cpu_map *cpus,
}
if (perf_missing_features.cloexec)
flags &= ~(unsigned long)PERF_FLAG_FD_CLOEXEC;
+ if (perf_missing_features.allow_close)
+ flags &= ~(unsigned long)PERF_FLAG_ALLOW_CLOSE;
if (perf_missing_features.mmap2)
evsel->core.attr.mmap2 = 0;
if (perf_missing_features.exclude_guest)
@@ -1729,6 +1735,11 @@ static int evsel__open_cpu(struct evsel *evsel, struct perf_cpu_map *cpus,
err = -EINVAL;
goto out_close;
}
+
+ if (flags & PERF_FLAG_ALLOW_CLOSE) {
+ FD(evsel, cpu, thread) = -fd;
+ evsel->allow_close = true;
+ }
}
}

@@ -1766,7 +1777,11 @@ static int evsel__open_cpu(struct evsel *evsel, struct perf_cpu_map *cpus,
* Must probe features in the order they were added to the
* perf_event_attr interface.
*/
- if (!perf_missing_features.cgroup && evsel->core.attr.cgroup) {
+ if (!perf_missing_features.allow_close && (flags & PERF_FLAG_ALLOW_CLOSE)) {
+ perf_missing_features.allow_close = true;
+ pr_debug2("switching off ALLOW_CLOSE support\n");
+ goto fallback_missing_features;
+ } else if (!perf_missing_features.cgroup && evsel->core.attr.cgroup) {
perf_missing_features.cgroup = true;
pr_debug2_peo("Kernel has no cgroup sampling support, bailing out\n");
goto out_close;
diff --git a/tools/perf/util/evsel.h b/tools/perf/util/evsel.h
index 0f963c2a88a5..076a541bb274 100644
--- a/tools/perf/util/evsel.h
+++ b/tools/perf/util/evsel.h
@@ -77,6 +77,8 @@ struct evsel {
bool forced_leader;
bool use_uncore_alias;
bool is_libpfm_event;
+ bool filter_applied;
+ bool allow_close;
/* parse modifier helper */
int exclude_GH;
int sample_read;
@@ -130,6 +132,7 @@ struct perf_missing_features {
bool aux_output;
bool branch_hw_idx;
bool cgroup;
+ bool allow_close;
};

extern struct perf_missing_features perf_missing_features;
--
2.27.0

2020-07-08 15:22:05

by Alexander Shishkin

[permalink] [raw]

Subject: [PATCH 1/2] perf: Add closing sibling events' file descriptors

Currently, perf requires one file descriptor per event. In large groups,
this may mean running into the limit on open file descriptors. However,
the sibling events in a group only need file descriptors for the initial
configuration stage, after which they may not be needed any more.

This adds an opt-in flag to the perf_event_open() syscall to retain
sibling events after their file descriptors are closed. In this case, the
actual events will be closed with the group leader.

Signed-off-by: Alexander Shishkin <[email protected]>
Suggested-by: Andi Kleen <[email protected]>
---
include/linux/perf_event.h | 7 ++
include/uapi/linux/perf_event.h | 1 +
kernel/events/core.c | 149 +++++++++++++++++++++++---------
3 files changed, 115 insertions(+), 42 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 3b22db08b6fb..46666ce2c303 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -623,6 +623,11 @@ struct perf_event {
* either sufficies for read.
*/
struct list_head sibling_list;
+ /*
+ * ALLOW_CLOSED siblings that were actually closed; when the group
+ * leader goes, so should they.
+ */
+ struct list_head closed_list;
struct list_head active_list;
/*
* Node on the pinned or flexible tree located at the event context;
@@ -644,6 +649,8 @@ struct perf_event {
int event_caps;
/* The cumulative AND of all event_caps for events in this group. */
int group_caps;
+ unsigned allow_close : 1,
+ closed : 1;

struct perf_event *group_leader;
struct pmu *pmu;
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 52ca2093831c..69823c0e3cbd 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -1093,6 +1093,7 @@ enum perf_callchain_context {
#define PERF_FLAG_FD_OUTPUT (1UL << 1)
#define PERF_FLAG_PID_CGROUP (1UL << 2) /* pid=cgroup id, per-cpu mode only */
#define PERF_FLAG_FD_CLOEXEC (1UL << 3) /* O_CLOEXEC */
+#define PERF_FLAG_ALLOW_CLOSE (1UL << 4) /* retain the event past fd close */

#if defined(__LITTLE_ENDIAN_BITFIELD)
union perf_mem_data_src {
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 7c436d705fbd..e61be9cfce98 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -352,7 +352,8 @@ static void event_function_local(struct perf_event *event, event_f func, void *d
#define PERF_FLAG_ALL (PERF_FLAG_FD_NO_GROUP |\
PERF_FLAG_FD_OUTPUT |\
PERF_FLAG_PID_CGROUP |\
- PERF_FLAG_FD_CLOEXEC)
+ PERF_FLAG_FD_CLOEXEC |\
+ PERF_FLAG_ALLOW_CLOSE)

/*
* branch priv levels that need permission checks
@@ -2165,6 +2166,15 @@ static void perf_group_detach(struct perf_event *event)
* to whatever list we are on.
*/
list_for_each_entry_safe(sibling, tmp, &event->sibling_list, sibling_list) {
+ if (sibling->closed) {
+ list_move(&sibling->sibling_list, &event->closed_list);
+ event->nr_siblings--;
+ continue;
+ }
+
+ /* Proceed as if it was an ordinary sibling */
+ if (sibling->allow_close)
+ sibling->allow_close = 0;

sibling->group_leader = sibling;
list_del_init(&sibling->sibling_list);
@@ -2313,6 +2323,7 @@ __perf_remove_from_context(struct perf_event *event,
void *info)
{
unsigned long flags = (unsigned long)info;
+ struct perf_event *sibling;

if (ctx->is_active & EVENT_TIME) {
update_context_time(ctx);
@@ -2332,6 +2343,10 @@ __perf_remove_from_context(struct perf_event *event,
cpuctx->task_ctx = NULL;
}
}
+
+ flags &= ~DETACH_GROUP;
+ list_for_each_entry(sibling, &event->closed_list, sibling_list)
+ __perf_remove_from_context(sibling, cpuctx, ctx, (void *)flags);
}

/*
@@ -4906,51 +4921,12 @@ static void put_event(struct perf_event *event)
_free_event(event);
}

-/*
- * Kill an event dead; while event:refcount will preserve the event
- * object, it will not preserve its functionality. Once the last 'user'
- * gives up the object, we'll destroy the thing.
- */
-int perf_event_release_kernel(struct perf_event *event)
+static void perf_event_free_children(struct perf_event *event)
{
- struct perf_event_context *ctx = event->ctx;
struct perf_event *child, *tmp;
+ struct perf_event_context *ctx;
LIST_HEAD(free_list);

- /*
- * If we got here through err_file: fput(event_file); we will not have
- * attached to a context yet.
- */
- if (!ctx) {
- WARN_ON_ONCE(event->attach_state &
- (PERF_ATTACH_CONTEXT|PERF_ATTACH_GROUP));
- goto no_ctx;
- }
-
- if (!is_kernel_event(event))
- perf_remove_from_owner(event);
-
- ctx = perf_event_ctx_lock(event);
- WARN_ON_ONCE(ctx->parent_ctx);
- perf_remove_from_context(event, DETACH_GROUP);
-
- raw_spin_lock_irq(&ctx->lock);
- /*
- * Mark this event as STATE_DEAD, there is no external reference to it
- * anymore.
- *
- * Anybody acquiring event->child_mutex after the below loop _must_
- * also see this, most importantly inherit_event() which will avoid
- * placing more children on the list.
- *
- * Thus this guarantees that we will in fact observe and kill _ALL_
- * child events.
- */
- event->state = PERF_EVENT_STATE_DEAD;
- raw_spin_unlock_irq(&ctx->lock);
-
- perf_event_ctx_unlock(event, ctx);
-
again:
mutex_lock(&event->child_mutex);
list_for_each_entry(child, &event->child_list, child_list) {
@@ -5016,6 +4992,82 @@ int perf_event_release_kernel(struct perf_event *event)
smp_mb(); /* pairs with wait_var_event() */
wake_up_var(var);
}
+}
+
+/*
+ * Kill an event dead; while event:refcount will preserve the event
+ * object, it will not preserve its functionality. Once the last 'user'
+ * gives up the object, we'll destroy the thing.
+ */
+int perf_event_release_kernel(struct perf_event *event)
+{
+ struct perf_event_context *ctx = event->ctx;
+ struct perf_event *sibling;
+
+ /*
+ * If we got here through err_file: fput(event_file); we will not have
+ * attached to a context yet.
+ */
+ if (!ctx) {
+ WARN_ON_ONCE(event->attach_state &
+ (PERF_ATTACH_CONTEXT|PERF_ATTACH_GROUP));
+ goto no_ctx;
+ }
+
+ if (!is_kernel_event(event))
+ perf_remove_from_owner(event);
+
+ ctx = perf_event_ctx_lock(event);
+ WARN_ON_ONCE(ctx->parent_ctx);
+
+ if (event->allow_close && !event->closed) {
+ event->closed = 1;
+ perf_event_ctx_unlock(event, ctx);
+ return 0;
+ }
+
+ /*
+ * The below will also move all closed siblings to the closed_list,
+ * so that we can reap their children and remove them from the owner.
+ */
+ perf_remove_from_context(event, DETACH_GROUP);
+
+ raw_spin_lock_irq(&ctx->lock);
+ /*
+ * Mark this event as STATE_DEAD, there is no external reference to it
+ * anymore.
+ *
+ * Anybody acquiring event->child_mutex after the below loop _must_
+ * also see this, most importantly inherit_event() which will avoid
+ * placing more children on the list. It will also skip over closed
+ * siblings, as they are also going away together with their leader.
+ *
+ * Thus this guarantees that we will in fact observe and kill _ALL_
+ * child events.
+ */
+ event->state = PERF_EVENT_STATE_DEAD;
+ raw_spin_unlock_irq(&ctx->lock);
+
+ perf_event_ctx_unlock(event, ctx);
+
+ perf_event_free_children(event);
+
+ /*
+ * The events on the closed_list are former closed siblings; they
+ * don't have file descriptors, so this is their teardown.
+ */
+ list_for_each_entry(sibling, &event->closed_list, sibling_list) {
+ if (!is_kernel_event(sibling))
+ perf_remove_from_owner(sibling);
+ perf_event_free_children(sibling);
+ /*
+ * The below may be last, or it may be raced for it
+ * by the perf_event_exit_event() path; we can do better
+ * and ensure one way or the other, but it doesn't matter,
+ * the other path is fully equipped to free events.
+ */
+ put_event(sibling);
+ }

no_ctx:
put_event(event); /* Must be the 'last' reference */
@@ -11118,6 +11170,7 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,

INIT_LIST_HEAD(&event->event_entry);
INIT_LIST_HEAD(&event->sibling_list);
+ INIT_LIST_HEAD(&event->closed_list);
INIT_LIST_HEAD(&event->active_list);
init_event_group(event);
INIT_LIST_HEAD(&event->rb_entry);
@@ -11718,6 +11771,14 @@ SYSCALL_DEFINE5(perf_event_open,
}
}

+ if (flags & PERF_FLAG_ALLOW_CLOSE) {
+ if (!group_leader || group_leader->group_leader != group_leader) {
+ err = -EINVAL;
+ goto err_task;
+ }
+ event->allow_close = 1;
+ }
+
/*
* Special case software events and allow them to be part of
* any hardware group.
@@ -12498,6 +12559,10 @@ inherit_event(struct perf_event *parent_event,
if (parent_event->parent)
parent_event = parent_event->parent;

+ /* If group leader is getting closed, ignore its closed siblings */
+ if (!group_leader && parent_event->closed)
+ return NULL;
+
child_event = perf_event_alloc(&parent_event->attr,
parent_event->cpu,
child,
--
2.27.0

2020-07-08 21:10:36

by Jiri Olsa

[permalink] [raw]

Subject: Re: [PATCH 2/2] perf record: Support closing siblings' file descriptors

On Wed, Jul 08, 2020 at 06:16:35PM +0300, Alexander Shishkin wrote:
> This changes perf record to close siblings' file descriptors after they
> are created, to reduce the number of open files. The trick is setting up
> mappings and applying ioctls to these guys before they get closed, which
> means a slight rework to allow these things to happen in the same loop as
> evsel creation.
>
> Before:
>
> > $ perf record -e '{dummy,syscalls:*}' -o before.data uname
> > Error:
> > Too many events are opened.
> > Probably the maximum number of open file descriptors has been reached.
> > Hint: Try again after reducing the number of events.
> > Hint: Try increasing the limit with 'ulimit -n <limit>'
>
> After:
>
> > $ perf record -e '{dummy,syscalls:*}' -o after.data uname
> > Couldn't synthesize cgroup events.
> > Linux
> > [ perf record: Woken up 5 times to write data ]
> > [ perf record: Captured and wrote 0.118 MB after.data (62 samples) ]

hi,
really nice ;-) some questions/comments below

thanks,
jirka

>
> Signed-off-by: Alexander Shishkin <[email protected]>
> ---
> tools/include/uapi/linux/perf_event.h | 1 +
> tools/lib/perf/evlist.c | 30 +++++++++++++++-
> tools/lib/perf/evsel.c | 21 +++++++++++
> tools/lib/perf/include/internal/evsel.h | 4 +++
> tools/perf/builtin-record.c | 48 +++++++++++++++++++------
> tools/perf/util/cpumap.c | 4 +++
> tools/perf/util/evlist.c | 4 ++-
> tools/perf/util/evsel.c | 17 ++++++++-
> tools/perf/util/evsel.h | 3 ++
> 9 files changed, 119 insertions(+), 13 deletions(-)
>
> diff --git a/tools/include/uapi/linux/perf_event.h b/tools/include/uapi/linux/perf_event.h
> index 7b2d6fc9e6ed..90238b8ee2ae 100644
> --- a/tools/include/uapi/linux/perf_event.h
> +++ b/tools/include/uapi/linux/perf_event.h
> @@ -1069,6 +1069,7 @@ enum perf_callchain_context {
> #define PERF_FLAG_FD_OUTPUT (1UL << 1)
> #define PERF_FLAG_PID_CGROUP (1UL << 2) /* pid=cgroup id, per-cpu mode only */
> #define PERF_FLAG_FD_CLOEXEC (1UL << 3) /* O_CLOEXEC */
> +#define PERF_FLAG_ALLOW_CLOSE (1UL << 4) /* XXX */
>
> #if defined(__LITTLE_ENDIAN_BITFIELD)
> union perf_mem_data_src {
> diff --git a/tools/lib/perf/evlist.c b/tools/lib/perf/evlist.c
> index 6a875a0f01bb..8aeb5634bc61 100644
> --- a/tools/lib/perf/evlist.c
> +++ b/tools/lib/perf/evlist.c
> @@ -403,6 +403,7 @@ perf_evlist__mmap_cb_get(struct perf_evlist *evlist, bool overwrite, int idx)
> }
>
> #define FD(e, x, y) (*(int *) xyarray__entry(e->fd, x, y))
> +#define OUTPUT(e, x, y) (*(int *)xyarray__entry(e->output, x, y))
>
> static int
> perf_evlist__mmap_cb_mmap(struct perf_mmap *map, struct perf_mmap_param *mp,
> @@ -433,6 +434,11 @@ mmap_per_evsel(struct perf_evlist *evlist, struct perf_evlist_mmap_ops *ops,
> bool overwrite = evsel->attr.write_backward;
> struct perf_mmap *map;
> int *output, fd, cpu;
> + bool do_close = false;
> +
> + /* Skip over not yet opened evsels */
> + if (!evsel->fd)
> + continue;
>
> if (evsel->system_wide && thread)
> continue;
> @@ -454,6 +460,10 @@ mmap_per_evsel(struct perf_evlist *evlist, struct perf_evlist_mmap_ops *ops,
> }
>
> fd = FD(evsel, cpu, thread);
> + if (OUTPUT(evsel, cpu, thread) >= 0) {
> + *output = OUTPUT(evsel, cpu, thread);
> + continue;
> + }

this check could be earlier in the loop to save some cycles

>
> if (*output == -1) {
> *output = fd;
> @@ -479,6 +489,10 @@ mmap_per_evsel(struct perf_evlist *evlist, struct perf_evlist_mmap_ops *ops,
> if (!idx)
> perf_evlist__set_mmap_first(evlist, map, overwrite);
> } else {
> + if (fd < 0) {
> + fd = -fd;
> + do_close = true;
> + }
> if (ioctl(fd, PERF_EVENT_IOC_SET_OUTPUT, *output) != 0)
> return -1;
>
> @@ -487,7 +501,7 @@ mmap_per_evsel(struct perf_evlist *evlist, struct perf_evlist_mmap_ops *ops,
>
> revent = !overwrite ? POLLIN : 0;
>
> - if (!evsel->system_wide &&
> + if (!evsel->system_wide && !do_close &&
> perf_evlist__add_pollfd(evlist, fd, map, revent) < 0) {
> perf_mmap__put(map);
> return -1;
> @@ -500,6 +514,13 @@ mmap_per_evsel(struct perf_evlist *evlist, struct perf_evlist_mmap_ops *ops,
> perf_evlist__set_sid_idx(evlist, evsel, idx, cpu,
> thread);
> }
> +
> + if (do_close) {
> + close(fd);
> + perf_mmap__put(map);
> + FD(evsel, cpu, thread) = -1; /* *output */
> + }

it's also possible to define new callback in perf_evlist_mmap_ops,
and do the close there.. not sure it'd be nicer

> + OUTPUT(evsel, cpu, thread) = *output;
> }
>
> return 0;
> @@ -587,10 +608,17 @@ int perf_evlist__mmap_ops(struct perf_evlist *evlist,
> evlist->nr_mmaps = perf_evlist__nr_mmaps(evlist);
>
> perf_evlist__for_each_entry(evlist, evsel) {
> + /* Skip over not yet opened evsels */
> + if (!evsel->fd)
> + continue;
> +
> if ((evsel->attr.read_format & PERF_FORMAT_ID) &&
> evsel->sample_id == NULL &&
> perf_evsel__alloc_id(evsel, perf_cpu_map__nr(cpus), threads->nr) < 0)
> return -ENOMEM;
> + if (evsel->output == NULL &&
> + perf_evsel__alloc_output(evsel, perf_cpu_map__nr(cpus), threads->nr) < 0)
> + return -ENOMEM;
> }
>
> if (evlist->pollfd.entries == NULL && perf_evlist__alloc_pollfd(evlist) < 0)
> diff --git a/tools/lib/perf/evsel.c b/tools/lib/perf/evsel.c
> index 4dc06289f4c7..6661145ca673 100644
> --- a/tools/lib/perf/evsel.c
> +++ b/tools/lib/perf/evsel.c
> @@ -55,6 +55,21 @@ int perf_evsel__alloc_fd(struct perf_evsel *evsel, int ncpus, int nthreads)
> return evsel->fd != NULL ? 0 : -ENOMEM;
> }
>
> +int perf_evsel__alloc_output(struct perf_evsel *evsel, int ncpus, int nthreads)
> +{
> + evsel->output = xyarray__new(ncpus, nthreads, sizeof(int));
> +
> + if (evsel->output) {
> + int cpu, thread;
> + for (cpu = 0; cpu < ncpus; cpu++) {
> + for (thread = 0; thread < nthreads; thread++)
> + *(int *)xyarray__entry(evsel->output, cpu, thread) = -1;
> + }
> + }
> +
> + return evsel->output != NULL ? 0 : -ENOMEM;
> +}
> +
> static int
> sys_perf_event_open(struct perf_event_attr *attr,
> pid_t pid, int cpu, int group_fd,
> @@ -139,6 +154,12 @@ void perf_evsel__free_fd(struct perf_evsel *evsel)
> evsel->fd = NULL;
> }
>
> +void perf_evsel__free_output(struct perf_evsel *evsel)
> +{
> + xyarray__delete(evsel->output);
> + evsel->output = NULL;
> +}
> +
> void perf_evsel__close(struct perf_evsel *evsel)
> {
> if (evsel->fd == NULL)
> diff --git a/tools/lib/perf/include/internal/evsel.h b/tools/lib/perf/include/internal/evsel.h
> index 1ffd083b235e..9bbbec7d92b1 100644
> --- a/tools/lib/perf/include/internal/evsel.h
> +++ b/tools/lib/perf/include/internal/evsel.h
> @@ -42,6 +42,7 @@ struct perf_evsel {
> struct perf_thread_map *threads;
> struct xyarray *fd;
> struct xyarray *sample_id;
> + struct xyarray *output;
> u64 *id;
> u32 ids;
>
> @@ -60,4 +61,7 @@ int perf_evsel__apply_filter(struct perf_evsel *evsel, const char *filter);
> int perf_evsel__alloc_id(struct perf_evsel *evsel, int ncpus, int nthreads);
> void perf_evsel__free_id(struct perf_evsel *evsel);
>
> +int perf_evsel__alloc_output(struct perf_evsel *evsel, int ncpus, int nthreads);
> +void perf_evsel__free_output(struct perf_evsel *evsel);

introducing these helpers together with the re-entrancy check
should go to separate patch

> +
> #endif /* __LIBPERF_INTERNAL_EVSEL_H */
> diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
> index e108d90ae2ed..7dd7db4ff37a 100644
> --- a/tools/perf/builtin-record.c
> +++ b/tools/perf/builtin-record.c
> @@ -837,6 +837,21 @@ static int record__mmap(struct record *rec)
> return record__mmap_evlist(rec, rec->evlist);
> }
>
> +static int record__apply_filters(struct record *rec)
> +{
> + struct evsel *pos;
> + char msg[BUFSIZ];
> +
> + if (perf_evlist__apply_filters(rec->evlist, &pos)) {
> + pr_err("failed to set filter \"%s\" on event %s with %d (%s)\n",
> + pos->filter, evsel__name(pos), errno,
> + str_error_r(errno, msg, sizeof(msg)));
> + return -1;
> + }
> +
> + return 0;
> +}
> +
> static int record__open(struct record *rec)
> {
> char msg[BUFSIZ];
> @@ -844,6 +859,7 @@ static int record__open(struct record *rec)
> struct evlist *evlist = rec->evlist;
> struct perf_session *session = rec->session;
> struct record_opts *opts = &rec->opts;
> + bool late_mmap = true;
> int rc = 0;
>
> /*
> @@ -894,6 +910,21 @@ static int record__open(struct record *rec)
> }
>
> pos->supported = true;
> +
> + if (pos->allow_close) {
> + rc = record__mmap(rec);
> + if (rc)
> + goto out;
> + rc = record__apply_filters(rec);
> + if (rc)
> + goto out;
> + late_mmap = false;
> + }

I was wondering why to call record__mmap for each evsel,
and why there are all those new re-entrancy checks..

because we want to close those file descriptors along the way,
not after all is open.. it could have been mentioned in some
comment ;-)

with all those re-entrancy checks could we call it just from
here for both allow_close and not allow_close cases and skip
the call below?

> + }
> +
> + evlist__for_each_entry(evlist, pos) {
> + if (pos->core.output)
> + perf_evsel__free_output(&pos->core);
> }
>
> if (symbol_conf.kptr_restrict && !perf_evlist__exclude_kernel(evlist)) {
> @@ -907,18 +938,15 @@ static int record__open(struct record *rec)
> "even with a suitable vmlinux or kallsyms file.\n\n");
> }
>
> - if (perf_evlist__apply_filters(evlist, &pos)) {
> - pr_err("failed to set filter \"%s\" on event %s with %d (%s)\n",
> - pos->filter, evsel__name(pos), errno,
> - str_error_r(errno, msg, sizeof(msg)));
> - rc = -1;
> - goto out;
> + if (late_mmap) {
> + rc = record__mmap(rec);
> + if (rc)
> + goto out;
> + rc = record__apply_filters(rec);
> + if (rc)
> + goto out;
> }
>
> - rc = record__mmap(rec);
> - if (rc)
> - goto out;
> -
> session->evlist = evlist;
> perf_session__set_id_hdr_size(session);
> out:
> diff --git a/tools/perf/util/cpumap.c b/tools/perf/util/cpumap.c
> index dc5c5e6fc502..2eac58ada5ab 100644
> --- a/tools/perf/util/cpumap.c
> +++ b/tools/perf/util/cpumap.c
> @@ -432,6 +432,10 @@ int cpu__setup_cpunode_map(void)
> const char *mnt;
> int n;
>
> + /* Only do this once */
> + if (cpunode_map)
> + return 0;
> +

this should go to separate patch

> /* initialize globals */
> if (init_cpunode_map())
> return -1;
> diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
> index 173b4f0e0e6e..776a8339df99 100644
> --- a/tools/perf/util/evlist.c
> +++ b/tools/perf/util/evlist.c
> @@ -977,7 +977,7 @@ int perf_evlist__apply_filters(struct evlist *evlist, struct evsel **err_evsel)
> int err = 0;
>
> evlist__for_each_entry(evlist, evsel) {
> - if (evsel->filter == NULL)
> + if (evsel->filter == NULL || evsel->filter_applied)
> continue;
>
> /*
> @@ -989,6 +989,8 @@ int perf_evlist__apply_filters(struct evlist *evlist, struct evsel **err_evsel)
> *err_evsel = evsel;
> break;
> }
> +
> + evsel->filter_applied = true;
> }
>
> return err;
> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
> index 96e5171dce41..696c5d7190e2 100644
> --- a/tools/perf/util/evsel.c
> +++ b/tools/perf/util/evsel.c
> @@ -1172,6 +1172,7 @@ int evsel__set_filter(struct evsel *evsel, const char *filter)
> if (new_filter != NULL) {
> free(evsel->filter);
> evsel->filter = new_filter;
> + evsel->filter_applied = false;
> return 0;
> }
>
> @@ -1632,6 +1633,9 @@ static int evsel__open_cpu(struct evsel *evsel, struct perf_cpu_map *cpus,
> pid = evsel->cgrp->fd;
> }
>
> + if (evsel->leader != evsel)
> + flags |= PERF_FLAG_ALLOW_CLOSE;
> +
> fallback_missing_features:
> if (perf_missing_features.clockid_wrong)
> evsel->core.attr.clockid = CLOCK_MONOTONIC; /* should always work */
> @@ -1641,6 +1645,8 @@ static int evsel__open_cpu(struct evsel *evsel, struct perf_cpu_map *cpus,
> }
> if (perf_missing_features.cloexec)
> flags &= ~(unsigned long)PERF_FLAG_FD_CLOEXEC;
> + if (perf_missing_features.allow_close)
> + flags &= ~(unsigned long)PERF_FLAG_ALLOW_CLOSE;
> if (perf_missing_features.mmap2)
> evsel->core.attr.mmap2 = 0;
> if (perf_missing_features.exclude_guest)
> @@ -1729,6 +1735,11 @@ static int evsel__open_cpu(struct evsel *evsel, struct perf_cpu_map *cpus,
> err = -EINVAL;
> goto out_close;
> }
> +
> + if (flags & PERF_FLAG_ALLOW_CLOSE) {
> + FD(evsel, cpu, thread) = -fd;
> + evsel->allow_close = true;

can't we move allow_close to perf_evsel and use that directly
instead of that -fd thing?

> + }
> }
> }
>
> @@ -1766,7 +1777,11 @@ static int evsel__open_cpu(struct evsel *evsel, struct perf_cpu_map *cpus,
> * Must probe features in the order they were added to the
> * perf_event_attr interface.
> */
> - if (!perf_missing_features.cgroup && evsel->core.attr.cgroup) {
> + if (!perf_missing_features.allow_close && (flags & PERF_FLAG_ALLOW_CLOSE)) {
> + perf_missing_features.allow_close = true;
> + pr_debug2("switching off ALLOW_CLOSE support\n");
> + goto fallback_missing_features;
> + } else if (!perf_missing_features.cgroup && evsel->core.attr.cgroup) {
> perf_missing_features.cgroup = true;
> pr_debug2_peo("Kernel has no cgroup sampling support, bailing out\n");
> goto out_close;
> diff --git a/tools/perf/util/evsel.h b/tools/perf/util/evsel.h
> index 0f963c2a88a5..076a541bb274 100644
> --- a/tools/perf/util/evsel.h
> +++ b/tools/perf/util/evsel.h
> @@ -77,6 +77,8 @@ struct evsel {
> bool forced_leader;
> bool use_uncore_alias;
> bool is_libpfm_event;
> + bool filter_applied;

should go to separate patch

> + bool allow_close;
> /* parse modifier helper */
> int exclude_GH;
> int sample_read;
> @@ -130,6 +132,7 @@ struct perf_missing_features {
> bool aux_output;
> bool branch_hw_idx;
> bool cgroup;
> + bool allow_close;
> };
>
> extern struct perf_missing_features perf_missing_features;
> --
> 2.27.0
>

2020-07-09 08:32:58

by Alexey Budankov

[permalink] [raw]

Subject: Re: [PATCH 0/2] perf: Allow closing siblings' file descriptors

Hi Alex,

On 08.07.2020 18:16, Alexander Shishkin wrote:
> Hi guys,
>
> I've been looking at reducing the number of open file descriptors per perf
> session. If we retain one descriptor per event, in a large group they add
> up. At the same time, we're not actually using them for anything after the
> SET_OUTPUT and maybe SET_FILTER ioctls. So, this series is a stab at that.

PERF_EVENT_IOC_ENABLE, PERF_EVENT_IOC_DISABLE ioctls are still assumed to
work, right?

Asking w.r.t. functionality on --control fd:ctl_fd[,ack_fd] option for stat
and record modes [1].

Thanks,
Alexey

[1] https://lore.kernel.org/lkml/[email protected]/

>
> So, I added a new flag to the perf_event_open() that tells perf to keep
> the event around after its file descriptor gets closed, for as long as its
> group leader is alive. Since this is a new behavior, the flag is an opt-in.
>
> I also hacked this into the perf tool (mostly perf record, but I'll hack
> stat as well if this general approach is agreeable).
>
> Alexander Shishkin (2):
> perf: Add closing sibling events' file descriptors
> perf record: Support closing siblings' file descriptors
>
> include/linux/perf_event.h | 7 ++
> include/uapi/linux/perf_event.h | 1 +
> kernel/events/core.c | 149 +++++++++++++++++-------
> tools/include/uapi/linux/perf_event.h | 1 +
> tools/lib/perf/evlist.c | 30 ++++-
> tools/lib/perf/evsel.c | 21 ++++
> tools/lib/perf/include/internal/evsel.h | 4 +
> tools/perf/builtin-record.c | 48 ++++++--
> tools/perf/util/cpumap.c | 4 +
> tools/perf/util/evlist.c | 4 +-
> tools/perf/util/evsel.c | 17 ++-
> tools/perf/util/evsel.h | 3 +
> 12 files changed, 234 insertions(+), 55 deletions(-)
>

2020-07-16 08:46:27

by Chen, Rong A

[permalink] [raw]

Subject: [perf record] d65f0cbbc6: perf-sanity-tests.Setup_struct_perf_event_attr.fail

Greeting,

FYI, we noticed the following commit (built with gcc-9):

commit: d65f0cbbc6f06d7b0d79fd85e2c32d6ec1b5e61d ("[PATCH 2/2] perf record: Support closing siblings' file descriptors")
url: https://github.com/0day-ci/linux/commits/Alexander-Shishkin/perf-Allow-closing-siblings-file-descriptors/20200708-232149
base: https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git 78c2141b654375079e3737f06f19cabfc0fcecd7

in testcase: perf-sanity-tests
with following parameters:

perf_compiler: gcc
ucode: 0x28

on test machine: 8 threads Intel(R) Core(TM) i7-4790 v3 @ 3.60GHz with 6G memory

caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):

If you fix the issue, kindly add following tag
Reported-by: kernel test robot <[email protected]>

2020-07-16 04:07:45 sudo /usr/src/perf_selftests-x86_64-rhel-8.3-d65f0cbbc6f06d7b0d79fd85e2c32d6ec1b5e61d/tools/perf/perf test 17
17: Setup struct perf_event_attr : FAILED!
2020-07-16 04:07:47 sudo /usr/src/perf_selftests-x86_64-rhel-8.3-d65f0cbbc6f06d7b0d79fd85e2c32d6ec1b5e61d/tools/perf/perf test 18
18: Match and link multiple hists : Ok
2020-07-16 04:07:47 sudo /usr/src/perf_selftests-x86_64-rhel-8.3-d65f0cbbc6f06d7b0d79fd85e2c32d6ec1b5e61d/tools/perf/perf test 19
19: 'import perf' in python : Ok
2020-07-16 04:07:47 sudo /usr/src/perf_selftests-x86_64-rhel-8.3-d65f0cbbc6f06d7b0d79fd85e2c32d6ec1b5e61d/tools/perf/perf test 20
20: Breakpoint overflow signal handler : Ok
2020-07-16 04:07:47 sudo /usr/src/perf_selftests-x86_64-rhel-8.3-d65f0cbbc6f06d7b0d79fd85e2c32d6ec1b5e61d/tools/perf/perf test 21
21: Breakpoint overflow sampling : Ok
2020-07-16 04:07:47 sudo /usr/src/perf_selftests-x86_64-rhel-8.3-d65f0cbbc6f06d7b0d79fd85e2c32d6ec1b5e61d/tools/perf/perf test 22
22: Breakpoint accounting : Ok
2020-07-16 04:07:48 sudo /usr/src/perf_selftests-x86_64-rhel-8.3-d65f0cbbc6f06d7b0d79fd85e2c32d6ec1b5e61d/tools/perf/perf test 23
23: Watchpoint :
23.1: Read Only Watchpoint : Skip
23.2: Write Only Watchpoint : Ok
23.3: Read / Write Watchpoint : Ok
23.4: Modify Watchpoint : Ok
2020-07-16 04:07:48 sudo /usr/src/perf_selftests-x86_64-rhel-8.3-d65f0cbbc6f06d7b0d79fd85e2c32d6ec1b5e61d/tools/perf/perf test 24
24: Number of exit events of a simple workload : Ok
2020-07-16 04:07:48 sudo /usr/src/perf_selftests-x86_64-rhel-8.3-d65f0cbbc6f06d7b0d79fd85e2c32d6ec1b5e61d/tools/perf/perf test 25
25: Software clock events period values : Ok
2020-07-16 04:07:48 sudo /usr/src/perf_selftests-x86_64-rhel-8.3-d65f0cbbc6f06d7b0d79fd85e2c32d6ec1b5e61d/tools/perf/perf test 26
26: Object code reading : Ok
2020-07-16 04:07:48 sudo /usr/src/perf_selftests-x86_64-rhel-8.3-d65f0cbbc6f06d7b0d79fd85e2c32d6ec1b5e61d/tools/perf/perf test 27
27: Sample parsing : Ok
2020-07-16 04:07:48 sudo /usr/src/perf_selftests-x86_64-rhel-8.3-d65f0cbbc6f06d7b0d79fd85e2c32d6ec1b5e61d/tools/perf/perf test 28
28: Use a dummy software event to keep tracking : Ok
2020-07-16 04:07:48 sudo /usr/src/perf_selftests-x86_64-rhel-8.3-d65f0cbbc6f06d7b0d79fd85e2c32d6ec1b5e61d/tools/perf/perf test 29
29: Parse with no sample_id_all bit set : Ok
2020-07-16 04:07:48 sudo /usr/src/perf_selftests-x86_64-rhel-8.3-d65f0cbbc6f06d7b0d79fd85e2c32d6ec1b5e61d/tools/perf/perf test 30
30: Filter hist entries : Ok

To reproduce:

git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
bin/lkp install job.yaml # job file is attached in this email
bin/lkp run job.yaml

Thanks,
Rong Chen

Attachments:

(No filename) (3.95 kB)
config-5.8.0-rc3-00023-gd65f0cbbc6f06 (160.77 kB)
job-script (5.43 kB)
kmsg.xz (30.27 kB)
perf-sanity-tests (30.22 kB)
job.yaml (4.40 kB)
reproduce (7.84 kB)
Download all attachments

2020-08-06 06:17:24

by Adrian Hunter

[permalink] [raw]

Subject: Re: [PATCH 0/2] perf: Allow closing siblings' file descriptors

On 8/07/20 6:16 pm, Alexander Shishkin wrote:
> Hi guys,
>
> I've been looking at reducing the number of open file descriptors per perf
> session. If we retain one descriptor per event, in a large group they add
> up. At the same time, we're not actually using them for anything after the
> SET_OUTPUT and maybe SET_FILTER ioctls. So, this series is a stab at that.

I am wondering if instead we should be looking at creating a kernel API that
allows associating a multitude of tracepoints with a single event. Thoughts
anyone?

2020-08-06 08:38:39

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH 1/2] perf: Add closing sibling events' file descriptors

On Wed, Jul 08, 2020 at 06:16:34PM +0300, Alexander Shishkin wrote:
> Currently, perf requires one file descriptor per event. In large groups,
> this may mean running into the limit on open file descriptors. However,
> the sibling events in a group only need file descriptors for the initial
> configuration stage, after which they may not be needed any more.
>
> This adds an opt-in flag to the perf_event_open() syscall to retain
> sibling events after their file descriptors are closed. In this case, the
> actual events will be closed with the group leader.

So having the 1:1 relation with filedesc imposes a resource limit on
userspace.

This patch breaks that and enables a user to basically DoS the system by
creating unbound events.

2020-08-06 08:41:05

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH 0/2] perf: Allow closing siblings' file descriptors

On Thu, Aug 06, 2020 at 09:15:08AM +0300, Adrian Hunter wrote:
> On 8/07/20 6:16 pm, Alexander Shishkin wrote:
> > Hi guys,
> >
> > I've been looking at reducing the number of open file descriptors per perf
> > session. If we retain one descriptor per event, in a large group they add
> > up. At the same time, we're not actually using them for anything after the
> > SET_OUTPUT and maybe SET_FILTER ioctls. So, this series is a stab at that.
>
> I am wondering if instead we should be looking at creating a kernel API that
> allows associating a multitude of tracepoints with a single event. Thoughts
> anyone?

https://lkml.kernel.org/r/1290445737.2072.338.camel@laptop

2020-08-06 17:48:50

by Andi Kleen

[permalink] [raw]

Subject: Re: [PATCH 1/2] perf: Add closing sibling events' file descriptors

> > This adds an opt-in flag to the perf_event_open() syscall to retain
> > sibling events after their file descriptors are closed. In this case, the
> > actual events will be closed with the group leader.
>
> So having the 1:1 relation with filedesc imposes a resource limit on
> userspace.
>
> This patch breaks that and enables a user to basically DoS the system by
> creating unbound events.

The idea was to account the events in the locked memory allocation too.
Not sure that made it into the patch though.

It has a minor issue that it might break some existing setups that rely
on the mmap fitting exactly into the mmap allocation, but that could
be solved by allowing a little slack, since the existing setups
likely don't have that many events.

There's also a secondary issue of DoS the kernel by creating very long
lists to iterate, but I suppose this is already quite possible, so probably
not a new issue.

-Andi

2020-08-10 13:59:09

by Alexander Shishkin

[permalink] [raw]

Subject: Re: [PATCH 1/2] perf: Add closing sibling events' file descriptors

Andi Kleen <[email protected]> writes:

>> > This adds an opt-in flag to the perf_event_open() syscall to retain
>> > sibling events after their file descriptors are closed. In this case, the
>> > actual events will be closed with the group leader.
>>
>> So having the 1:1 relation with filedesc imposes a resource limit on
>> userspace.
>>
>> This patch breaks that and enables a user to basically DoS the system by
>> creating unbound events.
>
> The idea was to account the events in the locked memory allocation too.
> Not sure that made it into the patch though.

It didn't. I can't figure out what to charge on the locked memory, as
all that memory is in kernel-side objects. It also needs to make sense
as iirc the default MLOCK_LIMIT is quite low, you'd hit it sooner than
the file descriptor limit.

> It has a minor issue that it might break some existing setups that rely
> on the mmap fitting exactly into the mmap allocation, but that could
> be solved by allowing a little slack, since the existing setups
> likely don't have that many events.

I don't see how to make this work in a sane way. Besides, if we have to
have a limit anyway, sticking with the existing one is just easier and
1:1 is kind of more logical.

Regards,
--
Alex

2020-08-10 17:52:01

by Andi Kleen

[permalink] [raw]

Subject: Re: [PATCH 1/2] perf: Add closing sibling events' file descriptors

> It didn't. I can't figure out what to charge on the locked memory, as
> all that memory is in kernel-side objects. It also needs to make sense

I don't see how that makes a difference for the count. It just account
bytes. Can you elaborate?

> as iirc the default MLOCK_LIMIT is quite low, you'd hit it sooner than
> the file descriptor limit.

For a single process?

>
> > It has a minor issue that it might break some existing setups that rely
> > on the mmap fitting exactly into the mmap allocation, but that could
> > be solved by allowing a little slack, since the existing setups
> > likely don't have that many events.
>
> I don't see how to make this work in a sane way. Besides, if we have to
> have a limit anyway, sticking with the existing one is just easier and
> 1:1 is kind of more logical.

It's just a very wasteful way because we need an extra inode and file descriptor
for each event*cgroup.

And of course it's super user unfriendly because managing the fd limits
is a pain, apart from them not really working that well anyways
(since someone who really wants to do a memory DoS can still open
RLIMIT_NPROC*RLIMIT_NFILE fds just by forking)

Unfortunately we're kind of stuck with the old NFILE=1024 default
even though it makes little sense on modern servers. Right now a lot
of very reasonable perf stat command lines don't work out of the box
on larger machines because of this (and since cores are growing all
the time "larger machines" of today are the standard servers of
tomorrow)

Maybe an alternative would be to allow a multiplier. For each open fd
you can have N perf events. With N being a little higher than the current
cost of the inode + file descriptor together.

But would really prefer to have some kind of limit per uid that is
actually sane.

-Andi

2020-08-10 20:38:30

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH 1/2] perf: Add closing sibling events' file descriptors

On Mon, Aug 10, 2020 at 07:45:18AM -0700, Andi Kleen wrote:

> Unfortunately we're kind of stuck with the old NFILE=1024 default
> even though it makes little sense on modern servers.

Why can't that be changed? It seems to me all of userspace changes all
the time; heck that system-doofus thing flushed 20+ years of sysadmin
experience down the drain, just cause. Why can't we up a file limit?

2020-08-11 08:16:27

by Alexey Budankov

[permalink] [raw]

Subject: Re: [PATCH 1/2] perf: Add closing sibling events' file descriptors

On 10.08.2020 23:36, Peter Zijlstra wrote:
> On Mon, Aug 10, 2020 at 07:45:18AM -0700, Andi Kleen wrote:
>
>> Unfortunately we're kind of stuck with the old NFILE=1024 default
>> even though it makes little sense on modern servers.
>
> Why can't that be changed? It seems to me all of userspace changes all
> the time; heck that system-doofus thing flushed 20+ years of sysadmin
> experience down the drain, just cause. Why can't we up a file limit?
>

If you asked me I would say that
Intel VTune HW micro architecture analysis would also be unblocked
by pushing RLIMIT_NOFILE limit beyond its current 1K default.

Currently rule of thumb for VTune users is to set the limit to
value == #cores X # events. The recommended value is 64K [1].

BR,
Alexei

[1] https://software.intel.com/content/www/us/en/develop/documentation/vtune-cookbook/top/configuration-recipes/profiling-hardware-without-sampling-drivers.html

2020-08-11 09:51:24

by Alexander Shishkin

[permalink] [raw]

Subject: Re: [PATCH 1/2] perf: Add closing sibling events' file descriptors

Andi Kleen <[email protected]> writes:

>> It didn't. I can't figure out what to charge on the locked memory, as
>> all that memory is in kernel-side objects. It also needs to make sense
>
> I don't see how that makes a difference for the count. It just account
> bytes. Can you elaborate?

Right, but which bytes? One byte per event? That's
arbitrary. sizeof(struct perf_event)? Then, probably also sizeof(struct
perf_event_context).

>> as iirc the default MLOCK_LIMIT is quite low, you'd hit it sooner than
>> the file descriptor limit.
>
> For a single process?

The above two structs add up to 2288 bytes on my local build. Given the
default RLIMIT_MEMLOCK of 64k, that's 28 events. As opposed to ~1k
events if we keep using the RLIMIT_NOFILE. Unless I'm missing your
point.

Regards,
--
Alex

2020-08-11 14:38:00

by Andi Kleen

[permalink] [raw]

Subject: Re: [PATCH 1/2] perf: Add closing sibling events' file descriptors

On Tue, Aug 11, 2020 at 12:47:24PM +0300, Alexander Shishkin wrote:
> Andi Kleen <[email protected]> writes:
>
> >> It didn't. I can't figure out what to charge on the locked memory, as
> >> all that memory is in kernel-side objects. It also needs to make sense
> >
> > I don't see how that makes a difference for the count. It just account
> > bytes. Can you elaborate?
>
> Right, but which bytes? One byte per event? That's
> arbitrary. sizeof(struct perf_event)? Then, probably also sizeof(struct
> perf_event_context).

Yes the sum of all the sizeofs needed for a perf_event.

>
> >> as iirc the default MLOCK_LIMIT is quite low, you'd hit it sooner than
> >> the file descriptor limit.
> >
> > For a single process?
>
> The above two structs add up to 2288 bytes on my local build. Given the
> default RLIMIT_MEMLOCK of 64k, that's 28 events. As opposed to ~1k
> events if we keep using the RLIMIT_NOFILE. Unless I'm missing your
> point.

Yes that's true. We would probably need to increase the limit to a few
MB at least.

Or maybe use some combination with the old rlimit for compatibility.
The old rlimit would give an implicit extra RLIMIT_NFILE * 2288 limit
for RLIMIT_MEMLOCK. This would only give full compatibility for a single
perf process, but I suspect that's good enough for most users.

-Andi

2020-08-11 14:43:53

by Andi Kleen

[permalink] [raw]

Subject: Re: [PATCH 1/2] perf: Add closing sibling events' file descriptors

On Mon, Aug 10, 2020 at 10:36:32PM +0200, Peter Zijlstra wrote:
> On Mon, Aug 10, 2020 at 07:45:18AM -0700, Andi Kleen wrote:
>
> > Unfortunately we're kind of stuck with the old NFILE=1024 default
> > even though it makes little sense on modern servers.
>
> Why can't that be changed? It seems to me all of userspace changes all
> the time; heck that system-doofus thing flushed 20+ years of sysadmin
> experience down the drain, just cause. Why can't we up a file limit?

We could try, but I believe it's hard coded in various places outside
the kernel.

But this wouldn't solve the problem of the unnecessary overhead of
the inode + file descriptor per event that the patch was trying
to address. It would just waste more memory.

And increasing it makes the worst case memory consumption you complained
about event worse because the real limit per uid i
RLIMIT_NPROC*RLIMIT_FILES. Of course NPROC needs to be large enough too,
so the total just becomes bigger and bigger.

Ultimatively you need another accounting mechanism that actually
works per uid without extra factors.

Now perf already has one so it's still unclear to me why we can't use it.

-Andi

2020-08-11 14:49:56

by David Laight

[permalink] [raw]

Subject: RE: [PATCH 1/2] perf: Add closing sibling events' file descriptors

From: Andi Kleen
> On Mon, Aug 10, 2020 at 10:36:32PM +0200, Peter Zijlstra wrote:
> > On Mon, Aug 10, 2020 at 07:45:18AM -0700, Andi Kleen wrote:
> >
> > > Unfortunately we're kind of stuck with the old NFILE=1024 default
> > > even though it makes little sense on modern servers.
> >
> > Why can't that be changed? It seems to me all of userspace changes all
> > the time; heck that system-doofus thing flushed 20+ years of sysadmin
> > experience down the drain, just cause. Why can't we up a file limit?
>
> We could try, but I believe it's hard coded in various places outside
> the kernel.

The place it really bites is select().
Although the kernel supports large bitmaps glibc doesn't.
The random bit overwrites are a PITA to debug.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2020-08-11 16:23:54

by Alexander Shishkin

[permalink] [raw]

Subject: Re: [PATCH 1/2] perf: Add closing sibling events' file descriptors

Andi Kleen <[email protected]> writes:

> On Tue, Aug 11, 2020 at 12:47:24PM +0300, Alexander Shishkin wrote:
>>
>> Right, but which bytes? One byte per event? That's
>> arbitrary. sizeof(struct perf_event)? Then, probably also sizeof(struct
>> perf_event_context).
>
> Yes the sum of all the sizeofs needed for a perf_event.

Well, *all* of them will be tedious to collect, seeing as there is
ctx->task_ctx_data, there is ring_buffer, scheduling trees, there is
stuff that pmus allocate under the hood, like AUX SG tables.

>> The above two structs add up to 2288 bytes on my local build. Given the
>> default RLIMIT_MEMLOCK of 64k, that's 28 events. As opposed to ~1k
>> events if we keep using the RLIMIT_NOFILE. Unless I'm missing your
>> point.
>
> Yes that's true. We would probably need to increase the limit to a few
> MB at least.

Ok, but if we have to increase a limit anyway, we might as well increase
the NOFILE.

> Or maybe use some combination with the old rlimit for compatibility.
> The old rlimit would give an implicit extra RLIMIT_NFILE * 2288 limit
> for RLIMIT_MEMLOCK. This would only give full compatibility for a single
> perf process, but I suspect that's good enough for most users.

We'd need to settle on charging a fixed set of structures per event,
then. And, without increasing the file limit, this would still total at
1052 events.

We could also involve perf_event_mlock_kb *and* increase it too, but I
suspect distros don't just leave it at kernel's default either.

Regards,
--
Alex

2020-08-11 18:08:23

by Andi Kleen

[permalink] [raw]

Subject: Re: [PATCH 1/2] perf: Add closing sibling events' file descriptors

On Tue, Aug 11, 2020 at 02:47:03PM +0000, David Laight wrote:
> From: Andi Kleen
> > On Mon, Aug 10, 2020 at 10:36:32PM +0200, Peter Zijlstra wrote:
> > > On Mon, Aug 10, 2020 at 07:45:18AM -0700, Andi Kleen wrote:
> > >
> > > > Unfortunately we're kind of stuck with the old NFILE=1024 default
> > > > even though it makes little sense on modern servers.
> > >
> > > Why can't that be changed? It seems to me all of userspace changes all
> > > the time; heck that system-doofus thing flushed 20+ years of sysadmin
> > > experience down the drain, just cause. Why can't we up a file limit?
> >
> > We could try, but I believe it's hard coded in various places outside
> > the kernel.
>
> The place it really bites is select().
> Although the kernel supports large bitmaps glibc doesn't.
> The random bit overwrites are a PITA to debug.

Good point.

I remember I asked for a simple define to increase like glibc5 had, but
they still didn't implement that.

-Andi

2020-08-11 21:09:29

by David Laight

[permalink] [raw]

Subject: RE: [PATCH 1/2] perf: Add closing sibling events' file descriptors

> On Tue, Aug 11, 2020 at 02:47:03PM +0000, David Laight wrote:
> > From: Andi Kleen
> > > On Mon, Aug 10, 2020 at 10:36:32PM +0200, Peter Zijlstra wrote:
> > > > On Mon, Aug 10, 2020 at 07:45:18AM -0700, Andi Kleen wrote:
> > > >
> > > > > Unfortunately we're kind of stuck with the old NFILE=1024 default
> > > > > even though it makes little sense on modern servers.
> > > >
> > > > Why can't that be changed? It seems to me all of userspace changes all
> > > > the time; heck that system-doofus thing flushed 20+ years of sysadmin
> > > > experience down the drain, just cause. Why can't we up a file limit?
> > >
> > > We could try, but I believe it's hard coded in various places outside
> > > the kernel.
> >
> > The place it really bites is select().
> > Although the kernel supports large bitmaps glibc doesn't.
> > The random bit overwrites are a PITA to debug.
>
> Good point.
>
> I remember I asked for a simple define to increase like glibc5 had, but
> they still didn't implement that.

Even windows lets you redefine FDSET_SIZE before including the header.
And their select() is much more like poll() - so the limit is the number
of sockets, not the highest socket number (which do happen to be small
integers).

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2020-08-11 22:19:25

by Andi Kleen

[permalink] [raw]

Subject: Re: [PATCH 1/2] perf: Add closing sibling events' file descriptors

On Tue, Aug 11, 2020 at 07:21:13PM +0300, Alexander Shishkin wrote:
> Andi Kleen <[email protected]> writes:
>
> > On Tue, Aug 11, 2020 at 12:47:24PM +0300, Alexander Shishkin wrote:
> >>
> >> Right, but which bytes? One byte per event? That's
> >> arbitrary. sizeof(struct perf_event)? Then, probably also sizeof(struct
> >> perf_event_context).
> >
> > Yes the sum of all the sizeofs needed for a perf_event.
>
> Well, *all* of them will be tedious to collect, seeing as there is
> ctx->task_ctx_data, there is ring_buffer, scheduling trees, there is
> stuff that pmus allocate under the hood, like AUX SG tables.

Well I'm sure we can figure something out. I guess it doesn't need to be
fully accurate, just approximate enough, and be bounded.

>
> >> The above two structs add up to 2288 bytes on my local build. Given the
> >> default RLIMIT_MEMLOCK of 64k, that's 28 events. As opposed to ~1k
> >> events if we keep using the RLIMIT_NOFILE. Unless I'm missing your
> >> point.
> >
> > Yes that's true. We would probably need to increase the limit to a few
> > MB at least.
>
> Ok, but if we have to increase a limit anyway, we might as well increase
> the NOFILE.

NFILE is a terrible limit because it's really large factor * NFILE for
DoS. Also I suspect there will be many cases where the kernel default
is not used.

But yes I suspect it should be increased, not just for perf, but
for other use cases. AFAIK pretty much every non trivial network
server has to change it.

>
> > Or maybe use some combination with the old rlimit for compatibility.
> > The old rlimit would give an implicit extra RLIMIT_NFILE * 2288 limit
> > for RLIMIT_MEMLOCK. This would only give full compatibility for a single
> > perf process, but I suspect that's good enough for most users.
>
> We'd need to settle on charging a fixed set of structures per event,
> then. And, without increasing the file limit, this would still total at
> 1052 events.

True. For perf we really would like a limit that scales with the number
of CPUs.

>
> We could also involve perf_event_mlock_kb *and* increase it too, but I
> suspect distros don't just leave it at kernel's default either.

I haven't seen any distribution that changed it so far.

-Andi