This patch set implements the necessary kernel changes for persistent
events.
Persistent events run standalone in the system without the need of a
controlling process that holds an event's file descriptor. The events
are always enabled and collect data samples in a ring buffer.
Processes may connect to existing persistent events using the
perf_event_open() syscall. For this the syscall must be configured
using the new PERF_TYPE_PERSISTENT event type and a unique event
identifier specified in attr.config. The id is propagated in sysfs or
using ioctl (see below).
Persistent event buffers may be accessed with mmap() in the same way
as for any other event. Since the buffers may be used by multiple
processes at the same time, there is only read-only access to them.
Currently there is only support for per-cpu events, thus root access
is needed too.
Persistent events are visible in sysfs. They are added or removed
dynamically. With the information in sysfs userland knows about how to
setup the perf_event attribute of a persistent event. Since a
persistent event always has the persistent flag set, a way is needed
to express this in sysfs. A new syntax is used for this. With
'attr<num>:<mask>' any bit in the attribute structure may be set in a
similar way as using 'config<num>', but <num> is an index that points
to the u64 value to change within the attribute.
For persistent events the persistent flag (bit 23 of flag field in
struct perf_event_attr) needs to be set which is expressed in sysfs
with "attr5:23". E.g. the mce_record event is described in sysfs as
follows:
/sys/bus/event_source/devices/persistent/events/mce_record:persistent,config=106
/sys/bus/event_source/devices/persistent/format/persistent:attr5:23
Note that perf tools need to support the 'attr<num>' syntax that is
added in a separate patch set. With it we are able to run perf tool
commands to read persistent events, e.g.:
# perf record -e persistent/mce_record/ sleep 10
# perf top -e persistent/mce_record/
In general the new syntax is flexible to describe with sysfs any event
to be setup by perf tools.
There are ioctl functions to control persistent events that can be
used to detach or attach an event to or from a process. The
PERF_EVENT_IOC_DETACH ioctl call makes an event persistent. The
perf_event_open() syscall can be used to re-open the event by any
process. The PERF_EVENT_IOC_ATTACH ioctl attaches the event again so
that it is removed after closing the event's fd.
The patches base on the originally work from Borislav Petkov.
This version 3 of the patch set is a complete rework of the code.
There are the following major changes:
* new event type PERF_TYPE_PERSISTENT introduced,
* support for all type of events,
* unique event ids,
* improvements in reference counting and locking,
* ioctl functions are added to control persistency,
* the sysfs implementation now uses variable list size.
This should address most issues discussed during last review of
version 2. The following is unresolved yet and can be added later on
top of this patches, if necessary:
* support for per-task events (also allowing non-root access),
* creation of persistent events for disabled cpus,
* make event persistent with already open (mmap'ed) buffers,
* make event persistent while creating it.
First patches contain some rework of the perf mmap code to reuse it
for persistent events.
Also note that patch 12 (ioctl functions to control persistency) is
RFC and untested. A perf tools implementation for this is missing and
some ideas are needed how this could be integrated, esp. in something
like perf trace or so.
All patches can be found here:
git://git.kernel.org/pub/scm/linux/kernel/git/rric/oprofile.git persistent-v3
Note: I will resent the perf tools patch necessary to use persistent
events.
-Robert
Borislav Petkov (1):
mce, x86: Enable persistent events
Robert Richter (11):
perf, mmap: Factor out ring_buffer_detach_all()
perf, mmap: Factor out try_get_event()/put_event()
perf, mmap: Factor out perf_alloc/free_rb()
perf, mmap: Factor out perf_get_fd()
perf: Add persistent events
perf, persistent: Implementing a persistent pmu
perf, persistent: Exposing persistent events using sysfs
perf, persistent: Use unique event ids
perf, persistent: Implement reference counter for events
perf, persistent: Dynamically resize list of sysfs entries
[RFC] perf, persistent: ioctl functions to control persistency
.../testing/sysfs-bus-event_source-devices-format | 43 +-
arch/x86/kernel/cpu/mcheck/mce.c | 19 +
include/linux/perf_event.h | 12 +-
include/uapi/linux/perf_event.h | 6 +-
kernel/events/Makefile | 2 +-
kernel/events/core.c | 210 +++++---
kernel/events/internal.h | 20 +
kernel/events/persistent.c | 563 +++++++++++++++++++++
8 files changed, 779 insertions(+), 96 deletions(-)
create mode 100644 kernel/events/persistent.c
--
1.8.3.2
From: Robert Richter <[email protected]>
Implement try_get_event() as counter part to put_event(). Put both in
internal.h to make it available to other perf files.
Signed-off-by: Robert Richter <[email protected]>
Signed-off-by: Robert Richter <[email protected]>
---
kernel/events/core.c | 9 +++------
kernel/events/internal.h | 12 ++++++++++++
2 files changed, 15 insertions(+), 6 deletions(-)
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 5dcc5fe..c9a5d4c 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3242,13 +3242,10 @@ EXPORT_SYMBOL_GPL(perf_event_release_kernel);
/*
* Called when the last reference to the file is gone.
*/
-static void put_event(struct perf_event *event)
+void __put_event(struct perf_event *event)
{
struct task_struct *owner;
- if (!atomic_long_dec_and_test(&event->refcount))
- return;
-
rcu_read_lock();
owner = ACCESS_ONCE(event->owner);
/*
@@ -3781,7 +3778,7 @@ static void ring_buffer_detach_all(struct ring_buffer *rb)
again:
rcu_read_lock();
list_for_each_entry_rcu(event, &rb->event_list, rb_entry) {
- if (!atomic_long_inc_not_zero(&event->refcount)) {
+ if (!try_get_event(event)) {
/*
* This event is en-route to free_event() which will
* detach it and remove it from the list.
@@ -7445,7 +7442,7 @@ inherit_event(struct perf_event *parent_event,
if (IS_ERR(child_event))
return child_event;
- if (!atomic_long_inc_not_zero(&parent_event->refcount)) {
+ if (!try_get_event(parent_event)) {
free_event(child_event);
return NULL;
}
diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index ca65997..96a07d2 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -178,4 +178,16 @@ static inline bool arch_perf_have_user_stack_dump(void)
#define perf_user_stack_pointer(regs) 0
#endif /* CONFIG_HAVE_PERF_USER_STACK_DUMP */
+static inline bool try_get_event(struct perf_event *event)
+{
+ return atomic_long_inc_not_zero(&event->refcount) != 0;
+}
+extern void __put_event(struct perf_event *event);
+static inline void put_event(struct perf_event *event)
+{
+ if (!atomic_long_dec_and_test(&event->refcount))
+ return;
+ __put_event(event);
+}
+
#endif /* _KERNEL_EVENTS_INTERNAL_H */
--
1.8.3.2
From: Robert Richter <[email protected]>
Add the needed pieces for persistent events which makes them
process-agnostic. Also, make their buffers read-only when mmaping them
from userspace.
Add a barebones implementation for registering persistent events with
perf. For that, we don't destroy the buffers when they're unmapped;
also, we map them read-only so that multiple agents can access them.
Also, we allocate the event buffers at event init time and not at mmap
time so that we can log samples into them regardless of whether there
are readers in userspace or not.
Multiple events from different cpus may map to a single persistent
event entry which has a unique identifier. The identifier allows to
access the persistent event with the perf_event_open() syscall. For
this the new event type PERF_TYPE_PERSISTENT must be set with its id
specified in attr.config. Currently there is only support for per-cpu
events. Also, root access is required.
Since the buffers are shared, the set_output ioctl may not be used in
conjunction with persistent events.
This patch only supports trace_points, support for all event types is
implemented in a later patch.
Based on patch set from Borislav Petkov <[email protected]>.
Cc: Borislav Petkov <[email protected]>
Cc: Fengguang Wu <[email protected]>
Cc: Jiri Olsa <[email protected]>
Signed-off-by: Robert Richter <[email protected]>
Signed-off-by: Robert Richter <[email protected]>
---
include/linux/perf_event.h | 12 ++-
include/uapi/linux/perf_event.h | 4 +-
kernel/events/Makefile | 2 +-
kernel/events/core.c | 37 +++++--
kernel/events/internal.h | 2 +
kernel/events/persistent.c | 221 ++++++++++++++++++++++++++++++++++++++++
6 files changed, 266 insertions(+), 12 deletions(-)
create mode 100644 kernel/events/persistent.c
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index c43f6ea..1a62a25 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -436,6 +436,8 @@ struct perf_event {
struct perf_cgroup *cgrp; /* cgroup event is attach to */
int cgrp_defer_enabled;
#endif
+ struct list_head pevent_entry; /* persistent event */
+ int pevent_id;
#endif /* CONFIG_PERF_EVENTS */
};
@@ -765,7 +767,7 @@ extern void perf_event_enable(struct perf_event *event);
extern void perf_event_disable(struct perf_event *event);
extern int __perf_event_disable(void *info);
extern void perf_event_task_tick(void);
-#else
+#else /* !CONFIG_PERF_EVENTS */
static inline void
perf_event_task_sched_in(struct task_struct *prev,
struct task_struct *task) { }
@@ -805,7 +807,7 @@ static inline void perf_event_enable(struct perf_event *event) { }
static inline void perf_event_disable(struct perf_event *event) { }
static inline int __perf_event_disable(void *info) { return -1; }
static inline void perf_event_task_tick(void) { }
-#endif
+#endif /* !CONFIG_PERF_EVENTS */
#if defined(CONFIG_PERF_EVENTS) && defined(CONFIG_NO_HZ_FULL)
extern bool perf_event_can_stop_tick(void);
@@ -819,6 +821,12 @@ extern void perf_restore_debug_store(void);
static inline void perf_restore_debug_store(void) { }
#endif
+#if defined(CONFIG_PERF_EVENTS) && defined(CONFIG_EVENT_TRACING)
+extern int perf_add_persistent_tp(struct ftrace_event_call *tp);
+#else
+static inline int perf_add_persistent_tp(void *tp) { return -ENOENT; }
+#endif
+
#define perf_output_put(handle, x) perf_output_copy((handle), &(x), sizeof(x))
/*
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 62c25a2..2b84b97 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -32,6 +32,7 @@ enum perf_type_id {
PERF_TYPE_HW_CACHE = 3,
PERF_TYPE_RAW = 4,
PERF_TYPE_BREAKPOINT = 5,
+ PERF_TYPE_PERSISTENT = 6,
PERF_TYPE_MAX, /* non-ABI */
};
@@ -275,8 +276,9 @@ struct perf_event_attr {
exclude_callchain_kernel : 1, /* exclude kernel callchains */
exclude_callchain_user : 1, /* exclude user callchains */
+ persistent : 1, /* always-on event */
- __reserved_1 : 41;
+ __reserved_1 : 40;
union {
__u32 wakeup_events; /* wakeup every n events */
diff --git a/kernel/events/Makefile b/kernel/events/Makefile
index 103f5d1..70990d5 100644
--- a/kernel/events/Makefile
+++ b/kernel/events/Makefile
@@ -2,7 +2,7 @@ ifdef CONFIG_FUNCTION_TRACER
CFLAGS_REMOVE_core.o = -pg
endif
-obj-y := core.o ring_buffer.o callchain.o
+obj-y := core.o ring_buffer.o callchain.o persistent.o
obj-$(CONFIG_HAVE_HW_BREAKPOINT) += hw_breakpoint.o
obj-$(CONFIG_UPROBES) += uprobes.o
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 932acc6..d9d6e67 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3982,6 +3982,9 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
if (!(vma->vm_flags & VM_SHARED))
return -EINVAL;
+ if (event->attr.persistent && (vma->vm_flags & VM_WRITE))
+ return -EACCES;
+
vma_size = vma->vm_end - vma->vm_start;
nr_pages = (vma_size / PAGE_SIZE) - 1;
@@ -4007,6 +4010,11 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
goto unlock;
}
+ if (!event->rb->overwrite && vma->vm_flags & VM_WRITE) {
+ ret = -EACCES;
+ goto unlock;
+ }
+
if (!atomic_inc_not_zero(&event->rb->mmap_count)) {
/*
* Raced against perf_mmap_close() through
@@ -5845,7 +5853,7 @@ static struct pmu perf_tracepoint = {
.event_idx = perf_swevent_event_idx,
};
-static inline void perf_tp_register(void)
+static inline void perf_register_tp(void)
{
perf_pmu_register(&perf_tracepoint, "tracepoint", PERF_TYPE_TRACEPOINT);
}
@@ -5875,18 +5883,14 @@ static void perf_event_free_filter(struct perf_event *event)
#else
-static inline void perf_tp_register(void)
-{
-}
+static inline void perf_register_tp(void) { }
static int perf_event_set_filter(struct perf_event *event, void __user *arg)
{
return -ENOENT;
}
-static void perf_event_free_filter(struct perf_event *event)
-{
-}
+static void perf_event_free_filter(struct perf_event *event) { }
#endif /* CONFIG_EVENT_TRACING */
@@ -6574,6 +6578,7 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
INIT_LIST_HEAD(&event->event_entry);
INIT_LIST_HEAD(&event->sibling_list);
INIT_LIST_HEAD(&event->rb_entry);
+ INIT_LIST_HEAD(&event->pevent_entry);
init_waitqueue_head(&event->waitq);
init_irq_work(&event->pending, perf_pending_event);
@@ -6831,6 +6836,13 @@ perf_event_set_output(struct perf_event *event, struct perf_event *output_event)
goto unlock;
}
+ /* Don't redirect read-only (persistent) events. */
+ ret = -EACCES;
+ if (old_rb && !old_rb->overwrite)
+ goto unlock;
+ if (rb && !rb->overwrite)
+ goto unlock;
+
if (old_rb)
ring_buffer_detach(event, old_rb);
@@ -6888,6 +6900,14 @@ SYSCALL_DEFINE5(perf_event_open,
if (err)
return err;
+ /* return fd for an existing persistent event */
+ if (attr.type == PERF_TYPE_PERSISTENT)
+ return perf_get_persistent_event_fd(cpu, attr.config);
+
+ /* put event into persistent state (not yet supported) */
+ if (attr.persistent)
+ return -EOPNOTSUPP;
+
if (!attr.exclude_kernel) {
if (perf_paranoid_kernel() && !capable(CAP_SYS_ADMIN))
return -EACCES;
@@ -7828,7 +7848,8 @@ void __init perf_event_init(void)
perf_pmu_register(&perf_swevent, "software", PERF_TYPE_SOFTWARE);
perf_pmu_register(&perf_cpu_clock, NULL, -1);
perf_pmu_register(&perf_task_clock, NULL, -1);
- perf_tp_register();
+ perf_register_tp();
+ perf_register_persistent();
perf_cpu_notifier(perf_cpu_notify);
register_reboot_notifier(&perf_reboot_notifier);
diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index d8708aa..94c3f73 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -193,5 +193,7 @@ static inline void put_event(struct perf_event *event)
extern int perf_alloc_rb(struct perf_event *event, int nr_pages, int flags);
extern void perf_free_rb(struct perf_event *event);
extern int perf_get_fd(struct perf_event *event);
+extern int perf_get_persistent_event_fd(int cpu, int id);
+extern void __init perf_register_persistent(void);
#endif /* _KERNEL_EVENTS_INTERNAL_H */
diff --git a/kernel/events/persistent.c b/kernel/events/persistent.c
new file mode 100644
index 0000000..926654f
--- /dev/null
+++ b/kernel/events/persistent.c
@@ -0,0 +1,221 @@
+#include <linux/slab.h>
+#include <linux/perf_event.h>
+#include <linux/ftrace_event.h>
+
+#include "internal.h"
+
+/* 512 kiB: default perf tools memory size, see perf_evlist__mmap() */
+#define CPU_BUFFER_NR_PAGES ((512 * 1024) / PAGE_SIZE)
+
+struct pevent {
+ char *name;
+ int id;
+};
+
+static DEFINE_PER_CPU(struct list_head, pevents);
+static DEFINE_PER_CPU(struct mutex, pevents_lock);
+
+/* Must be protected with pevents_lock. */
+static struct perf_event *__pevent_find(int cpu, int id)
+{
+ struct perf_event *event;
+
+ list_for_each_entry(event, &per_cpu(pevents, cpu), pevent_entry) {
+ if (event->pevent_id == id)
+ return event;
+ }
+
+ return NULL;
+}
+
+static int pevent_add(struct pevent *pevent, struct perf_event *event)
+{
+ int ret = -EEXIST;
+ int cpu = event->cpu;
+
+ mutex_lock(&per_cpu(pevents_lock, cpu));
+
+ if (__pevent_find(cpu, pevent->id))
+ goto unlock;
+
+ if (event->pevent_id)
+ goto unlock;
+
+ ret = 0;
+ event->pevent_id = pevent->id;
+ list_add_tail(&event->pevent_entry, &per_cpu(pevents, cpu));
+unlock:
+ mutex_unlock(&per_cpu(pevents_lock, cpu));
+
+ return ret;
+}
+
+static struct perf_event *pevent_del(struct pevent *pevent, int cpu)
+{
+ struct perf_event *event;
+
+ mutex_lock(&per_cpu(pevents_lock, cpu));
+
+ event = __pevent_find(cpu, pevent->id);
+ if (event) {
+ list_del(&event->pevent_entry);
+ event->pevent_id = 0;
+ }
+
+ mutex_unlock(&per_cpu(pevents_lock, cpu));
+
+ return event;
+}
+
+static void persistent_event_release(struct perf_event *event)
+{
+ /*
+ * Safe since we hold &event->mmap_count. The ringbuffer is
+ * released with put_event() if there are no other references.
+ * In this case there are also no other mmaps.
+ */
+ atomic_dec(&event->rb->mmap_count);
+ atomic_dec(&event->mmap_count);
+ put_event(event);
+}
+
+static int persistent_event_open(int cpu, struct pevent *pevent,
+ struct perf_event_attr *attr, int nr_pages)
+{
+ struct perf_event *event;
+ int ret;
+
+ event = perf_event_create_kernel_counter(attr, cpu, NULL, NULL, NULL);
+ if (IS_ERR(event))
+ return PTR_ERR(event);
+
+ if (nr_pages < 0)
+ nr_pages = CPU_BUFFER_NR_PAGES;
+
+ ret = perf_alloc_rb(event, nr_pages, 0);
+ if (ret)
+ goto fail;
+
+ ret = pevent_add(pevent, event);
+ if (ret)
+ goto fail;
+
+ atomic_inc(&event->mmap_count);
+
+ /* All workie, enable event now */
+ perf_event_enable(event);
+
+ return ret;
+fail:
+ perf_event_release_kernel(event);
+ return ret;
+}
+
+static void persistent_event_close(int cpu, struct pevent *pevent)
+{
+ struct perf_event *event = pevent_del(pevent, cpu);
+ if (event)
+ persistent_event_release(event);
+}
+
+static int __maybe_unused
+persistent_open(char *name, struct perf_event_attr *attr, int nr_pages)
+{
+ struct pevent *pevent;
+ char id_buf[32];
+ int cpu;
+ int ret = 0;
+
+ pevent = kzalloc(sizeof(*pevent), GFP_KERNEL);
+ if (!pevent)
+ return -ENOMEM;
+
+ pevent->id = attr->config;
+
+ if (!name) {
+ snprintf(id_buf, sizeof(id_buf), "%d", pevent->id);
+ name = id_buf;
+ }
+
+ pevent->name = kstrdup(name, GFP_KERNEL);
+ if (!pevent->name) {
+ ret = -ENOMEM;
+ goto fail;
+ }
+
+ for_each_possible_cpu(cpu) {
+ ret = persistent_event_open(cpu, pevent, attr, nr_pages);
+ if (ret)
+ goto fail;
+ }
+
+ return 0;
+fail:
+ for_each_possible_cpu(cpu)
+ persistent_event_close(cpu, pevent);
+ kfree(pevent->name);
+ kfree(pevent);
+
+ pr_err("%s: Error adding persistent event: %d\n",
+ __func__, ret);
+
+ return ret;
+}
+
+#ifdef CONFIG_EVENT_TRACING
+
+int perf_add_persistent_tp(struct ftrace_event_call *tp)
+{
+ struct perf_event_attr attr;
+
+ memset(&attr, 0, sizeof(attr));
+ attr.sample_period = 1;
+ attr.wakeup_events = 1;
+ attr.sample_type = PERF_SAMPLE_RAW;
+ attr.persistent = 1;
+ attr.config = tp->event.type;
+ attr.type = PERF_TYPE_TRACEPOINT;
+ attr.size = sizeof(attr);
+
+ return persistent_open(tp->name, &attr, -1);
+}
+
+#endif /* CONFIG_EVENT_TRACING */
+
+int perf_get_persistent_event_fd(int cpu, int id)
+{
+ struct perf_event *event;
+ int event_fd = 0;
+
+ if ((unsigned)cpu >= nr_cpu_ids)
+ return -EINVAL;
+
+ /* Must be root for persistent events */
+ if (perf_paranoid_cpu() && !capable(CAP_SYS_ADMIN))
+ return -EACCES;
+
+ mutex_lock(&per_cpu(pevents_lock, cpu));
+ event = __pevent_find(cpu, id);
+ if (!event || !try_get_event(event))
+ event_fd = -ENOENT;
+ mutex_unlock(&per_cpu(pevents_lock, cpu));
+
+ if (event_fd)
+ return event_fd;
+
+ event_fd = perf_get_fd(event);
+ if (event_fd < 0)
+ put_event(event);
+
+ return event_fd;
+}
+
+void __init perf_register_persistent(void)
+{
+ int cpu;
+
+ for_each_possible_cpu(cpu) {
+ INIT_LIST_HEAD(&per_cpu(pevents, cpu));
+ mutex_init(&per_cpu(pevents_lock, cpu));
+ }
+}
--
1.8.3.2
From: Robert Richter <[email protected]>
Tracepoints have a unique attr.config value. But, this is not
sufficient to support all event types. For this we need to generate
unique event ids.
Signed-off-by: Robert Richter <[email protected]>
Signed-off-by: Robert Richter <[email protected]>
---
kernel/events/persistent.c | 40 ++++++++++++++++++++++++++++++++++++++--
1 file changed, 38 insertions(+), 2 deletions(-)
diff --git a/kernel/events/persistent.c b/kernel/events/persistent.c
index aca1e98..f23270b 100644
--- a/kernel/events/persistent.c
+++ b/kernel/events/persistent.c
@@ -1,6 +1,7 @@
#include <linux/slab.h>
#include <linux/perf_event.h>
#include <linux/ftrace_event.h>
+#include <linux/idr.h>
#include "internal.h"
@@ -13,10 +14,37 @@ struct pevent {
int id;
};
+static struct idr event_idr;
+static struct mutex event_lock;
static struct pmu persistent_pmu;
static DEFINE_PER_CPU(struct list_head, pevents);
static DEFINE_PER_CPU(struct mutex, pevents_lock);
+static inline struct pevent *find_event(int id)
+{
+ struct pevent *pevent;
+ rcu_read_lock();
+ pevent = idr_find(&event_idr, id);
+ rcu_read_lock();
+ return pevent;
+}
+
+static inline int get_event_id(struct pevent *pevent)
+{
+ int event_id;
+ mutex_lock(&event_lock);
+ event_id = idr_alloc(&event_idr, pevent, 1, INT_MAX, GFP_KERNEL);
+ mutex_unlock(&event_lock);
+ return event_id;
+}
+
+static inline void put_event_id(int id)
+{
+ mutex_lock(&event_lock);
+ idr_remove(&event_idr, id);
+ mutex_unlock(&event_lock);
+}
+
/* Must be protected with pevents_lock. */
static struct perf_event *__pevent_find(int cpu, int id)
{
@@ -128,13 +156,16 @@ persistent_open(char *name, struct perf_event_attr *attr, int nr_pages)
struct pevent *pevent;
char id_buf[32];
int cpu;
- int ret = 0;
+ int ret;
pevent = kzalloc(sizeof(*pevent), GFP_KERNEL);
if (!pevent)
return -ENOMEM;
- pevent->id = attr->config;
+ ret = get_event_id(pevent);
+ if (ret < 0)
+ goto fail;
+ pevent->id = ret;
if (!name) {
snprintf(id_buf, sizeof(id_buf), "%d", pevent->id);
@@ -163,6 +194,9 @@ persistent_open(char *name, struct perf_event_attr *attr, int nr_pages)
fail:
for_each_possible_cpu(cpu)
persistent_event_close(cpu, pevent);
+
+ if (pevent->id)
+ put_event_id(pevent->id);
kfree(pevent->name);
kfree(pevent);
@@ -306,6 +340,8 @@ void __init perf_register_persistent(void)
{
int cpu;
+ idr_init(&event_idr);
+ mutex_init(&event_lock);
perf_pmu_register(&persistent_pmu, "persistent", PERF_TYPE_PERSISTENT);
for_each_possible_cpu(cpu) {
--
1.8.3.2
From: Robert Richter <[email protected]>
We want to use the kernel's pmu design to later expose persistent
events via sysfs to userland. Initially implement a persistent pmu.
The format syntax is introduced allowing to set bits anywhere in
struct perf_event_attr. This is used in this case to set the
persistent flag (attr5:23). The syntax is attr<num> where num is the
index of the u64 array in struct perf_event_attr. Otherwise syntax is
same as for config<num>.
Patches that implement this functionality for perf tools are sent in a
separate patchset.
Signed-off-by: Robert Richter <[email protected]>
Signed-off-by: Robert Richter <[email protected]>
---
kernel/events/persistent.c | 34 ++++++++++++++++++++++++++++++++++
1 file changed, 34 insertions(+)
diff --git a/kernel/events/persistent.c b/kernel/events/persistent.c
index 926654f..ede95ab 100644
--- a/kernel/events/persistent.c
+++ b/kernel/events/persistent.c
@@ -12,6 +12,7 @@ struct pevent {
int id;
};
+static struct pmu persistent_pmu;
static DEFINE_PER_CPU(struct list_head, pevents);
static DEFINE_PER_CPU(struct mutex, pevents_lock);
@@ -210,10 +211,43 @@ int perf_get_persistent_event_fd(int cpu, int id)
return event_fd;
}
+PMU_FORMAT_ATTR(persistent, "attr5:23");
+
+static struct attribute *persistent_format_attrs[] = {
+ &format_attr_persistent.attr,
+ NULL,
+};
+
+static struct attribute_group persistent_format_group = {
+ .name = "format",
+ .attrs = persistent_format_attrs,
+};
+
+static const struct attribute_group *persistent_attr_groups[] = {
+ &persistent_format_group,
+ NULL,
+};
+
+static int persistent_pmu_init(struct perf_event *event)
+{
+ if (persistent_pmu.type != event->attr.type)
+ return -ENOENT;
+
+ /* Not a persistent event. */
+ return -EFAULT;
+}
+
+static struct pmu persistent_pmu = {
+ .event_init = persistent_pmu_init,
+ .attr_groups = persistent_attr_groups,
+};
+
void __init perf_register_persistent(void)
{
int cpu;
+ perf_pmu_register(&persistent_pmu, "persistent", PERF_TYPE_PERSISTENT);
+
for_each_possible_cpu(cpu) {
INIT_LIST_HEAD(&per_cpu(pevents, cpu));
mutex_init(&per_cpu(pevents_lock, cpu));
--
1.8.3.2
From: Robert Richter <[email protected]>
Implementing ioctl functions to control persistent events. There are
functions to detach or attach an event to or from a process. The
PERF_EVENT_IOC_DETACH ioctl call makes an event persistent. After
closing the event's fd it runs then in the background of the system
without the need of a controlling process. The perf_event_open()
syscall can be used to reopen the event by any process. The
PERF_EVENT_IOC_ATTACH ioctl attaches the event again so that it is
removed after closing the event's fd.
This is for Linux man-pages:
type ...
PERF_TYPE_PERSISTENT (Since Linux 3.xx)
This indicates a persistent event. There is a unique
identifier for each persistent event that needs to be
specified in the event's attribute config field.
Persistent events are listed under:
/sys/bus/event_source/devices/persistent/
...
persistent : 41, /* always-on event */
...
persistent: (Since Linux 3.xx)
Put event into persistent state after opening. After closing
the event's fd the event is persistent in the system and
continues to run.
perf_event ioctl calls
PERF_EVENT_IOC_DETACH (Since Linux 3.xx)
Detach the event specified by the file descriptor from the
process and make it persistent in the system. After
closing the fd the event will continue to run. An unique
identifier for the persistent event is returned or an
error otherwise. The following allows to connect to the
event again:
pe.type = PERF_TYPE_PERSISTENT;
pe.config = <pevent_id>;
...
fd = perf_event_open(...);
The event must be reopened on the same cpu.
PERF_EVENT_IOC_ATTACH (Since Linux 3.xx)
Attach the event specified by the file descriptor to the
current process. The event is no longer persistent in the
system and will be removed after all users disconnected
from the event. Thus, if there are no other users the
event will be closed too after closing its file
descriptor, the event then no longer exists.
Cc: Vince Weaver <[email protected]>
Signed-off-by: Robert Richter <[email protected]>
Signed-off-by: Robert Richter <[email protected]>
---
include/uapi/linux/perf_event.h | 2 +
kernel/events/core.c | 6 ++
kernel/events/internal.h | 2 +
kernel/events/persistent.c | 178 +++++++++++++++++++++++++++++++++-------
4 files changed, 160 insertions(+), 28 deletions(-)
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 2b84b97..82a8244 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -324,6 +324,8 @@ struct perf_event_attr {
#define PERF_EVENT_IOC_SET_OUTPUT _IO ('$', 5)
#define PERF_EVENT_IOC_SET_FILTER _IOW('$', 6, char *)
#define PERF_EVENT_IOC_ID _IOR('$', 7, u64 *)
+#define PERF_EVENT_IOC_DETACH _IO ('$', 8)
+#define PERF_EVENT_IOC_ATTACH _IO ('$', 9)
enum perf_event_ioc_flags {
PERF_IOC_FLAG_GROUP = 1U << 0,
diff --git a/kernel/events/core.c b/kernel/events/core.c
index d9d6e67..8d5c6e3 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3622,6 +3622,12 @@ static long perf_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
case PERF_EVENT_IOC_SET_FILTER:
return perf_event_set_filter(event, (void __user *)arg);
+ case PERF_EVENT_IOC_DETACH:
+ return perf_event_detach(event);
+
+ case PERF_EVENT_IOC_ATTACH:
+ return perf_event_attach(event);
+
default:
return -ENOTTY;
}
diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index 94c3f73..f9bc15f 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -195,5 +195,7 @@ extern void perf_free_rb(struct perf_event *event);
extern int perf_get_fd(struct perf_event *event);
extern int perf_get_persistent_event_fd(int cpu, int id);
extern void __init perf_register_persistent(void);
+extern int perf_event_detach(struct perf_event *event);
+extern int perf_event_attach(struct perf_event *event);
#endif /* _KERNEL_EVENTS_INTERNAL_H */
diff --git a/kernel/events/persistent.c b/kernel/events/persistent.c
index a0ef6d4..e156afe 100644
--- a/kernel/events/persistent.c
+++ b/kernel/events/persistent.c
@@ -59,6 +59,49 @@ static struct perf_event *__pevent_find(int cpu, int id)
return NULL;
}
+static void pevent_free(struct pevent *pevent)
+{
+ if (pevent->id)
+ put_event_id(pevent->id);
+
+ kfree(pevent->name);
+ kfree(pevent);
+}
+
+static struct pevent *pevent_alloc(char *name)
+{
+ struct pevent *pevent;
+ char id_buf[32];
+ int ret;
+
+ pevent = kzalloc(sizeof(*pevent), GFP_KERNEL);
+ if (!pevent)
+ return ERR_PTR(-ENOMEM);
+
+ atomic_set(&pevent->refcount, 1);
+
+ ret = get_event_id(pevent);
+ if (ret < 0)
+ goto fail;
+ pevent->id = ret;
+
+ if (!name) {
+ snprintf(id_buf, sizeof(id_buf), "%d", pevent->id);
+ name = id_buf;
+ }
+
+ pevent->name = kstrdup(name, GFP_KERNEL);
+ if (!pevent->name) {
+ ret = -ENOMEM;
+ goto fail;
+ }
+
+ return pevent;
+fail:
+ pevent_free(pevent);
+ return ERR_PTR(ret);
+}
+
static int pevent_add(struct pevent *pevent, struct perf_event *event)
{
int ret = -EEXIST;
@@ -74,6 +117,7 @@ static int pevent_add(struct pevent *pevent, struct perf_event *event)
ret = 0;
event->pevent_id = pevent->id;
+ event->attr.persistent = 1;
list_add_tail(&event->pevent_entry, &per_cpu(pevents, cpu));
unlock:
mutex_unlock(&per_cpu(pevents_lock, cpu));
@@ -91,6 +135,7 @@ static struct perf_event *pevent_del(struct pevent *pevent, int cpu)
if (event) {
list_del(&event->pevent_entry);
event->pevent_id = 0;
+ event->attr.persistent = 0;
}
mutex_unlock(&per_cpu(pevents_lock, cpu));
@@ -160,33 +205,12 @@ static int __maybe_unused
persistent_open(char *name, struct perf_event_attr *attr, int nr_pages)
{
struct pevent *pevent;
- char id_buf[32];
int cpu;
int ret;
- pevent = kzalloc(sizeof(*pevent), GFP_KERNEL);
- if (!pevent)
- return -ENOMEM;
-
- atomic_set(&pevent->refcount, 1);
-
- ret = get_event_id(pevent);
- if (ret < 0)
- goto fail;
- pevent->id = ret;
-
- if (!name) {
- snprintf(id_buf, sizeof(id_buf), "%d", pevent->id);
- name = id_buf;
- }
-
- pevent->name = kstrdup(name, GFP_KERNEL);
- if (!pevent->name) {
- ret = -ENOMEM;
- goto fail;
- }
-
- pevent->sysfs.id = pevent->id;
+ pevent = pevent_alloc(name);
+ if (IS_ERR(pevent))
+ return PTR_ERR(pevent);
for_each_possible_cpu(cpu) {
ret = persistent_event_open(cpu, pevent, attr, nr_pages);
@@ -206,10 +230,7 @@ persistent_open(char *name, struct perf_event_attr *attr, int nr_pages)
out:
if (atomic_dec_and_test(&pevent->refcount)) {
pevent_sysfs_unregister(pevent);
- if (pevent->id)
- put_event_id(pevent->id);
- kfree(pevent->name);
- kfree(pevent);
+ pevent_free(pevent);
}
return ret;
@@ -439,3 +460,104 @@ void __init perf_register_persistent(void)
mutex_init(&per_cpu(pevents_lock, cpu));
}
}
+
+/*
+ * Detach an event from a process. The event will remain in the system
+ * after closing the event's fd, it becomes persistent.
+ */
+int perf_event_detach(struct perf_event *event)
+{
+ struct pevent *pevent;
+ int cpu;
+ int ret;
+
+ if (!try_get_event(event))
+ return -ENOENT;
+
+ /* task events not yet supported: */
+ cpu = event->cpu;
+ if ((unsigned)cpu >= nr_cpu_ids) {
+ ret = -EINVAL;
+ goto fail_rb;
+ }
+
+ /*
+ * Avoid grabbing an id, later checked again in pevent_add()
+ * with mmap_mutex held.
+ */
+ if (event->pevent_id) {
+ ret = -EEXIST;
+ goto fail_rb;
+ }
+
+ mutex_lock(&event->mmap_mutex);
+ if (event->rb)
+ ret = -EBUSY;
+ else
+ ret = perf_alloc_rb(event, CPU_BUFFER_NR_PAGES, 0);
+ mutex_unlock(&event->mmap_mutex);
+
+ if (ret)
+ goto fail_rb;
+
+ pevent = pevent_alloc(NULL);
+ if (IS_ERR(pevent)) {
+ ret = PTR_ERR(pevent);
+ goto fail_pevent;
+ }
+
+ ret = pevent_add(pevent, event);
+ if (ret)
+ goto fail_add;
+
+ ret = pevent_sysfs_register(pevent);
+ if (ret)
+ goto fail_sysfs;
+
+ atomic_inc(&event->mmap_count);
+
+ return pevent->id;
+fail_sysfs:
+ pevent_del(pevent, cpu);
+fail_add:
+ pevent_free(pevent);
+fail_pevent:
+ mutex_lock(&event->mmap_mutex);
+ if (event->rb)
+ perf_free_rb(event);
+ mutex_unlock(&event->mmap_mutex);
+fail_rb:
+ put_event(event);
+ return ret;
+}
+
+/*
+ * Attach an event to a process. The event will be removed after all
+ * users disconnected from it, it's no longer persistent in the
+ * system.
+ */
+int perf_event_attach(struct perf_event *event)
+{
+ int cpu = event->cpu;
+ struct pevent *pevent;
+
+ if ((unsigned)cpu >= nr_cpu_ids)
+ return -EINVAL;
+
+ pevent = find_event(event->pevent_id);
+ if (!pevent)
+ return -EINVAL;
+
+ event = pevent_del(pevent, cpu);
+ if (!event)
+ return -EINVAL;
+
+ if (atomic_dec_and_test(&pevent->refcount)) {
+ pevent_sysfs_unregister(pevent);
+ pevent_free(pevent);
+ }
+
+ persistent_event_release(event);
+
+ return 0;
+}
--
1.8.3.2
From: Robert Richter <[email protected]>
We need this later for proper event removal.
Signed-off-by: Robert Richter <[email protected]>
Signed-off-by: Robert Richter <[email protected]>
---
kernel/events/persistent.c | 27 +++++++++++++++++----------
1 file changed, 17 insertions(+), 10 deletions(-)
diff --git a/kernel/events/persistent.c b/kernel/events/persistent.c
index f23270b..70446ae 100644
--- a/kernel/events/persistent.c
+++ b/kernel/events/persistent.c
@@ -9,6 +9,7 @@
#define CPU_BUFFER_NR_PAGES ((512 * 1024) / PAGE_SIZE)
struct pevent {
+ atomic_t refcount;
struct perf_pmu_events_attr sysfs;
char *name;
int id;
@@ -130,6 +131,7 @@ static int persistent_event_open(int cpu, struct pevent *pevent,
if (ret)
goto fail;
+ atomic_inc(&pevent->refcount);
atomic_inc(&event->mmap_count);
/* All workie, enable event now */
@@ -144,8 +146,11 @@ static int persistent_event_open(int cpu, struct pevent *pevent,
static void persistent_event_close(int cpu, struct pevent *pevent)
{
struct perf_event *event = pevent_del(pevent, cpu);
- if (event)
+ if (event) {
+ /* Safe, the caller holds &pevent->refcount too. */
+ atomic_dec(&pevent->refcount);
persistent_event_release(event);
+ }
}
static int pevent_sysfs_register(struct pevent *event);
@@ -162,6 +167,8 @@ persistent_open(char *name, struct perf_event_attr *attr, int nr_pages)
if (!pevent)
return -ENOMEM;
+ atomic_set(&pevent->refcount, 1);
+
ret = get_event_id(pevent);
if (ret < 0)
goto fail;
@@ -187,21 +194,21 @@ persistent_open(char *name, struct perf_event_attr *attr, int nr_pages)
}
ret = pevent_sysfs_register(pevent);
- if (ret)
- goto fail;
-
- return 0;
+ if (!ret)
+ goto out;
fail:
for_each_possible_cpu(cpu)
persistent_event_close(cpu, pevent);
- if (pevent->id)
- put_event_id(pevent->id);
- kfree(pevent->name);
- kfree(pevent);
-
pr_err("%s: Error adding persistent event: %d\n",
__func__, ret);
+out:
+ if (atomic_dec_and_test(&pevent->refcount)) {
+ if (pevent->id)
+ put_event_id(pevent->id);
+ kfree(pevent->name);
+ kfree(pevent);
+ }
return ret;
}
--
1.8.3.2
From: Robert Richter <[email protected]>
There was a limitation of the total number of persistent events to be
registered in sysfs due to the lack of dynamically list allocation.
This patch implements memory reallocation in case an event is added or
removed from the list.
While at this also implement pevent_sysfs_unregister() which we need
later for proper event removal.
Signed-off-by: Robert Richter <[email protected]>
Signed-off-by: Robert Richter <[email protected]>
---
kernel/events/persistent.c | 115 ++++++++++++++++++++++++++++++++++++++-------
1 file changed, 99 insertions(+), 16 deletions(-)
diff --git a/kernel/events/persistent.c b/kernel/events/persistent.c
index 70446ae..a0ef6d4 100644
--- a/kernel/events/persistent.c
+++ b/kernel/events/persistent.c
@@ -154,6 +154,7 @@ static void persistent_event_close(int cpu, struct pevent *pevent)
}
static int pevent_sysfs_register(struct pevent *event);
+static void pevent_sysfs_unregister(struct pevent *event);
static int __maybe_unused
persistent_open(char *name, struct perf_event_attr *attr, int nr_pages)
@@ -204,6 +205,7 @@ persistent_open(char *name, struct perf_event_attr *attr, int nr_pages)
__func__, ret);
out:
if (atomic_dec_and_test(&pevent->refcount)) {
+ pevent_sysfs_unregister(pevent);
if (pevent->id)
put_event_id(pevent->id);
kfree(pevent->name);
@@ -273,13 +275,12 @@ static struct attribute_group persistent_format_group = {
.attrs = persistent_format_attrs,
};
-#define MAX_EVENTS 16
-
-static struct attribute *pevents_attr[MAX_EVENTS + 1] = { };
+static struct mutex sysfs_lock;
+static int sysfs_nr_entries;
static struct attribute_group pevents_group = {
.name = "events",
- .attrs = pevents_attr,
+ .attrs = NULL, /* dynamically allocated */
};
static const struct attribute_group *persistent_attr_groups[] = {
@@ -288,6 +289,7 @@ static const struct attribute_group *persistent_attr_groups[] = {
NULL,
};
#define EVENTS_GROUP_PTR (&persistent_attr_groups[1])
+#define EVENTS_ATTRS_PTR (&pevents_group.attrs)
static ssize_t pevent_sysfs_show(struct device *dev,
struct device_attribute *__attr, char *page)
@@ -304,7 +306,9 @@ static int pevent_sysfs_register(struct pevent *pevent)
struct attribute *attr = &sysfs->attr.attr;
struct device *dev = persistent_pmu.dev;
const struct attribute_group **group = EVENTS_GROUP_PTR;
- int idx;
+ struct attribute ***attrs_ptr = EVENTS_ATTRS_PTR;
+ struct attribute **attrs;
+ int ret = 0;
sysfs->id = pevent->id;
sysfs->attr = (struct device_attribute)
@@ -312,21 +316,99 @@ static int pevent_sysfs_register(struct pevent *pevent)
attr->name = pevent->name;
sysfs_attr_init(attr);
- /* add sysfs attr to events: */
- for (idx = 0; idx < MAX_EVENTS; idx++) {
- if (!cmpxchg(pevents_attr + idx, NULL, attr))
- break;
+ mutex_lock(&sysfs_lock);
+
+ /*
+ * Keep old list if no new one is available. Need this for
+ * device_remove_attrs() if unregistering pmu.
+ */
+ attrs = __krealloc(*attrs_ptr, (sysfs_nr_entries + 2) * sizeof(*attrs),
+ GFP_KERNEL);
+
+ if (!attrs) {
+ ret = -ENOMEM;
+ goto unlock;
}
- if (idx >= MAX_EVENTS)
- return -ENOSPC;
- if (!idx)
+ attrs[sysfs_nr_entries++] = attr;
+ attrs[sysfs_nr_entries] = NULL;
+
+ if (!*group)
*group = &pevents_group;
+
+ if (!dev)
+ goto out; /* sysfs not yet initialized */
+
+ if (sysfs_nr_entries == 1)
+ ret = sysfs_create_group(&dev->kobj, *group);
+ else
+ ret = sysfs_add_file_to_group(&dev->kobj, attr, (*group)->name);
+
+ if (ret) {
+ /* roll back */
+ sysfs_nr_entries--;
+ if (!sysfs_nr_entries)
+ *group = NULL;
+ if (*attrs_ptr != attrs)
+ kfree(attrs);
+ else
+ attrs[sysfs_nr_entries] = NULL;
+ goto unlock;
+ }
+out:
+ if (*attrs_ptr != attrs) {
+ kfree(*attrs_ptr);
+ *attrs_ptr = attrs;
+ }
+unlock:
+ mutex_unlock(&sysfs_lock);
+
+ return ret;
+}
+
+static void pevent_sysfs_unregister(struct pevent *pevent)
+{
+ struct attribute *attr = &pevent->sysfs.attr.attr;
+ struct device *dev = persistent_pmu.dev;
+ const struct attribute_group **group = EVENTS_GROUP_PTR;
+ struct attribute ***attrs_ptr = EVENTS_ATTRS_PTR;
+ struct attribute **attrs, **dest;
+
+ mutex_lock(&sysfs_lock);
+
+ for (dest = *attrs_ptr; *dest; dest++) {
+ if (*dest == attr)
+ break;
+ }
+
+ if (!*dest)
+ goto unlock;
+
+ sysfs_nr_entries--;
+
+ *dest = (*attrs_ptr)[sysfs_nr_entries];
+ (*attrs_ptr)[sysfs_nr_entries] = NULL;
+
if (!dev)
- return 0; /* sysfs not yet initialized */
- if (idx)
- return sysfs_add_file_to_group(&dev->kobj, attr, (*group)->name);
- return sysfs_create_group(&persistent_pmu.dev->kobj, *group);
+ goto out; /* sysfs not yet initialized */
+
+ if (!sysfs_nr_entries)
+ sysfs_remove_group(&dev->kobj, *group);
+ else
+ sysfs_remove_file_from_group(&dev->kobj, attr, (*group)->name);
+out:
+ if (!sysfs_nr_entries)
+ *group = NULL;
+
+ attrs = __krealloc(*attrs_ptr, (sysfs_nr_entries + 1) * sizeof(*attrs),
+ GFP_KERNEL);
+
+ if (!attrs && *attrs_ptr != attrs) {
+ kfree(*attrs_ptr);
+ *attrs_ptr = attrs;
+ }
+unlock:
+ mutex_unlock(&sysfs_lock);
}
static int persistent_pmu_init(struct perf_event *event)
@@ -349,6 +431,7 @@ void __init perf_register_persistent(void)
idr_init(&event_idr);
mutex_init(&event_lock);
+ mutex_init(&sysfs_lock);
perf_pmu_register(&persistent_pmu, "persistent", PERF_TYPE_PERSISTENT);
for_each_possible_cpu(cpu) {
--
1.8.3.2
From: Robert Richter <[email protected]>
Expose persistent events in the system to userland using sysfs. Perf
tools are able to read existing pmu events from sysfs. Now we use a
persistent pmu as an event container containing all registered
persistent events of the system. This patch adds dynamically
registration of persistent events to sysfs. E.g. something like this:
/sys/bus/event_source/devices/persistent/events/mce_record:persistent,config=106
/sys/bus/event_source/devices/persistent/format/persistent:attr5:23
Perf tools need to support the attr<num> syntax that is added in a
separate patch set. With it we are able to run perf tool commands to
read persistent events, e.g.:
# perf record -e persistent/mce_record/ sleep 10
# perf top -e persistent/mce_record/
[ Jiri: Document attr<index> syntax in sysfs ABI ]
[ Namhyung: Fix sysfs registration with lockdep enabled ]
Cc: Jiri Olsa <[email protected]>
Cc: Namhyung Kim <[email protected]>
Signed-off-by: Robert Richter <[email protected]>
Signed-off-by: Robert Richter <[email protected]>
---
.../testing/sysfs-bus-event_source-devices-format | 43 ++++++++++++----
kernel/events/persistent.c | 60 ++++++++++++++++++++++
2 files changed, 92 insertions(+), 11 deletions(-)
diff --git a/Documentation/ABI/testing/sysfs-bus-event_source-devices-format b/Documentation/ABI/testing/sysfs-bus-event_source-devices-format
index 77f47ff..2dbb911 100644
--- a/Documentation/ABI/testing/sysfs-bus-event_source-devices-format
+++ b/Documentation/ABI/testing/sysfs-bus-event_source-devices-format
@@ -1,13 +1,14 @@
-Where: /sys/bus/event_source/devices/<dev>/format
+Where: /sys/bus/event_source/devices/<pmu>/format/<name>
Date: January 2012
-Kernel Version: 3.3
+Kernel Version: 3.3
+ 3.xx (added attr<index>:<bits>)
Contact: Jiri Olsa <[email protected]>
-Description:
- Attribute group to describe the magic bits that go into
- perf_event_attr::config[012] for a particular pmu.
- Each attribute of this group defines the 'hardware' bitmask
- we want to export, so that userspace can deal with sane
- name/value pairs.
+
+Description: Define formats for bit ranges in perf_event_attr
+
+ Attribute group to describe the magic bits that go
+ into struct perf_event_attr for a particular pmu. Bit
+ range may be any bit mask of an u64 (bits 0 to 63).
Userspace must be prepared for the possibility that attributes
define overlapping bit ranges. For example:
@@ -15,6 +16,26 @@ Contact: Jiri Olsa <[email protected]>
attr2 = 'config:0-7'
attr3 = 'config:12-35'
- Example: 'config1:1,6-10,44'
- Defines contents of attribute that occupies bits 1,6-10,44 of
- perf_event_attr::config1.
+ Syntax Description
+
+ config[012]*:<bits> Each attribute of this group
+ defines the 'hardware' bitmask
+ we want to export, so that
+ userspace can deal with sane
+ name/value pairs.
+
+ attr<index>:<bits> Set any field of the event
+ attribute. The index is a
+ decimal number that specifies
+ the u64 value to be set within
+ struct perf_event_attr.
+
+ Examples:
+
+ 'config1:1,6-10,44' Defines contents of attribute
+ that occupies bits 1,6-10,44
+ of perf_event_attr::config1.
+
+ 'attr5:23' Define the persistent event
+ flag (bit 23 of the attribute
+ flags)
diff --git a/kernel/events/persistent.c b/kernel/events/persistent.c
index ede95ab..aca1e98 100644
--- a/kernel/events/persistent.c
+++ b/kernel/events/persistent.c
@@ -8,6 +8,7 @@
#define CPU_BUFFER_NR_PAGES ((512 * 1024) / PAGE_SIZE)
struct pevent {
+ struct perf_pmu_events_attr sysfs;
char *name;
int id;
};
@@ -119,6 +120,8 @@ static void persistent_event_close(int cpu, struct pevent *pevent)
persistent_event_release(event);
}
+static int pevent_sysfs_register(struct pevent *event);
+
static int __maybe_unused
persistent_open(char *name, struct perf_event_attr *attr, int nr_pages)
{
@@ -144,12 +147,18 @@ persistent_open(char *name, struct perf_event_attr *attr, int nr_pages)
goto fail;
}
+ pevent->sysfs.id = pevent->id;
+
for_each_possible_cpu(cpu) {
ret = persistent_event_open(cpu, pevent, attr, nr_pages);
if (ret)
goto fail;
}
+ ret = pevent_sysfs_register(pevent);
+ if (ret)
+ goto fail;
+
return 0;
fail:
for_each_possible_cpu(cpu)
@@ -223,10 +232,61 @@ static struct attribute_group persistent_format_group = {
.attrs = persistent_format_attrs,
};
+#define MAX_EVENTS 16
+
+static struct attribute *pevents_attr[MAX_EVENTS + 1] = { };
+
+static struct attribute_group pevents_group = {
+ .name = "events",
+ .attrs = pevents_attr,
+};
+
static const struct attribute_group *persistent_attr_groups[] = {
&persistent_format_group,
+ NULL, /* placeholder: &pevents_group */
NULL,
};
+#define EVENTS_GROUP_PTR (&persistent_attr_groups[1])
+
+static ssize_t pevent_sysfs_show(struct device *dev,
+ struct device_attribute *__attr, char *page)
+{
+ struct perf_pmu_events_attr *attr =
+ container_of(__attr, struct perf_pmu_events_attr, attr);
+ return sprintf(page, "persistent,config=%lld",
+ (unsigned long long)attr->id);
+}
+
+static int pevent_sysfs_register(struct pevent *pevent)
+{
+ struct perf_pmu_events_attr *sysfs = &pevent->sysfs;
+ struct attribute *attr = &sysfs->attr.attr;
+ struct device *dev = persistent_pmu.dev;
+ const struct attribute_group **group = EVENTS_GROUP_PTR;
+ int idx;
+
+ sysfs->id = pevent->id;
+ sysfs->attr = (struct device_attribute)
+ __ATTR(, 0444, pevent_sysfs_show, NULL);
+ attr->name = pevent->name;
+ sysfs_attr_init(attr);
+
+ /* add sysfs attr to events: */
+ for (idx = 0; idx < MAX_EVENTS; idx++) {
+ if (!cmpxchg(pevents_attr + idx, NULL, attr))
+ break;
+ }
+
+ if (idx >= MAX_EVENTS)
+ return -ENOSPC;
+ if (!idx)
+ *group = &pevents_group;
+ if (!dev)
+ return 0; /* sysfs not yet initialized */
+ if (idx)
+ return sysfs_add_file_to_group(&dev->kobj, attr, (*group)->name);
+ return sysfs_create_group(&persistent_pmu.dev->kobj, *group);
+}
static int persistent_pmu_init(struct perf_event *event)
{
--
1.8.3.2
From: Borislav Petkov <[email protected]>
... for MCEs collection.
Signed-off-by: Borislav Petkov <[email protected]>
[ rric: Fix build error for no-tracepoints configs ]
[ rric: Return proper error code. ]
[ rric: No error message if perf is disabled. ]
Signed-off-by: Robert Richter <[email protected]>
---
arch/x86/kernel/cpu/mcheck/mce.c | 19 +++++++++++++++++++
1 file changed, 19 insertions(+)
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 87a65c9..ffa227b 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -1990,6 +1990,25 @@ int __init mcheck_init(void)
return 0;
}
+#ifdef CONFIG_EVENT_TRACING
+
+int __init mcheck_init_tp(void)
+{
+ int ret = perf_add_persistent_tp(&event_mce_record);
+
+ if (ret && ret != -ENOENT)
+ pr_err("Error adding MCE persistent event: %d\n", ret);
+
+ return ret;
+}
+/*
+ * We can't run earlier because persistent events uses anon_inode_getfile and
+ * its anon_inode_mnt gets initialized as a fs_initcall.
+ */
+fs_initcall_sync(mcheck_init_tp);
+
+#endif /* CONFIG_EVENT_TRACING */
+
/*
* mce_syscore: PM support
*/
--
1.8.3.2
From: Robert Richter <[email protected]>
This new function creates a new fd for an event. We need this later to
get a fd from a persistent event.
Signed-off-by: Robert Richter <[email protected]>
Signed-off-by: Robert Richter <[email protected]>
---
kernel/events/core.c | 13 ++++++++-----
kernel/events/internal.h | 1 +
2 files changed, 9 insertions(+), 5 deletions(-)
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 24810d5..932acc6 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -4100,6 +4100,11 @@ static const struct file_operations perf_fops = {
.fasync = perf_fasync,
};
+int perf_get_fd(struct perf_event *event)
+{
+ return anon_inode_getfd("[perf_event]", &perf_fops, event, O_RDWR);
+}
+
/*
* Perf event wakeup
*
@@ -6868,7 +6873,6 @@ SYSCALL_DEFINE5(perf_event_open,
struct perf_event *event, *sibling;
struct perf_event_attr attr;
struct perf_event_context *ctx;
- struct file *event_file = NULL;
struct fd group = {NULL, 0};
struct task_struct *task = NULL;
struct pmu *pmu;
@@ -7025,9 +7029,9 @@ SYSCALL_DEFINE5(perf_event_open,
goto err_context;
}
- event_file = anon_inode_getfile("[perf_event]", &perf_fops, event, O_RDWR);
- if (IS_ERR(event_file)) {
- err = PTR_ERR(event_file);
+ event_fd = perf_get_fd(event);
+ if (event_fd < 0) {
+ err = event_fd;
goto err_context;
}
@@ -7093,7 +7097,6 @@ SYSCALL_DEFINE5(perf_event_open,
* perf_group_detach().
*/
fdput(group);
- fd_install(event_fd, event_file);
return event_fd;
err_context:
diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index 8ddaf57..d8708aa 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -192,5 +192,6 @@ static inline void put_event(struct perf_event *event)
extern int perf_alloc_rb(struct perf_event *event, int nr_pages, int flags);
extern void perf_free_rb(struct perf_event *event);
+extern int perf_get_fd(struct perf_event *event);
#endif /* _KERNEL_EVENTS_INTERNAL_H */
--
1.8.3.2
From: Robert Richter <[email protected]>
Factor out code to allocate and deallocate ringbuffers. We need this
later to setup the sampling buffer for persistent events.
While at this, replacing get_current_user() with get_uid(user).
Signed-off-by: Robert Richter <[email protected]>
Signed-off-by: Robert Richter <[email protected]>
---
kernel/events/core.c | 75 +++++++++++++++++++++++++++++-------------------
kernel/events/internal.h | 3 ++
2 files changed, 48 insertions(+), 30 deletions(-)
diff --git a/kernel/events/core.c b/kernel/events/core.c
index c9a5d4c..24810d5 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3124,8 +3124,44 @@ static void free_event_rcu(struct rcu_head *head)
}
static void ring_buffer_put(struct ring_buffer *rb);
+static void ring_buffer_attach(struct perf_event *event, struct ring_buffer *rb);
static void ring_buffer_detach(struct perf_event *event, struct ring_buffer *rb);
+/*
+ * Must be called with &event->mmap_mutex held. event->rb must be
+ * NULL. perf_alloc_rb() requires &event->mmap_count to be incremented
+ * on success which corresponds to &rb->mmap_count that is initialized
+ * with 1.
+ */
+int perf_alloc_rb(struct perf_event *event, int nr_pages, int flags)
+{
+ struct ring_buffer *rb;
+
+ rb = rb_alloc(nr_pages,
+ event->attr.watermark ? event->attr.wakeup_watermark : 0,
+ event->cpu, flags);
+ if (!rb)
+ return -ENOMEM;
+
+ atomic_set(&rb->mmap_count, 1);
+ ring_buffer_attach(event, rb);
+ rcu_assign_pointer(event->rb, rb);
+
+ perf_event_update_userpage(event);
+
+ return 0;
+}
+
+/* Must be called with &event->mmap_mutex held. event->rb must be set. */
+void perf_free_rb(struct perf_event *event)
+{
+ struct ring_buffer *rb = event->rb;
+
+ rcu_assign_pointer(event->rb, NULL);
+ ring_buffer_detach(event, rb);
+ ring_buffer_put(rb);
+}
+
static void unaccount_event_cpu(struct perf_event *event, int cpu)
{
if (event->parent)
@@ -3177,6 +3213,7 @@ static void __free_event(struct perf_event *event)
call_rcu(&event->rcu_head, free_event_rcu);
}
+
static void free_event(struct perf_event *event)
{
irq_work_sync(&event->pending);
@@ -3184,8 +3221,6 @@ static void free_event(struct perf_event *event)
unaccount_event(event);
if (event->rb) {
- struct ring_buffer *rb;
-
/*
* Can happen when we close an event with re-directed output.
*
@@ -3193,12 +3228,8 @@ static void free_event(struct perf_event *event)
* over us; possibly making our ring_buffer_put() the last.
*/
mutex_lock(&event->mmap_mutex);
- rb = event->rb;
- if (rb) {
- rcu_assign_pointer(event->rb, NULL);
- ring_buffer_detach(event, rb);
- ring_buffer_put(rb); /* could be last */
- }
+ if (event->rb)
+ perf_free_rb(event);
mutex_unlock(&event->mmap_mutex);
}
@@ -3798,11 +3829,8 @@ static void ring_buffer_detach_all(struct ring_buffer *rb)
* still restart the iteration to make sure we're not now
* iterating the wrong list.
*/
- if (event->rb == rb) {
- rcu_assign_pointer(event->rb, NULL);
- ring_buffer_detach(event, rb);
- ring_buffer_put(rb); /* can't be last, we still have one */
- }
+ if (event->rb == rb)
+ perf_free_rb(event);
mutex_unlock(&event->mmap_mutex);
put_event(event);
@@ -3938,7 +3966,6 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
unsigned long user_locked, user_lock_limit;
struct user_struct *user = current_user();
unsigned long locked, lock_limit;
- struct ring_buffer *rb;
unsigned long vma_size;
unsigned long nr_pages;
long user_extra, extra;
@@ -4022,27 +4049,15 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
if (vma->vm_flags & VM_WRITE)
flags |= RING_BUFFER_WRITABLE;
- rb = rb_alloc(nr_pages,
- event->attr.watermark ? event->attr.wakeup_watermark : 0,
- event->cpu, flags);
-
- if (!rb) {
- ret = -ENOMEM;
+ ret = perf_alloc_rb(event, nr_pages, flags);
+ if (ret)
goto unlock;
- }
- atomic_set(&rb->mmap_count, 1);
- rb->mmap_locked = extra;
- rb->mmap_user = get_current_user();
+ event->rb->mmap_locked = extra;
+ event->rb->mmap_user = get_uid(user);
atomic_long_add(user_extra, &user->locked_vm);
vma->vm_mm->pinned_vm += extra;
-
- ring_buffer_attach(event, rb);
- rcu_assign_pointer(event->rb, rb);
-
- perf_event_update_userpage(event);
-
unlock:
if (!ret)
atomic_inc(&event->mmap_count);
diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index 96a07d2..8ddaf57 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -190,4 +190,7 @@ static inline void put_event(struct perf_event *event)
__put_event(event);
}
+extern int perf_alloc_rb(struct perf_event *event, int nr_pages, int flags);
+extern void perf_free_rb(struct perf_event *event);
+
#endif /* _KERNEL_EVENTS_INTERNAL_H */
--
1.8.3.2
From: Robert Richter <[email protected]>
Factor out a function to detach all events from a ringbuffer. No
functional changes.
Signed-off-by: Robert Richter <[email protected]>
Signed-off-by: Robert Richter <[email protected]>
---
kernel/events/core.c | 82 ++++++++++++++++++++++++++++------------------------
1 file changed, 44 insertions(+), 38 deletions(-)
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 928fae7c..5dcc5fe 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3775,6 +3775,49 @@ static void ring_buffer_detach(struct perf_event *event, struct ring_buffer *rb)
spin_unlock_irqrestore(&rb->event_lock, flags);
}
+static void ring_buffer_detach_all(struct ring_buffer *rb)
+{
+ struct perf_event *event;
+again:
+ rcu_read_lock();
+ list_for_each_entry_rcu(event, &rb->event_list, rb_entry) {
+ if (!atomic_long_inc_not_zero(&event->refcount)) {
+ /*
+ * This event is en-route to free_event() which will
+ * detach it and remove it from the list.
+ */
+ continue;
+ }
+ rcu_read_unlock();
+
+ mutex_lock(&event->mmap_mutex);
+ /*
+ * Check we didn't race with perf_event_set_output() which can
+ * swizzle the rb from under us while we were waiting to
+ * acquire mmap_mutex.
+ *
+ * If we find a different rb; ignore this event, a next
+ * iteration will no longer find it on the list. We have to
+ * still restart the iteration to make sure we're not now
+ * iterating the wrong list.
+ */
+ if (event->rb == rb) {
+ rcu_assign_pointer(event->rb, NULL);
+ ring_buffer_detach(event, rb);
+ ring_buffer_put(rb); /* can't be last, we still have one */
+ }
+ mutex_unlock(&event->mmap_mutex);
+ put_event(event);
+
+ /*
+ * Restart the iteration; either we're on the wrong list or
+ * destroyed its integrity by doing a deletion.
+ */
+ goto again;
+ }
+ rcu_read_unlock();
+}
+
static void ring_buffer_wakeup(struct perf_event *event)
{
struct ring_buffer *rb;
@@ -3867,44 +3910,7 @@ static void perf_mmap_close(struct vm_area_struct *vma)
* into the now unreachable buffer. Somewhat complicated by the
* fact that rb::event_lock otherwise nests inside mmap_mutex.
*/
-again:
- rcu_read_lock();
- list_for_each_entry_rcu(event, &rb->event_list, rb_entry) {
- if (!atomic_long_inc_not_zero(&event->refcount)) {
- /*
- * This event is en-route to free_event() which will
- * detach it and remove it from the list.
- */
- continue;
- }
- rcu_read_unlock();
-
- mutex_lock(&event->mmap_mutex);
- /*
- * Check we didn't race with perf_event_set_output() which can
- * swizzle the rb from under us while we were waiting to
- * acquire mmap_mutex.
- *
- * If we find a different rb; ignore this event, a next
- * iteration will no longer find it on the list. We have to
- * still restart the iteration to make sure we're not now
- * iterating the wrong list.
- */
- if (event->rb == rb) {
- rcu_assign_pointer(event->rb, NULL);
- ring_buffer_detach(event, rb);
- ring_buffer_put(rb); /* can't be last, we still have one */
- }
- mutex_unlock(&event->mmap_mutex);
- put_event(event);
-
- /*
- * Restart the iteration; either we're on the wrong list or
- * destroyed its integrity by doing a deletion.
- */
- goto again;
- }
- rcu_read_unlock();
+ ring_buffer_detach_all(rb);
/*
* It could be there's still a few 0-ref events on the list; they'll
--
1.8.3.2
On Thu, 22 Aug 2013, Robert Richter wrote:
> From: Robert Richter <[email protected]>
>
> Expose persistent events in the system to userland using sysfs. Perf
> tools are able to read existing pmu events from sysfs. Now we use a
> persistent pmu as an event container containing all registered
> persistent events of the system. This patch adds dynamically
> registration of persistent events to sysfs. E.g. something like this:
>
> /sys/bus/event_source/devices/persistent/events/mce_record:persistent,config=106
> /sys/bus/event_source/devices/persistent/format/persistent:attr5:23
...
> @@ -15,6 +16,26 @@ Contact: Jiri Olsa <[email protected]>
> attr2 = 'config:0-7'
> attr3 = 'config:12-35'
>
> - Example: 'config1:1,6-10,44'
> - Defines contents of attribute that occupies bits 1,6-10,44 of
> - perf_event_attr::config1.
> + Syntax Description
> +
> + config[012]*:<bits> Each attribute of this group
> + defines the 'hardware' bitmask
> + we want to export, so that
> + userspace can deal with sane
> + name/value pairs.
> +
> + attr<index>:<bits> Set any field of the event
> + attribute. The index is a
> + decimal number that specifies
> + the u64 value to be set within
> + struct perf_event_attr.
Ugh this is ugly. You might also want to specify that the "index"
value starts at 0 which threw me for a bit when I was trying to figure
out what was going on. You might also want to clarify the previous
part of the document which uses "attr" to mean something else.
Is this endian clean? Will attr5:23 point to the same bit on a big-endian
machine as on little-endian?
If we're going to have to have an ugly interface like this we might
as well do something more human readable, since anything that parses this
is going to have to rebuild the struct perf_event_attr by hand anyway
(unless you propose people blindly set bits at offsets using pointer math
which just sounds like a bad idea).
For example, just give up and let someone specify the actual field name
like "persistent" and have the tools handle that.
/sys/bus/event_source/devices/persistent/events/mce_record
persistent,config=106
/sys/bus/event_source/devices/persistent/format/persistent
attr_persistent
That way you could also add things like
/sys/bus/event_source/devices/persistent/format/precise_ip
attr_precise_ip:0-1
Although I still think exposing the full huge attr stuct via sysfs is just
silly. Isn't there some better way? I'm not aware of any other syscall
that exports things like this.
Vince
On Thu, 22 Aug 2013, Robert Richter wrote:
> From: Robert Richter <[email protected]>
>
> Implementing ioctl functions to control persistent events. There are
> functions to detach or attach an event to or from a process. The
> PERF_EVENT_IOC_DETACH ioctl call makes an event persistent. After
> closing the event's fd it runs then in the background of the system
> without the need of a controlling process. The perf_event_open()
> syscall can be used to reopen the event by any process. The
> PERF_EVENT_IOC_ATTACH ioctl attaches the event again so that it is
> removed after closing the event's fd.
> PERF_EVENT_IOC_ATTACH (Since Linux 3.xx)
> PERF_EVENT_IOC_DETACH (Since Linux 3.xx)
I think these aren't very good names for the ioctls. Maybe something
like
PERF_EVENT_IOC_MAKE_PERSISTENT
PERF_EVENT_IOC_UNPERSIST
I know that last one's not a real word but I can't think of what the
proper term would be. Maybe
PERF_EVENT_IOC_RELEASE_PERSISTENT
PERF_EVENT_IOC_RECLAIM_PERSISTENT
> This is for Linux man-pages:
Thanks, though you're missing out by not learning all about troff
formatting.
> type ...
>
> PERF_TYPE_PERSISTENT (Since Linux 3.xx)
>
> This indicates a persistent event. There is a unique
> identifier for each persistent event that needs to be
> specified in the event's attribute config field.
> Persistent events are listed under:
>
> /sys/bus/event_source/devices/persistent/
Wait, so the first time you create a persistent event you do *not*
set type PERF_TYPE_PERSISTENT? You only do that if you're
"attaching" to an exisiting event? You might want to clarify that.
> persistent: (Since Linux 3.xx)
>
> Put event into persistent state after opening. After closing
> the event's fd the event is persistent in the system and
> continues to run.
will there be some sort of tool that will let you kill runaway persistent
events? Or will you have to manually perf_event_open() / iotcl() them
by hand somehow?
> PERF_EVENT_IOC_DETACH (Since Linux 3.xx)
>
> Detach the event specified by the file descriptor from the
> process and make it persistent in the system. After
> closing the fd the event will continue to run. An unique
> identifier for the persistent event is returned or an
> error otherwise. The following allows to connect to the
> event again:
You might want to re-order things so it's clear you get the unique ID
at ioctl time and not after the close happens.
> PERF_EVENT_IOC_ATTACH (Since Linux 3.xx)
>
> Attach the event specified by the file descriptor to the
> current process. The event is no longer persistent in the
> system and will be removed after all users disconnected
> from the event. Thus, if there are no other users the
> event will be closed too after closing its file
> descriptor, the event then no longer exists.
Vince
On Thu, Aug 22, 2013 at 02:18:06PM -0400, Vince Weaver wrote:
> > PERF_EVENT_IOC_ATTACH (Since Linux 3.xx)
> > PERF_EVENT_IOC_DETACH (Since Linux 3.xx)
>
> I think these aren't very good names for the ioctls. Maybe something
> like
> PERF_EVENT_IOC_MAKE_PERSISTENT
> PERF_EVENT_IOC_UNPERSIST
> I know that last one's not a real word but I can't think of what the
> proper term would be. Maybe
> PERF_EVENT_IOC_RELEASE_PERSISTENT
> PERF_EVENT_IOC_RECLAIM_PERSISTENT
"aren't very good names" is not really an argument I can work with. Why
not? What if you want to attach/detach to events but not be persistent.
Which also begs the question how long are we persistent? The whole
system runtime or until the user decides to detach.
So ATTACH/DETACH in the sense of attaching processes to events for an
arbitrary amount of time *and* *not* for the duration of the tracee
as we do it currently implicitly, is much more generic wrt usage than
specifying that specific persistent case.
Thanks.
On 22.08.13 14:00:51, Vince Weaver wrote:
> On Thu, 22 Aug 2013, Robert Richter wrote:
> > + attr<index>:<bits> Set any field of the event
> > + attribute. The index is a
> > + decimal number that specifies
> > + the u64 value to be set within
> > + struct perf_event_attr.
>
> Ugh this is ugly.
It's not intended to be used by humans. ;) It is more for format
definitions that are provided by the kernel and that are parsed by
(perf) tools. Of course, a human could dig into it to figure out
that's going on.
> You might also want to specify that the "index"
> value starts at 0 which threw me for a bit when I was trying to figure
> out what was going on. You might also want to clarify the previous
> part of the document which uses "attr" to mean something else.
I thought it would be clear enough to refer to struct perf_event_attr.
Since the index usually starts with 0 as in the config fields, I
assumed this was clear in this case too. Though this can be documented
better.
> Is this endian clean? Will attr5:23 point to the same bit on a big-endian
> machine as on little-endian?
It is the endianness used in the syscall. Handled in the same way as
for the config fields. I don't see where this could be an issue.
> If we're going to have to have an ugly interface like this we might
> as well do something more human readable, since anything that parses this
> is going to have to rebuild the struct perf_event_attr by hand anyway
> (unless you propose people blindly set bits at offsets using pointer math
> which just sounds like a bad idea).
The format directories in /sys/bus/event_source/devices/*/format/ are
already there to make it human readable. A user never has to deal
directly with syntax provided there and may use already the
abstractions for the event syntax.
> For example, just give up and let someone specify the actual field name
> like "persistent" and have the tools handle that.
>
> /sys/bus/event_source/devices/persistent/events/mce_record
> persistent,config=106
>
> /sys/bus/event_source/devices/persistent/format/persistent
> attr_persistent
>
> That way you could also add things like
> /sys/bus/event_source/devices/persistent/format/precise_ip
> attr_precise_ip:0-1
The problem with this is that you have to implement this in the event
parser of perf tools. Thus, you need to update the parser for any
other syntax you want to use. This is not necessary with my
implementation. It already provides the above. The pmu driver just
need to add the sysfs entry.
.../format/precise_ip = attr5:15-16
Then, -e cycles,precise_ip=1 is the same as -e cycles:p. Looks very
human readable?
All this without updating perf tools.
Simply test this as follows:
# cp -rp /sys/devices/cpu/format/ /dev/shm/
# echo attr5:15-16 > /dev/shm/format/precise_ip
# mount --bind /dev/shm/format/ /sys/devices/cpu/format/
# find /sys/devices/cpu/format/
/sys/devices/cpu/format/
/sys/devices/cpu/format/precise_ip
/sys/devices/cpu/format/umask
/sys/devices/cpu/format/event
/sys/devices/cpu/format/cmask
/sys/devices/cpu/format/edge
/sys/devices/cpu/format/inv
# perf record -e cycles,precise_ip=1 sleep 1
Works out-of-the-box...
It's the whole intention of the new syntax that the event parser never
needs to be modified again for new attribute flags or any other
settings of the attributes. Now the syntax is also capable to describe
any event setup. Also consider that different architectures may
provide different syntax. In this case there is no need for
arch-specific code in the tools implementation, all is just brought by
sysfs.
> Although I still think exposing the full huge attr stuct via sysfs is just
> silly. Isn't there some better way? I'm not aware of any other syscall
> that exports things like this.
That's a different story. Guess there is no way back anymore now.
Though we are in a state all this is handable and covered by perf
tools.
-Robert
On 23.08.13 11:11:28, Borislav Petkov wrote:
> On Thu, Aug 22, 2013 at 02:18:06PM -0400, Vince Weaver wrote:
> > > PERF_EVENT_IOC_ATTACH (Since Linux 3.xx)
> > > PERF_EVENT_IOC_DETACH (Since Linux 3.xx)
> >
> > I think these aren't very good names for the ioctls. Maybe something
> > like
> > PERF_EVENT_IOC_MAKE_PERSISTENT
> > PERF_EVENT_IOC_UNPERSIST
> > I know that last one's not a real word but I can't think of what the
> > proper term would be. Maybe
> > PERF_EVENT_IOC_RELEASE_PERSISTENT
> > PERF_EVENT_IOC_RECLAIM_PERSISTENT
>
> "aren't very good names" is not really an argument I can work with. Why
> not? What if you want to attach/detach to events but not be persistent.
> Which also begs the question how long are we persistent? The whole
> system runtime or until the user decides to detach.
>
> So ATTACH/DETACH in the sense of attaching processes to events for an
> arbitrary amount of time *and* *not* for the duration of the tracee
> as we do it currently implicitly, is much more generic wrt usage than
> specifying that specific persistent case.
Ok, for clarification, my intention was to say something like 'detach
event from the current process controlling it', or 'attach the event
to the current process that holds the fd'.
Whatever term is the best for this ioctls, I am fine with it. The
above terms look a bit long.
The problem with detach/attach is more that it's actually more
logically to attach first and afterwards detach. This is not the case
here, it's vise versa.
-Robert
On 22.08.13 14:18:06, Vince Weaver wrote:
> On Thu, 22 Aug 2013, Robert Richter wrote:
> > This is for Linux man-pages:
>
> Thanks, though you're missing out by not learning all about troff
> formatting.
Thanks for documenting perf_event_open(), it's a great help. The troff
thing I may learn in case I write a patch for man-pages. ;)
>
> > type ...
> >
> > PERF_TYPE_PERSISTENT (Since Linux 3.xx)
> >
> > This indicates a persistent event. There is a unique
> > identifier for each persistent event that needs to be
> > specified in the event's attribute config field.
> > Persistent events are listed under:
> >
> > /sys/bus/event_source/devices/persistent/
>
> Wait, so the first time you create a persistent event you do *not*
> set type PERF_TYPE_PERSISTENT? You only do that if you're
> "attaching" to an exisiting event? You might want to clarify that.
You just *open* an existing persistent event and may access buffers
with mmap(). After opening it you also may attach the event to the
process holding the fd with the ioctl. Then, the event is no longer
persistent and removed after all users finished using the event.
Without doing the attach-ioctl the event stays persistent in the
system.
> > persistent: (Since Linux 3.xx)
> >
> > Put event into persistent state after opening. After closing
> > the event's fd the event is persistent in the system and
> > continues to run.
>
> will there be some sort of tool that will let you kill runaway persistent
> events? Or will you have to manually perf_event_open() / iotcl() them
> by hand somehow?
The instance opening the event should also be responsible for removing
it. But there could be a perf tool for controlling persistent events
(create, list, remove, etc).
> > PERF_EVENT_IOC_DETACH (Since Linux 3.xx)
> >
> > Detach the event specified by the file descriptor from the
> > process and make it persistent in the system. After
> > closing the fd the event will continue to run. An unique
> > identifier for the persistent event is returned or an
> > error otherwise. The following allows to connect to the
> > event again:
>
> You might want to re-order things so it's clear you get the unique ID
> at ioctl time and not after the close happens.
Ah, yes, indeed:
Detach the event specified by the file descriptor from the
process and make it persistent in the system. An unique
identifier for the persistent event is returned or an
error otherwise. After closing the fd the event will
continue to run. The following allows to connect to the
event again:
> > PERF_EVENT_IOC_ATTACH (Since Linux 3.xx)
> >
> > Attach the event specified by the file descriptor to the
> > current process. The event is no longer persistent in the
> > system and will be removed after all users disconnected
> > from the event. Thus, if there are no other users the
> > event will be closed too after closing its file
> > descriptor, the event then no longer exists.
Thanks for review, Vince.
-Robert
On 23.08.13 11:45:56, Robert Richter wrote:
> On 23.08.13 11:11:28, Borislav Petkov wrote:
> > On Thu, Aug 22, 2013 at 02:18:06PM -0400, Vince Weaver wrote:
> > > PERF_EVENT_IOC_MAKE_PERSISTENT
> > > PERF_EVENT_IOC_UNPERSIST
Maybe this?
PERF_EVENT_IOC_PERSIST
PERF_EVENT_IOC_UNPERSIST
-Robert
On Fri, Aug 23, 2013 at 12:44:41PM +0200, Robert Richter wrote:
> On 23.08.13 11:45:56, Robert Richter wrote:
> > On 23.08.13 11:11:28, Borislav Petkov wrote:
> > > On Thu, Aug 22, 2013 at 02:18:06PM -0400, Vince Weaver wrote:
> > > > PERF_EVENT_IOC_MAKE_PERSISTENT
> > > > PERF_EVENT_IOC_UNPERSIST
>
> Maybe this?
>
> PERF_EVENT_IOC_PERSIST
> PERF_EVENT_IOC_UNPERSIST
No, ATTACH/DETACH actually describes what you do with the fds and is
most generic. "PERSIST*" is a special use case of attaching/detaching
events.
Thanks.
On Fri, 23 Aug 2013, Robert Richter wrote:
> I thought it would be clear enough to refer to struct perf_event_attr.
> Since the index usually starts with 0 as in the config fields, I
> assumed this was clear in this case too. Though this can be documented
> better.
Make no assumptions when documenting. When I as a user have to dig around
the kernel source tree to find out what is going on then the documentation
is lacking.
> > Is this endian clean? Will attr5:23 point to the same bit on a big-endian
> > machine as on little-endian?
>
> It is the endianness used in the syscall. Handled in the same way as
> for the config fields. I don't see where this could be an issue.
C bitfields go opposite ways on big-endian vs little-endian systems.
This has come up with some of the bitfields in the sample buffers.
It doesn't matter if you just use the bitfields, but if you're trying
to poke single bits into opaque 64-bit blobs it might be an issue.
> The format directories in /sys/bus/event_source/devices/*/format/ are
> already there to make it human readable. A user never has to deal
> directly with syntax provided there and may use already the
> abstractions for the event syntax.
The format directory for now was only for the "config" fields which
traditionally were needed to specify an event, that is all.
Things get a lot more complex if arbitrary subsets of the the perf_attr
structure start getting exported.
> The problem with this is that you have to implement this in the event
> parser of perf tools. Thus, you need to update the parser for any
> other syntax you want to use. This is not necessary with my
> implementation. It already provides the above. The pmu driver just
> need to add the sysfs entry.
I see. Are you going to update the parsers for programs that
also try to read these values, like trinity?
Or is the perf tool special because it is in the kernel?
> All this without updating perf tools.
So are we going to admit the ABI is "it doesn't break perf" after all?
> It's the whole intention of the new syntax that the event parser never
> needs to be modified again
the *perf* event parser never needs to be modified again maybe.
In any case, Andi Kleen also has some patches to this effect so you might
want to co-ordinate your efforts. In his case it was the "precise" field
he was exporting in events.
Vince
(yes, I still think events should be defined in a library or else a
user parsable file in userspace. Putting them in sysfs just complicates
everything for no good reason)
On Fri, 23 Aug 2013, Borislav Petkov wrote:
> On Fri, Aug 23, 2013 at 12:44:41PM +0200, Robert Richter wrote:
> > On 23.08.13 11:45:56, Robert Richter wrote:
> > > On 23.08.13 11:11:28, Borislav Petkov wrote:
> > > > On Thu, Aug 22, 2013 at 02:18:06PM -0400, Vince Weaver wrote:
> > > > > PERF_EVENT_IOC_MAKE_PERSISTENT
> > > > > PERF_EVENT_IOC_UNPERSIST
> >
> > Maybe this?
> >
> > PERF_EVENT_IOC_PERSIST
> > PERF_EVENT_IOC_UNPERSIST
>
> No, ATTACH/DETACH actually describes what you do with the fds and is
> most generic. "PERSIST*" is a special use case of attaching/detaching
> events.
I agree with what Robert said elsewhere in this thread:
"The problem with detach/attach is more that it's actually more
logically to attach first and afterwards detach. This is not the case
here, it's vise versa."
My main confusion is that with some other performance tools, such as PAPI,
"attach" specifically means to take an event and attach it to a
process (much like the -p option to strace or gdb) or a cpu. Now
perf_event handles this differently (you do that at open time) but I still
think it gets things backwards.
Since ATTACH is usually a transitive verb in English I'd think the name
should specify what two things are being attached.
If I had not read the man-page fragment and saw a
result=ioctl(fd,PERF_EVENT_IOC_ATTACH,0);
I'd have no clue what it was doing (attach? attach to what?)
wheras if I saw
result=ioctl(fd,PERF_EVENT_IOC_UNPERSIST,0);
it's a little more clearer and also indicates that the ioctl is only
valid if you're dealing with a persistent event.
Vince
On Fri, Aug 23, 2013 at 01:07:54PM -0400, Vince Weaver wrote:
> If I had not read the man-page fragment and saw a
> result=ioctl(fd,PERF_EVENT_IOC_ATTACH,0);
> I'd have no clue what it was doing (attach? attach to what?)
> wheras if I saw
> result=ioctl(fd,PERF_EVENT_IOC_UNPERSIST,0);
> it's a little more clearer and also indicates that the ioctl is only
> valid if you're dealing with a persistent event.
Maybe this makes it more understandable for you but this is beside the
point.
The main and the most important idea here is that we want to
attach/detach file descriptors to perf events and the resources
(buffers, etc) associated with them so that those events can be made
independent from processes.
But I have to say the reversed thing above does sound confusing, now
that I'm looking at the code. Actually, at the time we discussed this,
my idea was to do it like this:
1. we open a perf event and get its file descriptor
2. ioctl ATTACH to it so that it is attached to the process.
... do some tracing and collecting and fiddling...
3. ioctl DETACH from it so that it is "forked in the background" so to
speak, very similar to a background job in the shell.
4. The rest of the code continues and deallocates the event *BUT* (and
this is the key thing!) the counter/tracepoint remains operational in
the kernel, running all the time.
5. Now, after a certain point, you come back and ioctl ATTACH to this
already opened event and read/collect its buffers again.
And here's the deal - if you don't DETACH from the event at step 3, it
gets destroyed on process exit, i.e. what the current perf behavior is.
Robert, I think the above is more straight-forward and intuitive, no?
Thanks.
--
Regards/Gruss,
Boris.
Sent from a fat crate under my desk. Formatting is fine.
--
On Fri, 23 Aug 2013, Borislav Petkov wrote:
> Maybe this makes it more understandable for you but this is beside the
> point.
Understandability doesn't matter?
> But I have to say the reversed thing above does sound confusing, now
> that I'm looking at the code. Actually, at the time we discussed this,
> my idea was to do it like this:
>
> 1. we open a perf event and get its file descriptor
> 2. ioctl ATTACH to it so that it is attached to the process.
>
> ... do some tracing and collecting and fiddling...
>
> 3. ioctl DETACH from it so that it is "forked in the background" so to
> speak, very similar to a background job in the shell.
Would it make sense to actually fork a kernel thread that "owns" the
event?
The way it is now events can "get loose" if either the user
forgets about them or the tool that opened them crashes, and it's
impossible to kill these events with normal tools. You possibly
wouldn't even know one was running (except you'd have one fewer
counter to work with) unless you poked around under /sys.
> 4. The rest of the code continues and deallocates the event *BUT* (and
> this is the key thing!) the counter/tracepoint remains operational in
> the kernel, running all the time.
>
> 5. Now, after a certain point, you come back and ioctl ATTACH to this
> already opened event and read/collect its buffers again.
Vince
On Fri, Aug 23, 2013 at 05:08:11PM -0400, Vince Weaver wrote:
> > 3. ioctl DETACH from it so that it is "forked in the background" so to
> > speak, very similar to a background job in the shell.
>
> Would it make sense to actually fork a kernel thread that "owns" the
> event?
>
> The way it is now events can "get loose" if either the user
> forgets about them or the tool that opened them crashes, and it's
> impossible to kill these events with normal tools. You possibly
> wouldn't even know one was running (except you'd have one fewer
> counter to work with) unless you poked around under /sys.
Actually, the idea is to be able to reattach and control that event with
perf tool too, in addition to some specialized daemon or whatever. So
whatever else "lost" it, using perf tool you should be able to get it
back.
--
Regards/Gruss,
Boris.
Sent from a fat crate under my desk. Formatting is fine.
--
Yep,
this text is very nicely written and should be in a README somewhere.
:-)
On Thu, Aug 22, 2013 at 04:13:15PM +0200, Robert Richter wrote:
> This patch set implements the necessary kernel changes for persistent
> events.
>
> Persistent events run standalone in the system without the need of a
> controlling process that holds an event's file descriptor. The events
> are always enabled and collect data samples in a ring buffer.
> Processes may connect to existing persistent events using the
> perf_event_open() syscall. For this the syscall must be configured
> using the new PERF_TYPE_PERSISTENT event type and a unique event
> identifier specified in attr.config. The id is propagated in sysfs or
> using ioctl (see below).
>
> Persistent event buffers may be accessed with mmap() in the same way
> as for any other event. Since the buffers may be used by multiple
> processes at the same time, there is only read-only access to them.
> Currently there is only support for per-cpu events, thus root access
> is needed too.
>
> Persistent events are visible in sysfs. They are added or removed
> dynamically. With the information in sysfs userland knows about how to
> setup the perf_event attribute of a persistent event. Since a
> persistent event always has the persistent flag set, a way is needed
> to express this in sysfs. A new syntax is used for this. With
> 'attr<num>:<mask>' any bit in the attribute structure may be set in a
> similar way as using 'config<num>', but <num> is an index that points
> to the u64 value to change within the attribute.
>
> For persistent events the persistent flag (bit 23 of flag field in
> struct perf_event_attr) needs to be set which is expressed in sysfs
> with "attr5:23". E.g. the mce_record event is described in sysfs as
> follows:
>
> /sys/bus/event_source/devices/persistent/events/mce_record:persistent,config=106
> /sys/bus/event_source/devices/persistent/format/persistent:attr5:23
>
> Note that perf tools need to support the 'attr<num>' syntax that is
> added in a separate patch set. With it we are able to run perf tool
> commands to read persistent events, e.g.:
>
> # perf record -e persistent/mce_record/ sleep 10
> # perf top -e persistent/mce_record/
>
> In general the new syntax is flexible to describe with sysfs any event
> to be setup by perf tools.
>
> There are ioctl functions to control persistent events that can be
> used to detach or attach an event to or from a process. The
> PERF_EVENT_IOC_DETACH ioctl call makes an event persistent.
Yeah, we probably want to abstract this a step further by allowing
to attach/detach fds to/from events. Then "persistent" is only one
incarnation of us always detaching from the event during its lifetime.
If we close an event while it is attached, it gets destroyed - i.e.,
current functionality, etc. See the other thread.
But we probably need more input on this from people. Ingo, Peter?
Thanks.
--
Regards/Gruss,
Boris.
Sent from a fat crate under my desk. Formatting is fine.
--
On 23.08.13 12:39:53, Vince Weaver wrote:
> On Fri, 23 Aug 2013, Robert Richter wrote:
> Make no assumptions when documenting. When I as a user have to dig around
> the kernel source tree to find out what is going on then the documentation
> is lacking.
>
> > > Is this endian clean? Will attr5:23 point to the same bit on a big-endian
> > > machine as on little-endian?
> >
> > It is the endianness used in the syscall. Handled in the same way as
> > for the config fields. I don't see where this could be an issue.
>
> C bitfields go opposite ways on big-endian vs little-endian systems.
> This has come up with some of the bitfields in the sample buffers.
> It doesn't matter if you just use the bitfields, but if you're trying
> to poke single bits into opaque 64-bit blobs it might be an issue.
Ok, let's improve documentation, how about the patch below?
> > The format directories in /sys/bus/event_source/devices/*/format/ are
> > already there to make it human readable. A user never has to deal
> > directly with syntax provided there and may use already the
> > abstractions for the event syntax.
>
> The format directory for now was only for the "config" fields which
> traditionally were needed to specify an event, that is all.
>
> Things get a lot more complex if arbitrary subsets of the the perf_attr
> structure start getting exported.
I don't see anything special with config fields, why not expand the
same functionality to other fields in perf_attr too? Your example for
precise_ip shows this could be useful.
And again, a while ago there was the decision to use sysfs to specify
events (I didn't like the approach too). Now, we must be able to setup
*any* kind of event in that way, not just that ones that can be setup
only by using config fields.
> > The problem with this is that you have to implement this in the event
> > parser of perf tools. Thus, you need to update the parser for any
> > other syntax you want to use. This is not necessary with my
> > implementation. It already provides the above. The pmu driver just
> > need to add the sysfs entry.
>
> I see. Are you going to update the parsers for programs that
> also try to read these values, like trinity?
>
> Or is the perf tool special because it is in the kernel?
First of all, the change is backward compatible for any tool out. If
not implemented, the parser throws an error and the event with the new
format can not be setup via sysfs. If other tools do not use such
events, no need to update anything.
The event parser became a bit complex already, thus it might be useful
to split the code and put it in some library e.g. liblk or so. There
are plans to do this.
-Robert
diff --git a/Documentation/ABI/testing/sysfs-bus-event_source-devices-format b/Documentation/ABI/testing/sysfs-bus-event_source-devices-format
index 77f47ff..d8f348f 100644
--- a/Documentation/ABI/testing/sysfs-bus-event_source-devices-format
+++ b/Documentation/ABI/testing/sysfs-bus-event_source-devices-format
@@ -1,20 +1,50 @@
-Where: /sys/bus/event_source/devices/<dev>/format
+Where: /sys/bus/event_source/devices/<pmu>/format/<name>
Date: January 2012
-Kernel Version: 3.3
+Kernel Version: 3.3
+ 3.xx (added attr<index>:<bits>)
Contact: Jiri Olsa <[email protected]>
-Description:
- Attribute group to describe the magic bits that go into
- perf_event_attr::config[012] for a particular pmu.
- Each attribute of this group defines the 'hardware' bitmask
- we want to export, so that userspace can deal with sane
- name/value pairs.
-
- Userspace must be prepared for the possibility that attributes
+
+Description: Define formats for bit ranges in perf_event_attr
+
+ Specify a format to describe the magic bits that go
+ into struct perf_event_attr for a particular pmu.
+ Userspace can deal then with sane name/value pairs.
+
+ Bit range may be any bit mask of an u64 value (bits 0
+ to 63). The bit range may vary depending on the
+ system's endianess if the underlying field is an u32,
+ for example the format is for bp_type:
+
+ attr7:32-63 ... little endian
+ attr7:0-31 ... big endian
+
+ Userspace must be prepared for the possibility that formats
define overlapping bit ranges. For example:
- attr1 = 'config:0-23'
- attr2 = 'config:0-7'
- attr3 = 'config:12-35'
- Example: 'config1:1,6-10,44'
- Defines contents of attribute that occupies bits 1,6-10,44 of
- perf_event_attr::config1.
+ format1 = 'config:0-23'
+ format2 = 'config:0-7'
+ format3 = 'config:12-35'
+
+ Syntax Description
+
+ config[012]*:<bits>.... Format that defines any
+ 'hardware' bitmask in one of
+ the config fields.
+
+ attr<index>:<bits>..... Format that defines any
+ bitmask in any u64 field in
+ the perf_event_attr structure.
+ The index is a decimal number
+ specifying the u64 value to be
+ pointed to in perf_event_attr.
+ Index starts at zero.
+
+ Examples:
+
+ 'config1:1,6-10,44' Defines contents of attribute
+ that occupies bits 1,6-10,44
+ of perf_event_attr::config1.
+
+ 'attr5:23' Define the persistent event
+ flag (bit 23 of the attribute
+ flags)
On 23.08.13 17:08:11, Vince Weaver wrote:
> On Fri, 23 Aug 2013, Borislav Petkov wrote:
>
> > Maybe this makes it more understandable for you but this is beside the
> > point.
>
> Understandability doesn't matter?
I agree with Vince on this, the naming should be intuitive. Esp. since
'attach' is used in tracing in different context, we should avoid
using the term here for less confusion.
I got another idea for this, what about UNCLAIM and CLAIM? It is
exactly, what it is. A process unclaims an event telling it doesn't
care anymore. Another process comes and claims the event, meaning the
process wants the event no longer to be shared with others and being
released after closing.
> > But I have to say the reversed thing above does sound confusing, now
> > that I'm looking at the code. Actually, at the time we discussed this,
> > my idea was to do it like this:
> >
> > 1. we open a perf event and get its file descriptor
> > 2. ioctl ATTACH to it so that it is attached to the process.
> >
> > ... do some tracing and collecting and fiddling...
You don't need to attach to a persistent event, you can just use the
event with perf_event_open, mmap, close.
> > 3. ioctl DETACH from it so that it is "forked in the background" so to
> > speak, very similar to a background job in the shell.
With 'detach' we move the event into the 'background' so that it is
still available after opening.
> Would it make sense to actually fork a kernel thread that "owns" the
> event?
There is no need for a kernel thread, there is nothing to do. We just
increase the refcount so that the event is not destroyed.
> The way it is now events can "get loose" if either the user
> forgets about them or the tool that opened them crashes, and it's
> impossible to kill these events with normal tools. You possibly
> wouldn't even know one was running (except you'd have one fewer
> counter to work with) unless you poked around under /sys.
As boris said, there could be some sort of 'kill' tool for events.
That's what we need sysfs for, it tells which events are running.
> > 4. The rest of the code continues and deallocates the event *BUT* (and
> > this is the key thing!) the counter/tracepoint remains operational in
> > the kernel, running all the time.
The event remains operational esp. after closing it or killing/
terminating the process owning it.
> > 5. Now, after a certain point, you come back and ioctl ATTACH to this
> > already opened event and read/collect its buffers again.
-Robert
> > On Thu, 22 Aug 2013, Robert Richter wrote:
> > > This is for Linux man-pages:
Updated description below.
-Robert
Author: Robert Richter <[email protected]>
Date: Tue Aug 13 11:22:22 2013 +0200
[RFC] perf, persistent: ioctl functions to control persistency
Implementing ioctl functions to control persistent events. There are
functions to detach or attach an event to or from a process. The
PERF_EVENT_IOC_DETACH ioctl call makes an event persistent. After
closing the event's fd it runs then in the background of the system
without the need of a controlling process. The perf_event_open()
syscall can be used to reopen the event by any process. The
PERF_EVENT_IOC_ATTACH ioctl attaches the event again so that it is
removed after closing the event's fd.
This is for Linux man-pages:
type ...
PERF_TYPE_PERSISTENT (Since Linux 3.xx)
Open a persistent event that is already running in the
background of the system. There is a unique identifier for
each persistent event that needs to be specified in the
event's attribute config field. Persistent events are
listed under:
/sys/bus/event_source/devices/persistent/
See PERF_EVENT_IOC_DETACH how to create a persistent
event. The instance creating such an event should also be
responsible for removing it.
...
persistent : 41, /* always-on event */
...
persistent: (Since Linux 3.xx)
Put event into persistent state after opening. After closing
the event's fd the event is persistent in the system and
continues to run.
perf_event ioctl calls
PERF_EVENT_IOC_DETACH (Since Linux 3.xx)
Any event that was opened with the perf_event_open()
syscall may become a persistent event. This is done by
detaching the event from the controlling process that
holds the event's file descriptor. This ioctl can be used
for doing this. After detaching it, the event is
persistent in the system. An unique identifier for the
persistent event is returned or an error otherwise. After
closing the fd the event will continue to run. The
following allows to connect to the event again:
pe.type = PERF_TYPE_PERSISTENT;
pe.config = <pevent_id>;
...
fd = perf_event_open(...);
The event must be reopened on the same cpu.
PERF_EVENT_IOC_ATTACH (Since Linux 3.xx)
Attach the event specified by the file descriptor to the
current process. The event is no longer persistent in the
system and will be removed after all users disconnected
from the event. Thus, if there are no other users the
event will be closed too after closing its file
descriptor, the event then no longer exists.
Cc: Vince Weaver <[email protected]>
Signed-off-by: Robert Richter <[email protected]>
Signed-off-by: Robert Richter <[email protected]>
On Tue, Aug 27, 2013 at 01:54:22PM +0200, Robert Richter wrote:
> I got another idea for this, what about UNCLAIM and CLAIM? It is
> exactly, what it is. A process unclaims an event telling it doesn't
> care anymore. Another process comes and claims the event, meaning the
> process wants the event no longer to be shared with others and being
> released after closing.
This still doesn't pan out because with claiming the event, you state
that the event is *owned* by this process but with persistent events
we want to be able to state that they can have multiple users and thus
multiple buffer consumers, concurrently.
> > > 3. ioctl DETACH from it so that it is "forked in the background" so to
> > > speak, very similar to a background job in the shell.
>
> With 'detach' we move the event into the 'background' so that it is
> still available after opening.
Ok, maybe ATTACH/DETACH is not the perfect naming for this after all.
Maybe when we want to state the fact that the event is going to continue
existing after closing the buffer consumer, we want to do ioctl(event,
DONT_DESTROY) and when we want to actually get rid of it, one of the
process does ioctl(event, DESTROY).
Which reminds me, what do we do when one process destroys the event but
there are other consumers accessing it concurrently? Refcounting?
Thanks.
--
Regards/Gruss,
Boris.
On 24.08.13 11:38:26, Borislav Petkov wrote:
> this text is very nicely written and should be in a README somewhere.
> :-)
Yeah, will put it under tools/perf/Documentation/...
> > There are ioctl functions to control persistent events that can be
> > used to detach or attach an event to or from a process. The
> > PERF_EVENT_IOC_DETACH ioctl call makes an event persistent.
>
> Yeah, we probably want to abstract this a step further by allowing
> to attach/detach fds to/from events. Then "persistent" is only one
> incarnation of us always detaching from the event during its lifetime.
>
> If we close an event while it is attached, it gets destroyed - i.e.,
> current functionality, etc. See the other thread.
I don't know what you mean here exactly, please explain.
> But we probably need more input on this from people. Ingo, Peter?
Yes, that would be great.
Thanks,
-Robert
On Tue, Aug 27, 2013 at 02:27:21PM +0200, Robert Richter wrote:
> > > There are ioctl functions to control persistent events that can be
> > > used to detach or attach an event to or from a process. The
> > > PERF_EVENT_IOC_DETACH ioctl call makes an event persistent.
> >
> > Yeah, we probably want to abstract this a step further by allowing
> > to attach/detach fds to/from events. Then "persistent" is only one
> > incarnation of us always detaching from the event during its lifetime.
> >
> > If we close an event while it is attached, it gets destroyed - i.e.,
> > current functionality, etc. See the other thread.
>
> I don't know what you mean here exactly, please explain.
Basically that detaching an event shouldn't make it persistent
explicitly - it simply continues running in the background. When we
reattach to it and die with the event attached, then it gets destroyed
too.
Which means, we can have arbitrary life periods of events, persistency
being only a special case of it.
IOW, as long as an event is detached in the background, it counts.
When something attaches to it and that something exits, the event gets
destroyed too, as part of the process teardown.
And this is probably the most generic way to look at it.
--
Regards/Gruss,
Boris.
Ok, starting with your question at the end first since it might
explain the following better.
If the event is no longer persistent and if there are multiple users,
the event remains open and running until the all mmap's are munmap'ed
and the last fd is closed. After that it is released. This is done
with refcounts and already implemented.
On 27.08.13 14:22:42, Borislav Petkov wrote:
> On Tue, Aug 27, 2013 at 01:54:22PM +0200, Robert Richter wrote:
> > I got another idea for this, what about UNCLAIM and CLAIM? It is
> > exactly, what it is. A process unclaims an event telling it doesn't
> > care anymore. Another process comes and claims the event, meaning the
> > process wants the event no longer to be shared with others and being
> > released after closing.
>
> This still doesn't pan out because with claiming the event, you state
> that the event is *owned* by this process but with persistent events
> we want to be able to state that they can have multiple users and thus
> multiple buffer consumers, concurrently.
We still have multiple users after 'claiming' the event. The only
thing is that the event will be destroyed after all users went away.
A process (that claimed the event) was responsible for this.
> > > > 3. ioctl DETACH from it so that it is "forked in the background" so to
> > > > speak, very similar to a background job in the shell.
> >
> > With 'detach' we move the event into the 'background' so that it is
> > still available after opening.
>
> Ok, maybe ATTACH/DETACH is not the perfect naming for this after all.
> Maybe when we want to state the fact that the event is going to continue
> existing after closing the buffer consumer, we want to do ioctl(event,
> DONT_DESTROY) and when we want to actually get rid of it, one of the
> process does ioctl(event, DESTROY).
I still prefer claim/unclaim. ;)
> Which reminds me, what do we do when one process destroys the event but
> there are other consumers accessing it concurrently? Refcounting?
See above.
-Robert
On Tue, Aug 27, 2013 at 02:41:46PM +0200, Robert Richter wrote:
> We still have multiple users after 'claiming' the event. The only
> thing is that the event will be destroyed after all users went away. A
> process (that claimed the event) was responsible for this.
Ok, if the "claimer" process which owns the event dies for some reason,
then I'm guessing some other event can claim it on its own, i.e.
processes can claim events among each other...
> I still prefer claim/unclaim. ;)
Oh well, whatever, as long as it not misleading :-)
--
Regards/Gruss,
Boris.