This patch set implements out of the box support of perf tools for
persistent events. For this the kernel must provide necessary
information about existing persistent events via sysfs to userland.
Persistent events are provided by the kernel with readonly event
buffers. To allow the independent usage of the event buffers by any
process without limiting other processes, multiple users of a single
event must be handled too.
The basic concept is to use a pmu as an event container for persistent
events. The pmu registers events in sysfs and provides format and
event information for the userland. The persistent event framework
requires to add events to the pmu dynamically.
With the information in sysfs userland knows about how to setup the
perf_event attribute of a persistent event. Since a persistent event
always has the persistent flag set, a way is needed to express this in
sysfs. A new syntax is used for this. With 'attr<num>:<mask>' any bit
in the attribute structure may be set in a similar way as using
'config<num>', but <num> is an index that points to the u64 value to
change within the attribute.
For persistent events the persistent flag (bit 23 of flag field in
struct perf_event_attr) needs to be set which is expressed in sysfs
with "attr5:23". E.g. the mce_record event is described in sysfs as
follows:
/sys/bus/event_source/devices/persistent/events/mce_record:persistent,config=106
/sys/bus/event_source/devices/persistent/format/persistent:attr5:23
Note that perf tools need to support the 'attr<num>' syntax that is
added in a separate patch set. With it we are able to run perf tool
commands to read persistent events, e.g.:
# perf record -e persistent/mce_record/ sleep 10
# perf top -e persistent/mce_record/
In general the new syntax is flexible to describe with sysfs any event
to be setup by perf tools.
First patches contain also fixes and reworks made after reviewing
code. Could be applied separately.
Patches base on Boris' patches which I have rebased to latest
tip/perf/core. All patches can be found here:
git://git.kernel.org/pub/scm/linux/kernel/git/rric/oprofile.git persistent-v2
I wonder if this patch set could be applied to a tip/perf topic
branch? This would avoid reposting already reviewed patches.
Note: The perf tools patch set do not need to be reposted.
-Robert
Changes for V2:
* Merged minor changes into Boris' patches
* Included Boris' patches for review
* Document attr<index> syntax in sysfs ABI
* Adding cpu check to perf_get_persistent_event_fd()
* Rebased to latest tip/perf/core
Borislav Petkov (4):
perf, ring_buffer: Use same prefix
perf: Add persistent events
perf: Add persistent event facilities
MCE: Enable persistent event
Robert Richter (10):
perf, persistent: Rework struct pers_event_desc
perf, persistent: Remove rb_put()
perf, persistent: Introduce get_persistent_event()
perf, persistent: Reworking perf_get_persistent_event_fd()
perf, persistent: Protect event lists with mutex
perf, persistent: Avoid adding identical events
perf, persistent: Implementing a persistent pmu
perf, persistent: Name each persistent event
perf, persistent: Exposing persistent events using sysfs
perf, persistent: Allow multiple users for an event
.../testing/sysfs-bus-event_source-devices-format | 43 ++-
arch/x86/kernel/cpu/mcheck/mce.c | 19 ++
include/linux/perf_event.h | 9 +-
include/uapi/linux/perf_event.h | 3 +-
kernel/events/Makefile | 2 +-
kernel/events/core.c | 56 ++--
kernel/events/internal.h | 3 +
kernel/events/persistent.c | 320 +++++++++++++++++++++
kernel/events/ring_buffer.c | 7 +-
9 files changed, 419 insertions(+), 43 deletions(-)
create mode 100644 kernel/events/persistent.c
--
1.8.1.1
From: Borislav Petkov <[email protected]>
Add the needed pieces for persistent events which makes them
process-agnostic. Also, make their buffers read-only when mmaping them
from userspace.
While at it, do not return a void function, as caught by Fengguang's
build robot.
Changes made by Robert Richter <[email protected]>:
* mmap should return EACCES error if fd can not be opened writable.
This error code also helps userland to map buffers readonly on
failure.
Signed-off-by: Borislav Petkov <[email protected]>
[ Return -EACCES if mapped buffers must be readonly ]
Signed-off-by: Robert Richter <[email protected]>
Signed-off-by: Robert Richter <[email protected]>
---
include/uapi/linux/perf_event.h | 3 ++-
kernel/events/core.c | 10 +++++++++-
2 files changed, 11 insertions(+), 2 deletions(-)
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index fb104e5..6032361 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -272,8 +272,9 @@ struct perf_event_attr {
exclude_callchain_kernel : 1, /* exclude kernel callchains */
exclude_callchain_user : 1, /* exclude user callchains */
+ persistent : 1, /* always-on event */
- __reserved_1 : 41;
+ __reserved_1 : 40;
union {
__u32 wakeup_events; /* wakeup every n events */
diff --git a/kernel/events/core.c b/kernel/events/core.c
index b790ab6..a13e457 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3713,6 +3713,11 @@ static void perf_mmap_close(struct vm_area_struct *vma)
{
struct perf_event *event = vma->vm_file->private_data;
+ if (event->attr.persistent) {
+ atomic_dec(&event->mmap_count);
+ return;
+ }
+
if (atomic_dec_and_mutex_lock(&event->mmap_count, &event->mmap_mutex)) {
unsigned long size = perf_data_size(event->rb);
struct user_struct *user = event->mmap_user;
@@ -3756,9 +3761,12 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
if (event->cpu == -1 && event->attr.inherit)
return -EINVAL;
- if (!(vma->vm_flags & VM_SHARED))
+ if (!(vma->vm_flags & VM_SHARED) && !event->attr.persistent)
return -EINVAL;
+ if (event->attr.persistent && (vma->vm_flags & VM_WRITE))
+ return -EACCES;
+
vma_size = vma->vm_end - vma->vm_start;
nr_pages = (vma_size / PAGE_SIZE) - 1;
--
1.8.1.1
From: Borislav Petkov <[email protected]>
... for MCEs collection.
Changes made by Robert Richter <[email protected]>:
The mce_record tracepoint needs tracepoints to be enabled. Fixing
build error for no-tracepoints configs.
Signed-off-by: Borislav Petkov <[email protected]>
[ Fix build error for no-tracepoints configs ]
Signed-off-by: Robert Richter <[email protected]>
Signed-off-by: Robert Richter <[email protected]>
---
arch/x86/kernel/cpu/mcheck/mce.c | 18 ++++++++++++++++++
1 file changed, 18 insertions(+)
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 9239504..d421937 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -1987,6 +1987,24 @@ int __init mcheck_init(void)
return 0;
}
+#ifdef CONFIG_TRACEPOINTS
+
+int __init mcheck_init_tp(void)
+{
+ if (perf_add_persistent_event_by_id(event_mce_record.event.type)) {
+ pr_err("Error adding MCE persistent event.\n");
+ return -EINVAL;
+ }
+ return 0;
+}
+/*
+ * We can't run earlier because persistent events uses anon_inode_getfile and
+ * its anon_inode_mnt gets initialized as a fs_initcall.
+ */
+fs_initcall_sync(mcheck_init_tp);
+
+#endif /* CONFIG_TRACEPOINTS */
+
/*
* mce_syscore: PM support
*/
--
1.8.1.1
From: Robert Richter <[email protected]>
We want to use the kernel's pmu design to later expose persistent
events via sysfs to userland. Initially implement a persistent pmu.
The format syntax is introduced allowing to set bits anywhere in
struct perf_event_attr. This is used in this case to set the
persistent flag (attr5:23). The syntax is attr<num> where num is the
index of the u64 array in struct perf_event_attr. Otherwise syntax is
same as for config<num>.
Patches that implement this functionality for perf tools are sent in a
separate patchset.
Signed-off-by: Robert Richter <[email protected]>
Signed-off-by: Robert Richter <[email protected]>
---
kernel/events/persistent.c | 35 +++++++++++++++++++++++++++++++++++
1 file changed, 35 insertions(+)
diff --git a/kernel/events/persistent.c b/kernel/events/persistent.c
index 4fcd071..97c57c9 100644
--- a/kernel/events/persistent.c
+++ b/kernel/events/persistent.c
@@ -175,10 +175,45 @@ out:
return event_fd;
}
+PMU_FORMAT_ATTR(persistent, "attr5:23");
+
+static struct attribute *persistent_format_attrs[] = {
+ &format_attr_persistent.attr,
+ NULL,
+};
+
+static struct attribute_group persistent_format_group = {
+ .name = "format",
+ .attrs = persistent_format_attrs,
+};
+
+static const struct attribute_group *persistent_attr_groups[] = {
+ &persistent_format_group,
+ NULL,
+};
+
+static struct pmu persistent_pmu;
+
+static int persistent_pmu_init(struct perf_event *event)
+{
+ if (persistent_pmu.type != event->attr.type)
+ return -ENOENT;
+
+ /* Not a persistent event. */
+ return -EFAULT;
+}
+
+static struct pmu persistent_pmu = {
+ .event_init = persistent_pmu_init,
+ .attr_groups = persistent_attr_groups,
+};
+
void __init persistent_events_init(void)
{
int cpu;
+ perf_pmu_register(&persistent_pmu, "persistent", -1);
+
for_each_possible_cpu(cpu) {
INIT_LIST_HEAD(&per_cpu(pers_events, cpu));
mutex_init(&per_cpu(pers_events_lock, cpu));
--
1.8.1.1
From: Robert Richter <[email protected]>
Expose persistent events in the system to userland using sysfs. Perf
tools are able to read existing pmu events from sysfs. Now we use a
persistent pmu as an event container containing all registered
persistent events of the system. This patch adds dynamically
registration of persistent events to sysfs. E.g. something like this:
/sys/bus/event_source/devices/persistent/events/mce_record:persistent,config=106
/sys/bus/event_source/devices/persistent/format/persistent:attr5:23
Perf tools need to support the attr<num> syntax that is added in a
separate patch set. With it we are able to run perf tool commands to
read persistent events, e.g.:
# perf record -e persistent/mce_record/ sleep 10
# perf top -e persistent/mce_record/
[ Document attr<index> syntax in sysfs ABI ]
Reported-by: Jiri Olsa <[email protected]>
Signed-off-by: Robert Richter <[email protected]>
Signed-off-by: Robert Richter <[email protected]>
---
.../testing/sysfs-bus-event_source-devices-format | 43 ++++++++++++-----
kernel/events/persistent.c | 55 +++++++++++++++++++++-
2 files changed, 86 insertions(+), 12 deletions(-)
diff --git a/Documentation/ABI/testing/sysfs-bus-event_source-devices-format b/Documentation/ABI/testing/sysfs-bus-event_source-devices-format
index 77f47ff..47b7353 100644
--- a/Documentation/ABI/testing/sysfs-bus-event_source-devices-format
+++ b/Documentation/ABI/testing/sysfs-bus-event_source-devices-format
@@ -1,13 +1,14 @@
-Where: /sys/bus/event_source/devices/<dev>/format
+Where: /sys/bus/event_source/devices/<pmu>/format/<name>
Date: January 2012
-Kernel Version: 3.3
+Kernel Version: 3.3
+ 3.12 (added attr<index>:<bits>)
Contact: Jiri Olsa <[email protected]>
-Description:
- Attribute group to describe the magic bits that go into
- perf_event_attr::config[012] for a particular pmu.
- Each attribute of this group defines the 'hardware' bitmask
- we want to export, so that userspace can deal with sane
- name/value pairs.
+
+Description: Define formats for bit ranges in perf_event_attr
+
+ Attribute group to describe the magic bits that go
+ into struct perf_event_attr for a particular pmu. Bit
+ range may be any bit mask of an u64 (bits 0 to 63).
Userspace must be prepared for the possibility that attributes
define overlapping bit ranges. For example:
@@ -15,6 +16,26 @@ Description:
attr2 = 'config:0-7'
attr3 = 'config:12-35'
- Example: 'config1:1,6-10,44'
- Defines contents of attribute that occupies bits 1,6-10,44 of
- perf_event_attr::config1.
+ Syntax Description
+
+ config[012]*:<bits> Each attribute of this group
+ defines the 'hardware' bitmask
+ we want to export, so that
+ userspace can deal with sane
+ name/value pairs.
+
+ attr<index>:<bits> Set any field of the event
+ attribute. The index is a
+ decimal number that specifies
+ the u64 value to be set within
+ struct perf_event_attr.
+
+ Examples:
+
+ 'config1:1,6-10,44' Defines contents of attribute
+ that occupies bits 1,6-10,44
+ of perf_event_attr::config1.
+
+ 'attr5:23' Define the persistent event
+ flag (bit 23 of the attribute
+ flags)
diff --git a/kernel/events/persistent.c b/kernel/events/persistent.c
index 96201c1..8be7c05 100644
--- a/kernel/events/persistent.c
+++ b/kernel/events/persistent.c
@@ -17,8 +17,10 @@ struct pers_event_desc {
struct pers_event {
char *name;
struct perf_event_attr attr;
+ struct perf_pmu_events_attr sysfs;
};
+static struct pmu persistent_pmu;
static DEFINE_PER_CPU(struct list_head, pers_events);
static DEFINE_PER_CPU(struct mutex, pers_events_lock);
@@ -137,6 +139,8 @@ unwind:
return PTR_ERR(event);
}
+static int pers_event_sysfs_register(struct pers_event *event);
+
int perf_add_persistent_event_by_id(char* name, int id)
{
struct pers_event *event;
@@ -150,6 +154,8 @@ int perf_add_persistent_event_by_id(char* name, int id)
if (!event->name)
goto fail;
+ event->sysfs.id = id;
+
attr = &event->attr;
attr->sample_period = 1;
attr->wakeup_events = 1;
@@ -163,6 +169,8 @@ int perf_add_persistent_event_by_id(char* name, int id)
if (ret)
goto fail;
+ pers_event_sysfs_register(event);
+
return 0;
fail:
kfree(event->name);
@@ -207,12 +215,57 @@ static struct attribute_group persistent_format_group = {
.attrs = persistent_format_attrs,
};
+#define MAX_EVENTS 16
+
+static struct attribute *persistent_events_attr[MAX_EVENTS + 1] = { };
+
+static struct attribute_group persistent_events_group = {
+ .name = "events",
+ .attrs = persistent_events_attr,
+};
+
static const struct attribute_group *persistent_attr_groups[] = {
&persistent_format_group,
+ NULL, /* placeholder: &persistent_events_group */
NULL,
};
+#define EVENTS_GROUP (persistent_attr_groups[1])
-static struct pmu persistent_pmu;
+static ssize_t pers_event_sysfs_show(struct device *dev,
+ struct device_attribute *__attr, char *page)
+{
+ struct perf_pmu_events_attr *attr =
+ container_of(__attr, struct perf_pmu_events_attr, attr);
+ return sprintf(page, "persistent,config=%lld",
+ (unsigned long long)attr->id);
+}
+
+static int pers_event_sysfs_register(struct pers_event *event)
+{
+ struct device_attribute *attr = &event->sysfs.attr;
+ int idx;
+
+ *attr = (struct device_attribute)__ATTR(, 0444, pers_event_sysfs_show,
+ NULL);
+ attr->attr.name = event->name;
+
+ /* add sysfs attr to events: */
+ for (idx = 0; idx < MAX_EVENTS; idx++) {
+ if (!cmpxchg(persistent_events_attr + idx, NULL, &attr->attr))
+ break;
+ }
+
+ if (idx >= MAX_EVENTS)
+ return -ENOSPC;
+ if (!idx)
+ EVENTS_GROUP = &persistent_events_group;
+ if (!persistent_pmu.dev)
+ return 0; /* sysfs not yet initialized */
+ if (idx)
+ return sysfs_update_group(&persistent_pmu.dev->kobj,
+ EVENTS_GROUP);
+ return sysfs_create_group(&persistent_pmu.dev->kobj, EVENTS_GROUP);
+}
static int persistent_pmu_init(struct perf_event *event)
{
--
1.8.1.1
From: Robert Richter <[email protected]>
Usually a fd close leads to the release of the event too. For
persistent events this is different as the events should be
permanently enabled in the system. Using reference counting to avoid
releasing an event during a fd close. This also allows it to have
multiple users (open file descriptors) for a single persistent event.
While at this, we don't need desc->fd any longer. The fd is attached
to a task and reference counting keeps the event. Removing desc->fd.
Signed-off-by: Robert Richter <[email protected]>
Signed-off-by: Robert Richter <[email protected]>
---
kernel/events/persistent.c | 46 ++++++++++++++++++++++++++++++++++++----------
1 file changed, 36 insertions(+), 10 deletions(-)
diff --git a/kernel/events/persistent.c b/kernel/events/persistent.c
index 8be7c05..dd20b55 100644
--- a/kernel/events/persistent.c
+++ b/kernel/events/persistent.c
@@ -11,7 +11,6 @@
struct pers_event_desc {
struct perf_event *event;
struct list_head plist;
- int fd;
};
struct pers_event {
@@ -88,6 +87,18 @@ out:
return event;
}
+static void detach_persistent_event(struct pers_event_desc *desc)
+{
+ list_del(&desc->plist);
+ kfree(desc);
+}
+
+static void release_persistent_event(struct perf_event *event)
+{
+ perf_event_disable(event);
+ perf_event_release_kernel(event);
+}
+
static void del_persistent_event(int cpu, struct perf_event_attr *attr)
{
struct pers_event_desc *desc;
@@ -100,12 +111,14 @@ static void del_persistent_event(int cpu, struct perf_event_attr *attr)
goto out;
event = desc->event;
- list_del(&desc->plist);
-
- perf_event_disable(event);
- perf_event_release_kernel(event);
- put_unused_fd(desc->fd);
- kfree(desc);
+ /*
+ * We primarily want to remove desc from the list. If there
+ * are no open files, the refcount is 0 and we need to release
+ * the event too.
+ */
+ detach_persistent_event(desc);
+ if (atomic_long_dec_and_test(&event->refcount))
+ release_persistent_event(event);
out:
mutex_unlock(&per_cpu(pers_events_lock, cpu));
}
@@ -182,6 +195,7 @@ fail:
int perf_get_persistent_event_fd(unsigned cpu, struct perf_event_attr *attr)
{
struct pers_event_desc *desc;
+ struct perf_event *event;
int event_fd = -ENODEV;
if (cpu >= (unsigned)nr_cpu_ids)
@@ -190,13 +204,25 @@ int perf_get_persistent_event_fd(unsigned cpu, struct perf_event_attr *attr)
mutex_lock(&per_cpu(pers_events_lock, cpu));
desc = get_persistent_event(cpu, attr);
- if (!desc)
+
+ /* Increment refcount to keep event on put_event() */
+ if (!desc || !atomic_long_inc_not_zero(&desc->event->refcount))
goto out;
event_fd = anon_inode_getfd("[pers_event]", &perf_fops,
desc->event, O_RDONLY);
- if (event_fd >= 0)
- desc->fd = event_fd;
+
+ if (event_fd < 0) {
+ event = desc->event;
+ if (WARN_ON(atomic_long_dec_and_test(&event->refcount))) {
+ /*
+ * May not happen since decrementing refcount is
+ * protected by pers_events_lock.
+ */
+ detach_persistent_event(desc);
+ release_persistent_event(event);
+ }
+ }
out:
mutex_unlock(&per_cpu(pers_events_lock, cpu));
--
1.8.1.1
From: Robert Richter <[email protected]>
For later adding persistent events to sysfs we need a name for each
event. Adding a name to each persistent event.
Signed-off-by: Robert Richter <[email protected]>
Signed-off-by: Robert Richter <[email protected]>
---
arch/x86/kernel/cpu/mcheck/mce.c | 3 ++-
include/linux/perf_event.h | 4 ++--
kernel/events/persistent.c | 30 +++++++++++++++++++++++++-----
3 files changed, 29 insertions(+), 8 deletions(-)
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index d421937..833eb7a 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -1991,7 +1991,8 @@ int __init mcheck_init(void)
int __init mcheck_init_tp(void)
{
- if (perf_add_persistent_event_by_id(event_mce_record.event.type)) {
+ if (perf_add_persistent_event_by_id(event_mce_record.name,
+ event_mce_record.event.type)) {
pr_err("Error adding MCE persistent event.\n");
return -EINVAL;
}
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index dc72c93..06b4357b 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -751,7 +751,7 @@ extern void perf_event_disable(struct perf_event *event);
extern int __perf_event_disable(void *info);
extern void perf_event_task_tick(void);
extern int perf_add_persistent_event(struct perf_event_attr *, unsigned);
-extern int perf_add_persistent_event_by_id(int id);
+extern int perf_add_persistent_event_by_id(char *name, int id);
#else /* !CONFIG_PERF_EVENTS */
static inline void
perf_event_task_sched_in(struct task_struct *prev,
@@ -794,7 +794,7 @@ static inline int __perf_event_disable(void *info) { return -1; }
static inline void perf_event_task_tick(void) { }
static inline int perf_add_persistent_event(struct perf_event_attr *attr,
unsigned nr_pages) { return -EINVAL; }
-static inline int perf_add_persistent_event_by_id(int id) { return -EINVAL; }
+static inline int perf_add_persistent_event_by_id(char *name, int id) { return -EINVAL; }
#endif /* !CONFIG_PERF_EVENTS */
#if defined(CONFIG_PERF_EVENTS) && defined(CONFIG_NO_HZ_FULL)
diff --git a/kernel/events/persistent.c b/kernel/events/persistent.c
index 97c57c9..96201c1 100644
--- a/kernel/events/persistent.c
+++ b/kernel/events/persistent.c
@@ -14,6 +14,11 @@ struct pers_event_desc {
int fd;
};
+struct pers_event {
+ char *name;
+ struct perf_event_attr attr;
+};
+
static DEFINE_PER_CPU(struct list_head, pers_events);
static DEFINE_PER_CPU(struct mutex, pers_events_lock);
@@ -132,14 +137,20 @@ unwind:
return PTR_ERR(event);
}
-int perf_add_persistent_event_by_id(int id)
+int perf_add_persistent_event_by_id(char* name, int id)
{
- struct perf_event_attr *attr;
+ struct pers_event *event;
+ struct perf_event_attr *attr;
+ int ret = -ENOMEM;
- attr = kzalloc(sizeof(*attr), GFP_KERNEL);
- if (!attr)
+ event = kzalloc(sizeof(*event), GFP_KERNEL);
+ if (!event)
return -ENOMEM;
+ event->name = kstrdup(name, GFP_KERNEL);
+ if (!event->name)
+ goto fail;
+ attr = &event->attr;
attr->sample_period = 1;
attr->wakeup_events = 1;
attr->sample_type = PERF_SAMPLE_RAW;
@@ -148,7 +159,16 @@ int perf_add_persistent_event_by_id(int id)
attr->type = PERF_TYPE_TRACEPOINT;
attr->size = sizeof(*attr);
- return perf_add_persistent_event(attr, CPU_BUFFER_NR_PAGES);
+ ret = perf_add_persistent_event(attr, CPU_BUFFER_NR_PAGES);
+ if (ret)
+ goto fail;
+
+ return 0;
+fail:
+ kfree(event->name);
+ kfree(event);
+
+ return ret;
}
int perf_get_persistent_event_fd(unsigned cpu, struct perf_event_attr *attr)
--
1.8.1.1
From: Robert Richter <[email protected]>
Reducing duplicate code by introducing function get_persistent_event()
to get the descriptor of an persistent event.
Signed-off-by: Robert Richter <[email protected]>
Signed-off-by: Robert Richter <[email protected]>
---
kernel/events/persistent.c | 36 ++++++++++++++++++++++--------------
1 file changed, 22 insertions(+), 14 deletions(-)
diff --git a/kernel/events/persistent.c b/kernel/events/persistent.c
index e6a7664..7d91871 100644
--- a/kernel/events/persistent.c
+++ b/kernel/events/persistent.c
@@ -16,6 +16,19 @@ struct pers_event_desc {
static DEFINE_PER_CPU(struct list_head, pers_events);
+static struct pers_event_desc
+*get_persistent_event(int cpu, struct perf_event_attr *attr)
+{
+ struct pers_event_desc *desc;
+
+ list_for_each_entry(desc, &per_cpu(pers_events, cpu), plist) {
+ if (desc->event->attr.config == attr->config)
+ return desc;
+ }
+
+ return NULL;
+}
+
static struct perf_event *
add_persistent_event_on_cpu(unsigned int cpu, struct perf_event_attr *attr,
unsigned nr_pages)
@@ -58,18 +71,13 @@ out:
static void del_persistent_event(int cpu, struct perf_event_attr *attr)
{
- struct pers_event_desc *desc, *tmp;
- struct perf_event *event = NULL;
-
- list_for_each_entry_safe(desc, tmp, &per_cpu(pers_events, cpu), plist) {
- if (desc->event->attr.config == attr->config) {
- event = desc->event;
- break;
- }
- }
+ struct pers_event_desc *desc;
+ struct perf_event *event;
- if (!event)
+ desc = get_persistent_event(cpu, attr);
+ if (!desc)
return;
+ event = desc->event;
list_del(&desc->plist);
@@ -161,11 +169,11 @@ int perf_get_persistent_event_fd(unsigned cpu, struct perf_event_attr *attr)
if (cpu >= (unsigned)nr_cpu_ids)
return -EINVAL;
- list_for_each_entry(desc, &per_cpu(pers_events, cpu), plist)
- if (desc->event->attr.config == attr->config)
- return __alloc_persistent_event_fd(desc);
+ desc = get_persistent_event(cpu, attr);
+ if (!desc)
+ return -ENODEV;
- return -ENODEV;
+ return __alloc_persistent_event_fd(desc);
}
--
1.8.1.1
From: Robert Richter <[email protected]>
There is already a function anon_inode_getfd() that does already all
the work. Reworking and simplifying code.
Signed-off-by: Robert Richter <[email protected]>
Signed-off-by: Robert Richter <[email protected]>
---
kernel/events/persistent.c | 35 +++++++----------------------------
1 file changed, 7 insertions(+), 28 deletions(-)
diff --git a/kernel/events/persistent.c b/kernel/events/persistent.c
index 7d91871..16ed47c 100644
--- a/kernel/events/persistent.c
+++ b/kernel/events/persistent.c
@@ -87,33 +87,6 @@ static void del_persistent_event(int cpu, struct perf_event_attr *attr)
kfree(desc);
}
-static int __alloc_persistent_event_fd(struct pers_event_desc *desc)
-{
- struct file *event_file = NULL;
- int event_fd = -1;
-
- event_fd = get_unused_fd();
- if (event_fd < 0)
- goto out;
-
- event_file = anon_inode_getfile("[pers_event]", &perf_fops,
- desc->event, O_RDONLY);
- if (IS_ERR(event_file))
- goto err_event_file;
-
- desc->fd = event_fd;
- fd_install(event_fd, event_file);
-
- goto out;
-
-
- err_event_file:
- put_unused_fd(event_fd);
-
- out:
- return event_fd;
-}
-
/*
* Create and enable the persistent version of the perf event described by
* @attr.
@@ -165,6 +138,7 @@ int perf_add_persistent_event_by_id(int id)
int perf_get_persistent_event_fd(unsigned cpu, struct perf_event_attr *attr)
{
struct pers_event_desc *desc;
+ int event_fd;
if (cpu >= (unsigned)nr_cpu_ids)
return -EINVAL;
@@ -173,7 +147,12 @@ int perf_get_persistent_event_fd(unsigned cpu, struct perf_event_attr *attr)
if (!desc)
return -ENODEV;
- return __alloc_persistent_event_fd(desc);
+ event_fd = anon_inode_getfd("[pers_event]", &perf_fops,
+ desc->event, O_RDONLY);
+ if (event_fd >= 0)
+ desc->fd = event_fd;
+
+ return event_fd;
}
--
1.8.1.1
From: Robert Richter <[email protected]>
Check if an event already exists before adding it.
Signed-off-by: Robert Richter <[email protected]>
Signed-off-by: Robert Richter <[email protected]>
---
kernel/events/persistent.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/kernel/events/persistent.c b/kernel/events/persistent.c
index 586cea5..4fcd071 100644
--- a/kernel/events/persistent.c
+++ b/kernel/events/persistent.c
@@ -40,6 +40,12 @@ add_persistent_event_on_cpu(unsigned int cpu, struct perf_event_attr *attr,
mutex_lock(&per_cpu(pers_events_lock, cpu));
+ desc = get_persistent_event(cpu, attr);
+ if (desc) {
+ event = ERR_PTR(-EEXIST);
+ goto out;
+ }
+
desc = kzalloc(sizeof(*desc), GFP_KERNEL);
if (!desc) {
event = ERR_PTR(-ENOMEM);
--
1.8.1.1
From: Robert Richter <[email protected]>
Protect esp. access to struct pers_event_desc *desc. There are race
conditions possible where the descriptor could be removed from list
while it is used.
Signed-off-by: Robert Richter <[email protected]>
Signed-off-by: Robert Richter <[email protected]>
---
kernel/events/persistent.c | 33 ++++++++++++++++++++++++---------
1 file changed, 24 insertions(+), 9 deletions(-)
diff --git a/kernel/events/persistent.c b/kernel/events/persistent.c
index 16ed47c..586cea5 100644
--- a/kernel/events/persistent.c
+++ b/kernel/events/persistent.c
@@ -15,6 +15,7 @@ struct pers_event_desc {
};
static DEFINE_PER_CPU(struct list_head, pers_events);
+static DEFINE_PER_CPU(struct mutex, pers_events_lock);
static struct pers_event_desc
*get_persistent_event(int cpu, struct perf_event_attr *attr)
@@ -37,9 +38,13 @@ add_persistent_event_on_cpu(unsigned int cpu, struct perf_event_attr *attr,
struct pers_event_desc *desc;
struct ring_buffer *buf;
+ mutex_lock(&per_cpu(pers_events_lock, cpu));
+
desc = kzalloc(sizeof(*desc), GFP_KERNEL);
- if (!desc)
- return ERR_PTR(-ENOMEM);
+ if (!desc) {
+ event = ERR_PTR(-ENOMEM);
+ goto out;
+ }
event = perf_event_create_kernel_counter(attr, cpu, NULL, NULL, NULL);
if (IS_ERR(event))
@@ -66,6 +71,7 @@ err_rb:
err_event:
kfree(desc);
out:
+ mutex_unlock(&per_cpu(pers_events_lock, cpu));
return event;
}
@@ -74,9 +80,11 @@ static void del_persistent_event(int cpu, struct perf_event_attr *attr)
struct pers_event_desc *desc;
struct perf_event *event;
+ mutex_lock(&per_cpu(pers_events_lock, cpu));
+
desc = get_persistent_event(cpu, attr);
if (!desc)
- return;
+ goto out;
event = desc->event;
list_del(&desc->plist);
@@ -85,6 +93,8 @@ static void del_persistent_event(int cpu, struct perf_event_attr *attr)
perf_event_release_kernel(event);
put_unused_fd(desc->fd);
kfree(desc);
+out:
+ mutex_unlock(&per_cpu(pers_events_lock, cpu));
}
/*
@@ -138,28 +148,33 @@ int perf_add_persistent_event_by_id(int id)
int perf_get_persistent_event_fd(unsigned cpu, struct perf_event_attr *attr)
{
struct pers_event_desc *desc;
- int event_fd;
+ int event_fd = -ENODEV;
if (cpu >= (unsigned)nr_cpu_ids)
return -EINVAL;
+ mutex_lock(&per_cpu(pers_events_lock, cpu));
+
desc = get_persistent_event(cpu, attr);
if (!desc)
- return -ENODEV;
+ goto out;
event_fd = anon_inode_getfd("[pers_event]", &perf_fops,
desc->event, O_RDONLY);
if (event_fd >= 0)
desc->fd = event_fd;
+out:
+ mutex_unlock(&per_cpu(pers_events_lock, cpu));
return event_fd;
}
-
void __init persistent_events_init(void)
{
- int i;
+ int cpu;
- for_each_possible_cpu(i)
- INIT_LIST_HEAD(&per_cpu(pers_events, i));
+ for_each_possible_cpu(cpu) {
+ INIT_LIST_HEAD(&per_cpu(pers_events, cpu));
+ mutex_init(&per_cpu(pers_events_lock, cpu));
+ }
}
--
1.8.1.1
From: Robert Richter <[email protected]>
rb_put() is called already in perf_event_release_kernel(), so no need
to do the same in del_persistent_event(). We also don't need it in
add_persistent_event_on_cpu() after a rework. Since there are no users
of rb_put() anymore, we can make it private again to only
events/core.c.
Signed-off-by: Robert Richter <[email protected]>
Signed-off-by: Robert Richter <[email protected]>
---
kernel/events/core.c | 4 +++-
kernel/events/internal.h | 1 -
kernel/events/persistent.c | 29 +++++++++++------------------
3 files changed, 14 insertions(+), 20 deletions(-)
diff --git a/kernel/events/core.c b/kernel/events/core.c
index a9b6470..8f85caa 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3021,6 +3021,8 @@ static void free_event_rcu(struct rcu_head *head)
kfree(event);
}
+static void rb_put(struct ring_buffer *rb);
+
static void free_event(struct perf_event *event)
{
irq_work_sync(&event->pending);
@@ -3680,7 +3682,7 @@ static struct ring_buffer *rb_get(struct perf_event *event)
return rb;
}
-void rb_put(struct ring_buffer *rb)
+static void rb_put(struct ring_buffer *rb)
{
struct perf_event *event, *n;
unsigned long flags;
diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index 3b481be..3647289 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -38,7 +38,6 @@ struct ring_buffer {
extern void rb_free(struct ring_buffer *rb);
extern struct ring_buffer *
rb_alloc(int nr_pages, long watermark, int cpu, int flags);
-extern void rb_put(struct ring_buffer *rb);
extern void perf_event_wakeup(struct perf_event *event);
extern void
diff --git a/kernel/events/persistent.c b/kernel/events/persistent.c
index d7aaf95..e6a7664 100644
--- a/kernel/events/persistent.c
+++ b/kernel/events/persistent.c
@@ -20,22 +20,22 @@ static struct perf_event *
add_persistent_event_on_cpu(unsigned int cpu, struct perf_event_attr *attr,
unsigned nr_pages)
{
- struct perf_event *event = ERR_PTR(-ENOMEM);
+ struct perf_event *event;
struct pers_event_desc *desc;
struct ring_buffer *buf;
desc = kzalloc(sizeof(*desc), GFP_KERNEL);
if (!desc)
- goto out;
-
- buf = rb_alloc(nr_pages, 0, cpu, 0);
- if (!buf)
- goto err_rb;
+ return ERR_PTR(-ENOMEM);
event = perf_event_create_kernel_counter(attr, cpu, NULL, NULL, NULL);
if (IS_ERR(event))
goto err_event;
+ buf = rb_alloc(nr_pages, 0, cpu, 0);
+ if (!buf)
+ goto err_rb;
+
rcu_assign_pointer(event->rb, buf);
desc->event = event;
@@ -47,14 +47,12 @@ add_persistent_event_on_cpu(unsigned int cpu, struct perf_event_attr *attr,
perf_event_enable(event);
goto out;
-
- err_event:
- rb_put(buf);
-
- err_rb:
+err_rb:
+ perf_event_release_kernel(event);
+ event = ERR_PTR(-ENOMEM);
+err_event:
kfree(desc);
-
- out:
+out:
return event;
}
@@ -76,11 +74,6 @@ static void del_persistent_event(int cpu, struct perf_event_attr *attr)
list_del(&desc->plist);
perf_event_disable(event);
- if (event->rb) {
- rb_put(event->rb);
- rcu_assign_pointer(event->rb, NULL);
- }
-
perf_event_release_kernel(event);
put_unused_fd(desc->fd);
kfree(desc);
--
1.8.1.1
From: Robert Richter <[email protected]>
Struct pers_event_desc is only used in kernel/events/persistent.c.
Moving it there. Also, removing attr member as this is a copy of
event->attr.
Signed-off-by: Robert Richter <[email protected]>
Signed-off-by: Robert Richter <[email protected]>
---
include/linux/perf_event.h | 7 -------
kernel/events/persistent.c | 12 ++++++++----
2 files changed, 8 insertions(+), 11 deletions(-)
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index d2a42b7..dc72c93 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -518,13 +518,6 @@ struct perf_output_handle {
int page;
};
-struct pers_event_desc {
- struct perf_event_attr *attr;
- struct perf_event *event;
- struct list_head plist;
- int fd;
-};
-
#ifdef CONFIG_PERF_EVENTS
extern int perf_pmu_register(struct pmu *pmu, char *name, int type);
diff --git a/kernel/events/persistent.c b/kernel/events/persistent.c
index 53411b4..d7aaf95 100644
--- a/kernel/events/persistent.c
+++ b/kernel/events/persistent.c
@@ -8,6 +8,12 @@
/* 512 kiB: default perf tools memory size, see perf_evlist__mmap() */
#define CPU_BUFFER_NR_PAGES ((512 * 1024) / PAGE_SIZE)
+struct pers_event_desc {
+ struct perf_event *event;
+ struct list_head plist;
+ int fd;
+};
+
static DEFINE_PER_CPU(struct list_head, pers_events);
static struct perf_event *
@@ -33,7 +39,6 @@ add_persistent_event_on_cpu(unsigned int cpu, struct perf_event_attr *attr,
rcu_assign_pointer(event->rb, buf);
desc->event = event;
- desc->attr = attr;
INIT_LIST_HEAD(&desc->plist);
list_add_tail(&desc->plist, &per_cpu(pers_events, cpu));
@@ -59,7 +64,7 @@ static void del_persistent_event(int cpu, struct perf_event_attr *attr)
struct perf_event *event = NULL;
list_for_each_entry_safe(desc, tmp, &per_cpu(pers_events, cpu), plist) {
- if (desc->attr->config == attr->config) {
+ if (desc->event->attr.config == attr->config) {
event = desc->event;
break;
}
@@ -78,7 +83,6 @@ static void del_persistent_event(int cpu, struct perf_event_attr *attr)
perf_event_release_kernel(event);
put_unused_fd(desc->fd);
- kfree(desc->attr);
kfree(desc);
}
@@ -165,7 +169,7 @@ int perf_get_persistent_event_fd(unsigned cpu, struct perf_event_attr *attr)
return -EINVAL;
list_for_each_entry(desc, &per_cpu(pers_events, cpu), plist)
- if (desc->attr->config == attr->config)
+ if (desc->event->attr.config == attr->config)
return __alloc_persistent_event_fd(desc);
return -ENODEV;
--
1.8.1.1
From: Borislav Petkov <[email protected]>
Add a barebones implementation for registering persistent events with
perf. For that, we don't destroy the buffers when they're unmapped;
also, we map them read-only so that multiple agents can access them.
Also, we allocate the event buffers at event init time and not at mmap
time so that we can log samples into them regardless of whether there
are readers in userspace or not.
Changes made by Robert Richter <[email protected]>:
* Fixing wrongly determined attribute size.
* The default buffer size used to setup event buffers with perf tools
is 512k. Using the same buffer size for persistent events. This also
avoids failed mmap calls due to different buffer sizes.
* Improve error reporting.
* Returning -ENODEV if no file descriptor is found. An error code of
-1 (-EPERM) is misleading in this case.
* Adding cpu check to perf_get_persistent_event_fd()
[ make percpu variable static ]
Reported-by: Fengguang Wu <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
[ Fix attr size ]
[ Setting default buffer size to 512k as in perf tools ]
[ Print error code on failure when adding events ]
[ Return resonable error code ]
[ Adding cpu check to perf_get_persistent_event_fd() ]
Reported-by: Jiri Olsa <[email protected]>
Signed-off-by: Robert Richter <[email protected]>
Signed-off-by: Robert Richter <[email protected]>
---
include/linux/perf_event.h | 16 +++-
kernel/events/Makefile | 2 +-
kernel/events/core.c | 13 ++--
kernel/events/internal.h | 4 +
kernel/events/persistent.c | 181 +++++++++++++++++++++++++++++++++++++++++++++
5 files changed, 207 insertions(+), 9 deletions(-)
create mode 100644 kernel/events/persistent.c
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 6fddac1..d2a42b7 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -518,6 +518,13 @@ struct perf_output_handle {
int page;
};
+struct pers_event_desc {
+ struct perf_event_attr *attr;
+ struct perf_event *event;
+ struct list_head plist;
+ int fd;
+};
+
#ifdef CONFIG_PERF_EVENTS
extern int perf_pmu_register(struct pmu *pmu, char *name, int type);
@@ -750,7 +757,9 @@ extern void perf_event_enable(struct perf_event *event);
extern void perf_event_disable(struct perf_event *event);
extern int __perf_event_disable(void *info);
extern void perf_event_task_tick(void);
-#else
+extern int perf_add_persistent_event(struct perf_event_attr *, unsigned);
+extern int perf_add_persistent_event_by_id(int id);
+#else /* !CONFIG_PERF_EVENTS */
static inline void
perf_event_task_sched_in(struct task_struct *prev,
struct task_struct *task) { }
@@ -790,7 +799,10 @@ static inline void perf_event_enable(struct perf_event *event) { }
static inline void perf_event_disable(struct perf_event *event) { }
static inline int __perf_event_disable(void *info) { return -1; }
static inline void perf_event_task_tick(void) { }
-#endif
+static inline int perf_add_persistent_event(struct perf_event_attr *attr,
+ unsigned nr_pages) { return -EINVAL; }
+static inline int perf_add_persistent_event_by_id(int id) { return -EINVAL; }
+#endif /* !CONFIG_PERF_EVENTS */
#if defined(CONFIG_PERF_EVENTS) && defined(CONFIG_NO_HZ_FULL)
extern bool perf_event_can_stop_tick(void);
diff --git a/kernel/events/Makefile b/kernel/events/Makefile
index 103f5d1..70990d5 100644
--- a/kernel/events/Makefile
+++ b/kernel/events/Makefile
@@ -2,7 +2,7 @@ ifdef CONFIG_FUNCTION_TRACER
CFLAGS_REMOVE_core.o = -pg
endif
-obj-y := core.o ring_buffer.o callchain.o
+obj-y := core.o ring_buffer.o callchain.o persistent.o
obj-$(CONFIG_HAVE_HW_BREAKPOINT) += hw_breakpoint.o
obj-$(CONFIG_UPROBES) += uprobes.o
diff --git a/kernel/events/core.c b/kernel/events/core.c
index a13e457..a9b6470 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3021,8 +3021,6 @@ static void free_event_rcu(struct rcu_head *head)
kfree(event);
}
-static void rb_put(struct ring_buffer *rb);
-
static void free_event(struct perf_event *event)
{
irq_work_sync(&event->pending);
@@ -3398,8 +3396,6 @@ unlock:
return ret;
}
-static const struct file_operations perf_fops;
-
static inline int perf_fget_light(int fd, struct fd *p)
{
struct fd f = fdget(fd);
@@ -3684,7 +3680,7 @@ static struct ring_buffer *rb_get(struct perf_event *event)
return rb;
}
-static void rb_put(struct ring_buffer *rb)
+void rb_put(struct ring_buffer *rb)
{
struct perf_event *event, *n;
unsigned long flags;
@@ -3866,7 +3862,7 @@ static int perf_fasync(int fd, struct file *filp, int on)
return 0;
}
-static const struct file_operations perf_fops = {
+const struct file_operations perf_fops = {
.llseek = no_llseek,
.release = perf_release,
.read = perf_read,
@@ -6623,6 +6619,9 @@ SYSCALL_DEFINE5(perf_event_open,
if (err)
return err;
+ if (attr.persistent)
+ return perf_get_persistent_event_fd(cpu, &attr);
+
if (!attr.exclude_kernel) {
if (perf_paranoid_kernel() && !capable(CAP_SYS_ADMIN))
return -EACCES;
@@ -7579,6 +7578,8 @@ void __init perf_event_init(void)
*/
BUILD_BUG_ON((offsetof(struct perf_event_mmap_page, data_head))
!= 1024);
+
+ persistent_events_init();
}
static int __init perf_event_sysfs_init(void)
diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index eb675c4..3b481be 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -38,6 +38,7 @@ struct ring_buffer {
extern void rb_free(struct ring_buffer *rb);
extern struct ring_buffer *
rb_alloc(int nr_pages, long watermark, int cpu, int flags);
+extern void rb_put(struct ring_buffer *rb);
extern void perf_event_wakeup(struct perf_event *event);
extern void
@@ -174,4 +175,7 @@ static inline bool arch_perf_have_user_stack_dump(void)
#define perf_user_stack_pointer(regs) 0
#endif /* CONFIG_HAVE_PERF_USER_STACK_DUMP */
+extern const struct file_operations perf_fops;
+extern int perf_get_persistent_event_fd(unsigned cpu, struct perf_event_attr *attr);
+extern void __init persistent_events_init(void);
#endif /* _KERNEL_EVENTS_INTERNAL_H */
diff --git a/kernel/events/persistent.c b/kernel/events/persistent.c
new file mode 100644
index 0000000..53411b4
--- /dev/null
+++ b/kernel/events/persistent.c
@@ -0,0 +1,181 @@
+#include <linux/slab.h>
+#include <linux/file.h>
+#include <linux/perf_event.h>
+#include <linux/anon_inodes.h>
+
+#include "internal.h"
+
+/* 512 kiB: default perf tools memory size, see perf_evlist__mmap() */
+#define CPU_BUFFER_NR_PAGES ((512 * 1024) / PAGE_SIZE)
+
+static DEFINE_PER_CPU(struct list_head, pers_events);
+
+static struct perf_event *
+add_persistent_event_on_cpu(unsigned int cpu, struct perf_event_attr *attr,
+ unsigned nr_pages)
+{
+ struct perf_event *event = ERR_PTR(-ENOMEM);
+ struct pers_event_desc *desc;
+ struct ring_buffer *buf;
+
+ desc = kzalloc(sizeof(*desc), GFP_KERNEL);
+ if (!desc)
+ goto out;
+
+ buf = rb_alloc(nr_pages, 0, cpu, 0);
+ if (!buf)
+ goto err_rb;
+
+ event = perf_event_create_kernel_counter(attr, cpu, NULL, NULL, NULL);
+ if (IS_ERR(event))
+ goto err_event;
+
+ rcu_assign_pointer(event->rb, buf);
+
+ desc->event = event;
+ desc->attr = attr;
+
+ INIT_LIST_HEAD(&desc->plist);
+ list_add_tail(&desc->plist, &per_cpu(pers_events, cpu));
+
+ /* All workie, enable event now */
+ perf_event_enable(event);
+
+ goto out;
+
+ err_event:
+ rb_put(buf);
+
+ err_rb:
+ kfree(desc);
+
+ out:
+ return event;
+}
+
+static void del_persistent_event(int cpu, struct perf_event_attr *attr)
+{
+ struct pers_event_desc *desc, *tmp;
+ struct perf_event *event = NULL;
+
+ list_for_each_entry_safe(desc, tmp, &per_cpu(pers_events, cpu), plist) {
+ if (desc->attr->config == attr->config) {
+ event = desc->event;
+ break;
+ }
+ }
+
+ if (!event)
+ return;
+
+ list_del(&desc->plist);
+
+ perf_event_disable(event);
+ if (event->rb) {
+ rb_put(event->rb);
+ rcu_assign_pointer(event->rb, NULL);
+ }
+
+ perf_event_release_kernel(event);
+ put_unused_fd(desc->fd);
+ kfree(desc->attr);
+ kfree(desc);
+}
+
+static int __alloc_persistent_event_fd(struct pers_event_desc *desc)
+{
+ struct file *event_file = NULL;
+ int event_fd = -1;
+
+ event_fd = get_unused_fd();
+ if (event_fd < 0)
+ goto out;
+
+ event_file = anon_inode_getfile("[pers_event]", &perf_fops,
+ desc->event, O_RDONLY);
+ if (IS_ERR(event_file))
+ goto err_event_file;
+
+ desc->fd = event_fd;
+ fd_install(event_fd, event_file);
+
+ goto out;
+
+
+ err_event_file:
+ put_unused_fd(event_fd);
+
+ out:
+ return event_fd;
+}
+
+/*
+ * Create and enable the persistent version of the perf event described by
+ * @attr.
+ *
+ * @attr: perf event descriptor
+ * @nr_pages: size in pages
+ */
+int perf_add_persistent_event(struct perf_event_attr *attr, unsigned nr_pages)
+{
+ struct perf_event *event;
+ int i;
+
+ for_each_possible_cpu(i) {
+ event = add_persistent_event_on_cpu(i, attr, nr_pages);
+ if (IS_ERR(event))
+ goto unwind;
+ }
+ return 0;
+
+unwind:
+ pr_err("%s: Error adding persistent event on cpu %d: %ld\n",
+ __func__, i, PTR_ERR(event));
+
+ while (--i >= 0)
+ del_persistent_event(i, attr);
+
+ return PTR_ERR(event);
+}
+
+int perf_add_persistent_event_by_id(int id)
+{
+ struct perf_event_attr *attr;
+
+ attr = kzalloc(sizeof(*attr), GFP_KERNEL);
+ if (!attr)
+ return -ENOMEM;
+
+ attr->sample_period = 1;
+ attr->wakeup_events = 1;
+ attr->sample_type = PERF_SAMPLE_RAW;
+ attr->persistent = 1;
+ attr->config = id;
+ attr->type = PERF_TYPE_TRACEPOINT;
+ attr->size = sizeof(*attr);
+
+ return perf_add_persistent_event(attr, CPU_BUFFER_NR_PAGES);
+}
+
+int perf_get_persistent_event_fd(unsigned cpu, struct perf_event_attr *attr)
+{
+ struct pers_event_desc *desc;
+
+ if (cpu >= (unsigned)nr_cpu_ids)
+ return -EINVAL;
+
+ list_for_each_entry(desc, &per_cpu(pers_events, cpu), plist)
+ if (desc->attr->config == attr->config)
+ return __alloc_persistent_event_fd(desc);
+
+ return -ENODEV;
+}
+
+
+void __init persistent_events_init(void)
+{
+ int i;
+
+ for_each_possible_cpu(i)
+ INIT_LIST_HEAD(&per_cpu(pers_events, i));
+}
--
1.8.1.1
From: Borislav Petkov <[email protected]>
Rename ring_buffer-handling functions consistently with the "rb_"
prefix.
Signed-off-by: Borislav Petkov <[email protected]>
Signed-off-by: Robert Richter <[email protected]>
---
kernel/events/core.c | 37 +++++++++++++++++--------------------
kernel/events/ring_buffer.c | 7 +++----
2 files changed, 20 insertions(+), 24 deletions(-)
diff --git a/kernel/events/core.c b/kernel/events/core.c
index a0780b3..b790ab6 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -198,8 +198,7 @@ static void cpu_ctx_sched_in(struct perf_cpu_context *cpuctx,
static void update_context_time(struct perf_event_context *ctx);
static u64 perf_event_time(struct perf_event *event);
-static void ring_buffer_attach(struct perf_event *event,
- struct ring_buffer *rb);
+static void rb_attach(struct perf_event *event, struct ring_buffer *rb);
void __weak perf_event_print_debug(void) { }
@@ -3022,7 +3021,7 @@ static void free_event_rcu(struct rcu_head *head)
kfree(event);
}
-static void ring_buffer_put(struct ring_buffer *rb);
+static void rb_put(struct ring_buffer *rb);
static void free_event(struct perf_event *event)
{
@@ -3054,7 +3053,7 @@ static void free_event(struct perf_event *event)
}
if (event->rb) {
- ring_buffer_put(event->rb);
+ rb_put(event->rb);
event->rb = NULL;
}
@@ -3299,8 +3298,8 @@ static unsigned int perf_poll(struct file *file, poll_table *wait)
* t0: T1, rb = rcu_dereference(event->rb)
* t1: T2, old_rb = event->rb
* t2: T2, event->rb = new rb
- * t3: T2, ring_buffer_detach(old_rb)
- * t4: T1, ring_buffer_attach(rb1)
+ * t3: T2, rb_detach(old_rb)
+ * t4: T1, rb_attach(rb1)
* t5: T1, poll_wait(event->waitq)
*
* To avoid this problem, we grab mmap_mutex in perf_poll()
@@ -3312,7 +3311,7 @@ static unsigned int perf_poll(struct file *file, poll_table *wait)
rcu_read_lock();
rb = rcu_dereference(event->rb);
if (rb) {
- ring_buffer_attach(event, rb);
+ rb_attach(event, rb);
events = atomic_xchg(&rb->poll, 0);
}
rcu_read_unlock();
@@ -3617,8 +3616,7 @@ unlock:
return ret;
}
-static void ring_buffer_attach(struct perf_event *event,
- struct ring_buffer *rb)
+static void rb_attach(struct perf_event *event, struct ring_buffer *rb)
{
unsigned long flags;
@@ -3634,8 +3632,7 @@ unlock:
spin_unlock_irqrestore(&rb->event_lock, flags);
}
-static void ring_buffer_detach(struct perf_event *event,
- struct ring_buffer *rb)
+static void rb_detach(struct perf_event *event, struct ring_buffer *rb)
{
unsigned long flags;
@@ -3648,7 +3645,7 @@ static void ring_buffer_detach(struct perf_event *event,
spin_unlock_irqrestore(&rb->event_lock, flags);
}
-static void ring_buffer_wakeup(struct perf_event *event)
+static void rb_wakeup(struct perf_event *event)
{
struct ring_buffer *rb;
@@ -3672,7 +3669,7 @@ static void rb_free_rcu(struct rcu_head *rcu_head)
rb_free(rb);
}
-static struct ring_buffer *ring_buffer_get(struct perf_event *event)
+static struct ring_buffer *rb_get(struct perf_event *event)
{
struct ring_buffer *rb;
@@ -3687,7 +3684,7 @@ static struct ring_buffer *ring_buffer_get(struct perf_event *event)
return rb;
}
-static void ring_buffer_put(struct ring_buffer *rb)
+static void rb_put(struct ring_buffer *rb)
{
struct perf_event *event, *n;
unsigned long flags;
@@ -3724,10 +3721,10 @@ static void perf_mmap_close(struct vm_area_struct *vma)
atomic_long_sub((size >> PAGE_SHIFT) + 1, &user->locked_vm);
vma->vm_mm->pinned_vm -= event->mmap_locked;
rcu_assign_pointer(event->rb, NULL);
- ring_buffer_detach(event, rb);
+ rb_detach(event, rb);
mutex_unlock(&event->mmap_mutex);
- ring_buffer_put(rb);
+ rb_put(rb);
free_uid(user);
}
}
@@ -3881,7 +3878,7 @@ static const struct file_operations perf_fops = {
void perf_event_wakeup(struct perf_event *event)
{
- ring_buffer_wakeup(event);
+ rb_wakeup(event);
if (event->pending_kill) {
kill_fasync(&event->fasync, SIGIO, event->pending_kill);
@@ -6567,7 +6564,7 @@ set:
if (output_event) {
/* get the rb we want to redirect to */
- rb = ring_buffer_get(output_event);
+ rb = rb_get(output_event);
if (!rb)
goto unlock;
}
@@ -6575,13 +6572,13 @@ set:
old_rb = event->rb;
rcu_assign_pointer(event->rb, rb);
if (old_rb)
- ring_buffer_detach(event, old_rb);
+ rb_detach(event, old_rb);
ret = 0;
unlock:
mutex_unlock(&event->mmap_mutex);
if (old_rb)
- ring_buffer_put(old_rb);
+ rb_put(old_rb);
out:
return ret;
}
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index cd55144..b514c56 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -212,8 +212,7 @@ void perf_output_end(struct perf_output_handle *handle)
rcu_read_unlock();
}
-static void
-ring_buffer_init(struct ring_buffer *rb, long watermark, int flags)
+static void rb_init(struct ring_buffer *rb, long watermark, int flags)
{
long max_size = perf_data_size(rb);
@@ -290,7 +289,7 @@ struct ring_buffer *rb_alloc(int nr_pages, long watermark, int cpu, int flags)
rb->nr_pages = nr_pages;
- ring_buffer_init(rb, watermark, flags);
+ rb_init(rb, watermark, flags);
return rb;
@@ -395,7 +394,7 @@ struct ring_buffer *rb_alloc(int nr_pages, long watermark, int cpu, int flags)
rb->page_order = ilog2(nr_pages);
rb->nr_pages = !!nr_pages;
- ring_buffer_init(rb, watermark, flags);
+ rb_init(rb, watermark, flags);
return rb;
--
1.8.1.1
Hi Robert and Boris,
On Tue, 11 Jun 2013 18:42:29 +0200, Robert Richter wrote:
> From: Borislav Petkov <[email protected]>
>
> Add a barebones implementation for registering persistent events with
> perf. For that, we don't destroy the buffers when they're unmapped;
> also, we map them read-only so that multiple agents can access them.
>
> Also, we allocate the event buffers at event init time and not at mmap
> time so that we can log samples into them regardless of whether there
> are readers in userspace or not.
>
> Changes made by Robert Richter <[email protected]>:
>
> * Fixing wrongly determined attribute size.
>
> * The default buffer size used to setup event buffers with perf tools
> is 512k. Using the same buffer size for persistent events. This also
> avoids failed mmap calls due to different buffer sizes.
>
> * Improve error reporting.
>
> * Returning -ENODEV if no file descriptor is found. An error code of
> -1 (-EPERM) is misleading in this case.
>
> * Adding cpu check to perf_get_persistent_event_fd()
>
> [ make percpu variable static ]
> Reported-by: Fengguang Wu <[email protected]>
> Signed-off-by: Borislav Petkov <[email protected]>
> [ Fix attr size ]
> [ Setting default buffer size to 512k as in perf tools ]
> [ Print error code on failure when adding events ]
> [ Return resonable error code ]
> [ Adding cpu check to perf_get_persistent_event_fd() ]
> Reported-by: Jiri Olsa <[email protected]>
> Signed-off-by: Robert Richter <[email protected]>
> Signed-off-by: Robert Richter <[email protected]>
[SNIP]
> +int perf_add_persistent_event_by_id(int id)
> +{
> + struct perf_event_attr *attr;
> +
> + attr = kzalloc(sizeof(*attr), GFP_KERNEL);
> + if (!attr)
> + return -ENOMEM;
> +
> + attr->sample_period = 1;
> + attr->wakeup_events = 1;
> + attr->sample_type = PERF_SAMPLE_RAW;
> + attr->persistent = 1;
> + attr->config = id;
> + attr->type = PERF_TYPE_TRACEPOINT;
> + attr->size = sizeof(*attr);
> +
> + return perf_add_persistent_event(attr, CPU_BUFFER_NR_PAGES);
> +}
> +
> +int perf_get_persistent_event_fd(unsigned cpu, struct perf_event_attr *attr)
> +{
> + struct pers_event_desc *desc;
> +
> + if (cpu >= (unsigned)nr_cpu_ids)
> + return -EINVAL;
> +
> + list_for_each_entry(desc, &per_cpu(pers_events, cpu), plist)
> + if (desc->attr->config == attr->config)
> + return __alloc_persistent_event_fd(desc);
> +
So it only supports tracepoint events. Don't we need to add other types
of event?
Thanks,
Namhyung
> + return -ENODEV;
> +}
> +
> +
> +void __init persistent_events_init(void)
> +{
> + int i;
> +
> + for_each_possible_cpu(i)
> + INIT_LIST_HEAD(&per_cpu(pers_events, i));
> +}
On Tue, 11 Jun 2013 18:42:39 +0200, Robert Richter wrote:
> From: Robert Richter <[email protected]>
>
> Expose persistent events in the system to userland using sysfs. Perf
> tools are able to read existing pmu events from sysfs. Now we use a
> persistent pmu as an event container containing all registered
> persistent events of the system. This patch adds dynamically
> registration of persistent events to sysfs. E.g. something like this:
>
> /sys/bus/event_source/devices/persistent/events/mce_record:persistent,config=106
> /sys/bus/event_source/devices/persistent/format/persistent:attr5:23
>
> Perf tools need to support the attr<num> syntax that is added in a
> separate patch set. With it we are able to run perf tool commands to
> read persistent events, e.g.:
>
> # perf record -e persistent/mce_record/ sleep 10
> # perf top -e persistent/mce_record/
>
> [ Document attr<index> syntax in sysfs ABI ]
> Reported-by: Jiri Olsa <[email protected]>
> Signed-off-by: Robert Richter <[email protected]>
> Signed-off-by: Robert Richter <[email protected]>
> ---
[SNIP]
> +static int pers_event_sysfs_register(struct pers_event *event)
> +{
> + struct device_attribute *attr = &event->sysfs.attr;
> + int idx;
> +
> + *attr = (struct device_attribute)__ATTR(, 0444, pers_event_sysfs_show,
> + NULL);
> + attr->attr.name = event->name;
When I added another persistent event with this API, I got an WARNING
from lockdep like this:
[ 0.432506] BUG: key ffff88040946f140 not in .data!
[ 0.432581] ------------[ cut here ]------------
[ 0.432656] WARNING: at /home/namhyung/project/linux/kernel/lockdep.c:2987 lockdep_init_map+0x53d/0x570()
[ 0.432763] DEBUG_LOCKS_WARN_ON(1)
I guess we need the following line here:
sysfs_attr_init(&attr->attr);
Thanks,
Namhyung
> +
> + /* add sysfs attr to events: */
> + for (idx = 0; idx < MAX_EVENTS; idx++) {
> + if (!cmpxchg(persistent_events_attr + idx, NULL, &attr->attr))
> + break;
> + }
> +
> + if (idx >= MAX_EVENTS)
> + return -ENOSPC;
> + if (!idx)
> + EVENTS_GROUP = &persistent_events_group;
> + if (!persistent_pmu.dev)
> + return 0; /* sysfs not yet initialized */
> + if (idx)
> + return sysfs_update_group(&persistent_pmu.dev->kobj,
> + EVENTS_GROUP);
> + return sysfs_create_group(&persistent_pmu.dev->kobj, EVENTS_GROUP);
> +}
>
> static int persistent_pmu_init(struct perf_event *event)
> {
On 14.06.13 11:15:13, Namhyung Kim wrote:
> > +int perf_get_persistent_event_fd(unsigned cpu, struct perf_event_attr *attr)
> > +{
> > + struct pers_event_desc *desc;
> > +
> > + if (cpu >= (unsigned)nr_cpu_ids)
> > + return -EINVAL;
> > +
> > + list_for_each_entry(desc, &per_cpu(pers_events, cpu), plist)
> > + if (desc->attr->config == attr->config)
> > + return __alloc_persistent_event_fd(desc);
> > +
>
> So it only supports tracepoint events. Don't we need to add other types
> of event?
We have choosen tracepoints as initial implementation since they are
our first use case and its specifiers are unique already.
So this is for the initial implementation only. We can easily
implement support for other event types later. The only thing we need
to implement for this are unique identifiers for each persistent
event. Since all events are described in sysfs this change can be done
later without changing ABI or breaking userland.
The support for all event types is definitely on our list. We need
this also for a later implemention of registering all system wide
events as persistent events. This allows sharing events between
processes and is one of Ingos use cases to enable persistent events at
runtime.
-Robert
On 14.06.13 11:36:00, Namhyung Kim wrote:
> > +static int pers_event_sysfs_register(struct pers_event *event)
> > +{
> > + struct device_attribute *attr = &event->sysfs.attr;
> > + int idx;
> > +
> > + *attr = (struct device_attribute)__ATTR(, 0444, pers_event_sysfs_show,
> > + NULL);
> > + attr->attr.name = event->name;
>
> When I added another persistent event with this API, I got an WARNING
> from lockdep like this:
>
> [ 0.432506] BUG: key ffff88040946f140 not in .data!
> [ 0.432581] ------------[ cut here ]------------
> [ 0.432656] WARNING: at /home/namhyung/project/linux/kernel/lockdep.c:2987 lockdep_init_map+0x53d/0x570()
> [ 0.432763] DEBUG_LOCKS_WARN_ON(1)
>
>
> I guess we need the following line here:
>
> sysfs_attr_init(&attr->attr);
Yes, added your change. Thanks Namhyung for reviewing and testing.
-Robert
On Tue, Jun 11, 2013 at 06:42:28PM +0200, Robert Richter wrote:
> From: Borislav Petkov <[email protected]>
>
> Add the needed pieces for persistent events which makes them
> process-agnostic. Also, make their buffers read-only when mmaping them
> from userspace.
>
> While at it, do not return a void function, as caught by Fengguang's
> build robot.
>
> Changes made by Robert Richter <[email protected]>:
>
> * mmap should return EACCES error if fd can not be opened writable.
> This error code also helps userland to map buffers readonly on
> failure.
>
> Signed-off-by: Borislav Petkov <[email protected]>
> [ Return -EACCES if mapped buffers must be readonly ]
> Signed-off-by: Robert Richter <[email protected]>
> Signed-off-by: Robert Richter <[email protected]>
> ---
> include/uapi/linux/perf_event.h | 3 ++-
> kernel/events/core.c | 10 +++++++++-
> 2 files changed, 11 insertions(+), 2 deletions(-)
>
> diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
> index fb104e5..6032361 100644
> --- a/include/uapi/linux/perf_event.h
> +++ b/include/uapi/linux/perf_event.h
> @@ -272,8 +272,9 @@ struct perf_event_attr {
>
> exclude_callchain_kernel : 1, /* exclude kernel callchains */
> exclude_callchain_user : 1, /* exclude user callchains */
> + persistent : 1, /* always-on event */
>
> - __reserved_1 : 41;
> + __reserved_1 : 40;
>
> union {
> __u32 wakeup_events; /* wakeup every n events */
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index b790ab6..a13e457 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -3713,6 +3713,11 @@ static void perf_mmap_close(struct vm_area_struct *vma)
> {
> struct perf_event *event = vma->vm_file->private_data;
>
> + if (event->attr.persistent) {
> + atomic_dec(&event->mmap_count);
> + return;
> + }
> +
> if (atomic_dec_and_mutex_lock(&event->mmap_count, &event->mmap_mutex)) {
> unsigned long size = perf_data_size(event->rb);
> struct user_struct *user = event->mmap_user;
> @@ -3756,9 +3761,12 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
> if (event->cpu == -1 && event->attr.inherit)
> return -EINVAL;
>
> - if (!(vma->vm_flags & VM_SHARED))
> + if (!(vma->vm_flags & VM_SHARED) && !event->attr.persistent)
> return -EINVAL;
>
> + if (event->attr.persistent && (vma->vm_flags & VM_WRITE))
> + return -EACCES;
> +
> vma_size = vma->vm_end - vma->vm_start;
> nr_pages = (vma_size / PAGE_SIZE) - 1;
I find this patch incomplete for its asymmetric; there a dec, but not an
implied inc.
On Tue, Jun 11, 2013 at 06:42:29PM +0200, Robert Richter wrote:
> +static struct perf_event *
> +add_persistent_event_on_cpu(unsigned int cpu, struct perf_event_attr *attr,
> + unsigned nr_pages)
> +{
> + struct perf_event *event = ERR_PTR(-ENOMEM);
> + struct pers_event_desc *desc;
> + struct ring_buffer *buf;
> +
> + desc = kzalloc(sizeof(*desc), GFP_KERNEL);
> + if (!desc)
> + goto out;
> +
> + buf = rb_alloc(nr_pages, 0, cpu, 0);
> + if (!buf)
> + goto err_rb;
> +
> + event = perf_event_create_kernel_counter(attr, cpu, NULL, NULL, NULL);
> + if (IS_ERR(event))
> + goto err_event;
> +
> + rcu_assign_pointer(event->rb, buf);
> +
> + desc->event = event;
> + desc->attr = attr;
> +
> + INIT_LIST_HEAD(&desc->plist);
> + list_add_tail(&desc->plist, &per_cpu(pers_events, cpu));
> +
> + /* All workie, enable event now */
> + perf_event_enable(event);
> +
> + goto out;
> +
> + err_event:
> + rb_put(buf);
> +
> + err_rb:
> + kfree(desc);
> +
> + out:
> + return event;
> +}
I generally disapprove of indented labels. None of the perf code has
that and the tools are easy to 'fix'.
My quiltrc contains:
QUILT_DIFF_OPTS="-F ^[[:alpha:]\$_].*[^:]\$"
my .gitconfig contains:
[diff "default"]
xfuncname = "^[[:alpha:]$_].*[^:]$"
Both avoid diff thinking labels are function names.
> +static void del_persistent_event(int cpu, struct perf_event_attr *attr)
> +{
> + struct pers_event_desc *desc, *tmp;
> + struct perf_event *event = NULL;
> +
> + list_for_each_entry_safe(desc, tmp, &per_cpu(pers_events, cpu), plist) {
> + if (desc->attr->config == attr->config) {
attr->config might not be enough; you might also want to compare type,
config1, config2, and possible bp_type as well; although I think we want
to avoid having hwbp style persistent events, but then what do I know.
> + event = desc->event;
> + break;
> + }
> + }
> +
> + if (!event)
> + return;
> +
> + list_del(&desc->plist);
> +
> + perf_event_disable(event);
> + if (event->rb) {
> + rb_put(event->rb);
> + rcu_assign_pointer(event->rb, NULL);
> + }
> +
> + perf_event_release_kernel(event);
> + put_unused_fd(desc->fd);
> + kfree(desc->attr);
> + kfree(desc);
> +}
On Tue, Jun 11, 2013 at 06:42:29PM +0200, Robert Richter wrote:
> +int perf_add_persistent_event_by_id(int id)
> +{
> + struct perf_event_attr *attr;
> +
> + attr = kzalloc(sizeof(*attr), GFP_KERNEL);
> + if (!attr)
> + return -ENOMEM;
> +
> + attr->sample_period = 1;
> + attr->wakeup_events = 1;
> + attr->sample_type = PERF_SAMPLE_RAW;
> + attr->persistent = 1;
> + attr->config = id;
> + attr->type = PERF_TYPE_TRACEPOINT;
> + attr->size = sizeof(*attr);
> +
> + return perf_add_persistent_event(attr, CPU_BUFFER_NR_PAGES);
> +}
I would call this what it is: perf_add_persistent_tracepoint(), or so
:-)
On Tue, Jun 11, 2013 at 06:42:26PM +0200, Robert Richter wrote:
> This patch set implements out of the box support of perf tools for
> persistent events. For this the kernel must provide necessary
> information about existing persistent events via sysfs to userland.
> Persistent events are provided by the kernel with readonly event
> buffers. To allow the independent usage of the event buffers by any
> process without limiting other processes, multiple users of a single
> event must be handled too.
>
> The basic concept is to use a pmu as an event container for persistent
> events. The pmu registers events in sysfs and provides format and
> event information for the userland. The persistent event framework
> requires to add events to the pmu dynamically.
>
> With the information in sysfs userland knows about how to setup the
> perf_event attribute of a persistent event. Since a persistent event
> always has the persistent flag set, a way is needed to express this in
> sysfs. A new syntax is used for this. With 'attr<num>:<mask>' any bit
> in the attribute structure may be set in a similar way as using
> 'config<num>', but <num> is an index that points to the u64 value to
> change within the attribute.
>
> For persistent events the persistent flag (bit 23 of flag field in
> struct perf_event_attr) needs to be set which is expressed in sysfs
> with "attr5:23". E.g. the mce_record event is described in sysfs as
> follows:
>
> /sys/bus/event_source/devices/persistent/events/mce_record:persistent,config=106
> /sys/bus/event_source/devices/persistent/format/persistent:attr5:23
>
> Note that perf tools need to support the 'attr<num>' syntax that is
> added in a separate patch set. With it we are able to run perf tool
> commands to read persistent events, e.g.:
>
> # perf record -e persistent/mce_record/ sleep 10
> # perf top -e persistent/mce_record/
>
> In general the new syntax is flexible to describe with sysfs any event
> to be setup by perf tools.
>
> First patches contain also fixes and reworks made after reviewing
> code. Could be applied separately.
>
> Patches base on Boris' patches which I have rebased to latest
> tip/perf/core. All patches can be found here:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/rric/oprofile.git persistent-v2
>
OK, so I gave up on reading the patches :/ and went and looked at the git
tree. There's just too much needless churn in the patches.
As already said somewhere else; get_persistent_event() -- although that
wasn't the name at the time -- needs a better match function.
+ int perf_add_persistent_event(struct perf_event_attr *attr, unsigned nr_pages)
+ {
+ struct perf_event *event;
+ int i;
+
+ for_each_possible_cpu(i) {
+ event = add_persistent_event_on_cpu(i, attr, nr_pages);
+ if (IS_ERR(event))
+ goto unwind;
+ }
+ return 0;
+
+ unwind:
+ pr_err("%s: Error adding persistent event on cpu %d: %ld\n",
+ __func__, i, PTR_ERR(event));
+
+ while (--i >= 0)
+ del_persistent_event(i, attr);
+
+ return PTR_ERR(event);
+ }
This assumes cpu_possible_mask doesn't have holes in; I'm fairly sure we cannot
assume such.
Furthermore; while the function is exposed and thus looks like a generic usable
thing; however it looks to 'forget' making a sysfs entry, making it look like
something incomplete.
Only the ill named perf_add_persistent_event_by_id() function looks
something that's usable outside of this.
What's with MAX_EVENTS ?
On Tue, Jun 11, 2013 at 06:42:26PM +0200, Robert Richter wrote:
> Note that perf tools need to support the 'attr<num>' syntax that is
> added in a separate patch set. With it we are able to run perf tool
> commands to read persistent events, e.g.:
where is this patch? I can't find it.
I also find attr<num>:<bit> a bit weird. So far we've used
<perf_event_attr::fieldname>:<bits>, so while initializing anonymous
unions is a bit difficult with 'older' GCCs and we cannot actually do:
struct perf_event_attr {
...
union {
u64 flags;
u64 disabled : 1,
...
__reserved_1 : X;
}
...
};
We could fake it in userspace by allowing things like: flags:23. It
would not be a much worse hack than attr<num>:<bit> I suppose.
On Tue, Jun 11, 2013 at 06:42:26PM +0200, Robert Richter wrote:
> The basic concept is to use a pmu as an event container for persistent
> events. The pmu registers events in sysfs and provides format and
> event information for the userland. The persistent event framework
> requires to add events to the pmu dynamically.
>
> With the information in sysfs userland knows about how to setup the
> perf_event attribute of a persistent event. Since a persistent event
> always has the persistent flag set, a way is needed to express this in
> sysfs. A new syntax is used for this. With 'attr<num>:<mask>' any bit
> in the attribute structure may be set in a similar way as using
> 'config<num>', but <num> is an index that points to the u64 value to
> change within the attribute.
>
> For persistent events the persistent flag (bit 23 of flag field in
> struct perf_event_attr) needs to be set which is expressed in sysfs
> with "attr5:23". E.g. the mce_record event is described in sysfs as
> follows:
>
> /sys/bus/event_source/devices/persistent/events/mce_record:persistent,config=106
> /sys/bus/event_source/devices/persistent/format/persistent:attr5:23
>
Ok so it took me a little to 'get' this PMU 'abuse', but I suppose I can
go along with it ;-)
On the specifics though; it again seems to hard assume persistent events
are for tracepoints only -- either make this very explicit or remove it.
On Tue, Jun 11, 2013 at 06:42:26PM +0200, Robert Richter wrote:
Oh and what Boris and Ingo said; persistent events should 'persist' and
not be tied to particular processes. I'm not sure about the entire
eventfs thing; but the proposed sysfs thing should definitely work for
now.
On Mon, Jun 24, 2013 at 11:28:07AM +0200, Peter Zijlstra wrote:
> I find this patch incomplete for its asymmetric; there a dec, but not an
> implied inc.
There's one in perf_mmap_open.
Basically the idea was to pull up the ->mmap_count decrement (haha,
sounds like excrement :-)) and not deallocate the ring buffers for a
persistent event.
--
Regards/Gruss,
Boris.
Sent from a fat crate under my desk. Formatting is fine.
--
On Mon, Jun 24, 2013 at 11:48:30AM +0200, Peter Zijlstra wrote:
> On Tue, Jun 11, 2013 at 06:42:29PM +0200, Robert Richter wrote:
> > +int perf_add_persistent_event_by_id(int id)
> > +{
> > + struct perf_event_attr *attr;
> > +
> > + attr = kzalloc(sizeof(*attr), GFP_KERNEL);
> > + if (!attr)
> > + return -ENOMEM;
> > +
> > + attr->sample_period = 1;
> > + attr->wakeup_events = 1;
> > + attr->sample_type = PERF_SAMPLE_RAW;
> > + attr->persistent = 1;
> > + attr->config = id;
> > + attr->type = PERF_TYPE_TRACEPOINT;
> > + attr->size = sizeof(*attr);
> > +
> > + return perf_add_persistent_event(attr, CPU_BUFFER_NR_PAGES);
> > +}
>
> I would call this what it is: perf_add_persistent_tracepoint(), or so
> :-)
While we're at it, can we agree on a shortened variant for "persistent"
- it is too much to type and makes function names uglily¹ long.
¹ look at me making up adverbs.
--
Regards/Gruss,
Boris.
Sent from a fat crate under my desk. Formatting is fine.
--
* Peter Zijlstra <[email protected]> wrote:
> On Tue, Jun 11, 2013 at 06:42:26PM +0200, Robert Richter wrote:
>
> Oh and what Boris and Ingo said; persistent events should 'persist' and
> not be tied to particular processes. I'm not sure about the entire
> eventfs thing; but the proposed sysfs thing should definitely work for
> now.
I'm fine with a sysfs approach as well, as long as it's correct and
obvious enough to use.
Thanks,
Ingo
On Mon, Jun 24, 2013 at 09:26:16PM +0200, Borislav Petkov wrote:
> On Mon, Jun 24, 2013 at 11:48:30AM +0200, Peter Zijlstra wrote:
> > On Tue, Jun 11, 2013 at 06:42:29PM +0200, Robert Richter wrote:
> > > +int perf_add_persistent_event_by_id(int id)
> > > +{
> > > + struct perf_event_attr *attr;
> > > +
> > > + attr = kzalloc(sizeof(*attr), GFP_KERNEL);
> > > + if (!attr)
> > > + return -ENOMEM;
> > > +
> > > + attr->sample_period = 1;
> > > + attr->wakeup_events = 1;
> > > + attr->sample_type = PERF_SAMPLE_RAW;
> > > + attr->persistent = 1;
> > > + attr->config = id;
> > > + attr->type = PERF_TYPE_TRACEPOINT;
> > > + attr->size = sizeof(*attr);
> > > +
> > > + return perf_add_persistent_event(attr, CPU_BUFFER_NR_PAGES);
> > > +}
> >
> > I would call this what it is: perf_add_persistent_tracepoint(), or so
> > :-)
>
> While we're at it, can we agree on a shortened variant for "persistent"
> - it is too much to type and makes function names uglily? long.
Elsewhere in this series you use 'pers' to shorten things; it reads a
bit odd to me because 'pers' is the Dutch word for press (both meanings
transfer) but that is just something I'll have to live with isn't it ;-)
As for tracepoint, it seems common to shorten that to tp; which always
reminds me of toilet paper, but I suppose more people suffer from that.
Yielding: perf_add_pers_tp()
which I read as adding pressed toilet paper.. a well :-)
On 24.06.13 11:44:58, Peter Zijlstra wrote:
> > +static void del_persistent_event(int cpu, struct perf_event_attr *attr)
> > +{
> > + struct pers_event_desc *desc, *tmp;
> > + struct perf_event *event = NULL;
> > +
> > + list_for_each_entry_safe(desc, tmp, &per_cpu(pers_events, cpu), plist) {
> > + if (desc->attr->config == attr->config) {
>
> attr->config might not be enough; you might also want to compare type,
> config1, config2, and possible bp_type as well; although I think we want
> to avoid having hwbp style persistent events, but then what do I know.
[ I would call this what it is: perf_add_persistent_tracepoint(), or
so :-) ]
We chosed a tracepoint-only implementation as a first try and since
this is our use case. The goal is to have support for any event type.
I also was thinking of comparing the whole attr structure. But since
attr are similar (or almost identical) only you can't compare each
member of attr and need exceptions (e.g. cpu number). These exceptions
are ugly and depending on the specific use case.
I rather tend to assign unique identifiers to each persistent event
and then use attr->config to select the event, e.g.:
if (desc->unique_id == attr->config) ...
Tracepoints have this unique id in it already, thus tp only first.
The change can be made without breaking ABI, since the event setup is
described in sysfs.
-Robert
>
> > + event = desc->event;
> > + break;
> > + }
> > + }
On 24.06.13 21:24:04, Borislav Petkov wrote:
> On Mon, Jun 24, 2013 at 11:28:07AM +0200, Peter Zijlstra wrote:
> > I find this patch incomplete for its asymmetric; there a dec, but not an
> > implied inc.
>
> There's one in perf_mmap_open.
>
> Basically the idea was to pull up the ->mmap_count decrement (haha,
> sounds like excrement :-)) and not deallocate the ring buffers for a
> persistent event.
The asymmetry comes from rb_alloc() in add_persistent_event_on_cpu()
which is usually implemented in the mmap functions. We might better
improve functions to attach/detach events and buffers for more
symmetric code.
-Robert
On 24.06.13 11:32:04, Peter Zijlstra wrote:
> I generally disapprove of indented labels. None of the perf code has
> that and the tools are easy to 'fix'.
>
> My quiltrc contains:
>
> QUILT_DIFF_OPTS="-F ^[[:alpha:]\$_].*[^:]\$"
>
> my .gitconfig contains:
>
> [diff "default"]
> xfuncname = "^[[:alpha:]$_].*[^:]$"
>
> Both avoid diff thinking labels are function names.
Right, we will use your coding style. ;)
-Robert
On 25.06.13 09:44:01, Peter Zijlstra wrote:
> Elsewhere in this series you use 'pers' to shorten things; it reads a
> bit odd to me because 'pers' is the Dutch word for press (both meanings
> transfer) but that is just something I'll have to live with isn't it ;-)
>
> As for tracepoint, it seems common to shorten that to tp; which always
> reminds me of toilet paper, but I suppose more people suffer from that.
>
> Yielding: perf_add_pers_tp()
>
> which I read as adding pressed toilet paper.. a well :-)
Better we avoid this, don't wonat you misread the code. ;)
I also see 'pers_' not as an optimum since it could be mixed-up easily
with 'perf_'. Maybe we take 'persist_' instead?
-Robert
On Tue, Jun 25, 2013 at 11:24:39AM +0200, Robert Richter wrote:
> On 25.06.13 09:44:01, Peter Zijlstra wrote:
> > Elsewhere in this series you use 'pers' to shorten things; it reads a
> > bit odd to me because 'pers' is the Dutch word for press (both meanings
> > transfer) but that is just something I'll have to live with isn't it ;-)
> >
> > As for tracepoint, it seems common to shorten that to tp; which always
> > reminds me of toilet paper, but I suppose more people suffer from that.
> >
> > Yielding: perf_add_pers_tp()
> >
> > which I read as adding pressed toilet paper.. a well :-)
>
> Better we avoid this, don't wonat you misread the code. ;)
>
> I also see 'pers_' not as an optimum since it could be mixed-up easily
> with 'perf_'. Maybe we take 'persist_' instead?
Yep, although it reads wrong:
perf_add_persist_event
Another not really optimal idea would be to call them sticky events:
perf_add_sticky_event.
Also funny.
Then there's
perf_add_prstnt_event
Also, not really great.
Hmmm.. I got nothing so far...
--
Regards/Gruss,
Boris.
Sent from a fat crate under my desk. Formatting is fine.
--
On 24.06.13 12:08:19, Peter Zijlstra wrote:
> > git://git.kernel.org/pub/scm/linux/kernel/git/rric/oprofile.git persistent-v2
> >
>
> OK, so I gave up on reading the patches :/ and went and looked at the git
> tree. There's just too much needless churn in the patches.
Ok, we will rework the patches to have a final patch set hiding all
reworks. I was hoping we could work on a public branch since this
eases developing code with multiple people working on it. This also
shows how code matures. But this seems not to fly.
> + int perf_add_persistent_event(struct perf_event_attr *attr, unsigned nr_pages)
> + {
> + struct perf_event *event;
> + int i;
> +
> + for_each_possible_cpu(i) {
> + event = add_persistent_event_on_cpu(i, attr, nr_pages);
> + if (IS_ERR(event))
> + goto unwind;
> + }
> + return 0;
> +
> + unwind:
> + pr_err("%s: Error adding persistent event on cpu %d: %ld\n",
> + __func__, i, PTR_ERR(event));
> +
> + while (--i >= 0)
> + del_persistent_event(i, attr);
> +
> + return PTR_ERR(event);
> + }
>
> This assumes cpu_possible_mask doesn't have holes in; I'm fairly sure we cannot
> assume such.
Right, needs to be fixed.
> Furthermore; while the function is exposed and thus looks like a generic usable
> thing; however it looks to 'forget' making a sysfs entry, making it look like
> something incomplete.
>
> Only the ill named perf_add_persistent_event_by_id() function looks
> something that's usable outside of this.
Right, the name should be actually perf_add_persistent_tp() or so.
And, most of the function should be merged with
perf_add_persistent_event(). Maybe perf_add_persistent_event_by_id()
could be removed completly leaving perf_add_persistent_tp() as wrapper
for tracepoints.
> What's with MAX_EVENTS ?
It is the total number of sysfs entries that can currently be
registered dynamically. The problem with an unlimited event number is
that the sysfs attr list is a fixed array. We need to resize the array
which involves copying entries including locking etc. This was left
for later. ;)
Btw, another thing left for later is cpu hotplug code. I know this
needs to be implemented. But we want to see something running first.
Do you know what happens to system-wide events if a cpu is offline? It
looks like cpu_function_call() in perf_install_in_context() simply
ignores to install an event on an offline cpu. Not sure if it is later
enabled while bringing the cpu online. Quick looking at the code seems
to be not.
-Robert
On 25.06.13 11:37:06, Borislav Petkov wrote:
> On Tue, Jun 25, 2013 at 11:24:39AM +0200, Robert Richter wrote:
> > I also see 'pers_' not as an optimum since it could be mixed-up easily
> > with 'perf_'. Maybe we take 'persist_' instead?
>
> Yep, although it reads wrong:
>
> perf_add_persist_event
We could use persist_ as prefix for static functions and use the long
versions for the interface only.
But all this is a bit bikeshedding. I am sure we find something.
-Robert
On Tue, Jun 25, 2013 at 12:51:35PM +0200, Robert Richter wrote:
> On 25.06.13 11:37:06, Borislav Petkov wrote:
> > On Tue, Jun 25, 2013 at 11:24:39AM +0200, Robert Richter wrote:
> > > I also see 'pers_' not as an optimum since it could be mixed-up easily
> > > with 'perf_'. Maybe we take 'persist_' instead?
> >
> > Yep, although it reads wrong:
> >
> > perf_add_persist_event
>
> We could use persist_ as prefix for static functions and use the long
> versions for the interface only.
>
> But all this is a bit bikeshedding. I am sure we find something.
How about
perf_add_pevent?
It is nice and short, although maybe too short but "pevent" is almost
like a special term and the name shows that it is a special type of
event, i.e. a p-event.
:-)
--
Regards/Gruss,
Boris.
Sent from a fat crate under my desk. Formatting is fine.
--
On 25.06.13 17:29:41, Borislav Petkov wrote:
> How about
>
> perf_add_pevent?
>
> It is nice and short, although maybe too short but "pevent" is almost
> like a special term and the name shows that it is a special type of
> event, i.e. a p-event.
Fine with me, though it's more for internal usage. There is also use
of it as name for event pointers. And, it is used in perf tools as
event parser structure. But otherwise its ok.
-Robert
On 24.06.13 12:22:00, Peter Zijlstra wrote:
> On Tue, Jun 11, 2013 at 06:42:26PM +0200, Robert Richter wrote:
> > Note that perf tools need to support the 'attr<num>' syntax that is
> > added in a separate patch set. With it we are able to run perf tool
> > commands to read persistent events, e.g.:
>
> where is this patch? I can't find it.
I posted it with that subject:
[PATCH 0/4] perf tools: Persistent events, changes for perf tool integration
> I also find attr<num>:<bit> a bit weird. So far we've used
> <perf_event_attr::fieldname>:<bits>, so while initializing anonymous
> unions is a bit difficult with 'older' GCCs and we cannot actually do:
>
> struct perf_event_attr {
> ...
> union {
> u64 flags;
> u64 disabled : 1,
> ...
> __reserved_1 : X;
> }
> ...
> };
>
> We could fake it in userspace by allowing things like: flags:23. It
> would not be a much worse hack than attr<num>:<bit> I suppose.
A goal is to be able to entirely describe the complete attr structure
with sysfs. perf tools should work out-of-the-box for every kind of
event. For this particular case a 'flags' syntax would be ok, but not
if you like to fill in other attr members.
With attr<num>:<bit> you are flexible to put in *everything* into
attr. And for better readability there are abstractions as formats
like:
/sys/bus/event_source/devices/persistent/format/persistent:attr5:23
Not sure if the event parser handles recursion already, but we could
also support the 'flag' format you describe that way:
/sys/bus/event_source/devices/persistent/format/flags:attr5:0-63
/sys/bus/event_source/devices/persistent/format/persistent:flags:23
We could even list events for other pmus with small changes in the
kernel, e.g:
/sys/bus/event_source/devices/persistent/format/type:attr0:0-31
/sys/bus/event_source/devices/persistent/events/some_tracepoint:type=2,config=<tp_id>
The example is a bit weird. But in general we can translate struct
perf_event_attr to a string and vice versa. And perf tools understand
how to set up every bit of the event attr structure.
-Robert
On 24.06.13 21:45:11, Ingo Molnar wrote:
>
> * Peter Zijlstra <[email protected]> wrote:
>
> > Oh and what Boris and Ingo said; persistent events should 'persist' and
> > not be tied to particular processes.
Fine with me too. But we need an answer how to create/release
persistent events.
It's easy to add them as in-kernel events in the kernel. But, how
could persistent events be created from userspace without inventing
too much new? As for standard events this could be done with the perf
syscall too. The event is created by a process and buffers are also
setup using the event's file descriptor. After the setup we detach the
event from the process. The event is persistent now. We could release
an event the same way by attaching it to a process an then closing it.
Detaching an event from a process basically means the event is not
released after the process is closing the event or is killed itself.
Sharing an event between processes is very important for this. Sharing
requires the event buffer to be accessed readonly and the event's ref
count needs to keep track of all attached processes. As long as a
process is using the event it is not released.
But what options there are to detach the event from all processes and
make it persistent? We could just put the creating process to sleep
as long as the event should be persistent. This seems not to be an
option for you. There could be other ways to just increase the
refcount. We could use a ftrace-like approach and modify the refcount
by:
Detach:
# echo opened_event > /sys/bus/event_source/devices/persistent/detached
Release:
# echo '!opened_event' > /sys/bus/event_source/devices/persistent/detached
And Ingo's proposal using eventfs somehow:
/sys/fs/events/user/PID/
And/or cgroups?
Are there other suggestions?
> > I'm not sure about the entire
> > eventfs thing; but the proposed sysfs thing should definitely work for
> > now.
>
> I'm fine with a sysfs approach as well, as long as it's correct and
> obvious enough to use.
Glad to hear this. :)
Thanks,
-Robert
On Tue, Jun 25, 2013 at 07:57:29PM +0200, Robert Richter wrote:
> But what options there are to detach the event from all processes and
> make it persistent?
Something like this:
ioctl(fd, PERF_EVENT_IOC_DETACH, 0);
I guess this could simply set the persistent flag so that the rest of
the perf code knows not to destroy event buffers etc.
I don't have an idea about the reattaching though because you don't have
a file descriptor there.
Maybe for that we could really use the sys_perf_event_open() with flags
set to PERF_FLAG_PERSISTENT to note that we want to reattach to the
persistent event instead of opening a new one.
Something to that effect...
--
Regards/Gruss,
Boris.
Sent from a fat crate under my desk. Formatting is fine.
--
On 25.06.13 21:16:54, Borislav Petkov wrote:
> On Tue, Jun 25, 2013 at 07:57:29PM +0200, Robert Richter wrote:
> > But what options there are to detach the event from all processes and
> > make it persistent?
>
> Something like this:
>
> ioctl(fd, PERF_EVENT_IOC_DETACH, 0);
>
> I guess this could simply set the persistent flag so that the rest of
> the perf code knows not to destroy event buffers etc.
>
> I don't have an idea about the reattaching though because you don't have
> a file descriptor there.
>
> Maybe for that we could really use the sys_perf_event_open() with flags
> set to PERF_FLAG_PERSISTENT to note that we want to reattach to the
> persistent event instead of opening a new one.
We get a new fd by opening the persistent event with the syscall.
There would be 2 new ioctls:
ioctl(fd, PERF_EVENT_IOC_DETACH, 0);
ioctl(fd, PERF_EVENT_IOC_ATTACH, 0);
This would be fine and reuses existing infrastructure.
-Robert
On Wed, Jun 26, 2013 at 10:12:23AM +0200, Robert Richter wrote:
> We get a new fd by opening the persistent event with the syscall.
> There would be 2 new ioctls:
>
> ioctl(fd, PERF_EVENT_IOC_DETACH, 0);
> ioctl(fd, PERF_EVENT_IOC_ATTACH, 0);
>
> This would be fine and reuses existing infrastructure.
Well, how are you going to say that you want to open an already existing
persistent event or your want to create exactly the same persistent
event? Are we even going to allow identical persistent events to
coexist?
The answers to those questions gives us the implementation.
--
Regards/Gruss,
Boris.
Sent from a fat crate under my desk. Formatting is fine.
--
* Borislav Petkov <[email protected]> wrote:
> On Wed, Jun 26, 2013 at 10:12:23AM +0200, Robert Richter wrote:
> > We get a new fd by opening the persistent event with the syscall.
> > There would be 2 new ioctls:
> >
> > ioctl(fd, PERF_EVENT_IOC_DETACH, 0);
> > ioctl(fd, PERF_EVENT_IOC_ATTACH, 0);
> >
> > This would be fine and reuses existing infrastructure.
>
> Well, how are you going to say that you want to open an already existing
> persistent event or your want to create exactly the same persistent
> event? Are we even going to allow identical persistent events to
> coexist?
If already existing persistent events show up somewhere in sysfs (or in a
separate pseudofilesystem) then an open() of them [given sufficient
privileges of the caller, etc.] could attach to them.
Thanks,
Ingo
On Wed, Jun 26, 2013 at 11:46:34AM +0200, Ingo Molnar wrote:
> If already existing persistent events show up somewhere in sysfs
> (or in a separate pseudofilesystem) then an open() of them [given
> sufficient privileges of the caller, etc.] could attach to them.
Yep, I think we want to say sys_perf_event_open(attr.persistent=1) which
would give you a file descriptor to an already existing persistent event
with attr.config=<id>.
If it doesn't exist, it creates it.
--
Regards/Gruss,
Boris.
Sent from a fat crate under my desk. Formatting is fine.
--
On 26.06.13 10:24:08, Borislav Petkov wrote:
> On Wed, Jun 26, 2013 at 10:12:23AM +0200, Robert Richter wrote:
> > We get a new fd by opening the persistent event with the syscall.
> > There would be 2 new ioctls:
> >
> > ioctl(fd, PERF_EVENT_IOC_DETACH, 0);
> > ioctl(fd, PERF_EVENT_IOC_ATTACH, 0);
> >
> > This would be fine and reuses existing infrastructure.
>
> Well, how are you going to say that you want to open an already existing
> persistent event or your want to create exactly the same persistent
> event? Are we even going to allow identical persistent events to
> coexist?
Here is the scenario:
Creating a persistent event from userspace:
* A process opens a system-wide event with the syscall and gets a fd.
* The process mmaps the buffer.
* The process does an ioctl to detach the process which increases the
events and buffers refcount. The event is listed as 'persistent' in
sysfs with a unique id.
* The process closes the fd. Event and buffer remain in the system
since the refcounts are not zero.
Opening a persistent event:
* A process scans sysfs for persistent events.
* To open the event it sets up the event attr according to sysfs.
* The persistent event is opened with the syscall, the process gets a
new fd of the event.
* The process attaches to the event buffer with mmap.
Releasing a persistent event:
* A process opens a persistent event and gets a fd.
* The process does an ioctl to attach the process which decreases the
refcounts. The sysfs entry is removed.
* The process closes the fd.
* After all processes that are tied to the event closed their event's
fds, the persistent event and its buffer is released.
Sounds like a plan?
-Robert
* Robert Richter <[email protected]> wrote:
> On 26.06.13 10:24:08, Borislav Petkov wrote:
> > On Wed, Jun 26, 2013 at 10:12:23AM +0200, Robert Richter wrote:
> > > We get a new fd by opening the persistent event with the syscall.
> > > There would be 2 new ioctls:
> > >
> > > ioctl(fd, PERF_EVENT_IOC_DETACH, 0);
> > > ioctl(fd, PERF_EVENT_IOC_ATTACH, 0);
> > >
> > > This would be fine and reuses existing infrastructure.
> >
> > Well, how are you going to say that you want to open an already existing
> > persistent event or your want to create exactly the same persistent
> > event? Are we even going to allow identical persistent events to
> > coexist?
>
> Here is the scenario:
Looks mostly good - with a few suggestions:
>
> Creating a persistent event from userspace:
>
> * A process opens a system-wide event with the syscall and gets a fd.
Should this really be limited to system-wide events?
> * The process mmaps the buffer.
> * The process does an ioctl to detach the process which increases the
> events and buffers refcount. The event is listed as 'persistent' in
> sysfs with a unique id.
> * The process closes the fd. Event and buffer remain in the system
> since the refcounts are not zero.
>
> Opening a persistent event:
>
> * A process scans sysfs for persistent events.
> * To open the event it sets up the event attr according to sysfs.
Basically it would just put some ID (found in sysfs) into the attr and set
attr.persistent=1 - not any other information, right?
If it knows the ID straight away (the user told it, or it remembers it
from some other file such as a temporary file, etc.) then it does not even
have to scan sysfs.
[ How about to additional logic: attr.persistent=1 && attr.config==0 means
a new persistent event is created straight away - no ioctl is needed to
detach it explicitly. ]
> * The persistent event is opened with the syscall, the process gets a
> new fd of the event.
> * The process attaches to the event buffer with mmap.
Yes. And gets the pre-existing event and mmap buffer.
> Releasing a persistent event:
>
> * A process opens a persistent event and gets a fd.
> * The process does an ioctl to attach the process which decreases the
> refcounts. The sysfs entry is removed.
> * The process closes the fd.
> * After all processes that are tied to the event closed their event's
> fds, the persistent event and its buffer is released.
>
> Sounds like a plan?
It does :-)
I'm sure there will be some details going down that path, but it looks
workable at first glance.
Note, for tracing the PERF_FLAG_FD_OUTPUT method of multiplexing multiple
events onto a single mmap buffers is probably useful (also usable via the
PERF_EVENT_IOC_SET_OUTPUT ioctl()), so please make sure the scheme works
naturally with that model as well, not just with 1:1 event+buffer
mappings.
See the uses of PERF_EVENT_IOC_SET_OUTPUT in tools/perf/.
Thanks,
Ingo
* Ingo Molnar <[email protected]> wrote:
> Note, for tracing the PERF_FLAG_FD_OUTPUT method of multiplexing
> multiple events onto a single mmap buffers is probably useful (also
> usable via the PERF_EVENT_IOC_SET_OUTPUT ioctl()), so please make sure
> the scheme works naturally with that model as well, not just with 1:1
> event+buffer mappings.
>
> See the uses of PERF_EVENT_IOC_SET_OUTPUT in tools/perf/.
Note that another facility that would be very useful for tracing is
PeterZ's and tglx's patch that enables multiple tracepoints to be attached
to a single event.
See the 2+ years old (bitrotten and unfinished) WIP patch below.
It adds a PERF_EVENT_IOC_ADD_TP ioctl() that adds a new tracepoint to an
existing event. This makes perf based tracing scale up to an arbitrary
number of tracepoints in essence.
Thanks,
Ingo
------------------>
Subject: perf-tracepoint-idr.patch
From: Thomas Gleixner <[email protected]>
Date: Wed, 24 Nov 2010 12:09:26 +0100
Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
include/linux/ftrace_event.h | 10
include/linux/perf_event.h | 9
include/linux/sched.h | 9
include/trace/ftrace.h | 4
kernel/events/core.c | 407 ++++++++++++++++++++++++++++++++++++++--
kernel/trace/trace_event_perf.c | 95 +++------
kernel/trace/trace_kprobe.c | 10
kernel/trace/trace_output.c | 116 +++--------
kernel/trace/trace_syscalls.c | 8
9 files changed, 498 insertions(+), 170 deletions(-)
Index: linux/include/linux/ftrace_event.h
===================================================================
--- linux.orig/include/linux/ftrace_event.h
+++ linux/include/linux/ftrace_event.h
@@ -87,8 +87,6 @@ struct trace_event_functions {
};
struct trace_event {
- struct hlist_node node;
- struct list_head list;
int type;
struct trace_event_functions *funcs;
};
@@ -194,7 +192,6 @@ struct ftrace_event_call {
#ifdef CONFIG_PERF_EVENTS
int perf_refcount;
- struct hlist_head __percpu *perf_events;
#endif
};
@@ -263,8 +260,9 @@ struct perf_event;
DECLARE_PER_CPU(struct pt_regs, perf_trace_regs);
-extern int perf_trace_init(struct perf_event *event);
+extern int perf_trace_init(struct perf_event *event, int event_id);
extern void perf_trace_destroy(struct perf_event *event);
+extern void perf_trace_destroy_id(int id);
extern int perf_trace_add(struct perf_event *event, int flags);
extern void perf_trace_del(struct perf_event *event, int flags);
extern int ftrace_profile_set_filter(struct perf_event *event, int event_id,
@@ -275,9 +273,9 @@ extern void *perf_trace_buf_prepare(int
static inline void
perf_trace_buf_submit(void *raw_data, int size, int rctx, u64 addr,
- u64 count, struct pt_regs *regs, void *head)
+ u64 count, struct pt_regs *regs, int id)
{
- perf_tp_event(addr, count, raw_data, size, regs, head, rctx);
+ perf_tp_event(addr, count, raw_data, size, regs, rctx, id);
}
#endif
Index: linux/include/linux/perf_event.h
===================================================================
--- linux.orig/include/linux/perf_event.h
+++ linux/include/linux/perf_event.h
@@ -247,6 +247,7 @@ struct perf_event_attr {
#define PERF_EVENT_IOC_PERIOD _IOW('$', 4, __u64)
#define PERF_EVENT_IOC_SET_OUTPUT _IO ('$', 5)
#define PERF_EVENT_IOC_SET_FILTER _IOW('$', 6, char *)
+#define PERF_EVENT_IOC_ADD_TP _IO ('$', 7)
enum perf_event_ioc_flags {
PERF_IOC_FLAG_GROUP = 1U << 0,
@@ -568,6 +569,11 @@ struct hw_perf_event {
struct task_struct *bp_target;
};
#endif
+ /*
+ * Same fudge as for breakpoints, trace-events needs
+ * it too,.. convert the bp crap over..
+ */
+ struct task_struct *event_target;
};
int state;
local64_t prev_count;
@@ -859,6 +865,7 @@ struct perf_event {
#ifdef CONFIG_EVENT_TRACING
struct ftrace_event_call *tp_event;
struct event_filter *filter;
+ struct perf_tp_idr tp_idr;
#endif
#ifdef CONFIG_CGROUP_PERF
@@ -1133,7 +1140,7 @@ static inline bool perf_paranoid_kernel(
extern void perf_event_init(void);
extern void perf_tp_event(u64 addr, u64 count, void *record,
int entry_size, struct pt_regs *regs,
- struct hlist_head *head, int rctx);
+ int rctx, int id);
extern void perf_bp_event(struct perf_event *event, void *data);
#ifndef perf_misc_flags
Index: linux/include/linux/sched.h
===================================================================
--- linux.orig/include/linux/sched.h
+++ linux/include/linux/sched.h
@@ -82,6 +82,7 @@ struct sched_param {
#include <linux/rculist.h>
#include <linux/rtmutex.h>
+#include <linux/idr.h>
#include <linux/time.h>
#include <linux/param.h>
#include <linux/resource.h>
@@ -1199,6 +1200,11 @@ enum perf_event_task_context {
perf_nr_task_contexts,
};
+struct perf_tp_idr {
+ struct mutex lock;
+ struct idr idr;
+};
+
struct task_struct {
volatile long state; /* -1 unrunnable, 0 runnable, >0 stopped */
void *stack;
@@ -1485,6 +1491,9 @@ struct task_struct {
struct perf_event_context *perf_event_ctxp[perf_nr_task_contexts];
struct mutex perf_event_mutex;
struct list_head perf_event_list;
+#ifdef CONFIG_EVENT_TRACING
+ struct perf_tp_idr *perf_tp_idr;
+#endif
#endif
#ifdef CONFIG_NUMA
struct mempolicy *mempolicy; /* Protected by alloc_lock */
Index: linux/include/trace/ftrace.h
===================================================================
--- linux.orig/include/trace/ftrace.h
+++ linux/include/trace/ftrace.h
@@ -708,7 +708,6 @@ perf_trace_##call(void *__data, proto)
struct ftrace_raw_##call *entry; \
struct pt_regs __regs; \
u64 __addr = 0, __count = 1; \
- struct hlist_head *head; \
int __entry_size; \
int __data_size; \
int rctx; \
@@ -733,9 +732,8 @@ perf_trace_##call(void *__data, proto)
\
{ assign; } \
\
- head = this_cpu_ptr(event_call->perf_events); \
perf_trace_buf_submit(entry, __entry_size, rctx, __addr, \
- __count, &__regs, head); \
+ __count, &__regs, event_call->event.type); \
}
/*
Index: linux/kernel/events/core.c
===================================================================
--- linux.orig/kernel/events/core.c
+++ linux/kernel/events/core.c
@@ -823,6 +823,7 @@ list_add_event(struct perf_event *event,
ctx->nr_events++;
if (event->attr.inherit_stat)
ctx->nr_stat++;
+ ++ctx->generation;
}
/*
@@ -976,6 +977,7 @@ list_del_event(struct perf_event *event,
*/
if (event->state > PERF_EVENT_STATE_OFF)
event->state = PERF_EVENT_STATE_OFF;
+ ++ctx->generation;
}
static void perf_group_detach(struct perf_event *event)
@@ -1894,6 +1896,12 @@ static void perf_event_context_sched_out
if (!cpuctx->task_ctx)
return;
+#if 0
+ /*
+ * Need to sort out how to make task_struct::perf_tp_idr
+ * work with this fancy switching stuff.. tracepoints could be
+ * in multiple contexts due to the software event muck.
+ */
rcu_read_lock();
parent = rcu_dereference(ctx->parent_ctx);
next_ctx = next->perf_event_ctxp[ctxn];
@@ -1927,6 +1935,7 @@ static void perf_event_context_sched_out
raw_spin_unlock(&ctx->lock);
}
rcu_read_unlock();
+#endif
if (do_switch) {
ctx_sched_out(ctx, cpuctx, EVENT_ALL);
@@ -3261,6 +3270,7 @@ static struct perf_event *perf_fget_ligh
static int perf_event_set_output(struct perf_event *event,
struct perf_event *output_event);
static int perf_event_set_filter(struct perf_event *event, void __user *arg);
+static int perf_event_add_tp(struct perf_event *event, int tp_id);
static long perf_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
{
@@ -3307,6 +3317,9 @@ static long perf_ioctl(struct file *file
case PERF_EVENT_IOC_SET_FILTER:
return perf_event_set_filter(event, (void __user *)arg);
+ case PERF_EVENT_IOC_ADD_TP:
+ return perf_event_add_tp(event, arg);
+
default:
return -ENOTTY;
}
@@ -5471,6 +5484,9 @@ static struct pmu perf_swevent = {
#ifdef CONFIG_EVENT_TRACING
+#include <linux/ftrace_event.h>
+#include "../trace/trace_output.h"
+
static int perf_tp_filter_match(struct perf_event *event,
struct perf_sample_data *data)
{
@@ -5485,8 +5501,9 @@ static int perf_tp_event_match(struct pe
struct perf_sample_data *data,
struct pt_regs *regs)
{
- if (event->hw.state & PERF_HES_STOPPED)
+ if (event->state != PERF_EVENT_STATE_ACTIVE)
return 0;
+
/*
* All tracepoints are from kernel-space.
*/
@@ -5499,8 +5516,60 @@ static int perf_tp_event_match(struct pe
return 1;
}
+static void perf_tp_idr_init(struct perf_tp_idr *idr)
+{
+ idr_init(&idr->idr);
+ mutex_init(&idr->lock);
+}
+
+static DEFINE_PER_CPU(struct perf_tp_idr, perf_tp_idr);
+
+struct perf_tp_node {
+ struct list_head list;
+ struct perf_event *event;
+ struct rcu_head rcu;
+};
+
+static void do_perf_tp_event(struct perf_event *event, u64 count,
+ struct perf_sample_data *data,
+ struct pt_regs *regs)
+{
+ if (perf_tp_event_match(event, data, regs))
+ perf_swevent_event(event, count, 1, data, regs);
+}
+
+static void perf_tp_idr_event(struct perf_tp_idr *tp_idr,
+ int id, u64 count,
+ struct perf_sample_data *data,
+ struct pt_regs *regs)
+{
+ struct perf_tp_node *tp_node, *node;
+ struct perf_event *event;
+
+ if (!tp_idr)
+ return;
+
+ /*
+ * Most of this is done under rcu_read_lock_sched(), which doesn't
+ * exclude regular RCU grace periods, but the IDR code uses call_rcu()
+ * so we have to use rcu_read_lock() here as well.
+ */
+ rcu_read_lock();
+ tp_node = idr_find(&tp_idr->idr, id);
+ rcu_read_unlock();
+
+ if (!tp_node)
+ return;
+
+ event = tp_node->event;
+
+ do_perf_tp_event(event, count, data, regs);
+ list_for_each_entry_rcu(node, &tp_node->list, list)
+ do_perf_tp_event(node->event, count, data, regs);
+}
+
void perf_tp_event(u64 addr, u64 count, void *record, int entry_size,
- struct pt_regs *regs, struct hlist_head *head, int rctx)
+ struct pt_regs *regs, int rctx, int id)
{
struct perf_sample_data data;
struct perf_event *event;
@@ -5514,18 +5583,197 @@ void perf_tp_event(u64 addr, u64 count,
perf_sample_data_init(&data, addr);
data.raw = &raw;
- hlist_for_each_entry_rcu(event, node, head, hlist_entry) {
- if (perf_tp_event_match(event, &data, regs))
- perf_swevent_event(event, count, 1, &data, regs);
- }
+ perf_tp_idr_event(&__get_cpu_var(perf_tp_idr), id, count, &data, regs);
+ perf_tp_idr_event(current->perf_tp_idr, id, count, &data, regs);
perf_swevent_put_recursion_context(rctx);
}
EXPORT_SYMBOL_GPL(perf_tp_event);
+static struct perf_tp_idr *
+perf_tp_init_task(struct perf_event *event, struct task_struct *task)
+{
+ struct perf_tp_idr *idr;
+
+ mutex_lock(&task->perf_event_mutex);
+ idr = task->perf_tp_idr;
+ if (idr)
+ goto unlock;
+
+ idr = kzalloc(sizeof(struct perf_tp_idr), GFP_KERNEL);
+ if (!idr)
+ goto unlock;
+
+ perf_tp_idr_init(idr);
+
+ task->perf_tp_idr = idr;
+unlock:
+ mutex_unlock(&task->perf_event_mutex);
+
+ return idr;
+}
+
+static struct perf_tp_idr *perf_event_idr(struct perf_event *event, bool create)
+{
+ struct perf_tp_idr *tp_idr;
+ struct task_struct *task;
+
+ if (event->attach_state & PERF_ATTACH_TASK) {
+ task = event->hw.event_target;
+ tp_idr = task->perf_tp_idr;
+ if (!tp_idr && create)
+ tp_idr = perf_tp_init_task(event, task);
+ } else
+ tp_idr = &per_cpu(perf_tp_idr, event->cpu);
+
+ return tp_idr;
+}
+
+static void perf_tp_free_node(struct rcu_head *rcu)
+{
+ struct perf_tp_node *node = container_of(rcu, struct perf_tp_node, rcu);
+
+ kfree(node);
+}
+
+static int perf_tp_remove_idr(int id, void *p, void *data)
+{
+ struct perf_tp_node *node = p;
+ struct perf_tp_node *first, *next;
+ struct perf_tp_idr *tp_idr = data;
+
+ if (!tp_idr)
+ goto no_idr;
+
+ mutex_lock(&tp_idr->lock);
+ first = idr_find(&tp_idr->idr, id);
+ if (first == node) {
+ next = list_first_entry(&first->list, struct perf_tp_node, list);
+ if (next != first)
+ idr_replace(&tp_idr->idr, next, id);
+ else
+ idr_remove(&tp_idr->idr, id);
+ }
+ list_del_rcu(&node->list);
+ mutex_unlock(&tp_idr->lock);
+
+no_idr:
+ perf_trace_destroy_id(id);
+ call_rcu_sched(&node->rcu, perf_tp_free_node);
+ return 0;
+}
+
static void tp_perf_event_destroy(struct perf_event *event)
{
- perf_trace_destroy(event);
+ /*
+ * Since this is the free path, the fd is gone an there
+ * can be no concurrency on event->tp_idr.
+ */
+
+ idr_for_each(&event->tp_idr.idr, perf_tp_remove_idr,
+ perf_event_idr(event, false));
+
+ idr_remove_all(&event->tp_idr.idr);
+ idr_destroy(&event->tp_idr.idr);
+}
+
+static int __perf_event_add_tp(struct perf_event *event, int tp_id)
+{
+ struct perf_tp_node *node, *first;
+ struct perf_tp_idr *idr;
+ int tmp_id, err, ret = -ENOMEM;
+
+ node = kmalloc(sizeof(*node), GFP_KERNEL);
+ if (!node)
+ goto out;
+
+ node->event = event;
+ INIT_LIST_HEAD(&node->list);
+
+ /*
+ * Insert the node into the event->idr, this idr tracks the
+ * tracepoints we're interested in, it has a 1:1 relation
+ * with the node.
+ */
+ idr = &event->tp_idr;
+ mutex_lock(&idr->lock);
+ err = idr_pre_get(&idr->idr, GFP_KERNEL);
+ if (!err) {
+ ret = -ENOMEM;
+ goto free_node;
+ }
+
+ ret = idr_get_new_above(&idr->idr, node, tp_id, &tmp_id);
+ if (ret)
+ goto free_node;
+
+ if (WARN_ON(tp_id != tmp_id)) {
+ printk(KERN_ERR "fail: %d %d\n" , tp_id, tmp_id);
+ ret = -EBUSY;
+ goto free_idr1;
+ }
+ mutex_unlock(&idr->lock);
+
+ /*
+ * Insert the node into the task/cpu idr, this idr tracks
+ * all active tracepoints for the task/cpu, it has a 1:n relation
+ * with the node.
+ */
+ idr = perf_event_idr(event, true);
+ if (!idr) {
+ if (event->attach_state & PERF_ATTACH_CONTEXT)
+ ret = -ENOMEM;
+ else
+ ret = -ESRCH;
+ goto free_idr1_set;
+ }
+ mutex_lock(&idr->lock);
+ first = idr_find(&idr->idr, tp_id);
+ if (first) {
+ list_add_rcu(&node->list, &first->list);
+ goto unlock;
+ }
+
+ err = idr_pre_get(&idr->idr, GFP_KERNEL);
+ if (!err) {
+ ret = -ENOMEM;
+ goto free_idr1_set_unlock;
+ }
+
+ ret = idr_get_new_above(&idr->idr, node, tp_id, &tmp_id);
+ if (ret)
+ goto free_idr1_set;
+
+ if (WARN_ON(tp_id != tmp_id)) {
+ ret = -EBUSY;
+ goto free_idr2;
+ }
+unlock:
+ mutex_unlock(&idr->lock);
+
+ ret = perf_trace_init(event, tp_id);
+ if (ret)
+ goto free_all;
+
+out:
+ return ret;
+
+free_all:
+ mutex_lock(&idr->lock);
+free_idr2:
+ idr_remove(&idr->idr, tmp_id);
+free_idr1_set_unlock:
+ mutex_unlock(&idr->lock);
+free_idr1_set:
+ idr = &event->tp_idr;
+ tmp_id = tp_id;
+ mutex_lock(&idr->lock);
+free_idr1:
+ idr_remove(&idr->idr, tmp_id);
+free_node:
+ mutex_unlock(&idr->lock);
+ kfree(node);
+ goto out;
}
static int perf_tp_event_init(struct perf_event *event)
@@ -5535,21 +5783,35 @@ static int perf_tp_event_init(struct per
if (event->attr.type != PERF_TYPE_TRACEPOINT)
return -ENOENT;
- err = perf_trace_init(event);
- if (err)
- return err;
+ perf_tp_idr_init(&event->tp_idr);
event->destroy = tp_perf_event_destroy;
+ if (event->attr.config != ~0ULL) {
+ err = __perf_event_add_tp(event, event->attr.config);
+ if (err)
+ return err;
+ }
+
return 0;
}
+static int perf_tp_event_add(struct perf_event *event, int flags)
+{
+ event->hw.state = flags & PERF_EF_START ? 0 : PERF_HES_STOPPED;
+ return 0;
+}
+
+static void perf_tp_event_del(struct perf_event *event, int flags)
+{
+}
+
static struct pmu perf_tracepoint = {
.task_ctx_nr = perf_sw_context,
.event_init = perf_tp_event_init,
- .add = perf_trace_add,
- .del = perf_trace_del,
+ .add = perf_tp_event_add,
+ .del = perf_tp_event_del,
.start = perf_swevent_start,
.stop = perf_swevent_stop,
.read = perf_swevent_read,
@@ -5557,6 +5819,11 @@ static struct pmu perf_tracepoint = {
static inline void perf_tp_register(void)
{
+ int cpu;
+
+ for_each_possible_cpu(cpu)
+ perf_tp_idr_init(&per_cpu(perf_tp_idr, cpu));
+
perf_pmu_register(&perf_tracepoint, "tracepoint", PERF_TYPE_TRACEPOINT);
}
@@ -5565,7 +5832,8 @@ static int perf_event_set_filter(struct
char *filter_str;
int ret;
- if (event->attr.type != PERF_TYPE_TRACEPOINT)
+ if (event->attr.type != PERF_TYPE_TRACEPOINT ||
+ event->attr.config == ~0ULL)
return -EINVAL;
filter_str = strndup_user(arg, PAGE_SIZE);
@@ -5583,6 +5851,74 @@ static void perf_event_free_filter(struc
ftrace_profile_free_filter(event);
}
+static int perf_event_add_tp(struct perf_event *event, int tp_id)
+{
+ if (event->attr.type != PERF_TYPE_TRACEPOINT &&
+ event->attr.config != ~0ULL)
+ return -EINVAL;
+
+ return __perf_event_add_tp(event, tp_id);
+}
+
+/*
+ * Called from the exit path, _after_ all events have been detached from it.
+ */
+static void perf_tp_event_exit(struct task_struct *tsk)
+{
+ struct perf_tp_idr *idr = tsk->perf_tp_idr;
+
+ if (!idr)
+ return;
+
+ idr_remove_all(&idr->idr);
+ idr_destroy(&idr->idr);
+}
+
+static void perf_tp_event_delayed_put(struct task_struct *tsk)
+{
+ struct perf_tp_idr *idr = tsk->perf_tp_idr;
+
+ tsk->perf_tp_idr = NULL;
+ kfree(idr);
+}
+
+static int perf_tp_inherit_idr(int id, void *p, void *data)
+{
+ struct perf_event *child = data;
+
+ return __perf_event_add_tp(child, id);
+}
+
+static int perf_tp_event_inherit(struct perf_event *parent_event,
+ struct perf_event *child_event)
+{
+ int ret;
+
+ if (parent_event->attr.type != PERF_TYPE_TRACEPOINT ||
+ parent_event->attr.config != ~0ULL)
+ return 0;
+
+ /*
+ * The child is not yet exposed, hence no need to serialize things
+ * on that side.
+ */
+ mutex_lock(&parent_event->tp_idr.lock);
+ ret = idr_for_each(&parent_event->tp_idr.idr,
+ perf_tp_inherit_idr,
+ child_event);
+ mutex_unlock(&parent_event->tp_idr.lock);
+
+ return ret;
+}
+
+static void perf_tp_event_init_task(struct task_struct *child)
+{
+ /*
+ * Clear the idr pointer copied from the parent.
+ */
+ child->perf_tp_idr = NULL;
+}
+
#else
static inline void perf_tp_register(void)
@@ -5598,6 +5934,29 @@ static void perf_event_free_filter(struc
{
}
+static int perf_event_add_tp(struct perf_event *event, int tp_id)
+{
+ return -ENOENT;
+}
+
+static void perf_tp_event_exit(struct task_struct *tsk)
+{
+}
+
+static void perf_tp_event_delayed_put(struct task_struct *tsk)
+{
+}
+
+static int perf_tp_event_inherit(struct perf_event *parent_event,
+ struct perf_event *child_event)
+{
+ return 0;
+}
+
+static void perf_tp_event_init_task()(struct task_struct *child)
+{
+}
+
#endif /* CONFIG_EVENT_TRACING */
#ifdef CONFIG_HAVE_HW_BREAKPOINT
@@ -6173,6 +6532,9 @@ perf_event_alloc(struct perf_event_attr
INIT_LIST_HEAD(&event->sibling_list);
init_waitqueue_head(&event->waitq);
init_irq_work(&event->pending, perf_pending_event);
+#ifdef CONFIG_EVENT_TRACING
+ perf_tp_idr_init(&event->tp_idr);
+#endif
mutex_init(&event->mmap_mutex);
@@ -6191,6 +6553,7 @@ perf_event_alloc(struct perf_event_attr
if (task) {
event->attach_state = PERF_ATTACH_TASK;
+ event->hw.event_target = task;
#ifdef CONFIG_HAVE_HW_BREAKPOINT
/*
* hw_breakpoint is a bit difficult here..
@@ -6236,7 +6599,7 @@ done:
if (err) {
if (event->ns)
put_pid_ns(event->ns);
- kfree(event);
+ free_event(event);
return ERR_PTR(err);
}
@@ -6604,7 +6967,6 @@ SYSCALL_DEFINE5(perf_event_open,
}
perf_install_in_context(ctx, event, cpu);
- ++ctx->generation;
perf_unpin_context(ctx);
mutex_unlock(&ctx->mutex);
@@ -6681,7 +7043,6 @@ perf_event_create_kernel_counter(struct
WARN_ON_ONCE(ctx->parent_ctx);
mutex_lock(&ctx->mutex);
perf_install_in_context(ctx, event, cpu);
- ++ctx->generation;
perf_unpin_context(ctx);
mutex_unlock(&ctx->mutex);
@@ -6858,6 +7219,8 @@ void perf_event_exit_task(struct task_st
for_each_task_context_nr(ctxn)
perf_event_exit_task_context(child, ctxn);
+
+ perf_tp_event_exit(child);
}
static void perf_free_event(struct perf_event *event,
@@ -6920,6 +7283,8 @@ void perf_event_delayed_put(struct task_
for_each_task_context_nr(ctxn)
WARN_ON_ONCE(task->perf_event_ctxp[ctxn]);
+
+ perf_tp_event_delayed_put(task);
}
/*
@@ -6935,6 +7300,7 @@ inherit_event(struct perf_event *parent_
{
struct perf_event *child_event;
unsigned long flags;
+ int ret;
/*
* Instead of creating recursive hierarchies of events,
@@ -6952,6 +7318,13 @@ inherit_event(struct perf_event *parent_
NULL);
if (IS_ERR(child_event))
return child_event;
+
+ ret = perf_tp_event_inherit(parent_event, child_event);
+ if (ret) {
+ free_event(child_event);
+ return ERR_PTR(ret);
+ }
+
get_ctx(child_ctx);
/*
@@ -7177,6 +7550,8 @@ int perf_event_init_task(struct task_str
mutex_init(&child->perf_event_mutex);
INIT_LIST_HEAD(&child->perf_event_list);
+ perf_tp_event_init_task(child);
+
for_each_task_context_nr(ctxn) {
ret = perf_event_init_context(child, ctxn);
if (ret)
Index: linux/kernel/trace/trace_event_perf.c
===================================================================
--- linux.orig/kernel/trace/trace_event_perf.c
+++ linux/kernel/trace/trace_event_perf.c
@@ -8,6 +8,7 @@
#include <linux/module.h>
#include <linux/kprobes.h>
#include "trace.h"
+#include "trace_output.h"
static char __percpu *perf_trace_buf[PERF_NR_CONTEXTS];
@@ -47,9 +48,7 @@ static int perf_trace_event_perm(struct
static int perf_trace_event_init(struct ftrace_event_call *tp_event,
struct perf_event *p_event)
{
- struct hlist_head __percpu *list;
int ret;
- int cpu;
ret = perf_trace_event_perm(tp_event, p_event);
if (ret)
@@ -61,15 +60,6 @@ static int perf_trace_event_init(struct
ret = -ENOMEM;
- list = alloc_percpu(struct hlist_head);
- if (!list)
- goto fail;
-
- for_each_possible_cpu(cpu)
- INIT_HLIST_HEAD(per_cpu_ptr(list, cpu));
-
- tp_event->perf_events = list;
-
if (!total_ref_count) {
char __percpu *buf;
int i;
@@ -100,63 +90,40 @@ fail:
}
}
- if (!--tp_event->perf_refcount) {
- free_percpu(tp_event->perf_events);
- tp_event->perf_events = NULL;
- }
+ --tp_event->perf_refcount;
return ret;
}
-int perf_trace_init(struct perf_event *p_event)
+int perf_trace_init(struct perf_event *p_event, int event_id)
{
struct ftrace_event_call *tp_event;
- int event_id = p_event->attr.config;
+ struct trace_event *t_event;
int ret = -EINVAL;
+ trace_event_read_lock();
+ t_event = ftrace_find_event(event_id);
+ if (!t_event)
+ goto out;
+
+ tp_event = container_of(t_event, struct ftrace_event_call, event);
+
mutex_lock(&event_mutex);
- list_for_each_entry(tp_event, &ftrace_events, list) {
- if (tp_event->event.type == event_id &&
- tp_event->class && tp_event->class->reg &&
- try_module_get(tp_event->mod)) {
- ret = perf_trace_event_init(tp_event, p_event);
- if (ret)
- module_put(tp_event->mod);
- break;
- }
+ if (tp_event->class && tp_event->class->reg &&
+ try_module_get(tp_event->mod)) {
+ ret = perf_trace_event_init(tp_event, p_event);
+ if (ret)
+ module_put(tp_event->mod);
}
mutex_unlock(&event_mutex);
+out:
+ trace_event_read_unlock();
return ret;
}
-int perf_trace_add(struct perf_event *p_event, int flags)
-{
- struct ftrace_event_call *tp_event = p_event->tp_event;
- struct hlist_head __percpu *pcpu_list;
- struct hlist_head *list;
-
- pcpu_list = tp_event->perf_events;
- if (WARN_ON_ONCE(!pcpu_list))
- return -EINVAL;
-
- if (!(flags & PERF_EF_START))
- p_event->hw.state = PERF_HES_STOPPED;
-
- list = this_cpu_ptr(pcpu_list);
- hlist_add_head_rcu(&p_event->hlist_entry, list);
-
- return 0;
-}
-
-void perf_trace_del(struct perf_event *p_event, int flags)
-{
- hlist_del_rcu(&p_event->hlist_entry);
-}
-
-void perf_trace_destroy(struct perf_event *p_event)
+static void __perf_trace_destroy(struct ftrace_event_call *tp_event)
{
- struct ftrace_event_call *tp_event = p_event->tp_event;
int i;
mutex_lock(&event_mutex);
@@ -171,9 +138,6 @@ void perf_trace_destroy(struct perf_even
*/
tracepoint_synchronize_unregister();
- free_percpu(tp_event->perf_events);
- tp_event->perf_events = NULL;
-
if (!--total_ref_count) {
for (i = 0; i < PERF_NR_CONTEXTS; i++) {
free_percpu(perf_trace_buf[i]);
@@ -185,6 +149,27 @@ out:
mutex_unlock(&event_mutex);
}
+void perf_trace_destroy(struct perf_event *p_event)
+{
+ __perf_trace_destroy(p_event->tp_event);
+}
+
+void perf_trace_destroy_id(int event_id)
+{
+ struct ftrace_event_call *tp_event;
+ struct trace_event *t_event;
+
+ trace_event_read_lock();
+ t_event = ftrace_find_event(event_id);
+ if (!t_event)
+ goto unlock;
+
+ tp_event = container_of(t_event, struct ftrace_event_call, event);
+ __perf_trace_destroy(tp_event);
+unlock:
+ trace_event_read_unlock();
+}
+
__kprobes void *perf_trace_buf_prepare(int size, unsigned short type,
struct pt_regs *regs, int *rctxp)
{
Index: linux/kernel/trace/trace_kprobe.c
===================================================================
--- linux.orig/kernel/trace/trace_kprobe.c
+++ linux/kernel/trace/trace_kprobe.c
@@ -1659,7 +1659,6 @@ static __kprobes void kprobe_perf_func(s
struct trace_probe *tp = container_of(kp, struct trace_probe, rp.kp);
struct ftrace_event_call *call = &tp->call;
struct kprobe_trace_entry_head *entry;
- struct hlist_head *head;
int size, __size, dsize;
int rctx;
@@ -1679,8 +1678,8 @@ static __kprobes void kprobe_perf_func(s
memset(&entry[1], 0, dsize);
store_trace_args(sizeof(*entry), tp, regs, (u8 *)&entry[1], dsize);
- head = this_cpu_ptr(call->perf_events);
- perf_trace_buf_submit(entry, size, rctx, entry->ip, 1, regs, head);
+ perf_trace_buf_submit(entry, size, rctx, entry->ip, 1, regs,
+ call->event.type);
}
/* Kretprobe profile handler */
@@ -1690,7 +1689,6 @@ static __kprobes void kretprobe_perf_fun
struct trace_probe *tp = container_of(ri->rp, struct trace_probe, rp);
struct ftrace_event_call *call = &tp->call;
struct kretprobe_trace_entry_head *entry;
- struct hlist_head *head;
int size, __size, dsize;
int rctx;
@@ -1710,8 +1708,8 @@ static __kprobes void kretprobe_perf_fun
entry->ret_ip = (unsigned long)ri->ret_addr;
store_trace_args(sizeof(*entry), tp, regs, (u8 *)&entry[1], dsize);
- head = this_cpu_ptr(call->perf_events);
- perf_trace_buf_submit(entry, size, rctx, entry->ret_ip, 1, regs, head);
+ perf_trace_buf_submit(entry, size, rctx, entry->ret_ip, 1,
+ regs, call->event.type);
}
static int probe_perf_enable(struct ftrace_event_call *call)
Index: linux/kernel/trace/trace_output.c
===================================================================
--- linux.orig/kernel/trace/trace_output.c
+++ linux/kernel/trace/trace_output.c
@@ -8,6 +8,7 @@
#include <linux/module.h>
#include <linux/mutex.h>
#include <linux/ftrace.h>
+#include <linux/idr.h>
#include "trace_output.h"
@@ -16,9 +17,9 @@
DECLARE_RWSEM(trace_event_mutex);
-static struct hlist_head event_hash[EVENT_HASHSIZE] __read_mostly;
+static const int first_event_type = __TRACE_LAST_TYPE + 1;
-static int next_event_type = __TRACE_LAST_TYPE + 1;
+static DEFINE_IDR(trace_type_idr);
int trace_print_seq(struct seq_file *m, struct trace_seq *s)
{
@@ -664,58 +665,43 @@ static int task_state_char(unsigned long
*/
struct trace_event *ftrace_find_event(int type)
{
- struct trace_event *event;
- struct hlist_node *n;
- unsigned key;
-
- key = type & (EVENT_HASHSIZE - 1);
-
- hlist_for_each_entry(event, n, &event_hash[key], node) {
- if (event->type == type)
- return event;
- }
-
- return NULL;
+ return idr_find(&trace_type_idr, type);
}
-static LIST_HEAD(ftrace_event_list);
+void trace_event_read_lock(void)
+{
+ down_read(&trace_event_mutex);
+}
-static int trace_search_list(struct list_head **list)
+void trace_event_read_unlock(void)
{
- struct trace_event *e;
- int last = __TRACE_LAST_TYPE;
+ up_read(&trace_event_mutex);
+}
- if (list_empty(&ftrace_event_list)) {
- *list = &ftrace_event_list;
- return last + 1;
- }
+static int register_event(struct trace_event *event, int id, bool strict)
+{
+ int ret, type;
- /*
- * We used up all possible max events,
- * lets see if somebody freed one.
- */
- list_for_each_entry(e, &ftrace_event_list, list) {
- if (e->type != last + 1)
- break;
- last++;
- }
+ ret = idr_pre_get(&trace_type_idr, GFP_KERNEL);
+ if (!ret)
+ return 0;
- /* Did we used up all 65 thousand events??? */
- if ((last + 1) > FTRACE_MAX_EVENT)
+ ret = idr_get_new_above(&trace_type_idr, event, id, &type);
+ if (ret)
return 0;
- *list = &e->list;
- return last + 1;
-}
+ if (strict && id != type) {
+ idr_remove(&trace_type_idr, type);
+ return 0;
+ }
-void trace_event_read_lock(void)
-{
- down_read(&trace_event_mutex);
-}
+ if (type > FTRACE_MAX_EVENT) {
+ idr_remove(&trace_type_idr, type);
+ return 0;
+ }
-void trace_event_read_unlock(void)
-{
- up_read(&trace_event_mutex);
+ event->type = type;
+ return type;
}
/**
@@ -735,7 +721,6 @@ void trace_event_read_unlock(void)
*/
int register_ftrace_event(struct trace_event *event)
{
- unsigned key;
int ret = 0;
down_write(&trace_event_mutex);
@@ -746,35 +731,18 @@ int register_ftrace_event(struct trace_e
if (WARN_ON(!event->funcs))
goto out;
- INIT_LIST_HEAD(&event->list);
-
if (!event->type) {
- struct list_head *list = NULL;
-
- if (next_event_type > FTRACE_MAX_EVENT) {
-
- event->type = trace_search_list(&list);
- if (!event->type)
- goto out;
-
- } else {
-
- event->type = next_event_type++;
- list = &ftrace_event_list;
- }
-
- if (WARN_ON(ftrace_find_event(event->type)))
+ ret = register_event(event, first_event_type, false);
+ if (!ret)
goto out;
-
- list_add_tail(&event->list, list);
-
- } else if (event->type > __TRACE_LAST_TYPE) {
- printk(KERN_WARNING "Need to add type to trace.h\n");
- WARN_ON(1);
- goto out;
} else {
- /* Is this event already used */
- if (ftrace_find_event(event->type))
+ if (event->type > __TRACE_LAST_TYPE) {
+ printk(KERN_WARNING "Need to add type to trace.h\n");
+ WARN_ON(1);
+ goto out;
+ }
+ ret = register_event(event, event->type, true);
+ if (!ret)
goto out;
}
@@ -787,11 +755,6 @@ int register_ftrace_event(struct trace_e
if (event->funcs->binary == NULL)
event->funcs->binary = trace_nop_print;
- key = event->type & (EVENT_HASHSIZE - 1);
-
- hlist_add_head(&event->node, &event_hash[key]);
-
- ret = event->type;
out:
up_write(&trace_event_mutex);
@@ -804,8 +767,7 @@ EXPORT_SYMBOL_GPL(register_ftrace_event)
*/
int __unregister_ftrace_event(struct trace_event *event)
{
- hlist_del(&event->node);
- list_del(&event->list);
+ idr_remove(&trace_type_idr, event->type);
return 0;
}
Index: linux/kernel/trace/trace_syscalls.c
===================================================================
--- linux.orig/kernel/trace/trace_syscalls.c
+++ linux/kernel/trace/trace_syscalls.c
@@ -499,7 +499,6 @@ static void perf_syscall_enter(void *ign
{
struct syscall_metadata *sys_data;
struct syscall_trace_enter *rec;
- struct hlist_head *head;
int syscall_nr;
int rctx;
int size;
@@ -530,8 +529,7 @@ static void perf_syscall_enter(void *ign
syscall_get_arguments(current, regs, 0, sys_data->nb_args,
(unsigned long *)&rec->args);
- head = this_cpu_ptr(sys_data->enter_event->perf_events);
- perf_trace_buf_submit(rec, size, rctx, 0, 1, regs, head);
+ perf_trace_buf_submit(rec, size, rctx, 0, 1, regs, rec->ent.type);
}
int perf_sysenter_enable(struct ftrace_event_call *call)
@@ -573,7 +571,6 @@ static void perf_syscall_exit(void *igno
{
struct syscall_metadata *sys_data;
struct syscall_trace_exit *rec;
- struct hlist_head *head;
int syscall_nr;
int rctx;
int size;
@@ -606,8 +603,7 @@ static void perf_syscall_exit(void *igno
rec->nr = syscall_nr;
rec->ret = syscall_get_return_value(current, regs);
- head = this_cpu_ptr(sys_data->exit_event->perf_events);
- perf_trace_buf_submit(rec, size, rctx, 0, 1, regs, head);
+ perf_trace_buf_submit(rec, size, rctx, 0, 1, regs, rec->ent.type);
}
int perf_sysexit_enable(struct ftrace_event_call *call)
On 26.06.13 13:45:38, Ingo Molnar wrote:
> * Robert Richter <[email protected]> wrote:
> > Creating a persistent event from userspace:
> >
> > * A process opens a system-wide event with the syscall and gets a fd.
>
> Should this really be limited to system-wide events?
It must not necessarily be restricted to system-wide events. Limiting
it is just to make it easier in the beginning, we don't need to think
about what happens if a process dies and permissions in case of
per-task events, etc (didn't thought about it yet ;).
Also, a persistent event is currently per-system, meaning there is one
entry only for the same kind of event scheduled on all cpus. This
keeps event handling easy (e.g. no need to export the cpu in the
event's sysfs entry, just the flag and id) but also has some drawbacks
(handling of multiple events per entry). Probably it's better to have
a 1:1 mapping.
> > * The process mmaps the buffer.
> > * The process does an ioctl to detach the process which increases the
> > events and buffers refcount. The event is listed as 'persistent' in
> > sysfs with a unique id.
> > * The process closes the fd. Event and buffer remain in the system
> > since the refcounts are not zero.
> >
> > Opening a persistent event:
> >
> > * A process scans sysfs for persistent events.
> > * To open the event it sets up the event attr according to sysfs.
>
> Basically it would just put some ID (found in sysfs) into the attr and set
> attr.persistent=1 - not any other information, right?
>
> If it knows the ID straight away (the user told it, or it remembers it
> from some other file such as a temporary file, etc.) then it does not even
> have to scan sysfs.
Yes, there is a unique id which we could also return with the ioctl or
so. sysfs is esp. to let perf tools and the event parser know about
how to setup the events. It might be also useful if the syscall setup
changes in the future for these kind of events, then we just modify
the sysfs entry.
> [ How about to additional logic: attr.persistent=1 && attr.config==0 means
> a new persistent event is created straight away - no ioctl is needed to
> detach it explicitly. ]
That's correct. We could also do the following:
To connect to an existing event:
attr.type=<persistent-pmu> && attr.config==<event-id>
(This might be harder to implement except the persistent event pmu
type will be fix, PERF_TYPE_PERSISTENT=6.)
To create a new persistent event:
attr.persistent=1 && attr=<some event setup: pmu, config, flags, etc>
> > * The persistent event is opened with the syscall, the process gets a
> > new fd of the event.
> > * The process attaches to the event buffer with mmap.
>
> Yes. And gets the pre-existing event and mmap buffer.
That's what I mean.
A problem here is that mmap'ed buffer size (number of pages) must be
be equal to the pre-existing buffer size and thus to be known somehow.
> > Releasing a persistent event:
> >
> > * A process opens a persistent event and gets a fd.
> > * The process does an ioctl to attach the process which decreases the
> > refcounts. The sysfs entry is removed.
> > * The process closes the fd.
> > * After all processes that are tied to the event closed their event's
> > fds, the persistent event and its buffer is released.
> >
> > Sounds like a plan?
>
> It does :-)
>
> I'm sure there will be some details going down that path, but it looks
> workable at first glance.
Yes, there will be some 'implementation details', but it should work.
> Note, for tracing the PERF_FLAG_FD_OUTPUT method of multiplexing multiple
> events onto a single mmap buffers is probably useful (also usable via the
> PERF_EVENT_IOC_SET_OUTPUT ioctl()), so please make sure the scheme works
> naturally with that model as well, not just with 1:1 event+buffer
> mappings.
>
> See the uses of PERF_EVENT_IOC_SET_OUTPUT in tools/perf/.
Yes, thanks for this hint. I wasn't aware of this feature yet.
Thanks for your comments. Will start reworking the patches into this
direction.
-Robert
Hi Robert,
On Wed, 26 Jun 2013 14:44:24 +0200, Robert Richter wrote:
> On 26.06.13 13:45:38, Ingo Molnar wrote:
>> [ How about to additional logic: attr.persistent=1 && attr.config==0 means
>> a new persistent event is created straight away - no ioctl is needed to
>> detach it explicitly. ]
>
> That's correct. We could also do the following:
>
> To connect to an existing event:
>
> attr.type=<persistent-pmu> && attr.config==<event-id>
>
> (This might be harder to implement except the persistent event pmu
> type will be fix, PERF_TYPE_PERSISTENT=6.)
>
> To create a new persistent event:
>
> attr.persistent=1 && attr=<some event setup: pmu, config, flags, etc>
How about using 2 bits for perfsistent flag, 1 for connecting to an
existing one, 2 for creating new one.
>> > * The persistent event is opened with the syscall, the process gets a
>> > new fd of the event.
>> > * The process attaches to the event buffer with mmap.
>>
>> Yes. And gets the pre-existing event and mmap buffer.
>
> That's what I mean.
>
> A problem here is that mmap'ed buffer size (number of pages) must be
> be equal to the pre-existing buffer size and thus to be known somehow.
What about also exporting the buffer size via sysfs pmu directory?
Thanks,
Namhyung
On Thu, Jun 27, 2013 at 02:46:36PM +0900, Namhyung Kim wrote:
> How about using 2 bits for perfsistent flag, 1 for connecting to an
> existing one, 2 for creating new one.
No need since persistent events don't need to be duplicated. Think of a
tracepoint: the samples you get there are the same, no matter how many
times you enable it.
So, if you open a persistent event which doesn't exist, it gets created.
If you open an existing one, you get read-only access to its buffers. No
need for two bits. Actually, with the detach/attach ioctl we don't even
need a single flag but having it makes the implementation a lot simpler.
> > A problem here is that mmap'ed buffer size (number of pages) must be
> > be equal to the pre-existing buffer size and thus to be known somehow.
>
> What about also exporting the buffer size via sysfs pmu directory?
Yes, we've been discussing buffer size. The simplest thing is to
hardcode the buffer size so that it is implicitly known to all agents.
However, I don't know whether there is a use case where different buffer
sizes actually make sense. I'd tend to do the simplest thing initially.
--
Regards/Gruss,
Boris.
Sent from a fat crate under my desk. Formatting is fine.
--
* Borislav Petkov <[email protected]> wrote:
> > > A problem here is that mmap'ed buffer size (number of pages) must be
> > > be equal to the pre-existing buffer size and thus to be known
> > > somehow.
> >
> > What about also exporting the buffer size via sysfs pmu directory?
>
> Yes, we've been discussing buffer size. The simplest thing is to
> hardcode the buffer size so that it is implicitly known to all agents.
> However, I don't know whether there is a use case where different buffer
> sizes actually make sense. I'd tend to do the simplest thing initially.
Btw., in terms of testing and design focus I'd suggest concentrating not
on rare and relatively singular events like RAS MCE events, but on a more
'everyday tracing' flow of events:
- large, per cpu trace buffers
- all events output into a single trace buffer to get a coherent trace
- [ possibly revive the tracepoint-multiplexing patch from peterz/tglx,
to be able to get a rich selection of tracepoints. ]
That is I think the workflow most people will be interested in:
- they have a workload to analyze and they want to do some 'tracing' to
understand it better or to pinpoint a problem.
- based on the problem they want to trace a selection of tracepoints, as
easily and quickly as possible.
- we could possibly provide a few 'groups' of tracepoints for typical
uses - for example scheduler tracing, or MM tracing, or IO tracing,
etc.), so people wouldn't have to specify a dozen tracepoints but could
symbolically refer to any of these pre-cooked groups. [this is probably
a tooling detail.]
- they want to persistently trace into a generously sized trace buffer,
without any tool running while the collection period lasts.
- to refine the result they'd like to stop/start tracing, reset/clear the
tracebuffer for failed attempts, and generally see how large the
tracebuffer is.
- and extract/analyze traces - both in a detailed form and in summary
fashion.
- they possibly want to save it to a perf.data and have equivalent
facilities of analysis.
If that workflow works well then the RAS angle will work well too.
Thanks,
Ingo