From: Borislav Petkov <[email protected]>
Yeah,
here's a refresh of the persistent events deal, accessing those is much
cleaner now. Here's how:
So kernel code initializes and enables the event at its convenience
(during boot, whenever) and userspace goes and says:
sys_perf_event_open(pattr,...)
with pattr.persistent = 1. Userspace gets the persistent buffer file
descriptor to read from. Without that, we get a normal perf file
descriptor for the duration of the tracing.
This saves all the diddling of trying to hand down file descriptors
through debugfs or whatever. Instead, current perf code simply can use
it.
This is still RFC but things are starting to fall into place slowly. As
always, any and all comments/suggestions are welcome.
Borislav Petkov (3):
perf: Add persistent events
perf: Add persistent event facilities
MCE: Enable persistent event
arch/x86/kernel/cpu/mcheck/mce.c | 25 +++++++
include/linux/perf_event.h | 14 +++-
include/uapi/linux/perf_event.h | 3 +-
kernel/events/Makefile | 2 +-
kernel/events/core.c | 27 +++++---
kernel/events/internal.h | 4 ++
kernel/events/persistent.c | 141 +++++++++++++++++++++++++++++++++++++++
7 files changed, 202 insertions(+), 14 deletions(-)
create mode 100644 kernel/events/persistent.c
--
1.8.1.3.535.ga923c31
Add the needed pieces for persistent events which makes them
process-agnostic. Also, make their buffers read-only when mmaping them
from userspace.
Signed-off-by: Borislav Petkov <[email protected]>
---
include/uapi/linux/perf_event.h | 3 ++-
kernel/events/core.c | 8 +++++++-
2 files changed, 9 insertions(+), 2 deletions(-)
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 9fa9c622a7f4..4a4ae56195e1 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -270,8 +270,9 @@ struct perf_event_attr {
exclude_callchain_kernel : 1, /* exclude kernel callchains */
exclude_callchain_user : 1, /* exclude user callchains */
+ persistent : 1, /* always-on event */
- __reserved_1 : 41;
+ __reserved_1 : 40;
union {
__u32 wakeup_events; /* wakeup every n events */
diff --git a/kernel/events/core.c b/kernel/events/core.c
index b0cd86501c30..7c712b0c3b9a 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3560,6 +3560,9 @@ static void perf_mmap_close(struct vm_area_struct *vma)
{
struct perf_event *event = vma->vm_file->private_data;
+ if (event->attr.persistent)
+ return atomic_dec(&event->mmap_count);
+
if (atomic_dec_and_mutex_lock(&event->mmap_count, &event->mmap_mutex)) {
unsigned long size = perf_data_size(event->rb);
struct user_struct *user = event->mmap_user;
@@ -3603,7 +3606,10 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
if (event->cpu == -1 && event->attr.inherit)
return -EINVAL;
- if (!(vma->vm_flags & VM_SHARED))
+ if (!(vma->vm_flags & VM_SHARED) && !event->attr.persistent)
+ return -EINVAL;
+
+ if (event->attr.persistent && (vma->vm_flags & VM_WRITE))
return -EINVAL;
vma_size = vma->vm_end - vma->vm_start;
--
1.8.1.3.535.ga923c31
Add a barebones implementation for registering persistent events with
perf. For that, we don't destroy the buffers when they're unmapped;
also, we map them read-only so that multiple agents can access them.
Also, we allocate the event buffers at event init time and not at mmap
time so that we can log samples into them regardless of whether there
are readers in userspace or not.
Signed-off-by: Borislav Petkov <[email protected]>
---
include/linux/perf_event.h | 14 ++++-
kernel/events/Makefile | 2 +-
kernel/events/core.c | 19 +++---
kernel/events/internal.h | 4 ++
kernel/events/persistent.c | 148 +++++++++++++++++++++++++++++++++++++++++++++
5 files changed, 175 insertions(+), 12 deletions(-)
create mode 100644 kernel/events/persistent.c
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index e47ee462c2f2..e3e4b64c9286 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -531,6 +531,13 @@ struct perf_output_handle {
int page;
};
+struct pers_event_desc {
+ struct perf_event_attr *attr;
+ struct perf_event *event;
+ struct list_head plist;
+ int fd;
+};
+
#ifdef CONFIG_PERF_EVENTS
extern int perf_pmu_register(struct pmu *pmu, char *name, int type);
@@ -758,7 +765,8 @@ extern void perf_event_enable(struct perf_event *event);
extern void perf_event_disable(struct perf_event *event);
extern int __perf_event_disable(void *info);
extern void perf_event_task_tick(void);
-#else
+extern int perf_add_persistent_event(struct perf_event_attr *, unsigned);
+#else /* !CONFIG_PERF_EVENTS */
static inline void
perf_event_task_sched_in(struct task_struct *prev,
struct task_struct *task) { }
@@ -797,7 +805,9 @@ static inline void perf_event_enable(struct perf_event *event) { }
static inline void perf_event_disable(struct perf_event *event) { }
static inline int __perf_event_disable(void *info) { return -1; }
static inline void perf_event_task_tick(void) { }
-#endif
+static inline int perf_add_persistent_event(struct perf_event_attr *attr,
+ unsigned nr_pages) { return -EINVAL; }
+#endif /* !CONFIG_PERF_EVENTS */
#define perf_output_put(handle, x) perf_output_copy((handle), &(x), sizeof(x))
diff --git a/kernel/events/Makefile b/kernel/events/Makefile
index 103f5d147b2f..70990d5a2037 100644
--- a/kernel/events/Makefile
+++ b/kernel/events/Makefile
@@ -2,7 +2,7 @@ ifdef CONFIG_FUNCTION_TRACER
CFLAGS_REMOVE_core.o = -pg
endif
-obj-y := core.o ring_buffer.o callchain.o
+obj-y := core.o ring_buffer.o callchain.o persistent.o
obj-$(CONFIG_HAVE_HW_BREAKPOINT) += hw_breakpoint.o
obj-$(CONFIG_UPROBES) += uprobes.o
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 7c712b0c3b9a..dd8696959ca7 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2866,8 +2866,6 @@ static void free_event_rcu(struct rcu_head *head)
kfree(event);
}
-static void ring_buffer_put(struct ring_buffer *rb);
-
static void free_event(struct perf_event *event)
{
irq_work_sync(&event->pending);
@@ -2898,7 +2896,7 @@ static void free_event(struct perf_event *event)
}
if (event->rb) {
- ring_buffer_put(event->rb);
+ perf_ring_buffer_put(event->rb);
event->rb = NULL;
}
@@ -3243,8 +3241,6 @@ unlock:
return ret;
}
-static const struct file_operations perf_fops;
-
static inline int perf_fget_light(int fd, struct fd *p)
{
struct fd f = fdget(fd);
@@ -3531,7 +3527,7 @@ static struct ring_buffer *ring_buffer_get(struct perf_event *event)
return rb;
}
-static void ring_buffer_put(struct ring_buffer *rb)
+void perf_ring_buffer_put(struct ring_buffer *rb)
{
struct perf_event *event, *n;
unsigned long flags;
@@ -3574,7 +3570,7 @@ static void perf_mmap_close(struct vm_area_struct *vma)
ring_buffer_detach(event, rb);
mutex_unlock(&event->mmap_mutex);
- ring_buffer_put(rb);
+ perf_ring_buffer_put(rb);
free_uid(user);
}
}
@@ -3711,7 +3707,7 @@ static int perf_fasync(int fd, struct file *filp, int on)
return 0;
}
-static const struct file_operations perf_fops = {
+const struct file_operations perf_fops = {
.llseek = no_llseek,
.release = perf_release,
.read = perf_read,
@@ -6428,7 +6424,7 @@ unlock:
mutex_unlock(&event->mmap_mutex);
if (old_rb)
- ring_buffer_put(old_rb);
+ perf_ring_buffer_put(old_rb);
out:
return ret;
}
@@ -6465,6 +6461,9 @@ SYSCALL_DEFINE5(perf_event_open,
if (err)
return err;
+ if (attr.persistent)
+ return perf_get_persistent_event_fd(cpu, &attr);
+
if (!attr.exclude_kernel) {
if (perf_paranoid_kernel() && !capable(CAP_SYS_ADMIN))
return -EACCES;
@@ -7422,6 +7421,8 @@ void __init perf_event_init(void)
*/
BUILD_BUG_ON((offsetof(struct perf_event_mmap_page, data_head))
!= 1024);
+
+ persistent_events_init();
}
static int __init perf_event_sysfs_init(void)
diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index d56a64c99a8b..85594b2f00e3 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -38,6 +38,7 @@ struct ring_buffer {
extern void rb_free(struct ring_buffer *rb);
extern struct ring_buffer *
rb_alloc(int nr_pages, long watermark, int cpu, int flags);
+extern void perf_ring_buffer_put(struct ring_buffer *rb);
extern void perf_event_wakeup(struct perf_event *event);
extern void
@@ -174,4 +175,7 @@ static inline bool arch_perf_have_user_stack_dump(void)
#define perf_user_stack_pointer(regs) 0
#endif /* CONFIG_HAVE_PERF_USER_STACK_DUMP */
+extern const struct file_operations perf_fops;
+extern int perf_get_persistent_event_fd(unsigned cpu, struct perf_event_attr *attr);
+extern void __init persistent_events_init(void);
#endif /* _KERNEL_EVENTS_INTERNAL_H */
diff --git a/kernel/events/persistent.c b/kernel/events/persistent.c
new file mode 100644
index 000000000000..bda2dde6862b
--- /dev/null
+++ b/kernel/events/persistent.c
@@ -0,0 +1,148 @@
+#include <linux/slab.h>
+#include <linux/file.h>
+#include <linux/perf_event.h>
+#include <linux/anon_inodes.h>
+
+#include "internal.h"
+
+DEFINE_PER_CPU(struct list_head, pers_events);
+
+static struct perf_event *
+add_persistent_event_on_cpu(unsigned int cpu, struct perf_event_attr *attr,
+ unsigned nr_pages)
+{
+ struct perf_event *event = ERR_PTR(-EINVAL);
+ struct pers_event_desc *desc;
+ struct ring_buffer *buf;
+
+ desc = kzalloc(sizeof(*desc), GFP_KERNEL);
+ if (!desc)
+ goto out;
+
+ event = perf_event_create_kernel_counter(attr, cpu, NULL, NULL, NULL);
+ if (IS_ERR(event))
+ goto err_event;
+
+ buf = rb_alloc(nr_pages, 0, cpu, 0);
+ if (!buf)
+ goto err_event_file;
+
+ rcu_assign_pointer(event->rb, buf);
+
+ desc->event = event;
+ desc->attr = attr;
+
+ INIT_LIST_HEAD(&desc->plist);
+ list_add_tail(&desc->plist, &per_cpu(pers_events, cpu));
+
+ perf_event_enable(event);
+
+ goto out;
+
+ err_event_file:
+ perf_event_release_kernel(event);
+
+ err_event:
+ kfree(desc);
+
+ out:
+ return event;
+}
+
+static void rm_persistent_event(int cpu, struct perf_event_attr *attr)
+{
+ struct pers_event_desc *desc, *tmp;
+ struct perf_event *event = NULL;
+
+ list_for_each_entry_safe(desc, tmp, &per_cpu(pers_events, cpu), plist) {
+ if (desc->attr->config == attr->config) {
+ event = desc->event;
+ break;
+ }
+ }
+
+ if (!event)
+ return;
+
+ __list_del(desc->plist.prev, desc->plist.next);
+
+ perf_event_disable(event);
+ if (event->rb) {
+ perf_ring_buffer_put(event->rb);
+ rcu_assign_pointer(event->rb, NULL);
+ }
+
+ perf_event_release_kernel(event);
+ put_unused_fd(desc->fd);
+ kfree(desc);
+}
+
+int perf_get_persistent_event_fd(unsigned cpu, struct perf_event_attr *attr)
+{
+ struct pers_event_desc *desc;
+ struct file *event_file = NULL;
+ int event_fd = -1;
+
+ list_for_each_entry(desc, &per_cpu(pers_events, cpu), plist) {
+
+ if (desc->attr->config != attr->config)
+ continue;
+
+ event_fd = get_unused_fd();
+ if (event_fd < 0)
+ goto out;
+
+ event_file = anon_inode_getfile("[pers_event]", &perf_fops,
+ desc->event, O_RDONLY);
+ if (IS_ERR(event_file))
+ goto err_event_file;
+
+ desc->fd = event_fd;
+ fd_install(event_fd, event_file);
+
+ return event_fd;
+ }
+
+ err_event_file:
+ put_unused_fd(event_fd);
+
+out:
+ return event_fd;
+}
+
+/*
+ * Create and enable the persistent version of the perf event described by
+ * @attr.
+ *
+ * @attr: perf event descriptor
+ * @nr_pages: size in pages
+ */
+int perf_add_persistent_event(struct perf_event_attr *attr, unsigned nr_pages)
+{
+ struct perf_event *event;
+ int i;
+
+ for_each_possible_cpu(i) {
+ event = add_persistent_event_on_cpu(i, attr, nr_pages);
+ if (IS_ERR(event)) {
+ pr_err("%s: Error adding persistent event on cpu %d\n",
+ __func__, i);
+ goto unwind;
+ }
+ }
+ return 0;
+
+unwind:
+ while (--i >= 0)
+ rm_persistent_event(i, attr);
+
+ return -EINVAL;
+}
+
+void __init persistent_events_init(void)
+{
+ int i;
+
+ for_each_possible_cpu(i)
+ INIT_LIST_HEAD(&per_cpu(pers_events, i));
+}
--
1.8.1.3.535.ga923c31
From: Borislav Petkov <[email protected]>
... for MCEs collection.
Signed-off-by: Borislav Petkov <[email protected]>
---
arch/x86/kernel/cpu/mcheck/mce.c | 25 +++++++++++++++++++++++++
1 file changed, 25 insertions(+)
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 7bc126346ace..4f44991b38b5 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -1987,6 +1987,31 @@ int __init mcheck_init(void)
return 0;
}
+static struct perf_event_attr pattr = {
+ .type = PERF_TYPE_TRACEPOINT,
+ .size = sizeof(pattr),
+ .sample_type = PERF_SAMPLE_RAW,
+ .persistent = 1,
+};
+
+int __init mcheck_init_tp(void)
+{
+
+ pattr.config = event_mce_record.event.type;
+ pattr.sample_period = 1;
+ pattr.wakeup_events = 1;
+
+ if (perf_add_persistent_event(&pattr, 4))
+ pr_err("Error adding MCE persistent event.\n");
+
+ return 0;
+}
+/*
+ * We can't run earlier because persistent events uses anon_inode_getfile and
+ * its anon_inode_mnt gets initialized as a fs_initcall.
+ */
+fs_initcall_sync(mcheck_init_tp);
+
/*
* mce_syscore: PM support
*/
--
1.8.1.3.535.ga923c31
Hi Borislav,
On Fri, 15 Mar 2013 14:06:26 +0100, Borislav Petkov wrote:
> From: Borislav Petkov <[email protected]>
>
> Yeah,
>
> here's a refresh of the persistent events deal, accessing those is much
> cleaner now. Here's how:
>
> So kernel code initializes and enables the event at its convenience
> (during boot, whenever) and userspace goes and says:
>
> sys_perf_event_open(pattr,...)
>
> with pattr.persistent = 1. Userspace gets the persistent buffer file
> descriptor to read from. Without that, we get a normal perf file
> descriptor for the duration of the tracing.
>
> This saves all the diddling of trying to hand down file descriptors
> through debugfs or whatever. Instead, current perf code simply can use
> it.
Interesting, it'll helpful profiling boot-time behaviors.
So my question is how can an user know which persistent events are
available in her system?
Thanks,
Namhyung
>
> This is still RFC but things are starting to fall into place slowly. As
> always, any and all comments/suggestions are welcome.
>
> Borislav Petkov (3):
> perf: Add persistent events
> perf: Add persistent event facilities
> MCE: Enable persistent event
>
> arch/x86/kernel/cpu/mcheck/mce.c | 25 +++++++
> include/linux/perf_event.h | 14 +++-
> include/uapi/linux/perf_event.h | 3 +-
> kernel/events/Makefile | 2 +-
> kernel/events/core.c | 27 +++++---
> kernel/events/internal.h | 4 ++
> kernel/events/persistent.c | 141 +++++++++++++++++++++++++++++++++++++++
> 7 files changed, 202 insertions(+), 14 deletions(-)
> create mode 100644 kernel/events/persistent.c
* Borislav Petkov <[email protected]> wrote:
> From: Borislav Petkov <[email protected]>
>
> Yeah,
>
> here's a refresh of the persistent events deal, accessing those is much
> cleaner now. Here's how:
>
> So kernel code initializes and enables the event at its convenience
> (during boot, whenever) and userspace goes and says:
>
> sys_perf_event_open(pattr,...)
>
> with pattr.persistent = 1. Userspace gets the persistent buffer file
> descriptor to read from. Without that, we get a normal perf file
> descriptor for the duration of the tracing.
>
> This saves all the diddling of trying to hand down file descriptors
> through debugfs or whatever. Instead, current perf code simply can use
> it.
>
> This is still RFC but things are starting to fall into place slowly. As
> always, any and all comments/suggestions are welcome.
That definitely looks interesting and desirable. It would be nice to have
more generic/flexible semantics by using the VFS for tracing context
discovery.
That would allow 'stateful tracing', and not just in a kernel initiated
fashion: we could basically do ftrace-alike tracing, into persistent,
VFS-named buffers.
The question is, how are the individual buffers identified when and after
they have been created? An option would be to use cgroups for that -
cgroups already has its own VFS and syscall interfaces. But maybe some
other, explicit interface is needed (eventfs).
All the usecases we talked about in the past would work fine that way:
- the MCE events would show up as an already created set of buffers,
discoverable via the VFS interface.
- user-space could generate more 'tracing/profiling contexts' runtime.
- a boot tracer would activate via a boot option, and it would create a
tracing context - visible via the VFS interface.
- modern RAS daemon replacing mcelog
If you make that work, via a new perf tool side as well that allows the
creation of a tracing context (and a separate extraction as well), via
modified 'perf trace' or a new subcommand, that would be an major,
upstream-worthy perf feature IMO which would go way beyond the RAS usecase
...
Such a feature would become a popular instrumentation tool pretty quickly.
Thanks,
Ingo
* Namhyung Kim <[email protected]> wrote:
> So my question is how can an user know which persistent events are
> available in her system?
I think we need VFS enumeration for that: directories give a high level a
structure (allowing things like per user contexts) while readdir will give
list of specific persistent buffer contexts.
Sensible naming convention would be needed to make things easy to discover
and list - and for active buffers to not be forgotten about.
cgroups or a new 'eventfs' filesystem would be an option.
Thanks,
Ingo
On Fri, 15 Mar 2013 14:06:28 +0100, Borislav Petkov wrote:
> Add a barebones implementation for registering persistent events with
> perf. For that, we don't destroy the buffers when they're unmapped;
> also, we map them read-only so that multiple agents can access them.
>
> Also, we allocate the event buffers at event init time and not at mmap
> time so that we can log samples into them regardless of whether there
> are readers in userspace or not.
[SNIP]
> -static void ring_buffer_put(struct ring_buffer *rb)
> +void perf_ring_buffer_put(struct ring_buffer *rb)
Why did you rename this function?
> {
> struct perf_event *event, *n;
> unsigned long flags;
[SNIP]
> +static struct perf_event *
> +add_persistent_event_on_cpu(unsigned int cpu, struct perf_event_attr *attr,
> + unsigned nr_pages)
> +{
> + struct perf_event *event = ERR_PTR(-EINVAL);
Looks like -ENOMEM is more appropriate code for the below kzalloc.
> + struct pers_event_desc *desc;
> + struct ring_buffer *buf;
> +
> + desc = kzalloc(sizeof(*desc), GFP_KERNEL);
> + if (!desc)
> + goto out;
> +
> + event = perf_event_create_kernel_counter(attr, cpu, NULL, NULL, NULL);
> + if (IS_ERR(event))
> + goto err_event;
> +
> + buf = rb_alloc(nr_pages, 0, cpu, 0);
> + if (!buf)
> + goto err_event_file;
> +
> + rcu_assign_pointer(event->rb, buf);
> +
> + desc->event = event;
> + desc->attr = attr;
> +
> + INIT_LIST_HEAD(&desc->plist);
> + list_add_tail(&desc->plist, &per_cpu(pers_events, cpu));
> +
> + perf_event_enable(event);
> +
> + goto out;
> +
> + err_event_file:
> + perf_event_release_kernel(event);
It needs to reset event to ERR_PTR(-ENOMEM) ?
> +
> + err_event:
> + kfree(desc);
> +
> + out:
> + return event;
> +}
> +
> +static void rm_persistent_event(int cpu, struct perf_event_attr *attr)
> +{
> + struct pers_event_desc *desc, *tmp;
> + struct perf_event *event = NULL;
> +
> + list_for_each_entry_safe(desc, tmp, &per_cpu(pers_events, cpu), plist) {
> + if (desc->attr->config == attr->config) {
> + event = desc->event;
> + break;
> + }
> + }
> +
> + if (!event)
> + return;
> +
> + __list_del(desc->plist.prev, desc->plist.next);
Why not using list_del(&desc->plist) ?
> +
> + perf_event_disable(event);
> + if (event->rb) {
> + perf_ring_buffer_put(event->rb);
> + rcu_assign_pointer(event->rb, NULL);
> + }
> +
> + perf_event_release_kernel(event);
> + put_unused_fd(desc->fd);
> + kfree(desc);
> +}
> +
> +int perf_get_persistent_event_fd(unsigned cpu, struct perf_event_attr *attr)
> +{
> + struct pers_event_desc *desc;
> + struct file *event_file = NULL;
> + int event_fd = -1;
> +
> + list_for_each_entry(desc, &per_cpu(pers_events, cpu), plist) {
> +
> + if (desc->attr->config != attr->config)
> + continue;
> +
> + event_fd = get_unused_fd();
> + if (event_fd < 0)
> + goto out;
> +
> + event_file = anon_inode_getfile("[pers_event]", &perf_fops,
> + desc->event, O_RDONLY);
> + if (IS_ERR(event_file))
> + goto err_event_file;
> +
> + desc->fd = event_fd;
> + fd_install(event_fd, event_file);
> +
> + return event_fd;
> + }
> +
> + err_event_file:
> + put_unused_fd(event_fd);
Isn't it safe to have event_fd of -1 in case not found? Anyway, if it's
returned to the user space directly, it's better having more meaningful
error code IMHO.
> +
> +out:
> + return event_fd;
> +}
Thanks,
Namhyung
Hi,
On Mon, Mar 18, 2013 at 06:13:58PM +0900, Namhyung Kim wrote:
> > -static void ring_buffer_put(struct ring_buffer *rb)
> > +void perf_ring_buffer_put(struct ring_buffer *rb)
>
> Why did you rename this function?
Yeah, actually that's not needed.
However, the perf ring buffer function naming is kinda inconsistent:
ring_buffer_*
rb_*
Since we're keeping those internal to perf, they maybe should have the
same prefix, no?
I vote for "rb_" because it is shorter. :)
> > + err_event_file:
> > + perf_event_release_kernel(event);
>
> It needs to reset event to ERR_PTR(-ENOMEM) ?
Yeah, the error path in this function is kinda clumsy, I'll do some more
staring at how to make it simpler/better.
[ … ]
> > + __list_del(desc->plist.prev, desc->plist.next);
>
> Why not using list_del(&desc->plist) ?
Will do.
[ … ]
> > + err_event_file:
> > + put_unused_fd(event_fd);
>
> Isn't it safe to have event_fd of -1 in case not found? Anyway, if
> it's returned to the user space directly, it's better having more
> meaningful error code IMHO.
Yeah, I rewrote that one in the meantime to use a helper and it is
cleaner now.
Thanks for the review!
--
Regards/Gruss,
Boris.
Sent from a fat crate under my desk. Formatting is fine.
--
On Mon, Mar 18, 2013 at 09:40:08AM +0100, Ingo Molnar wrote:
> That definitely looks interesting and desirable. It would be nice to
> have more generic/flexible semantics by using the VFS for tracing
> context discovery.
>
> That would allow 'stateful tracing', and not just in a kernel
> initiated fashion: we could basically do ftrace-alike tracing, into
> persistent, VFS-named buffers.
>
> The question is, how are the individual buffers identified when and
> after they have been created? An option would be to use cgroups for
> that - cgroups already has its own VFS and syscall interfaces. But
> maybe some other, explicit interface is needed (eventfs).
My latest knowledge is that Steve needs an events filesystem too because
he wants to do mkdir in debugfs. So maybe we should really do an eventfs
finally - it has been asked for for so many times now. :)
> All the usecases we talked about in the past would work fine that way:
>
> - the MCE events would show up as an already created set of buffers,
> discoverable via the VFS interface.
Right, in the MCE case though, we want to enable them as early as
possible, i.e. long before we can even manipulate something through the
VFS. So my current thinking is that we need both - a way to enable a
persistent event from within the kernel and then the eventfs thing.
> - user-space could generate more 'tracing/profiling contexts' runtime.
Sure.
> - a boot tracer would activate via a boot option, and it would create
> a tracing context - visible via the VFS interface.
Right, and this one you still want to enable as early as possible, long
before userspace can access something through VFS. Btw, how are we going
to boottrace stuff which happens before perf initialization? Cache it
into buffers somewhere?
> - modern RAS daemon replacing mcelog
>
> If you make that work, via a new perf tool side as well that allows
> the creation of a tracing context (and a separate extraction as well),
> via modified 'perf trace' or a new subcommand, that would be an major,
> upstream-worthy perf feature IMO which would go way beyond the RAS
> usecase.
Right, ok, so we could start working towards that too but I'd like to
not delay the persistent events stuff so that we can finally do the RAS
daemon, and do it properly.
Besides, having the eventfs and persistency would be a different aspect
to manipulating perf events and can be developed independently and in
parallel.
Thanks.
--
Regards/Gruss,
Boris.
Sent from a fat crate under my desk. Formatting is fine.
--
On 18.03.13 09:46:38, Ingo Molnar wrote:
>
> * Namhyung Kim <[email protected]> wrote:
>
> > So my question is how can an user know which persistent events are
> > available in her system?
>
> I think we need VFS enumeration for that: directories give a high level a
> structure (allowing things like per user contexts) while readdir will give
> list of specific persistent buffer contexts.
>
> Sensible naming convention would be needed to make things easy to discover
> and list - and for active buffers to not be forgotten about.
>
> cgroups or a new 'eventfs' filesystem would be an option.
An option would be to attach the persistent events to a hosting pmu
(e.g. 'ras' in this case) and provide the events via sysfs as already
done by other pmus:
/sys/bus/event_source/devices/ras/events/
/sys/bus/event_source/devices/ras/events/mce_record
...
perf tools work then out-the-box with -e ras/mce_record/.
The event is selected by the 'ras' pmu and then routed to the original
pmu that might be e.g. 'tracepoint'. So we attach each persistent
event to a 'virtual' pmu which does the grouping in the perf sysfs and
the forwarding to its actual pmu.
-Robert
Boris,
in general your approach looks good. Just a question and some comments
below.
On 15.03.13 14:06:28, Borislav Petkov wrote:
> Add a barebones implementation for registering persistent events with
> perf. For that, we don't destroy the buffers when they're unmapped;
> also, we map them read-only so that multiple agents can access them.
>
> Also, we allocate the event buffers at event init time and not at mmap
> time so that we can log samples into them regardless of whether there
> are readers in userspace or not.
The mmap'ed region is already allocated by the kernel. How does a user
know the buffer size of the mmap'ed region?
Comments below of what I found so far.
Also, I wouldn't make too much use of -EINVAL, this should only be
used if the syscall contains *wrong* data.
-Robert
> +static struct perf_event *
> +add_persistent_event_on_cpu(unsigned int cpu, struct perf_event_attr *attr,
> + unsigned nr_pages)
> +{
> + struct perf_event *event = ERR_PTR(-EINVAL);
> + struct pers_event_desc *desc;
> + struct ring_buffer *buf;
> +
> + desc = kzalloc(sizeof(*desc), GFP_KERNEL);
> + if (!desc)
> + goto out;
> +
> + event = perf_event_create_kernel_counter(attr, cpu, NULL, NULL, NULL);
> + if (IS_ERR(event))
> + goto err_event;
> +
> + buf = rb_alloc(nr_pages, 0, cpu, 0);
> + if (!buf)
> + goto err_event_file;
> +
> + rcu_assign_pointer(event->rb, buf);
> +
> + desc->event = event;
> + desc->attr = attr;
> +
> + INIT_LIST_HEAD(&desc->plist);
> + list_add_tail(&desc->plist, &per_cpu(pers_events, cpu));
> +
> + perf_event_enable(event);
> +
> + goto out;
> +
> + err_event_file:
> + perf_event_release_kernel(event);
event must be set to an error code here.
Better swap order of rb_alloc() and perf_event_create_kernel_
counter(). Makes things easier.
> +
> + err_event:
> + kfree(desc);
> +
> + out:
> + return event;
> +}
> +
> +static void rm_persistent_event(int cpu, struct perf_event_attr *attr)
Would rather prefer del_... as this is actually used for deleting
events in perf.
[...]
> +int perf_get_persistent_event_fd(unsigned cpu, struct perf_event_attr *attr)
> +{
> + struct pers_event_desc *desc;
> + struct file *event_file = NULL;
> + int event_fd = -1;
> +
> + list_for_each_entry(desc, &per_cpu(pers_events, cpu), plist) {
> +
> + if (desc->attr->config != attr->config)
> + continue;
Umm, the attr->config is not sufficient as a selector since it must be
unique which is not granted (of course it works for one event only).
> +
> + event_fd = get_unused_fd();
> + if (event_fd < 0)
> + goto out;
> +
> + event_file = anon_inode_getfile("[pers_event]", &perf_fops,
> + desc->event, O_RDONLY);
> + if (IS_ERR(event_file))
> + goto err_event_file;
> +
> + desc->fd = event_fd;
> + fd_install(event_fd, event_file);
> +
> + return event_fd;
> + }
> +
> + err_event_file:
> + put_unused_fd(event_fd);
> +
> +out:
> + return event_fd;
> +}
> +
> +/*
> + * Create and enable the persistent version of the perf event described by
> + * @attr.
> + *
> + * @attr: perf event descriptor
> + * @nr_pages: size in pages
> + */
> +int perf_add_persistent_event(struct perf_event_attr *attr, unsigned nr_pages)
> +{
> + struct perf_event *event;
> + int i;
> +
> + for_each_possible_cpu(i) {
> + event = add_persistent_event_on_cpu(i, attr, nr_pages);
> + if (IS_ERR(event)) {
> + pr_err("%s: Error adding persistent event on cpu %d\n",
> + __func__, i);
> + goto unwind;
> + }
> + }
> + return 0;
> +
> +unwind:
> + while (--i >= 0)
> + rm_persistent_event(i, attr);
> +
> + return -EINVAL;
Should return the actual error.
> +}
On Thu, Mar 28, 2013 at 04:52:29PM +0100, Robert Richter wrote:
> An option would be to attach the persistent events to a hosting pmu
> (e.g. 'ras' in this case) and provide the events via sysfs as already
> done by other pmus:
>
> /sys/bus/event_source/devices/ras/events/
> /sys/bus/event_source/devices/ras/events/mce_record
> ...
>
> perf tools work then out-the-box with -e ras/mce_record/.
>
> The event is selected by the 'ras' pmu and then routed to the original
> pmu that might be e.g. 'tracepoint'. So we attach each persistent
> event to a 'virtual' pmu which does the grouping in the perf sysfs and
> the forwarding to its actual pmu.
As I told you already in IRC, persistent events should be completely
generic and not only related to ras. IOW, *all* possible events we have
now should be also be able to be created as persistent. What I mean is:
perf -e kvm:kvm_page_fault:P ... <target_task>
should mean that I want to collect *all* kvm page faulting events during
the system's lifetime not only during target_task executes.
And the "P" I've chosen arbitrary to mean persistent.
Which raises a couple more questions like what would happen with those
events once the perf tool call above exits. Obviously, the event should
stay enabled. Also, a subsequent call of this would mean "do not open
another persistent event of this type but give the persistent file
descriptor instead" and so on and so on.
Now, this probably can be solved with a virtual 'persistent' pmu which
'hosts' all persistent events. The question is, do we want to do it this
way?
What do the others think?
P.S. Btw, some of the concerns you've raised in your other mail have
been already addressed here:
git://git.kernel.org/pub/scm/linux/kernel/git/bp/bp.git persistent5
and I'll look at the remaining ones when I get back.
Thanks.
On Thu, Mar 28, 2013 at 07:15:16PM +0100, Robert Richter wrote:
> The mmap'ed region is already allocated by the kernel. How does a user
> know the buffer size of the mmap'ed region?
Right, so normal perf events get a buffer allocated at mmap time. The
size of that buffer is determined from the size of the vma which, in
turn, gets determined by perf (there's a default of 512K there).
Now, currently the size of the percpu buffers of a persistent event is
determined by the caller and this needs to be somehow better controlled
like limit them to a max size even when allocated from within the
kernel. Also maybe use perf's default when allocating the event from
perf. We can make them of a default size and keep them that way, which
would mean, perf would need to know about this. And there's the question
whether some persistent buffers would actually generate more samples and
need bigger buffers. Hmm...
We probably need to discuss this more though...
> Also, I wouldn't make too much use of -EINVAL, this should only be
> used if the syscall contains *wrong* data.
Ok.
> event must be set to an error code here.
>
> Better swap order of rb_alloc() and perf_event_create_kernel_
> counter(). Makes things easier.
Hehe, already done. I told you not to review that version of the
patches, remember? :-)
> > +static void rm_persistent_event(int cpu, struct perf_event_attr *attr)
>
> Would rather prefer del_... as this is actually used for deleting
> events in perf.
Done.
> > + list_for_each_entry(desc, &per_cpu(pers_events, cpu), plist) {
> > +
> > + if (desc->attr->config != attr->config)
> > + continue;
>
> Umm, the attr->config is not sufficient as a selector since it must be
> unique which is not granted (of course it works for one event only).
Right, so the tracepoints are enumerated by tracing code at boot
time. But not the hw events, for example. How can we select events
unambiguously?
> > +unwind:
> > + while (--i >= 0)
> > + rm_persistent_event(i, attr);
> > +
> > + return -EINVAL;
>
> Should return the actual error.
Done.
Thanks.
--
Regards/Gruss,
Boris.
Sent from a fat crate under my desk. Formatting is fine.
--
Boris,
On 15.03.13 14:06:27, Borislav Petkov wrote:
> Add the needed pieces for persistent events which makes them
> process-agnostic. Also, make their buffers read-only when mmaping them
> from userspace.
> diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
> index 9fa9c622a7f4..4a4ae56195e1 100644
> --- a/include/uapi/linux/perf_event.h
> +++ b/include/uapi/linux/perf_event.h
> @@ -270,8 +270,9 @@ struct perf_event_attr {
>
> exclude_callchain_kernel : 1, /* exclude kernel callchains */
> exclude_callchain_user : 1, /* exclude user callchains */
> + persistent : 1, /* always-on event */
based on the discussion we have had a couple of days ago I got the
idea that we do not necessarily need the persistent flag. The purpose
of the persistent flag is to indicate that an existing event buffer
that is already enabled should be used. This means the buffer is
shared by multiple consumers. Enabled events with the following
conditions allow this too:
* buffer mappings are readonly,
* event is systemwide,
* struct perf_event_attr is identical.
This means that an event with readonly buffers (PROT_WRITE unset,
over-write unread data) and the same struct perf_event_attr can be
simply reused. So, we don't need the flag: When setting up a
systemwide and readonly event we just check if there is already one
event running. If so we increment the use count of that event and
share its buffer, otherwise we create a new event.
This also has the advantage that the syscall i/f does not change. We
simply setup the event in the same way as we do already. Maybe the
only thing we need is a sysfs extension to report existing systemwide
events to userland.
Now, to have persistent events in the sense that the event is running
since system startup, we simply enable an in-kernel-counter. If
userland later sets up the same event, the kernel is aware of the
already running counter and redirects it to its buffer and all
(persistent) samples of the current buffer are now available to
userland.
The above also has the advantage of sharing systemwide events which
saves system resources, e.g. hardware counters. Suppose two processes
are collecting cpu-clocks, currently there are two hw counters and
buffers used for this, but only one counter would be sufficient.
I currently see the following problem with the approach. If we map to
an already running counter, there might be samples in the buffer that
had been collected before the syscall was setup. But the profiling
tool might be interested only in newer samples. So userland must be
aware of this and might be able to drop those samples. This should be
also backward compatible with the existing syscall i/f.
Another problem is different buffer size. I don't know if it is
possible to simply resize mmap'ed regions to the biggest buffer
requested. Though this problem already exist with your code.
Maybe it's worth to look at this approach. I need to review more perf
code for this.
-Robert
On 03.04.13 19:27:12, Borislav Petkov wrote:
> > > + list_for_each_entry(desc, &per_cpu(pers_events, cpu), plist) {
> > > +
> > > + if (desc->attr->config != attr->config)
> > > + continue;
> >
> > Umm, the attr->config is not sufficient as a selector since it must be
> > unique which is not granted (of course it works for one event only).
>
> Right, so the tracepoints are enumerated by tracing code at boot
> time. But not the hw events, for example. How can we select events
> unambiguously?
As we discussed personally, in a first version we should allow only
the tracepoint pmu to be used for persistent events. The config value
is then unique and the above works.
-Robert
On Sat, Apr 06, 2013 at 05:53:17PM +0200, Robert Richter wrote:
> Boris,
>
> On 15.03.13 14:06:27, Borislav Petkov wrote:
> > Add the needed pieces for persistent events which makes them
> > process-agnostic. Also, make their buffers read-only when mmaping them
> > from userspace.
>
> > diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
> > index 9fa9c622a7f4..4a4ae56195e1 100644
> > --- a/include/uapi/linux/perf_event.h
> > +++ b/include/uapi/linux/perf_event.h
> > @@ -270,8 +270,9 @@ struct perf_event_attr {
> >
> > exclude_callchain_kernel : 1, /* exclude kernel callchains */
> > exclude_callchain_user : 1, /* exclude user callchains */
> > + persistent : 1, /* always-on event */
>
> based on the discussion we have had a couple of days ago I got the
> idea that we do not necessarily need the persistent flag. The purpose
> of the persistent flag is to indicate that an existing event buffer
> that is already enabled should be used. This means the buffer is
> shared by multiple consumers. Enabled events with the following
> conditions allow this too:
>
> * buffer mappings are readonly,
> * event is systemwide,
> * struct perf_event_attr is identical.
>
> This means that an event with readonly buffers (PROT_WRITE unset,
> over-write unread data) and the same struct perf_event_attr can be
> simply reused. So, we don't need the flag: When setting up a
> systemwide and readonly event we just check if there is already one
> event running. If so we increment the use count of that event and
> share its buffer, otherwise we create a new event.
>
> This also has the advantage that the syscall i/f does not change. We
> simply setup the event in the same way as we do already. Maybe the
> only thing we need is a sysfs extension to report existing systemwide
> events to userland.
I'm not entirely sure why we want to do that? Just for informational
purposes? I mean, if we enable the events on the kernel command line, we
already know which are the persistent ones.
> Now, to have persistent events in the sense that the event is running
> since system startup, we simply enable an in-kernel-counter. If
> userland later sets up the same event, the kernel is aware of the
> already running counter and redirects it to its buffer and all
> (persistent) samples of the current buffer are now available to
> userland.
>
> The above also has the advantage of sharing systemwide events which
> saves system resources, e.g. hardware counters. Suppose two processes
> are collecting cpu-clocks, currently there are two hw counters and
> buffers used for this, but only one counter would be sufficient.
I'm all up for efficiency. :)
> I currently see the following problem with the approach. If we map to
> an already running counter, there might be samples in the buffer that
> had been collected before the syscall was setup. But the profiling
> tool might be interested only in newer samples. So userland must be
> aware of this and might be able to drop those samples. This should be
> also backward compatible with the existing syscall i/f.
Hmm, so the perf tool should actually know about persistent events
after all. If we open the persistent event from userspace, we get *all*
samples. If we simply do a normal perf session, we get only the new
samples.
> Another problem is different buffer size. I don't know if it is
> possible to simply resize mmap'ed regions to the biggest buffer
> requested. Though this problem already exist with your code.
Hmm, that needs more code staring I guess.
> Maybe it's worth to look at this approach. I need to review more perf
> code for this.
Yeah, me too.
Thanks.
--
Regards/Gruss,
Boris.
Sent from a fat crate under my desk. Formatting is fine.
--