2014-04-07 15:04:58

by Jean Pihet

[permalink] [raw]
Subject: [PATCH v4 00/16] perf, persistent: Add persistent events

Jean Pihet <[email protected]>:
- In order to restart the discussion on the topic, here is a rebased
version of Robert's latest patches (v3) on acme/perf/core.
It has been compiled and lightly tested on ARM64.

- From the latest discussion on ML the ioctls are renamed from
PERF_EVENT_IOC_ATTACH/DETACH to PERF_EVENT_IOC_CLAIM/UNCLAIM.


Robert Richter <[email protected]>:
This patch set implements the necessary kernel changes for persistent
events.

Persistent events run standalone in the system without the need of a
controlling process that holds an event's file descriptor. The events
are always enabled and collect data samples in a ring buffer.
Processes may connect to existing persistent events using the
perf_event_open() syscall. For this the syscall must be configured
using the new PERF_TYPE_PERSISTENT event type and a unique event
identifier specified in attr.config. The id is propagated in sysfs or
using ioctl (see below).

Persistent event buffers may be accessed with mmap() in the same way
as for any other event. Since the buffers may be used by multiple
processes at the same time, there is only read-only access to them.
Currently there is only support for per-cpu events, thus root access
is needed too.

Persistent events are visible in sysfs. They are added or removed
dynamically. With the information in sysfs userland knows about how to
setup the perf_event attribute of a persistent event. Since a
persistent event always has the persistent flag set, a way is needed
to express this in sysfs. A new syntax is used for this. With
'attr<num>:<mask>' any bit in the attribute structure may be set in a
similar way as using 'config<num>', but <num> is an index that points
to the u64 value to change within the attribute.

For persistent events the persistent flag (bit 24 of flag field in
struct perf_event_attr) needs to be set which is expressed in sysfs
with "attr5:24". E.g. the mce_record event is described in sysfs as
follows:

/sys/bus/event_source/devices/persistent/events/mce_record:persistent,config=106
/sys/bus/event_source/devices/persistent/format/persistent:attr5:24

Note that perf tools need to support the 'attr<num>' syntax that is
added in a separate patch set. With it we are able to run perf tool
commands to read persistent events, e.g.:

# perf record -e persistent/mce_record/ sleep 10
# perf top -e persistent/mce_record/

In general the new syntax is flexible to describe with sysfs any event
to be setup by perf tools.

There are ioctl functions to control persistent events that can be
used to detach or attach an event to or from a process. The
PERF_EVENT_IOC_UNCLAIM ioctl call makes an event persistent. The
perf_event_open() syscall can be used to re-open the event by any
process. The PERF_EVENT_IOC_CLAIM ioctl attaches the event again so
that it is removed after closing the event's fd.

The patches base on the originally work from Borislav Petkov.

This version 3 of the patch set is a complete rework of the code.
There are the following major changes:

* new event type PERF_TYPE_PERSISTENT introduced,

* support for all type of events,

* unique event ids,

* improvements in reference counting and locking,

* ioctl functions are added to control persistency,

* the sysfs implementation now uses variable list size.

This should address most issues discussed during last review of
version 2. The following is unresolved yet and can be added later on
top of this patches, if necessary:

* support for per-task events (also allowing non-root access),

* creation of persistent events for disabled cpus,

* make event persistent with already open (mmap'ed) buffers,

* make event persistent while creating it.

First patches contain some rework of the perf mmap code to reuse it
for persistent events.

Also note that patch 12 (ioctl functions to control persistency) is
RFC and untested. A perf tools implementation for this is missing and
some ideas are needed how this could be integrated, esp. in something
like perf trace or so.

All v3 patches can be found here:

git://git.kernel.org/pub/scm/linux/kernel/git/rric/oprofile.git persistent-v3

Note: I will resent the perf tools patch necessary to use persistent
events.

-Robert


Borislav Petkov (1):
mce, x86: Enable persistent events

Robert Richter (15):
perf, mmap: Factor out ring_buffer_detach_all()
perf, mmap: Factor out try_get_event()/put_event()
perf, mmap: Factor out perf_alloc/free_rb()
perf, mmap: Factor out perf_get_fd()
perf: Add persistent events
perf, persistent: Implementing a persistent pmu
perf, persistent: Exposing persistent events using sysfs
perf, persistent: Use unique event ids
perf, persistent: Implement reference counter for events
perf, persistent: Dynamically resize list of sysfs entries
perf, persistent: ioctl functions to control persistency
perf tools: Rename flex conditions to avoid name conflicts
perf tools: Modify event parser to update event attribute by index
perf tools: Add attr<num> syntax to event parser
perf tools: Retry mapping buffers readonly on EACCES

.../testing/sysfs-bus-event_source-devices-format | 43 +-
arch/x86/kernel/cpu/mcheck/mce.c | 19 +
include/linux/perf_event.h | 12 +-
include/uapi/linux/perf_event.h | 6 +-
kernel/events/Makefile | 2 +-
kernel/events/core.c | 214 +++++---
kernel/events/internal.h | 20 +
kernel/events/persistent.c | 563 +++++++++++++++++++++
tools/perf/builtin-record.c | 11 +-
tools/perf/builtin-top.c | 8 +-
tools/perf/perf.h | 1 +
tools/perf/tests/parse-events.c | 12 +-
tools/perf/util/parse-events.c | 59 +--
tools/perf/util/parse-events.h | 12 +-
tools/perf/util/parse-events.l | 58 ++-
tools/perf/util/parse-events.y | 24 +-
tools/perf/util/pmu.c | 32 +-
tools/perf/util/pmu.h | 9 +-
tools/perf/util/pmu.l | 1 +
tools/perf/util/pmu.y | 18 +-
20 files changed, 911 insertions(+), 213 deletions(-)
create mode 100644 kernel/events/persistent.c

--
1.7.11.7


2014-04-07 15:05:07

by Jean Pihet

[permalink] [raw]
Subject: [PATCH 02/16] perf, mmap: Factor out try_get_event()/put_event()

From: Robert Richter <[email protected]>

Implement try_get_event() as counter part to put_event(). Put both in
internal.h to make it available to other perf files.

Signed-off-by: Robert Richter <[email protected]>
Signed-off-by: Robert Richter <[email protected]>
Signed-off-by: Jean Pihet <[email protected]>
---
kernel/events/core.c | 9 +++------
kernel/events/internal.h | 12 ++++++++++++
2 files changed, 15 insertions(+), 6 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 8867236..5eaba42 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3311,13 +3311,10 @@ EXPORT_SYMBOL_GPL(perf_event_release_kernel);
/*
* Called when the last reference to the file is gone.
*/
-static void put_event(struct perf_event *event)
+void __put_event(struct perf_event *event)
{
struct task_struct *owner;

- if (!atomic_long_dec_and_test(&event->refcount))
- return;
-
rcu_read_lock();
owner = ACCESS_ONCE(event->owner);
/*
@@ -3884,7 +3881,7 @@ static void ring_buffer_detach_all(struct ring_buffer *rb)
again:
rcu_read_lock();
list_for_each_entry_rcu(event, &rb->event_list, rb_entry) {
- if (!atomic_long_inc_not_zero(&event->refcount)) {
+ if (!try_get_event(event)) {
/*
* This event is en-route to free_event() which will
* detach it and remove it from the list.
@@ -7606,7 +7603,7 @@ inherit_event(struct perf_event *parent_event,
if (IS_ERR(child_event))
return child_event;

- if (!atomic_long_inc_not_zero(&parent_event->refcount)) {
+ if (!try_get_event(parent_event)) {
free_event(child_event);
return NULL;
}
diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index 569b2187..3bd89d4 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -195,4 +195,16 @@ static inline bool arch_perf_have_user_stack_dump(void)
#define perf_user_stack_pointer(regs) 0
#endif /* CONFIG_HAVE_PERF_USER_STACK_DUMP */

+static inline bool try_get_event(struct perf_event *event)
+{
+ return atomic_long_inc_not_zero(&event->refcount) != 0;
+}
+extern void __put_event(struct perf_event *event);
+static inline void put_event(struct perf_event *event)
+{
+ if (!atomic_long_dec_and_test(&event->refcount))
+ return;
+ __put_event(event);
+}
+
#endif /* _KERNEL_EVENTS_INTERNAL_H */
--
1.7.11.7

2014-04-07 15:05:31

by Jean Pihet

[permalink] [raw]
Subject: [PATCH 11/16] perf, persistent: Dynamically resize list of sysfs entries

From: Robert Richter <[email protected]>

There was a limitation of the total number of persistent events to be
registered in sysfs due to the lack of dynamically list allocation.
This patch implements memory reallocation in case an event is added or
removed from the list.

While at this also implement pevent_sysfs_unregister() which we need
later for proper event removal.

Signed-off-by: Robert Richter <[email protected]>
Signed-off-by: Robert Richter <[email protected]>
Signed-off-by: Jean Pihet <[email protected]>
---
kernel/events/persistent.c | 115 ++++++++++++++++++++++++++++++++++++++-------
1 file changed, 99 insertions(+), 16 deletions(-)

diff --git a/kernel/events/persistent.c b/kernel/events/persistent.c
index fcbb4f8..49bf889 100644
--- a/kernel/events/persistent.c
+++ b/kernel/events/persistent.c
@@ -154,6 +154,7 @@ static void persistent_event_close(int cpu, struct pevent *pevent)
}

static int pevent_sysfs_register(struct pevent *event);
+static void pevent_sysfs_unregister(struct pevent *event);

static int __maybe_unused
persistent_open(char *name, struct perf_event_attr *attr, int nr_pages)
@@ -204,6 +205,7 @@ fail:
__func__, ret);
out:
if (atomic_dec_and_test(&pevent->refcount)) {
+ pevent_sysfs_unregister(pevent);
if (pevent->id)
put_event_id(pevent->id);
kfree(pevent->name);
@@ -273,13 +275,12 @@ static struct attribute_group persistent_format_group = {
.attrs = persistent_format_attrs,
};

-#define MAX_EVENTS 16
-
-static struct attribute *pevents_attr[MAX_EVENTS + 1] = { };
+static struct mutex sysfs_lock;
+static int sysfs_nr_entries;

static struct attribute_group pevents_group = {
.name = "events",
- .attrs = pevents_attr,
+ .attrs = NULL, /* dynamically allocated */
};

static const struct attribute_group *persistent_attr_groups[] = {
@@ -288,6 +289,7 @@ static const struct attribute_group *persistent_attr_groups[] = {
NULL,
};
#define EVENTS_GROUP_PTR (&persistent_attr_groups[1])
+#define EVENTS_ATTRS_PTR (&pevents_group.attrs)

static ssize_t pevent_sysfs_show(struct device *dev,
struct device_attribute *__attr, char *page)
@@ -304,7 +306,9 @@ static int pevent_sysfs_register(struct pevent *pevent)
struct attribute *attr = &sysfs->attr.attr;
struct device *dev = persistent_pmu.dev;
const struct attribute_group **group = EVENTS_GROUP_PTR;
- int idx;
+ struct attribute ***attrs_ptr = EVENTS_ATTRS_PTR;
+ struct attribute **attrs;
+ int ret = 0;

sysfs->id = pevent->id;
sysfs->attr = (struct device_attribute)
@@ -312,21 +316,99 @@ static int pevent_sysfs_register(struct pevent *pevent)
attr->name = pevent->name;
sysfs_attr_init(attr);

- /* add sysfs attr to events: */
- for (idx = 0; idx < MAX_EVENTS; idx++) {
- if (!cmpxchg(pevents_attr + idx, NULL, attr))
- break;
+ mutex_lock(&sysfs_lock);
+
+ /*
+ * Keep old list if no new one is available. Need this for
+ * device_remove_attrs() if unregistering pmu.
+ */
+ attrs = __krealloc(*attrs_ptr, (sysfs_nr_entries + 2) * sizeof(*attrs),
+ GFP_KERNEL);
+
+ if (!attrs) {
+ ret = -ENOMEM;
+ goto unlock;
}

- if (idx >= MAX_EVENTS)
- return -ENOSPC;
- if (!idx)
+ attrs[sysfs_nr_entries++] = attr;
+ attrs[sysfs_nr_entries] = NULL;
+
+ if (!*group)
*group = &pevents_group;
+
+ if (!dev)
+ goto out; /* sysfs not yet initialized */
+
+ if (sysfs_nr_entries == 1)
+ ret = sysfs_create_group(&dev->kobj, *group);
+ else
+ ret = sysfs_add_file_to_group(&dev->kobj, attr, (*group)->name);
+
+ if (ret) {
+ /* roll back */
+ sysfs_nr_entries--;
+ if (!sysfs_nr_entries)
+ *group = NULL;
+ if (*attrs_ptr != attrs)
+ kfree(attrs);
+ else
+ attrs[sysfs_nr_entries] = NULL;
+ goto unlock;
+ }
+out:
+ if (*attrs_ptr != attrs) {
+ kfree(*attrs_ptr);
+ *attrs_ptr = attrs;
+ }
+unlock:
+ mutex_unlock(&sysfs_lock);
+
+ return ret;
+}
+
+static void pevent_sysfs_unregister(struct pevent *pevent)
+{
+ struct attribute *attr = &pevent->sysfs.attr.attr;
+ struct device *dev = persistent_pmu.dev;
+ const struct attribute_group **group = EVENTS_GROUP_PTR;
+ struct attribute ***attrs_ptr = EVENTS_ATTRS_PTR;
+ struct attribute **attrs, **dest;
+
+ mutex_lock(&sysfs_lock);
+
+ for (dest = *attrs_ptr; *dest; dest++) {
+ if (*dest == attr)
+ break;
+ }
+
+ if (!*dest)
+ goto unlock;
+
+ sysfs_nr_entries--;
+
+ *dest = (*attrs_ptr)[sysfs_nr_entries];
+ (*attrs_ptr)[sysfs_nr_entries] = NULL;
+
if (!dev)
- return 0; /* sysfs not yet initialized */
- if (idx)
- return sysfs_add_file_to_group(&dev->kobj, attr, (*group)->name);
- return sysfs_create_group(&persistent_pmu.dev->kobj, *group);
+ goto out; /* sysfs not yet initialized */
+
+ if (!sysfs_nr_entries)
+ sysfs_remove_group(&dev->kobj, *group);
+ else
+ sysfs_remove_file_from_group(&dev->kobj, attr, (*group)->name);
+out:
+ if (!sysfs_nr_entries)
+ *group = NULL;
+
+ attrs = __krealloc(*attrs_ptr, (sysfs_nr_entries + 1) * sizeof(*attrs),
+ GFP_KERNEL);
+
+ if (!attrs && *attrs_ptr != attrs) {
+ kfree(*attrs_ptr);
+ *attrs_ptr = attrs;
+ }
+unlock:
+ mutex_unlock(&sysfs_lock);
}

static int persistent_pmu_init(struct perf_event *event)
@@ -349,6 +431,7 @@ void __init perf_register_persistent(void)

idr_init(&event_idr);
mutex_init(&event_lock);
+ mutex_init(&sysfs_lock);
perf_pmu_register(&persistent_pmu, "persistent", PERF_TYPE_PERSISTENT);

for_each_possible_cpu(cpu) {
--
1.7.11.7

2014-04-07 15:05:37

by Jean Pihet

[permalink] [raw]
Subject: [PATCH 14/16] perf tools: Modify event parser to update event attribute by index

From: Robert Richter <[email protected]>

In a later patch we want to introduce a syntax that allows updating
attribute fields by an index pointing to a certain u64 entry of struct
perf_event_attr. We need this to expose any event via sysfs that is
available in the system where especially flag fields need to be set.

Reworking the event parser to use an attribute index. This is done by
introducing type PARSE_EVENTS__TERM_TYPE_ATTR for all numeric values
to be added to the event attribute fields. We use an index to specify
the corresponding u64 attr value.

Signed-off-by: Robert Richter <[email protected]>
Signed-off-by: Robert Richter <[email protected]>
Signed-off-by: Jean Pihet <[email protected]>
---
tools/perf/tests/parse-events.c | 12 ++++++---
tools/perf/util/parse-events.c | 59 +++++++++++++++++++----------------------
tools/perf/util/parse-events.h | 12 ++++-----
tools/perf/util/parse-events.l | 18 ++++++-------
tools/perf/util/parse-events.y | 24 ++++++++++-------
5 files changed, 65 insertions(+), 60 deletions(-)

diff --git a/tools/perf/tests/parse-events.c b/tools/perf/tests/parse-events.c
index 8605ff5..c15660c 100644
--- a/tools/perf/tests/parse-events.c
+++ b/tools/perf/tests/parse-events.c
@@ -463,7 +463,9 @@ static int test__checkterms_simple(struct list_head *terms)
/* config=10 */
term = list_entry(terms->next, struct parse_events_term, list);
TEST_ASSERT_VAL("wrong type term",
- term->type_term == PARSE_EVENTS__TERM_TYPE_CONFIG);
+ term->type_term == PARSE_EVENTS__TERM_TYPE_ATTR);
+ TEST_ASSERT_VAL("wrong type idx",
+ term->idx == 1);
TEST_ASSERT_VAL("wrong type val",
term->type_val == PARSE_EVENTS__TERM_TYPE_NUM);
TEST_ASSERT_VAL("wrong val", term->val.num == 10);
@@ -472,7 +474,9 @@ static int test__checkterms_simple(struct list_head *terms)
/* config1 */
term = list_entry(term->list.next, struct parse_events_term, list);
TEST_ASSERT_VAL("wrong type term",
- term->type_term == PARSE_EVENTS__TERM_TYPE_CONFIG1);
+ term->type_term == PARSE_EVENTS__TERM_TYPE_ATTR);
+ TEST_ASSERT_VAL("wrong type idx",
+ term->idx == 7);
TEST_ASSERT_VAL("wrong type val",
term->type_val == PARSE_EVENTS__TERM_TYPE_NUM);
TEST_ASSERT_VAL("wrong val", term->val.num == 1);
@@ -481,7 +485,9 @@ static int test__checkterms_simple(struct list_head *terms)
/* config2=3 */
term = list_entry(term->list.next, struct parse_events_term, list);
TEST_ASSERT_VAL("wrong type term",
- term->type_term == PARSE_EVENTS__TERM_TYPE_CONFIG2);
+ term->type_term == PARSE_EVENTS__TERM_TYPE_ATTR);
+ TEST_ASSERT_VAL("wrong type idx",
+ term->idx == 8);
TEST_ASSERT_VAL("wrong type val",
term->type_val == PARSE_EVENTS__TERM_TYPE_NUM);
TEST_ASSERT_VAL("wrong val", term->val.num == 3);
diff --git a/tools/perf/util/parse-events.c b/tools/perf/util/parse-events.c
index 1e15df1..8fd2542 100644
--- a/tools/perf/util/parse-events.c
+++ b/tools/perf/util/parse-events.c
@@ -541,6 +541,19 @@ int parse_events_add_breakpoint(struct list_head *list, int *idx,
return add_event(list, idx, &attr, NULL);
}

+int parse_events__set_attr(struct perf_event_attr *__attr, u64 idx, u64 val)
+{
+ __u64 *attr = (__u64 *)__attr;
+
+ if (idx * sizeof(*attr) >= sizeof(*__attr))
+ return -EINVAL;
+
+ attr += idx;
+ *attr |= val;
+
+ return 0;
+}
+
static int config_term(struct perf_event_attr *attr,
struct parse_events_term *term)
{
@@ -551,36 +564,17 @@ do { \
} while (0)

switch (term->type_term) {
- case PARSE_EVENTS__TERM_TYPE_CONFIG:
+ case PARSE_EVENTS__TERM_TYPE_ATTR:
CHECK_TYPE_VAL(NUM);
- attr->config = term->val.num;
- break;
- case PARSE_EVENTS__TERM_TYPE_CONFIG1:
- CHECK_TYPE_VAL(NUM);
- attr->config1 = term->val.num;
- break;
- case PARSE_EVENTS__TERM_TYPE_CONFIG2:
- CHECK_TYPE_VAL(NUM);
- attr->config2 = term->val.num;
- break;
- case PARSE_EVENTS__TERM_TYPE_SAMPLE_PERIOD:
- CHECK_TYPE_VAL(NUM);
- attr->sample_period = term->val.num;
- break;
- case PARSE_EVENTS__TERM_TYPE_BRANCH_SAMPLE_TYPE:
- /*
- * TODO uncomment when the field is available
- * attr->branch_sample_type = term->val.num;
- */
- break;
+ return parse_events__set_attr(attr, term->idx, term->val.num);
case PARSE_EVENTS__TERM_TYPE_NAME:
CHECK_TYPE_VAL(STR);
- break;
+ return 0;
default:
- return -EINVAL;
+ break;
}

- return 0;
+ return -EINVAL;
#undef CHECK_TYPE_VAL
}

@@ -1277,7 +1271,7 @@ int parse_events__is_hardcoded_term(struct parse_events_term *term)

static int new_term(struct parse_events_term **_term, int type_val,
int type_term, char *config,
- char *str, u64 num)
+ char *str, u64 num, u64 idx)
{
struct parse_events_term *term;

@@ -1288,7 +1282,8 @@ static int new_term(struct parse_events_term **_term, int type_val,
INIT_LIST_HEAD(&term->list);
term->type_val = type_val;
term->type_term = type_term;
- term->config = config;
+ term->config = config;
+ term->idx = idx;

switch (type_val) {
case PARSE_EVENTS__TERM_TYPE_NUM:
@@ -1307,17 +1302,17 @@ static int new_term(struct parse_events_term **_term, int type_val,
}

int parse_events_term__num(struct parse_events_term **term,
- int type_term, char *config, u64 num)
+ int type_term, char *config, u64 num, u64 idx)
{
return new_term(term, PARSE_EVENTS__TERM_TYPE_NUM, type_term,
- config, NULL, num);
+ config, NULL, num, idx);
}

int parse_events_term__str(struct parse_events_term **term,
int type_term, char *config, char *str)
{
return new_term(term, PARSE_EVENTS__TERM_TYPE_STR, type_term,
- config, str, 0);
+ config, str, 0, 0);
}

int parse_events_term__sym_hw(struct parse_events_term **term,
@@ -1331,18 +1326,18 @@ int parse_events_term__sym_hw(struct parse_events_term **term,
if (config)
return new_term(term, PARSE_EVENTS__TERM_TYPE_STR,
PARSE_EVENTS__TERM_TYPE_USER, config,
- (char *) sym->symbol, 0);
+ (char *) sym->symbol, 0, 0);
else
return new_term(term, PARSE_EVENTS__TERM_TYPE_STR,
PARSE_EVENTS__TERM_TYPE_USER,
- (char *) "event", (char *) sym->symbol, 0);
+ (char *) "event", (char *) sym->symbol, 0, 0);
}

int parse_events_term__clone(struct parse_events_term **new,
struct parse_events_term *term)
{
return new_term(new, term->type_val, term->type_term, term->config,
- term->val.str, term->val.num);
+ term->val.str, term->val.num, term->idx);
}

void parse_events__free_terms(struct list_head *terms)
diff --git a/tools/perf/util/parse-events.h b/tools/perf/util/parse-events.h
index f1cb4c4..61bbe88 100644
--- a/tools/perf/util/parse-events.h
+++ b/tools/perf/util/parse-events.h
@@ -35,6 +35,8 @@ extern int parse_events_terms(struct list_head *terms, const char *str);
extern int parse_filter(const struct option *opt, const char *str, int unset);

#define EVENTS_HELP_MAX (128*1024)
+#define PERF_ATTR_IDX(MEMBER) \
+ (offsetof(struct perf_event_attr, MEMBER) / sizeof(__u64))

enum {
PARSE_EVENTS__TERM_TYPE_NUM,
@@ -43,12 +45,8 @@ enum {

enum {
PARSE_EVENTS__TERM_TYPE_USER,
- PARSE_EVENTS__TERM_TYPE_CONFIG,
- PARSE_EVENTS__TERM_TYPE_CONFIG1,
- PARSE_EVENTS__TERM_TYPE_CONFIG2,
+ PARSE_EVENTS__TERM_TYPE_ATTR,
PARSE_EVENTS__TERM_TYPE_NAME,
- PARSE_EVENTS__TERM_TYPE_SAMPLE_PERIOD,
- PARSE_EVENTS__TERM_TYPE_BRANCH_SAMPLE_TYPE,
};

struct parse_events_term {
@@ -59,6 +57,7 @@ struct parse_events_term {
} val;
int type_val;
int type_term;
+ u64 idx;
struct list_head list;
};

@@ -74,7 +73,7 @@ struct parse_events_terms {

int parse_events__is_hardcoded_term(struct parse_events_term *term);
int parse_events_term__num(struct parse_events_term **_term,
- int type_term, char *config, u64 num);
+ int type_term, char *config, u64 num, u64 idx);
int parse_events_term__str(struct parse_events_term **_term,
int type_term, char *config, char *str);
int parse_events_term__sym_hw(struct parse_events_term **term,
@@ -97,6 +96,7 @@ int parse_events_add_breakpoint(struct list_head *list, int *idx,
int parse_events_add_pmu(struct list_head *list, int *idx,
char *pmu , struct list_head *head_config);
void parse_events__set_leader(char *name, struct list_head *list);
+int parse_events__set_attr(struct perf_event_attr *__attr, u64 idx, u64 val);
void parse_events_update_lists(struct list_head *list_event,
struct list_head *list_all);
void parse_events_error(void *data, void *scanner, char const *msg);
diff --git a/tools/perf/util/parse-events.l b/tools/perf/util/parse-events.l
index a5808d1..8ef3f6c 100644
--- a/tools/perf/util/parse-events.l
+++ b/tools/perf/util/parse-events.l
@@ -59,12 +59,12 @@ static int sym(yyscan_t scanner, int type, int config)
return type == PERF_TYPE_HARDWARE ? PE_VALUE_SYM_HW : PE_VALUE_SYM_SW;
}

-static int term(yyscan_t scanner, int type)
+static int attr(yyscan_t scanner, u64 idx)
{
YYSTYPE *yylval = parse_events_get_lval(scanner);

- yylval->num = type;
- return PE_TERM;
+ yylval->num = idx;
+ return PE_TERM_ATTR;
}

%}
@@ -127,12 +127,12 @@ modifier_bp [rwx]{1,3}
}

<cond_config>{
-config { return term(yyscanner, PARSE_EVENTS__TERM_TYPE_CONFIG); }
-config1 { return term(yyscanner, PARSE_EVENTS__TERM_TYPE_CONFIG1); }
-config2 { return term(yyscanner, PARSE_EVENTS__TERM_TYPE_CONFIG2); }
-name { return term(yyscanner, PARSE_EVENTS__TERM_TYPE_NAME); }
-period { return term(yyscanner, PARSE_EVENTS__TERM_TYPE_SAMPLE_PERIOD); }
-branch_type { return term(yyscanner, PARSE_EVENTS__TERM_TYPE_BRANCH_SAMPLE_TYPE); }
+config { return attr(yyscanner, PERF_ATTR_IDX(config)); }
+config1 { return attr(yyscanner, PERF_ATTR_IDX(config1)); }
+config2 { return attr(yyscanner, PERF_ATTR_IDX(config2)); }
+period { return attr(yyscanner, PERF_ATTR_IDX(sample_period)); }
+branch_type { return attr(yyscanner, PERF_ATTR_IDX(branch_sample_type)); }
+name { return PE_TERM_NAME; }
, { return ','; }
"/" { BEGIN(INITIAL); return '/'; }
{name_minus} { return str(yyscanner, PE_NAME); }
diff --git a/tools/perf/util/parse-events.y b/tools/perf/util/parse-events.y
index 4eb67ec..2bd285e 100644
--- a/tools/perf/util/parse-events.y
+++ b/tools/perf/util/parse-events.y
@@ -40,7 +40,8 @@ static inc_group_count(struct list_head *list,
%}

%token PE_START_EVENTS PE_START_TERMS
-%token PE_VALUE PE_VALUE_SYM_HW PE_VALUE_SYM_SW PE_RAW PE_TERM
+%token PE_VALUE PE_VALUE_SYM_HW PE_VALUE_SYM_SW PE_RAW
+%token PE_TERM_NAME PE_TERM_ATTR
%token PE_EVENT_NAME
%token PE_NAME
%token PE_MODIFIER_EVENT PE_MODIFIER_BP
@@ -51,7 +52,7 @@ static inc_group_count(struct list_head *list,
%type <num> PE_VALUE_SYM_HW
%type <num> PE_VALUE_SYM_SW
%type <num> PE_RAW
-%type <num> PE_TERM
+%type <num> PE_TERM_ATTR
%type <str> PE_NAME
%type <str> PE_NAME_CACHE_TYPE
%type <str> PE_NAME_CACHE_OP_RESULT
@@ -375,7 +376,7 @@ PE_NAME '=' PE_VALUE
struct parse_events_term *term;

ABORT_ON(parse_events_term__num(&term, PARSE_EVENTS__TERM_TYPE_USER,
- $1, $3));
+ $1, $3, 0));
$$ = term;
}
|
@@ -393,7 +394,7 @@ PE_NAME
struct parse_events_term *term;

ABORT_ON(parse_events_term__num(&term, PARSE_EVENTS__TERM_TYPE_USER,
- $1, 1));
+ $1, 1, 0));
$$ = term;
}
|
@@ -406,27 +407,30 @@ PE_VALUE_SYM_HW
$$ = term;
}
|
-PE_TERM '=' PE_NAME
+PE_TERM_NAME '=' PE_NAME
{
struct parse_events_term *term;

- ABORT_ON(parse_events_term__str(&term, (int)$1, NULL, $3));
+ ABORT_ON(parse_events_term__str(&term, PARSE_EVENTS__TERM_TYPE_NAME,
+ NULL, $3));
$$ = term;
}
|
-PE_TERM '=' PE_VALUE
+PE_TERM_ATTR '=' PE_VALUE
{
struct parse_events_term *term;

- ABORT_ON(parse_events_term__num(&term, (int)$1, NULL, $3));
+ ABORT_ON(parse_events_term__num(&term, PARSE_EVENTS__TERM_TYPE_ATTR,
+ NULL, $3, $1));
$$ = term;
}
|
-PE_TERM
+PE_TERM_ATTR
{
struct parse_events_term *term;

- ABORT_ON(parse_events_term__num(&term, (int)$1, NULL, 1));
+ ABORT_ON(parse_events_term__num(&term, PARSE_EVENTS__TERM_TYPE_ATTR,
+ NULL, 1, $1));
$$ = term;
}

--
1.7.11.7

2014-04-07 15:06:06

by Jean Pihet

[permalink] [raw]
Subject: [PATCH 16/16] perf tools: Retry mapping buffers readonly on EACCES

From: Robert Richter <[email protected]>

Persistent event buffers may only be mmapped readonly. Thus, retry
mapping it readonly if mmap returns EACCES after trying to mmap
writable.

[ namhyung: Don't write to readonly mmap'ed buffers. ]
Signed-off-by: Robert Richter <[email protected]>
Signed-off-by: Robert Richter <[email protected]>
Signed-off-by: Jean Pihet <[email protected]>
---
tools/perf/builtin-record.c | 11 ++++++++---
tools/perf/builtin-top.c | 8 ++++++--
tools/perf/perf.h | 1 +
3 files changed, 15 insertions(+), 5 deletions(-)

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index eb524f9..ede3af1 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -133,8 +133,8 @@ static int record__mmap_read(struct record *rec, struct perf_mmap *md)
}

md->prev = old;
- perf_mmap__write_tail(md, old);
-
+ if (!rec->opts.mmap_ro)
+ perf_mmap__write_tail(md, old);
out:
return rc;
}
@@ -207,7 +207,12 @@ try_again:
goto out;
}

- if (perf_evlist__mmap(evlist, opts->mmap_pages, false) < 0) {
+try_again2:
+ if (perf_evlist__mmap(evlist, opts->mmap_pages, opts->mmap_ro) < 0) {
+ if (!opts->mmap_ro && errno == EACCES) {
+ opts->mmap_ro = true;
+ goto try_again2;
+ }
if (errno == EPERM) {
pr_err("Permission error mapping pages.\n"
"Consider increasing "
diff --git a/tools/perf/builtin-top.c b/tools/perf/builtin-top.c
index 65aaa5b..479356f 100644
--- a/tools/perf/builtin-top.c
+++ b/tools/perf/builtin-top.c
@@ -868,8 +868,12 @@ try_again:
goto out_err;
}
}
-
- if (perf_evlist__mmap(evlist, opts->mmap_pages, false) < 0) {
+try_again2:
+ if (perf_evlist__mmap(evlist, opts->mmap_pages, opts->mmap_ro) < 0) {
+ if (!opts->mmap_ro && errno == EACCES) {
+ opts->mmap_ro = true;
+ goto try_again2;
+ }
ui__error("Failed to mmap with %d (%s)\n",
errno, strerror(errno));
goto out_err;
diff --git a/tools/perf/perf.h b/tools/perf/perf.h
index 1138c41..a8ffaba4 100644
--- a/tools/perf/perf.h
+++ b/tools/perf/perf.h
@@ -282,6 +282,7 @@ struct record_opts {
bool sample_weight;
bool sample_time;
bool period;
+ bool mmap_ro;
unsigned int freq;
unsigned int mmap_pages;
unsigned int user_freq;
--
1.7.11.7

2014-04-07 15:06:31

by Jean Pihet

[permalink] [raw]
Subject: [PATCH 15/16] perf tools: Add attr<num> syntax to event parser

From: Robert Richter <[email protected]>

The event parser is limited to update only a subset of all fields in
struct perf_event_attr (config*, period, branch_type). We are not able
to set other attr fields, esp. flags.

Introducing a new syntax to set any field of the event attribute by
using an index to the u64 value to be used within struct
perf_event_attr. The new syntax attr<num> is similar to config<num>,
but <num> specifies the index to be used. E.g. attr5:24 sets bit 24 of
the flag field of attr.

The persistent event implementation is a use case of the above. In
this case sysfs provides:

/sys/bus/event_source/devices/persistent/events/mce_record:persistent,config=106
/sys/bus/event_source/devices/persistent/format/persistent:attr5:24

Persistent events are exposed via sysfs and need to set the persistent
flag (bit 24 of the flag field). With the sysfs entry above we are
able to define the persistent flag format and then may setup a
mce_record event with that flag set.

In general we are now flexible to describe with sysfs any event to be
setup by perf tools.

Signed-off-by: Robert Richter <[email protected]>
Signed-off-by: Robert Richter <[email protected]>
Signed-off-by: Jean Pihet <[email protected]>
---
tools/perf/util/parse-events.l | 14 ++++++++++++++
tools/perf/util/pmu.c | 32 ++++++--------------------------
tools/perf/util/pmu.h | 9 ++-------
tools/perf/util/pmu.l | 1 +
tools/perf/util/pmu.y | 18 ++++++++++++++----
5 files changed, 37 insertions(+), 37 deletions(-)

diff --git a/tools/perf/util/parse-events.l b/tools/perf/util/parse-events.l
index 8ef3f6c..707ce83 100644
--- a/tools/perf/util/parse-events.l
+++ b/tools/perf/util/parse-events.l
@@ -67,6 +67,19 @@ static int attr(yyscan_t scanner, u64 idx)
return PE_TERM_ATTR;
}

+static int attr_parse(yyscan_t scanner)
+{
+ YYSTYPE *yylval = parse_events_get_lval(scanner);
+ char *text = parse_events_get_text(scanner);
+
+ errno = 0;
+ yylval->num = strtoull(text + 4, NULL, 10);
+ if (errno)
+ return PE_ERROR;
+
+ return PE_TERM_ATTR;
+}
+
%}

%x cond_mem
@@ -127,6 +140,7 @@ modifier_bp [rwx]{1,3}
}

<cond_config>{
+attr[0-9]* { return attr_parse(yyscanner); }
config { return attr(yyscanner, PERF_ATTR_IDX(config)); }
config1 { return attr(yyscanner, PERF_ATTR_IDX(config1)); }
config2 { return attr(yyscanner, PERF_ATTR_IDX(config2)); }
diff --git a/tools/perf/util/pmu.c b/tools/perf/util/pmu.c
index 00a7dcb..d36d750 100644
--- a/tools/perf/util/pmu.c
+++ b/tools/perf/util/pmu.c
@@ -21,8 +21,8 @@ struct perf_pmu_alias {
};

struct perf_pmu_format {
- char *name;
- int value;
+ char *name;
+ u64 idx;
DECLARE_BITMAP(bits, PERF_PMU_FORMAT_BITS);
struct list_head list;
};
@@ -512,7 +512,6 @@ static int pmu_config_term(struct list_head *formats,
struct parse_events_term *term)
{
struct perf_pmu_format *format;
- __u64 *vp;

/*
* Support only for hardcoded and numnerial terms.
@@ -529,27 +528,8 @@ static int pmu_config_term(struct list_head *formats,
if (!format)
return -EINVAL;

- switch (format->value) {
- case PERF_PMU_FORMAT_VALUE_CONFIG:
- vp = &attr->config;
- break;
- case PERF_PMU_FORMAT_VALUE_CONFIG1:
- vp = &attr->config1;
- break;
- case PERF_PMU_FORMAT_VALUE_CONFIG2:
- vp = &attr->config2;
- break;
- default:
- return -EINVAL;
- }
-
- /*
- * XXX If we ever decide to go with string values for
- * non-hardcoded terms, here's the place to translate
- * them into value.
- */
- *vp |= pmu_format_value(format->bits, term->val.num);
- return 0;
+ return parse_events__set_attr(attr, format->idx,
+ pmu_format_value(format->bits, term->val.num));
}

int perf_pmu__config_terms(struct list_head *formats,
@@ -678,7 +658,7 @@ int perf_pmu__check_alias(struct perf_pmu *pmu, struct list_head *head_terms,
}

int perf_pmu__new_format(struct list_head *list, char *name,
- int config, unsigned long *bits)
+ __u64 idx, unsigned long *bits)
{
struct perf_pmu_format *format;

@@ -687,7 +667,7 @@ int perf_pmu__new_format(struct list_head *list, char *name,
return -ENOMEM;

format->name = strdup(name);
- format->value = config;
+ format->idx = idx;
memcpy(format->bits, bits, sizeof(format->bits));

list_add_tail(&format->list, list);
diff --git a/tools/perf/util/pmu.h b/tools/perf/util/pmu.h
index 8b64125..10269f7 100644
--- a/tools/perf/util/pmu.h
+++ b/tools/perf/util/pmu.h
@@ -4,12 +4,7 @@
#include <linux/bitops.h>
#include <linux/perf_event.h>
#include <stdbool.h>
-
-enum {
- PERF_PMU_FORMAT_VALUE_CONFIG,
- PERF_PMU_FORMAT_VALUE_CONFIG1,
- PERF_PMU_FORMAT_VALUE_CONFIG2,
-};
+#include "parse-events.h"

#define PERF_PMU_FORMAT_BITS 64

@@ -36,7 +31,7 @@ int perf_pmu_wrap(void);
void perf_pmu_error(struct list_head *list, char *name, char const *msg);

int perf_pmu__new_format(struct list_head *list, char *name,
- int config, unsigned long *bits);
+ __u64 idx, unsigned long *bits);
void perf_pmu__set_format(unsigned long *bits, long from, long to);
int perf_pmu__format_parse(char *dir, struct list_head *head);

diff --git a/tools/perf/util/pmu.l b/tools/perf/util/pmu.l
index a15d9fb..9d5aa62 100644
--- a/tools/perf/util/pmu.l
+++ b/tools/perf/util/pmu.l
@@ -26,6 +26,7 @@ num_dec [0-9]+
%%

{num_dec} { return value(10); }
+attr { return PP_ATTR; }
config { return PP_CONFIG; }
config1 { return PP_CONFIG1; }
config2 { return PP_CONFIG2; }
diff --git a/tools/perf/util/pmu.y b/tools/perf/util/pmu.y
index bfd7e85..fb86df8 100644
--- a/tools/perf/util/pmu.y
+++ b/tools/perf/util/pmu.y
@@ -20,7 +20,7 @@ do { \

%}

-%token PP_CONFIG PP_CONFIG1 PP_CONFIG2
+%token PP_ATTR PP_CONFIG PP_CONFIG1 PP_CONFIG2
%token PP_VALUE PP_ERROR
%type <num> PP_VALUE
%type <bits> bit_term
@@ -40,24 +40,34 @@ format format_term
format_term

format_term:
+PP_ATTR ':' bits
+{
+ ABORT_ON(perf_pmu__new_format(format, name, 0, $3));
+}
+|
+PP_ATTR PP_VALUE ':' bits
+{
+ ABORT_ON(perf_pmu__new_format(format, name, $2, $4));
+}
+|
PP_CONFIG ':' bits
{
ABORT_ON(perf_pmu__new_format(format, name,
- PERF_PMU_FORMAT_VALUE_CONFIG,
+ PERF_ATTR_IDX(config),
$3));
}
|
PP_CONFIG1 ':' bits
{
ABORT_ON(perf_pmu__new_format(format, name,
- PERF_PMU_FORMAT_VALUE_CONFIG1,
+ PERF_ATTR_IDX(config1),
$3));
}
|
PP_CONFIG2 ':' bits
{
ABORT_ON(perf_pmu__new_format(format, name,
- PERF_PMU_FORMAT_VALUE_CONFIG2,
+ PERF_ATTR_IDX(config2),
$3));
}

--
1.7.11.7

2014-04-07 15:07:06

by Jean Pihet

[permalink] [raw]
Subject: [PATCH 13/16] perf tools: Rename flex conditions to avoid name conflicts

From: Robert Richter <[email protected]>

These define's may cause conflicts with other definitions:

#define INITIAL 0
#define mem 1
#define config 2
#define event 3

Prefix them with cond_* to avoid this.

Signed-off-by: Robert Richter <[email protected]>
Signed-off-by: Robert Richter <[email protected]>
Signed-off-by: Jean Pihet <[email protected]>
---
tools/perf/util/parse-events.l | 26 +++++++++++++-------------
1 file changed, 13 insertions(+), 13 deletions(-)

diff --git a/tools/perf/util/parse-events.l b/tools/perf/util/parse-events.l
index 3432995..a5808d1 100644
--- a/tools/perf/util/parse-events.l
+++ b/tools/perf/util/parse-events.l
@@ -69,9 +69,9 @@ static int term(yyscan_t scanner, int type)

%}

-%x mem
-%s config
-%x event
+%x cond_mem
+%s cond_config
+%x cond_event

group [^,{}/]*[{][^}]*[}][^,{}/]*
event_pmu [^,{}/]+[/][^/]*[/][^,{}/]*
@@ -95,9 +95,9 @@ modifier_bp [rwx]{1,3}
start_token = parse_events_get_extra(yyscanner);

if (start_token == PE_START_TERMS)
- BEGIN(config);
+ BEGIN(cond_config);
else if (start_token == PE_START_EVENTS)
- BEGIN(event);
+ BEGIN(cond_event);

if (start_token) {
parse_events_set_extra(NULL, yyscanner);
@@ -106,7 +106,7 @@ modifier_bp [rwx]{1,3}
}
%}

-<event>{
+<cond_event>{

{group} {
BEGIN(INITIAL); yyless(0);
@@ -126,7 +126,7 @@ modifier_bp [rwx]{1,3}

}

-<config>{
+<cond_config>{
config { return term(yyscanner, PARSE_EVENTS__TERM_TYPE_CONFIG); }
config1 { return term(yyscanner, PARSE_EVENTS__TERM_TYPE_CONFIG1); }
config2 { return term(yyscanner, PARSE_EVENTS__TERM_TYPE_CONFIG2); }
@@ -138,7 +138,7 @@ branch_type { return term(yyscanner, PARSE_EVENTS__TERM_TYPE_BRANCH_SAMPLE_TYPE
{name_minus} { return str(yyscanner, PE_NAME); }
}

-<mem>{
+<cond_mem>{
{modifier_bp} { return str(yyscanner, PE_MODIFIER_BP); }
: { return ':'; }
{num_dec} { return value(yyscanner, 10); }
@@ -146,7 +146,7 @@ branch_type { return term(yyscanner, PARSE_EVENTS__TERM_TYPE_BRANCH_SAMPLE_TYPE
/*
* We need to separate 'mem:' scanner part, in order to get specific
* modifier bits parsed out. Otherwise we would need to handle PE_NAME
- * and we'd need to parse it manually. During the escape from <mem>
+ * and we'd need to parse it manually. During the escape from <cond_mem>
* state we need to put the escaping char back, so we dont miss it.
*/
. { unput(*yytext); BEGIN(INITIAL); }
@@ -193,18 +193,18 @@ speculative-read|speculative-load |
refs|Reference|ops|access |
misses|miss { return str(yyscanner, PE_NAME_CACHE_OP_RESULT); }

-mem: { BEGIN(mem); return PE_PREFIX_MEM; }
+mem: { BEGIN(cond_mem); return PE_PREFIX_MEM; }
r{num_raw_hex} { return raw(yyscanner); }
{num_dec} { return value(yyscanner, 10); }
{num_hex} { return value(yyscanner, 16); }

{modifier_event} { return str(yyscanner, PE_MODIFIER_EVENT); }
{name} { return str(yyscanner, PE_NAME); }
-"/" { BEGIN(config); return '/'; }
+"/" { BEGIN(cond_config); return '/'; }
- { return '-'; }
-, { BEGIN(event); return ','; }
+, { BEGIN(cond_event); return ','; }
: { return ':'; }
-"{" { BEGIN(event); return '{'; }
+"{" { BEGIN(cond_event); return '{'; }
"}" { return '}'; }
= { return '='; }
\n { }
--
1.7.11.7

2014-04-07 15:07:46

by Jean Pihet

[permalink] [raw]
Subject: [PATCH 12/16] perf, persistent: ioctl functions to control persistency

From: Robert Richter <[email protected]>

Implementing ioctl functions to control persistent events. There are
functions to unclaim or claim an event to or from a process. The
PERF_EVENT_IOC_UNCLAIM ioctl call makes an event persistent. After
closing the event's fd it runs then in the background of the system
without the need of a controlling process. The perf_event_open()
syscall can be used to reopen the event by any process. The
PERF_EVENT_IOC_CLAIM ioctl attaches the event again so that it is
removed after closing the event's fd.

This is for Linux man-pages:

type ...

PERF_TYPE_PERSISTENT (Since Linux 3.xx)

This indicates a persistent event. There is a unique
identifier for each persistent event that needs to be
specified in the event's attribute config field.
Persistent events are listed under:

/sys/bus/event_source/devices/persistent/

...
persistent : 41, /* always-on event */
...

persistent: (Since Linux 3.xx)

Put event into persistent state after opening. After closing
the event's fd the event is persistent in the system and
continues to run.

perf_event ioctl calls

PERF_EVENT_IOC_UNCLAIM (Since Linux 3.xx)

Unclaim the event specified by the file descriptor from the
process and make it persistent in the system. After
closing the fd the event will continue to run. An unique
identifier for the persistent event is returned or an
error otherwise. The following allows to connect to the
event again:

pe.type = PERF_TYPE_PERSISTENT;
pe.config = <pevent_id>;
...
fd = perf_event_open(...);

The event must be reopened on the same cpu.

PERF_EVENT_IOC_CLAIM (Since Linux 3.xx)

Claim the event specified by the file descriptor to the
current process. The event is no longer persistent in the
system and will be removed after all users disconnected
from the event. Thus, if there are no other users the
event will be closed too after closing its file
descriptor, the event then no longer exists.

[ Jean Pihet: renamed PERF_EVENT_IOC_ATTACH/DETACH to PERF_EVENT_IOC_CLAIM/UNCLAIM ]
Cc: Vince Weaver <[email protected]>
Signed-off-by: Robert Richter <[email protected]>
Signed-off-by: Robert Richter <[email protected]>
Signed-off-by: Jean Pihet <[email protected]>
---
include/uapi/linux/perf_event.h | 2 +
kernel/events/core.c | 6 ++
kernel/events/internal.h | 2 +
kernel/events/persistent.c | 178 +++++++++++++++++++++++++++++++++-------
4 files changed, 160 insertions(+), 28 deletions(-)

diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index a3f2761..b2f1943 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -350,6 +350,8 @@ struct perf_event_attr {
#define PERF_EVENT_IOC_SET_OUTPUT _IO ('$', 5)
#define PERF_EVENT_IOC_SET_FILTER _IOW('$', 6, char *)
#define PERF_EVENT_IOC_ID _IOR('$', 7, __u64 *)
+#define PERF_EVENT_IOC_UNCLAIM _IO ('$', 8)
+#define PERF_EVENT_IOC_CLAIM _IO ('$', 9)

enum perf_event_ioc_flags {
PERF_IOC_FLAG_GROUP = 1U << 0,
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 80ada8e..bdf3895 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3707,6 +3707,12 @@ static long perf_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
case PERF_EVENT_IOC_SET_FILTER:
return perf_event_set_filter(event, (void __user *)arg);

+ case PERF_EVENT_IOC_UNCLAIM:
+ return perf_event_unclaim(event);
+
+ case PERF_EVENT_IOC_CLAIM:
+ return perf_event_claim(event);
+
default:
return -ENOTTY;
}
diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index 6b9a11d..9a871b5 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -212,5 +212,7 @@ extern void perf_free_rb(struct perf_event *event);
extern int perf_get_fd(struct perf_event *event, int f_flags);
extern int perf_get_persistent_event_fd(int cpu, int id);
extern void __init perf_register_persistent(void);
+extern int perf_event_unclaim(struct perf_event *event);
+extern int perf_event_claim(struct perf_event *event);

#endif /* _KERNEL_EVENTS_INTERNAL_H */
diff --git a/kernel/events/persistent.c b/kernel/events/persistent.c
index 49bf889..0d796c6 100644
--- a/kernel/events/persistent.c
+++ b/kernel/events/persistent.c
@@ -59,6 +59,49 @@ static struct perf_event *__pevent_find(int cpu, int id)
return NULL;
}

+static void pevent_free(struct pevent *pevent)
+{
+ if (pevent->id)
+ put_event_id(pevent->id);
+
+ kfree(pevent->name);
+ kfree(pevent);
+}
+
+static struct pevent *pevent_alloc(char *name)
+{
+ struct pevent *pevent;
+ char id_buf[32];
+ int ret;
+
+ pevent = kzalloc(sizeof(*pevent), GFP_KERNEL);
+ if (!pevent)
+ return ERR_PTR(-ENOMEM);
+
+ atomic_set(&pevent->refcount, 1);
+
+ ret = get_event_id(pevent);
+ if (ret < 0)
+ goto fail;
+ pevent->id = ret;
+
+ if (!name) {
+ snprintf(id_buf, sizeof(id_buf), "%d", pevent->id);
+ name = id_buf;
+ }
+
+ pevent->name = kstrdup(name, GFP_KERNEL);
+ if (!pevent->name) {
+ ret = -ENOMEM;
+ goto fail;
+ }
+
+ return pevent;
+fail:
+ pevent_free(pevent);
+ return ERR_PTR(ret);
+}
+
static int pevent_add(struct pevent *pevent, struct perf_event *event)
{
int ret = -EEXIST;
@@ -74,6 +117,7 @@ static int pevent_add(struct pevent *pevent, struct perf_event *event)

ret = 0;
event->pevent_id = pevent->id;
+ event->attr.persistent = 1;
list_add_tail(&event->pevent_entry, &per_cpu(pevents, cpu));
unlock:
mutex_unlock(&per_cpu(pevents_lock, cpu));
@@ -91,6 +135,7 @@ static struct perf_event *pevent_del(struct pevent *pevent, int cpu)
if (event) {
list_del(&event->pevent_entry);
event->pevent_id = 0;
+ event->attr.persistent = 0;
}

mutex_unlock(&per_cpu(pevents_lock, cpu));
@@ -160,33 +205,12 @@ static int __maybe_unused
persistent_open(char *name, struct perf_event_attr *attr, int nr_pages)
{
struct pevent *pevent;
- char id_buf[32];
int cpu;
int ret;

- pevent = kzalloc(sizeof(*pevent), GFP_KERNEL);
- if (!pevent)
- return -ENOMEM;
-
- atomic_set(&pevent->refcount, 1);
-
- ret = get_event_id(pevent);
- if (ret < 0)
- goto fail;
- pevent->id = ret;
-
- if (!name) {
- snprintf(id_buf, sizeof(id_buf), "%d", pevent->id);
- name = id_buf;
- }
-
- pevent->name = kstrdup(name, GFP_KERNEL);
- if (!pevent->name) {
- ret = -ENOMEM;
- goto fail;
- }
-
- pevent->sysfs.id = pevent->id;
+ pevent = pevent_alloc(name);
+ if (IS_ERR(pevent))
+ return PTR_ERR(pevent);

for_each_possible_cpu(cpu) {
ret = persistent_event_open(cpu, pevent, attr, nr_pages);
@@ -206,10 +230,7 @@ fail:
out:
if (atomic_dec_and_test(&pevent->refcount)) {
pevent_sysfs_unregister(pevent);
- if (pevent->id)
- put_event_id(pevent->id);
- kfree(pevent->name);
- kfree(pevent);
+ pevent_free(pevent);
}

return ret;
@@ -439,3 +460,104 @@ void __init perf_register_persistent(void)
mutex_init(&per_cpu(pevents_lock, cpu));
}
}
+
+/*
+ * Unclaim an event from a process. The event will remain in the system
+ * after closing the event's fd, it becomes persistent.
+ */
+int perf_event_unclaim(struct perf_event *event)
+{
+ struct pevent *pevent;
+ int cpu;
+ int ret;
+
+ if (!try_get_event(event))
+ return -ENOENT;
+
+ /* task events not yet supported: */
+ cpu = event->cpu;
+ if ((unsigned)cpu >= nr_cpu_ids) {
+ ret = -EINVAL;
+ goto fail_rb;
+ }
+
+ /*
+ * Avoid grabbing an id, later checked again in pevent_add()
+ * with mmap_mutex held.
+ */
+ if (event->pevent_id) {
+ ret = -EEXIST;
+ goto fail_rb;
+ }
+
+ mutex_lock(&event->mmap_mutex);
+ if (event->rb)
+ ret = -EBUSY;
+ else
+ ret = perf_alloc_rb(event, CPU_BUFFER_NR_PAGES, 0);
+ mutex_unlock(&event->mmap_mutex);
+
+ if (ret)
+ goto fail_rb;
+
+ pevent = pevent_alloc(NULL);
+ if (IS_ERR(pevent)) {
+ ret = PTR_ERR(pevent);
+ goto fail_pevent;
+ }
+
+ ret = pevent_add(pevent, event);
+ if (ret)
+ goto fail_add;
+
+ ret = pevent_sysfs_register(pevent);
+ if (ret)
+ goto fail_sysfs;
+
+ atomic_inc(&event->mmap_count);
+
+ return pevent->id;
+fail_sysfs:
+ pevent_del(pevent, cpu);
+fail_add:
+ pevent_free(pevent);
+fail_pevent:
+ mutex_lock(&event->mmap_mutex);
+ if (event->rb)
+ perf_free_rb(event);
+ mutex_unlock(&event->mmap_mutex);
+fail_rb:
+ put_event(event);
+ return ret;
+}
+
+/*
+ * Claim an event from a process. The event will be removed after all
+ * users disconnected from it, it's no longer persistent in the
+ * system.
+ */
+int perf_event_claim(struct perf_event *event)
+{
+ int cpu = event->cpu;
+ struct pevent *pevent;
+
+ if ((unsigned)cpu >= nr_cpu_ids)
+ return -EINVAL;
+
+ pevent = find_event(event->pevent_id);
+ if (!pevent)
+ return -EINVAL;
+
+ event = pevent_del(pevent, cpu);
+ if (!event)
+ return -EINVAL;
+
+ if (atomic_dec_and_test(&pevent->refcount)) {
+ pevent_sysfs_unregister(pevent);
+ pevent_free(pevent);
+ }
+
+ persistent_event_release(event);
+
+ return 0;
+}
--
1.7.11.7

2014-04-07 15:05:22

by Jean Pihet

[permalink] [raw]
Subject: [PATCH 09/16] perf, persistent: Use unique event ids

From: Robert Richter <[email protected]>

Tracepoints have a unique attr.config value. But, this is not
sufficient to support all event types. For this we need to generate
unique event ids.

Signed-off-by: Robert Richter <[email protected]>
Signed-off-by: Robert Richter <[email protected]>
Signed-off-by: Jean Pihet <[email protected]>
---
kernel/events/persistent.c | 40 ++++++++++++++++++++++++++++++++++++++--
1 file changed, 38 insertions(+), 2 deletions(-)

diff --git a/kernel/events/persistent.c b/kernel/events/persistent.c
index 6578acf..d04827c 100644
--- a/kernel/events/persistent.c
+++ b/kernel/events/persistent.c
@@ -1,6 +1,7 @@
#include <linux/slab.h>
#include <linux/perf_event.h>
#include <linux/ftrace_event.h>
+#include <linux/idr.h>

#include "internal.h"

@@ -13,10 +14,37 @@ struct pevent {
int id;
};

+static struct idr event_idr;
+static struct mutex event_lock;
static struct pmu persistent_pmu;
static DEFINE_PER_CPU(struct list_head, pevents);
static DEFINE_PER_CPU(struct mutex, pevents_lock);

+static inline struct pevent *find_event(int id)
+{
+ struct pevent *pevent;
+ rcu_read_lock();
+ pevent = idr_find(&event_idr, id);
+ rcu_read_lock();
+ return pevent;
+}
+
+static inline int get_event_id(struct pevent *pevent)
+{
+ int event_id;
+ mutex_lock(&event_lock);
+ event_id = idr_alloc(&event_idr, pevent, 1, INT_MAX, GFP_KERNEL);
+ mutex_unlock(&event_lock);
+ return event_id;
+}
+
+static inline void put_event_id(int id)
+{
+ mutex_lock(&event_lock);
+ idr_remove(&event_idr, id);
+ mutex_unlock(&event_lock);
+}
+
/* Must be protected with pevents_lock. */
static struct perf_event *__pevent_find(int cpu, int id)
{
@@ -128,13 +156,16 @@ persistent_open(char *name, struct perf_event_attr *attr, int nr_pages)
struct pevent *pevent;
char id_buf[32];
int cpu;
- int ret = 0;
+ int ret;

pevent = kzalloc(sizeof(*pevent), GFP_KERNEL);
if (!pevent)
return -ENOMEM;

- pevent->id = attr->config;
+ ret = get_event_id(pevent);
+ if (ret < 0)
+ goto fail;
+ pevent->id = ret;

if (!name) {
snprintf(id_buf, sizeof(id_buf), "%d", pevent->id);
@@ -163,6 +194,9 @@ persistent_open(char *name, struct perf_event_attr *attr, int nr_pages)
fail:
for_each_possible_cpu(cpu)
persistent_event_close(cpu, pevent);
+
+ if (pevent->id)
+ put_event_id(pevent->id);
kfree(pevent->name);
kfree(pevent);

@@ -306,6 +340,8 @@ void __init perf_register_persistent(void)
{
int cpu;

+ idr_init(&event_idr);
+ mutex_init(&event_lock);
perf_pmu_register(&persistent_pmu, "persistent", PERF_TYPE_PERSISTENT);

for_each_possible_cpu(cpu) {
--
1.7.11.7

2014-04-07 15:05:16

by Jean Pihet

[permalink] [raw]
Subject: [PATCH 06/16] mce, x86: Enable persistent events

From: Borislav Petkov <[email protected]>

... for MCEs collection.

Signed-off-by: Borislav Petkov <[email protected]>
[ rric: Fix build error for no-tracepoints configs ]
[ rric: Return proper error code. ]
[ rric: No error message if perf is disabled. ]
Signed-off-by: Robert Richter <[email protected]>
---
arch/x86/kernel/cpu/mcheck/mce.c | 19 +++++++++++++++++++
1 file changed, 19 insertions(+)

diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 4d5419b..92b0d45 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -2018,6 +2018,25 @@ int __init mcheck_init(void)
return 0;
}

+#ifdef CONFIG_EVENT_TRACING
+
+int __init mcheck_init_tp(void)
+{
+ int ret = perf_add_persistent_tp(&event_mce_record);
+
+ if (ret && ret != -ENOENT)
+ pr_err("Error adding MCE persistent event: %d\n", ret);
+
+ return ret;
+}
+/*
+ * We can't run earlier because persistent events uses anon_inode_getfile and
+ * its anon_inode_mnt gets initialized as a fs_initcall.
+ */
+fs_initcall_sync(mcheck_init_tp);
+
+#endif /* CONFIG_EVENT_TRACING */
+
/*
* mce_syscore: PM support
*/
--
1.7.11.7

2014-04-07 15:08:27

by Jean Pihet

[permalink] [raw]
Subject: [PATCH 10/16] perf, persistent: Implement reference counter for events

From: Robert Richter <[email protected]>

We need this later for proper event removal.

Signed-off-by: Robert Richter <[email protected]>
Signed-off-by: Robert Richter <[email protected]>
Signed-off-by: Jean Pihet <[email protected]>
---
kernel/events/persistent.c | 27 +++++++++++++++++----------
1 file changed, 17 insertions(+), 10 deletions(-)

diff --git a/kernel/events/persistent.c b/kernel/events/persistent.c
index d04827c..fcbb4f8 100644
--- a/kernel/events/persistent.c
+++ b/kernel/events/persistent.c
@@ -9,6 +9,7 @@
#define CPU_BUFFER_NR_PAGES ((512 * 1024) / PAGE_SIZE)

struct pevent {
+ atomic_t refcount;
struct perf_pmu_events_attr sysfs;
char *name;
int id;
@@ -130,6 +131,7 @@ static int persistent_event_open(int cpu, struct pevent *pevent,
if (ret)
goto fail;

+ atomic_inc(&pevent->refcount);
atomic_inc(&event->mmap_count);

/* All workie, enable event now */
@@ -144,8 +146,11 @@ fail:
static void persistent_event_close(int cpu, struct pevent *pevent)
{
struct perf_event *event = pevent_del(pevent, cpu);
- if (event)
+ if (event) {
+ /* Safe, the caller holds &pevent->refcount too. */
+ atomic_dec(&pevent->refcount);
persistent_event_release(event);
+ }
}

static int pevent_sysfs_register(struct pevent *event);
@@ -162,6 +167,8 @@ persistent_open(char *name, struct perf_event_attr *attr, int nr_pages)
if (!pevent)
return -ENOMEM;

+ atomic_set(&pevent->refcount, 1);
+
ret = get_event_id(pevent);
if (ret < 0)
goto fail;
@@ -187,21 +194,21 @@ persistent_open(char *name, struct perf_event_attr *attr, int nr_pages)
}

ret = pevent_sysfs_register(pevent);
- if (ret)
- goto fail;
-
- return 0;
+ if (!ret)
+ goto out;
fail:
for_each_possible_cpu(cpu)
persistent_event_close(cpu, pevent);

- if (pevent->id)
- put_event_id(pevent->id);
- kfree(pevent->name);
- kfree(pevent);
-
pr_err("%s: Error adding persistent event: %d\n",
__func__, ret);
+out:
+ if (atomic_dec_and_test(&pevent->refcount)) {
+ if (pevent->id)
+ put_event_id(pevent->id);
+ kfree(pevent->name);
+ kfree(pevent);
+ }

return ret;
}
--
1.7.11.7

2014-04-07 15:08:47

by Jean Pihet

[permalink] [raw]
Subject: [PATCH 08/16] perf, persistent: Exposing persistent events using sysfs

From: Robert Richter <[email protected]>

Expose persistent events in the system to userland using sysfs. Perf
tools are able to read existing pmu events from sysfs. Now we use a
persistent pmu as an event container containing all registered
persistent events of the system. This patch adds dynamically
registration of persistent events to sysfs. E.g. something like this:

/sys/bus/event_source/devices/persistent/events/mce_record:persistent,config=106
/sys/bus/event_source/devices/persistent/format/persistent:attr5:23

Perf tools need to support the attr<num> syntax that is added in a
separate patch set. With it we are able to run perf tool commands to
read persistent events, e.g.:

# perf record -e persistent/mce_record/ sleep 10
# perf top -e persistent/mce_record/

[ Jiri: Document attr<index> syntax in sysfs ABI ]
[ Namhyung: Fix sysfs registration with lockdep enabled ]
Cc: Jiri Olsa <[email protected]>
Cc: Namhyung Kim <[email protected]>
Signed-off-by: Robert Richter <[email protected]>
Signed-off-by: Robert Richter <[email protected]>
Signed-off-by: Jean Pihet <[email protected]>
---
.../testing/sysfs-bus-event_source-devices-format | 43 ++++++++++++----
kernel/events/persistent.c | 60 ++++++++++++++++++++++
2 files changed, 92 insertions(+), 11 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-bus-event_source-devices-format b/Documentation/ABI/testing/sysfs-bus-event_source-devices-format
index 77f47ff..744d53a 100644
--- a/Documentation/ABI/testing/sysfs-bus-event_source-devices-format
+++ b/Documentation/ABI/testing/sysfs-bus-event_source-devices-format
@@ -1,13 +1,14 @@
-Where: /sys/bus/event_source/devices/<dev>/format
+Where: /sys/bus/event_source/devices/<pmu>/format/<name>
Date: January 2012
-Kernel Version: 3.3
+Kernel Version: 3.3
+ 3.xx (added attr<index>:<bits>)
Contact: Jiri Olsa <[email protected]>
-Description:
- Attribute group to describe the magic bits that go into
- perf_event_attr::config[012] for a particular pmu.
- Each attribute of this group defines the 'hardware' bitmask
- we want to export, so that userspace can deal with sane
- name/value pairs.
+
+Description: Define formats for bit ranges in perf_event_attr
+
+ Attribute group to describe the magic bits that go
+ into struct perf_event_attr for a particular pmu. Bit
+ range may be any bit mask of an u64 (bits 0 to 63).

Userspace must be prepared for the possibility that attributes
define overlapping bit ranges. For example:
@@ -15,6 +16,26 @@ Description:
attr2 = 'config:0-7'
attr3 = 'config:12-35'

- Example: 'config1:1,6-10,44'
- Defines contents of attribute that occupies bits 1,6-10,44 of
- perf_event_attr::config1.
+ Syntax Description
+
+ config[012]*:<bits> Each attribute of this group
+ defines the 'hardware' bitmask
+ we want to export, so that
+ userspace can deal with sane
+ name/value pairs.
+
+ attr<index>:<bits> Set any field of the event
+ attribute. The index is a
+ decimal number that specifies
+ the u64 value to be set within
+ struct perf_event_attr.
+
+ Examples:
+
+ 'config1:1,6-10,44' Defines contents of attribute
+ that occupies bits 1,6-10,44
+ of perf_event_attr::config1.
+
+ 'attr5:24' Define the persistent event
+ flag (bit 24 of the attribute
+ flags)
diff --git a/kernel/events/persistent.c b/kernel/events/persistent.c
index 6b039f5..6578acf 100644
--- a/kernel/events/persistent.c
+++ b/kernel/events/persistent.c
@@ -8,6 +8,7 @@
#define CPU_BUFFER_NR_PAGES ((512 * 1024) / PAGE_SIZE)

struct pevent {
+ struct perf_pmu_events_attr sysfs;
char *name;
int id;
};
@@ -119,6 +120,8 @@ static void persistent_event_close(int cpu, struct pevent *pevent)
persistent_event_release(event);
}

+static int pevent_sysfs_register(struct pevent *event);
+
static int __maybe_unused
persistent_open(char *name, struct perf_event_attr *attr, int nr_pages)
{
@@ -144,12 +147,18 @@ persistent_open(char *name, struct perf_event_attr *attr, int nr_pages)
goto fail;
}

+ pevent->sysfs.id = pevent->id;
+
for_each_possible_cpu(cpu) {
ret = persistent_event_open(cpu, pevent, attr, nr_pages);
if (ret)
goto fail;
}

+ ret = pevent_sysfs_register(pevent);
+ if (ret)
+ goto fail;
+
return 0;
fail:
for_each_possible_cpu(cpu)
@@ -223,10 +232,61 @@ static struct attribute_group persistent_format_group = {
.attrs = persistent_format_attrs,
};

+#define MAX_EVENTS 16
+
+static struct attribute *pevents_attr[MAX_EVENTS + 1] = { };
+
+static struct attribute_group pevents_group = {
+ .name = "events",
+ .attrs = pevents_attr,
+};
+
static const struct attribute_group *persistent_attr_groups[] = {
&persistent_format_group,
+ NULL, /* placeholder: &pevents_group */
NULL,
};
+#define EVENTS_GROUP_PTR (&persistent_attr_groups[1])
+
+static ssize_t pevent_sysfs_show(struct device *dev,
+ struct device_attribute *__attr, char *page)
+{
+ struct perf_pmu_events_attr *attr =
+ container_of(__attr, struct perf_pmu_events_attr, attr);
+ return sprintf(page, "persistent,config=%lld",
+ (unsigned long long)attr->id);
+}
+
+static int pevent_sysfs_register(struct pevent *pevent)
+{
+ struct perf_pmu_events_attr *sysfs = &pevent->sysfs;
+ struct attribute *attr = &sysfs->attr.attr;
+ struct device *dev = persistent_pmu.dev;
+ const struct attribute_group **group = EVENTS_GROUP_PTR;
+ int idx;
+
+ sysfs->id = pevent->id;
+ sysfs->attr = (struct device_attribute)
+ __ATTR(, 0444, pevent_sysfs_show, NULL);
+ attr->name = pevent->name;
+ sysfs_attr_init(attr);
+
+ /* add sysfs attr to events: */
+ for (idx = 0; idx < MAX_EVENTS; idx++) {
+ if (!cmpxchg(pevents_attr + idx, NULL, attr))
+ break;
+ }
+
+ if (idx >= MAX_EVENTS)
+ return -ENOSPC;
+ if (!idx)
+ *group = &pevents_group;
+ if (!dev)
+ return 0; /* sysfs not yet initialized */
+ if (idx)
+ return sysfs_add_file_to_group(&dev->kobj, attr, (*group)->name);
+ return sysfs_create_group(&persistent_pmu.dev->kobj, *group);
+}

static int persistent_pmu_init(struct perf_event *event)
{
--
1.7.11.7

2014-04-07 15:09:21

by Jean Pihet

[permalink] [raw]
Subject: [PATCH 07/16] perf, persistent: Implementing a persistent pmu

From: Robert Richter <[email protected]>

We want to use the kernel's pmu design to later expose persistent
events via sysfs to userland. Initially implement a persistent pmu.

The format syntax is introduced allowing to set bits anywhere in
struct perf_event_attr. This is used in this case to set the
persistent flag (attr5:24). The syntax is attr<num> where num is the
index of the u64 array in struct perf_event_attr. Otherwise syntax is
same as for config<num>.

Patches that implement this functionality for perf tools are sent in a
separate patchset.

Signed-off-by: Robert Richter <[email protected]>
Signed-off-by: Robert Richter <[email protected]>
Signed-off-by: Jean Pihet <[email protected]>
---
kernel/events/persistent.c | 34 ++++++++++++++++++++++++++++++++++
1 file changed, 34 insertions(+)

diff --git a/kernel/events/persistent.c b/kernel/events/persistent.c
index 78fdd74..6b039f5 100644
--- a/kernel/events/persistent.c
+++ b/kernel/events/persistent.c
@@ -12,6 +12,7 @@ struct pevent {
int id;
};

+static struct pmu persistent_pmu;
static DEFINE_PER_CPU(struct list_head, pevents);
static DEFINE_PER_CPU(struct mutex, pevents_lock);

@@ -210,10 +211,43 @@ int perf_get_persistent_event_fd(int cpu, int id)
return event_fd;
}

+PMU_FORMAT_ATTR(persistent, "attr5:24");
+
+static struct attribute *persistent_format_attrs[] = {
+ &format_attr_persistent.attr,
+ NULL,
+};
+
+static struct attribute_group persistent_format_group = {
+ .name = "format",
+ .attrs = persistent_format_attrs,
+};
+
+static const struct attribute_group *persistent_attr_groups[] = {
+ &persistent_format_group,
+ NULL,
+};
+
+static int persistent_pmu_init(struct perf_event *event)
+{
+ if (persistent_pmu.type != event->attr.type)
+ return -ENOENT;
+
+ /* Not a persistent event. */
+ return -EFAULT;
+}
+
+static struct pmu persistent_pmu = {
+ .event_init = persistent_pmu_init,
+ .attr_groups = persistent_attr_groups,
+};
+
void __init perf_register_persistent(void)
{
int cpu;

+ perf_pmu_register(&persistent_pmu, "persistent", PERF_TYPE_PERSISTENT);
+
for_each_possible_cpu(cpu) {
INIT_LIST_HEAD(&per_cpu(pevents, cpu));
mutex_init(&per_cpu(pevents_lock, cpu));
--
1.7.11.7

2014-04-07 15:09:45

by Jean Pihet

[permalink] [raw]
Subject: [PATCH 05/16] perf: Add persistent events

From: Robert Richter <[email protected]>

Add the needed pieces for persistent events which makes them
process-agnostic. Also, make their buffers read-only when mmaping them
from userspace.

Add a barebones implementation for registering persistent events with
perf. For that, we don't destroy the buffers when they're unmapped;
also, we map them read-only so that multiple agents can access them.

Also, we allocate the event buffers at event init time and not at mmap
time so that we can log samples into them regardless of whether there
are readers in userspace or not.

Multiple events from different cpus may map to a single persistent
event entry which has a unique identifier. The identifier allows to
access the persistent event with the perf_event_open() syscall. For
this the new event type PERF_TYPE_PERSISTENT must be set with its id
specified in attr.config. Currently there is only support for per-cpu
events. Also, root access is required.

Since the buffers are shared, the set_output ioctl may not be used in
conjunction with persistent events.

This patch only supports trace_points, support for all event types is
implemented in a later patch.

Based on patch set from Borislav Petkov <[email protected]>.

Cc: Borislav Petkov <[email protected]>
Cc: Fengguang Wu <[email protected]>
Cc: Jiri Olsa <[email protected]>
Signed-off-by: Robert Richter <[email protected]>
Signed-off-by: Robert Richter <[email protected]>
Signed-off-by: Jean Pihet <[email protected]>
---
include/linux/perf_event.h | 12 ++-
include/uapi/linux/perf_event.h | 4 +-
kernel/events/Makefile | 2 +-
kernel/events/core.c | 38 +++++--
kernel/events/internal.h | 2 +
kernel/events/persistent.c | 221 ++++++++++++++++++++++++++++++++++++++++
6 files changed, 266 insertions(+), 13 deletions(-)
create mode 100644 kernel/events/persistent.c

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index e56b07f..c368e9c 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -436,6 +436,8 @@ struct perf_event {
struct perf_cgroup *cgrp; /* cgroup event is attach to */
int cgrp_defer_enabled;
#endif
+ struct list_head pevent_entry; /* persistent event */
+ int pevent_id;

#endif /* CONFIG_PERF_EVENTS */
};
@@ -770,7 +772,7 @@ extern void perf_event_enable(struct perf_event *event);
extern void perf_event_disable(struct perf_event *event);
extern int __perf_event_disable(void *info);
extern void perf_event_task_tick(void);
-#else
+#else /* !CONFIG_PERF_EVENTS */
static inline void
perf_event_task_sched_in(struct task_struct *prev,
struct task_struct *task) { }
@@ -810,7 +812,7 @@ static inline void perf_event_enable(struct perf_event *event) { }
static inline void perf_event_disable(struct perf_event *event) { }
static inline int __perf_event_disable(void *info) { return -1; }
static inline void perf_event_task_tick(void) { }
-#endif
+#endif /* !CONFIG_PERF_EVENTS */

#if defined(CONFIG_PERF_EVENTS) && defined(CONFIG_NO_HZ_FULL)
extern bool perf_event_can_stop_tick(void);
@@ -824,6 +826,12 @@ extern void perf_restore_debug_store(void);
static inline void perf_restore_debug_store(void) { }
#endif

+#if defined(CONFIG_PERF_EVENTS) && defined(CONFIG_EVENT_TRACING)
+extern int perf_add_persistent_tp(struct ftrace_event_call *tp);
+#else
+static inline int perf_add_persistent_tp(void *tp) { return -ENOENT; }
+#endif
+
#define perf_output_put(handle, x) perf_output_copy((handle), &(x), sizeof(x))

/*
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 853bc1c..a3f2761 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -32,6 +32,7 @@ enum perf_type_id {
PERF_TYPE_HW_CACHE = 3,
PERF_TYPE_RAW = 4,
PERF_TYPE_BREAKPOINT = 5,
+ PERF_TYPE_PERSISTENT = 6,

PERF_TYPE_MAX, /* non-ABI */
};
@@ -301,8 +302,9 @@ struct perf_event_attr {
exclude_callchain_kernel : 1, /* exclude kernel callchains */
exclude_callchain_user : 1, /* exclude user callchains */
mmap2 : 1, /* include mmap with inode data */
+ persistent : 1, /* always-on event */

- __reserved_1 : 40;
+ __reserved_1 : 39;

union {
__u32 wakeup_events; /* wakeup every n events */
diff --git a/kernel/events/Makefile b/kernel/events/Makefile
index 103f5d1..70990d5 100644
--- a/kernel/events/Makefile
+++ b/kernel/events/Makefile
@@ -2,7 +2,7 @@ ifdef CONFIG_FUNCTION_TRACER
CFLAGS_REMOVE_core.o = -pg
endif

-obj-y := core.o ring_buffer.o callchain.o
+obj-y := core.o ring_buffer.o callchain.o persistent.o

obj-$(CONFIG_HAVE_HW_BREAKPOINT) += hw_breakpoint.o
obj-$(CONFIG_UPROBES) += uprobes.o
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 9857475..80ada8e 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -4087,6 +4087,9 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
if (!(vma->vm_flags & VM_SHARED))
return -EINVAL;

+ if (event->attr.persistent && (vma->vm_flags & VM_WRITE))
+ return -EACCES;
+
vma_size = vma->vm_end - vma->vm_start;
nr_pages = (vma_size / PAGE_SIZE) - 1;

@@ -4112,6 +4115,11 @@ again:
goto unlock;
}

+ if (!event->rb->overwrite && vma->vm_flags & VM_WRITE) {
+ ret = -EACCES;
+ goto unlock;
+ }
+
if (!atomic_inc_not_zero(&event->rb->mmap_count)) {
/*
* Raced against perf_mmap_close() through
@@ -5995,7 +6003,7 @@ static struct pmu perf_tracepoint = {
.event_idx = perf_swevent_event_idx,
};

-static inline void perf_tp_register(void)
+static inline void perf_register_tp(void)
{
perf_pmu_register(&perf_tracepoint, "tracepoint", PERF_TYPE_TRACEPOINT);
}
@@ -6025,18 +6033,14 @@ static void perf_event_free_filter(struct perf_event *event)

#else

-static inline void perf_tp_register(void)
-{
-}
+static inline void perf_register_tp(void) { }

static int perf_event_set_filter(struct perf_event *event, void __user *arg)
{
return -ENOENT;
}

-static void perf_event_free_filter(struct perf_event *event)
-{
-}
+static void perf_event_free_filter(struct perf_event *event) { }

#endif /* CONFIG_EVENT_TRACING */

@@ -6729,7 +6733,7 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
INIT_LIST_HEAD(&event->rb_entry);
INIT_LIST_HEAD(&event->active_entry);
INIT_HLIST_NODE(&event->hlist_entry);
-
+ INIT_LIST_HEAD(&event->pevent_entry);

init_waitqueue_head(&event->waitq);
init_irq_work(&event->pending, perf_pending_event);
@@ -6991,6 +6995,13 @@ set:
goto unlock;
}

+ /* Don't redirect read-only (persistent) events. */
+ ret = -EACCES;
+ if (old_rb && !old_rb->overwrite)
+ goto unlock;
+ if (rb && !rb->overwrite)
+ goto unlock;
+
if (old_rb)
ring_buffer_detach(event, old_rb);

@@ -7049,6 +7060,14 @@ SYSCALL_DEFINE5(perf_event_open,
if (err)
return err;

+ /* return fd for an existing persistent event */
+ if (attr.type == PERF_TYPE_PERSISTENT)
+ return perf_get_persistent_event_fd(cpu, attr.config);
+
+ /* put event into persistent state (not yet supported) */
+ if (attr.persistent)
+ return -EOPNOTSUPP;
+
if (!attr.exclude_kernel) {
if (perf_paranoid_kernel() && !capable(CAP_SYS_ADMIN))
return -EACCES;
@@ -7990,7 +8009,8 @@ void __init perf_event_init(void)
perf_pmu_register(&perf_swevent, "software", PERF_TYPE_SOFTWARE);
perf_pmu_register(&perf_cpu_clock, NULL, -1);
perf_pmu_register(&perf_task_clock, NULL, -1);
- perf_tp_register();
+ perf_register_tp();
+ perf_register_persistent();
perf_cpu_notifier(perf_cpu_notify);
register_reboot_notifier(&perf_reboot_notifier);

diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index 5f1f92d..6b9a11d 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -210,5 +210,7 @@ static inline void put_event(struct perf_event *event)
extern int perf_alloc_rb(struct perf_event *event, int nr_pages, int flags);
extern void perf_free_rb(struct perf_event *event);
extern int perf_get_fd(struct perf_event *event, int f_flags);
+extern int perf_get_persistent_event_fd(int cpu, int id);
+extern void __init perf_register_persistent(void);

#endif /* _KERNEL_EVENTS_INTERNAL_H */
diff --git a/kernel/events/persistent.c b/kernel/events/persistent.c
new file mode 100644
index 0000000..78fdd74
--- /dev/null
+++ b/kernel/events/persistent.c
@@ -0,0 +1,221 @@
+#include <linux/slab.h>
+#include <linux/perf_event.h>
+#include <linux/ftrace_event.h>
+
+#include "internal.h"
+
+/* 512 kiB: default perf tools memory size, see perf_evlist__mmap() */
+#define CPU_BUFFER_NR_PAGES ((512 * 1024) / PAGE_SIZE)
+
+struct pevent {
+ char *name;
+ int id;
+};
+
+static DEFINE_PER_CPU(struct list_head, pevents);
+static DEFINE_PER_CPU(struct mutex, pevents_lock);
+
+/* Must be protected with pevents_lock. */
+static struct perf_event *__pevent_find(int cpu, int id)
+{
+ struct perf_event *event;
+
+ list_for_each_entry(event, &per_cpu(pevents, cpu), pevent_entry) {
+ if (event->pevent_id == id)
+ return event;
+ }
+
+ return NULL;
+}
+
+static int pevent_add(struct pevent *pevent, struct perf_event *event)
+{
+ int ret = -EEXIST;
+ int cpu = event->cpu;
+
+ mutex_lock(&per_cpu(pevents_lock, cpu));
+
+ if (__pevent_find(cpu, pevent->id))
+ goto unlock;
+
+ if (event->pevent_id)
+ goto unlock;
+
+ ret = 0;
+ event->pevent_id = pevent->id;
+ list_add_tail(&event->pevent_entry, &per_cpu(pevents, cpu));
+unlock:
+ mutex_unlock(&per_cpu(pevents_lock, cpu));
+
+ return ret;
+}
+
+static struct perf_event *pevent_del(struct pevent *pevent, int cpu)
+{
+ struct perf_event *event;
+
+ mutex_lock(&per_cpu(pevents_lock, cpu));
+
+ event = __pevent_find(cpu, pevent->id);
+ if (event) {
+ list_del(&event->pevent_entry);
+ event->pevent_id = 0;
+ }
+
+ mutex_unlock(&per_cpu(pevents_lock, cpu));
+
+ return event;
+}
+
+static void persistent_event_release(struct perf_event *event)
+{
+ /*
+ * Safe since we hold &event->mmap_count. The ringbuffer is
+ * released with put_event() if there are no other references.
+ * In this case there are also no other mmaps.
+ */
+ atomic_dec(&event->rb->mmap_count);
+ atomic_dec(&event->mmap_count);
+ put_event(event);
+}
+
+static int persistent_event_open(int cpu, struct pevent *pevent,
+ struct perf_event_attr *attr, int nr_pages)
+{
+ struct perf_event *event;
+ int ret;
+
+ event = perf_event_create_kernel_counter(attr, cpu, NULL, NULL, NULL);
+ if (IS_ERR(event))
+ return PTR_ERR(event);
+
+ if (nr_pages < 0)
+ nr_pages = CPU_BUFFER_NR_PAGES;
+
+ ret = perf_alloc_rb(event, nr_pages, 0);
+ if (ret)
+ goto fail;
+
+ ret = pevent_add(pevent, event);
+ if (ret)
+ goto fail;
+
+ atomic_inc(&event->mmap_count);
+
+ /* All workie, enable event now */
+ perf_event_enable(event);
+
+ return ret;
+fail:
+ perf_event_release_kernel(event);
+ return ret;
+}
+
+static void persistent_event_close(int cpu, struct pevent *pevent)
+{
+ struct perf_event *event = pevent_del(pevent, cpu);
+ if (event)
+ persistent_event_release(event);
+}
+
+static int __maybe_unused
+persistent_open(char *name, struct perf_event_attr *attr, int nr_pages)
+{
+ struct pevent *pevent;
+ char id_buf[32];
+ int cpu;
+ int ret = 0;
+
+ pevent = kzalloc(sizeof(*pevent), GFP_KERNEL);
+ if (!pevent)
+ return -ENOMEM;
+
+ pevent->id = attr->config;
+
+ if (!name) {
+ snprintf(id_buf, sizeof(id_buf), "%d", pevent->id);
+ name = id_buf;
+ }
+
+ pevent->name = kstrdup(name, GFP_KERNEL);
+ if (!pevent->name) {
+ ret = -ENOMEM;
+ goto fail;
+ }
+
+ for_each_possible_cpu(cpu) {
+ ret = persistent_event_open(cpu, pevent, attr, nr_pages);
+ if (ret)
+ goto fail;
+ }
+
+ return 0;
+fail:
+ for_each_possible_cpu(cpu)
+ persistent_event_close(cpu, pevent);
+ kfree(pevent->name);
+ kfree(pevent);
+
+ pr_err("%s: Error adding persistent event: %d\n",
+ __func__, ret);
+
+ return ret;
+}
+
+#ifdef CONFIG_EVENT_TRACING
+
+int perf_add_persistent_tp(struct ftrace_event_call *tp)
+{
+ struct perf_event_attr attr;
+
+ memset(&attr, 0, sizeof(attr));
+ attr.sample_period = 1;
+ attr.wakeup_events = 1;
+ attr.sample_type = PERF_SAMPLE_RAW;
+ attr.persistent = 1;
+ attr.config = tp->event.type;
+ attr.type = PERF_TYPE_TRACEPOINT;
+ attr.size = sizeof(attr);
+
+ return persistent_open(tp->name, &attr, -1);
+}
+
+#endif /* CONFIG_EVENT_TRACING */
+
+int perf_get_persistent_event_fd(int cpu, int id)
+{
+ struct perf_event *event;
+ int event_fd = 0;
+
+ if ((unsigned)cpu >= nr_cpu_ids)
+ return -EINVAL;
+
+ /* Must be root for persistent events */
+ if (perf_paranoid_cpu() && !capable(CAP_SYS_ADMIN))
+ return -EACCES;
+
+ mutex_lock(&per_cpu(pevents_lock, cpu));
+ event = __pevent_find(cpu, id);
+ if (!event || !try_get_event(event))
+ event_fd = -ENOENT;
+ mutex_unlock(&per_cpu(pevents_lock, cpu));
+
+ if (event_fd)
+ return event_fd;
+
+ event_fd = perf_get_fd(event, O_RDWR);
+ if (event_fd < 0)
+ put_event(event);
+
+ return event_fd;
+}
+
+void __init perf_register_persistent(void)
+{
+ int cpu;
+
+ for_each_possible_cpu(cpu) {
+ INIT_LIST_HEAD(&per_cpu(pevents, cpu));
+ mutex_init(&per_cpu(pevents_lock, cpu));
+ }
+}
--
1.7.11.7

2014-04-07 15:10:08

by Jean Pihet

[permalink] [raw]
Subject: [PATCH 04/16] perf, mmap: Factor out perf_get_fd()

From: Robert Richter <[email protected]>

This new function creates a new fd for an event. We need this later to
get a fd from a persistent event.

Signed-off-by: Robert Richter <[email protected]>
Signed-off-by: Robert Richter <[email protected]>
Signed-off-by: Jean Pihet <[email protected]>
---
kernel/events/core.c | 14 ++++++++------
kernel/events/internal.h | 1 +
2 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 22ec8f0..9857475 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -4206,6 +4206,11 @@ static const struct file_operations perf_fops = {
.fasync = perf_fasync,
};

+int perf_get_fd(struct perf_event *event, int f_flags)
+{
+ return anon_inode_getfd("[perf_event]", &perf_fops, event, f_flags);
+}
+
/*
* Perf event wakeup
*
@@ -7028,7 +7033,6 @@ SYSCALL_DEFINE5(perf_event_open,
struct perf_event *event, *sibling;
struct perf_event_attr attr;
struct perf_event_context *ctx;
- struct file *event_file = NULL;
struct fd group = {NULL, 0};
struct task_struct *task = NULL;
struct pmu *pmu;
@@ -7189,10 +7193,9 @@ SYSCALL_DEFINE5(perf_event_open,
goto err_context;
}

- event_file = anon_inode_getfile("[perf_event]", &perf_fops, event,
- f_flags);
- if (IS_ERR(event_file)) {
- err = PTR_ERR(event_file);
+ event_fd = perf_get_fd(event, f_flags);
+ if (event_fd < 0) {
+ err = event_fd;
goto err_context;
}

@@ -7257,7 +7260,6 @@ SYSCALL_DEFINE5(perf_event_open,
* perf_group_detach().
*/
fdput(group);
- fd_install(event_fd, event_file);
return event_fd;

err_context:
diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index e9007ff..5f1f92d 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -209,5 +209,6 @@ static inline void put_event(struct perf_event *event)

extern int perf_alloc_rb(struct perf_event *event, int nr_pages, int flags);
extern void perf_free_rb(struct perf_event *event);
+extern int perf_get_fd(struct perf_event *event, int f_flags);

#endif /* _KERNEL_EVENTS_INTERNAL_H */
--
1.7.11.7

2014-04-07 15:11:13

by Jean Pihet

[permalink] [raw]
Subject: [PATCH 03/16] perf, mmap: Factor out perf_alloc/free_rb()

From: Robert Richter <[email protected]>

Factor out code to allocate and deallocate ringbuffers. We need this
later to setup the sampling buffer for persistent events.

While at this, replacing get_current_user() with get_uid(user).

Signed-off-by: Robert Richter <[email protected]>
Signed-off-by: Robert Richter <[email protected]>
Signed-off-by: Jean Pihet <[email protected]>
---
kernel/events/core.c | 77 +++++++++++++++++++++++++++++-------------------
kernel/events/internal.h | 3 ++
2 files changed, 50 insertions(+), 30 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 5eaba42..22ec8f0 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3193,7 +3193,45 @@ static void free_event_rcu(struct rcu_head *head)
}

static void ring_buffer_put(struct ring_buffer *rb);
+static void ring_buffer_attach(struct perf_event *event, struct ring_buffer *rb);
static void ring_buffer_detach(struct perf_event *event, struct ring_buffer *rb);
+static void perf_event_init_userpage(struct perf_event *event);
+
+/*
+ * Must be called with &event->mmap_mutex held. event->rb must be
+ * NULL. perf_alloc_rb() requires &event->mmap_count to be incremented
+ * on success which corresponds to &rb->mmap_count that is initialized
+ * with 1.
+ */
+int perf_alloc_rb(struct perf_event *event, int nr_pages, int flags)
+{
+ struct ring_buffer *rb;
+
+ rb = rb_alloc(nr_pages,
+ event->attr.watermark ? event->attr.wakeup_watermark : 0,
+ event->cpu, flags);
+ if (!rb)
+ return -ENOMEM;
+
+ atomic_set(&rb->mmap_count, 1);
+ ring_buffer_attach(event, rb);
+ rcu_assign_pointer(event->rb, rb);
+
+ perf_event_init_userpage(event);
+ perf_event_update_userpage(event);
+
+ return 0;
+}
+
+/* Must be called with &event->mmap_mutex held. event->rb must be set. */
+void perf_free_rb(struct perf_event *event)
+{
+ struct ring_buffer *rb = event->rb;
+
+ rcu_assign_pointer(event->rb, NULL);
+ ring_buffer_detach(event, rb);
+ ring_buffer_put(rb);
+}

static void unaccount_event_cpu(struct perf_event *event, int cpu)
{
@@ -3246,6 +3284,7 @@ static void __free_event(struct perf_event *event)

call_rcu(&event->rcu_head, free_event_rcu);
}
+
static void free_event(struct perf_event *event)
{
irq_work_sync(&event->pending);
@@ -3253,8 +3292,6 @@ static void free_event(struct perf_event *event)
unaccount_event(event);

if (event->rb) {
- struct ring_buffer *rb;
-
/*
* Can happen when we close an event with re-directed output.
*
@@ -3262,12 +3299,8 @@ static void free_event(struct perf_event *event)
* over us; possibly making our ring_buffer_put() the last.
*/
mutex_lock(&event->mmap_mutex);
- rb = event->rb;
- if (rb) {
- rcu_assign_pointer(event->rb, NULL);
- ring_buffer_detach(event, rb);
- ring_buffer_put(rb); /* could be last */
- }
+ if (event->rb)
+ perf_free_rb(event);
mutex_unlock(&event->mmap_mutex);
}

@@ -3901,11 +3934,8 @@ again:
* still restart the iteration to make sure we're not now
* iterating the wrong list.
*/
- if (event->rb == rb) {
- rcu_assign_pointer(event->rb, NULL);
- ring_buffer_detach(event, rb);
- ring_buffer_put(rb); /* can't be last, we still have one */
- }
+ if (event->rb == rb)
+ perf_free_rb(event);
mutex_unlock(&event->mmap_mutex);
put_event(event);

@@ -4041,7 +4071,6 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
unsigned long user_locked, user_lock_limit;
struct user_struct *user = current_user();
unsigned long locked, lock_limit;
- struct ring_buffer *rb;
unsigned long vma_size;
unsigned long nr_pages;
long user_extra, extra;
@@ -4125,28 +4154,16 @@ again:
if (vma->vm_flags & VM_WRITE)
flags |= RING_BUFFER_WRITABLE;

- rb = rb_alloc(nr_pages,
- event->attr.watermark ? event->attr.wakeup_watermark : 0,
- event->cpu, flags);
-
- if (!rb) {
- ret = -ENOMEM;
+ ret = perf_alloc_rb(event, nr_pages, flags);
+ if (ret)
goto unlock;
- }

- atomic_set(&rb->mmap_count, 1);
- rb->mmap_locked = extra;
- rb->mmap_user = get_current_user();
+ event->rb->mmap_locked = extra;
+ event->rb->mmap_user = get_uid(user);

atomic_long_add(user_extra, &user->locked_vm);
vma->vm_mm->pinned_vm += extra;

- ring_buffer_attach(event, rb);
- rcu_assign_pointer(event->rb, rb);
-
- perf_event_init_userpage(event);
- perf_event_update_userpage(event);
-
unlock:
if (!ret)
atomic_inc(&event->mmap_count);
diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index 3bd89d4..e9007ff 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -207,4 +207,7 @@ static inline void put_event(struct perf_event *event)
__put_event(event);
}

+extern int perf_alloc_rb(struct perf_event *event, int nr_pages, int flags);
+extern void perf_free_rb(struct perf_event *event);
+
#endif /* _KERNEL_EVENTS_INTERNAL_H */
--
1.7.11.7

2014-04-07 15:12:50

by Jean Pihet

[permalink] [raw]
Subject: [PATCH 01/16] perf, mmap: Factor out ring_buffer_detach_all()

From: Robert Richter <[email protected]>

Factor out a function to detach all events from a ringbuffer. No
functional changes.

Signed-off-by: Robert Richter <[email protected]>
Signed-off-by: Robert Richter <[email protected]>
Signed-off-by: Jean Pihet <[email protected]>
---
kernel/events/core.c | 82 ++++++++++++++++++++++++++++------------------------
1 file changed, 44 insertions(+), 38 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 661951a..8867236 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3878,6 +3878,49 @@ static void ring_buffer_detach(struct perf_event *event, struct ring_buffer *rb)
spin_unlock_irqrestore(&rb->event_lock, flags);
}

+static void ring_buffer_detach_all(struct ring_buffer *rb)
+{
+ struct perf_event *event;
+again:
+ rcu_read_lock();
+ list_for_each_entry_rcu(event, &rb->event_list, rb_entry) {
+ if (!atomic_long_inc_not_zero(&event->refcount)) {
+ /*
+ * This event is en-route to free_event() which will
+ * detach it and remove it from the list.
+ */
+ continue;
+ }
+ rcu_read_unlock();
+
+ mutex_lock(&event->mmap_mutex);
+ /*
+ * Check we didn't race with perf_event_set_output() which can
+ * swizzle the rb from under us while we were waiting to
+ * acquire mmap_mutex.
+ *
+ * If we find a different rb; ignore this event, a next
+ * iteration will no longer find it on the list. We have to
+ * still restart the iteration to make sure we're not now
+ * iterating the wrong list.
+ */
+ if (event->rb == rb) {
+ rcu_assign_pointer(event->rb, NULL);
+ ring_buffer_detach(event, rb);
+ ring_buffer_put(rb); /* can't be last, we still have one */
+ }
+ mutex_unlock(&event->mmap_mutex);
+ put_event(event);
+
+ /*
+ * Restart the iteration; either we're on the wrong list or
+ * destroyed its integrity by doing a deletion.
+ */
+ goto again;
+ }
+ rcu_read_unlock();
+}
+
static void ring_buffer_wakeup(struct perf_event *event)
{
struct ring_buffer *rb;
@@ -3970,44 +4013,7 @@ static void perf_mmap_close(struct vm_area_struct *vma)
* into the now unreachable buffer. Somewhat complicated by the
* fact that rb::event_lock otherwise nests inside mmap_mutex.
*/
-again:
- rcu_read_lock();
- list_for_each_entry_rcu(event, &rb->event_list, rb_entry) {
- if (!atomic_long_inc_not_zero(&event->refcount)) {
- /*
- * This event is en-route to free_event() which will
- * detach it and remove it from the list.
- */
- continue;
- }
- rcu_read_unlock();
-
- mutex_lock(&event->mmap_mutex);
- /*
- * Check we didn't race with perf_event_set_output() which can
- * swizzle the rb from under us while we were waiting to
- * acquire mmap_mutex.
- *
- * If we find a different rb; ignore this event, a next
- * iteration will no longer find it on the list. We have to
- * still restart the iteration to make sure we're not now
- * iterating the wrong list.
- */
- if (event->rb == rb) {
- rcu_assign_pointer(event->rb, NULL);
- ring_buffer_detach(event, rb);
- ring_buffer_put(rb); /* can't be last, we still have one */
- }
- mutex_unlock(&event->mmap_mutex);
- put_event(event);
-
- /*
- * Restart the iteration; either we're on the wrong list or
- * destroyed its integrity by doing a deletion.
- */
- goto again;
- }
- rcu_read_unlock();
+ ring_buffer_detach_all(rb);

/*
* It could be there's still a few 0-ref events on the list; they'll
--
1.7.11.7

2014-04-17 12:44:28

by Jean Pihet

[permalink] [raw]
Subject: Re: [PATCH v4 00/16] perf, persistent: Add persistent events

Hi,

Ping on this series. Are there some opinions or interest in this?

Regards,
Jean

On 7 April 2014 17:04, Jean Pihet <[email protected]> wrote:
> Jean Pihet <[email protected]>:
> - In order to restart the discussion on the topic, here is a rebased
> version of Robert's latest patches (v3) on acme/perf/core.
> It has been compiled and lightly tested on ARM64.
>
> - From the latest discussion on ML the ioctls are renamed from
> PERF_EVENT_IOC_ATTACH/DETACH to PERF_EVENT_IOC_CLAIM/UNCLAIM.
>
>
> Robert Richter <[email protected]>:
> This patch set implements the necessary kernel changes for persistent
> events.
>
> Persistent events run standalone in the system without the need of a
> controlling process that holds an event's file descriptor. The events
> are always enabled and collect data samples in a ring buffer.
> Processes may connect to existing persistent events using the
> perf_event_open() syscall. For this the syscall must be configured
> using the new PERF_TYPE_PERSISTENT event type and a unique event
> identifier specified in attr.config. The id is propagated in sysfs or
> using ioctl (see below).
>
> Persistent event buffers may be accessed with mmap() in the same way
> as for any other event. Since the buffers may be used by multiple
> processes at the same time, there is only read-only access to them.
> Currently there is only support for per-cpu events, thus root access
> is needed too.
>
> Persistent events are visible in sysfs. They are added or removed
> dynamically. With the information in sysfs userland knows about how to
> setup the perf_event attribute of a persistent event. Since a
> persistent event always has the persistent flag set, a way is needed
> to express this in sysfs. A new syntax is used for this. With
> 'attr<num>:<mask>' any bit in the attribute structure may be set in a
> similar way as using 'config<num>', but <num> is an index that points
> to the u64 value to change within the attribute.
>
> For persistent events the persistent flag (bit 24 of flag field in
> struct perf_event_attr) needs to be set which is expressed in sysfs
> with "attr5:24". E.g. the mce_record event is described in sysfs as
> follows:
>
> /sys/bus/event_source/devices/persistent/events/mce_record:persistent,config=106
> /sys/bus/event_source/devices/persistent/format/persistent:attr5:24
>
> Note that perf tools need to support the 'attr<num>' syntax that is
> added in a separate patch set. With it we are able to run perf tool
> commands to read persistent events, e.g.:
>
> # perf record -e persistent/mce_record/ sleep 10
> # perf top -e persistent/mce_record/
>
> In general the new syntax is flexible to describe with sysfs any event
> to be setup by perf tools.
>
> There are ioctl functions to control persistent events that can be
> used to detach or attach an event to or from a process. The
> PERF_EVENT_IOC_UNCLAIM ioctl call makes an event persistent. The
> perf_event_open() syscall can be used to re-open the event by any
> process. The PERF_EVENT_IOC_CLAIM ioctl attaches the event again so
> that it is removed after closing the event's fd.
>
> The patches base on the originally work from Borislav Petkov.
>
> This version 3 of the patch set is a complete rework of the code.
> There are the following major changes:
>
> * new event type PERF_TYPE_PERSISTENT introduced,
>
> * support for all type of events,
>
> * unique event ids,
>
> * improvements in reference counting and locking,
>
> * ioctl functions are added to control persistency,
>
> * the sysfs implementation now uses variable list size.
>
> This should address most issues discussed during last review of
> version 2. The following is unresolved yet and can be added later on
> top of this patches, if necessary:
>
> * support for per-task events (also allowing non-root access),
>
> * creation of persistent events for disabled cpus,
>
> * make event persistent with already open (mmap'ed) buffers,
>
> * make event persistent while creating it.
>
> First patches contain some rework of the perf mmap code to reuse it
> for persistent events.
>
> Also note that patch 12 (ioctl functions to control persistency) is
> RFC and untested. A perf tools implementation for this is missing and
> some ideas are needed how this could be integrated, esp. in something
> like perf trace or so.
>
> All v3 patches can be found here:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/rric/oprofile.git persistent-v3
>
> Note: I will resent the perf tools patch necessary to use persistent
> events.
>
> -Robert
>
>
> Borislav Petkov (1):
> mce, x86: Enable persistent events
>
> Robert Richter (15):
> perf, mmap: Factor out ring_buffer_detach_all()
> perf, mmap: Factor out try_get_event()/put_event()
> perf, mmap: Factor out perf_alloc/free_rb()
> perf, mmap: Factor out perf_get_fd()
> perf: Add persistent events
> perf, persistent: Implementing a persistent pmu
> perf, persistent: Exposing persistent events using sysfs
> perf, persistent: Use unique event ids
> perf, persistent: Implement reference counter for events
> perf, persistent: Dynamically resize list of sysfs entries
> perf, persistent: ioctl functions to control persistency
> perf tools: Rename flex conditions to avoid name conflicts
> perf tools: Modify event parser to update event attribute by index
> perf tools: Add attr<num> syntax to event parser
> perf tools: Retry mapping buffers readonly on EACCES
>
> .../testing/sysfs-bus-event_source-devices-format | 43 +-
> arch/x86/kernel/cpu/mcheck/mce.c | 19 +
> include/linux/perf_event.h | 12 +-
> include/uapi/linux/perf_event.h | 6 +-
> kernel/events/Makefile | 2 +-
> kernel/events/core.c | 214 +++++---
> kernel/events/internal.h | 20 +
> kernel/events/persistent.c | 563 +++++++++++++++++++++
> tools/perf/builtin-record.c | 11 +-
> tools/perf/builtin-top.c | 8 +-
> tools/perf/perf.h | 1 +
> tools/perf/tests/parse-events.c | 12 +-
> tools/perf/util/parse-events.c | 59 +--
> tools/perf/util/parse-events.h | 12 +-
> tools/perf/util/parse-events.l | 58 ++-
> tools/perf/util/parse-events.y | 24 +-
> tools/perf/util/pmu.c | 32 +-
> tools/perf/util/pmu.h | 9 +-
> tools/perf/util/pmu.l | 1 +
> tools/perf/util/pmu.y | 18 +-
> 20 files changed, 911 insertions(+), 213 deletions(-)
> create mode 100644 kernel/events/persistent.c
>
> --
> 1.7.11.7
>

2014-04-17 12:50:34

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v4 00/16] perf, persistent: Add persistent events

On Thu, Apr 17, 2014 at 02:44:26PM +0200, Jean Pihet wrote:
> Hi,
>
> Ping on this series. Are there some opinions or interest in this?

Just checking whether you've read my mail from 2-ish weeks ago:

https://lkml.kernel.org/r/[email protected]

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

2014-04-17 13:17:04

by Jean Pihet

[permalink] [raw]
Subject: Re: [PATCH v4 00/16] perf, persistent: Add persistent events

Hi Borislav,

On 17 April 2014 14:50, Borislav Petkov <[email protected]> wrote:
> On Thu, Apr 17, 2014 at 02:44:26PM +0200, Jean Pihet wrote:
>> Hi,
>>
>> Ping on this series. Are there some opinions or interest in this?
>
> Just checking whether you've read my mail from 2-ish weeks ago:
>
> https://lkml.kernel.org/r/[email protected]
Yes indeed!

Ingo,
Is the perf support for persistent event needed, or can this series be
reviewed as is?

In the meantime I am now working on the perf tool:
- add persistent events,
- factor out the code and provide a library (libperf) for other tool
to use it (e.g. monitoring daemon).

I surely will let you know!

Thx,
Jean

>
> --
> Regards/Gruss,
> Boris.
>
> Sent from a fat crate under my desk. Formatting is fine.
> --

2014-04-17 13:22:01

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v4 00/16] perf, persistent: Add persistent events

Hi Jean,

On Thu, Apr 17, 2014 at 03:17:01PM +0200, Jean Pihet wrote:
> Yes indeed!
>
> Ingo,
> Is the perf support for persistent event needed, or can this series be
> reviewed as is?
>
> In the meantime I am now working on the perf tool:
> - add persistent events,
> - factor out the code and provide a library (libperf) for other tool
> to use it (e.g. monitoring daemon).

Ok, good. I'm doing a bit too:

https://lkml.org/lkml/2014/4/13/20

so let's synchronize and get this thing going. I'll add you to my CC
list on future submissions.

Thanks.

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

2014-04-22 08:20:23

by Jean Pihet

[permalink] [raw]
Subject: Re: [PATCH v4 00/16] perf, persistent: Add persistent events

Hi Borislav,

On 17 April 2014 15:21, Borislav Petkov <[email protected]> wrote:
> Hi Jean,
>
> On Thu, Apr 17, 2014 at 03:17:01PM +0200, Jean Pihet wrote:
>> Yes indeed!
>>
>> Ingo,
>> Is the perf support for persistent event needed, or can this series be
>> reviewed as is?
>>
>> In the meantime I am now working on the perf tool:
>> - add persistent events,
>> - factor out the code and provide a library (libperf) for other tool
>> to use it (e.g. monitoring daemon).
>
> Ok, good. I'm doing a bit too:
>
> https://lkml.org/lkml/2014/4/13/20
>
> so let's synchronize and get this thing going. I'll add you to my CC
> list on future submissions.
Ok, great!

About libperf and the persistent events, do you know more details
about the status and the proposed API? I have no info about the work
that was started previously.

Regards,
Jean

>
> Thanks.
>
> --
> Regards/Gruss,
> Boris.
>
> Sent from a fat crate under my desk. Formatting is fine.
> --

2014-04-22 10:07:24

by Jean Pihet

[permalink] [raw]
Subject: Re: [PATCH v4 00/16] perf, persistent: Add persistent events

Hi,

-- adding Fu Wei in the loop--

Fu Wei is working on the RAS daemon, which is a user of libperf. It is
crucial to agree on the libperf API in order to start with the correct
implementation.

Jean

On 22 April 2014 10:20, Jean Pihet <[email protected]> wrote:
> Hi Borislav,
>
> On 17 April 2014 15:21, Borislav Petkov <[email protected]> wrote:
>> Hi Jean,
>>
>> On Thu, Apr 17, 2014 at 03:17:01PM +0200, Jean Pihet wrote:
>>> Yes indeed!
>>>
>>> Ingo,
>>> Is the perf support for persistent event needed, or can this series be
>>> reviewed as is?
>>>
>>> In the meantime I am now working on the perf tool:
>>> - add persistent events,
>>> - factor out the code and provide a library (libperf) for other tool
>>> to use it (e.g. monitoring daemon).
>>
>> Ok, good. I'm doing a bit too:
>>
>> https://lkml.org/lkml/2014/4/13/20
>>
>> so let's synchronize and get this thing going. I'll add you to my CC
>> list on future submissions.
> Ok, great!
>
> About libperf and the persistent events, do you know more details
> about the status and the proposed API? I have no info about the work
> that was started previously.
>
> Regards,
> Jean
>
>>
>> Thanks.
>>
>> --
>> Regards/Gruss,
>> Boris.
>>
>> Sent from a fat crate under my desk. Formatting is fine.
>> --

2014-04-22 10:20:12

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v4 00/16] perf, persistent: Add persistent events

On Tue, Apr 22, 2014 at 12:07:07PM +0200, Jean Pihet wrote:
> Hi,
>
> -- adding Fu Wei in the loop--
>
> Fu Wei is working on the RAS daemon, which is a user of libperf. It is

Good. :)

> crucial to agree on the libperf API in order to start with the correct
> implementation.

Again, it should be a libperf library which contains only the
perf-specific bits. I've done a rough carve out, see the mail below.

However, then we started a conversation in a smaller thread which lead
to

https://lkml.kernel.org/r/[email protected]

In any case, the basic perf tool functionality which we need right
now is in the last 4 patches of these 16 - we want to be able to
open/enable/close/... persistent events in the ras daemon without the
perf tool, i.e. link to as many libraries as needed in tools/.

Also, libperf should not be exposed to external users... for now.

And yes, carving out perf tool functionality is very tedious work -
that's why I've dropped to sending only a couple of patches at once so
that merge conflicts can be avoided.

What else,... hmm, yeah, that should be it. Just take a look at those
mails and let me know. Also, this whole thing is not the final design -
we're just hashing out ideas and changing things as we go.

Thanks!

--
>From [email protected] Fri Nov 15 21:29:18 2013
Date: Fri, 15 Nov 2013 21:28:48 +0100
From: Borislav Petkov <[email protected]>
To: Ingo Molnar <[email protected]>, Arnaldo Carvalho de Melo
<[email protected]>
Cc: Robert Richter <[email protected]>, Peter Zijlstra <[email protected]>,
Jiri Olsa <[email protected]>
Subject: Re: persistent events: Tooling support; perf tool splitup
Message-ID: <[email protected]>
User-Agent: Mutt/1.5.21 (2010-09-15)
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Content-Type: text/plain; charset=utf-8
Status: RO

Hi guys,

I'm finally ready with the next attempt at a carve out of the facilities
in tools/perf/ into libraries for external use. I'm adding the diffstat
of the patches below with comments so that you can see how it would look
like. Once we all agree on the structure, I'll start feeding acme 1-2
patches at a time so that there are no big merge issues.

Comments and suggestions are, as always, very welcome and needed. :-)

* The first group are the kernel headers any tools in tools/
could/are/would be using. I've put them into tools/include/ because it
leaves the #include <linux...> directives unchanged.

But since we're making a liblk kernel library, I can imagine putting
them into tools/lib/lk/ and making all part of that generic library.

tools/{perf/util => }/include/asm/bug.h | 6 +-
tools/{perf/util => }/include/linux/bitmap.h | 7 +-
tools/{perf/util => }/include/linux/bitops.h | 12 +-
tools/{perf/util => }/include/linux/compiler.h | 8 +-
tools/{perf/util => }/include/linux/kernel.h | 6 +-
tools/{perf/util => }/include/linux/list.h | 9 +-
tools/{perf/util => }/include/linux/types.h | 6 +-
tools/include/asm/byteorder.h | 2 +
tools/include/linux/export.h | 6 +
tools/include/linux/hash.h | 8 +
tools/include/linux/poison.h | 1 +
tools/include/linux/prefetch.h | 6 +
tools/include/linux/rbtree.h | 3 +
tools/include/linux/rbtree_augmented.h | 2 +

This is tools/lib/lk/liblk.a which contains only compilation units which
contain generic code. The idea behind this library is to be used by all
tools in tools/.

tools/{perf/util => lib/lk}/color.c | 69 +-
tools/{perf/util => lib/lk}/color.h | 34 +-
tools/{perf/util => lib/lk}/cpumap.c | 9 +-
tools/{perf/util => lib/lk}/cpumap.h | 4 +
tools/{perf/util => lib/lk}/ctype.c | 2 +-
tools/{perf/util => lib/lk}/hweight.c | 1 +
tools/{perf/util => lib/lk}/include/linux/magic.h | 0
tools/{perf/util => lib/lk}/parse-options.c | 6 +-
tools/{perf/util => lib/lk}/parse-options.h | 13 +-
tools/{perf/util => lib/lk}/rblist.c | 0
tools/{perf/util => lib/lk}/rblist.h | 6 +-
tools/{perf/util => lib/lk}/strbuf.c | 3 +-
tools/{perf/util => lib/lk}/strbuf.h | 15 +-
tools/{perf/util => lib/lk}/string.c | 42 +-
tools/{perf/util => lib/lk}/strlist.c | 0
tools/{perf/util => lib/lk}/strlist.h | 6 +-
tools/{perf/util => lib/lk}/sysfs.c | 4 +
tools/{perf/util => lib/lk}/thread_map.c | 0
tools/{perf/util => lib/lk}/thread_map.h | 6 +-
tools/{perf/util => lib/lk}/types.h | 6 +-
tools/{perf/util => lib/lk}/usage.c | 2 +-
tools/{perf/util => lib/lk}/wrapper.c | 3 +-
tools/{perf/util => lib/lk}/xyarray.c | 0
tools/{perf/util => lib/lk}/xyarray.h | 6 +-
tools/lib/lk/Makefile | 62 +-
tools/lib/lk/config.c | 370 ++++++
tools/lib/lk/config.h | 13 +
tools/lib/lk/ctype.h | 37 +
tools/lib/lk/hweight.h | 13 +
tools/lib/lk/string.h | 21 +
tools/lib/lk/sysfs.h | 12 +
tools/lib/lk/usage.h | 34 +
tools/lib/lk/util.c | 18 +
tools/lib/lk/util.h | 38 +
tools/lib/lk/wrapper.h | 27 +

This is lib/perf/libperfevent.a which got initiated from exporting
evsel.* and evlist.* which I'm going to need for other tools to use.
Diffstat shows what moves in there, I've tried to keep it as small as
possible.

tools/{perf/util => lib/perf}/callchain.c | 4 +-
tools/{perf/util => lib/perf}/callchain.h | 12 +-
tools/{perf/util => lib/perf}/cgroup.c | 8 +-
tools/{perf/util => lib/perf}/cgroup.h | 7 +-
tools/{perf/util => lib/perf}/debug.c | 29 +-
tools/{perf/util => lib/perf}/debug.h | 19 +-
tools/{perf/util => lib/perf}/dso.c | 8 +-
tools/{perf/util => lib/perf}/dso.h | 11 +-
tools/{perf/util => lib/perf}/event.c | 109 +-
tools/{perf/util => lib/perf}/event.h | 12 +-
tools/{perf/util => lib/perf}/evlist.c | 37 +-
tools/{perf/util => lib/perf}/evlist.h | 15 +-
tools/{perf/util => lib/perf}/evsel.c | 38 +-
tools/{perf/util => lib/perf}/evsel.h | 33 +-
tools/{perf/util => lib/perf}/map.c | 10 +-
tools/{perf/util => lib/perf}/map.h | 9 +-
tools/{perf/util => lib/perf}/path.c | 28 +-
tools/{perf/util => lib/perf}/perf_regs.h | 6 +-
tools/{perf/util => lib/perf}/symbol.c | 6 +-
tools/{perf/util => lib/perf}/symbol.h | 10 +-
tools/{perf/util => lib/perf}/target.c | 3 -
tools/{perf/util => lib/perf}/target.h | 12 +-
tools/{perf/util => lib/perf}/thread.h | 7 +-
tools/{perf/util => lib/perf}/tool.h | 6 +-
tools/{perf/util => lib/perf}/vdso.c | 4 +-
tools/{perf/util => lib/perf}/vdso.h | 6 +-
tools/lib/perf/Makefile | 83 ++
tools/lib/perf/arch.h | 112 ++
tools/lib/perf/build-id.c | 102 ++
tools/lib/perf/build-id.h | 32 +
tools/lib/perf/hist.h | 69 ++
tools/lib/perf/machine.c | 1268 ++++++++++++++++++++
tools/lib/perf/machine.h | 168 +++
tools/lib/perf/path.h | 29 +
tools/lib/perf/perf.c | 29 +
tools/lib/perf/perf.h | 105 ++
tools/lib/perf/thread.c | 187 +++

Rest is changes to tools/perf/

tools/perf/MANIFEST | 3 -
tools/perf/Makefile | 136 +--
tools/perf/arch/common.c | 4 +-
tools/perf/arch/x86/include/perf_regs.h | 2 +-
tools/perf/arch/x86/util/header.c | 2 +-
tools/perf/arch/x86/util/tsc.c | 4 +-
tools/perf/arch/x86/util/tsc.h | 2 +-
tools/perf/bench/mem-memcpy.c | 8 +-
tools/perf/bench/mem-memset.c | 8 +-
tools/perf/bench/numa.c | 6 +-
tools/perf/bench/sched-messaging.c | 4 +-
tools/perf/bench/sched-pipe.c | 4 +-
tools/perf/builtin-annotate.c | 21 +-
tools/perf/builtin-bench.c | 4 +-
tools/perf/builtin-buildid-cache.c | 12 +-
tools/perf/builtin-buildid-list.c | 19 +-
tools/perf/builtin-diff.c | 14 +-
tools/perf/builtin-evlist.c | 8 +-
tools/perf/builtin-help.c | 15 +-
tools/perf/builtin-inject.c | 18 +-
tools/perf/builtin-kmem.c | 16 +-
tools/perf/builtin-kvm.c | 20 +-
tools/perf/builtin-lock.c | 20 +-
tools/perf/builtin-mem.c | 6 +-
tools/perf/builtin-probe.c | 10 +-
tools/perf/builtin-record.c | 162 ++-
tools/perf/builtin-report.c | 26 +-
tools/perf/builtin-sched.c | 21 +-
tools/perf/builtin-script.c | 18 +-
tools/perf/builtin-stat.c | 29 +-
tools/perf/builtin-timechart.c | 21 +-
tools/perf/builtin-top.c | 27 +-
tools/perf/builtin-trace.c | 41 +-
tools/perf/builtin.h | 2 +-
tools/perf/config/Makefile | 4 +-
tools/perf/perf.c | 18 +-
tools/perf/perf.h | 187 +--
tools/perf/tests/attr.c | 5 +-
tools/perf/tests/bp_signal.c | 2 +-
tools/perf/tests/bp_signal_overflow.c | 2 +-
tools/perf/tests/builtin-test.c | 15 +-
tools/perf/tests/code-reading.c | 14 +-
tools/perf/tests/dso-data.c | 6 +-
tools/perf/tests/evsel-roundtrip-name.c | 6 +-
tools/perf/tests/evsel-tp-sched.c | 2 +-
tools/perf/tests/hists_link.c | 12 +-
tools/perf/tests/keep-tracking.c | 8 +-
tools/perf/tests/mmap-basic.c | 9 +-
tools/perf/tests/open-syscall-all-cpus.c | 8 +-
tools/perf/tests/open-syscall-tp-fields.c | 6 +-
tools/perf/tests/open-syscall.c | 6 +-
tools/perf/tests/parse-events.c | 9 +-
tools/perf/tests/parse-no-sample-id-all.c | 4 +-
tools/perf/tests/perf-record.c | 6 +-
tools/perf/tests/perf-time-to-tsc.c | 8 +-
tools/perf/tests/python-use.c | 2 -
tools/perf/tests/rdpmc.c | 4 +-
tools/perf/tests/sample-parsing.c | 5 +-
tools/perf/tests/sw-clock.c | 8 +-
tools/perf/tests/task-exit.c | 8 +-
tools/perf/tests/tests.h | 4 +
tools/perf/tests/vmlinux-kallsyms.c | 7 +-
tools/perf/ui/browser.c | 2 +-
tools/perf/ui/browser.h | 2 +-
tools/perf/ui/browsers/annotate.c | 8 +-
tools/perf/ui/browsers/hists.c | 6 +-
tools/perf/ui/browsers/map.c | 4 +-
tools/perf/ui/browsers/scripts.c | 7 +-
tools/perf/ui/gtk/annotate.c | 4 +-
tools/perf/ui/gtk/browser.c | 6 +-
tools/perf/ui/gtk/helpline.c | 2 +-
tools/perf/ui/gtk/hists.c | 6 +-
tools/perf/ui/gtk/setup.c | 2 +-
tools/perf/ui/gtk/util.c | 2 +-
tools/perf/ui/helpline.c | 2 +-
tools/perf/ui/helpline.h | 1 -
tools/perf/ui/hist.c | 4 +-
tools/perf/ui/progress.h | 2 +-
tools/perf/ui/setup.c | 4 +-
tools/perf/ui/stdio/hist.c | 4 +-
tools/perf/ui/tui/helpline.c | 2 +-
tools/perf/ui/tui/setup.c | 3 +-
tools/perf/ui/tui/util.c | 2 +-
tools/perf/ui/ui.h | 4 +-
tools/perf/ui/util.c | 2 +-
tools/perf/util/abspath.c | 2 +
tools/perf/util/alias.c | 2 +-
tools/perf/util/annotate.c | 14 +-
tools/perf/util/annotate.h | 4 +-
tools/perf/util/build-id.c | 129 +-
tools/perf/util/build-id.h | 24 +-
tools/perf/util/cache.h | 45 +-
tools/perf/util/config.c | 463 +------
tools/perf/util/dwarf-aux.c | 2 +-
tools/perf/util/exec_cmd.c | 10 +-
tools/perf/util/exec_cmd.h | 6 +-
tools/perf/util/header.c | 30 +-
tools/perf/util/header.h | 16 +-
tools/perf/util/help.c | 2 +-
tools/perf/util/hist.c | 70 +-
tools/perf/util/hist.h | 195 ++-
tools/perf/util/include/asm/byteorder.h | 2 -
tools/perf/util/include/asm/hweight.h | 8 -
tools/perf/util/include/linux/export.h | 6 -
tools/perf/util/include/linux/hash.h | 5 -
tools/perf/util/include/linux/poison.h | 1 -
tools/perf/util/include/linux/prefetch.h | 6 -
tools/perf/util/include/linux/rbtree.h | 2 -
tools/perf/util/include/linux/rbtree_augmented.h | 2 -
tools/perf/util/include/linux/string.h | 4 -
tools/perf/util/intlist.h | 2 +-
tools/perf/util/machine.c | 1258 +------------------
tools/perf/util/machine.h | 163 +--
tools/perf/util/pager.c | 15 +-
tools/perf/util/parse-events.c | 12 +-
tools/perf/util/parse-events.h | 3 +-
tools/perf/util/parse-events.y | 2 +-
tools/perf/util/pmu.c | 4 +-
tools/perf/util/probe-event.c | 14 +-
tools/perf/util/probe-event.h | 2 +-
tools/perf/util/probe-finder.c | 7 +-
tools/perf/util/probe-finder.h | 2 +
tools/perf/util/python-ext-sources | 22 +-
tools/perf/util/python.c | 12 +-
tools/perf/util/record.c | 8 +-
.../perf/util/scripting-engines/trace-event-perl.c | 6 +-
.../util/scripting-engines/trace-event-python.c | 6 +-
tools/perf/util/session.c | 17 +-
tools/perf/util/session.h | 8 +-
tools/perf/util/setup.py | 3 +-
tools/perf/util/sort.c | 8 +-
tools/perf/util/sort.h | 83 +-
tools/perf/util/stat.h | 2 +-
tools/perf/util/strfilter.c | 2 +
tools/perf/util/svghelper.h | 2 +-
tools/perf/util/symbol-elf.c | 4 +-
tools/perf/util/symbol-minimal.c | 2 +-
tools/perf/util/sysfs.h | 6 -
tools/perf/util/thread.c | 91 --
tools/perf/util/top.c | 10 +-
tools/perf/util/top.h | 6 +-
tools/perf/util/trace-event-info.c | 2 +-
tools/perf/util/trace-event-read.c | 2 -
tools/perf/util/unwind.c | 4 +-
tools/perf/util/unwind.h | 6 +-
tools/perf/util/util.c | 98 +-
tools/perf/util/util.h | 193 ++-
tools/perf/util/values.h | 2 +-
238 files changed, 4299 insertions(+), 3722 deletions(-)

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--




>
> Jean
>
> On 22 April 2014 10:20, Jean Pihet <[email protected]> wrote:
> > Hi Borislav,
> >
> > On 17 April 2014 15:21, Borislav Petkov <[email protected]> wrote:
> >> Hi Jean,
> >>
> >> On Thu, Apr 17, 2014 at 03:17:01PM +0200, Jean Pihet wrote:
> >>> Yes indeed!
> >>>
> >>> Ingo,
> >>> Is the perf support for persistent event needed, or can this series be
> >>> reviewed as is?
> >>>
> >>> In the meantime I am now working on the perf tool:
> >>> - add persistent events,
> >>> - factor out the code and provide a library (libperf) for other tool
> >>> to use it (e.g. monitoring daemon).
> >>
> >> Ok, good. I'm doing a bit too:
> >>
> >> https://lkml.org/lkml/2014/4/13/20
> >>
> >> so let's synchronize and get this thing going. I'll add you to my CC
> >> list on future submissions.
> > Ok, great!
> >
> > About libperf and the persistent events, do you know more details
> > about the status and the proposed API? I have no info about the work
> > that was started previously.
> >
> > Regards,
> > Jean
> >
> >>
> >> Thanks.
> >>
> >> --
> >> Regards/Gruss,
> >> Boris.
> >>
> >> Sent from a fat crate under my desk. Formatting is fine.
> >> --
>

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

2014-04-22 19:58:38

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 04/16] perf, mmap: Factor out perf_get_fd()

On Mon, Apr 07, 2014 at 05:04:26PM +0200, Jean Pihet wrote:
> From: Robert Richter <[email protected]>
>
> This new function creates a new fd for an event. We need this later to
> get a fd from a persistent event.
>
> Signed-off-by: Robert Richter <[email protected]>
> Signed-off-by: Robert Richter <[email protected]>
> Signed-off-by: Jean Pihet <[email protected]>
> ---
> kernel/events/core.c | 14 ++++++++------
> kernel/events/internal.h | 1 +
> 2 files changed, 9 insertions(+), 6 deletions(-)
>
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 22ec8f0..9857475 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -4206,6 +4206,11 @@ static const struct file_operations perf_fops = {
> .fasync = perf_fasync,
> };
>
> +int perf_get_fd(struct perf_event *event, int f_flags)
> +{
> + return anon_inode_getfd("[perf_event]", &perf_fops, event, f_flags);
> +}
> +
> /*
> * Perf event wakeup
> *
> @@ -7028,7 +7033,6 @@ SYSCALL_DEFINE5(perf_event_open,
> struct perf_event *event, *sibling;
> struct perf_event_attr attr;
> struct perf_event_context *ctx;
> - struct file *event_file = NULL;
> struct fd group = {NULL, 0};
> struct task_struct *task = NULL;
> struct pmu *pmu;
> @@ -7189,10 +7193,9 @@ SYSCALL_DEFINE5(perf_event_open,
> goto err_context;
> }
>
> - event_file = anon_inode_getfile("[perf_event]", &perf_fops, event,
> - f_flags);
> - if (IS_ERR(event_file)) {
> - err = PTR_ERR(event_file);
> + event_fd = perf_get_fd(event, f_flags);
> + if (event_fd < 0) {
> + err = event_fd;
> goto err_context;
> }
>
> @@ -7257,7 +7260,6 @@ SYSCALL_DEFINE5(perf_event_open,
> * perf_group_detach().
> */
> fdput(group);
> - fd_install(event_fd, event_file);
> return event_fd;
>
> err_context:

ISTR there being a good reason we only install the FD at the very last
moment. Of course I cannot quite recall it :/

I suspect it had something to do with not being able to fudge the state
while its being set-up or somesuch.

2014-04-22 19:58:32

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 03/16] perf, mmap: Factor out perf_alloc/free_rb()

On Mon, Apr 07, 2014 at 05:04:25PM +0200, Jean Pihet wrote:
> From: Robert Richter <[email protected]>
>
> Factor out code to allocate and deallocate ringbuffers. We need this
> later to setup the sampling buffer for persistent events.
>
> While at this, replacing get_current_user() with get_uid(user).
>
> Signed-off-by: Robert Richter <[email protected]>
> Signed-off-by: Robert Richter <[email protected]>
> Signed-off-by: Jean Pihet <[email protected]>
> ---
> kernel/events/core.c | 77 +++++++++++++++++++++++++++++-------------------
> kernel/events/internal.h | 3 ++
> 2 files changed, 50 insertions(+), 30 deletions(-)
>
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 5eaba42..22ec8f0 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -3193,7 +3193,45 @@ static void free_event_rcu(struct rcu_head *head)
> }
>
> static void ring_buffer_put(struct ring_buffer *rb);
> +static void ring_buffer_attach(struct perf_event *event, struct ring_buffer *rb);
> static void ring_buffer_detach(struct perf_event *event, struct ring_buffer *rb);
> +static void perf_event_init_userpage(struct perf_event *event);
> +
> +/*
> + * Must be called with &event->mmap_mutex held. event->rb must be
> + * NULL. perf_alloc_rb() requires &event->mmap_count to be incremented
> + * on success which corresponds to &rb->mmap_count that is initialized
> + * with 1.
> + */
> +int perf_alloc_rb(struct perf_event *event, int nr_pages, int flags)
> +{
> + struct ring_buffer *rb;

lockdep_assert_held(&event->mmap_mutex);
WARN_ON(event->rb);

> +
> + rb = rb_alloc(nr_pages,
> + event->attr.watermark ? event->attr.wakeup_watermark : 0,
> + event->cpu, flags);
> + if (!rb)
> + return -ENOMEM;
> +
> + atomic_set(&rb->mmap_count, 1);
> + ring_buffer_attach(event, rb);
> + rcu_assign_pointer(event->rb, rb);
> +
> + perf_event_init_userpage(event);
> + perf_event_update_userpage(event);
> +
> + return 0;
> +}
> +
> +/* Must be called with &event->mmap_mutex held. event->rb must be set. */
> +void perf_free_rb(struct perf_event *event)
> +{
> + struct ring_buffer *rb = event->rb;

lockdep_assert_held(&event->mmap_mutex);

> + rcu_assign_pointer(event->rb, NULL);
> + ring_buffer_detach(event, rb);
> + ring_buffer_put(rb);
> +}

2014-04-25 13:54:20

by Robert Richter

[permalink] [raw]
Subject: Re: [PATCH 04/16] perf, mmap: Factor out perf_get_fd()

On 22.04.14 16:27:59, Peter Zijlstra wrote:
> On Mon, Apr 07, 2014 at 05:04:26PM +0200, Jean Pihet wrote:
> > From: Robert Richter <[email protected]>
> >
> > This new function creates a new fd for an event. We need this later to
> > get a fd from a persistent event.
> >
> > Signed-off-by: Robert Richter <[email protected]>
> > Signed-off-by: Robert Richter <[email protected]>
> > Signed-off-by: Jean Pihet <[email protected]>
> > ---
> > kernel/events/core.c | 14 ++++++++------
> > kernel/events/internal.h | 1 +
> > 2 files changed, 9 insertions(+), 6 deletions(-)
> >
> > diff --git a/kernel/events/core.c b/kernel/events/core.c
> > index 22ec8f0..9857475 100644
> > --- a/kernel/events/core.c
> > +++ b/kernel/events/core.c
> > @@ -4206,6 +4206,11 @@ static const struct file_operations perf_fops = {
> > .fasync = perf_fasync,
> > };
> >
> > +int perf_get_fd(struct perf_event *event, int f_flags)
> > +{
> > + return anon_inode_getfd("[perf_event]", &perf_fops, event, f_flags);
> > +}
> > +
> > /*
> > * Perf event wakeup
> > *
> > @@ -7028,7 +7033,6 @@ SYSCALL_DEFINE5(perf_event_open,
> > struct perf_event *event, *sibling;
> > struct perf_event_attr attr;
> > struct perf_event_context *ctx;
> > - struct file *event_file = NULL;
> > struct fd group = {NULL, 0};
> > struct task_struct *task = NULL;
> > struct pmu *pmu;
> > @@ -7189,10 +7193,9 @@ SYSCALL_DEFINE5(perf_event_open,
> > goto err_context;
> > }
> >
> > - event_file = anon_inode_getfile("[perf_event]", &perf_fops, event,
> > - f_flags);
> > - if (IS_ERR(event_file)) {
> > - err = PTR_ERR(event_file);
> > + event_fd = perf_get_fd(event, f_flags);
> > + if (event_fd < 0) {
> > + err = event_fd;
> > goto err_context;
> > }
> >
> > @@ -7257,7 +7260,6 @@ SYSCALL_DEFINE5(perf_event_open,
> > * perf_group_detach().
> > */
> > fdput(group);
> > - fd_install(event_fd, event_file);
> > return event_fd;
> >
> > err_context:
>
> ISTR there being a good reason we only install the FD at the very last
> moment. Of course I cannot quite recall it :/
>
> I suspect it had something to do with not being able to fudge the state
> while its being set-up or somesuch.

I guess its this patch from Al you referring to:

ea635c6 Fix racy use of anon_inode_getfd() in perf_event.c

I think we do not introduce it back due to:

a6fa941 perf_event: Switch to internal refcount, fix race with close()

which removes the user of event_file.

-Robert

2014-04-25 14:43:14

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 04/16] perf, mmap: Factor out perf_get_fd()

On Fri, Apr 25, 2014 at 03:54:13PM +0200, Robert Richter wrote:
> I guess its this patch from Al you referring to:
>
> ea635c6 Fix racy use of anon_inode_getfd() in perf_event.c
>
> I think we do not introduce it back due to:
>
> a6fa941 perf_event: Switch to internal refcount, fix race with close()
>
> which removes the user of event_file.

linux-2.6# git show a6fa941
error: short SHA1 a6fa941 is ambiguous.
error: short SHA1 a6fa941 is ambiguous.
fatal: ambiguous argument 'a6fa941': unknown revision or path not in the working tree.

Which is I think why Linus makes us use 12 chars instead of the git
default of 8.

Of course its also entirely useless of git to not list the full IDs for
those it did find to be ambiguous.

2014-04-25 14:52:10

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 04/16] perf, mmap: Factor out perf_get_fd()

On Fri, Apr 25, 2014 at 04:43:01PM +0200, Peter Zijlstra wrote:
> On Fri, Apr 25, 2014 at 03:54:13PM +0200, Robert Richter wrote:
> > I guess its this patch from Al you referring to:
> >
> > ea635c6 Fix racy use of anon_inode_getfd() in perf_event.c
> >
> > I think we do not introduce it back due to:
> >
> > a6fa941 perf_event: Switch to internal refcount, fix race with close()
> >
> > which removes the user of event_file.
>
> linux-2.6# git show a6fa941
> error: short SHA1 a6fa941 is ambiguous.
> error: short SHA1 a6fa941 is ambiguous.
> fatal: ambiguous argument 'a6fa941': unknown revision or path not in the working tree.
>
> Which is I think why Linus makes us use 12 chars instead of the git
> default of 8.
>
> Of course its also entirely useless of git to not list the full IDs for
> those it did find to be ambiguous.

ok, so its a6fa941d94b4.

But no, I don't think that helps, its still true that the moment you get
a fd another thread can immediately close(). That would drop the last
ref and free it, meanwhile perf_event_open() is happily poking at it.

Now I think you could cure this by adding an extra ref before calling
your perf_get_fd() and dropping that extra ref at the end, where we used
to have fd_install().

2014-04-29 15:34:09

by Robert Richter

[permalink] [raw]
Subject: Re: [PATCH 04/16] perf, mmap: Factor out perf_get_fd()

On 25.04.14 16:52:05, Peter Zijlstra wrote:
> But no, I don't think that helps, its still true that the moment you get
> a fd another thread can immediately close(). That would drop the last
> ref and free it, meanwhile perf_event_open() is happily poking at it.
>
> Now I think you could cure this by adding an extra ref before calling
> your perf_get_fd() and dropping that extra ref at the end, where we used
> to have fd_install().

Yes, right. I have a solution now which increments the event's ref
count before creating the file descriptor using try_get_event()/
put_event().

The patch also does not remove get_unused_fd_flags() and the err_fd
error handler.

Have an update already of a rebase version but still need to test it.

Would it be ok to split the patch set and send in a first step only
the first 4 patches that refactor the perf mmap code?

Thanks,

-Robert