2013-06-28 13:16:28

by Adrian Hunter

[permalink] [raw]
Subject: [PATCH 0/5] perf: add two new features

Hi

Please consider these two new perf features:
x86: add ability to calculate TSC from perf sample timestamps
perf: add 'keep tracking' flag to PERF_EVENT_IOC_DISABLE

This is also a minor fix:
perf: fix broken union in perf_event_mmap_page

And tests in perf tools:
perf tools: add test for converting perf time to/from TSC
perf tools: add 'keep tracking' test


Adrian Hunter (5):
perf: fix broken union in perf_event_mmap_page
x86: add ability to calculate TSC from perf sample timestamps
perf tools: add test for converting perf time to/from TSC
perf: add 'keep tracking' flag to PERF_EVENT_IOC_DISABLE
perf tools: add 'keep tracking' test

arch/x86/include/asm/tsc.h | 1 +
arch/x86/kernel/cpu/perf_event.c | 6 ++
arch/x86/kernel/tsc.c | 6 ++
include/linux/perf_event.h | 1 +
include/uapi/linux/perf_event.h | 29 +++++-
kernel/events/core.c | 21 ++++-
tools/perf/Makefile | 4 +
tools/perf/arch/x86/Makefile | 2 +
tools/perf/arch/x86/util/tsc.c | 59 ++++++++++++
tools/perf/arch/x86/util/tsc.h | 20 ++++
tools/perf/tests/builtin-test.c | 10 ++
tools/perf/tests/keep-tracking.c | 168 ++++++++++++++++++++++++++++++++++
tools/perf/tests/perf-time-to-tsc.c | 177 ++++++++++++++++++++++++++++++++++++
tools/perf/tests/tests.h | 2 +
tools/perf/util/evlist.c | 32 +++++++
tools/perf/util/evlist.h | 5 +
16 files changed, 537 insertions(+), 6 deletions(-)
create mode 100644 tools/perf/arch/x86/util/tsc.c
create mode 100644 tools/perf/arch/x86/util/tsc.h
create mode 100644 tools/perf/tests/keep-tracking.c
create mode 100644 tools/perf/tests/perf-time-to-tsc.c


Regards
Adrian


2013-06-28 13:16:55

by Adrian Hunter

[permalink] [raw]
Subject: [PATCH 1/5] perf: fix broken union in perf_event_mmap_page

The capabilities bits must not be "union'ed" together.
Put them in a separate struct.

Signed-off-by: Adrian Hunter <[email protected]>
---
include/uapi/linux/perf_event.h | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 0b1df41..19f6ee5 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -375,9 +375,11 @@ struct perf_event_mmap_page {
__u64 time_running; /* time event on cpu */
union {
__u64 capabilities;
- __u64 cap_usr_time : 1,
- cap_usr_rdpmc : 1,
- cap_____res : 62;
+ struct {
+ __u64 cap_usr_time : 1,
+ cap_usr_rdpmc : 1,
+ cap_____res : 62;
+ };
};

/*
--
1.7.11.7

2013-06-28 13:17:00

by Adrian Hunter

[permalink] [raw]
Subject: [PATCH 2/5] x86: add ability to calculate TSC from perf sample timestamps

For modern CPUs, perf clock is directly related to TSC. TSC
can be calculated from perf clock and vice versa using a simple
calculation. Two of the three componenets of that calculation
are already exported in struct perf_event_mmap_page. This patch
exports the third.

Signed-off-by: Adrian Hunter <[email protected]>
---
arch/x86/include/asm/tsc.h | 1 +
arch/x86/kernel/cpu/perf_event.c | 6 ++++++
arch/x86/kernel/tsc.c | 6 ++++++
include/uapi/linux/perf_event.h | 22 ++++++++++++++++++++--
4 files changed, 33 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/tsc.h b/arch/x86/include/asm/tsc.h
index c91e8b9..235be70 100644
--- a/arch/x86/include/asm/tsc.h
+++ b/arch/x86/include/asm/tsc.h
@@ -49,6 +49,7 @@ extern void tsc_init(void);
extern void mark_tsc_unstable(char *reason);
extern int unsynchronized_tsc(void);
extern int check_tsc_unstable(void);
+extern int check_tsc_disabled(void);
extern unsigned long native_calibrate_tsc(void);

extern int tsc_clocksource_reliable;
diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index afc2413..3f74034 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -1860,6 +1860,7 @@ static struct pmu pmu = {
void arch_perf_update_userpage(struct perf_event_mmap_page *userpg, u64 now)
{
userpg->cap_usr_time = 0;
+ userpg->cap_usr_time_zero = 0;
userpg->cap_usr_rdpmc = x86_pmu.attr_rdpmc;
userpg->pmc_width = x86_pmu.cntval_bits;

@@ -1873,6 +1874,11 @@ void arch_perf_update_userpage(struct perf_event_mmap_page *userpg, u64 now)
userpg->time_mult = this_cpu_read(cyc2ns);
userpg->time_shift = CYC2NS_SCALE_FACTOR;
userpg->time_offset = this_cpu_read(cyc2ns_offset) - now;
+
+ if (sched_clock_stable && !check_tsc_disabled()) {
+ userpg->cap_usr_time_zero = 1;
+ userpg->time_zero = this_cpu_read(cyc2ns_offset);
+ }
}

/*
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 098b3cf..c810283 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -89,6 +89,12 @@ int check_tsc_unstable(void)
}
EXPORT_SYMBOL_GPL(check_tsc_unstable);

+int check_tsc_disabled(void)
+{
+ return tsc_disabled;
+}
+EXPORT_SYMBOL_GPL(check_tsc_disabled);
+
#ifdef CONFIG_X86_TSC
int __init notsc_setup(char *str)
{
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 19f6ee5..663be3f 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -378,7 +378,8 @@ struct perf_event_mmap_page {
struct {
__u64 cap_usr_time : 1,
cap_usr_rdpmc : 1,
- cap_____res : 62;
+ cap_usr_time_zero : 1,
+ cap_____res : 61;
};
};

@@ -420,12 +421,29 @@ struct perf_event_mmap_page {
__u16 time_shift;
__u32 time_mult;
__u64 time_offset;
+ /*
+ * If cap_usr_time_zero, the hardware clock (e.g. TSC) can be calculated
+ * from sample timestamps.
+ *
+ * time = timestamp - time_zero;
+ * quot = time / time_mult;
+ * rem = time % time_mult;
+ * cyc = (quot << time_shift) + (rem << time_shift) / time_mult;
+ *
+ * And vice versa:
+ *
+ * quot = cyc >> time_shift;
+ * rem = cyc & ((1 << time_shift) - 1);
+ * timestamp = time_zero + quot * time_mult +
+ * ((rem * time_mult) >> time_shift);
+ */
+ __u64 time_zero;

/*
* Hole for extension of the self monitor capabilities
*/

- __u64 __reserved[120]; /* align to 1k */
+ __u64 __reserved[119]; /* align to 1k */

/*
* Control data for the mmap() data buffer.
--
1.7.11.7

2013-06-28 13:17:04

by Adrian Hunter

[permalink] [raw]
Subject: [PATCH 5/5] perf tools: add 'keep tracking' test

Add a test for the newly added 'keep tracking' flag for
the PERF_EVENT_IOC_DISABLE IOCTL. The test checks that
comm events are present when the event is disabled with
the 'keep tracking' flag set. The test also checks that
comm events are not present when the event is disabled
without the flag set.

Signed-off-by: Adrian Hunter <[email protected]>
---
tools/perf/Makefile | 1 +
tools/perf/tests/builtin-test.c | 4 +
tools/perf/tests/keep-tracking.c | 168 +++++++++++++++++++++++++++++++++++++++
tools/perf/tests/tests.h | 1 +
tools/perf/util/evlist.c | 32 ++++++++
tools/perf/util/evlist.h | 5 ++
6 files changed, 211 insertions(+)
create mode 100644 tools/perf/tests/keep-tracking.c

diff --git a/tools/perf/Makefile b/tools/perf/Makefile
index 7a8dc89..9c3642f 100644
--- a/tools/perf/Makefile
+++ b/tools/perf/Makefile
@@ -393,6 +393,7 @@ LIB_OBJS += $(OUTPUT)tests/sw-clock.o
ifeq ($(ARCH),x86)
LIB_OBJS += $(OUTPUT)tests/perf-time-to-tsc.o
endif
+LIB_OBJS += $(OUTPUT)tests/keep-tracking.o

BUILTIN_OBJS += $(OUTPUT)builtin-annotate.o
BUILTIN_OBJS += $(OUTPUT)builtin-bench.o
diff --git a/tools/perf/tests/builtin-test.c b/tools/perf/tests/builtin-test.c
index b7b4049..b0b2b88 100644
--- a/tools/perf/tests/builtin-test.c
+++ b/tools/perf/tests/builtin-test.c
@@ -100,6 +100,10 @@ static struct test {
},
#endif
{
+ .desc = "Test event disable IOCTL 'keep tracking' flag",
+ .func = test__keep_tracking,
+ },
+ {
.func = NULL,
},
};
diff --git a/tools/perf/tests/keep-tracking.c b/tools/perf/tests/keep-tracking.c
new file mode 100644
index 0000000..ce6a1a1
--- /dev/null
+++ b/tools/perf/tests/keep-tracking.c
@@ -0,0 +1,168 @@
+#include <sys/types.h>
+#include <unistd.h>
+#include <sys/prctl.h>
+
+#include "parse-events.h"
+#include "evlist.h"
+#include "evsel.h"
+#include "thread_map.h"
+#include "cpumap.h"
+#include "tests.h"
+
+#define CHECK__(x) { \
+ while ((x) < 0) { \
+ pr_debug(#x " failed!\n"); \
+ goto out_err; \
+ } \
+}
+
+#define CHECK_NOT_NULL__(x) { \
+ while ((x) == NULL) { \
+ pr_debug(#x " failed!\n"); \
+ goto out_err; \
+ } \
+}
+
+static int find_comm(struct perf_evlist *evlist, const char *comm)
+{
+ union perf_event *event;
+ int i, found;
+
+ found = 0;
+ for (i = 0; i < evlist->nr_mmaps; i++) {
+ while ((event = perf_evlist__mmap_read(evlist, i)) != NULL) {
+ if (event->header.type == PERF_RECORD_COMM &&
+ (pid_t)event->comm.pid == getpid() &&
+ (pid_t)event->comm.tid == getpid() &&
+ strcmp(event->comm.comm, comm) == 0)
+ found += 1;
+ }
+ }
+ return found;
+}
+
+/**
+ * test__keep_tracking - test event disable IOCTL with 'keep tracking' flag.
+ *
+ * This function implements a test that checks that tracking events continue
+ * when an event is disabled with the 'keep tracking flag'. If the test passes
+ * %0 is returned, otherwise %-1 is returned.
+ */
+int test__keep_tracking(void)
+{
+ struct perf_record_opts opts = {
+ .mmap_pages = UINT_MAX,
+ .user_freq = UINT_MAX,
+ .user_interval = ULLONG_MAX,
+ .freq = 4000,
+ .target = {
+ .uses_mmap = true,
+ },
+ };
+ struct thread_map *threads = NULL;
+ struct cpu_map *cpus = NULL;
+ struct perf_evlist *evlist = NULL;
+ struct perf_evsel *evsel = NULL;
+ int found, err = -1;
+ const char *comm;
+
+ threads = thread_map__new(-1, getpid(), UINT_MAX);
+ CHECK_NOT_NULL__(threads);
+
+ cpus = cpu_map__new(NULL);
+ CHECK_NOT_NULL__(cpus);
+
+ evlist = perf_evlist__new();
+ CHECK_NOT_NULL__(evlist);
+
+ perf_evlist__set_maps(evlist, cpus, threads);
+
+ CHECK__(parse_events(evlist, "cycles:u"));
+
+ perf_evlist__config(evlist, &opts);
+
+ evsel = perf_evlist__first(evlist);
+
+ evsel->attr.comm = 1;
+ evsel->attr.disabled = 1;
+ evsel->attr.enable_on_exec = 0;
+
+ CHECK__(perf_evlist__open(evlist));
+
+ CHECK__(perf_evlist__mmap(evlist, UINT_MAX, false));
+
+ /*
+ * First, test that a 'comm' event can be found when the event is
+ * enabled.
+ */
+
+ perf_evlist__enable(evlist);
+
+ comm = "Test COMM 1";
+ CHECK__(prctl(PR_SET_NAME, (unsigned long)comm, 0, 0, 0));
+
+ perf_evlist__disable(evlist);
+
+ found = find_comm(evlist, comm);
+ if (found != 1) {
+ pr_debug("Failed to find tracking event.\n");
+ goto out_err;
+ }
+
+ /*
+ * Secondly, test that a 'comm' event cannot be found when the event is
+ * disabled without the 'keep tracking' flag.
+ */
+
+ perf_evlist__enable(evlist);
+
+ CHECK__(perf_evlist__disable_event(evlist, evsel, 0));
+
+ comm = "Test COMM 2";
+ CHECK__(prctl(PR_SET_NAME, (unsigned long)comm, 0, 0, 0));
+
+ perf_evlist__disable(evlist);
+
+ found = find_comm(evlist, comm);
+ if (found) {
+ pr_debug("Found tracking event with the event disabled.\n");
+ goto out_err;
+ }
+
+ /*
+ * Finally, test that a 'comm' event can be found when the event is
+ * disabled with the 'keep tracking' flag.
+ */
+
+ perf_evlist__enable(evlist);
+
+ CHECK__(perf_evlist__disable_event(evlist, evsel,
+ PERF_IOC_KEEP_TRACKING));
+
+ comm = "Test COMM 3";
+ CHECK__(prctl(PR_SET_NAME, (unsigned long)comm, 0, 0, 0));
+
+ perf_evlist__disable(evlist);
+
+ found = find_comm(evlist, comm);
+ if (found != 1) {
+ pr_debug("PERF_IOC_KEEP_TRACKING flag had no effect.\n");
+ goto out_err;
+ }
+
+ err = 0;
+
+out_err:
+ if (evlist) {
+ perf_evlist__disable(evlist);
+ perf_evlist__munmap(evlist);
+ perf_evlist__close(evlist);
+ perf_evlist__delete(evlist);
+ }
+ if (cpus)
+ cpu_map__delete(cpus);
+ if (threads)
+ thread_map__delete(threads);
+
+ return err;
+}
diff --git a/tools/perf/tests/tests.h b/tools/perf/tests/tests.h
index 107696e..971172a 100644
--- a/tools/perf/tests/tests.h
+++ b/tools/perf/tests/tests.h
@@ -28,5 +28,6 @@ int test__bp_signal_overflow(void);
int test__task_exit(void);
int test__sw_clock_freq(void);
int test__perf_time_to_tsc(void);
+int test__keep_tracking(void);

#endif /* TESTS_H */
diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
index 78331da..abaf63b 100644
--- a/tools/perf/util/evlist.c
+++ b/tools/perf/util/evlist.c
@@ -303,6 +303,38 @@ void perf_evlist__enable(struct perf_evlist *evlist)
}
}

+int perf_evlist__disable_event(struct perf_evlist *evlist,
+ struct perf_evsel *evsel, int flags)
+{
+ int cpu, thread, err;
+
+ for (cpu = 0; cpu < evlist->cpus->nr; cpu++) {
+ for (thread = 0; thread < evlist->threads->nr; thread++) {
+ err = ioctl(FD(evsel, cpu, thread),
+ PERF_EVENT_IOC_DISABLE, flags);
+ if (err)
+ return err;
+ }
+ }
+ return 0;
+}
+
+int perf_evlist__enable_event(struct perf_evlist *evlist,
+ struct perf_evsel *evsel, int flags)
+{
+ int cpu, thread, err;
+
+ for (cpu = 0; cpu < evlist->cpus->nr; cpu++) {
+ for (thread = 0; thread < evlist->threads->nr; thread++) {
+ err = ioctl(FD(evsel, cpu, thread),
+ PERF_EVENT_IOC_ENABLE, flags);
+ if (err)
+ return err;
+ }
+ }
+ return 0;
+}
+
static int perf_evlist__alloc_pollfd(struct perf_evlist *evlist)
{
int nr_cpus = cpu_map__nr(evlist->cpus);
diff --git a/tools/perf/util/evlist.h b/tools/perf/util/evlist.h
index b1be475..3531516 100644
--- a/tools/perf/util/evlist.h
+++ b/tools/perf/util/evlist.h
@@ -104,6 +104,11 @@ void perf_evlist__munmap(struct perf_evlist *evlist);
void perf_evlist__disable(struct perf_evlist *evlist);
void perf_evlist__enable(struct perf_evlist *evlist);

+int perf_evlist__disable_event(struct perf_evlist *evlist,
+ struct perf_evsel *evsel, int flags);
+int perf_evlist__enable_event(struct perf_evlist *evlist,
+ struct perf_evsel *evsel, int flags);
+
void perf_evlist__set_selected(struct perf_evlist *evlist,
struct perf_evsel *evsel);

--
1.7.11.7

2013-06-28 13:17:27

by Adrian Hunter

[permalink] [raw]
Subject: [PATCH 4/5] perf: add 'keep tracking' flag to PERF_EVENT_IOC_DISABLE

Make it possible to disable an event but continue to receive
"tracking" events i.e. continue to receive mmap, comm, task events.
The flag is updated by both PERF_EVENT_IOC_DISABLE and
PERF_EVENT_IOC_ENABLE. The flag is cleared by prctl
PR_TASK_PERF_EVENTS_DISABLE.

Signed-off-by: Adrian Hunter <[email protected]>
---
include/linux/perf_event.h | 1 +
include/uapi/linux/perf_event.h | 1 +
kernel/events/core.c | 21 +++++++++++++++++++--
3 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 50b3efd..789eeeb 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -436,6 +436,7 @@ struct perf_event {
struct perf_cgroup *cgrp; /* cgroup event is attach to */
int cgrp_defer_enabled;
#endif
+ int keep_tracking;

#endif /* CONFIG_PERF_EVENTS */
};
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 663be3f..7b4ecfb 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -324,6 +324,7 @@ struct perf_event_attr {

enum perf_event_ioc_flags {
PERF_IOC_FLAG_GROUP = 1U << 0,
+ PERF_IOC_KEEP_TRACKING = 1U << 1,
};

/*
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 1db3af9..0c1fbe9 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3495,6 +3495,16 @@ static inline int perf_fget_light(int fd, struct fd *p)
return 0;
}

+static void perf_event_keep_tracking(struct perf_event *event, u32 flags)
+{
+ int keep_tracking = !!(flags & PERF_IOC_KEEP_TRACKING);
+
+ if (flags & PERF_IOC_FLAG_GROUP)
+ event->group_leader->keep_tracking = keep_tracking;
+ else
+ event->keep_tracking = keep_tracking;
+}
+
static int perf_event_set_output(struct perf_event *event,
struct perf_event *output_event);
static int perf_event_set_filter(struct perf_event *event, void __user *arg);
@@ -3510,6 +3520,7 @@ static long perf_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
func = perf_event_enable;
break;
case PERF_EVENT_IOC_DISABLE:
+ perf_event_keep_tracking(event, flags);
func = perf_event_disable;
break;
case PERF_EVENT_IOC_RESET:
@@ -3552,6 +3563,9 @@ static long perf_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
else
perf_event_for_each_child(event, func);

+ if (cmd == PERF_EVENT_IOC_ENABLE)
+ perf_event_keep_tracking(event, flags);
+
return 0;
}

@@ -3572,8 +3586,10 @@ int perf_event_task_disable(void)
struct perf_event *event;

mutex_lock(&current->perf_event_mutex);
- list_for_each_entry(event, &current->perf_event_list, owner_entry)
+ list_for_each_entry(event, &current->perf_event_list, owner_entry) {
perf_event_for_each_child(event, perf_event_disable);
+ event->keep_tracking = 0;
+ }
mutex_unlock(&current->perf_event_mutex);

return 0;
@@ -4670,7 +4686,8 @@ perf_event_aux_ctx(struct perf_event_context *ctx,
struct perf_event *event;

list_for_each_entry_rcu(event, &ctx->event_list, event_entry) {
- if (event->state < PERF_EVENT_STATE_INACTIVE)
+ if (event->state < PERF_EVENT_STATE_INACTIVE &&
+ !event->keep_tracking)
continue;
if (!event_filter_match(event))
continue;
--
1.7.11.7

2013-06-28 13:17:45

by Adrian Hunter

[permalink] [raw]
Subject: [PATCH 3/5] perf tools: add test for converting perf time to/from TSC

The test uses the newly added cap_usr_time_zero and time_zero of
perf_event_mmap_page. TSC from rdtsc is compared with the time
from 2 perf events. The test passes if the calculated times are
all in the correct order.

Signed-off-by: Adrian Hunter <[email protected]>
---
tools/perf/Makefile | 3 +
tools/perf/arch/x86/Makefile | 2 +
tools/perf/arch/x86/util/tsc.c | 59 ++++++++++++
tools/perf/arch/x86/util/tsc.h | 20 ++++
tools/perf/tests/builtin-test.c | 6 ++
tools/perf/tests/perf-time-to-tsc.c | 177 ++++++++++++++++++++++++++++++++++++
tools/perf/tests/tests.h | 1 +
7 files changed, 268 insertions(+)
create mode 100644 tools/perf/arch/x86/util/tsc.c
create mode 100644 tools/perf/arch/x86/util/tsc.h
create mode 100644 tools/perf/tests/perf-time-to-tsc.c

diff --git a/tools/perf/Makefile b/tools/perf/Makefile
index 203cb0e..7a8dc89 100644
--- a/tools/perf/Makefile
+++ b/tools/perf/Makefile
@@ -390,6 +390,9 @@ LIB_OBJS += $(OUTPUT)tests/bp_signal.o
LIB_OBJS += $(OUTPUT)tests/bp_signal_overflow.o
LIB_OBJS += $(OUTPUT)tests/task-exit.o
LIB_OBJS += $(OUTPUT)tests/sw-clock.o
+ifeq ($(ARCH),x86)
+LIB_OBJS += $(OUTPUT)tests/perf-time-to-tsc.o
+endif

BUILTIN_OBJS += $(OUTPUT)builtin-annotate.o
BUILTIN_OBJS += $(OUTPUT)builtin-bench.o
diff --git a/tools/perf/arch/x86/Makefile b/tools/perf/arch/x86/Makefile
index 815841c..8801fe0 100644
--- a/tools/perf/arch/x86/Makefile
+++ b/tools/perf/arch/x86/Makefile
@@ -6,3 +6,5 @@ ifndef NO_LIBUNWIND
LIB_OBJS += $(OUTPUT)arch/$(ARCH)/util/unwind.o
endif
LIB_OBJS += $(OUTPUT)arch/$(ARCH)/util/header.o
+LIB_OBJS += $(OUTPUT)arch/$(ARCH)/util/tsc.o
+LIB_H += arch/$(ARCH)/util/tsc.h
diff --git a/tools/perf/arch/x86/util/tsc.c b/tools/perf/arch/x86/util/tsc.c
new file mode 100644
index 0000000..f111744
--- /dev/null
+++ b/tools/perf/arch/x86/util/tsc.c
@@ -0,0 +1,59 @@
+#include <stdbool.h>
+#include <errno.h>
+
+#include <linux/perf_event.h>
+
+#include "../../perf.h"
+#include "../../util/types.h"
+#include "../../util/debug.h"
+#include "tsc.h"
+
+u64 perf_time_to_tsc(u64 ns, struct perf_tsc_conversion *tc)
+{
+ u64 time, quot, rem;
+
+ time = ns - tc->time_zero;
+ quot = time / tc->time_mult;
+ rem = time % tc->time_mult;
+ return (quot << tc->time_shift) +
+ (rem << tc->time_shift) / tc->time_mult;
+}
+
+u64 tsc_to_perf_time(u64 cyc, struct perf_tsc_conversion *tc)
+{
+ u64 quot, rem;
+
+ quot = cyc >> tc->time_shift;
+ rem = cyc & ((1 << tc->time_shift) - 1);
+ return tc->time_zero + quot * tc->time_mult +
+ ((rem * tc->time_mult) >> tc->time_shift);
+}
+
+int perf_read_tsc_conversion(const struct perf_event_mmap_page *pc,
+ struct perf_tsc_conversion *tc)
+{
+ bool cap_usr_time_zero;
+ u32 seq;
+ int i = 0;
+
+ while (1) {
+ seq = pc->lock;
+ rmb();
+ tc->time_mult = pc->time_mult;
+ tc->time_shift = pc->time_shift;
+ tc->time_zero = pc->time_zero;
+ cap_usr_time_zero = pc->cap_usr_time_zero;
+ rmb();
+ if (pc->lock == seq && !(seq & 1))
+ break;
+ if (++i > 10000) {
+ pr_debug("failed to get perf_event_mmap_page lock\n");
+ return -EINVAL;
+ }
+ }
+
+ if (!cap_usr_time_zero)
+ return -EOPNOTSUPP;
+
+ return 0;
+}
diff --git a/tools/perf/arch/x86/util/tsc.h b/tools/perf/arch/x86/util/tsc.h
new file mode 100644
index 0000000..a24dec8
--- /dev/null
+++ b/tools/perf/arch/x86/util/tsc.h
@@ -0,0 +1,20 @@
+#ifndef TOOLS_PERF_ARCH_X86_UTIL_TSC_H__
+#define TOOLS_PERF_ARCH_X86_UTIL_TSC_H__
+
+#include "../../util/types.h"
+
+struct perf_tsc_conversion {
+ u16 time_shift;
+ u32 time_mult;
+ u64 time_zero;
+};
+
+struct perf_event_mmap_page;
+
+int perf_read_tsc_conversion(const struct perf_event_mmap_page *pc,
+ struct perf_tsc_conversion *tc);
+
+u64 perf_time_to_tsc(u64 ns, struct perf_tsc_conversion *tc);
+u64 tsc_to_perf_time(u64 cyc, struct perf_tsc_conversion *tc);
+
+#endif /* TOOLS_PERF_ARCH_X86_UTIL_TSC_H__ */
diff --git a/tools/perf/tests/builtin-test.c b/tools/perf/tests/builtin-test.c
index 35b45f1466..b7b4049 100644
--- a/tools/perf/tests/builtin-test.c
+++ b/tools/perf/tests/builtin-test.c
@@ -93,6 +93,12 @@ static struct test {
.desc = "Test software clock events have valid period values",
.func = test__sw_clock_freq,
},
+#if defined(__x86_64__) || defined(__i386__)
+ {
+ .desc = "Test converting perf time to TSC",
+ .func = test__perf_time_to_tsc,
+ },
+#endif
{
.func = NULL,
},
diff --git a/tools/perf/tests/perf-time-to-tsc.c b/tools/perf/tests/perf-time-to-tsc.c
new file mode 100644
index 0000000..0ab61b1
--- /dev/null
+++ b/tools/perf/tests/perf-time-to-tsc.c
@@ -0,0 +1,177 @@
+#include <stdio.h>
+#include <sys/types.h>
+#include <unistd.h>
+#include <inttypes.h>
+#include <sys/prctl.h>
+
+#include "parse-events.h"
+#include "evlist.h"
+#include "evsel.h"
+#include "thread_map.h"
+#include "cpumap.h"
+#include "tests.h"
+
+#include "../arch/x86/util/tsc.h"
+
+#define CHECK__(x) { \
+ while ((x) < 0) { \
+ pr_debug(#x " failed!\n"); \
+ goto out_err; \
+ } \
+}
+
+#define CHECK_NOT_NULL__(x) { \
+ while ((x) == NULL) { \
+ pr_debug(#x " failed!\n"); \
+ goto out_err; \
+ } \
+}
+
+static u64 rdtsc(void)
+{
+ unsigned int low, high;
+
+ asm volatile("rdtsc" : "=a" (low), "=d" (high));
+
+ return low | ((u64)high) << 32;
+}
+
+/**
+ * test__perf_time_to_tsc - test converting perf time to TSC.
+ *
+ * This function implements a test that checks that the conversion of perf time
+ * to and from TSC is consistent with the order of events. If the test passes
+ * %0 is returned, otherwise %-1 is returned. If TSC conversion is not
+ * supported then then the test passes but " (not supported)" is printed.
+ */
+int test__perf_time_to_tsc(void)
+{
+ struct perf_record_opts opts = {
+ .mmap_pages = UINT_MAX,
+ .user_freq = UINT_MAX,
+ .user_interval = ULLONG_MAX,
+ .freq = 4000,
+ .target = {
+ .uses_mmap = true,
+ },
+ .sample_time = true,
+ };
+ struct thread_map *threads = NULL;
+ struct cpu_map *cpus = NULL;
+ struct perf_evlist *evlist = NULL;
+ struct perf_evsel *evsel = NULL;
+ int err = -1, ret, i;
+ const char *comm1, *comm2;
+ struct perf_tsc_conversion tc;
+ struct perf_event_mmap_page *pc;
+ union perf_event *event;
+ u64 test_tsc, comm1_tsc, comm2_tsc;
+ u64 test_time, comm1_time = 0, comm2_time = 0;
+
+ threads = thread_map__new(-1, getpid(), UINT_MAX);
+ CHECK_NOT_NULL__(threads);
+
+ cpus = cpu_map__new(NULL);
+ CHECK_NOT_NULL__(cpus);
+
+ evlist = perf_evlist__new();
+ CHECK_NOT_NULL__(evlist);
+
+ perf_evlist__set_maps(evlist, cpus, threads);
+
+ CHECK__(parse_events(evlist, "cycles:u"));
+
+ perf_evlist__config(evlist, &opts);
+
+ evsel = perf_evlist__first(evlist);
+
+ evsel->attr.comm = 1;
+ evsel->attr.disabled = 1;
+ evsel->attr.enable_on_exec = 0;
+
+ CHECK__(perf_evlist__open(evlist));
+
+ CHECK__(perf_evlist__mmap(evlist, UINT_MAX, false));
+
+ pc = evlist->mmap[0].base;
+ ret = perf_read_tsc_conversion(pc, &tc);
+ if (ret) {
+ if (ret == -EOPNOTSUPP) {
+ fprintf(stderr, " (not supported)");
+ return 0;
+ }
+ goto out_err;
+ }
+
+ perf_evlist__enable(evlist);
+
+ comm1 = "Test COMM 1";
+ CHECK__(prctl(PR_SET_NAME, (unsigned long)comm1, 0, 0, 0));
+
+ test_tsc = rdtsc();
+
+ comm2 = "Test COMM 2";
+ CHECK__(prctl(PR_SET_NAME, (unsigned long)comm2, 0, 0, 0));
+
+ perf_evlist__disable(evlist);
+
+ for (i = 0; i < evlist->nr_mmaps; i++) {
+ while ((event = perf_evlist__mmap_read(evlist, i)) != NULL) {
+ struct perf_sample sample;
+
+ if (event->header.type != PERF_RECORD_COMM ||
+ (pid_t)event->comm.pid != getpid() ||
+ (pid_t)event->comm.tid != getpid())
+ continue;
+
+ if (strcmp(event->comm.comm, comm1) == 0) {
+ CHECK__(perf_evsel__parse_sample(evsel, event,
+ &sample));
+ comm1_time = sample.time;
+ }
+ if (strcmp(event->comm.comm, comm2) == 0) {
+ CHECK__(perf_evsel__parse_sample(evsel, event,
+ &sample));
+ comm2_time = sample.time;
+ }
+ }
+ }
+
+ if (!comm1_time || !comm2_time)
+ goto out_err;
+
+ test_time = tsc_to_perf_time(test_tsc, &tc);
+ comm1_tsc = perf_time_to_tsc(comm1_time, &tc);
+ comm2_tsc = perf_time_to_tsc(comm2_time, &tc);
+
+ pr_debug("1st event perf time %"PRIu64" tsc %"PRIu64"\n",
+ comm1_time, comm1_tsc);
+ pr_debug("rdtsc time %"PRIu64" tsc %"PRIu64"\n",
+ test_time, test_tsc);
+ pr_debug("2nd event perf time %"PRIu64" tsc %"PRIu64"\n",
+ comm2_time, comm2_tsc);
+
+ if (test_time <= comm1_time ||
+ test_time >= comm2_time)
+ goto out_err;
+
+ if (test_tsc <= comm1_tsc ||
+ test_tsc >= comm2_tsc)
+ goto out_err;
+
+ err = 0;
+
+out_err:
+ if (evlist) {
+ perf_evlist__disable(evlist);
+ perf_evlist__munmap(evlist);
+ perf_evlist__close(evlist);
+ perf_evlist__delete(evlist);
+ }
+ if (cpus)
+ cpu_map__delete(cpus);
+ if (threads)
+ thread_map__delete(threads);
+
+ return err;
+}
diff --git a/tools/perf/tests/tests.h b/tools/perf/tests/tests.h
index dd7feae..107696e 100644
--- a/tools/perf/tests/tests.h
+++ b/tools/perf/tests/tests.h
@@ -27,5 +27,6 @@ int test__bp_signal(void);
int test__bp_signal_overflow(void);
int test__task_exit(void);
int test__sw_clock_freq(void);
+int test__perf_time_to_tsc(void);

#endif /* TESTS_H */
--
1.7.11.7

2013-06-28 15:22:42

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/5] perf: fix broken union in perf_event_mmap_page

On Fri, Jun 28, 2013 at 04:22:17PM +0300, Adrian Hunter wrote:
> The capabilities bits must not be "union'ed" together.
> Put them in a separate struct.
>
> Signed-off-by: Adrian Hunter <[email protected]>
> ---
> include/uapi/linux/perf_event.h | 8 +++++---
> 1 file changed, 5 insertions(+), 3 deletions(-)
>
> diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
> index 0b1df41..19f6ee5 100644
> --- a/include/uapi/linux/perf_event.h
> +++ b/include/uapi/linux/perf_event.h
> @@ -375,9 +375,11 @@ struct perf_event_mmap_page {
> __u64 time_running; /* time event on cpu */
> union {
> __u64 capabilities;
> - __u64 cap_usr_time : 1,
> - cap_usr_rdpmc : 1,
> - cap_____res : 62;
> + struct {
> + __u64 cap_usr_time : 1,
> + cap_usr_rdpmc : 1,
> + cap_____res : 62;
> + };
> };

Ick, it did that!? and here I thought there was a difference between:

int foo:1,
bar:1;

and

int foo:1;
int bar:1;

That made all the difference in this particular case. I guess I should
go read the language spec more carefully next time.

2013-06-28 15:27:57

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 0/5] perf: add two new features

On Fri, Jun 28, 2013 at 04:22:16PM +0300, Adrian Hunter wrote:
> Hi
>
> Please consider these two new perf features:
> x86: add ability to calculate TSC from perf sample timestamps
> perf: add 'keep tracking' flag to PERF_EVENT_IOC_DISABLE

Please explain to us why you'd like to do this..

2013-06-28 19:22:38

by Adrian Hunter

[permalink] [raw]
Subject: Re: [PATCH 0/5] perf: add two new features

On 28/06/2013 6:27 p.m., Peter Zijlstra wrote:
> On Fri, Jun 28, 2013 at 04:22:16PM +0300, Adrian Hunter wrote:
>> Hi
>>
>> Please consider these two new perf features:
>> x86: add ability to calculate TSC from perf sample timestamps
>> perf: add 'keep tracking' flag to PERF_EVENT_IOC_DISABLE
>
> Please explain to us why you'd like to do this..

I will see what information I can dig up. The short answer is that I
need to disable and re-enable a perf event but still be able to map IPs
to their DSOs and symbols - which means not losing mmap events.

2013-07-16 06:15:29

by Adrian Hunter

[permalink] [raw]
Subject: Re: [PATCH 0/5] perf: add two new features

On 28/06/13 22:22, Adrian Hunter wrote:
> On 28/06/2013 6:27 p.m., Peter Zijlstra wrote:
>> On Fri, Jun 28, 2013 at 04:22:16PM +0300, Adrian Hunter wrote:
>>> Hi
>>>
>>> Please consider these two new perf features:
>>> x86: add ability to calculate TSC from perf sample timestamps
>>> perf: add 'keep tracking' flag to PERF_EVENT_IOC_DISABLE
>>
>> Please explain to us why you'd like to do this..
>
> I will see what information I can dig up. The short answer is that I need
> to disable and re-enable a perf event but still be able to map IPs to their
> DSOs and symbols - which means not losing mmap events.

Any other comments?

2013-07-16 11:52:09

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH 1/5] perf: fix broken union in perf_event_mmap_page

On 06/28/2013 08:22 AM, Peter Zijlstra wrote:
>
> Ick, it did that!? and here I thought there was a difference between:
>
> int foo:1,
> bar:1;
>
> and
>
> int foo:1;
> int bar:1;
>
> That made all the difference in this particular case. I guess I should
> go read the language spec more carefully next time.
>

There is absolutely no difference there...

-hpa

2013-07-16 14:35:37

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 0/5] perf: add two new features

On Tue, Jul 16, 2013 at 09:22:00AM +0300, Adrian Hunter wrote:
> On 28/06/13 22:22, Adrian Hunter wrote:
> > On 28/06/2013 6:27 p.m., Peter Zijlstra wrote:
> >> On Fri, Jun 28, 2013 at 04:22:16PM +0300, Adrian Hunter wrote:
> >>> Hi
> >>>
> >>> Please consider these two new perf features:
> >>> x86: add ability to calculate TSC from perf sample timestamps
> >>> perf: add 'keep tracking' flag to PERF_EVENT_IOC_DISABLE
> >>
> >> Please explain to us why you'd like to do this..
> >
> > I will see what information I can dig up. The short answer is that I need
> > to disable and re-enable a perf event but still be able to map IPs to their
> > DSOs and symbols - which means not losing mmap events.
>
> Any other comments?

Ah, thanks for the reminder.. well, I've applied patches 1-3 as those seem
useful on their own. I'm not entirely convinced about the 'keep tracking' thing
though.

It seems to me you could get the same by opening a second event into the same
buffer and keeping that enabled.

2013-07-17 11:22:12

by Adrian Hunter

[permalink] [raw]
Subject: Re: [PATCH 0/5] perf: add two new features

On 16/07/13 17:34, Peter Zijlstra wrote:
> On Tue, Jul 16, 2013 at 09:22:00AM +0300, Adrian Hunter wrote:
>> On 28/06/13 22:22, Adrian Hunter wrote:
>>> On 28/06/2013 6:27 p.m., Peter Zijlstra wrote:
>>>> On Fri, Jun 28, 2013 at 04:22:16PM +0300, Adrian Hunter wrote:
>>>>> Hi
>>>>>
>>>>> Please consider these two new perf features:
>>>>> x86: add ability to calculate TSC from perf sample timestamps
>>>>> perf: add 'keep tracking' flag to PERF_EVENT_IOC_DISABLE
>>>>
>>>> Please explain to us why you'd like to do this..
>>>
>>> I will see what information I can dig up. The short answer is that I need
>>> to disable and re-enable a perf event but still be able to map IPs to their
>>> DSOs and symbols - which means not losing mmap events.
>>
>> Any other comments?
>
> Ah, thanks for the reminder.. well, I've applied patches 1-3 as those seem
> useful on their own. I'm not entirely convinced about the 'keep tracking' thing
> though.
>
> It seems to me you could get the same by opening a second event into the same
> buffer and keeping that enabled.

In that case I would like a dummy event to use for that purpose. A software
event could be set aside e.g.

diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 627bbcf..7ea4ccd 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -109,6 +109,7 @@ enum perf_sw_ids {
PERF_COUNT_SW_PAGE_FAULTS_MAJ = 6,
PERF_COUNT_SW_ALIGNMENT_FAULTS = 7,
PERF_COUNT_SW_EMULATION_FAULTS = 8,
+ PERF_COUNT_SW_DUMMY = 9,

PERF_COUNT_SW_MAX, /* non-ABI */
};

Subject: [tip:perf/core] perf/x86: Add ability to calculate TSC from perf sample timestamps

Commit-ID: c73deb6aecda2955716f31572516f09d930ef450
Gitweb: http://git.kernel.org/tip/c73deb6aecda2955716f31572516f09d930ef450
Author: Adrian Hunter <[email protected]>
AuthorDate: Fri, 28 Jun 2013 16:22:18 +0300
Committer: Ingo Molnar <[email protected]>
CommitDate: Tue, 23 Jul 2013 12:17:45 +0200

perf/x86: Add ability to calculate TSC from perf sample timestamps

For modern CPUs, perf clock is directly related to TSC. TSC
can be calculated from perf clock and vice versa using a simple
calculation. Two of the three componenets of that calculation
are already exported in struct perf_event_mmap_page. This patch
exports the third.

Signed-off-by: Adrian Hunter <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/include/asm/tsc.h | 1 +
arch/x86/kernel/cpu/perf_event.c | 6 ++++++
arch/x86/kernel/tsc.c | 6 ++++++
include/uapi/linux/perf_event.h | 22 ++++++++++++++++++++--
4 files changed, 33 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/tsc.h b/arch/x86/include/asm/tsc.h
index c91e8b9..235be70 100644
--- a/arch/x86/include/asm/tsc.h
+++ b/arch/x86/include/asm/tsc.h
@@ -49,6 +49,7 @@ extern void tsc_init(void);
extern void mark_tsc_unstable(char *reason);
extern int unsynchronized_tsc(void);
extern int check_tsc_unstable(void);
+extern int check_tsc_disabled(void);
extern unsigned long native_calibrate_tsc(void);

extern int tsc_clocksource_reliable;
diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index a7c7305..8355c84 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -1884,6 +1884,7 @@ static struct pmu pmu = {
void arch_perf_update_userpage(struct perf_event_mmap_page *userpg, u64 now)
{
userpg->cap_usr_time = 0;
+ userpg->cap_usr_time_zero = 0;
userpg->cap_usr_rdpmc = x86_pmu.attr_rdpmc;
userpg->pmc_width = x86_pmu.cntval_bits;

@@ -1897,6 +1898,11 @@ void arch_perf_update_userpage(struct perf_event_mmap_page *userpg, u64 now)
userpg->time_mult = this_cpu_read(cyc2ns);
userpg->time_shift = CYC2NS_SCALE_FACTOR;
userpg->time_offset = this_cpu_read(cyc2ns_offset) - now;
+
+ if (sched_clock_stable && !check_tsc_disabled()) {
+ userpg->cap_usr_time_zero = 1;
+ userpg->time_zero = this_cpu_read(cyc2ns_offset);
+ }
}

/*
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 6ff4924..930e5d4 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -89,6 +89,12 @@ int check_tsc_unstable(void)
}
EXPORT_SYMBOL_GPL(check_tsc_unstable);

+int check_tsc_disabled(void)
+{
+ return tsc_disabled;
+}
+EXPORT_SYMBOL_GPL(check_tsc_disabled);
+
#ifdef CONFIG_X86_TSC
int __init notsc_setup(char *str)
{
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 0041aed..efef1d3 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -378,7 +378,8 @@ struct perf_event_mmap_page {
struct {
__u64 cap_usr_time : 1,
cap_usr_rdpmc : 1,
- cap_____res : 62;
+ cap_usr_time_zero : 1,
+ cap_____res : 61;
};
};

@@ -420,12 +421,29 @@ struct perf_event_mmap_page {
__u16 time_shift;
__u32 time_mult;
__u64 time_offset;
+ /*
+ * If cap_usr_time_zero, the hardware clock (e.g. TSC) can be calculated
+ * from sample timestamps.
+ *
+ * time = timestamp - time_zero;
+ * quot = time / time_mult;
+ * rem = time % time_mult;
+ * cyc = (quot << time_shift) + (rem << time_shift) / time_mult;
+ *
+ * And vice versa:
+ *
+ * quot = cyc >> time_shift;
+ * rem = cyc & ((1 << time_shift) - 1);
+ * timestamp = time_zero + quot * time_mult +
+ * ((rem * time_mult) >> time_shift);
+ */
+ __u64 time_zero;

/*
* Hole for extension of the self monitor capabilities
*/

- __u64 __reserved[120]; /* align to 1k */
+ __u64 __reserved[119]; /* align to 1k */

/*
* Control data for the mmap() data buffer.

Subject: [tip:perf/core] perf tools: Add test for converting perf time to/ from TSC

Commit-ID: 3bd5a5fc8c6b9fe769777abf74b0ab5fbd7930b4
Gitweb: http://git.kernel.org/tip/3bd5a5fc8c6b9fe769777abf74b0ab5fbd7930b4
Author: Adrian Hunter <[email protected]>
AuthorDate: Fri, 28 Jun 2013 16:22:19 +0300
Committer: Ingo Molnar <[email protected]>
CommitDate: Tue, 23 Jul 2013 12:17:59 +0200

perf tools: Add test for converting perf time to/from TSC

The test uses the newly added cap_usr_time_zero and time_zero of
perf_event_mmap_page. TSC from rdtsc is compared with the time
from 2 perf events. The test passes if the calculated times are
all in the correct order.

Signed-off-by: Adrian Hunter <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Jiri Olsa <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
tools/perf/Makefile | 3 +
tools/perf/arch/x86/Makefile | 2 +
tools/perf/arch/x86/util/tsc.c | 59 ++++++++++++
tools/perf/arch/x86/util/tsc.h | 20 ++++
tools/perf/tests/builtin-test.c | 6 ++
tools/perf/tests/perf-time-to-tsc.c | 177 ++++++++++++++++++++++++++++++++++++
tools/perf/tests/tests.h | 1 +
7 files changed, 268 insertions(+)

diff --git a/tools/perf/Makefile b/tools/perf/Makefile
index 024680b..bfd12d0 100644
--- a/tools/perf/Makefile
+++ b/tools/perf/Makefile
@@ -389,6 +389,9 @@ LIB_OBJS += $(OUTPUT)tests/bp_signal.o
LIB_OBJS += $(OUTPUT)tests/bp_signal_overflow.o
LIB_OBJS += $(OUTPUT)tests/task-exit.o
LIB_OBJS += $(OUTPUT)tests/sw-clock.o
+ifeq ($(ARCH),x86)
+LIB_OBJS += $(OUTPUT)tests/perf-time-to-tsc.o
+endif

BUILTIN_OBJS += $(OUTPUT)builtin-annotate.o
BUILTIN_OBJS += $(OUTPUT)builtin-bench.o
diff --git a/tools/perf/arch/x86/Makefile b/tools/perf/arch/x86/Makefile
index 815841c..8801fe0 100644
--- a/tools/perf/arch/x86/Makefile
+++ b/tools/perf/arch/x86/Makefile
@@ -6,3 +6,5 @@ ifndef NO_LIBUNWIND
LIB_OBJS += $(OUTPUT)arch/$(ARCH)/util/unwind.o
endif
LIB_OBJS += $(OUTPUT)arch/$(ARCH)/util/header.o
+LIB_OBJS += $(OUTPUT)arch/$(ARCH)/util/tsc.o
+LIB_H += arch/$(ARCH)/util/tsc.h
diff --git a/tools/perf/arch/x86/util/tsc.c b/tools/perf/arch/x86/util/tsc.c
new file mode 100644
index 0000000..f111744
--- /dev/null
+++ b/tools/perf/arch/x86/util/tsc.c
@@ -0,0 +1,59 @@
+#include <stdbool.h>
+#include <errno.h>
+
+#include <linux/perf_event.h>
+
+#include "../../perf.h"
+#include "../../util/types.h"
+#include "../../util/debug.h"
+#include "tsc.h"
+
+u64 perf_time_to_tsc(u64 ns, struct perf_tsc_conversion *tc)
+{
+ u64 time, quot, rem;
+
+ time = ns - tc->time_zero;
+ quot = time / tc->time_mult;
+ rem = time % tc->time_mult;
+ return (quot << tc->time_shift) +
+ (rem << tc->time_shift) / tc->time_mult;
+}
+
+u64 tsc_to_perf_time(u64 cyc, struct perf_tsc_conversion *tc)
+{
+ u64 quot, rem;
+
+ quot = cyc >> tc->time_shift;
+ rem = cyc & ((1 << tc->time_shift) - 1);
+ return tc->time_zero + quot * tc->time_mult +
+ ((rem * tc->time_mult) >> tc->time_shift);
+}
+
+int perf_read_tsc_conversion(const struct perf_event_mmap_page *pc,
+ struct perf_tsc_conversion *tc)
+{
+ bool cap_usr_time_zero;
+ u32 seq;
+ int i = 0;
+
+ while (1) {
+ seq = pc->lock;
+ rmb();
+ tc->time_mult = pc->time_mult;
+ tc->time_shift = pc->time_shift;
+ tc->time_zero = pc->time_zero;
+ cap_usr_time_zero = pc->cap_usr_time_zero;
+ rmb();
+ if (pc->lock == seq && !(seq & 1))
+ break;
+ if (++i > 10000) {
+ pr_debug("failed to get perf_event_mmap_page lock\n");
+ return -EINVAL;
+ }
+ }
+
+ if (!cap_usr_time_zero)
+ return -EOPNOTSUPP;
+
+ return 0;
+}
diff --git a/tools/perf/arch/x86/util/tsc.h b/tools/perf/arch/x86/util/tsc.h
new file mode 100644
index 0000000..a24dec8
--- /dev/null
+++ b/tools/perf/arch/x86/util/tsc.h
@@ -0,0 +1,20 @@
+#ifndef TOOLS_PERF_ARCH_X86_UTIL_TSC_H__
+#define TOOLS_PERF_ARCH_X86_UTIL_TSC_H__
+
+#include "../../util/types.h"
+
+struct perf_tsc_conversion {
+ u16 time_shift;
+ u32 time_mult;
+ u64 time_zero;
+};
+
+struct perf_event_mmap_page;
+
+int perf_read_tsc_conversion(const struct perf_event_mmap_page *pc,
+ struct perf_tsc_conversion *tc);
+
+u64 perf_time_to_tsc(u64 ns, struct perf_tsc_conversion *tc);
+u64 tsc_to_perf_time(u64 cyc, struct perf_tsc_conversion *tc);
+
+#endif /* TOOLS_PERF_ARCH_X86_UTIL_TSC_H__ */
diff --git a/tools/perf/tests/builtin-test.c b/tools/perf/tests/builtin-test.c
index 35b45f1466..b7b4049 100644
--- a/tools/perf/tests/builtin-test.c
+++ b/tools/perf/tests/builtin-test.c
@@ -93,6 +93,12 @@ static struct test {
.desc = "Test software clock events have valid period values",
.func = test__sw_clock_freq,
},
+#if defined(__x86_64__) || defined(__i386__)
+ {
+ .desc = "Test converting perf time to TSC",
+ .func = test__perf_time_to_tsc,
+ },
+#endif
{
.func = NULL,
},
diff --git a/tools/perf/tests/perf-time-to-tsc.c b/tools/perf/tests/perf-time-to-tsc.c
new file mode 100644
index 0000000..0ab61b1
--- /dev/null
+++ b/tools/perf/tests/perf-time-to-tsc.c
@@ -0,0 +1,177 @@
+#include <stdio.h>
+#include <sys/types.h>
+#include <unistd.h>
+#include <inttypes.h>
+#include <sys/prctl.h>
+
+#include "parse-events.h"
+#include "evlist.h"
+#include "evsel.h"
+#include "thread_map.h"
+#include "cpumap.h"
+#include "tests.h"
+
+#include "../arch/x86/util/tsc.h"
+
+#define CHECK__(x) { \
+ while ((x) < 0) { \
+ pr_debug(#x " failed!\n"); \
+ goto out_err; \
+ } \
+}
+
+#define CHECK_NOT_NULL__(x) { \
+ while ((x) == NULL) { \
+ pr_debug(#x " failed!\n"); \
+ goto out_err; \
+ } \
+}
+
+static u64 rdtsc(void)
+{
+ unsigned int low, high;
+
+ asm volatile("rdtsc" : "=a" (low), "=d" (high));
+
+ return low | ((u64)high) << 32;
+}
+
+/**
+ * test__perf_time_to_tsc - test converting perf time to TSC.
+ *
+ * This function implements a test that checks that the conversion of perf time
+ * to and from TSC is consistent with the order of events. If the test passes
+ * %0 is returned, otherwise %-1 is returned. If TSC conversion is not
+ * supported then then the test passes but " (not supported)" is printed.
+ */
+int test__perf_time_to_tsc(void)
+{
+ struct perf_record_opts opts = {
+ .mmap_pages = UINT_MAX,
+ .user_freq = UINT_MAX,
+ .user_interval = ULLONG_MAX,
+ .freq = 4000,
+ .target = {
+ .uses_mmap = true,
+ },
+ .sample_time = true,
+ };
+ struct thread_map *threads = NULL;
+ struct cpu_map *cpus = NULL;
+ struct perf_evlist *evlist = NULL;
+ struct perf_evsel *evsel = NULL;
+ int err = -1, ret, i;
+ const char *comm1, *comm2;
+ struct perf_tsc_conversion tc;
+ struct perf_event_mmap_page *pc;
+ union perf_event *event;
+ u64 test_tsc, comm1_tsc, comm2_tsc;
+ u64 test_time, comm1_time = 0, comm2_time = 0;
+
+ threads = thread_map__new(-1, getpid(), UINT_MAX);
+ CHECK_NOT_NULL__(threads);
+
+ cpus = cpu_map__new(NULL);
+ CHECK_NOT_NULL__(cpus);
+
+ evlist = perf_evlist__new();
+ CHECK_NOT_NULL__(evlist);
+
+ perf_evlist__set_maps(evlist, cpus, threads);
+
+ CHECK__(parse_events(evlist, "cycles:u"));
+
+ perf_evlist__config(evlist, &opts);
+
+ evsel = perf_evlist__first(evlist);
+
+ evsel->attr.comm = 1;
+ evsel->attr.disabled = 1;
+ evsel->attr.enable_on_exec = 0;
+
+ CHECK__(perf_evlist__open(evlist));
+
+ CHECK__(perf_evlist__mmap(evlist, UINT_MAX, false));
+
+ pc = evlist->mmap[0].base;
+ ret = perf_read_tsc_conversion(pc, &tc);
+ if (ret) {
+ if (ret == -EOPNOTSUPP) {
+ fprintf(stderr, " (not supported)");
+ return 0;
+ }
+ goto out_err;
+ }
+
+ perf_evlist__enable(evlist);
+
+ comm1 = "Test COMM 1";
+ CHECK__(prctl(PR_SET_NAME, (unsigned long)comm1, 0, 0, 0));
+
+ test_tsc = rdtsc();
+
+ comm2 = "Test COMM 2";
+ CHECK__(prctl(PR_SET_NAME, (unsigned long)comm2, 0, 0, 0));
+
+ perf_evlist__disable(evlist);
+
+ for (i = 0; i < evlist->nr_mmaps; i++) {
+ while ((event = perf_evlist__mmap_read(evlist, i)) != NULL) {
+ struct perf_sample sample;
+
+ if (event->header.type != PERF_RECORD_COMM ||
+ (pid_t)event->comm.pid != getpid() ||
+ (pid_t)event->comm.tid != getpid())
+ continue;
+
+ if (strcmp(event->comm.comm, comm1) == 0) {
+ CHECK__(perf_evsel__parse_sample(evsel, event,
+ &sample));
+ comm1_time = sample.time;
+ }
+ if (strcmp(event->comm.comm, comm2) == 0) {
+ CHECK__(perf_evsel__parse_sample(evsel, event,
+ &sample));
+ comm2_time = sample.time;
+ }
+ }
+ }
+
+ if (!comm1_time || !comm2_time)
+ goto out_err;
+
+ test_time = tsc_to_perf_time(test_tsc, &tc);
+ comm1_tsc = perf_time_to_tsc(comm1_time, &tc);
+ comm2_tsc = perf_time_to_tsc(comm2_time, &tc);
+
+ pr_debug("1st event perf time %"PRIu64" tsc %"PRIu64"\n",
+ comm1_time, comm1_tsc);
+ pr_debug("rdtsc time %"PRIu64" tsc %"PRIu64"\n",
+ test_time, test_tsc);
+ pr_debug("2nd event perf time %"PRIu64" tsc %"PRIu64"\n",
+ comm2_time, comm2_tsc);
+
+ if (test_time <= comm1_time ||
+ test_time >= comm2_time)
+ goto out_err;
+
+ if (test_tsc <= comm1_tsc ||
+ test_tsc >= comm2_tsc)
+ goto out_err;
+
+ err = 0;
+
+out_err:
+ if (evlist) {
+ perf_evlist__disable(evlist);
+ perf_evlist__munmap(evlist);
+ perf_evlist__close(evlist);
+ perf_evlist__delete(evlist);
+ }
+ if (cpus)
+ cpu_map__delete(cpus);
+ if (threads)
+ thread_map__delete(threads);
+
+ return err;
+}
diff --git a/tools/perf/tests/tests.h b/tools/perf/tests/tests.h
index 07a92f9..d22202a 100644
--- a/tools/perf/tests/tests.h
+++ b/tools/perf/tests/tests.h
@@ -35,5 +35,6 @@ int test__bp_signal(void);
int test__bp_signal_overflow(void);
int test__task_exit(void);
int test__sw_clock_freq(void);
+int test__perf_time_to_tsc(void);

#endif /* TESTS_H */

Subject: [tip:perf/core] perf: Fix broken union in ' struct perf_event_mmap_page'

Commit-ID: 860f085b74e9f0075de8140ed3a1e5b5e3e39aa8
Gitweb: http://git.kernel.org/tip/860f085b74e9f0075de8140ed3a1e5b5e3e39aa8
Author: Adrian Hunter <[email protected]>
AuthorDate: Fri, 28 Jun 2013 16:22:17 +0300
Committer: Ingo Molnar <[email protected]>
CommitDate: Tue, 23 Jul 2013 12:17:10 +0200

perf: Fix broken union in 'struct perf_event_mmap_page'

The capabilities bits must not be "union'ed" together.
Put them in a separate struct.

Signed-off-by: Adrian Hunter <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
include/uapi/linux/perf_event.h | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 00d8274..0041aed 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -375,9 +375,11 @@ struct perf_event_mmap_page {
__u64 time_running; /* time event on cpu */
union {
__u64 capabilities;
- __u64 cap_usr_time : 1,
- cap_usr_rdpmc : 1,
- cap_____res : 62;
+ struct {
+ __u64 cap_usr_time : 1,
+ cap_usr_rdpmc : 1,
+ cap_____res : 62;
+ };
};

/*

2013-09-17 20:21:53

by Vince Weaver

[permalink] [raw]
Subject: Re: [tip:perf/core] perf: Fix broken union in ' struct perf_event_mmap_page'


This patch somehow breaks the perf-ABI.

If I take a program that reads "mmap->cap_usr_rdpmc" and compile it
against the new header with this change (say from 3.12-rc1)
and then run it on an old kernel (say 3.11) then I get "0" for
cap_usr_rdpmc.

If I take the same program and recompile against the old (without this
patch) header and run it on 3.11, I get the expected "1" value.

So something about this changed the bit pattern in an incompatible
fashion.

Vince




On Tue, 23 Jul 2013, tip-bot for Adrian Hunter wrote:

> Commit-ID: 860f085b74e9f0075de8140ed3a1e5b5e3e39aa8
> Gitweb: http://git.kernel.org/tip/860f085b74e9f0075de8140ed3a1e5b5e3e39aa8
> Author: Adrian Hunter <[email protected]>
> AuthorDate: Fri, 28 Jun 2013 16:22:17 +0300
> Committer: Ingo Molnar <[email protected]>
> CommitDate: Tue, 23 Jul 2013 12:17:10 +0200
>
> perf: Fix broken union in 'struct perf_event_mmap_page'
>
> The capabilities bits must not be "union'ed" together.
> Put them in a separate struct.
>
> Signed-off-by: Adrian Hunter <[email protected]>
> Signed-off-by: Peter Zijlstra <[email protected]>
> Link: http://lkml.kernel.org/r/[email protected]
> Signed-off-by: Ingo Molnar <[email protected]>
> ---
> include/uapi/linux/perf_event.h | 8 +++++---
> 1 file changed, 5 insertions(+), 3 deletions(-)
>
> diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
> index 00d8274..0041aed 100644
> --- a/include/uapi/linux/perf_event.h
> +++ b/include/uapi/linux/perf_event.h
> @@ -375,9 +375,11 @@ struct perf_event_mmap_page {
> __u64 time_running; /* time event on cpu */
> union {
> __u64 capabilities;
> - __u64 cap_usr_time : 1,
> - cap_usr_rdpmc : 1,
> - cap_____res : 62;
> + struct {
> + __u64 cap_usr_time : 1,
> + cap_usr_rdpmc : 1,
> + cap_____res : 62;
> + };
> };
>
> /*
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
k

2013-09-17 20:33:59

by Vince Weaver

[permalink] [raw]
Subject: Re: [tip:perf/core] perf: Fix broken union in ' struct perf_event_mmap_page'

On Tue, 17 Sep 2013, Vince Weaver wrote:

>
> This patch somehow breaks the perf-ABI.
>
> If I take a program that reads "mmap->cap_usr_rdpmc" and compile it
> against the new header with this change (say from 3.12-rc1)
> and then run it on an old kernel (say 3.11) then I get "0" for
> cap_usr_rdpmc.
>
> If I take the same program and recompile against the old (without this
> patch) header and run it on 3.11, I get the expected "1" value.

To follow up, the original case:

union {
__u64 capabilities;
__u64 cap_usr_time : 1,
cap_usr_rdpmc : 1,
cap_____res : 62;
};

Then mmap->usr_rdpmc=1; gets assembled as

400420: 80 0d 11 05 20 00 02 orb $0x2,0x200511(%rip)

The new version

union {
__u64 capabilities;
struct {
__u64 cap_usr_time : 1,
cap_usr_rdpmc : 1,
cap_usr_time_zero : 1,
cap_____res : 61;
};
};

mmap->usr_rdpmc=1; gets assembled as


400427: 80 0d 02 05 20 00 01 orb $0x1,0x200502(%rip)

note the difference in the or value.

Vince

2013-09-18 08:57:54

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [tip:perf/core] perf: Fix broken union in ' struct perf_event_mmap_page'

On Tue, Sep 17, 2013 at 04:23:25PM -0400, Vince Weaver wrote:
> > include/uapi/linux/perf_event.h | 8 +++++---
> > 1 file changed, 5 insertions(+), 3 deletions(-)
> >
> > diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
> > index 00d8274..0041aed 100644
> > --- a/include/uapi/linux/perf_event.h
> > +++ b/include/uapi/linux/perf_event.h
> > @@ -375,9 +375,11 @@ struct perf_event_mmap_page {
> > __u64 time_running; /* time event on cpu */
> > union {
> > __u64 capabilities;
> > - __u64 cap_usr_time : 1,
> > - cap_usr_rdpmc : 1,
> > - cap_____res : 62;
> > + struct {
> > + __u64 cap_usr_time : 1,
> > + cap_usr_rdpmc : 1,
> > + cap_____res : 62;
> > + };
> > };

> This patch somehow breaks the perf-ABI.

Difficult call that..

> If I take a program that reads "mmap->cap_usr_rdpmc" and compile it
> against the new header with this change (say from 3.12-rc1)
> and then run it on an old kernel (say 3.11) then I get "0" for
> cap_usr_rdpmc.
>
> If I take the same program and recompile against the old (without this
> patch) header and run it on 3.11, I get the expected "1" value.
>
> So something about this changed the bit pattern in an incompatible
> fashion.

Which was the entire point of the change. Previously cap_usr_time and
cap_usr_rdpmc were the same bit which clearly cannot be right.

With the change they're consecutive bits in the capabilities word.

Should we preserve completely broken things that don't work?

2013-09-18 09:07:09

by Adrian Hunter

[permalink] [raw]
Subject: Re: [tip:perf/core] perf: Fix broken union in ' struct perf_event_mmap_page'

On 17/09/13 23:23, Vince Weaver wrote:
>
> This patch somehow breaks the perf-ABI.
>
> If I take a program that reads "mmap->cap_usr_rdpmc" and compile it
> against the new header with this change (say from 3.12-rc1)
> and then run it on an old kernel (say 3.11) then I get "0" for
> cap_usr_rdpmc.
>
> If I take the same program and recompile against the old (without this
> patch) header and run it on 3.11, I get the expected "1" value.
>
> So something about this changed the bit pattern in an incompatible
> fashion.


cap_usr_time and cap_usr_rdpmc were occupying the same bit position i.e. bit 0

That means that cap_usr_time and cap_usr_rdpmc were both unreliable.

If you look at the logic:

void arch_perf_update_userpage(struct perf_event_mmap_page *userpg, u64 now)
{
userpg->cap_usr_time = 0;
userpg->cap_usr_time_zero = 0;
userpg->cap_usr_rdpmc = x86_pmu.attr_rdpmc;
userpg->pmc_width = x86_pmu.cntval_bits;

if (!boot_cpu_has(X86_FEATURE_CONSTANT_TSC))
return;

if (!boot_cpu_has(X86_FEATURE_NONSTOP_TSC))
return;

userpg->cap_usr_time = 1;
userpg->time_mult = this_cpu_read(cyc2ns);
userpg->time_shift = CYC2NS_SCALE_FACTOR;
userpg->time_offset = this_cpu_read(cyc2ns_offset) - now;

if (sched_clock_stable && !check_tsc_disabled()) {
userpg->cap_usr_time_zero = 1;
userpg->time_zero = this_cpu_read(cyc2ns_offset);
}
}

The incorrect union caused 2 bugs:

1. On hardware with constant, non-stop TSC cap_usr_rdpmc was always 1.

2. On hardware without constant, non-stop TSC cap_usr_time was still 1 if
rdpmc was allowed in userspace.


Possible improvements are one or both of:
1. Add cap_usr_fixed to identify kernels that have the capabilities bits fixed
2. Swap the positions of cap_usr_time and cap_usr_rdpmc so that
cap_usr_rdpmc remains in bit 0






>
> Vince
>
>
>
>
> On Tue, 23 Jul 2013, tip-bot for Adrian Hunter wrote:
>
>> Commit-ID: 860f085b74e9f0075de8140ed3a1e5b5e3e39aa8
>> Gitweb: http://git.kernel.org/tip/860f085b74e9f0075de8140ed3a1e5b5e3e39aa8
>> Author: Adrian Hunter <[email protected]>
>> AuthorDate: Fri, 28 Jun 2013 16:22:17 +0300
>> Committer: Ingo Molnar <[email protected]>
>> CommitDate: Tue, 23 Jul 2013 12:17:10 +0200
>>
>> perf: Fix broken union in 'struct perf_event_mmap_page'
>>
>> The capabilities bits must not be "union'ed" together.
>> Put them in a separate struct.
>>
>> Signed-off-by: Adrian Hunter <[email protected]>
>> Signed-off-by: Peter Zijlstra <[email protected]>
>> Link: http://lkml.kernel.org/r/[email protected]
>> Signed-off-by: Ingo Molnar <[email protected]>
>> ---
>> include/uapi/linux/perf_event.h | 8 +++++---
>> 1 file changed, 5 insertions(+), 3 deletions(-)
>>
>> diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
>> index 00d8274..0041aed 100644
>> --- a/include/uapi/linux/perf_event.h
>> +++ b/include/uapi/linux/perf_event.h
>> @@ -375,9 +375,11 @@ struct perf_event_mmap_page {
>> __u64 time_running; /* time event on cpu */
>> union {
>> __u64 capabilities;
>> - __u64 cap_usr_time : 1,
>> - cap_usr_rdpmc : 1,
>> - cap_____res : 62;
>> + struct {
>> + __u64 cap_usr_time : 1,
>> + cap_usr_rdpmc : 1,
>> + cap_____res : 62;
>> + };
>> };
>>
>> /*
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at http://www.tux.org/lkml/
>>
> k
>
>

2013-09-18 14:08:49

by Vince Weaver

[permalink] [raw]
Subject: Re: [tip:perf/core] perf: Fix broken union in ' struct perf_event_mmap_page'

On Wed, 18 Sep 2013, Adrian Hunter wrote:

> On 17/09/13 23:23, Vince Weaver wrote:
> >
> > This patch somehow breaks the perf-ABI.
> >
> > If I take a program that reads "mmap->cap_usr_rdpmc" and compile it
> > against the new header with this change (say from 3.12-rc1)
> > and then run it on an old kernel (say 3.11) then I get "0" for
> > cap_usr_rdpmc.
> >
> > If I take the same program and recompile against the old (without this
> > patch) header and run it on 3.11, I get the expected "1" value.
> >
> > So something about this changed the bit pattern in an incompatible
> > fashion.
>
>
> cap_usr_time and cap_usr_rdpmc were occupying the same bit position i.e. bit 0
>
> That means that cap_usr_time and cap_usr_rdpmc were both unreliable.

well then I have to say this patch wins today's award for "least
useful commit message". Why wasn't this important info there?
>From what it originally sounded like this was just some shuffling of
low-level C minutia, not a hard-to-resolve break in the perf ABI.

> Possible improvements are one or both of:
> 1. Add cap_usr_fixed to identify kernels that have the capabilities bits fixed
> 2. Swap the positions of cap_usr_time and cap_usr_rdpmc so that
> cap_usr_rdpmc remains in bit 0

Or just create two new fields and mark the old ones as obsolete somehow.

The problem with just making this change silently is now tools like PAPI
that care about cap_usr_rdpmc will behave in hard-to-debug ways.

For example, code compiled against new headers but run with old kernels
won't detect rdpmc properly

Code compiled against old headers but run with new kernels won't detect
rdpmc properly.

It's true existing code could have problems if you ran on a machine with
conflicting rdpmc/usr_time, but my tests never noticed that, presumably
because rdpmc and usr_time were added at the same time (in 3.4), x86
always enables rdpmc, and none of my code uses usr_time so I never see
wrong time results.

Vince

2013-09-18 14:17:57

by Vince Weaver

[permalink] [raw]
Subject: Re: [tip:perf/core] perf: Fix broken union in ' struct perf_event_mmap_page'

On Wed, 18 Sep 2013, Peter Zijlstra wrote:

> > This patch somehow breaks the perf-ABI.
>
> Difficult call that..

OK, let me rephrase.

This change broke existing working code.

Can you point to any code that is fixed by the commit?

If not, I think the rule is you revert the changeset.

Or you can just mark the bug "Won't fix, perf tool doesn't use it" I guess

You guys told me to start running my tests in -rc1 so I can catch bugs
early enough to fix things. So here I am, finding a problem in -rc1.
If no one cares I'll stop bothering and go back to just documenting the
problems after the fact in complaining webpages.

Vince

2013-09-18 15:42:37

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [tip:perf/core] perf: Fix broken union in ' struct perf_event_mmap_page'

On Wed, Sep 18, 2013 at 10:19:32AM -0400, Vince Weaver wrote:
> Can you point to any code that is fixed by the commit?

I have some, but I don't think a lot of people use it.

Would you be ok with something like the below? It should preserve
functionality for code that only cares about cap_usr_rdpmc (PAPI).

Stephane, does libpfm use any of these?

--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -380,8 +380,8 @@ struct perf_event_mmap_page {
union {
__u64 capabilities;
struct {
- __u64 cap_usr_time : 1,
- cap_usr_rdpmc : 1,
+ __u64 cap_usr_rdpmc : 1,
+ cap_usr_time : 1,
cap_usr_time_zero : 1,
cap_____res : 61;
};

2013-09-18 18:33:55

by Stephane Eranian

[permalink] [raw]
Subject: Re: [tip:perf/core] perf: Fix broken union in ' struct perf_event_mmap_page'

On Wed, Sep 18, 2013 at 5:42 PM, Peter Zijlstra <[email protected]> wrote:
> On Wed, Sep 18, 2013 at 10:19:32AM -0400, Vince Weaver wrote:
>> Can you point to any code that is fixed by the commit?
>
> I have some, but I don't think a lot of people use it.
>
> Would you be ok with something like the below? It should preserve
> functionality for code that only cares about cap_usr_rdpmc (PAPI).
>
> Stephane, does libpfm use any of these?
>
Yes, there is an example using this. Need to verify it is not broken
currently (self_count.c).

> --- a/include/uapi/linux/perf_event.h
> +++ b/include/uapi/linux/perf_event.h
> @@ -380,8 +380,8 @@ struct perf_event_mmap_page {
> union {
> __u64 capabilities;
> struct {
> - __u64 cap_usr_time : 1,
> - cap_usr_rdpmc : 1,
> + __u64 cap_usr_rdpmc : 1,
> + cap_usr_time : 1,
> cap_usr_time_zero : 1,
> cap_____res : 61;
> };

2013-09-18 20:06:21

by Vince Weaver

[permalink] [raw]
Subject: Re: [tip:perf/core] perf: Fix broken union in ' struct perf_event_mmap_page'

On Wed, 18 Sep 2013, Peter Zijlstra wrote:

> On Wed, Sep 18, 2013 at 10:19:32AM -0400, Vince Weaver wrote:
> > Can you point to any code that is fixed by the commit?
>
> I have some, but I don't think a lot of people use it.
>
> Would you be ok with something like the below? It should preserve
> functionality for code that only cares about cap_usr_rdpmc (PAPI).
>
> Stephane, does libpfm use any of these?
>
> --- a/include/uapi/linux/perf_event.h
> +++ b/include/uapi/linux/perf_event.h
> @@ -380,8 +380,8 @@ struct perf_event_mmap_page {
> union {
> __u64 capabilities;
> struct {
> - __u64 cap_usr_time : 1,
> - cap_usr_rdpmc : 1,
> + __u64 cap_usr_rdpmc : 1,
> + cap_usr_time : 1,
> cap_usr_time_zero : 1,
> cap_____res : 61;
> };
>

It would be nice if there was some way to detect this change; I liked the
idea of a "cap_usr_fixed" bit.

Even with your change you can't have code that can reliably detect both
cap_usr_time and cap_usr_rdpmc unless you can guarantee that both
perf_event.h and the kernel are 3.12 or newer, and it gets more
complicated if distros backport this patch.

Tools like PAPI often carry around their own copy of perf_event.h and
people like to move around binaries between machines with different kernel
versions so things get complicated quickly.

Vince

2013-09-19 08:17:01

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [tip:perf/core] perf: Fix broken union in ' struct perf_event_mmap_page'

On Wed, Sep 18, 2013 at 04:07:52PM -0400, Vince Weaver wrote:
> It would be nice if there was some way to detect this change; I liked the
> idea of a "cap_usr_fixed" bit.

How about we start using the version field for this? Arguably we should
have incremented that value every time we changed the thing but we might
as well start now.

---

diff --git a/kernel/events/core.c b/kernel/events/core.c
index dd236b6..c44d55c 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3660,6 +3660,25 @@ static void calc_timer_values(struct perf_event *event,
*running = ctx_time - event->tstamp_running;
}

+static void perf_event_init_userpage(struct perf_event *event)
+{
+ struct perf_event_mmap_page *userpg;
+ struct ring_buffer *rb;
+
+ rcu_read_lock();
+ rb = rcu_dereference(event->rb);
+ if (!rb)
+ goto unlock;
+
+ userpg = rb->user_page;
+
+ userpg->version = 1;
+ userpg->compat_version = 0;
+
+unlock:
+ rcu_read_unlock();
+}
+
void __weak arch_perf_update_userpage(struct perf_event_mmap_page *userpg, u64 now)
{
}
@@ -4044,6 +4063,7 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
ring_buffer_attach(event, rb);
rcu_assign_pointer(event->rb, rb);

+ perf_event_init_userpage(event);
perf_event_update_userpage(event);

unlock:

2013-09-19 08:42:56

by Ingo Molnar

[permalink] [raw]
Subject: Re: [tip:perf/core] perf: Fix broken union in ' struct perf_event_mmap_page'


* Vince Weaver <[email protected]> wrote:

> On Tue, 17 Sep 2013, Vince Weaver wrote:
>
> >
> > This patch somehow breaks the perf-ABI.
> >
> > If I take a program that reads "mmap->cap_usr_rdpmc" and compile it
> > against the new header with this change (say from 3.12-rc1)
> > and then run it on an old kernel (say 3.11) then I get "0" for
> > cap_usr_rdpmc.
> >
> > If I take the same program and recompile against the old (without this
> > patch) header and run it on 3.11, I get the expected "1" value.
>
> To follow up, the original case:
>
> union {
> __u64 capabilities;
> __u64 cap_usr_time : 1,
> cap_usr_rdpmc : 1,
> cap_____res : 62;
> };
>
> Then mmap->usr_rdpmc=1; gets assembled as
>
> 400420: 80 0d 11 05 20 00 02 orb $0x2,0x200511(%rip)

Hm, how can it be 0x2? Didn't you mix up the two assemblies accidentally?

>
> The new version
>
> union {
> __u64 capabilities;
> struct {
> __u64 cap_usr_time : 1,
> cap_usr_rdpmc : 1,
> cap_usr_time_zero : 1,
> cap_____res : 61;
> };
> };
>
> mmap->usr_rdpmc=1; gets assembled as
>
>
> 400427: 80 0d 02 05 20 00 01 orb $0x1,0x200502(%rip)
>
> note the difference in the or value.

Here the value should be 0x2, as cap_usr_rdpmc move to the second bit.

Right?

Ingo

2013-09-19 08:44:01

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [tip:perf/core] perf: Fix broken union in ' struct perf_event_mmap_page'

On Wed, Sep 18, 2013 at 08:33:53PM +0200, Stephane Eranian wrote:
> On Wed, Sep 18, 2013 at 5:42 PM, Peter Zijlstra <[email protected]> wrote:
> > On Wed, Sep 18, 2013 at 10:19:32AM -0400, Vince Weaver wrote:
> >> Can you point to any code that is fixed by the commit?
> >
> > I have some, but I don't think a lot of people use it.
> >
> > Would you be ok with something like the below? It should preserve
> > functionality for code that only cares about cap_usr_rdpmc (PAPI).
> >
> > Stephane, does libpfm use any of these?
> >
> Yes, there is an example using this. Need to verify it is not broken
> currently (self_count.c).

So if that only uses cap_usr_rdpmc, you have the same issue as Vince and
the proposed solutions would work for you too.

If you also use cap_usr_time we've a bit of a problem.

2013-09-19 08:55:36

by Stephane Eranian

[permalink] [raw]
Subject: Re: [tip:perf/core] perf: Fix broken union in ' struct perf_event_mmap_page'

On Thu, Sep 19, 2013 at 10:43 AM, Peter Zijlstra <[email protected]> wrote:
> On Wed, Sep 18, 2013 at 08:33:53PM +0200, Stephane Eranian wrote:
>> On Wed, Sep 18, 2013 at 5:42 PM, Peter Zijlstra <[email protected]> wrote:
>> > On Wed, Sep 18, 2013 at 10:19:32AM -0400, Vince Weaver wrote:
>> >> Can you point to any code that is fixed by the commit?
>> >
>> > I have some, but I don't think a lot of people use it.
>> >
>> > Would you be ok with something like the below? It should preserve
>> > functionality for code that only cares about cap_usr_rdpmc (PAPI).
>> >
>> > Stephane, does libpfm use any of these?
>> >
>> Yes, there is an example using this. Need to verify it is not broken
>> currently (self_count.c).
>
> So if that only uses cap_usr_rdpmc, you have the same issue as Vince and
> the proposed solutions would work for you too.
>
> If you also use cap_usr_time we've a bit of a problem.

I need to look at this program again, was written a long time ago.
It does not use cap_usr_rdpmc nor cap_use_time for sure.
So which one(s) are okay to use?

2013-09-19 09:14:59

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH] perf: Always set bit 0 in the capabilities field of 'struct perf_event_mmap_page' to 0, to maintain the ABI


* Peter Zijlstra <[email protected]> wrote:

> On Wed, Sep 18, 2013 at 04:07:52PM -0400, Vince Weaver wrote:
>
> > It would be nice if there was some way to detect this change; I liked
> > the idea of a "cap_usr_fixed" bit.
>
> How about we start using the version field for this? Arguably we should
> have incremented that value every time we changed the thing but we might
> as well start now.

But version fields are really fragile, the way we usually iterate ABIs is
a self-maintaining size field - which is missing here.

So I think the best solution would be to make it all explicit and
self-contained:

- always clear bit 0, and rename it to usrpage->cap_bit0, to at least not
confuse old user-space binaries. RDPMC will be marked as unavailable
to old binaries but that's within the ABI.

- rename bit 1 to ->cap_bit0_is_deprecated and always set it to 1, so new
libraries can reliably detect that bit 0 is deprecated and perma-zero
without having to check the kernel version.

- use bits 2, 3, 4 for the newly defined, correct functionality.

- rename all the bitfield names in perf_event.h to be different from the
old names, to make sure it's not possible to mis-compile it
accidentally with old assumptions.

I.e. something like the patch below. (untested)

The 'size' field can then be used in the future to add new fields and it
will act as a natural ABI version indicator as well.

Thanks,

Ingo

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 8355c84..3ab624c 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -1883,9 +1883,9 @@ static struct pmu pmu = {

void arch_perf_update_userpage(struct perf_event_mmap_page *userpg, u64 now)
{
- userpg->cap_usr_time = 0;
- userpg->cap_usr_time_zero = 0;
- userpg->cap_usr_rdpmc = x86_pmu.attr_rdpmc;
+ userpg->cap_usr_time_used = 0;
+ userpg->cap_usr_time_zero_used = 0;
+ userpg->cap_usr_rdpmc_available = x86_pmu.attr_rdpmc;
userpg->pmc_width = x86_pmu.cntval_bits;

if (!boot_cpu_has(X86_FEATURE_CONSTANT_TSC))
@@ -1894,13 +1894,13 @@ void arch_perf_update_userpage(struct perf_event_mmap_page *userpg, u64 now)
if (!boot_cpu_has(X86_FEATURE_NONSTOP_TSC))
return;

- userpg->cap_usr_time = 1;
+ userpg->cap_usr_time_used = 1;
userpg->time_mult = this_cpu_read(cyc2ns);
userpg->time_shift = CYC2NS_SCALE_FACTOR;
userpg->time_offset = this_cpu_read(cyc2ns_offset) - now;

if (sched_clock_stable && !check_tsc_disabled()) {
- userpg->cap_usr_time_zero = 1;
+ userpg->cap_usr_time_zero_used = 1;
userpg->time_zero = this_cpu_read(cyc2ns_offset);
}
}
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 40a1fb8..515d7d2 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -380,10 +380,13 @@ struct perf_event_mmap_page {
union {
__u64 capabilities;
struct {
- __u64 cap_usr_time : 1,
- cap_usr_rdpmc : 1,
- cap_usr_time_zero : 1,
- cap_____res : 61;
+ __u64 cap_bit0 : 1, /* Deprecated, always zero, see commit 860f085b74e9 */
+ cap_bit0_is_deprecated : 1, /* Always 1, signals that bit 0 is zero */
+
+ cap_usr_rdpmc_available : 1, /* The RDPMC instruction can be used to read counts */
+ cap_usr_time_used : 1, /* The time_* fields are uses */
+ cap_usr_time_zero_used : 1, /* The time_zero field is used */
+ cap_____res : 59;
};
};

@@ -442,12 +445,14 @@ struct perf_event_mmap_page {
* ((rem * time_mult) >> time_shift);
*/
__u64 time_zero;
+ __u32 size; /* Header size up to this point */
+ __u32 __reserved0; /* 4 byte hole */

/*
* Hole for extension of the self monitor capabilities
*/

- __u64 __reserved[119]; /* align to 1k */
+ __u64 __reserved[118]; /* align to 1k */

/*
* Control data for the mmap() data buffer.
diff --git a/kernel/events/core.c b/kernel/events/core.c
index dd236b6..27d339f 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3660,6 +3660,26 @@ static void calc_timer_values(struct perf_event *event,
*running = ctx_time - event->tstamp_running;
}

+static void perf_event_init_userpage(struct perf_event *event)
+{
+ struct perf_event_mmap_page *userpg;
+ struct ring_buffer *rb;
+
+ rcu_read_lock();
+ rb = rcu_dereference(event->rb);
+ if (!rb)
+ goto unlock;
+
+ userpg = rb->user_page;
+
+ /* Allow new userspace to detect that bit 0 is deprecated */
+ userpg->cap_bit0_is_deprecated = 1;
+ userpg->size = offsetof(struct perf_event_mmap_page, size);
+
+unlock:
+ rcu_read_unlock();
+}
+
void __weak arch_perf_update_userpage(struct perf_event_mmap_page *userpg, u64 now)
{
}
@@ -4044,6 +4064,7 @@ again:
ring_buffer_attach(event, rb);
rcu_assign_pointer(event->rb, rb);

+ perf_event_init_userpage(event);
perf_event_update_userpage(event);

unlock:

2013-09-19 09:16:58

by Ingo Molnar

[permalink] [raw]
Subject: Re: [tip:perf/core] perf: Fix broken union in ' struct perf_event_mmap_page'


* Stephane Eranian <[email protected]> wrote:

> On Thu, Sep 19, 2013 at 10:43 AM, Peter Zijlstra <[email protected]> wrote:
> > On Wed, Sep 18, 2013 at 08:33:53PM +0200, Stephane Eranian wrote:
> >> On Wed, Sep 18, 2013 at 5:42 PM, Peter Zijlstra <[email protected]> wrote:
> >> > On Wed, Sep 18, 2013 at 10:19:32AM -0400, Vince Weaver wrote:
> >> >> Can you point to any code that is fixed by the commit?
> >> >
> >> > I have some, but I don't think a lot of people use it.
> >> >
> >> > Would you be ok with something like the below? It should preserve
> >> > functionality for code that only cares about cap_usr_rdpmc (PAPI).
> >> >
> >> > Stephane, does libpfm use any of these?
> >> >
> >> Yes, there is an example using this. Need to verify it is not broken
> >> currently (self_count.c).
> >
> > So if that only uses cap_usr_rdpmc, you have the same issue as Vince and
> > the proposed solutions would work for you too.
> >
> > If you also use cap_usr_time we've a bit of a problem.
>
> I need to look at this program again, was written a long time ago. It
> does not use cap_usr_rdpmc nor cap_use_time for sure.

If it does not use either flag (which in released kernels is really a
single flag ABI-wise) then it should be fine.

Thanks,

Ingo

2013-09-19 10:12:57

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] perf: Always set bit 0 in the capabilities field of 'struct perf_event_mmap_page' to 0, to maintain the ABI

On Thu, Sep 19, 2013 at 11:14:53AM +0200, Ingo Molnar wrote:
> @@ -442,12 +445,14 @@ struct perf_event_mmap_page {
> * ((rem * time_mult) >> time_shift);
> */
> __u64 time_zero;
> + __u32 size; /* Header size up to this point */
> + __u32 __reserved0; /* 4 byte hole */
>
> /*
> * Hole for extension of the self monitor capabilities
> */
>
> - __u64 __reserved[119]; /* align to 1k */
> + __u64 __reserved[118]; /* align to 1k */
>
> /*
> * Control data for the mmap() data buffer.
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index dd236b6..27d339f 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -3660,6 +3660,26 @@ static void calc_timer_values(struct perf_event *event,
> *running = ctx_time - event->tstamp_running;
> }
>
> +static void perf_event_init_userpage(struct perf_event *event)
> +{
> + struct perf_event_mmap_page *userpg;
> + struct ring_buffer *rb;
> +
> + rcu_read_lock();
> + rb = rcu_dereference(event->rb);
> + if (!rb)
> + goto unlock;
> +
> + userpg = rb->user_page;
> +
> + /* Allow new userspace to detect that bit 0 is deprecated */
> + userpg->cap_bit0_is_deprecated = 1;
> + userpg->size = offsetof(struct perf_event_mmap_page, size);

This is fragile and I'm 100% sure we'll forget to update it.

userpg->size = offsetof(struct perf_event_mmap_page, __reserved);

Will auto update and mostly do the right thing.

Also, how will userspace know there's a valid size field way out there?
Shouldn't we bump the version field to indicate so? :-) After all,
running on old userspace this field will be 0.

> +
> +unlock:
> + rcu_read_unlock();
> +}

2013-09-19 10:28:23

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] perf: Always set bit 0 in the capabilities field of 'struct perf_event_mmap_page' to 0, to maintain the ABI


* Peter Zijlstra <[email protected]> wrote:

> On Thu, Sep 19, 2013 at 11:14:53AM +0200, Ingo Molnar wrote:
> > @@ -442,12 +445,14 @@ struct perf_event_mmap_page {
> > * ((rem * time_mult) >> time_shift);
> > */
> > __u64 time_zero;
> > + __u32 size; /* Header size up to this point */
> > + __u32 __reserved0; /* 4 byte hole */
> >
> > /*
> > * Hole for extension of the self monitor capabilities
> > */
> >
> > - __u64 __reserved[119]; /* align to 1k */
> > + __u64 __reserved[118]; /* align to 1k */
> >
> > /*
> > * Control data for the mmap() data buffer.
> > diff --git a/kernel/events/core.c b/kernel/events/core.c
> > index dd236b6..27d339f 100644
> > --- a/kernel/events/core.c
> > +++ b/kernel/events/core.c
> > @@ -3660,6 +3660,26 @@ static void calc_timer_values(struct perf_event *event,
> > *running = ctx_time - event->tstamp_running;
> > }
> >
> > +static void perf_event_init_userpage(struct perf_event *event)
> > +{
> > + struct perf_event_mmap_page *userpg;
> > + struct ring_buffer *rb;
> > +
> > + rcu_read_lock();
> > + rb = rcu_dereference(event->rb);
> > + if (!rb)
> > + goto unlock;
> > +
> > + userpg = rb->user_page;
> > +
> > + /* Allow new userspace to detect that bit 0 is deprecated */
> > + userpg->cap_bit0_is_deprecated = 1;
> > + userpg->size = offsetof(struct perf_event_mmap_page, size);
>
> This is fragile and I'm 100% sure we'll forget to update it.
>
> userpg->size = offsetof(struct perf_event_mmap_page, __reserved);
>
> Will auto update and mostly do the right thing.

Ah, yes, agreed 100% - that was my intention, just implemented it badly.

One detail: I think we want to track size with u8 granularity, to be able
to detect when a new u32 (or u16) field gets added, right? Updated patch
attached below.

Note that this way of writing the array size:

+ __u8 __reserved[118*8+4]; /* align to 1k. */

Makes it sure that we are aware of the current word alignment situation
and makes it harder to break alignment in the future.

> Also, how will userspace know there's a valid size field way out there?

The current value of the size field on old kernels is 0 so it easily
detected by being nonzero.

> Shouldn't we bump the version field to indicate so? :-) After all,
> running on old userspace this field will be 0.

Well, on old kernel this will be 0, right?

Thanks,

Ingo

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 8355c84..3ab624c 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -1883,9 +1883,9 @@ static struct pmu pmu = {

void arch_perf_update_userpage(struct perf_event_mmap_page *userpg, u64 now)
{
- userpg->cap_usr_time = 0;
- userpg->cap_usr_time_zero = 0;
- userpg->cap_usr_rdpmc = x86_pmu.attr_rdpmc;
+ userpg->cap_usr_time_used = 0;
+ userpg->cap_usr_time_zero_used = 0;
+ userpg->cap_usr_rdpmc_available = x86_pmu.attr_rdpmc;
userpg->pmc_width = x86_pmu.cntval_bits;

if (!boot_cpu_has(X86_FEATURE_CONSTANT_TSC))
@@ -1894,13 +1894,13 @@ void arch_perf_update_userpage(struct perf_event_mmap_page *userpg, u64 now)
if (!boot_cpu_has(X86_FEATURE_NONSTOP_TSC))
return;

- userpg->cap_usr_time = 1;
+ userpg->cap_usr_time_used = 1;
userpg->time_mult = this_cpu_read(cyc2ns);
userpg->time_shift = CYC2NS_SCALE_FACTOR;
userpg->time_offset = this_cpu_read(cyc2ns_offset) - now;

if (sched_clock_stable && !check_tsc_disabled()) {
- userpg->cap_usr_time_zero = 1;
+ userpg->cap_usr_time_zero_used = 1;
userpg->time_zero = this_cpu_read(cyc2ns_offset);
}
}
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 40a1fb8..e3514d1 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -380,10 +380,13 @@ struct perf_event_mmap_page {
union {
__u64 capabilities;
struct {
- __u64 cap_usr_time : 1,
- cap_usr_rdpmc : 1,
- cap_usr_time_zero : 1,
- cap_____res : 61;
+ __u64 cap_bit0 : 1, /* Deprecated, always zero, see commit 860f085b74e9 */
+ cap_bit0_is_deprecated : 1, /* Always 1, signals that bit 0 is zero */
+
+ cap_usr_rdpmc_available : 1, /* The RDPMC instruction can be used to read counts */
+ cap_usr_time_used : 1, /* The time_* fields are uses */
+ cap_usr_time_zero_used : 1, /* The time_zero field is used */
+ cap_____res : 59;
};
};

@@ -442,12 +445,13 @@ struct perf_event_mmap_page {
* ((rem * time_mult) >> time_shift);
*/
__u64 time_zero;
+ __u32 size; /* Header size up to __reserved[] fields. */

/*
* Hole for extension of the self monitor capabilities
*/

- __u64 __reserved[119]; /* align to 1k */
+ __u8 __reserved[118*8+4]; /* align to 1k. */

/*
* Control data for the mmap() data buffer.
diff --git a/kernel/events/core.c b/kernel/events/core.c
index dd236b6..cb4238e 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3660,6 +3660,26 @@ static void calc_timer_values(struct perf_event *event,
*running = ctx_time - event->tstamp_running;
}

+static void perf_event_init_userpage(struct perf_event *event)
+{
+ struct perf_event_mmap_page *userpg;
+ struct ring_buffer *rb;
+
+ rcu_read_lock();
+ rb = rcu_dereference(event->rb);
+ if (!rb)
+ goto unlock;
+
+ userpg = rb->user_page;
+
+ /* Allow new userspace to detect that bit 0 is deprecated */
+ userpg->cap_bit0_is_deprecated = 1;
+ userpg->size = offsetof(struct perf_event_mmap_page, __reserved);
+
+unlock:
+ rcu_read_unlock();
+}
+
void __weak arch_perf_update_userpage(struct perf_event_mmap_page *userpg, u64 now)
{
}
@@ -4044,6 +4064,7 @@ again:
ring_buffer_attach(event, rb);
rcu_assign_pointer(event->rb, rb);

+ perf_event_init_userpage(event);
perf_event_update_userpage(event);

unlock:

2013-09-19 10:36:01

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] perf: Always set bit 0 in the capabilities field of 'struct perf_event_mmap_page' to 0, to maintain the ABI

On Thu, Sep 19, 2013 at 12:28:18PM +0200, Ingo Molnar wrote:

You really don't like version fields do you ;-)

> diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
> index 8355c84..3ab624c 100644
> --- a/arch/x86/kernel/cpu/perf_event.c
> +++ b/arch/x86/kernel/cpu/perf_event.c
> @@ -1883,9 +1883,9 @@ static struct pmu pmu = {
>
> void arch_perf_update_userpage(struct perf_event_mmap_page *userpg, u64 now)
> {
> - userpg->cap_usr_time = 0;
> - userpg->cap_usr_time_zero = 0;
> - userpg->cap_usr_rdpmc = x86_pmu.attr_rdpmc;
> + userpg->cap_usr_time_used = 0;
> + userpg->cap_usr_time_zero_used = 0;
> + userpg->cap_usr_rdpmc_available = x86_pmu.attr_rdpmc;
> userpg->pmc_width = x86_pmu.cntval_bits;
>
> if (!boot_cpu_has(X86_FEATURE_CONSTANT_TSC))
> @@ -1894,13 +1894,13 @@ void arch_perf_update_userpage(struct perf_event_mmap_page *userpg, u64 now)
> if (!boot_cpu_has(X86_FEATURE_NONSTOP_TSC))
> return;
>
> - userpg->cap_usr_time = 1;
> + userpg->cap_usr_time_used = 1;
> userpg->time_mult = this_cpu_read(cyc2ns);
> userpg->time_shift = CYC2NS_SCALE_FACTOR;
> userpg->time_offset = this_cpu_read(cyc2ns_offset) - now;
>
> if (sched_clock_stable && !check_tsc_disabled()) {
> - userpg->cap_usr_time_zero = 1;
> + userpg->cap_usr_time_zero_used = 1;
> userpg->time_zero = this_cpu_read(cyc2ns_offset);
> }
> }
> diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
> index 40a1fb8..e3514d1 100644
> --- a/include/uapi/linux/perf_event.h
> +++ b/include/uapi/linux/perf_event.h
> @@ -380,10 +380,13 @@ struct perf_event_mmap_page {
> union {
> __u64 capabilities;
> struct {
> - __u64 cap_usr_time : 1,
> - cap_usr_rdpmc : 1,
> - cap_usr_time_zero : 1,
> - cap_____res : 61;
> + __u64 cap_bit0 : 1, /* Deprecated, always zero, see commit 860f085b74e9 */
> + cap_bit0_is_deprecated : 1, /* Always 1, signals that bit 0 is zero */
> +
> + cap_usr_rdpmc_available : 1, /* The RDPMC instruction can be used to read counts */
> + cap_usr_time_used : 1, /* The time_* fields are uses */
> + cap_usr_time_zero_used : 1, /* The time_zero field is used */
> + cap_____res : 59;
> };
> };

Would it make sense to do something like s/cap_usr/cap_user/ and drop
the _available, _used postfixes? It results in different names but
avoids these terribly long ones.

2013-09-19 10:41:03

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH, v3] perf: Always set bit 0 in the capabilities field of 'struct perf_event_mmap_page' to 0, to maintain the ABI


* Peter Zijlstra <[email protected]> wrote:

> On Thu, Sep 19, 2013 at 12:28:18PM +0200, Ingo Molnar wrote:
>
> You really don't like version fields do you ;-)

Indeed they are a horrible concept.

> Would it make sense to do something like s/cap_usr/cap_user/ and drop
> the _available, _used postfixes? It results in different names but
> avoids these terribly long ones.

Absolutely! Find updated patch below.

Thanks,

Ingo

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 8355c84..a9c606b 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -1883,9 +1883,9 @@ static struct pmu pmu = {

void arch_perf_update_userpage(struct perf_event_mmap_page *userpg, u64 now)
{
- userpg->cap_usr_time = 0;
- userpg->cap_usr_time_zero = 0;
- userpg->cap_usr_rdpmc = x86_pmu.attr_rdpmc;
+ userpg->cap_user_time = 0;
+ userpg->cap_user_time_zero = 0;
+ userpg->cap_user_rdpmc = x86_pmu.attr_rdpmc;
userpg->pmc_width = x86_pmu.cntval_bits;

if (!boot_cpu_has(X86_FEATURE_CONSTANT_TSC))
@@ -1894,13 +1894,13 @@ void arch_perf_update_userpage(struct perf_event_mmap_page *userpg, u64 now)
if (!boot_cpu_has(X86_FEATURE_NONSTOP_TSC))
return;

- userpg->cap_usr_time = 1;
+ userpg->cap_user_time = 1;
userpg->time_mult = this_cpu_read(cyc2ns);
userpg->time_shift = CYC2NS_SCALE_FACTOR;
userpg->time_offset = this_cpu_read(cyc2ns_offset) - now;

if (sched_clock_stable && !check_tsc_disabled()) {
- userpg->cap_usr_time_zero = 1;
+ userpg->cap_user_time_zero = 1;
userpg->time_zero = this_cpu_read(cyc2ns_offset);
}
}
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 40a1fb8..dd4c903 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -380,10 +380,13 @@ struct perf_event_mmap_page {
union {
__u64 capabilities;
struct {
- __u64 cap_usr_time : 1,
- cap_usr_rdpmc : 1,
- cap_usr_time_zero : 1,
- cap_____res : 61;
+ __u64 cap_bit0 : 1, /* Always 0, deprecated, see commit 860f085b74e9 */
+ cap_bit0_is_deprecated : 1, /* Always 1, signals that bit 0 is zero */
+
+ cap_user_rdpmc : 1, /* The RDPMC instruction can be used to read counts */
+ cap_user_time : 1, /* The time_* fields are used */
+ cap_user_time_zero : 1, /* The time_zero field is used */
+ cap_____res : 59;
};
};

@@ -442,12 +445,13 @@ struct perf_event_mmap_page {
* ((rem * time_mult) >> time_shift);
*/
__u64 time_zero;
+ __u32 size; /* Header size up to __reserved[] fields. */

/*
* Hole for extension of the self monitor capabilities
*/

- __u64 __reserved[119]; /* align to 1k */
+ __u8 __reserved[118*8+4]; /* align to 1k. */

/*
* Control data for the mmap() data buffer.
diff --git a/kernel/events/core.c b/kernel/events/core.c
index dd236b6..cb4238e 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3660,6 +3660,26 @@ static void calc_timer_values(struct perf_event *event,
*running = ctx_time - event->tstamp_running;
}

+static void perf_event_init_userpage(struct perf_event *event)
+{
+ struct perf_event_mmap_page *userpg;
+ struct ring_buffer *rb;
+
+ rcu_read_lock();
+ rb = rcu_dereference(event->rb);
+ if (!rb)
+ goto unlock;
+
+ userpg = rb->user_page;
+
+ /* Allow new userspace to detect that bit 0 is deprecated */
+ userpg->cap_bit0_is_deprecated = 1;
+ userpg->size = offsetof(struct perf_event_mmap_page, __reserved);
+
+unlock:
+ rcu_read_unlock();
+}
+
void __weak arch_perf_update_userpage(struct perf_event_mmap_page *userpg, u64 now)
{
}
@@ -4044,6 +4064,7 @@ again:
ring_buffer_attach(event, rb);
rcu_assign_pointer(event->rb, rb);

+ perf_event_init_userpage(event);
perf_event_update_userpage(event);

unlock:

2013-09-19 11:12:19

by Adrian Hunter

[permalink] [raw]
Subject: Re: [PATCH, v3] perf: Always set bit 0 in the capabilities field of 'struct perf_event_mmap_page' to 0, to maintain the ABI

On 19/09/13 13:40, Ingo Molnar wrote:
>
> * Peter Zijlstra <[email protected]> wrote:
>
>> On Thu, Sep 19, 2013 at 12:28:18PM +0200, Ingo Molnar wrote:
>>
>> You really don't like version fields do you ;-)
>
> Indeed they are a horrible concept.
>
>> Would it make sense to do something like s/cap_usr/cap_user/ and drop
>> the _available, _used postfixes? It results in different names but
>> avoids these terribly long ones.
>
> Absolutely! Find updated patch below.
>
> Thanks,
>
> Ingo
>
> diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
> index 8355c84..a9c606b 100644
> --- a/arch/x86/kernel/cpu/perf_event.c
> +++ b/arch/x86/kernel/cpu/perf_event.c
> @@ -1883,9 +1883,9 @@ static struct pmu pmu = {
>
> void arch_perf_update_userpage(struct perf_event_mmap_page *userpg, u64 now)
> {
> - userpg->cap_usr_time = 0;
> - userpg->cap_usr_time_zero = 0;
> - userpg->cap_usr_rdpmc = x86_pmu.attr_rdpmc;
> + userpg->cap_user_time = 0;
> + userpg->cap_user_time_zero = 0;
> + userpg->cap_user_rdpmc = x86_pmu.attr_rdpmc;
> userpg->pmc_width = x86_pmu.cntval_bits;
>
> if (!boot_cpu_has(X86_FEATURE_CONSTANT_TSC))
> @@ -1894,13 +1894,13 @@ void arch_perf_update_userpage(struct perf_event_mmap_page *userpg, u64 now)
> if (!boot_cpu_has(X86_FEATURE_NONSTOP_TSC))
> return;
>
> - userpg->cap_usr_time = 1;
> + userpg->cap_user_time = 1;
> userpg->time_mult = this_cpu_read(cyc2ns);
> userpg->time_shift = CYC2NS_SCALE_FACTOR;
> userpg->time_offset = this_cpu_read(cyc2ns_offset) - now;
>
> if (sched_clock_stable && !check_tsc_disabled()) {
> - userpg->cap_usr_time_zero = 1;
> + userpg->cap_user_time_zero = 1;
> userpg->time_zero = this_cpu_read(cyc2ns_offset);
> }
> }
> diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
> index 40a1fb8..dd4c903 100644
> --- a/include/uapi/linux/perf_event.h
> +++ b/include/uapi/linux/perf_event.h
> @@ -380,10 +380,13 @@ struct perf_event_mmap_page {
> union {
> __u64 capabilities;
> struct {
> - __u64 cap_usr_time : 1,
> - cap_usr_rdpmc : 1,
> - cap_usr_time_zero : 1,
> - cap_____res : 61;
> + __u64 cap_bit0 : 1, /* Always 0, deprecated, see commit 860f085b74e9 */
> + cap_bit0_is_deprecated : 1, /* Always 1, signals that bit 0 is zero */
> +
> + cap_user_rdpmc : 1, /* The RDPMC instruction can be used to read counts */
> + cap_user_time : 1, /* The time_* fields are used */
> + cap_user_time_zero : 1, /* The time_zero field is used */
> + cap_____res : 59;
> };
> };
>
> @@ -442,12 +445,13 @@ struct perf_event_mmap_page {
> * ((rem * time_mult) >> time_shift);
> */
> __u64 time_zero;
> + __u32 size; /* Header size up to __reserved[] fields. */
>
> /*
> * Hole for extension of the self monitor capabilities
> */
>
> - __u64 __reserved[119]; /* align to 1k */
> + __u8 __reserved[118*8+4]; /* align to 1k. */
>
> /*
> * Control data for the mmap() data buffer.
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index dd236b6..cb4238e 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -3660,6 +3660,26 @@ static void calc_timer_values(struct perf_event *event,
> *running = ctx_time - event->tstamp_running;
> }
>
> +static void perf_event_init_userpage(struct perf_event *event)
> +{
> + struct perf_event_mmap_page *userpg;
> + struct ring_buffer *rb;
> +
> + rcu_read_lock();
> + rb = rcu_dereference(event->rb);
> + if (!rb)
> + goto unlock;
> +
> + userpg = rb->user_page;
> +
> + /* Allow new userspace to detect that bit 0 is deprecated */
> + userpg->cap_bit0_is_deprecated = 1;
> + userpg->size = offsetof(struct perf_event_mmap_page, __reserved);
> +
> +unlock:
> + rcu_read_unlock();
> +}
> +
> void __weak arch_perf_update_userpage(struct perf_event_mmap_page *userpg, u64 now)
> {
> }
> @@ -4044,6 +4064,7 @@ again:
> ring_buffer_attach(event, rb);
> rcu_assign_pointer(event->rb, rb);
>
> + perf_event_init_userpage(event);
> perf_event_update_userpage(event);
>
> unlock:
>
>

Please consider adding:


diff --git a/tools/perf/arch/x86/util/tsc.c b/tools/perf/arch/x86/util/tsc.c
index 9570c2b..b2519e4 100644
--- a/tools/perf/arch/x86/util/tsc.c
+++ b/tools/perf/arch/x86/util/tsc.c
@@ -32,7 +32,7 @@ u64 tsc_to_perf_time(u64 cyc, struct perf_tsc_conversion *tc)
int perf_read_tsc_conversion(const struct perf_event_mmap_page *pc,
struct perf_tsc_conversion *tc)
{
- bool cap_usr_time_zero;
+ bool cap_user_time_zero;
u32 seq;
int i = 0;

@@ -42,7 +42,7 @@ int perf_read_tsc_conversion(const struct
perf_event_mmap_page *pc,
tc->time_mult = pc->time_mult;
tc->time_shift = pc->time_shift;
tc->time_zero = pc->time_zero;
- cap_usr_time_zero = pc->cap_usr_time_zero;
+ cap_user_time_zero = pc->cap_user_time_zero;
rmb();
if (pc->lock == seq && !(seq & 1))
break;
@@ -52,7 +52,7 @@ int perf_read_tsc_conversion(const struct
perf_event_mmap_page *pc,
}
}

- if (!cap_usr_time_zero)
+ if (!cap_user_time_zero)
return -EOPNOTSUPP;

return 0;

2013-09-19 11:42:59

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH, v4] perf: Fix capabilities bitfield compatibility in 'struct perf_event_mmap_page'


* Adrian Hunter <[email protected]> wrote:

> > struct {
> > - __u64 cap_usr_time : 1,
> > - cap_usr_rdpmc : 1,
> > - cap_usr_time_zero : 1,
> > - cap_____res : 61;
> > + __u64 cap_bit0 : 1, /* Always 0, deprecated, see commit 860f085b74e9 */
> > + cap_bit0_is_deprecated : 1, /* Always 1, signals that bit 0 is zero */
> > +
> > + cap_user_rdpmc : 1, /* The RDPMC instruction can be used to read counts */
> > + cap_user_time : 1, /* The time_* fields are used */
> > + cap_user_time_zero : 1, /* The time_zero field is used */
> > + cap_____res : 59;
> > };

> Please consider adding:

indeed - added. I also build-tested it all and added a changelog - see the
updated patch below.

Thanks,

Ingo

--------------------->
Subject: perf: Fix capabilities bitfield compatibility in 'struct perf_event_mmap_page'
From: Peter Zijlstra <[email protected]>
Date: Thu, 19 Sep 2013 10:16:42 +0200

Solve the problems around the broken definition of perf_event_mmap_page::
cap_usr_time and cap_usr_rdpmc fields which used to overlap, partially
fixed by:

860f085b74e9 ("perf: Fix broken union in 'struct perf_event_mmap_page'")

The problem with the fix (merged in v3.12-rc1 and not yet released
officially), noticed by Vince Weaver is that the new behavior is
not detectable by new user-space, and that due to the reuse of the
field names it's easy to mis-compile a binary if old headers are used
on a new kernel or new headers are used on an old kernel.

To solve all that make this change explicit, detectable and self-contained,
by iterating the ABI the following way:

- Always clear bit 0, and rename it to usrpage->cap_bit0, to at least not
confuse old user-space binaries. RDPMC will be marked as unavailable
to old binaries but that's within the ABI, this is a capability bit.

- Rename bit 1 to ->cap_bit0_is_deprecated and always set it to 1, so new
libraries can reliably detect that bit 0 is deprecated and perma-zero
without having to check the kernel version.

- Use bits 2, 3, 4 for the newly defined, correct functionality:

cap_user_rdpmc : 1, /* The RDPMC instruction can be used to read counts */
cap_user_time : 1, /* The time_* fields are used */
cap_user_time_zero : 1, /* The time_zero field is used */

- Rename all the bitfield names in perf_event.h to be different from the
old names, to make sure it's not possible to mis-compile it
accidentally with old assumptions.

The 'size' field can then be used in the future to add new fields and it
will act as a natural ABI version indicator as well.

Also adjust tools/perf/ userspace for the new definitions, noticed by
Adrian Hunter.

Reported-by: Vince Weaver <[email protected]>
Also-Fixed-by: Adrian Hunter <[email protected]>
Link: http://lkml.kernel.org/n/[email protected]
[ Restructured it, using Peter's original patches. ]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/kernel/cpu/perf_event.c | 10 +++++-----
include/uapi/linux/perf_event.h | 14 +++++++++-----
kernel/events/core.c | 21 +++++++++++++++++++++
tools/perf/arch/x86/util/tsc.c | 6 +++---
4 files changed, 38 insertions(+), 13 deletions(-)

Index: tip/arch/x86/kernel/cpu/perf_event.c
===================================================================
--- tip.orig/arch/x86/kernel/cpu/perf_event.c
+++ tip/arch/x86/kernel/cpu/perf_event.c
@@ -1883,9 +1883,9 @@ static struct pmu pmu = {

void arch_perf_update_userpage(struct perf_event_mmap_page *userpg, u64 now)
{
- userpg->cap_usr_time = 0;
- userpg->cap_usr_time_zero = 0;
- userpg->cap_usr_rdpmc = x86_pmu.attr_rdpmc;
+ userpg->cap_user_time = 0;
+ userpg->cap_user_time_zero = 0;
+ userpg->cap_user_rdpmc = x86_pmu.attr_rdpmc;
userpg->pmc_width = x86_pmu.cntval_bits;

if (!boot_cpu_has(X86_FEATURE_CONSTANT_TSC))
@@ -1894,13 +1894,13 @@ void arch_perf_update_userpage(struct pe
if (!boot_cpu_has(X86_FEATURE_NONSTOP_TSC))
return;

- userpg->cap_usr_time = 1;
+ userpg->cap_user_time = 1;
userpg->time_mult = this_cpu_read(cyc2ns);
userpg->time_shift = CYC2NS_SCALE_FACTOR;
userpg->time_offset = this_cpu_read(cyc2ns_offset) - now;

if (sched_clock_stable && !check_tsc_disabled()) {
- userpg->cap_usr_time_zero = 1;
+ userpg->cap_user_time_zero = 1;
userpg->time_zero = this_cpu_read(cyc2ns_offset);
}
}
Index: tip/include/uapi/linux/perf_event.h
===================================================================
--- tip.orig/include/uapi/linux/perf_event.h
+++ tip/include/uapi/linux/perf_event.h
@@ -380,10 +380,13 @@ struct perf_event_mmap_page {
union {
__u64 capabilities;
struct {
- __u64 cap_usr_time : 1,
- cap_usr_rdpmc : 1,
- cap_usr_time_zero : 1,
- cap_____res : 61;
+ __u64 cap_bit0 : 1, /* Always 0, deprecated, see commit 860f085b74e9 */
+ cap_bit0_is_deprecated : 1, /* Always 1, signals that bit 0 is zero */
+
+ cap_user_rdpmc : 1, /* The RDPMC instruction can be used to read counts */
+ cap_user_time : 1, /* The time_* fields are used */
+ cap_user_time_zero : 1, /* The time_zero field is used */
+ cap_____res : 59;
};
};

@@ -442,12 +445,13 @@ struct perf_event_mmap_page {
* ((rem * time_mult) >> time_shift);
*/
__u64 time_zero;
+ __u32 size; /* Header size up to __reserved[] fields. */

/*
* Hole for extension of the self monitor capabilities
*/

- __u64 __reserved[119]; /* align to 1k */
+ __u8 __reserved[118*8+4]; /* align to 1k. */

/*
* Control data for the mmap() data buffer.
Index: tip/kernel/events/core.c
===================================================================
--- tip.orig/kernel/events/core.c
+++ tip/kernel/events/core.c
@@ -3660,6 +3660,26 @@ static void calc_timer_values(struct per
*running = ctx_time - event->tstamp_running;
}

+static void perf_event_init_userpage(struct perf_event *event)
+{
+ struct perf_event_mmap_page *userpg;
+ struct ring_buffer *rb;
+
+ rcu_read_lock();
+ rb = rcu_dereference(event->rb);
+ if (!rb)
+ goto unlock;
+
+ userpg = rb->user_page;
+
+ /* Allow new userspace to detect that bit 0 is deprecated */
+ userpg->cap_bit0_is_deprecated = 1;
+ userpg->size = offsetof(struct perf_event_mmap_page, __reserved);
+
+unlock:
+ rcu_read_unlock();
+}
+
void __weak arch_perf_update_userpage(struct perf_event_mmap_page *userpg, u64 now)
{
}
@@ -4044,6 +4064,7 @@ again:
ring_buffer_attach(event, rb);
rcu_assign_pointer(event->rb, rb);

+ perf_event_init_userpage(event);
perf_event_update_userpage(event);

unlock:
Index: tip/tools/perf/arch/x86/util/tsc.c
===================================================================
--- tip.orig/tools/perf/arch/x86/util/tsc.c
+++ tip/tools/perf/arch/x86/util/tsc.c
@@ -32,7 +32,7 @@ u64 tsc_to_perf_time(u64 cyc, struct per
int perf_read_tsc_conversion(const struct perf_event_mmap_page *pc,
struct perf_tsc_conversion *tc)
{
- bool cap_usr_time_zero;
+ bool cap_user_time_zero;
u32 seq;
int i = 0;

@@ -42,7 +42,7 @@ int perf_read_tsc_conversion(const struc
tc->time_mult = pc->time_mult;
tc->time_shift = pc->time_shift;
tc->time_zero = pc->time_zero;
- cap_usr_time_zero = pc->cap_usr_time_zero;
+ cap_user_time_zero = pc->cap_user_time_zero;
rmb();
if (pc->lock == seq && !(seq & 1))
break;
@@ -52,7 +52,7 @@ int perf_read_tsc_conversion(const struc
}
}

- if (!cap_usr_time_zero)
+ if (!cap_user_time_zero)
return -EOPNOTSUPP;

return 0;

2013-09-19 17:39:11

by Vince Weaver

[permalink] [raw]
Subject: Re: [PATCH, v4] perf: Fix capabilities bitfield compatibility in 'struct perf_event_mmap_page'

On Thu, 19 Sep 2013, Ingo Molnar wrote:

> To solve all that make this change explicit, detectable and self-contained,
> by iterating the ABI the following way:
>
> - Always clear bit 0, and rename it to usrpage->cap_bit0, to at least not
> confuse old user-space binaries. RDPMC will be marked as unavailable
> to old binaries but that's within the ABI, this is a capability bit.
>
> - Rename bit 1 to ->cap_bit0_is_deprecated and always set it to 1, so new
> libraries can reliably detect that bit 0 is deprecated and perma-zero
> without having to check the kernel version.
>
> - Use bits 2, 3, 4 for the newly defined, correct functionality:
>
> cap_user_rdpmc : 1, /* The RDPMC instruction can be used to read counts */
> cap_user_time : 1, /* The time_* fields are used */
> cap_user_time_zero : 1, /* The time_zero field is used */
>

So let me think through the possible combinations:

OLD-KERNEL OLD-HEADER
+ cap_usr_rdpmc and cap_usr_time map to the same bit
+ In general for kernels from 3.4 to 3.11 the bit will be set
for all x86

NEW-KERNEL OLD-HEADER
+ cap_usr_rdpmc (cap_bit0) and cap_usr_time (cap_bit0) both are 0
+ code detecting cap_usr_rdpmc will probably fall back to non-rdpmc
even through rdpmc code is probably there and functioning

OLD-KERNEL NEW-HEADER
+ cap_user_rdpmc and cap_user_time both set to 0
+ cap_bit0_is_deprecated can be read; if it is 0 you can
read cap_bit0, and if it is 1, assume rdpmc is available
with the same likelyhood you could with the old header

NEW-KERNEL NEW-HEADER
+ Use cap_user_rdpmc and cap_user_time and everything is OK


> - Rename all the bitfield names in perf_event.h to be different from the
> old names, to make sure it's not possible to mis-compile it
> accidentally with old assumptions.

Well this breaks that API, though I guess there are no guarantees there.
I guess that's intentional since it will force
users to fix their code, but a pain if you aren't expecting it and
suddenly your project doesn't compile anymore after a kernel-headers
update. Most of my code carries around its own perf_event.h so I guess
I'll be unaffected.

In any case this seems to be about as reasonable a solution to this
problem as we can get.

Vince

2013-09-20 07:44:24

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH, v4] perf: Fix capabilities bitfield compatibility in 'struct perf_event_mmap_page'


* Vince Weaver <[email protected]> wrote:

> On Thu, 19 Sep 2013, Ingo Molnar wrote:
>
> > To solve all that make this change explicit, detectable and self-contained,
> > by iterating the ABI the following way:
> >
> > - Always clear bit 0, and rename it to usrpage->cap_bit0, to at least not
> > confuse old user-space binaries. RDPMC will be marked as unavailable
> > to old binaries but that's within the ABI, this is a capability bit.
> >
> > - Rename bit 1 to ->cap_bit0_is_deprecated and always set it to 1, so new
> > libraries can reliably detect that bit 0 is deprecated and perma-zero
> > without having to check the kernel version.
> >
> > - Use bits 2, 3, 4 for the newly defined, correct functionality:
> >
> > cap_user_rdpmc : 1, /* The RDPMC instruction can be used to read counts */
> > cap_user_time : 1, /* The time_* fields are used */
> > cap_user_time_zero : 1, /* The time_zero field is used */
> >
>
> So let me think through the possible combinations:
>
> OLD-KERNEL OLD-HEADER
> + cap_usr_rdpmc and cap_usr_time map to the same bit
> + In general for kernels from 3.4 to 3.11 the bit will be set
> for all x86
>
> NEW-KERNEL OLD-HEADER
> + cap_usr_rdpmc (cap_bit0) and cap_usr_time (cap_bit0) both are 0
> + code detecting cap_usr_rdpmc will probably fall back to non-rdpmc
> even through rdpmc code is probably there and functioning
>
> OLD-KERNEL NEW-HEADER
> + cap_user_rdpmc and cap_user_time both set to 0
> + cap_bit0_is_deprecated can be read; if it is 0 you can
> read cap_bit0, and if it is 1, assume rdpmc is available
> with the same likelyhood you could with the old header

Yes - this is the enabling vector to support old kernels (to the extent
it's possible), in newly built libraries - without having to test for some
explicit kernel version in the quirk code. (which is really a fragile
concept when features/fixes get backported.)

> NEW-KERNEL NEW-HEADER
> + Use cap_user_rdpmc and cap_user_time and everything is OK

Yes.

> > - Rename all the bitfield names in perf_event.h to be different from the
> > old names, to make sure it's not possible to mis-compile it
> > accidentally with old assumptions.
>
> Well this breaks that API, though I guess there are no guarantees there.

Not just that there are no guarantees, but without the rename old code
with new headers could also result in hard to debug "it seems to work but
is really broken" code.

> I guess that's intentional since it will force users to fix their code,
> but a pain if you aren't expecting it and suddenly your project doesn't
> compile anymore after a kernel-headers update. Most of my code carries
> around its own perf_event.h so I guess I'll be unaffected.

I'd expect mainly libraries to ever run into the build failure, not end
users.

> In any case this seems to be about as reasonable a solution to this
> problem as we can get.

Great, thanks for the review!

Ingo