hi,
adding sysfs attribute to specify the maximum allowed value
for perf_event_attr::precise_ip field.
Adding functionality for simple 'p' modifier and 'precise' term
to get the maximum allowed value for perf_event_attr::precise_ip
field.
Following events:
-e 'cycles:p'
-e 'cpu/cycles,precise/'
get maximum allowed value for perf_event_attr::precise_ip field.
Following event:
-e 'cycles:pp'
-e 'cpu/cycles,precise=2/'
get perf_event_attr::precise_ip == 2.
If precise is disabled completely (sysfs precise == 0), we display
warning and continue with standard non-PEBS event.
If precise attribute is not supported '1' is used as current default.
Adding automated test to test precise event could be properly created
(if sysfs precise is supported). And customizing parsing tests with
precise modifier.
Updating Documentation/ABI with 'precise' attribute and
also forgotten 'rdpmc' attribute.
thanks,
jirka
Signed-off-by: Jiri Olsa <[email protected]>
Cc: Corey Ashford <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: David Ahern <[email protected]>
Cc: Stephane Eranian <[email protected]>
---
Andi Kleen (1):
perf tools: Add a precise event qualifier
Jiri Olsa (8):
perf x86: Add precise sysfs cpu pmu attribute
perf tools: Add precise object to interface sysfs precise
perf tests: Add precise event automated test
perf tools: Set maximum precise value for event 'p' modifier
perf tools: Set maximum precise value for 'precise' term
perf tests: Add automated precise term test
perf: Document the ABI for 'precise' sysfs attribute
perf: Document the ABI for 'rdpmc' sysfs attribute
Documentation/ABI/testing/sysfs-bus-event_source-devices-cpu-precise | 10 ++++
Documentation/ABI/testing/sysfs-bus-event_source-devices-cpu-rdpmc | 8 +++
arch/x86/kernel/cpu/perf_event.c | 34 +++++++++----
tools/perf/Makefile | 2 +
tools/perf/tests/builtin-test.c | 4 ++
tools/perf/tests/parse-events.c | 170 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++----
tools/perf/tests/precise.c | 49 ++++++++++++++++++
tools/perf/tests/tests.h | 1 +
tools/perf/util/parse-events.c | 69 ++++++++++++++++++++++++-
tools/perf/util/parse-events.h | 3 ++
tools/perf/util/parse-events.l | 1 +
tools/perf/util/parse-events.y | 2 +-
tools/perf/util/precise.c | 44 ++++++++++++++++
tools/perf/util/util.h | 3 ++
14 files changed, 379 insertions(+), 21 deletions(-)
create mode 100644 Documentation/ABI/testing/sysfs-bus-event_source-devices-cpu-precise
create mode 100644 Documentation/ABI/testing/sysfs-bus-event_source-devices-cpu-rdpmc
create mode 100644 tools/perf/tests/precise.c
create mode 100644 tools/perf/util/precise.c
Adding precise util object to get maximum value for
perf_event_attr::precise_ip. The value is exported
via sysfs file '/sys/devices/cpu/precise'.
The interface is:
int perf_precise__get(void)
Returns:
maximum value allowed for perf_event_attr::precise_ip
-1 if we failed to read the sysfs attribute or
sysfs attribute is not found (supported)
Signed-off-by: Jiri Olsa <[email protected]>
Cc: Corey Ashford <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: David Ahern <[email protected]>
Cc: Stephane Eranian <[email protected]>
---
tools/perf/Makefile | 1 +
tools/perf/util/precise.c | 44 ++++++++++++++++++++++++++++++++++++++++++++
tools/perf/util/util.h | 3 +++
3 files changed, 48 insertions(+)
create mode 100644 tools/perf/util/precise.c
diff --git a/tools/perf/Makefile b/tools/perf/Makefile
index 55b42b2..477e7bf 100644
--- a/tools/perf/Makefile
+++ b/tools/perf/Makefile
@@ -361,6 +361,7 @@ LIB_OBJS += $(OUTPUT)util/rblist.o
LIB_OBJS += $(OUTPUT)util/intlist.o
LIB_OBJS += $(OUTPUT)util/vdso.o
LIB_OBJS += $(OUTPUT)util/stat.o
+LIB_OBJS += $(OUTPUT)util/precise.o
LIB_OBJS += $(OUTPUT)ui/setup.o
LIB_OBJS += $(OUTPUT)ui/helpline.o
diff --git a/tools/perf/util/precise.c b/tools/perf/util/precise.c
new file mode 100644
index 0000000..70fcba0
--- /dev/null
+++ b/tools/perf/util/precise.c
@@ -0,0 +1,44 @@
+
+#include <linux/kernel.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <unistd.h>
+#include <stdio.h>
+#include "sysfs.h"
+#include "util.h"
+
+/*
+ * Returns maximum value allowed for
+ * perf_event_attr::precise_ip, and:
+ * -1 if we failed to read the sysfs attribute or
+ * sysfs attribute is not found (supported)
+ */
+int perf_precise__get(void)
+{
+ static int precise = -1;
+ struct stat st;
+ char path[PATH_MAX];
+
+ if (precise != -1)
+ return precise;
+
+ scnprintf(path, PATH_MAX, "%s/devices/cpu/precise",
+ sysfs_find_mountpoint());
+
+ if (!lstat(path, &st)) {
+ FILE *file;
+
+ file = fopen(path, "r");
+ if (!file)
+ return -1;
+
+ if (1 != fscanf(file, "%d", &precise))
+ pr_debug("failed to read precise info\n");
+
+ fclose(file);
+ } else
+ /* Return -1 if there's no sysfs precise support. */
+ return -1;
+
+ return precise;
+}
diff --git a/tools/perf/util/util.h b/tools/perf/util/util.h
index 7a484c9..deeeb54 100644
--- a/tools/perf/util/util.h
+++ b/tools/perf/util/util.h
@@ -276,4 +276,7 @@ extern unsigned int page_size;
struct winsize;
void get_term_dimensions(struct winsize *ws);
+
+int perf_precise__get(void);
+
#endif /* GIT_COMPAT_UTIL_H */
--
1.7.11.7
If single 'p' modifier is specified for event, set the
system precise value for perf_events_attr::precise_ip.
If more than a single 'p' is specified keep the intended
value.
If precise is not supported by system, warning is disaplyed.
Signed-off-by: Jiri Olsa <[email protected]>
Cc: Corey Ashford <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: David Ahern <[email protected]>
Cc: Stephane Eranian <[email protected]>
---
tools/perf/util/parse-events.c | 40 ++++++++++++++++++++++++++++++++++++++--
1 file changed, 38 insertions(+), 2 deletions(-)
diff --git a/tools/perf/util/parse-events.c b/tools/perf/util/parse-events.c
index 2e3ef10..6be4599 100644
--- a/tools/perf/util/parse-events.c
+++ b/tools/perf/util/parse-events.c
@@ -533,6 +533,26 @@ int parse_events_add_breakpoint(struct list_head **list, int *idx,
return add_event(list, idx, &attr, NULL);
}
+static int precise_default(void)
+{
+ int precise = perf_precise__get();
+ static int warned;
+
+ /*
+ * Precise info not supported by by this kernel,
+ * set 1 as the precise value.
+ */
+ if (precise == -1)
+ precise = 1;
+
+ /* PEBS is not supported here, display warning. */
+ if (precise == 0 && !warned++)
+ pr_warning("warning: no precise support, "
+ "using non-precise event(s)\n");
+
+ return precise;
+}
+
static int config_term(struct perf_event_attr *attr,
struct parse_events_term *term)
{
@@ -701,7 +721,12 @@ static int get_event_modifier(struct event_modifier *mod, char *str,
int eh = evsel ? evsel->attr.exclude_hv : 0;
int eH = evsel ? evsel->attr.exclude_host : 0;
int eG = evsel ? evsel->attr.exclude_guest : 0;
- int precise = evsel ? evsel->attr.precise_ip : 0;
+ /*
+ * No previous state setup for precise,
+ * let's start from 0 and get incremented
+ * for both (evsel?) cases.
+ */
+ int precise = 0;
int sample_read = 0;
int exclude = eu | ek | eh;
@@ -810,7 +835,18 @@ int parse_events__modifier_event(struct list_head *list, char *str, bool add)
evsel->attr.exclude_user = mod.eu;
evsel->attr.exclude_kernel = mod.ek;
evsel->attr.exclude_hv = mod.eh;
- evsel->attr.precise_ip = mod.precise;
+
+ /*
+ * Let the group precise modifier (if any) overwrite the
+ * event modifier.
+ */
+ if (mod.precise) {
+ if (mod.precise > 1)
+ evsel->attr.precise_ip = mod.precise;
+ else
+ evsel->attr.precise_ip = precise_default();
+ }
+
evsel->attr.exclude_host = mod.eH;
evsel->attr.exclude_guest = mod.eG;
evsel->exclude_GH = mod.exclude_GH;
--
1.7.11.7
Adding ABI documentation for newly added 'precise' sysfs
attribute. It's added under the testing section.
Signed-off-by: Jiri Olsa <[email protected]>
Cc: Corey Ashford <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: David Ahern <[email protected]>
Cc: Stephane Eranian <[email protected]>
---
.../ABI/testing/sysfs-bus-event_source-devices-cpu-precise | 10 ++++++++++
1 file changed, 10 insertions(+)
create mode 100644 Documentation/ABI/testing/sysfs-bus-event_source-devices-cpu-precise
diff --git a/Documentation/ABI/testing/sysfs-bus-event_source-devices-cpu-precise b/Documentation/ABI/testing/sysfs-bus-event_source-devices-cpu-precise
new file mode 100644
index 0000000..f3ed378
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-bus-event_source-devices-cpu-precise
@@ -0,0 +1,10 @@
+Where: /sys/bus/event_source/devices/cpu/precise
+Date: January 2013
+Kernel Version: 3.7
+Contact: Jiri Olsa <[email protected]>
+ Linux kernel mailing list <[email protected]>
+Description:
+ X86 specific attribute
+
+ Attribute to specify the maximum allowed value
+ for perf_event_attr:precise_ip field.
--
1.7.11.7
Adding sysfs 'precise' attribute for cpu device
(/sys/devices/cpu/precise) to show the maximum
value for perf event precise_ip attribute.
This will be used to put proper precise_ip when
configuring an event and and to automatically test
precise stuff to some extend.
Signed-off-by: Jiri Olsa <[email protected]>
Cc: Corey Ashford <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: David Ahern <[email protected]>
Cc: Stephane Eranian <[email protected]>
---
arch/x86/kernel/cpu/perf_event.c | 34 +++++++++++++++++++++++++---------
1 file changed, 25 insertions(+), 9 deletions(-)
diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 5ed7a4c..d38c9ea 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -377,19 +377,26 @@ static inline int precise_br_compat(struct perf_event *event)
return m == b;
}
-int x86_pmu_hw_config(struct perf_event *event)
+static int x86_get_precise(void)
{
- if (event->attr.precise_ip) {
- int precise = 0;
+ int precise = 0;
- /* Support for constant skid */
- if (x86_pmu.pebs_active && !x86_pmu.pebs_broken) {
+ /* Support for constant skid */
+ if (x86_pmu.pebs_active && !x86_pmu.pebs_broken) {
+ precise++;
+
+ /* Support for IP fixup */
+ if (x86_pmu.lbr_nr)
precise++;
+ }
- /* Support for IP fixup */
- if (x86_pmu.lbr_nr)
- precise++;
- }
+ return precise;
+}
+
+int x86_pmu_hw_config(struct perf_event *event)
+{
+ if (event->attr.precise_ip) {
+ int precise = x86_get_precise();
if (event->attr.precise_ip > precise)
return -EOPNOTSUPP;
@@ -1734,6 +1741,13 @@ static int x86_pmu_event_init(struct perf_event *event)
return err;
}
+static ssize_t get_attr_precise(struct device *cdev,
+ struct device_attribute *attr,
+ char *buf)
+{
+ return snprintf(buf, 10, "%d\n", x86_get_precise());
+}
+
static int x86_pmu_event_idx(struct perf_event *event)
{
int idx = event->hw.idx;
@@ -1786,9 +1800,11 @@ static ssize_t set_attr_rdpmc(struct device *cdev,
}
static DEVICE_ATTR(rdpmc, S_IRUSR | S_IWUSR, get_attr_rdpmc, set_attr_rdpmc);
+static DEVICE_ATTR(precise, S_IRUGO, get_attr_precise, NULL);
static struct attribute *x86_pmu_attrs[] = {
&dev_attr_rdpmc.attr,
+ &dev_attr_precise.attr,
NULL,
};
--
1.7.11.7
Adding automated test for precise term test in event:
'cpu/config=1,precise/'
'cpu/config=2,precise/p'
'cpu/config=3,precise/u'
'instructions:p'
'instructions:pp'
to check proper values of precise_ip driven by sysfs
precise attribute.
Also changing other precise related testcases to follow
new precise sysfs rules.
Signed-off-by: Jiri Olsa <[email protected]>
Cc: Corey Ashford <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: David Ahern <[email protected]>
Cc: Stephane Eranian <[email protected]>
---
tools/perf/tests/parse-events.c | 170 +++++++++++++++++++++++++++++++++++++---
1 file changed, 161 insertions(+), 9 deletions(-)
diff --git a/tools/perf/tests/parse-events.c b/tools/perf/tests/parse-events.c
index f75328c..3b2e21c 100644
--- a/tools/perf/tests/parse-events.c
+++ b/tools/perf/tests/parse-events.c
@@ -5,6 +5,7 @@
#include "sysfs.h"
#include <lk/debugfs.h>
#include "tests.h"
+#include "util.h"
#include <linux/hw_breakpoint.h>
#define TEST_ASSERT_VAL(text, cond) \
@@ -18,6 +19,21 @@ do { \
#define PERF_TP_SAMPLE_TYPE (PERF_SAMPLE_RAW | PERF_SAMPLE_TIME | \
PERF_SAMPLE_CPU | PERF_SAMPLE_PERIOD)
+static int get_precise(void)
+{
+ int precise = perf_precise__get();
+
+ /*
+ * The sysfs precise attribute is not supported,
+ * fill in the 1 as default, check
+ * util/parse-events.c::precise_default function
+ */
+ if (precise == -1)
+ precise = 1;
+
+ return precise;
+}
+
static int test__checkevent_tracepoint(struct perf_evlist *evlist)
{
struct perf_evsel *evsel = perf_evlist__first(evlist);
@@ -228,7 +244,8 @@ static int test__checkevent_raw_modifier(struct perf_evlist *evlist)
TEST_ASSERT_VAL("wrong exclude_user", evsel->attr.exclude_user);
TEST_ASSERT_VAL("wrong exclude_kernel", !evsel->attr.exclude_kernel);
TEST_ASSERT_VAL("wrong exclude_hv", evsel->attr.exclude_hv);
- TEST_ASSERT_VAL("wrong precise_ip", evsel->attr.precise_ip);
+ TEST_ASSERT_VAL("wrong precise_ip",
+ evsel->attr.precise_ip == get_precise());
return test__checkevent_raw(evlist);
}
@@ -240,7 +257,8 @@ static int test__checkevent_numeric_modifier(struct perf_evlist *evlist)
TEST_ASSERT_VAL("wrong exclude_user", evsel->attr.exclude_user);
TEST_ASSERT_VAL("wrong exclude_kernel", evsel->attr.exclude_kernel);
TEST_ASSERT_VAL("wrong exclude_hv", !evsel->attr.exclude_hv);
- TEST_ASSERT_VAL("wrong precise_ip", evsel->attr.precise_ip);
+ TEST_ASSERT_VAL("wrong precise_ip",
+ evsel->attr.precise_ip == get_precise());
return test__checkevent_numeric(evlist);
}
@@ -296,7 +314,8 @@ static int test__checkevent_genhw_modifier(struct perf_evlist *evlist)
TEST_ASSERT_VAL("wrong exclude_user", evsel->attr.exclude_user);
TEST_ASSERT_VAL("wrong exclude_kernel", !evsel->attr.exclude_kernel);
TEST_ASSERT_VAL("wrong exclude_hv", evsel->attr.exclude_hv);
- TEST_ASSERT_VAL("wrong precise_ip", evsel->attr.precise_ip);
+ TEST_ASSERT_VAL("wrong precise_ip",
+ evsel->attr.precise_ip == get_precise());
return test__checkevent_genhw(evlist);
}
@@ -337,7 +356,8 @@ static int test__checkevent_breakpoint_r_modifier(struct perf_evlist *evlist)
TEST_ASSERT_VAL("wrong exclude_user", evsel->attr.exclude_user);
TEST_ASSERT_VAL("wrong exclude_kernel", evsel->attr.exclude_kernel);
TEST_ASSERT_VAL("wrong exclude_hv", !evsel->attr.exclude_hv);
- TEST_ASSERT_VAL("wrong precise_ip", evsel->attr.precise_ip);
+ TEST_ASSERT_VAL("wrong precise_ip",
+ evsel->attr.precise_ip == get_precise());
TEST_ASSERT_VAL("wrong name",
!strcmp(perf_evsel__name(evsel), "mem:0:r:hp"));
@@ -351,7 +371,8 @@ static int test__checkevent_breakpoint_w_modifier(struct perf_evlist *evlist)
TEST_ASSERT_VAL("wrong exclude_user", !evsel->attr.exclude_user);
TEST_ASSERT_VAL("wrong exclude_kernel", evsel->attr.exclude_kernel);
TEST_ASSERT_VAL("wrong exclude_hv", evsel->attr.exclude_hv);
- TEST_ASSERT_VAL("wrong precise_ip", evsel->attr.precise_ip);
+ TEST_ASSERT_VAL("wrong precise_ip",
+ evsel->attr.precise_ip == get_precise());
TEST_ASSERT_VAL("wrong name",
!strcmp(perf_evsel__name(evsel), "mem:0:w:up"));
@@ -365,7 +386,8 @@ static int test__checkevent_breakpoint_rw_modifier(struct perf_evlist *evlist)
TEST_ASSERT_VAL("wrong exclude_user", evsel->attr.exclude_user);
TEST_ASSERT_VAL("wrong exclude_kernel", !evsel->attr.exclude_kernel);
TEST_ASSERT_VAL("wrong exclude_hv", evsel->attr.exclude_hv);
- TEST_ASSERT_VAL("wrong precise_ip", evsel->attr.precise_ip);
+ TEST_ASSERT_VAL("wrong precise_ip",
+ evsel->attr.precise_ip == get_precise());
TEST_ASSERT_VAL("wrong name",
!strcmp(perf_evsel__name(evsel), "mem:0:rw:kp"));
@@ -421,7 +443,8 @@ static int test__checkevent_list(struct perf_evlist *evlist)
TEST_ASSERT_VAL("wrong exclude_user", evsel->attr.exclude_user);
TEST_ASSERT_VAL("wrong exclude_kernel", evsel->attr.exclude_kernel);
TEST_ASSERT_VAL("wrong exclude_hv", !evsel->attr.exclude_hv);
- TEST_ASSERT_VAL("wrong precise_ip", evsel->attr.precise_ip);
+ TEST_ASSERT_VAL("wrong precise_ip",
+ evsel->attr.precise_ip == get_precise());
return 0;
}
@@ -447,6 +470,69 @@ static int test__checkevent_pmu_name(struct perf_evlist *evlist)
return 0;
}
+static int test__checkevent_pmu_precise(struct perf_evlist *evlist)
+{
+ struct perf_evsel *evsel = perf_evlist__first(evlist);
+
+ /* cpu/cycles,precise/ */
+ TEST_ASSERT_VAL("wrong number of entries", 1 == evlist->nr_entries);
+ TEST_ASSERT_VAL("wrong config", 1 == evsel->attr.config);
+ TEST_ASSERT_VAL("wrong exclude_user", !evsel->attr.exclude_user);
+ TEST_ASSERT_VAL("wrong exclude_kernel", !evsel->attr.exclude_kernel);
+ TEST_ASSERT_VAL("wrong exclude_hv", !evsel->attr.exclude_hv);
+ TEST_ASSERT_VAL("wrong exclude guest", evsel->attr.exclude_guest);
+ TEST_ASSERT_VAL("wrong exclude host", !evsel->attr.exclude_host);
+ /* precise is driven by sysfs precise attribute */
+ TEST_ASSERT_VAL("wrong precise_ip",
+ evsel->attr.precise_ip == get_precise());
+ TEST_ASSERT_VAL("wrong group name", !evsel->group_name);
+ TEST_ASSERT_VAL("wrong leader", perf_evsel__is_group_leader(evsel));
+
+ return 0;
+}
+
+static int test__checkevent_pmu_precise1(struct perf_evlist *evlist)
+{
+ struct perf_evsel *evsel = perf_evlist__first(evlist);
+
+ /* cpu/cycles,precise/p */
+ TEST_ASSERT_VAL("wrong number of entries", 1 == evlist->nr_entries);
+ TEST_ASSERT_VAL("wrong config", 2 == evsel->attr.config);
+ TEST_ASSERT_VAL("wrong exclude_user", !evsel->attr.exclude_user);
+ TEST_ASSERT_VAL("wrong exclude_kernel", !evsel->attr.exclude_kernel);
+ TEST_ASSERT_VAL("wrong exclude_hv", !evsel->attr.exclude_hv);
+ TEST_ASSERT_VAL("wrong exclude guest", evsel->attr.exclude_guest);
+ TEST_ASSERT_VAL("wrong exclude host", !evsel->attr.exclude_host);
+ /* precise is driven by 'p' modifier */
+ TEST_ASSERT_VAL("wrong precise_ip",
+ evsel->attr.precise_ip == get_precise());
+ TEST_ASSERT_VAL("wrong group name", !evsel->group_name);
+ TEST_ASSERT_VAL("wrong leader", perf_evsel__is_group_leader(evsel));
+
+ return 0;
+}
+
+static int test__checkevent_pmu_precise2(struct perf_evlist *evlist)
+{
+ struct perf_evsel *evsel = perf_evlist__first(evlist);
+
+ /* cpu/cycles,precise/u */
+ TEST_ASSERT_VAL("wrong number of entries", 1 == evlist->nr_entries);
+ TEST_ASSERT_VAL("wrong config", 3 == evsel->attr.config);
+ TEST_ASSERT_VAL("wrong exclude_user", !evsel->attr.exclude_user);
+ TEST_ASSERT_VAL("wrong exclude_kernel", evsel->attr.exclude_kernel);
+ TEST_ASSERT_VAL("wrong exclude_hv", evsel->attr.exclude_hv);
+ TEST_ASSERT_VAL("wrong exclude guest", !evsel->attr.exclude_guest);
+ TEST_ASSERT_VAL("wrong exclude host", !evsel->attr.exclude_host);
+ /* precise is driven by sysfs precise attribute */
+ TEST_ASSERT_VAL("wrong precise_ip",
+ evsel->attr.precise_ip == get_precise());
+ TEST_ASSERT_VAL("wrong group name", !evsel->group_name);
+ TEST_ASSERT_VAL("wrong leader", perf_evsel__is_group_leader(evsel));
+
+ return 0;
+}
+
static int test__checkevent_pmu_events(struct perf_evlist *evlist)
{
struct perf_evsel *evsel;
@@ -714,7 +800,8 @@ static int test__group4(struct perf_evlist *evlist __maybe_unused)
/* use of precise requires exclude_guest */
TEST_ASSERT_VAL("wrong exclude guest", evsel->attr.exclude_guest);
TEST_ASSERT_VAL("wrong exclude host", !evsel->attr.exclude_host);
- TEST_ASSERT_VAL("wrong precise_ip", evsel->attr.precise_ip == 1);
+ TEST_ASSERT_VAL("wrong precise_ip",
+ evsel->attr.precise_ip == get_precise());
TEST_ASSERT_VAL("wrong group name", !evsel->group_name);
TEST_ASSERT_VAL("wrong leader", perf_evsel__is_group_leader(evsel));
TEST_ASSERT_VAL("wrong nr_members", evsel->nr_members == 2);
@@ -732,7 +819,8 @@ static int test__group4(struct perf_evlist *evlist __maybe_unused)
/* use of precise requires exclude_guest */
TEST_ASSERT_VAL("wrong exclude guest", evsel->attr.exclude_guest);
TEST_ASSERT_VAL("wrong exclude host", !evsel->attr.exclude_host);
- TEST_ASSERT_VAL("wrong precise_ip", evsel->attr.precise_ip == 2);
+ TEST_ASSERT_VAL("wrong precise_ip",
+ evsel->attr.precise_ip == get_precise());
TEST_ASSERT_VAL("wrong leader", evsel->leader == leader);
TEST_ASSERT_VAL("wrong group_idx", perf_evsel__group_idx(evsel) == 1);
TEST_ASSERT_VAL("wrong sample_read", !evsel->sample_read);
@@ -1078,6 +1166,50 @@ static int test__leader_sample2(struct perf_evlist *evlist __maybe_unused)
return 0;
}
+static int test__precise_default1(struct perf_evlist *evlist)
+{
+ struct perf_evsel *evsel, *leader;
+
+ TEST_ASSERT_VAL("wrong number of entries", 1 == evlist->nr_entries);
+
+ /* instructions:p - default precise_ip value */
+ evsel = leader = perf_evlist__first(evlist);
+ TEST_ASSERT_VAL("wrong type", PERF_TYPE_HARDWARE == evsel->attr.type);
+ TEST_ASSERT_VAL("wrong config",
+ PERF_COUNT_HW_INSTRUCTIONS == evsel->attr.config);
+ TEST_ASSERT_VAL("wrong exclude_user", !evsel->attr.exclude_user);
+ TEST_ASSERT_VAL("wrong exclude_kernel", !evsel->attr.exclude_kernel);
+ TEST_ASSERT_VAL("wrong exclude_hv", !evsel->attr.exclude_hv);
+ TEST_ASSERT_VAL("wrong exclude guest", evsel->attr.exclude_guest);
+ TEST_ASSERT_VAL("wrong exclude host", !evsel->attr.exclude_host);
+ TEST_ASSERT_VAL("wrong precise_ip",
+ evsel->attr.precise_ip == get_precise());
+
+ return 0;
+}
+
+static int test__precise_default2(struct perf_evlist *evlist)
+{
+ struct perf_evsel *evsel, *leader;
+
+ TEST_ASSERT_VAL("wrong number of entries", 1 == evlist->nr_entries);
+
+ /* instructions:pp - precise_ip == 2 */
+ evsel = leader = perf_evlist__first(evlist);
+ TEST_ASSERT_VAL("wrong type", PERF_TYPE_HARDWARE == evsel->attr.type);
+ TEST_ASSERT_VAL("wrong config",
+ PERF_COUNT_HW_INSTRUCTIONS == evsel->attr.config);
+ TEST_ASSERT_VAL("wrong exclude_user", !evsel->attr.exclude_user);
+ TEST_ASSERT_VAL("wrong exclude_kernel", !evsel->attr.exclude_kernel);
+ TEST_ASSERT_VAL("wrong exclude_hv", !evsel->attr.exclude_hv);
+ TEST_ASSERT_VAL("wrong exclude guest", evsel->attr.exclude_guest);
+ TEST_ASSERT_VAL("wrong exclude host", !evsel->attr.exclude_host);
+ TEST_ASSERT_VAL("wrong precise_ip",
+ evsel->attr.precise_ip == 2);
+
+ return 0;
+}
+
static int count_tracepoints(void)
{
char events_path[PATH_MAX];
@@ -1302,6 +1434,14 @@ static struct evlist_test test__events[] = {
.name = "{instructions,branch-misses}:Su",
.check = test__leader_sample2,
},
+ [40] = {
+ .name = "instructions:p",
+ .check = test__precise_default1,
+ },
+ [41] = {
+ .name = "instructions:pp",
+ .check = test__precise_default2,
+ },
};
static struct evlist_test test__events_pmu[] = {
@@ -1313,6 +1453,18 @@ static struct evlist_test test__events_pmu[] = {
.name = "cpu/config=1,name=krava/u,cpu/config=2/u",
.check = test__checkevent_pmu_name,
},
+ [2] = {
+ .name = "cpu/config=1,precise/",
+ .check = test__checkevent_pmu_precise,
+ },
+ [3] = {
+ .name = "cpu/config=2,precise/p",
+ .check = test__checkevent_pmu_precise1,
+ },
+ [4] = {
+ .name = "cpu/config=3,precise/u",
+ .check = test__checkevent_pmu_precise2,
+ },
};
struct terms_test {
--
1.7.11.7
Currently if the term is specified without any value like
-e 'cpu/...,precise,../', the number '1' is assigned as
its default value.
Adding special treatment for 'precise' term to use the
maximum allowed precise value in such case using the
perf_precise__get function.
Signed-off-by: Jiri Olsa <[email protected]>
Cc: Corey Ashford <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: David Ahern <[email protected]>
Cc: Stephane Eranian <[email protected]>
---
tools/perf/util/parse-events.c | 29 ++++++++++++++++++++++++++---
tools/perf/util/parse-events.h | 2 ++
tools/perf/util/parse-events.y | 2 +-
3 files changed, 29 insertions(+), 4 deletions(-)
diff --git a/tools/perf/util/parse-events.c b/tools/perf/util/parse-events.c
index 6be4599..5297c44 100644
--- a/tools/perf/util/parse-events.c
+++ b/tools/perf/util/parse-events.c
@@ -590,9 +590,14 @@ do { \
break;
case PARSE_EVENTS__TERM_TYPE_PRECISE:
CHECK_TYPE_VAL(NUM);
- if ((unsigned)term->val.num > 2)
- return -EINVAL;
- attr->precise_ip = term->val.num;
+ /* No value specified, try to get it from sysfs. */
+ if (term->val.num == (u64) -1)
+ attr->precise_ip = precise_default();
+ else {
+ if ((unsigned)term->val.num > 2)
+ return -EINVAL;
+ attr->precise_ip = term->val.num;
+ }
break;
default:
return -EINVAL;
@@ -1245,6 +1250,24 @@ int parse_events_term__num(struct parse_events_term **term,
config, NULL, num);
}
+int parse_events_term__num_default(struct parse_events_term **term,
+ int type_term, char *config)
+{
+ /*
+ * If no value is specified for term, we use 1 as default.
+ * The PRECISE term is an exception, because we force special
+ * functionality when there's no value specified for it,
+ * so we need to recognize it.
+ */
+ u64 num = 1;
+
+ if (type_term == PARSE_EVENTS__TERM_TYPE_PRECISE)
+ num = (u64) -1;
+
+ return new_term(term, PARSE_EVENTS__TERM_TYPE_NUM, type_term,
+ config, NULL, num);
+}
+
int parse_events_term__str(struct parse_events_term **term,
int type_term, char *config, char *str)
{
diff --git a/tools/perf/util/parse-events.h b/tools/perf/util/parse-events.h
index 13d7c66..6810397 100644
--- a/tools/perf/util/parse-events.h
+++ b/tools/perf/util/parse-events.h
@@ -75,6 +75,8 @@ struct parse_events_terms {
int parse_events__is_hardcoded_term(struct parse_events_term *term);
int parse_events_term__num(struct parse_events_term **_term,
int type_term, char *config, u64 num);
+int parse_events_term__num_default(struct parse_events_term **term,
+ int type_term, char *config);
int parse_events_term__str(struct parse_events_term **_term,
int type_term, char *config, char *str);
int parse_events_term__sym_hw(struct parse_events_term **term,
diff --git a/tools/perf/util/parse-events.y b/tools/perf/util/parse-events.y
index afc44c1..da77510 100644
--- a/tools/perf/util/parse-events.y
+++ b/tools/perf/util/parse-events.y
@@ -408,7 +408,7 @@ PE_TERM
{
struct parse_events_term *term;
- ABORT_ON(parse_events_term__num(&term, (int)$1, NULL, 1));
+ ABORT_ON(parse_events_term__num_default(&term, (int)$1, NULL));
$$ = term;
}
--
1.7.11.7
Adding ABI documentation for newly added 'rdpmc' sysfs
attribute. It's added under the testing section.
Signed-off-by: Jiri Olsa <[email protected]>
Cc: Corey Ashford <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: David Ahern <[email protected]>
Cc: Stephane Eranian <[email protected]>
---
.../ABI/testing/sysfs-bus-event_source-devices-cpu-rdpmc | 8 ++++++++
1 file changed, 8 insertions(+)
create mode 100644 Documentation/ABI/testing/sysfs-bus-event_source-devices-cpu-rdpmc
diff --git a/Documentation/ABI/testing/sysfs-bus-event_source-devices-cpu-rdpmc b/Documentation/ABI/testing/sysfs-bus-event_source-devices-cpu-rdpmc
new file mode 100644
index 0000000..b4f5cbf
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-bus-event_source-devices-cpu-rdpmc
@@ -0,0 +1,8 @@
+Where: /sys/bus/event_source/devices/cpu/rdpmc
+Date: January 2013
+Kernel Version: 3.7
+Contact: Linux kernel mailing list <[email protected]>
+Description:
+ X86 specific attribute
+
+ Attribute to enable/disable user space RDPMC instruction.
--
1.7.11.7
From: Andi Kleen <[email protected]>
Add a precise qualifier, like cpu/event=0x3c,precise=1/
This is needed so that the kernel can request enabling PEBS
for TSX events. The parser bails out on any sysfs parse errors,
so this is needed in any case to handle any event on the TSX
perf kernel.
Signed-off-by: Andi Kleen <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Corey Ashford <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: David Ahern <[email protected]>
Cc: Stephane Eranian <[email protected]>
[ nit: changed subject line a bit]
[ nit: fixed util/parse-events.l rebase fuzz ]
Signed-off-by: Jiri Olsa <[email protected]>
---
tools/perf/util/parse-events.c | 6 ++++++
tools/perf/util/parse-events.h | 1 +
tools/perf/util/parse-events.l | 1 +
3 files changed, 8 insertions(+)
diff --git a/tools/perf/util/parse-events.c b/tools/perf/util/parse-events.c
index 8b712761..2e3ef10 100644
--- a/tools/perf/util/parse-events.c
+++ b/tools/perf/util/parse-events.c
@@ -568,6 +568,12 @@ do { \
case PARSE_EVENTS__TERM_TYPE_NAME:
CHECK_TYPE_VAL(STR);
break;
+ case PARSE_EVENTS__TERM_TYPE_PRECISE:
+ CHECK_TYPE_VAL(NUM);
+ if ((unsigned)term->val.num > 2)
+ return -EINVAL;
+ attr->precise_ip = term->val.num;
+ break;
default:
return -EINVAL;
}
diff --git a/tools/perf/util/parse-events.h b/tools/perf/util/parse-events.h
index 8a48593..13d7c66 100644
--- a/tools/perf/util/parse-events.h
+++ b/tools/perf/util/parse-events.h
@@ -48,6 +48,7 @@ enum {
PARSE_EVENTS__TERM_TYPE_NAME,
PARSE_EVENTS__TERM_TYPE_SAMPLE_PERIOD,
PARSE_EVENTS__TERM_TYPE_BRANCH_SAMPLE_TYPE,
+ PARSE_EVENTS__TERM_TYPE_PRECISE,
};
struct parse_events_term {
diff --git a/tools/perf/util/parse-events.l b/tools/perf/util/parse-events.l
index b36115f..11d4df3 100644
--- a/tools/perf/util/parse-events.l
+++ b/tools/perf/util/parse-events.l
@@ -167,6 +167,7 @@ config2 { return term(yyscanner, PARSE_EVENTS__TERM_TYPE_CONFIG2); }
name { return term(yyscanner, PARSE_EVENTS__TERM_TYPE_NAME); }
period { return term(yyscanner, PARSE_EVENTS__TERM_TYPE_SAMPLE_PERIOD); }
branch_type { return term(yyscanner, PARSE_EVENTS__TERM_TYPE_BRANCH_SAMPLE_TYPE); }
+precise { return term(yyscanner, PARSE_EVENTS__TERM_TYPE_PRECISE); }
, { return ','; }
"/" { BEGIN(INITIAL); return '/'; }
{name_minus} { return str(yyscanner, PE_NAME); }
--
1.7.11.7
The test detects the precise attribute availability and try
to open perf event with each allowed precise attribute value.
Signed-off-by: Jiri Olsa <[email protected]>
Cc: Corey Ashford <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: David Ahern <[email protected]>
Cc: Stephane Eranian <[email protected]>
---
tools/perf/Makefile | 1 +
tools/perf/tests/builtin-test.c | 4 ++++
tools/perf/tests/precise.c | 49 +++++++++++++++++++++++++++++++++++++++++
tools/perf/tests/tests.h | 1 +
4 files changed, 55 insertions(+)
create mode 100644 tools/perf/tests/precise.c
diff --git a/tools/perf/Makefile b/tools/perf/Makefile
index 477e7bf..f96de83 100644
--- a/tools/perf/Makefile
+++ b/tools/perf/Makefile
@@ -391,6 +391,7 @@ LIB_OBJS += $(OUTPUT)tests/bp_signal.o
LIB_OBJS += $(OUTPUT)tests/bp_signal_overflow.o
LIB_OBJS += $(OUTPUT)tests/task-exit.o
LIB_OBJS += $(OUTPUT)tests/sw-clock.o
+LIB_OBJS += $(OUTPUT)tests/precise.o
BUILTIN_OBJS += $(OUTPUT)builtin-annotate.o
BUILTIN_OBJS += $(OUTPUT)builtin-bench.o
diff --git a/tools/perf/tests/builtin-test.c b/tools/perf/tests/builtin-test.c
index 0918ada..eda716b 100644
--- a/tools/perf/tests/builtin-test.c
+++ b/tools/perf/tests/builtin-test.c
@@ -94,6 +94,10 @@ static struct test {
.func = test__sw_clock_freq,
},
{
+ .desc = "Test precise event attribute",
+ .func = test__precise,
+ },
+ {
.func = NULL,
},
};
diff --git a/tools/perf/tests/precise.c b/tools/perf/tests/precise.c
new file mode 100644
index 0000000..c548520
--- /dev/null
+++ b/tools/perf/tests/precise.c
@@ -0,0 +1,49 @@
+#include <linux/kernel.h>
+#include <linux/perf_event.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <unistd.h>
+#include "perf.h"
+#include "tests.h"
+#include "util.h"
+#include "sysfs.h"
+
+static int event_open_precise(int precise)
+{
+ struct perf_event_attr attr = {
+ .type = PERF_TYPE_HARDWARE,
+ .config = PERF_COUNT_HW_CPU_CYCLES,
+ .precise_ip = precise,
+ };
+ int fd;
+
+ pr_debug("open cycles event with precise %d\n", precise);
+
+ fd = sys_perf_event_open(&attr, 0, -1, -1, 0);
+ if (fd < 0) {
+ pr_debug("failed to open event, syscall returned "
+ "with %d (%s)\n", fd, strerror(errno));
+ return -1;
+ }
+
+ close(fd);
+ return 0;
+
+}
+
+int test__precise(void)
+{
+ int precise = perf_precise__get();
+ int i;
+
+ if (precise <= 0) {
+ pr_debug("no precise info or support\n");
+ return TEST_SKIP;
+ }
+
+ for (i = 1; i <= precise; i++)
+ if (event_open_precise(i))
+ return TEST_FAIL;
+
+ return TEST_OK;
+}
diff --git a/tools/perf/tests/tests.h b/tools/perf/tests/tests.h
index dd7feae..ff37f54 100644
--- a/tools/perf/tests/tests.h
+++ b/tools/perf/tests/tests.h
@@ -27,5 +27,6 @@ int test__bp_signal(void);
int test__bp_signal_overflow(void);
int test__task_exit(void);
int test__sw_clock_freq(void);
+int test__precise(void);
#endif /* TESTS_H */
--
1.7.11.7
On Thu, May 09, 2013 at 03:32:15PM +0200, Jiri Olsa wrote:
> hi,
> adding sysfs attribute to specify the maximum allowed value
> for perf_event_attr::precise_ip field.
>
> Adding functionality for simple 'p' modifier and 'precise' term
> to get the maximum allowed value for perf_event_attr::precise_ip
> field.
>
You've seem to lost the part explaining why we want this.. :-)
On Thu, May 09, 2013 at 05:07:44PM +0200, Peter Zijlstra wrote:
> On Thu, May 09, 2013 at 03:32:15PM +0200, Jiri Olsa wrote:
> > hi,
> > adding sysfs attribute to specify the maximum allowed value
> > for perf_event_attr::precise_ip field.
> >
> > Adding functionality for simple 'p' modifier and 'precise' term
> > to get the maximum allowed value for perf_event_attr::precise_ip
> > field.
> >
>
> You've seem to lost the part explaining why we want this.. :-)
well, initially it was an answer when we broke precise event
monitoring in kernel so I wrote automated test for it (patches 1,2,3)
https://lkml.org/lkml/2012/12/12/561
having maximum precise enabled with just single 'p' seemed
like good idea
next step would be to enable precise automatically for 'cycles'
(when PEBS is working) asked for by Ingo
http://marc.info/?l=linux-kernel&m=135929050803963&w=2
jirka
On 5/9/13 7:32 AM, Jiri Olsa wrote:
> +static int precise_default(void)
> +{
> + int precise = perf_precise__get();
> + static int warned;
> +
> + /*
> + * Precise info not supported by by this kernel,
> + * set 1 as the precise value.
> + */
> + if (precise == -1)
> + precise = 1;
> +
> + /* PEBS is not supported here, display warning. */
> + if (precise == 0 && !warned++)
> + pr_warning("warning: no precise support, "
> + "using non-precise event(s)\n");
WARN_ONCE().
David
Hi Jiri,
On Thu, 9 May 2013 15:32:17 +0200, Jiri Olsa wrote:
> Adding precise util object to get maximum value for
> perf_event_attr::precise_ip. The value is exported
> via sysfs file '/sys/devices/cpu/precise'.
>
> The interface is:
> int perf_precise__get(void)
I think Arnaldo wants to use double underscores to separate the name of
data struct and method name. So wouldn't it be more appropriate using
one of perf_precise_get() or perf_get_precise()?
Thanks,
Namhyung
On Thu, 9 May 2013 15:32:19 +0200, Jiri Olsa wrote:
> From: Andi Kleen <[email protected]>
>
> Add a precise qualifier, like cpu/event=0x3c,precise=1/
>
> This is needed so that the kernel can request enabling PEBS
> for TSX events. The parser bails out on any sysfs parse errors,
> so this is needed in any case to handle any event on the TSX
> perf kernel.
[SNIP]
> --- a/tools/perf/util/parse-events.c
> +++ b/tools/perf/util/parse-events.c
> @@ -568,6 +568,12 @@ do { \
> case PARSE_EVENTS__TERM_TYPE_NAME:
> CHECK_TYPE_VAL(STR);
> break;
> + case PARSE_EVENTS__TERM_TYPE_PRECISE:
> + CHECK_TYPE_VAL(NUM);
> + if ((unsigned)term->val.num > 2)
Shouldn't it be 3 as we technically allow the 'ppp' modifier? Although
there's no cpu suppports precise=3 currently, things can be changed. :)
Thanks,
Namhyung
On Thu, 9 May 2013 15:32:20 +0200, Jiri Olsa wrote:
> If single 'p' modifier is specified for event, set the
> system precise value for perf_events_attr::precise_ip.
>
> If more than a single 'p' is specified keep the intended
> value.
So there's no way to set precise=1 on a system suppports precise=2 using
this syntax, right? (I don't know whether it makes any sense though.)
>
> If precise is not supported by system, warning is disaplyed.
s/disaplyed/displayed/
>
> Signed-off-by: Jiri Olsa <[email protected]>
> Cc: Corey Ashford <[email protected]>
> Cc: Frederic Weisbecker <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Namhyung Kim <[email protected]>
> Cc: Paul Mackerras <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Arnaldo Carvalho de Melo <[email protected]>
> Cc: Andi Kleen <[email protected]>
> Cc: David Ahern <[email protected]>
> Cc: Stephane Eranian <[email protected]>
> ---
> tools/perf/util/parse-events.c | 40 ++++++++++++++++++++++++++++++++++++++--
> 1 file changed, 38 insertions(+), 2 deletions(-)
>
> diff --git a/tools/perf/util/parse-events.c b/tools/perf/util/parse-events.c
> index 2e3ef10..6be4599 100644
> --- a/tools/perf/util/parse-events.c
> +++ b/tools/perf/util/parse-events.c
> @@ -533,6 +533,26 @@ int parse_events_add_breakpoint(struct list_head **list, int *idx,
> return add_event(list, idx, &attr, NULL);
> }
>
> +static int precise_default(void)
> +{
> + int precise = perf_precise__get();
> + static int warned;
> +
> + /*
> + * Precise info not supported by by this kernel,
s/by by/by/ :)
> + * set 1 as the precise value.
> + */
> + if (precise == -1)
> + precise = 1;
> +
> + /* PEBS is not supported here, display warning. */
> + if (precise == 0 && !warned++)
> + pr_warning("warning: no precise support, "
> + "using non-precise event(s)\n");
Please put the message on a single line.
Thanks,
Namhyung
On Thu, 9 May 2013 15:32:23 +0200, Jiri Olsa wrote:
> Adding ABI documentation for newly added 'precise' sysfs
> attribute. It's added under the testing section.
>
> Signed-off-by: Jiri Olsa <[email protected]>
> Cc: Corey Ashford <[email protected]>
> Cc: Frederic Weisbecker <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Namhyung Kim <[email protected]>
> Cc: Paul Mackerras <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Arnaldo Carvalho de Melo <[email protected]>
> Cc: Andi Kleen <[email protected]>
> Cc: David Ahern <[email protected]>
> Cc: Stephane Eranian <[email protected]>
> ---
> .../ABI/testing/sysfs-bus-event_source-devices-cpu-precise | 10 ++++++++++
> 1 file changed, 10 insertions(+)
> create mode 100644 Documentation/ABI/testing/sysfs-bus-event_source-devices-cpu-precise
>
> diff --git a/Documentation/ABI/testing/sysfs-bus-event_source-devices-cpu-precise b/Documentation/ABI/testing/sysfs-bus-event_source-devices-cpu-precise
> new file mode 100644
> index 0000000..f3ed378
> --- /dev/null
> +++ b/Documentation/ABI/testing/sysfs-bus-event_source-devices-cpu-precise
> @@ -0,0 +1,10 @@
> +Where: /sys/bus/event_source/devices/cpu/precise
> +Date: January 2013
> +Kernel Version: 3.7
Looks like a copy-and-paste problem. :)
Thanks,
Namhyung
> +Contact: Jiri Olsa <[email protected]>
> + Linux kernel mailing list <[email protected]>
> +Description:
> + X86 specific attribute
> +
> + Attribute to specify the maximum allowed value
> + for perf_event_attr:precise_ip field.
On Fri, May 10, 2013 at 10:34:49AM +0900, Namhyung Kim wrote:
> Hi Jiri,
>
> On Thu, 9 May 2013 15:32:17 +0200, Jiri Olsa wrote:
> > Adding precise util object to get maximum value for
> > perf_event_attr::precise_ip. The value is exported
> > via sysfs file '/sys/devices/cpu/precise'.
> >
> > The interface is:
> > int perf_precise__get(void)
>
> I think Arnaldo wants to use double underscores to separate the name of
> data struct and method name. So wouldn't it be more appropriate using
> one of perf_precise_get() or perf_get_precise()?
hum right, '__' to separate structs methods.. ok, perf_precise_get it is
thanks,
jirka
On Fri, May 10, 2013 at 10:43:28AM +0900, Namhyung Kim wrote:
> On Thu, 9 May 2013 15:32:19 +0200, Jiri Olsa wrote:
> > From: Andi Kleen <[email protected]>
> >
> > Add a precise qualifier, like cpu/event=0x3c,precise=1/
> >
> > This is needed so that the kernel can request enabling PEBS
> > for TSX events. The parser bails out on any sysfs parse errors,
> > so this is needed in any case to handle any event on the TSX
> > perf kernel.
> [SNIP]
> > --- a/tools/perf/util/parse-events.c
> > +++ b/tools/perf/util/parse-events.c
> > @@ -568,6 +568,12 @@ do { \
> > case PARSE_EVENTS__TERM_TYPE_NAME:
> > CHECK_TYPE_VAL(STR);
> > break;
> > + case PARSE_EVENTS__TERM_TYPE_PRECISE:
> > + CHECK_TYPE_VAL(NUM);
> > + if ((unsigned)term->val.num > 2)
>
> Shouldn't it be 3 as we technically allow the 'ppp' modifier? Although
> there's no cpu suppports precise=3 currently, things can be changed. :)
ok
jirka
>
> Thanks,
> Namhyung
>
On Thu, May 09, 2013 at 01:43:52PM -0600, David Ahern wrote:
> On 5/9/13 7:32 AM, Jiri Olsa wrote:
> >+static int precise_default(void)
> >+{
> >+ int precise = perf_precise__get();
> >+ static int warned;
> >+
> >+ /*
> >+ * Precise info not supported by by this kernel,
> >+ * set 1 as the precise value.
> >+ */
> >+ if (precise == -1)
> >+ precise = 1;
> >+
> >+ /* PEBS is not supported here, display warning. */
> >+ if (precise == 0 && !warned++)
> >+ pr_warning("warning: no precise support, "
> >+ "using non-precise event(s)\n");
>
> WARN_ONCE().
ah, forgot we have this one ;) thanks
jirka
On Fri, May 10, 2013 at 10:53:41AM +0900, Namhyung Kim wrote:
> On Thu, 9 May 2013 15:32:20 +0200, Jiri Olsa wrote:
> > If single 'p' modifier is specified for event, set the
> > system precise value for perf_events_attr::precise_ip.
> >
> > If more than a single 'p' is specified keep the intended
> > value.
>
> So there's no way to set precise=1 on a system suppports precise=2 using
> this syntax, right? (I don't know whether it makes any sense though.)
right, I abused the single 'p' to get whatever is in sysfs,
which AFAICS should be always the intention..
>
> >
> > If precise is not supported by system, warning is disaplyed.
>
> s/disaplyed/displayed/
ok
>
> >
> > Signed-off-by: Jiri Olsa <[email protected]>
> > Cc: Corey Ashford <[email protected]>
> > Cc: Frederic Weisbecker <[email protected]>
> > Cc: Ingo Molnar <[email protected]>
> > Cc: Namhyung Kim <[email protected]>
> > Cc: Paul Mackerras <[email protected]>
> > Cc: Peter Zijlstra <[email protected]>
> > Cc: Arnaldo Carvalho de Melo <[email protected]>
> > Cc: Andi Kleen <[email protected]>
> > Cc: David Ahern <[email protected]>
> > Cc: Stephane Eranian <[email protected]>
> > ---
> > tools/perf/util/parse-events.c | 40 ++++++++++++++++++++++++++++++++++++++--
> > 1 file changed, 38 insertions(+), 2 deletions(-)
> >
> > diff --git a/tools/perf/util/parse-events.c b/tools/perf/util/parse-events.c
> > index 2e3ef10..6be4599 100644
> > --- a/tools/perf/util/parse-events.c
> > +++ b/tools/perf/util/parse-events.c
> > @@ -533,6 +533,26 @@ int parse_events_add_breakpoint(struct list_head **list, int *idx,
> > return add_event(list, idx, &attr, NULL);
> > }
> >
> > +static int precise_default(void)
> > +{
> > + int precise = perf_precise__get();
> > + static int warned;
> > +
> > + /*
> > + * Precise info not supported by by this kernel,
>
> s/by by/by/ :)
ok
>
>
> > + * set 1 as the precise value.
> > + */
> > + if (precise == -1)
> > + precise = 1;
> > +
> > + /* PEBS is not supported here, display warning. */
> > + if (precise == 0 && !warned++)
> > + pr_warning("warning: no precise support, "
> > + "using non-precise event(s)\n");
>
> Please put the message on a single line.
ok
thanks,
jirka
On Thu, May 09, 2013 at 05:20:22PM +0200, Jiri Olsa wrote:
> On Thu, May 09, 2013 at 05:07:44PM +0200, Peter Zijlstra wrote:
> > On Thu, May 09, 2013 at 03:32:15PM +0200, Jiri Olsa wrote:
> > > hi,
> > > adding sysfs attribute to specify the maximum allowed value
> > > for perf_event_attr::precise_ip field.
> > >
> > > Adding functionality for simple 'p' modifier and 'precise' term
> > > to get the maximum allowed value for perf_event_attr::precise_ip
> > > field.
> > >
> >
> > You've seem to lost the part explaining why we want this.. :-)
>
> well, initially it was an answer when we broke precise event
> monitoring in kernel so I wrote automated test for it (patches 1,2,3)
> https://lkml.org/lkml/2012/12/12/561
But those don't rely on the max thing right?
> having maximum precise enabled with just single 'p' seemed
> like good idea
Doesn't seem like to me; that takes away the possibility to use less.
> next step would be to enable precise automatically for 'cycles'
> (when PEBS is working) asked for by Ingo
> http://marc.info/?l=linux-kernel&m=135929050803963&w=2
Hurm.. I'm of two minds there. As Stephane has been pointing out for ages,
cycles behaves significantly different between regular and PEBS events for some
cases.
Also, you really don't need the max_precise for that either. At worst you'll
have a number of unsuccessful event creations.
On Thu, May 09, 2013 at 05:20:22PM +0200, Jiri Olsa wrote:
> On Thu, May 09, 2013 at 05:07:44PM +0200, Peter Zijlstra wrote:
> > On Thu, May 09, 2013 at 03:32:15PM +0200, Jiri Olsa wrote:
> > > hi,
> > > adding sysfs attribute to specify the maximum allowed value
> > > for perf_event_attr::precise_ip field.
> > >
> > > Adding functionality for simple 'p' modifier and 'precise' term
> > > to get the maximum allowed value for perf_event_attr::precise_ip
> > > field.
> > >
> >
> > You've seem to lost the part explaining why we want this.. :-)
>
> well, initially it was an answer when we broke precise event
> monitoring in kernel so I wrote automated test for it (patches 1,2,3)
> https://lkml.org/lkml/2012/12/12/561
Oh urgh 1-3 aren't just tests; they introduce all this nonsense already :/
On Fri, May 10, 2013 at 11:27:41AM +0200, Peter Zijlstra wrote:
> On Thu, May 09, 2013 at 05:20:22PM +0200, Jiri Olsa wrote:
> > On Thu, May 09, 2013 at 05:07:44PM +0200, Peter Zijlstra wrote:
> > > On Thu, May 09, 2013 at 03:32:15PM +0200, Jiri Olsa wrote:
> > > > hi,
> > > > adding sysfs attribute to specify the maximum allowed value
> > > > for perf_event_attr::precise_ip field.
> > > >
> > > > Adding functionality for simple 'p' modifier and 'precise' term
> > > > to get the maximum allowed value for perf_event_attr::precise_ip
> > > > field.
> > > >
> > >
> > > You've seem to lost the part explaining why we want this.. :-)
> >
> > well, initially it was an answer when we broke precise event
> > monitoring in kernel so I wrote automated test for it (patches 1,2,3)
> > https://lkml.org/lkml/2012/12/12/561
>
> But those don't rely on the max thing right?
nope, but I need a automated way to find out if PEBS
is supported in system
>
> > having maximum precise enabled with just single 'p' seemed
> > like good idea
>
> Doesn't seem like to me; that takes away the possibility to use less.
hm, we could have another modifier to get system precise value
'P' maybe.. and keep the 'p' logic
>
> > next step would be to enable precise automatically for 'cycles'
> > (when PEBS is working) asked for by Ingo
> > http://marc.info/?l=linux-kernel&m=135929050803963&w=2
>
> Hurm.. I'm of two minds there. As Stephane has been pointing out for ages,
> cycles behaves significantly different between regular and PEBS events for some
> cases.
>
> Also, you really don't need the max_precise for that either. At worst you'll
> have a number of unsuccessful event creations.
so you mean just detect that by opening events with increasing
precise and see how far we could get.. could be I guess, though
the 'precise' sysfs attribute seems more fit to me
thanks,
jirka
On Fri, May 10, 2013 at 11:29:30AM +0200, Peter Zijlstra wrote:
> On Thu, May 09, 2013 at 05:20:22PM +0200, Jiri Olsa wrote:
> > On Thu, May 09, 2013 at 05:07:44PM +0200, Peter Zijlstra wrote:
> > > On Thu, May 09, 2013 at 03:32:15PM +0200, Jiri Olsa wrote:
> > > > hi,
> > > > adding sysfs attribute to specify the maximum allowed value
> > > > for perf_event_attr::precise_ip field.
> > > >
> > > > Adding functionality for simple 'p' modifier and 'precise' term
> > > > to get the maximum allowed value for perf_event_attr::precise_ip
> > > > field.
> > > >
> > >
> > > You've seem to lost the part explaining why we want this.. :-)
> >
> > well, initially it was an answer when we broke precise event
> > monitoring in kernel so I wrote automated test for it (patches 1,2,3)
> > https://lkml.org/lkml/2012/12/12/561
>
> Oh urgh 1-3 aren't just tests; they introduce all this nonsense already :/
yep
On Fri, May 10, 2013 at 11:40:53AM +0200, Jiri Olsa wrote:
> nope, but I need a automated way to find out if PEBS
> is supported in system
Create an event, see what happens ;-) But I suppose you want to know if it
should have worked vs does it actually work? Yeah, I suppose what you did
is the right thing there.
I was just completely missing the rationale earlier.
> > > having maximum precise enabled with just single 'p' seemed
> > > like good idea
> >
> > Doesn't seem like to me; that takes away the possibility to use less.
>
> hm, we could have another modifier to get system precise value
> 'P' maybe.. and keep the 'p' logic
That might be a possibility indeed.
> > Also, you really don't need the max_precise for that either. At worst you'll
> > have a number of unsuccessful event creations.
>
> so you mean just detect that by opening events with increasing
> precise and see how far we could get.. could be I guess, though
> the 'precise' sysfs attribute seems more fit to me
The other way around, start at ppp end at !p, then use the one that worked.
* Peter Zijlstra <[email protected]> wrote:
> > so you mean just detect that by opening events with increasing precise
> > and see how far we could get.. could be I guess, though the 'precise'
> > sysfs attribute seems more fit to me
>
> The other way around, start at ppp end at !p, then use the one that
> worked.
Really, instead of this silly 'probing until it works' notion, how about
the radical idea that we pass to the kernel our request, and the kernel
fulfills our wish to the best of its ability?
This could be done as a new PERF_COUNT_HW_CPU_CYCLES_PRECISE event, to
which tooling could migrate, without changing existing semantics.
The problem with the complex probing is that it's totally unnecessary
complexity that results from lack of passing the right information to the
kernel. Forcing that will only hinder user-space adoption of our precise
profiling facilities.
Thanks,
Ingo
On Fri, May 10, 2013 at 12:18:23PM +0200, Ingo Molnar wrote:
>
> * Peter Zijlstra <[email protected]> wrote:
>
> > > so you mean just detect that by opening events with increasing precise
> > > and see how far we could get.. could be I guess, though the 'precise'
> > > sysfs attribute seems more fit to me
> >
> > The other way around, start at ppp end at !p, then use the one that
> > worked.
>
> Really, instead of this silly 'probing until it works' notion, how about
> the radical idea that we pass to the kernel our request, and the kernel
> fulfills our wish to the best of its ability?
>
> This could be done as a new PERF_COUNT_HW_CPU_CYCLES_PRECISE event, to
> which tooling could migrate, without changing existing semantics.
>
> The problem with the complex probing is that it's totally unnecessary
> complexity that results from lack of passing the right information to the
> kernel. Forcing that will only hinder user-space adoption of our precise
> profiling facilities.
The part I have trouble with is that its a vague request and you'll get a vague
answer.
Its very similar to: 'do something' and getting a string of random bytes back.
Yes it is 'something', request fulfilled.
By doing the probing, userspace knows exactly what it asked and what it'll get.
* Peter Zijlstra <[email protected]> wrote:
> On Fri, May 10, 2013 at 12:18:23PM +0200, Ingo Molnar wrote:
> >
> > * Peter Zijlstra <[email protected]> wrote:
> >
> > > > so you mean just detect that by opening events with increasing precise
> > > > and see how far we could get.. could be I guess, though the 'precise'
> > > > sysfs attribute seems more fit to me
> > >
> > > The other way around, start at ppp end at !p, then use the one that
> > > worked.
> >
> > Really, instead of this silly 'probing until it works' notion, how about
> > the radical idea that we pass to the kernel our request, and the kernel
> > fulfills our wish to the best of its ability?
> >
> > This could be done as a new PERF_COUNT_HW_CPU_CYCLES_PRECISE event, to
> > which tooling could migrate, without changing existing semantics.
> >
> > The problem with the complex probing is that it's totally unnecessary
> > complexity that results from lack of passing the right information to the
> > kernel. Forcing that will only hinder user-space adoption of our precise
> > profiling facilities.
>
> The part I have trouble with is that its a vague request and you'll get
> a vague answer.
PERF_COUNT_HW_CPU_CYCLES_PRECISE is not a vague request at all: it means
'get me the most precise cycles profiling available on this system'.
Thanks,
Ingo
On Fri, May 10, 2013 at 12:31:12PM +0200, Ingo Molnar wrote:
> PERF_COUNT_HW_CPU_CYCLES_PRECISE is not a vague request at all: it means
> 'get me the most precise cycles profiling available on this system'.
And how will you interpret the results? Do you know to manually adjust for skid
or will you assume the results 'correct'?
I see such a feature only causing confusion; I told it to be precise, therefore
this register op after the memory load really is the more expensive thing.
People generally don't volunteer to think, you have to force them to -- even if
that makes them complain.
* Peter Zijlstra <[email protected]> wrote:
> On Fri, May 10, 2013 at 12:31:12PM +0200, Ingo Molnar wrote:
> > PERF_COUNT_HW_CPU_CYCLES_PRECISE is not a vague request at all: it means
> > 'get me the most precise cycles profiling available on this system'.
>
> And how will you interpret the results? Do you know to manually adjust
> for skid or will you assume the results 'correct'?
Look at the tools/perf/ patches, they don't actually need or use that
information to adjust for skid!
If user-space wants _that_ level of control because it wants to correct
for skid (if there's skid), or if it wants to display to the user how
precise the profiling is, then they can do the (much) more complex probing
dance.
What is absolutely indefensible is to not give a good shortcut for the
most common case of 'give me the most precise cycles event you got'...
> I see such a feature only causing confusion; I told it to be precise,
> therefore this register op after the memory load really is the more
> expensive thing.
You are creating confusion where there's none: "give me the best profiling
you've got" is a pretty reasonable thing to ask.
The thing is, there's variations in the quality of profiling between CPUs,
sometimes even between CPU models. 99.999% of the people don't care about
that, because 99.9% of the time the profile is unambiguous: functions are
typically big enough, with the overhead somewhere in the middle, so skid
just doesn't matter.
So to 'solve' this corner case information you worry about to extract
(which cannot be solved - there's more profiling artifacts and imprecision
than skid) you sacrifice the thing that _DOES_ matter: for the kernel to
offer our best profiling feature as easily as possible...
What explanation do you have for that failure?
> People generally don't volunteer to think, you have to force them to --
> even if that makes them complain.
It is a mistake to 'force' people to consider stuff _they don't care about
in 99% of the cases_.
Thanks,
Ingo
On Fri, May 10, 2013 at 12:55:36PM +0200, Ingo Molnar wrote:
> Look at the tools/perf/ patches, they don't actually need or use that
> information to adjust for skid!
>
> If user-space wants _that_ level of control because it wants to correct
> for skid (if there's skid), or if it wants to display to the user how
> precise the profiling is, then they can do the (much) more complex probing
> dance.
>
> What is absolutely indefensible is to not give a good shortcut for the
> most common case of 'give me the most precise cycles event you got'...
That's not what I'm saying... the user (not userspace, but you and me) when
staring at perf output need to interpret the result.
If you don't know WTF the thing actually measured, how are you going to do
that?
> > I see such a feature only causing confusion; I told it to be precise,
> > therefore this register op after the memory load really is the more
> > expensive thing.
>
> You are creating confusion where there's none: "give me the best profiling
> you've got" is a pretty reasonable thing to ask.
Only if it then tells you what you got. It doesn't do that.
> The thing is, there's variations in the quality of profiling between CPUs,
> sometimes even between CPU models. 99.999% of the people don't care about
> that, because 99.9% of the time the profile is unambiguous: functions are
> typically big enough, with the overhead somewhere in the middle, so skid
> just doesn't matter.
Sure at function level it doesn't matter, but once you found your most
expensive function very often the next question is _why_ is it expensive.
At that point you're going to stare at asm output. The moment you do that you
need to know the type of output you're staring at.
Also, if you think function level output is the most relevant one, you
shouldn't use PEBS at all. PEBS has an issue with REP prefixes, it severely
under accounts the cycles spend on them. And since exact placement doesn't
matter (as you just argued) the little skid you have is irrelevant.
So either skid matters and you need to know what type of output you've got, or
it doesn't and the whole precise thing is irrelevant at best.
* Peter Zijlstra <[email protected]> wrote:
> On Fri, May 10, 2013 at 12:55:36PM +0200, Ingo Molnar wrote:
> > Look at the tools/perf/ patches, they don't actually need or use that
> > information to adjust for skid!
> >
> > If user-space wants _that_ level of control because it wants to correct
> > for skid (if there's skid), or if it wants to display to the user how
> > precise the profiling is, then they can do the (much) more complex probing
> > dance.
> >
> > What is absolutely indefensible is to not give a good shortcut for the
> > most common case of 'give me the most precise cycles event you got'...
>
> That's not what I'm saying... the user (not userspace, but you and me)
> when staring at perf output need to interpret the result.
>
> If you don't know WTF the thing actually measured, how are you going to
> do that?
That's really a red herring: there's absolutely no reason why the
kernel could not pass back the level of precision it provided.
You are also over-rating the importance of such details - most developers
will assume when looking at profiler output that it's a statistical result
- and being happy when it happens to be "absolutely accurate" instead of
just "very accurate"...
> > > I see such a feature only causing confusion; I told it to be
> > > precise, therefore this register op after the memory load really is
> > > the more expensive thing.
> >
> > You are creating confusion where there's none: "give me the best
> > profiling you've got" is a pretty reasonable thing to ask.
>
> Only if it then tells you what you got. It doesn't do that.
I'm not against the kernel telling what precision it gave us, at all. That
could be solved by the kernel setting the precision field in the
PERF_COUNT_CYCLES_PRECISE case or so.
I'm against you apparently recommending a complex probing method to get
something the kernel ought to get us straight away via much simpler
ways...
Complexity does not primarily result in people doing things 'smarter'. It
primarily results in people _not using the feature at all_.
> > The thing is, there's variations in the quality of profiling between
> > CPUs, sometimes even between CPU models. 99.999% of the people don't
> > care about that, because 99.9% of the time the profile is unambiguous:
> > functions are typically big enough, with the overhead somewhere in the
> > middle, so skid just doesn't matter.
>
> Sure at function level it doesn't matter, but once you found your most
> expensive function very often the next question is _why_ is it
> expensive.
>
> At that point you're going to stare at asm output. The moment you do
> that you need to know the type of output you're staring at.
FYI, very few developers are actually looking at the assembly output
because very few developers _know_ assembly to begin with.
They are looking at things like sysprof or perf report output, maybe at
the annotated _source_ code and that's it.
The mapping to source code is fuzzy to begin with, with inlining, loop
unrolling and other compiler optimizations being a far bigger effect than
skid.
So the fuzz created by skid is relatively small - but it's nice when it's
gone and obviously it's helpful when you are looking at assembly output.
> Also, if you think function level output is the most relevant one, you
> shouldn't use PEBS at all. PEBS has an issue with REP prefixes, it
> severely under accounts the cycles spend on them. And since exact
> placement doesn't matter (as you just argued) the little skid you have
> is irrelevant.
>
> So either skid matters and you need to know what type of output you've
> got, or it doesn't and the whole precise thing is irrelevant at best.
That's just another plain silly argument: having more precise results is
obviously useful even if you don't use a magnifying lense. Sometimes
functions are small and skid results in the wrong function being credited
with overhead.
It's also immaterial: there's no reason why the kernel couldn't feed back
the level of precision it offers, to user-space, via a small, simple
variation to the existing syscall interface.
Thanks,
Ingo
On Sat, May 11, 2013 at 09:50:08AM +0200, Ingo Molnar wrote:
> That's really a red herring: there's absolutely no reason why the
> kernel could not pass back the level of precision it provided.
All I've been saying is that doing random precision without feedback is
confusing.
We also don't really have a good feedback channel for this kind of thing. The
best I can come up with is tagging each and every sample with the quality it
represents. I think we can do with only one extra PERF_RECORD_MISC bit, but it
looks like we're quickly running out of those things.
But I think the biggest problem is PEBS's inability do deal with REP prefixes;
see this email from Stephane: https://lkml.org/lkml/2011/2/1/177
It is really unfortunate for PEBS to have such a side-effect; but it makes all
memset/memcpy/memmove things appear like they have no cost. I'm very sure that
will surprise a number of people.
---
arch/x86/include/asm/perf_event.h | 4 +++-
arch/x86/kernel/cpu/perf_event.c | 2 ++
arch/x86/kernel/cpu/perf_event_intel_ds.c | 4 +++-
include/uapi/linux/perf_event.h | 1 +
4 files changed, 9 insertions(+), 2 deletions(-)
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 57cb634..6908838 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -205,14 +205,16 @@ static inline u32 get_ibs_caps(void) { return 0; }
extern void perf_events_lapic_init(void);
/*
- * Abuse bits {3,5} of the cpu eflags register. These flags are otherwise
+ * Abuse bits {3,5,15} of the cpu eflags register. These flags are otherwise
* unused and ABI specified to be 0, so nobody should care what we do with
* them.
*
+ * CONSTANT - the IP has constant skid.
* EXACT - the IP points to the exact instruction that triggered the
* event (HW bugs exempt).
* VM - original X86_VM_MASK; see set_linear_ip().
*/
+#define PERF_EFLAGS_CONSTANT (1UL << 15)
#define PERF_EFLAGS_EXACT (1UL << 3)
#define PERF_EFLAGS_VM (1UL << 5)
diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 1025f3c..ca1f7dc 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -2078,6 +2078,8 @@ unsigned long perf_misc_flags(struct pt_regs *regs)
if (regs->flags & PERF_EFLAGS_EXACT)
misc |= PERF_RECORD_MISC_EXACT_IP;
+ else if (regs->flags & PERF_EFLAGS_CONSTANT)
+ misc |= PERF_RECORD_MISC_CONSTANT_SKID;
return misc;
}
diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index 60250f6..757ecd4 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -753,10 +753,12 @@ static void __intel_pmu_pebs_event(struct perf_event *event,
regs.bp = pebs->bp;
regs.sp = pebs->sp;
+ regs.flags &= ~(PERF_EFLAGS_CONSTANT | PERF_EFLAGS_EXACT);
+
if (event->attr.precise_ip > 1 && intel_pmu_pebs_fixup_ip(®s))
regs.flags |= PERF_EFLAGS_EXACT;
else
- regs.flags &= ~PERF_EFLAGS_EXACT;
+ regs.flags |= PERF_EFLAGS_CONSTANT;
if (has_branch_stack(event))
data.br_stack = &cpuc->lbr_stack;
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index fb104e5..cb1f70f 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -451,6 +451,7 @@ struct perf_event_mmap_page {
* the actual instruction that triggered the event. See also
* perf_event_attr::precise_ip.
*/
+#define PERF_RECORD_MISC_CONSTANT_SKID (1 << 12)
#define PERF_RECORD_MISC_EXACT_IP (1 << 14)
/*
* Reserve the last bit to indicate some extended misc field
* Peter Zijlstra <[email protected]> wrote:
> On Sat, May 11, 2013 at 09:50:08AM +0200, Ingo Molnar wrote:
> > That's really a red herring: there's absolutely no reason why the
> > kernel could not pass back the level of precision it provided.
>
> All I've been saying is that doing random precision without feedback is
> confusing.
I agree with that.
> We also don't really have a good feedback channel for this kind of
> thing. The best I can come up with is tagging each and every sample with
> the quality it represents. I think we can do with only one extra
> PERF_RECORD_MISC bit, but it looks like we're quickly running out of
> those things.
Hm, how about passing precision back to user-space at creation time, in
the perf_attr data structure? There's no need to pass it back in every
sample, precision will not really change during the life-time of an event.
> But I think the biggest problem is PEBS's inability do deal with REP
> prefixes; see this email from Stephane:
> https://lkml.org/lkml/2011/2/1/177
>
> It is really unfortunate for PEBS to have such a side-effect; but it
> makes all memset/memcpy/memmove things appear like they have no cost.
> I'm very sure that will surprise a number of people.
I'd expect PEBS to get gradually better.
Note that at least for user-space, REP MOVS is getting rarer. libc uses
SSE based memcpy/memset variants - which is not miscounted by PEBS. The
kernel still uses REP MOVS - but it's a special case because it cannot
cheaply use vector registers.
The vast majority of code gets measured by cycles:pp more accurately than
cycles.
We could try and see how many people complain. It's not like it's hard to
undo such a change of the default event?
Thanks,
Ingo
On Mon, May 13, 2013 at 09:43:13PM +0200, Ingo Molnar wrote:
>
> * Peter Zijlstra <[email protected]> wrote:
>
> > On Sat, May 11, 2013 at 09:50:08AM +0200, Ingo Molnar wrote:
> > > That's really a red herring: there's absolutely no reason why the
> > > kernel could not pass back the level of precision it provided.
> >
> > All I've been saying is that doing random precision without feedback is
> > confusing.
>
> I agree with that.
>
> > We also don't really have a good feedback channel for this kind of
> > thing. The best I can come up with is tagging each and every sample with
> > the quality it represents. I think we can do with only one extra
> > PERF_RECORD_MISC bit, but it looks like we're quickly running out of
> > those things.
>
> Hm, how about passing precision back to user-space at creation time, in
> the perf_attr data structure? There's no need to pass it back in every
> sample, precision will not really change during the life-time of an event.
Ah indeed, we talked about modifying the attr structure before (error details
or so). Did something like that ever make it in, or would this be the first
use now?
> > But I think the biggest problem is PEBS's inability do deal with REP
> > prefixes; see this email from Stephane:
> > https://lkml.org/lkml/2011/2/1/177
> >
> > It is really unfortunate for PEBS to have such a side-effect; but it
> > makes all memset/memcpy/memmove things appear like they have no cost.
> > I'm very sure that will surprise a number of people.
>
> I'd expect PEBS to get gradually better.
>
> Note that at least for user-space, REP MOVS is getting rarer. libc uses
> SSE based memcpy/memset variants - which is not miscounted by PEBS. The
> kernel still uses REP MOVS - but it's a special case because it cannot
> cheaply use vector registers.
What's the rep_good cpu feature flag for? I thought Intel was putting more
effort into making REP MOVS doing the right thing again. No need to worry your
pretty head about the best way to move bytes around any longer.
> The vast majority of code gets measured by cycles:pp more accurately than
> cycles.
>
> We could try and see how many people complain. It's not like it's hard to
> undo such a change of the default event?
I suppose so.. Alternatively we can have the PEBS event read a 'real' cycles
counter and weight the sample based on that. Bit cumbersome, esp if you want to
implement it kernel side, but it could possibly work around this issue.
On Mon, May 13, 2013 at 09:43:13PM +0200, Ingo Molnar wrote:
> Note that at least for user-space, REP MOVS is getting rarer. libc uses
> SSE based memcpy/memset variants - which is not miscounted by PEBS. The
> kernel still uses REP MOVS - but it's a special case because it cannot
> cheaply use vector registers.
>
> The vast majority of code gets measured by cycles:pp more accurately than
> cycles.
>
> We could try and see how many people complain. It's not like it's hard to
> undo such a change of the default event?
>
People may optimize for a wrong case instead of complaining. There is
nothing that obviously broken, only if you know what to look for the
brokenness can be seen.
--
Gleb.
* Peter Zijlstra <[email protected]> wrote:
> On Mon, May 13, 2013 at 09:43:13PM +0200, Ingo Molnar wrote:
> >
> > * Peter Zijlstra <[email protected]> wrote:
> >
> > > On Sat, May 11, 2013 at 09:50:08AM +0200, Ingo Molnar wrote:
> > > > That's really a red herring: there's absolutely no reason why the
> > > > kernel could not pass back the level of precision it provided.
> > >
> > > All I've been saying is that doing random precision without feedback is
> > > confusing.
> >
> > I agree with that.
> >
> > > We also don't really have a good feedback channel for this kind of
> > > thing. The best I can come up with is tagging each and every sample with
> > > the quality it represents. I think we can do with only one extra
> > > PERF_RECORD_MISC bit, but it looks like we're quickly running out of
> > > those things.
> >
> > Hm, how about passing precision back to user-space at creation time, in
> > the perf_attr data structure? There's no need to pass it back in every
> > sample, precision will not really change during the life-time of an event.
>
> Ah indeed, we talked about modifying the attr structure before (error details
> or so). Did something like that ever make it in, or would this be the first
> use now?
That remained on the level of talk AFAICT.
> > The vast majority of code gets measured by cycles:pp more accurately
> > than cycles.
> >
> > We could try and see how many people complain. It's not like it's hard
> > to undo such a change of the default event?
>
> I suppose so.. Alternatively we can have the PEBS event read a 'real'
> cycles counter and weight the sample based on that. Bit cumbersome, esp
> if you want to implement it kernel side, but it could possibly work
> around this issue.
Looks a bit cumbersome indeed. Lets try the simpler approach and see?
Thanks,
Ingo
On Mon, May 13, 2013 at 9:43 PM, Ingo Molnar <[email protected]> wrote:
>
> * Peter Zijlstra <[email protected]> wrote:
>
>> On Sat, May 11, 2013 at 09:50:08AM +0200, Ingo Molnar wrote:
>> > That's really a red herring: there's absolutely no reason why the
>> > kernel could not pass back the level of precision it provided.
>>
>> All I've been saying is that doing random precision without feedback is
>> confusing.
>
> I agree with that.
>
>> We also don't really have a good feedback channel for this kind of
>> thing. The best I can come up with is tagging each and every sample with
>> the quality it represents. I think we can do with only one extra
>> PERF_RECORD_MISC bit, but it looks like we're quickly running out of
>> those things.
>
> Hm, how about passing precision back to user-space at creation time, in
> the perf_attr data structure? There's no need to pass it back in every
> sample, precision will not really change during the life-time of an event.
>
>> But I think the biggest problem is PEBS's inability do deal with REP
>> prefixes; see this email from Stephane:
>> https://lkml.org/lkml/2011/2/1/177
>>
>> It is really unfortunate for PEBS to have such a side-effect; but it
>> makes all memset/memcpy/memmove things appear like they have no cost.
>> I'm very sure that will surprise a number of people.
>
> I'd expect PEBS to get gradually better.
>
> Note that at least for user-space, REP MOVS is getting rarer. libc uses
> SSE based memcpy/memset variants - which is not miscounted by PEBS. The
> kernel still uses REP MOVS - but it's a special case because it cannot
> cheaply use vector registers.
>
> The vast majority of code gets measured by cycles:pp more accurately than
> cycles.
>
I don't understand how you come to that conclusion. I can show you simple
examples where this is not true at all (even without rep mov).
I will repeat once again what PEBS provides. The only guarantee of PEBS
is that it captures the next dynamic address after an instruction that caused
the event to occur.
The address is the 'next' one because PEBS captures at retirement of the sampled
instruction. The caveats are that the sampled instruction is not the
one at the end
of the sampling period for this event. It may be N cycles later.
Therefore there is a
shadow during which qualifying instructions may be executed but never sampled.
This is what INST_RETIRED:PREC_DIST is trying to compensate for.
Furthermore, as pointed out recently, the filters are ignored for the sampled
instruction. They are honored up to the sampling period. As such, the sampled
instruction does not qualify for the filters. Filters can vastly
change the meaning
of an event: for instance cmask=1 changes LLC_MISS to cycles with
pending LLC misses.
I would add that using uops_retired:cmask=16:invert to gain precise cycle
does not behave the same way across processors. It turns out that Westmere and
SandyBridge handle this differently during halted cycles. So in the
end, I think it is
pretty hard to understand what's being measured uniformly. And
therefore I think it
is a VERY bad idea to default cycles to cycles:pp when PEBS is present.
* Stephane Eranian <[email protected]> wrote:
> On Mon, May 13, 2013 at 9:43 PM, Ingo Molnar <[email protected]> wrote:
> >
> > * Peter Zijlstra <[email protected]> wrote:
> >
> >> On Sat, May 11, 2013 at 09:50:08AM +0200, Ingo Molnar wrote:
> >> > That's really a red herring: there's absolutely no reason why the
> >> > kernel could not pass back the level of precision it provided.
> >>
> >> All I've been saying is that doing random precision without feedback is
> >> confusing.
> >
> > I agree with that.
> >
> >> We also don't really have a good feedback channel for this kind of
> >> thing. The best I can come up with is tagging each and every sample with
> >> the quality it represents. I think we can do with only one extra
> >> PERF_RECORD_MISC bit, but it looks like we're quickly running out of
> >> those things.
> >
> > Hm, how about passing precision back to user-space at creation time, in
> > the perf_attr data structure? There's no need to pass it back in every
> > sample, precision will not really change during the life-time of an event.
> >
> >> But I think the biggest problem is PEBS's inability do deal with REP
> >> prefixes; see this email from Stephane:
> >> https://lkml.org/lkml/2011/2/1/177
> >>
> >> It is really unfortunate for PEBS to have such a side-effect; but it
> >> makes all memset/memcpy/memmove things appear like they have no cost.
> >> I'm very sure that will surprise a number of people.
> >
> > I'd expect PEBS to get gradually better.
> >
> > Note that at least for user-space, REP MOVS is getting rarer. libc uses
> > SSE based memcpy/memset variants - which is not miscounted by PEBS. The
> > kernel still uses REP MOVS - but it's a special case because it cannot
> > cheaply use vector registers.
> >
> > The vast majority of code gets measured by cycles:pp more accurately than
> > cycles.
> >
> I don't understand how you come to that conclusion. [...]
By frequently looking at cycles:pp output.
> [...] I can show you simple examples where this is not true at all (even
> without rep mov).
That would be useful if there's any practical problem with cycles:pp. In
terms of profiling typical kernel and user space functions it does appear
to work very well.
Thanks,
Ingo