2013-04-20 19:23:28

by Andi Kleen

[permalink] [raw]
Subject: perf PMU support for Haswell v8

This is a heavily updated version of the Haswell PMU TSX and other
patchkit, on top of the separate "basic haswell" patchkit

This adds perf PMU support for the upcoming Haswell core. The patchkit
is fairly large, mainly due to various enhancement for TSX. TSX tuning
relies heavily on the PMU, so I tried hard to make all facilities
easily available. In addition it also has some other enhancements.

Overview:
- perf stat -T to get high level transaction statistics
- A generic set of transaction events
- Support for transaction abort flags to distingush abort types
- Support for transaction abort weight
- Full support for transaction LBR flags
- KVM support to run perf stat -T in a guest
- Full width counters and checkpointed counter support
- Address support for all PEBS events

This includes changes to the core perf code, to the x86 specific part,
to the perf user land tools and to KVM

Available at
git://git.kernel.org/pub/scm/linux/kernel/ak/linux-misc.git hsw/pmu6

For more details on the Haswell PMU please see the SDM. For more details on TSX
please see http://halobates.de/adding-lock-elision-to-linux.pdf

[dropped old changelog]

-Andi


2013-04-20 19:19:30

by Andi Kleen

[permalink] [raw]
Subject: [PATCH 01/15] perf, x86: Suppress duplicated abort LBR records

From: Andi Kleen <[email protected]>

Haswell always give an extra LBR record after every TSX abort. This can confuse
some clients. Suppress the extra record.

This only works when the abort is visible in the window, that is if the
extra record is the last entry in the LBR. If the abort has already
left it it will stay.

Signed-off-by: Andi Kleen <[email protected]>
---
arch/x86/kernel/cpu/perf_event.h | 1 +
arch/x86/kernel/cpu/perf_event_intel.c | 1 +
arch/x86/kernel/cpu/perf_event_intel_lbr.c | 29 ++++++++++++++++++++-------
3 files changed, 23 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index d75d0ff..563d6e8 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -427,6 +427,7 @@ struct x86_pmu {
int lbr_nr; /* hardware stack size */
u64 lbr_sel_mask; /* LBR_SELECT valid bits */
const int *lbr_sel_map; /* lbr_select mappings */
+ bool lbr_double_abort; /* duplicated lbr aborts */

/*
* Extra registers for events
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index 4a78745..5a0d73c 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -2247,6 +2247,7 @@ __init int intel_pmu_init(void)

x86_pmu.hw_config = hsw_hw_config;
x86_pmu.get_event_constraints = hsw_get_event_constraints;
+ x86_pmu.lbr_double_abort = true;
pr_cont("Haswell events, ");
break;

diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index 6f9b794..33b6b5f 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -284,6 +284,7 @@ static void intel_pmu_lbr_read_64(struct cpu_hw_events *cpuc)
int lbr_format = x86_pmu.intel_cap.lbr_format;
u64 tos = intel_pmu_lbr_tos();
int i;
+ int out = 0;

for (i = 0; i < x86_pmu.lbr_nr; i++) {
unsigned long lbr_idx = (tos - i) & mask;
@@ -306,15 +307,27 @@ static void intel_pmu_lbr_read_64(struct cpu_hw_events *cpuc)
}
from = (u64)((((s64)from) << skip) >> skip);

- cpuc->lbr_entries[i].from = from;
- cpuc->lbr_entries[i].to = to;
- cpuc->lbr_entries[i].mispred = mis;
- cpuc->lbr_entries[i].predicted = pred;
- cpuc->lbr_entries[i].in_tx = in_tx;
- cpuc->lbr_entries[i].abort = abort;
- cpuc->lbr_entries[i].reserved = 0;
+ /*
+ * Some CPUs report duplicated abort records,
+ * with the second entry not having an abort bit set.
+ * Skip them here. This loop runs backwards,
+ * so we need to undo the previous record.
+ * If the abort just happened outside the window
+ * the extra entry cannot be removed.
+ */
+ if (abort && x86_pmu.lbr_double_abort && out > 0)
+ out--;
+
+ cpuc->lbr_entries[out].from = from;
+ cpuc->lbr_entries[out].to = to;
+ cpuc->lbr_entries[out].mispred = mis;
+ cpuc->lbr_entries[out].predicted = pred;
+ cpuc->lbr_entries[out].in_tx = in_tx;
+ cpuc->lbr_entries[out].abort = abort;
+ cpuc->lbr_entries[out].reserved = 0;
+ out++;
}
- cpuc->lbr_stack.nr = i;
+ cpuc->lbr_stack.nr = out;
}

void intel_pmu_lbr_read(void)
--
1.7.7.6

2013-04-20 19:19:56

by Andi Kleen

[permalink] [raw]
Subject: [PATCH 15/15] perf, tools: Add perf stat --transaction v3

From: Andi Kleen <[email protected]>

Add support to perf stat to print the basic transactional execution statistics:
Total cycles, Cycles in Transaction, Cycles in aborted transsactions
using the intx and intx_checkpoint qualifiers.
Transaction Starts and Elision Starts, to compute the average transaction length.

This is a reasonable overview over the success of the transactions.

Enable with a new --transaction / -T option.

This requires measuring these events in a group, since they depend on each
other.

This is implemented by using TM sysfs events exported by the kernel

v2: Only print the extended statistics when the option is enabled.
This avoids negative output when the user specifies the -T events
in separate groups.
v3: Port to latest tree
Signed-off-by: Andi Kleen <[email protected]>
---
tools/perf/Documentation/perf-stat.txt | 5 ++
tools/perf/builtin-stat.c | 103 +++++++++++++++++++++++++++++++-
tools/perf/util/evsel.h | 6 ++
3 files changed, 111 insertions(+), 3 deletions(-)

diff --git a/tools/perf/Documentation/perf-stat.txt b/tools/perf/Documentation/perf-stat.txt
index 2fe87fb..40bc65a 100644
--- a/tools/perf/Documentation/perf-stat.txt
+++ b/tools/perf/Documentation/perf-stat.txt
@@ -132,6 +132,11 @@ is a useful mode to detect imbalance between physical cores. To enable this mod
use --per-core in addition to -a. (system-wide). The output includes the
core number and the number of online logical processors on that physical processor.

+-T::
+--transaction::
+
+Print statistics of transactional execution if supported.
+
EXAMPLES
--------

diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index 7e910ba..5053c1a 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -70,6 +70,30 @@ static void print_counter_aggr(struct perf_evsel *counter, char *prefix);
static void print_counter(struct perf_evsel *counter, char *prefix);
static void print_aggr(char *prefix);

+/* Default events used for perf stat -T */
+static const char * const transaction_attrs[] = {
+ "task-clock",
+ "{"
+ "instructions,"
+ "cycles,"
+ "cpu/cycles-t/,"
+ "cpu/cycles-ct/,"
+ "cpu/tx-start/,"
+ "cpu/el-start/"
+ "}"
+};
+
+/* must match the transaction_attrs above */
+enum {
+ T_TASK_CLOCK,
+ T_INSTRUCTIONS,
+ T_CYCLES,
+ T_CYCLES_INTX,
+ T_CYCLES_INTX_CP,
+ T_TRANSACTION_START,
+ T_ELISION_START
+};
+
static struct perf_evlist *evsel_list;

static struct perf_target target = {
@@ -90,6 +114,7 @@ static enum aggr_mode aggr_mode = AGGR_GLOBAL;
static pid_t child_pid = -1;
static bool null_run = false;
static int detailed_run = 0;
+static bool transaction_run;
static bool big_num = true;
static int big_num_opt = -1;
static const char *csv_sep = NULL;
@@ -213,7 +238,11 @@ static struct stats runtime_l1_icache_stats[MAX_NR_CPUS];
static struct stats runtime_ll_cache_stats[MAX_NR_CPUS];
static struct stats runtime_itlb_cache_stats[MAX_NR_CPUS];
static struct stats runtime_dtlb_cache_stats[MAX_NR_CPUS];
+static struct stats runtime_cycles_intx_stats[MAX_NR_CPUS];
+static struct stats runtime_cycles_intxcp_stats[MAX_NR_CPUS];
static struct stats walltime_nsecs_stats;
+static struct stats runtime_transaction_stats[MAX_NR_CPUS];
+static struct stats runtime_elision_stats[MAX_NR_CPUS];

static void perf_stat__reset_stats(struct perf_evlist *evlist)
{
@@ -272,6 +301,18 @@ static inline int nsec_counter(struct perf_evsel *evsel)
return 0;
}

+static struct perf_evsel *nth_evsel(int n)
+{
+ struct perf_evsel *ev;
+ int j;
+
+ j = 0;
+ list_for_each_entry(ev, &evsel_list->entries, node)
+ if (j++ == n)
+ return ev;
+ return NULL;
+}
+
/*
* Update various tracking values we maintain to print
* more semantic information such as miss/hit ratios,
@@ -283,8 +324,14 @@ static void update_shadow_stats(struct perf_evsel *counter, u64 *count)
update_stats(&runtime_nsecs_stats[0], count[0]);
else if (perf_evsel__match(counter, HARDWARE, HW_CPU_CYCLES))
update_stats(&runtime_cycles_stats[0], count[0]);
- else if (perf_evsel__match(counter, HARDWARE, HW_STALLED_CYCLES_FRONTEND))
- update_stats(&runtime_stalled_cycles_front_stats[0], count[0]);
+ else if (perf_evsel__cmp(counter, nth_evsel(T_CYCLES_INTX)))
+ update_stats(&runtime_cycles_intx_stats[0], count[0]);
+ else if (perf_evsel__cmp(counter, nth_evsel(T_CYCLES_INTX_CP)))
+ update_stats(&runtime_cycles_intxcp_stats[0], count[0]);
+ else if (perf_evsel__cmp(counter, nth_evsel(T_TRANSACTION_START)))
+ update_stats(&runtime_transaction_stats[0], count[0]);
+ else if (perf_evsel__cmp(counter, nth_evsel(T_ELISION_START)))
+ update_stats(&runtime_elision_stats[0], count[0]);
else if (perf_evsel__match(counter, HARDWARE, HW_STALLED_CYCLES_BACKEND))
update_stats(&runtime_stalled_cycles_back_stats[0], count[0]);
else if (perf_evsel__match(counter, HARDWARE, HW_BRANCH_INSTRUCTIONS))
@@ -807,7 +854,7 @@ static void print_ll_cache_misses(int cpu,

static void abs_printout(int cpu, int nr, struct perf_evsel *evsel, double avg)
{
- double total, ratio = 0.0;
+ double total, ratio = 0.0, total2;
const char *fmt;

if (csv_output)
@@ -903,6 +950,41 @@ static void abs_printout(int cpu, int nr, struct perf_evsel *evsel, double avg)
ratio = 1.0 * avg / total;

fprintf(output, " # %8.3f GHz ", ratio);
+ } else if (perf_evsel__cmp(evsel, nth_evsel(T_CYCLES_INTX)) &&
+ transaction_run) {
+ total = avg_stats(&runtime_cycles_stats[cpu]);
+ if (total)
+ fprintf(output,
+ " # %5.2f%% transactional cycles ",
+ 100.0 * (avg / total));
+ } else if (perf_evsel__cmp(evsel, nth_evsel(T_CYCLES_INTX_CP)) &&
+ transaction_run) {
+ total = avg_stats(&runtime_cycles_stats[cpu]);
+ total2 = avg_stats(&runtime_cycles_intx_stats[cpu]);
+ if (total)
+ fprintf(output,
+ " # %5.2f%% aborted cycles ",
+ 100.0 * ((total2-avg) / total));
+ } else if (perf_evsel__cmp(evsel, nth_evsel(T_TRANSACTION_START)) &&
+ avg > 0 &&
+ runtime_cycles_intx_stats[cpu].n != 0 &&
+ transaction_run) {
+ total = avg_stats(&runtime_cycles_intx_stats[cpu]);
+
+ if (total)
+ ratio = total / avg;
+
+ fprintf(output, " # %8.0f cycles / transaction ", ratio);
+ } else if (perf_evsel__cmp(evsel, nth_evsel(T_ELISION_START)) &&
+ avg > 0 &&
+ runtime_cycles_intx_stats[cpu].n != 0 &&
+ transaction_run) {
+ total = avg_stats(&runtime_cycles_intx_stats[cpu]);
+
+ if (total)
+ ratio = total / avg;
+
+ fprintf(output, " # %8.0f cycles / elision ", ratio);
} else if (runtime_nsecs_stats[cpu].n != 0) {
char unit = 'M';

@@ -1312,6 +1394,19 @@ static int add_default_attributes(void)
if (null_run)
return 0;

+ if (transaction_run) {
+ unsigned i;
+
+ for (i = 0; i < ARRAY_SIZE(transaction_attrs); i++) {
+ if (parse_events(evsel_list, transaction_attrs[i])) {
+ fprintf(stderr,
+ "Cannot set up transaction events\n");
+ return -1;
+ }
+ }
+ return 0;
+ }
+
if (!evsel_list->nr_entries) {
if (perf_evlist__add_default_attrs(evsel_list, default_attrs) < 0)
return -1;
@@ -1397,6 +1492,8 @@ int cmd_stat(int argc, const char **argv, const char *prefix __maybe_unused)
"aggregate counts per processor socket", AGGR_SOCKET),
OPT_SET_UINT(0, "per-core", &aggr_mode,
"aggregate counts per physical processor core", AGGR_CORE),
+ OPT_BOOLEAN('T', "transaction", &transaction_run,
+ "hardware transaction statistics"),
OPT_END()
};
const char * const stat_usage[] = {
diff --git a/tools/perf/util/evsel.h b/tools/perf/util/evsel.h
index 3f156cc..2f3dc86 100644
--- a/tools/perf/util/evsel.h
+++ b/tools/perf/util/evsel.h
@@ -180,6 +180,12 @@ static inline bool perf_evsel__match2(struct perf_evsel *e1,
(e1->attr.config == e2->attr.config);
}

+#define perf_evsel__cmp(a, b) \
+ ((a) && \
+ (b) && \
+ (a)->attr.type == (b)->attr.type && \
+ (a)->attr.config == (b)->attr.config)
+
int __perf_evsel__read_on_cpu(struct perf_evsel *evsel,
int cpu, int thread, bool scale);

--
1.7.7.6

2013-04-20 19:19:31

by Andi Kleen

[permalink] [raw]
Subject: [PATCH 06/15] perf, x86: Support the TSX intx/intx_cp qualifiers v4

From: Andi Kleen <[email protected]>

Export the TSX transaction and checkpointed qualifiers in sysfs,
so that they can be used like this

cpu/...,intx=1/

v2: Moved bad hunk. Forbid some bad combinations.
v3: Use EOPNOTSUPP. White space fixes (Stephane Eranian)
v4: Only sysfs code for now
Signed-off-by: Andi Kleen <[email protected]>
---
arch/x86/kernel/cpu/perf_event_intel.c | 21 +++++++++++++++++++++
1 files changed, 21 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index 3f2afb2..8aa1326 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -1697,6 +1697,8 @@ PMU_FORMAT_ATTR(pc, "config:19" );
PMU_FORMAT_ATTR(any, "config:21" ); /* v3 + */
PMU_FORMAT_ATTR(inv, "config:23" );
PMU_FORMAT_ATTR(cmask, "config:24-31" );
+PMU_FORMAT_ATTR(intx, "config:32" );
+PMU_FORMAT_ATTR(intx_cp, "config:33" );

static struct attribute *intel_arch_formats_attr[] = {
&format_attr_event.attr,
@@ -1857,6 +1859,24 @@ static struct attribute *intel_arch3_formats_attr[] = {
NULL,
};

+/* Arch3 + TSX support */
+static struct attribute *intel_hsw_formats_attr[] __read_mostly = {
+ &format_attr_event.attr,
+ &format_attr_umask.attr,
+ &format_attr_edge.attr,
+ &format_attr_pc.attr,
+ &format_attr_any.attr,
+ &format_attr_inv.attr,
+ &format_attr_cmask.attr,
+ &format_attr_intx.attr,
+ &format_attr_intx_cp.attr,
+
+ &format_attr_offcore_rsp.attr, /* XXX do NHM/WSM + SNB breakout */
+ &format_attr_ldlat.attr, /* PEBS load latency */
+ NULL,
+};
+
+
static __initconst const struct x86_pmu intel_pmu = {
.name = "Intel",
.handle_irq = intel_pmu_handle_irq,
@@ -2247,6 +2267,7 @@ __init int intel_pmu_init(void)

x86_pmu.hw_config = hsw_hw_config;
x86_pmu.get_event_constraints = hsw_get_event_constraints;
+ x86_pmu.format_attrs = intel_hsw_formats_attr;
x86_pmu.lbr_double_abort = true;
pr_cont("Haswell events, ");
break;
--
1.7.7.6

2013-04-20 19:19:58

by Andi Kleen

[permalink] [raw]
Subject: [PATCH 13/15] tools, perf: Add a precise event qualifier v2

From: Andi Kleen <[email protected]>

Add a precise qualifier, like cpu/event=0x3c,precise=1/

This is needed so that the kernel can request enabling PEBS
for TSX events. The parser bails out on any sysfs parse errors,
so this is needed in any case to handle any event on the TSX
perf kernel.

v2: Allow 3 as value
Signed-off-by: Andi Kleen <[email protected]>
---
tools/perf/util/parse-events.c | 6 ++++++
tools/perf/util/parse-events.h | 1 +
tools/perf/util/parse-events.l | 1 +
3 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/tools/perf/util/parse-events.c b/tools/perf/util/parse-events.c
index 6c8bb0f..68d8476 100644
--- a/tools/perf/util/parse-events.c
+++ b/tools/perf/util/parse-events.c
@@ -568,6 +568,12 @@ do { \
case PARSE_EVENTS__TERM_TYPE_NAME:
CHECK_TYPE_VAL(STR);
break;
+ case PARSE_EVENTS__TERM_TYPE_PRECISE:
+ CHECK_TYPE_VAL(NUM);
+ if ((unsigned)term->val.num > 3)
+ return -EINVAL;
+ attr->precise_ip = term->val.num;
+ break;
default:
return -EINVAL;
}
diff --git a/tools/perf/util/parse-events.h b/tools/perf/util/parse-events.h
index 8a48593..13d7c66 100644
--- a/tools/perf/util/parse-events.h
+++ b/tools/perf/util/parse-events.h
@@ -48,6 +48,7 @@ enum {
PARSE_EVENTS__TERM_TYPE_NAME,
PARSE_EVENTS__TERM_TYPE_SAMPLE_PERIOD,
PARSE_EVENTS__TERM_TYPE_BRANCH_SAMPLE_TYPE,
+ PARSE_EVENTS__TERM_TYPE_PRECISE,
};

struct parse_events_term {
diff --git a/tools/perf/util/parse-events.l b/tools/perf/util/parse-events.l
index e9d1134..32a9000 100644
--- a/tools/perf/util/parse-events.l
+++ b/tools/perf/util/parse-events.l
@@ -169,6 +169,7 @@ period { return term(yyscanner, PARSE_EVENTS__TERM_TYPE_SAMPLE_PERIOD); }
branch_type { return term(yyscanner, PARSE_EVENTS__TERM_TYPE_BRANCH_SAMPLE_TYPE); }
, { return ','; }
"/" { BEGIN(INITIAL); return '/'; }
+precise { return term(yyscanner, PARSE_EVENTS__TERM_TYPE_PRECISE); }
{name_minus} { return str(yyscanner, PE_NAME); }
}

--
1.7.7.6

2013-04-20 19:20:05

by Andi Kleen

[permalink] [raw]
Subject: [PATCH 12/15] perf, tools: Add support for record transaction flags v3

From: Andi Kleen <[email protected]>

Add the glue in the user tools to record transaction flags with
--transaction (-T was already taken) and dump them.

Followon patches will use them.

v2: Fix manpage
v3: Move transaction to the end
Signed-off-by: Andi Kleen <[email protected]>
---
tools/perf/Documentation/perf-record.txt | 4 +-
tools/perf/Documentation/perf-report.txt | 4 ++
tools/perf/Documentation/perf-top.txt | 2 +-
tools/perf/builtin-annotate.c | 2 +-
tools/perf/builtin-diff.c | 8 ++-
tools/perf/builtin-record.c | 2 +
tools/perf/builtin-report.c | 4 +-
tools/perf/builtin-top.c | 4 +-
tools/perf/perf.h | 1 +
tools/perf/tests/hists_link.c | 6 ++-
tools/perf/util/event.h | 1 +
tools/perf/util/evsel.c | 9 ++++
tools/perf/util/hist.c | 7 ++-
tools/perf/util/hist.h | 4 +-
tools/perf/util/session.c | 3 +
tools/perf/util/sort.c | 74 ++++++++++++++++++++++++++++++
tools/perf/util/sort.h | 2 +
17 files changed, 123 insertions(+), 14 deletions(-)

diff --git a/tools/perf/Documentation/perf-record.txt b/tools/perf/Documentation/perf-record.txt
index 6f3405e..c73dd25 100644
--- a/tools/perf/Documentation/perf-record.txt
+++ b/tools/perf/Documentation/perf-record.txt
@@ -185,12 +185,14 @@ is enabled for all the sampling events. The sampled branch type is the same for
The various filters must be specified as a comma separated list: --branch-filter any_ret,u,k
Note that this feature may not be available on all processors.

--W::
--weight::
Enable weightened sampling. An additional weight is recorded per sample and can be
displayed with the weight and local_weight sort keys. This currently works for TSX
abort events and some memory events in precise mode on modern Intel CPUs.

+--transaction::
+Record transaction flags for transaction related events.
+
SEE ALSO
--------
linkperf:perf-stat[1], linkperf:perf-list[1]
diff --git a/tools/perf/Documentation/perf-report.txt b/tools/perf/Documentation/perf-report.txt
index 06d5d9b..e59d7af 100644
--- a/tools/perf/Documentation/perf-report.txt
+++ b/tools/perf/Documentation/perf-report.txt
@@ -72,6 +72,10 @@ OPTIONS
- cpu: cpu number the task ran at the time of sample
- srcline: filename and line number executed at the time of sample. The
DWARF debugging info must be provided.
+ - weight: Event specific weight, e.g. memory latency or transaction
+ abort cost. This is the global weight.
+ - local_weight: Local weight version of the weight above.
+ - transaction: Transaction abort flags.

By default, comm, dso and symbol keys are used.
(i.e. --sort comm,dso,symbol)
diff --git a/tools/perf/Documentation/perf-top.txt b/tools/perf/Documentation/perf-top.txt
index 1f5192a..1d8278b 100644
--- a/tools/perf/Documentation/perf-top.txt
+++ b/tools/perf/Documentation/perf-top.txt
@@ -113,7 +113,7 @@ Default is to monitor all CPUS.
-s::
--sort::
Sort by key(s): pid, comm, dso, symbol, parent, srcline, weight,
- local_weight, abort, in_tx
+ local_weight, abort, in_tx, transaction

-n::
--show-nr-samples::
diff --git a/tools/perf/builtin-annotate.c b/tools/perf/builtin-annotate.c
index db491e9..6d0f8a4 100644
--- a/tools/perf/builtin-annotate.c
+++ b/tools/perf/builtin-annotate.c
@@ -63,7 +63,7 @@ static int perf_evsel__add_sample(struct perf_evsel *evsel,
return 0;
}

- he = __hists__add_entry(&evsel->hists, al, NULL, 1, 1);
+ he = __hists__add_entry(&evsel->hists, al, NULL, 1, 1, 0);
if (he == NULL)
return -ENOMEM;

diff --git a/tools/perf/builtin-diff.c b/tools/perf/builtin-diff.c
index 2d0462d..c6a0a86 100644
--- a/tools/perf/builtin-diff.c
+++ b/tools/perf/builtin-diff.c
@@ -232,9 +232,10 @@ int perf_diff__formula(struct hist_entry *he, struct hist_entry *pair,

static int hists__add_entry(struct hists *self,
struct addr_location *al, u64 period,
- u64 weight)
+ u64 weight, u64 transaction)
{
- if (__hists__add_entry(self, al, NULL, period, weight) != NULL)
+ if (__hists__add_entry(self, al, NULL, period, weight, transaction)
+ != NULL)
return 0;
return -ENOMEM;
}
@@ -256,7 +257,8 @@ static int diff__process_sample_event(struct perf_tool *tool __maybe_unused,
if (al.filtered)
return 0;

- if (hists__add_entry(&evsel->hists, &al, sample->period, sample->weight)) {
+ if (hists__add_entry(&evsel->hists, &al, sample->period,
+ sample->weight, sample->transaction)) {
pr_warning("problem incrementing symbol period, skipping event\n");
return -1;
}
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 1c46dd0..870010d 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -958,6 +958,8 @@ const struct option record_options[] = {
parse_branch_stack),
OPT_BOOLEAN('W', "weight", &record.opts.sample_weight,
"sample by weight (on special events only)"),
+ OPT_BOOLEAN(0, "transaction", &record.opts.sample_transaction,
+ "sample transaction flags (special events only)"),
OPT_END()
};

diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c
index 9a85d66..9d7b17c 100644
--- a/tools/perf/builtin-report.c
+++ b/tools/perf/builtin-report.c
@@ -251,7 +251,7 @@ static int perf_evsel__add_hist_entry(struct perf_evsel *evsel,
}

he = __hists__add_entry(&evsel->hists, al, parent, sample->period,
- sample->weight);
+ sample->weight, sample->transaction);
if (he == NULL)
return -ENOMEM;

@@ -752,7 +752,7 @@ int cmd_report(int argc, const char **argv, const char *prefix __maybe_unused)
"sort by key(s): pid, comm, dso, symbol, parent, cpu, srcline,"
" dso_to, dso_from, symbol_to, symbol_from, mispredict,"
" weight, local_weight, mem, symbol_daddr, dso_daddr, tlb, "
- "snoop, locked, abort, in_tx"),
+ "snoop, locked, abort, in_tx, transaction"),
OPT_BOOLEAN(0, "showcpuutilization", &symbol_conf.show_cpu_utilization,
"Show sample percentage for different cpu modes"),
OPT_STRING('p', "parent", &parent_pattern, "regex",
diff --git a/tools/perf/builtin-top.c b/tools/perf/builtin-top.c
index c83b1fd..ee9df3d 100644
--- a/tools/perf/builtin-top.c
+++ b/tools/perf/builtin-top.c
@@ -252,7 +252,7 @@ static struct hist_entry *perf_evsel__add_hist_entry(struct perf_evsel *evsel,
struct hist_entry *he;

he = __hists__add_entry(&evsel->hists, al, NULL, sample->period,
- sample->weight);
+ sample->weight, sample->transaction);
if (he == NULL)
return NULL;

@@ -1090,7 +1090,7 @@ int cmd_top(int argc, const char **argv, const char *prefix __maybe_unused)
"be more verbose (show counter open errors, etc)"),
OPT_STRING('s', "sort", &sort_order, "key[,key2...]",
"sort by key(s): pid, comm, dso, symbol, parent, weight, local_weight,"
- " abort, in_tx"),
+ " abort, in_tx, transaction"),
OPT_BOOLEAN('n', "show-nr-samples", &symbol_conf.show_nr_samples,
"Show a column with the number of samples"),
OPT_CALLBACK_DEFAULT('G', "call-graph", &top.record_opts,
diff --git a/tools/perf/perf.h b/tools/perf/perf.h
index 4fb573b..5bf680c 100644
--- a/tools/perf/perf.h
+++ b/tools/perf/perf.h
@@ -230,6 +230,7 @@ struct perf_record_opts {
u64 default_interval;
u64 user_interval;
u16 stack_dump_size;
+ bool sample_transaction;
};

#endif
diff --git a/tools/perf/tests/hists_link.c b/tools/perf/tests/hists_link.c
index 89085a9..c1d3ac3 100644
--- a/tools/perf/tests/hists_link.c
+++ b/tools/perf/tests/hists_link.c
@@ -223,7 +223,8 @@ static int add_hist_entries(struct perf_evlist *evlist, struct machine *machine)
&sample, 0) < 0)
goto out;

- he = __hists__add_entry(&evsel->hists, &al, NULL, 1, 1);
+ he = __hists__add_entry(&evsel->hists, &al, NULL,
+ 1, 1, 0);
if (he == NULL)
goto out;

@@ -247,7 +248,8 @@ static int add_hist_entries(struct perf_evlist *evlist, struct machine *machine)
&sample, 0) < 0)
goto out;

- he = __hists__add_entry(&evsel->hists, &al, NULL, 1, 1);
+ he = __hists__add_entry(&evsel->hists, &al, NULL, 1, 1,
+ 0);
if (he == NULL)
goto out;

diff --git a/tools/perf/util/event.h b/tools/perf/util/event.h
index 1813895..536a00a 100644
--- a/tools/perf/util/event.h
+++ b/tools/perf/util/event.h
@@ -89,6 +89,7 @@ struct perf_sample {
u64 stream_id;
u64 period;
u64 weight;
+ u64 transaction;
u32 cpu;
u32 raw_size;
u64 data_src;
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 07b1a3a..17f2d2a 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -569,6 +569,9 @@ void perf_evsel__config(struct perf_evsel *evsel,
if (opts->sample_weight)
attr->sample_type |= PERF_SAMPLE_WEIGHT;

+ if (opts->sample_transaction)
+ attr->sample_type |= PERF_SAMPLE_TRANSACTION;
+
attr->mmap = track;
attr->comm = track;

@@ -1186,6 +1189,12 @@ int perf_evsel__parse_sample(struct perf_evsel *evsel, union perf_event *event,
array++;
}

+ data->transaction = 0;
+ if (type & PERF_SAMPLE_TRANSACTION) {
+ data->transaction = *array;
+ array++;
+ }
+
return 0;
}

diff --git a/tools/perf/util/hist.c b/tools/perf/util/hist.c
index 6b32721..9611f15 100644
--- a/tools/perf/util/hist.c
+++ b/tools/perf/util/hist.c
@@ -155,6 +155,10 @@ void hists__calc_col_len(struct hists *hists, struct hist_entry *h)
hists__new_col_len(hists, HISTC_MEM_LVL, 21 + 3);
hists__new_col_len(hists, HISTC_LOCAL_WEIGHT, 12);
hists__new_col_len(hists, HISTC_GLOBAL_WEIGHT, 12);
+
+ if (h->transaction)
+ hists__new_col_len(hists, HISTC_TRANSACTION,
+ hist_entry__transaction_len());
}

void hists__output_recalc_col_len(struct hists *hists, int max_rows)
@@ -457,7 +461,7 @@ struct hist_entry *__hists__add_branch_entry(struct hists *self,
struct hist_entry *__hists__add_entry(struct hists *self,
struct addr_location *al,
struct symbol *sym_parent, u64 period,
- u64 weight)
+ u64 weight, u64 transaction)
{
struct hist_entry entry = {
.thread = al->thread,
@@ -478,6 +482,7 @@ struct hist_entry *__hists__add_entry(struct hists *self,
.hists = self,
.branch_info = NULL,
.mem_info = NULL,
+ .transaction = transaction,
};

return add_hist_entry(self, &entry, al, period, weight);
diff --git a/tools/perf/util/hist.h b/tools/perf/util/hist.h
index 284a748..63bb98c 100644
--- a/tools/perf/util/hist.h
+++ b/tools/perf/util/hist.h
@@ -59,6 +59,7 @@ enum hist_column {
HISTC_MEM_TLB,
HISTC_MEM_LVL,
HISTC_MEM_SNOOP,
+ HISTC_TRANSACTION,
HISTC_NR_COLS, /* Last entry */
};

@@ -84,9 +85,10 @@ struct hists {
struct hist_entry *__hists__add_entry(struct hists *self,
struct addr_location *al,
struct symbol *parent, u64 period,
- u64 weight);
+ u64 weight, u64 transaction);
int64_t hist_entry__cmp(struct hist_entry *left, struct hist_entry *right);
int64_t hist_entry__collapse(struct hist_entry *left, struct hist_entry *right);
+int hist_entry__transaction_len(void);
int hist_entry__sort_snprintf(struct hist_entry *self, char *bf, size_t size,
struct hists *hists);
void hist_entry__free(struct hist_entry *);
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index cf1fe01..9428c1f 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -804,6 +804,9 @@ static void dump_sample(struct perf_evsel *evsel, union perf_event *event,

if (sample_type & PERF_SAMPLE_DATA_SRC)
printf(" . data_src: 0x%"PRIx64"\n", sample->data_src);
+
+ if (sample_type & PERF_SAMPLE_TRANSACTION)
+ printf("... transaction: %" PRIx64 "\n", sample->transaction);
}

static struct machine *
diff --git a/tools/perf/util/sort.c b/tools/perf/util/sort.c
index 480c2da..36f00a4 100644
--- a/tools/perf/util/sort.c
+++ b/tools/perf/util/sort.c
@@ -904,6 +904,79 @@ struct sort_entry sort_intx = {
.se_width_idx = HISTC_INTX,
};

+static int64_t
+sort__transaction_cmp(struct hist_entry *left, struct hist_entry *right)
+{
+ return left->transaction - right->transaction;
+}
+
+static inline char *add_str(char *p, const char *str)
+{
+ strcpy(p, str);
+ return p + strlen(str);
+}
+
+static struct txbit {
+ unsigned flag;
+ const char *name;
+ int skip_for_len;
+} txbits[] = {
+ { PERF_SAMPLE_TXN_ELISION, "EL ", 0 },
+ { PERF_SAMPLE_TXN_TRANSACTION, "TX ", 1 },
+ { PERF_SAMPLE_TXN_SYNC, "SYNC ", 1 },
+ { PERF_SAMPLE_TXN_ASYNC, "ASYNC ", 0 },
+ { PERF_SAMPLE_TXN_RETRY, "RETRY ", 0 },
+ { PERF_SAMPLE_TXN_CONFLICT, "CON ", 0 },
+ { PERF_SAMPLE_TXN_CAPACITY, "CAP ", 1 },
+ { PERF_SAMPLE_TXN_MEMORY, "MEM ", 0 },
+ { PERF_SAMPLE_TXN_MISC, "MISC ", 0 },
+ { 0, NULL, 0 }
+};
+
+int hist_entry__transaction_len(void)
+{
+ int i;
+ int len = 0;
+
+ for (i = 0; txbits[i].name; i++) {
+ if (!txbits[i].skip_for_len)
+ len += strlen(txbits[i].name);
+ }
+ len += 4; /* :XX<space> */
+ return len;
+}
+
+static int hist_entry__transaction_snprintf(struct hist_entry *self, char *bf,
+ size_t size, unsigned int width)
+{
+ u64 t = self->transaction;
+ char buf[128];
+ char *p = buf;
+ int i;
+
+ buf[0] = 0;
+ for (i = 0; txbits[i].name; i++)
+ if (txbits[i].flag & t)
+ p = add_str(p, txbits[i].name);
+ if (t && !(t & (PERF_SAMPLE_TXN_SYNC|PERF_SAMPLE_TXN_ASYNC)))
+ p = add_str(p, "NEITHER ");
+ if (t & PERF_SAMPLE_TXN_ABORT_MASK) {
+ sprintf(p, ":%" PRIx64,
+ (t & PERF_SAMPLE_TXN_ABORT_MASK) >>
+ PERF_SAMPLE_TXN_ABORT_SHIFT);
+ p += strlen(p);
+ }
+
+ return repsep_snprintf(bf, size, "%-*s", width, buf);
+}
+
+struct sort_entry sort_transaction = {
+ .se_header = "Transaction ",
+ .se_cmp = sort__transaction_cmp,
+ .se_snprintf = hist_entry__transaction_snprintf,
+ .se_width_idx = HISTC_TRANSACTION,
+};
+
struct sort_dimension {
const char *name;
struct sort_entry *entry;
@@ -928,6 +1001,7 @@ static struct sort_dimension common_sort_dimensions[] = {
DIM(SORT_MEM_TLB, "tlb", sort_mem_tlb),
DIM(SORT_MEM_LVL, "mem", sort_mem_lvl),
DIM(SORT_MEM_SNOOP, "snoop", sort_mem_snoop),
+ DIM(SORT_TRANSACTION, "transaction", sort_transaction),
};

#undef DIM
diff --git a/tools/perf/util/sort.h b/tools/perf/util/sort.h
index e053d70..14a5564 100644
--- a/tools/perf/util/sort.h
+++ b/tools/perf/util/sort.h
@@ -83,6 +83,7 @@ struct hist_entry {
struct map_symbol ms;
struct thread *thread;
u64 ip;
+ u64 transaction;
s32 cpu;

struct hist_entry_diff diff;
@@ -140,6 +141,7 @@ enum sort_type {
SORT_MEM_TLB,
SORT_MEM_LVL,
SORT_MEM_SNOOP,
+ SORT_TRANSACTION,

/* branch stack specific sort keys */
__SORT_BRANCH_STACK,
--
1.7.7.6

2013-04-20 19:20:02

by Andi Kleen

[permalink] [raw]
Subject: [PATCH 02/15] perf, x86: Disable software LBR filter for Sandy Bridge/Haswell

From: Stephane Eranian <[email protected]>

Sandy Bridge and Haswell support all required LBR filters natively,
so there is no need to do instruction decoding in branch_type.
This lowers the overhead of LBR sampling with filters.

We enable far calls for call, so calls include exceptions, but that
seems like a acceptable trade off for much faster LBR sampling.

[Description and changes from AK]

Signed-off-by: Andi Kleen <[email protected]>
---
arch/x86/kernel/cpu/perf_event.h | 1 +
arch/x86/kernel/cpu/perf_event_intel_lbr.c | 12 ++++++++----
2 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 563d6e8..2341d9f 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -428,6 +428,7 @@ struct x86_pmu {
u64 lbr_sel_mask; /* LBR_SELECT valid bits */
const int *lbr_sel_map; /* lbr_select mappings */
bool lbr_double_abort; /* duplicated lbr aborts */
+ bool lbr_no_sw_filter; /* HW does all filters */

/*
* Extra registers for events
diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index 33b6b5f..18f5a08 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -608,6 +608,9 @@ intel_pmu_lbr_filter(struct cpu_hw_events *cpuc)
int i, j, type;
bool compress = false;

+ if (x86_pmu.lbr_no_sw_filter)
+ return;
+
/* if sampling all branches, then nothing to filter */
if ((br_sel & X86_BR_ALL) == X86_BR_ALL)
return;
@@ -727,12 +730,13 @@ void intel_pmu_lbr_init_snb(void)

x86_pmu.lbr_sel_mask = LBR_SEL_MASK;
x86_pmu.lbr_sel_map = snb_lbr_sel_map;
+ x86_pmu.lbr_no_sw_filter = true;

/*
- * SW branch filter usage:
- * - support syscall, sysret capture.
- * That requires LBR_FAR but that means far
- * jmp need to be filtered out
+ * We include interrupts/exceptions
+ * with calls. While technically they are not,
+ * it's not worth extra filtering just to
+ * get rid of them.
*/
pr_cont("16-deep LBR, ");
}
--
1.7.7.6

2013-04-20 19:20:00

by Andi Kleen

[permalink] [raw]
Subject: [PATCH 10/15] perf, core: Add generic transaction flags v3

From: Andi Kleen <[email protected]>

Add a generic qualifier for transaction events, as a new sample
type that returns a flag word. This is particularly useful
for qualifying aborts: to distinguish aborts which happen
due to asynchronous events (like conflicts caused by another
CPU) versus instructions that lead to an abort.

The tuning strategies are very different for those cases,
so it's important to distinguish them easily and early.

Since it's inconvenient and inflexible to filter for this
in the kernel we report all the events out and allow
some post processing in user space.

The flags are based on the Intel TSX events, but should be fairly
generic and mostly applicable to other architectures too. In addition
to various flag words there's also reserved space to report an
program supplied abort code. For TSX this is used to distinguish specific
classes of aborts, like a lock busy abort when doing lock elision.

Flags:

Elision and generic transactions (ELISION vs TRANSACTION)
Aborts caused by current thread vs aborts caused by others (SYNC vs ASYNC)
Retryable transaction (RETRY)
Conflicts with other threads (CONFLICT)
Transaction capacity overflow (CAPACITY)
Memory related abort (MEMORY)
Other unknown aborts (MISC)

Transactions implicitely aborted can also return an abort code.
This can be used to signal specific events to the profiler. A common
case is abort on lock busy in a RTM eliding library (code 0xff)
To handle this case we include the TSX abort code

Common example aborts in TSX would be:

- Conflict with another thread on memory read.
Flags: TRANSACTION|ASYNC|CONFLICT|MEMORY
- executing a WRMSR in a transaction. Flags: TRANSACTION|SYNC|MISC
- aborting on a MMIO in a driver. Flags: TRANSACTION|MEMORY|SYNC
- HLE transaction in user space is too large
Flags: ELISION|SYNC|MEMORY|CAPACITY

The only flag that is somewhat TSX specific is ELISION.

This adds the perf core glue needed for reporting the new flag word out.

v2: Add MEM/MISC
v3: Move transaction to the end
Signed-off-by: Andi Kleen <[email protected]>
---
include/linux/perf_event.h | 5 +++++
include/uapi/linux/perf_event.h | 25 ++++++++++++++++++++++++-
kernel/events/core.c | 6 ++++++
3 files changed, 35 insertions(+), 1 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 466e378..29f3420 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -580,6 +580,10 @@ struct perf_sample_data {
struct perf_regs_user regs_user;
u64 stack_user_size;
u64 weight;
+ /*
+ * Transaction flags for abort events:
+ */
+ u64 transaction;
};

static inline void perf_sample_data_init(struct perf_sample_data *data,
@@ -595,6 +599,7 @@ static inline void perf_sample_data_init(struct perf_sample_data *data,
data->stack_user_size = 0;
data->weight = 0;
data->data_src.val = 0;
+ data->transaction = 0;
}

extern void perf_output_sample(struct perf_output_handle *handle,
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 0b1df41..44be18d 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -134,8 +134,9 @@ enum perf_event_sample_format {
PERF_SAMPLE_STACK_USER = 1U << 13,
PERF_SAMPLE_WEIGHT = 1U << 14,
PERF_SAMPLE_DATA_SRC = 1U << 15,
+ PERF_SAMPLE_TRANSACTION = 1U << 16,

- PERF_SAMPLE_MAX = 1U << 16, /* non-ABI */
+ PERF_SAMPLE_MAX = 1U << 17, /* non-ABI */
};

/*
@@ -179,6 +180,28 @@ enum perf_sample_regs_abi {
};

/*
+ * Values for the transaction event qualifier, mostly for abort events.
+ */
+enum {
+ PERF_SAMPLE_TXN_ELISION = (1 << 0), /* From elision */
+ PERF_SAMPLE_TXN_TRANSACTION = (1 << 1), /* From transaction */
+ PERF_SAMPLE_TXN_SYNC = (1 << 2), /* Instruction is related */
+ PERF_SAMPLE_TXN_ASYNC = (1 << 3), /* Instruction not related */
+ PERF_SAMPLE_TXN_RETRY = (1 << 4), /* Retry possible */
+ PERF_SAMPLE_TXN_CONFLICT = (1 << 5), /* Conflict abort */
+ PERF_SAMPLE_TXN_CAPACITY = (1 << 6), /* Capacity abort */
+ PERF_SAMPLE_TXN_MEMORY = (1 << 7), /* Memory related abort */
+ PERF_SAMPLE_TXN_MISC = (1 << 8), /* Misc aborts */
+
+ PERF_SAMPLE_TXN_MAX = (1 << 9), /* non-ABI */
+
+ /* bits 24..31 are reserved for the abort code */
+
+ PERF_SAMPLE_TXN_ABORT_MASK = 0xff000000,
+ PERF_SAMPLE_TXN_ABORT_SHIFT = 24,
+};
+
+/*
* The format of the data returned by read() on a perf event fd,
* as specified by attr.read_format:
*
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 98c0845..658760b 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -979,6 +979,9 @@ static void perf_event__header_size(struct perf_event *event)
if (sample_type & PERF_SAMPLE_WEIGHT)
size += sizeof(data->weight);

+ if (sample_type & PERF_SAMPLE_TRANSACTION)
+ size += sizeof(data->transaction);
+
if (sample_type & PERF_SAMPLE_READ)
size += event->read_size;

@@ -4205,6 +4208,9 @@ void perf_output_sample(struct perf_output_handle *handle,

if (sample_type & PERF_SAMPLE_DATA_SRC)
perf_output_put(handle, data->data_src.val);
+
+ if (sample_type & PERF_SAMPLE_TRANSACTION)
+ perf_output_put(handle, data->transaction);
}

void perf_prepare_sample(struct perf_event_header *header,
--
1.7.7.6

2013-04-20 19:19:54

by Andi Kleen

[permalink] [raw]
Subject: [PATCH 11/15] perf, x86: Add Haswell specific transaction flag reporting

From: Andi Kleen <[email protected]>

In the PEBS handler report the transaction flags using the new
generic transaction flags facility. Most of them come from
the "tsx_tuning" field in PEBSv2, but the abort code is derived
from the RAX register reported in the PEBS record.

Signed-off-by: Andi Kleen <[email protected]>
---
arch/x86/kernel/cpu/perf_event_intel_ds.c | 13 +++++++++++++
1 files changed, 13 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index 60683c4..f9bb903 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -812,6 +812,19 @@ static void __intel_pmu_pebs_event(struct perf_event *event,
x86_pmu.intel_cap.pebs_format >= 1)
data.addr = pebs->dla;

+ if ((event->attr.sample_type & PERF_SAMPLE_WEIGHT) &&
+ (x86_pmu.intel_cap.pebs_format >= 2) &&
+ pebs_hsw->tsx_tuning)
+ data.weight = pebs_hsw->tsx_tuning & 0xffffffff;
+
+ if ((event->attr.sample_type & PERF_SAMPLE_TRANSACTION) &&
+ x86_pmu.intel_cap.pebs_format >= 2) {
+ data.transaction = pebs_hsw->tsx_tuning >> 32;
+ if ((data.transaction & PERF_SAMPLE_TXN_TRANSACTION) &&
+ (pebs->ax & 1))
+ data.transaction |= pebs->ax & 0xff000000;
+ }
+
if (has_branch_stack(event))
data.br_stack = &cpuc->lbr_stack;

--
1.7.7.6

2013-04-20 19:21:53

by Andi Kleen

[permalink] [raw]
Subject: [PATCH 08/15] perf, kvm: Support the intx/intx_cp modifiers in KVM arch perfmon emulation v5

From: Andi Kleen <[email protected]>

This is not arch perfmon, but older CPUs will just ignore it. This makes
it possible to do at least some TSX measurements from a KVM guest

Cc: [email protected]
v2: Various fixes to address review feedback
v3: Ignore the bits when no CPUID. No #GP. Force raw events with TSX bits.
v4: Use reserved bits for #GP
v5: Remove obsolete argument
Signed-off-by: Andi Kleen <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/kvm/pmu.c | 25 ++++++++++++++++++++-----
2 files changed, 21 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 4979778..7c7e207 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -315,6 +315,7 @@ struct kvm_pmu {
u64 global_ovf_ctrl;
u64 counter_bitmask[2];
u64 global_ctrl_mask;
+ u64 reserved_bits;
u8 version;
struct kvm_pmc gp_counters[INTEL_PMC_MAX_GENERIC];
struct kvm_pmc fixed_counters[INTEL_PMC_MAX_FIXED];
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index cfc258a..9317c43 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -160,7 +160,7 @@ static void stop_counter(struct kvm_pmc *pmc)

static void reprogram_counter(struct kvm_pmc *pmc, u32 type,
unsigned config, bool exclude_user, bool exclude_kernel,
- bool intr)
+ bool intr, bool intx, bool intx_cp)
{
struct perf_event *event;
struct perf_event_attr attr = {
@@ -173,6 +173,10 @@ static void reprogram_counter(struct kvm_pmc *pmc, u32 type,
.exclude_kernel = exclude_kernel,
.config = config,
};
+ if (intx)
+ attr.config |= HSW_INTX;
+ if (intx_cp)
+ attr.config |= HSW_INTX_CHECKPOINTED;

attr.sample_period = (-pmc->counter) & pmc_bitmask(pmc);

@@ -226,7 +230,9 @@ static void reprogram_gp_counter(struct kvm_pmc *pmc, u64 eventsel)

if (!(eventsel & (ARCH_PERFMON_EVENTSEL_EDGE |
ARCH_PERFMON_EVENTSEL_INV |
- ARCH_PERFMON_EVENTSEL_CMASK))) {
+ ARCH_PERFMON_EVENTSEL_CMASK |
+ HSW_INTX |
+ HSW_INTX_CHECKPOINTED))) {
config = find_arch_event(&pmc->vcpu->arch.pmu, event_select,
unit_mask);
if (config != PERF_COUNT_HW_MAX)
@@ -239,7 +245,9 @@ static void reprogram_gp_counter(struct kvm_pmc *pmc, u64 eventsel)
reprogram_counter(pmc, type, config,
!(eventsel & ARCH_PERFMON_EVENTSEL_USR),
!(eventsel & ARCH_PERFMON_EVENTSEL_OS),
- eventsel & ARCH_PERFMON_EVENTSEL_INT);
+ eventsel & ARCH_PERFMON_EVENTSEL_INT,
+ (eventsel & HSW_INTX),
+ (eventsel & HSW_INTX_CHECKPOINTED));
}

static void reprogram_fixed_counter(struct kvm_pmc *pmc, u8 en_pmi, int idx)
@@ -256,7 +264,7 @@ static void reprogram_fixed_counter(struct kvm_pmc *pmc, u8 en_pmi, int idx)
arch_events[fixed_pmc_events[idx]].event_type,
!(en & 0x2), /* exclude user */
!(en & 0x1), /* exclude kernel */
- pmi);
+ pmi, false, false);
}

static inline u8 fixed_en_pmi(u64 ctrl, int idx)
@@ -400,7 +408,7 @@ int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, u32 index, u64 data)
} else if ((pmc = get_gp_pmc(pmu, index, MSR_P6_EVNTSEL0))) {
if (data == pmc->eventsel)
return 0;
- if (!(data & 0xffffffff00200000ull)) {
+ if (!(data & pmu->reserved_bits)) {
reprogram_gp_counter(pmc, data);
return 0;
}
@@ -442,6 +450,7 @@ void kvm_pmu_cpuid_update(struct kvm_vcpu *vcpu)
pmu->counter_bitmask[KVM_PMC_GP] = 0;
pmu->counter_bitmask[KVM_PMC_FIXED] = 0;
pmu->version = 0;
+ pmu->reserved_bits = 0xffffffff00200000ull;

entry = kvm_find_cpuid_entry(vcpu, 0xa, 0);
if (!entry)
@@ -470,6 +479,12 @@ void kvm_pmu_cpuid_update(struct kvm_vcpu *vcpu)
pmu->global_ctrl = ((1 << pmu->nr_arch_gp_counters) - 1) |
(((1ull << pmu->nr_arch_fixed_counters) - 1) << INTEL_PMC_IDX_FIXED);
pmu->global_ctrl_mask = ~pmu->global_ctrl;
+
+ entry = kvm_find_cpuid_entry(vcpu, 7, 0);
+ if (entry &&
+ (boot_cpu_has(X86_FEATURE_HLE) || boot_cpu_has(X86_FEATURE_RTM)) &&
+ (entry->ebx & (X86_FEATURE_HLE|X86_FEATURE_RTM)))
+ pmu->reserved_bits ^= HSW_INTX|HSW_INTX_CHECKPOINTED;
}

void kvm_pmu_init(struct kvm_vcpu *vcpu)
--
1.7.7.6

2013-04-20 19:21:52

by Andi Kleen

[permalink] [raw]
Subject: [PATCH 04/15] perf, tools: Support sorting by in_tx, abort branch flags v3

From: Andi Kleen <[email protected]>

Extend the perf branch sorting code to support sorting by in_tx
or abort qualifiers. Also print out those qualifiers.

This also fixes up some of the existing sort key documentation.

We do not support notx here, because it's simply not showing
the in_tx flag.

v2: Readd flags to man pages
v3: Rename intx
Signed-off-by: Andi Kleen <[email protected]>
---
tools/perf/Documentation/perf-report.txt | 4 ++-
tools/perf/Documentation/perf-top.txt | 3 +-
tools/perf/builtin-report.c | 2 +-
tools/perf/builtin-top.c | 3 +-
tools/perf/perf.h | 4 ++-
tools/perf/util/hist.h | 2 +
tools/perf/util/sort.c | 51 ++++++++++++++++++++++++++++++
tools/perf/util/sort.h | 2 +
8 files changed, 66 insertions(+), 5 deletions(-)

diff --git a/tools/perf/Documentation/perf-report.txt b/tools/perf/Documentation/perf-report.txt
index 7d5f4f3..06d5d9b 100644
--- a/tools/perf/Documentation/perf-report.txt
+++ b/tools/perf/Documentation/perf-report.txt
@@ -71,7 +71,7 @@ OPTIONS
entries are displayed as "[other]".
- cpu: cpu number the task ran at the time of sample
- srcline: filename and line number executed at the time of sample. The
- DWARF debuggin info must be provided.
+ DWARF debugging info must be provided.

By default, comm, dso and symbol keys are used.
(i.e. --sort comm,dso,symbol)
@@ -85,6 +85,8 @@ OPTIONS
- symbol_from: name of function branched from
- symbol_to: name of function branched to
- mispredict: "N" for predicted branch, "Y" for mispredicted branch
+ - in_tx: branch in TSX transaction
+ - abort: TSX transaction abort.

And default sort keys are changed to comm, dso_from, symbol_from, dso_to
and symbol_to, see '--branch-stack'.
diff --git a/tools/perf/Documentation/perf-top.txt b/tools/perf/Documentation/perf-top.txt
index 9f1a2fe..1f5192a 100644
--- a/tools/perf/Documentation/perf-top.txt
+++ b/tools/perf/Documentation/perf-top.txt
@@ -112,7 +112,8 @@ Default is to monitor all CPUS.

-s::
--sort::
- Sort by key(s): pid, comm, dso, symbol, parent, srcline, weight, local_weight.
+ Sort by key(s): pid, comm, dso, symbol, parent, srcline, weight,
+ local_weight, abort, in_tx

-n::
--show-nr-samples::
diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c
index bd0ca81..9a85d66 100644
--- a/tools/perf/builtin-report.c
+++ b/tools/perf/builtin-report.c
@@ -752,7 +752,7 @@ int cmd_report(int argc, const char **argv, const char *prefix __maybe_unused)
"sort by key(s): pid, comm, dso, symbol, parent, cpu, srcline,"
" dso_to, dso_from, symbol_to, symbol_from, mispredict,"
" weight, local_weight, mem, symbol_daddr, dso_daddr, tlb, "
- "snoop, locked"),
+ "snoop, locked, abort, in_tx"),
OPT_BOOLEAN(0, "showcpuutilization", &symbol_conf.show_cpu_utilization,
"Show sample percentage for different cpu modes"),
OPT_STRING('p', "parent", &parent_pattern, "regex",
diff --git a/tools/perf/builtin-top.c b/tools/perf/builtin-top.c
index 67bdb9f..c83b1fd 100644
--- a/tools/perf/builtin-top.c
+++ b/tools/perf/builtin-top.c
@@ -1089,7 +1089,8 @@ int cmd_top(int argc, const char **argv, const char *prefix __maybe_unused)
OPT_INCR('v', "verbose", &verbose,
"be more verbose (show counter open errors, etc)"),
OPT_STRING('s', "sort", &sort_order, "key[,key2...]",
- "sort by key(s): pid, comm, dso, symbol, parent, weight, local_weight"),
+ "sort by key(s): pid, comm, dso, symbol, parent, weight, local_weight,"
+ " abort, in_tx"),
OPT_BOOLEAN('n', "show-nr-samples", &symbol_conf.show_nr_samples,
"Show a column with the number of samples"),
OPT_CALLBACK_DEFAULT('G', "call-graph", &top.record_opts,
diff --git a/tools/perf/perf.h b/tools/perf/perf.h
index 32bd102..4fb573b 100644
--- a/tools/perf/perf.h
+++ b/tools/perf/perf.h
@@ -179,7 +179,9 @@ struct ip_callchain {
struct branch_flags {
u64 mispred:1;
u64 predicted:1;
- u64 reserved:62;
+ u64 intx:1;
+ u64 abort:1;
+ u64 reserved:60;
};

struct branch_entry {
diff --git a/tools/perf/util/hist.h b/tools/perf/util/hist.h
index 14c2fe2..284a748 100644
--- a/tools/perf/util/hist.h
+++ b/tools/perf/util/hist.h
@@ -44,6 +44,8 @@ enum hist_column {
HISTC_PARENT,
HISTC_CPU,
HISTC_MISPREDICT,
+ HISTC_INTX,
+ HISTC_ABORT,
HISTC_SYMBOL_FROM,
HISTC_SYMBOL_TO,
HISTC_DSO_FROM,
diff --git a/tools/perf/util/sort.c b/tools/perf/util/sort.c
index 5f52d49..480c2da 100644
--- a/tools/perf/util/sort.c
+++ b/tools/perf/util/sort.c
@@ -855,6 +855,55 @@ struct sort_entry sort_mem_snoop = {
.se_width_idx = HISTC_MEM_SNOOP,
};

+static int64_t
+sort__abort_cmp(struct hist_entry *left, struct hist_entry *right)
+{
+ return left->branch_info->flags.abort !=
+ right->branch_info->flags.abort;
+}
+
+static int hist_entry__abort_snprintf(struct hist_entry *self, char *bf,
+ size_t size, unsigned int width)
+{
+ static const char *out = ".";
+
+ if (self->branch_info->flags.abort)
+ out = "A";
+ return repsep_snprintf(bf, size, "%-*s", width, out);
+}
+
+struct sort_entry sort_abort = {
+ .se_header = "Transaction abort",
+ .se_cmp = sort__abort_cmp,
+ .se_snprintf = hist_entry__abort_snprintf,
+ .se_width_idx = HISTC_ABORT,
+};
+
+static int64_t
+sort__intx_cmp(struct hist_entry *left, struct hist_entry *right)
+{
+ return left->branch_info->flags.intx !=
+ right->branch_info->flags.intx;
+}
+
+static int hist_entry__intx_snprintf(struct hist_entry *self, char *bf,
+ size_t size, unsigned int width)
+{
+ static const char *out = ".";
+
+ if (self->branch_info->flags.intx)
+ out = "T";
+
+ return repsep_snprintf(bf, size, "%-*s", width, out);
+}
+
+struct sort_entry sort_intx = {
+ .se_header = "Branch in transaction",
+ .se_cmp = sort__intx_cmp,
+ .se_snprintf = hist_entry__intx_snprintf,
+ .se_width_idx = HISTC_INTX,
+};
+
struct sort_dimension {
const char *name;
struct sort_entry *entry;
@@ -891,6 +940,8 @@ static struct sort_dimension bstack_sort_dimensions[] = {
DIM(SORT_SYM_FROM, "symbol_from", sort_sym_from),
DIM(SORT_SYM_TO, "symbol_to", sort_sym_to),
DIM(SORT_MISPREDICT, "mispredict", sort_mispredict),
+ DIM(SORT_INTX, "in_tx", sort_intx),
+ DIM(SORT_ABORT, "abort", sort_abort),
};

#undef DIM
diff --git a/tools/perf/util/sort.h b/tools/perf/util/sort.h
index f24bdf6..e053d70 100644
--- a/tools/perf/util/sort.h
+++ b/tools/perf/util/sort.h
@@ -148,6 +148,8 @@ enum sort_type {
SORT_SYM_FROM,
SORT_SYM_TO,
SORT_MISPREDICT,
+ SORT_ABORT,
+ SORT_INTX,
};

/*
--
1.7.7.6

2013-04-20 19:21:50

by Andi Kleen

[permalink] [raw]
Subject: [PATCH 14/15] perf, x86: Add Haswell TSX event aliases v4

From: Andi Kleen <[email protected]>

Add infrastructure to generate event aliases in /sys/devices/cpu/events/

And use this to set up user friendly aliases for the common TSX events.
TSX tuning relies heavily on the PMU, so it's important to be user friendly.

This replaces the generic transaction events in an earlier version
of this patchkit.

tx-start/commit/abort to count RTM transactions
el-start/commit/abort to count HLE ("elision") transactions
tx-conflict/overflow to count conflict/overflow for both combined.

The general abort events exist in precise and non precise and precise-return
variants. Since the common case is sampling plain "tx-aborts" in precise.

This is very important because abort sampling only really works
with PEBS enabled, otherwise it would report the IP after the abort,
not the abort point. But counting with PEBS has more overhead,
so also have tx/el-abort-count aliases that do not enable PEBS
for perf stat.

In many cases sampling the return address with PEBS is still useful, so
we also have tx/el-abort-return. These are mainly for sampling
asynchronous conflicts, where it can be more beneficial to look at the
complete critical section, than the exact abort point. We still
want PEBS for those so that the transaction weight and flags can
be examined.

There is an tx-abort<->tx-aborts alias too, because I found myself
using both variants.

Also added friendly aliases for cpu/cycles,intx=1/ and
cpu/cycles,intx=1,intx_cp=1/ and the same for instructions.
These will be used by perf stat -T, and are also useful for users directly.
So for example to get transactional cycles can use "perf stat -e cycles-t"

This gives a clean set of generalized events to examine transaction
success and aborts. Haswell has additional events for TSX, but those are more
specialized for very specific situations.

v2: Move to new sysfs infrastructure
v3: Use own sysfs functions now
v4: Add tx/el-abort-return for better conflict sampling
Signed-off-by: Andi Kleen <[email protected]>
---
arch/x86/kernel/cpu/perf_event_intel.c | 55 ++++++++++++++++++++++++++++++++
1 files changed, 55 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index f24fb6f..d3fa8cd 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -2069,6 +2069,60 @@ static __init void intel_nehalem_quirk(void)
}
}

+/* Haswell special events */
+EVENT_ATTR_STR(tx-start, tx_start, "event=0xc9,umask=0x1");
+EVENT_ATTR_STR(tx-commit, tx_commit, "event=0xc9,umask=0x2");
+EVENT_ATTR_STR(tx-abort, tx_abort, "event=0xc9,umask=0x4,precise=2");
+EVENT_ATTR_STR(tx-abort-count, tx_abort_return, "event=0xc9,umask=0x4");
+EVENT_ATTR_STR(tx-abort-return, tx_abort_count, "event=0xc9,umask=0x4,precise=1");
+/* alias */
+EVENT_ATTR_STR(tx-aborts, tx_aborts, "event=0xc9,umask=0x4,precise=2");
+EVENT_ATTR_STR(tx-capacity, tx_capacity, "event=0x54,umask=0x2");
+EVENT_ATTR_STR(tx-conflict, tx_conflict, "event=0x54,umask=0x1");
+EVENT_ATTR_STR(el-start, el_start, "event=0xc8,umask=0x1");
+EVENT_ATTR_STR(el-commit, el_commit, "event=0xc8,umask=0x2");
+EVENT_ATTR_STR(el-abort, el_abort, "event=0xc8,umask=0x4,precise=2");
+EVENT_ATTR_STR(el-abort-return, el_abort_return, "event=0xc8,umask=0x4,precise=1");
+EVENT_ATTR_STR(el-abort-count, el_abort_count, "event=0xc8,umask=0x4");
+/* alias */
+EVENT_ATTR_STR(el-aborts, el_aborts, "event=0xc8,umask=0x4,precise=2");
+/* shared with tx-* */
+EVENT_ATTR_STR(el-capacity, el_capacity, "event=0x54,umask=0x2");
+/* shared with tx-* */
+EVENT_ATTR_STR(el-conflict, el_conflict, "event=0x54,umask=0x1");
+EVENT_ATTR_STR(cycles-t, cycles_t, "event=0x3c,intx=1");
+EVENT_ATTR_STR(cycles-ct, cycles_ct, "event=0x3c,intx=1,intx_cp=1");
+EVENT_ATTR_STR(instructions-t, instructions_t, "event=0xc0,umask=0x01,intx=1");
+EVENT_ATTR_STR(instructions-ct, instructions_ct, "event=0xc0,umask=0x01,intx=1,intx_cp=1");
+EVENT_ATTR_STR(instructions-p, instructions_p, "event=0xc0,umask=0x01,precise=2");
+
+static struct attribute *hsw_events_attrs[] = {
+ EVENT_PTR(tx_start),
+ EVENT_PTR(tx_commit),
+ EVENT_PTR(tx_abort),
+ EVENT_PTR(tx_aborts),
+ EVENT_PTR(tx_abort_count),
+ EVENT_PTR(tx_abort_return),
+ EVENT_PTR(tx_capacity),
+ EVENT_PTR(tx_conflict),
+ EVENT_PTR(el_start),
+ EVENT_PTR(el_commit),
+ EVENT_PTR(el_abort),
+ EVENT_PTR(el_aborts),
+ EVENT_PTR(el_abort_count),
+ EVENT_PTR(el_abort_return),
+ EVENT_PTR(el_capacity),
+ EVENT_PTR(el_conflict),
+ EVENT_PTR(cycles_t),
+ EVENT_PTR(cycles_ct),
+ EVENT_PTR(instructions_t),
+ EVENT_PTR(instructions_ct),
+ EVENT_PTR(instructions_p),
+ EVENT_PTR(mem_ld_nhm),
+ /* TBD add a mem-stores event */
+ NULL
+};
+
__init int intel_pmu_init(void)
{
union cpuid10_edx edx;
@@ -2307,6 +2361,7 @@ __init int intel_pmu_init(void)
x86_pmu.hw_config = hsw_hw_config;
x86_pmu.get_event_constraints = hsw_get_event_constraints;
x86_pmu.format_attrs = intel_hsw_formats_attr;
+ x86_pmu.cpu_events = hsw_events_attrs;
x86_pmu.lbr_double_abort = true;
pr_cont("Haswell events, ");
break;
--
1.7.7.6

2013-04-20 19:19:28

by Andi Kleen

[permalink] [raw]
Subject: [PATCH 03/15] perf, x86: Support full width counting v3

From: Andi Kleen <[email protected]>

Recent Intel CPUs like Haswell and IvyBridge have a new alternative MSR
range for perfctrs that allows writing the full counter width. Enable this
range if the hardware reports it using a new capability bit.

This lowers the overhead of perf stat slightly because it has to do less
interrupts to accumulate the counter value. On Haswell it also avoids some
problems with TSX aborting when the end of the counter range is reached.

v2: Print the feature at boot
v3: Rename field. Add comment.
Reviewed-by: Stephane Eranian <[email protected]>
Signed-off-by: Andi Kleen <[email protected]>
---
arch/x86/include/uapi/asm/msr-index.h | 3 +++
arch/x86/kernel/cpu/perf_event.h | 5 +++++
arch/x86/kernel/cpu/perf_event_intel.c | 7 +++++++
3 files changed, 15 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/uapi/asm/msr-index.h b/arch/x86/include/uapi/asm/msr-index.h
index bf7bb68..dbe5b52 100644
--- a/arch/x86/include/uapi/asm/msr-index.h
+++ b/arch/x86/include/uapi/asm/msr-index.h
@@ -167,6 +167,9 @@
#define MSR_KNC_EVNTSEL0 0x00000028
#define MSR_KNC_EVNTSEL1 0x00000029

+/* Alternative perfctr range with full access. */
+#define MSR_IA32_PMC0 0x000004c1
+
/* AMD64 MSRs. Not complete. See the architecture manual for a more
complete list. */

diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 2341d9f..0da5713 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -304,6 +304,11 @@ union perf_capabilities {
u64 pebs_arch_reg:1;
u64 pebs_format:4;
u64 smm_freeze:1;
+ /*
+ * PMU supports separate counter range for writing
+ * values > 32bit.
+ */
+ u64 full_width_write:1;
};
u64 capabilities;
};
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index 5a0d73c..3f2afb2 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -2299,5 +2299,12 @@ __init int intel_pmu_init(void)
}
}

+ /* Support full width counters using alternative MSR range */
+ if (x86_pmu.intel_cap.full_width_write) {
+ x86_pmu.max_period = x86_pmu.cntval_mask;
+ x86_pmu.perfctr = MSR_IA32_PMC0;
+ pr_cont("full-width counters, ");
+ }
+
return 0;
}
--
1.7.7.6

2013-04-20 19:22:42

by Andi Kleen

[permalink] [raw]
Subject: [PATCH 07/15] perf, x86: Avoid checkpointed counters causing excessive TSX aborts v4

From: Andi Kleen <[email protected]>

With checkpointed counters there can be a situation where the counter
is overflowing, aborts the transaction, is set back to a non overflowing
checkpoint, causes interupt. The interrupt doesn't see the overflow
because it has been checkpointed. This is then a spurious PMI, typically with
a ugly NMI message. It can also lead to excessive aborts.

Avoid this problem by:
- Using the full counter width for counting counters (earlier patch)
- Forbid sampling for checkpointed counters. It's not too useful anyways,
checkpointing is mainly for counting. The check is approximate
(to still handle KVM), but should catch the majority of cases.
- On a PMI always set back checkpointed counters to zero.

v2: Add unlikely. Add comment
v3: Allow large sampling periods with CP for KVM
v4: Use event_is_checkpointed. Use EOPNOTSUPP. (Stephane Eranian)
Signed-off-by: Andi Kleen <[email protected]>
---
arch/x86/kernel/cpu/perf_event_intel.c | 39 ++++++++++++++++++++++++++++++++
1 files changed, 39 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index 8aa1326..f24fb6f 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -1122,6 +1122,11 @@ static void intel_pmu_enable_event(struct perf_event *event)
__x86_pmu_enable_event(hwc, ARCH_PERFMON_EVENTSEL_ENABLE);
}

+static inline bool event_is_checkpointed(struct perf_event *event)
+{
+ return (event->hw.config & HSW_INTX_CHECKPOINTED) != 0;
+}
+
/*
* Save and restart an expired event. Called by NMI contexts,
* so it has to be careful about preempting normal event ops:
@@ -1129,6 +1134,17 @@ static void intel_pmu_enable_event(struct perf_event *event)
int intel_pmu_save_and_restart(struct perf_event *event)
{
x86_perf_event_update(event);
+ /*
+ * For a checkpointed counter always reset back to 0. This
+ * avoids a situation where the counter overflows, aborts the
+ * transaction and is then set back to shortly before the
+ * overflow, and overflows and aborts again.
+ */
+ if (unlikely(event_is_checkpointed(event))) {
+ /* No race with NMIs because the counter should not be armed */
+ wrmsrl(event->hw.event_base, 0);
+ local64_set(&event->hw.prev_count, 0);
+ }
return x86_perf_event_set_period(event);
}

@@ -1202,6 +1218,15 @@ again:
x86_pmu.drain_pebs(regs);
}

+ /*
+ * To avoid spurious interrupts with perf stat always reset checkpointed
+ * counters.
+ *
+ * XXX move somewhere else.
+ */
+ if (cpuc->events[2] && event_is_checkpointed(cpuc->events[2]))
+ status |= (1ULL << 2);
+
for_each_set_bit(bit, (unsigned long *)&status, X86_PMC_IDX_MAX) {
struct perf_event *event = cpuc->events[bit];

@@ -1669,6 +1694,20 @@ static int hsw_hw_config(struct perf_event *event)
event->attr.precise_ip > 0))
return -EOPNOTSUPP;

+ if (event_is_checkpointed(event)) {
+ /*
+ * Sampling of checkpointed events can cause situations where
+ * the CPU constantly aborts because of a overflow, which is
+ * then checkpointed back and ignored. Forbid checkpointing
+ * for sampling.
+ *
+ * But still allow a long sampling period, so that perf stat
+ * from KVM works.
+ */
+ if (event->attr.sample_period > 0 &&
+ event->attr.sample_period < 0x7fffffff)
+ return -EOPNOTSUPP;
+ }
return 0;
}

--
1.7.7.6

2013-04-20 19:22:40

by Andi Kleen

[permalink] [raw]
Subject: [PATCH 09/15] perf, x86: Support PERF_SAMPLE_ADDR for all PEBS events v3

From: Andi Kleen <[email protected]>

Haswell supplies the address for every PEBS memory event, so always fill it in
when the user requested it. It will be 0 when not useful (no memory access)

v2: Now include fmt1 too, so it works on Nehalem and later.
v3: Remove extra code inside st|ld if.
Signed-off-by: Andi Kleen <[email protected]>
---
arch/x86/kernel/cpu/perf_event_intel_ds.c | 7 ++++---
1 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index e0a66f80..60683c4 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -767,9 +767,6 @@ static void __intel_pmu_pebs_event(struct perf_event *event,
* if PEBS-LL or PreciseStore
*/
if (fll || fst) {
- if (sample_type & PERF_SAMPLE_ADDR)
- data.addr = pebs->dla;
-
/*
* Use latency for weight (only avail with PEBS-LL)
*/
@@ -811,6 +808,10 @@ static void __intel_pmu_pebs_event(struct perf_event *event,
else
regs.flags &= ~PERF_EFLAGS_EXACT;

+ if ((event->attr.sample_type & PERF_SAMPLE_ADDR) &&
+ x86_pmu.intel_cap.pebs_format >= 1)
+ data.addr = pebs->dla;
+
if (has_branch_stack(event))
data.br_stack = &cpuc->lbr_stack;

--
1.7.7.6

2013-04-20 19:23:06

by Andi Kleen

[permalink] [raw]
Subject: [PATCH 05/15] perf, tools: Add abort_tx,no_tx,in_tx branch filter options to perf record -j v3

From: Andi Kleen <[email protected]>

Make perf record -j aware of the new in_tx,no_tx,abort_tx branch qualifiers.

v2: ABORT -> ABORTTX
v3: Add more _
Signed-off-by: Andi Kleen <[email protected]>
---
tools/perf/Documentation/perf-record.txt | 3 +++
tools/perf/builtin-record.c | 3 +++
2 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/tools/perf/Documentation/perf-record.txt b/tools/perf/Documentation/perf-record.txt
index d4da111..6f3405e 100644
--- a/tools/perf/Documentation/perf-record.txt
+++ b/tools/perf/Documentation/perf-record.txt
@@ -172,6 +172,9 @@ following filters are defined:
- u: only when the branch target is at the user level
- k: only when the branch target is in the kernel
- hv: only when the target is at the hypervisor level
+ - in_tx: only when the target is in a hardware transaction
+ - no_tx: only when the target is not in a hardware transaction
+ - abort_tx: only when the target is a hardware transaction abort

+
The option requires at least one branch type among any, any_call, any_ret, ind_call.
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index cdf58ec..1c46dd0 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -676,6 +676,9 @@ static const struct branch_mode branch_modes[] = {
BRANCH_OPT("any_call", PERF_SAMPLE_BRANCH_ANY_CALL),
BRANCH_OPT("any_ret", PERF_SAMPLE_BRANCH_ANY_RETURN),
BRANCH_OPT("ind_call", PERF_SAMPLE_BRANCH_IND_CALL),
+ BRANCH_OPT("abort_tx", PERF_SAMPLE_BRANCH_ABORT_TX),
+ BRANCH_OPT("in_tx", PERF_SAMPLE_BRANCH_IN_TX),
+ BRANCH_OPT("no_tx", PERF_SAMPLE_BRANCH_NO_TX),
BRANCH_END
};

--
1.7.7.6

2013-04-23 08:48:48

by Gleb Natapov

[permalink] [raw]
Subject: Re: [PATCH 08/15] perf, kvm: Support the intx/intx_cp modifiers in KVM arch perfmon emulation v5

On Sat, Apr 20, 2013 at 12:19:16PM -0700, Andi Kleen wrote:
> From: Andi Kleen <[email protected]>
>
> This is not arch perfmon, but older CPUs will just ignore it. This makes
> it possible to do at least some TSX measurements from a KVM guest
>
> Cc: [email protected]
> v2: Various fixes to address review feedback
> v3: Ignore the bits when no CPUID. No #GP. Force raw events with TSX bits.
> v4: Use reserved bits for #GP
> v5: Remove obsolete argument
> Signed-off-by: Andi Kleen <[email protected]>
Acked-by: Gleb Natapov <[email protected]>

> ---
> arch/x86/include/asm/kvm_host.h | 1 +
> arch/x86/kvm/pmu.c | 25 ++++++++++++++++++++-----
> 2 files changed, 21 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 4979778..7c7e207 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -315,6 +315,7 @@ struct kvm_pmu {
> u64 global_ovf_ctrl;
> u64 counter_bitmask[2];
> u64 global_ctrl_mask;
> + u64 reserved_bits;
> u8 version;
> struct kvm_pmc gp_counters[INTEL_PMC_MAX_GENERIC];
> struct kvm_pmc fixed_counters[INTEL_PMC_MAX_FIXED];
> diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
> index cfc258a..9317c43 100644
> --- a/arch/x86/kvm/pmu.c
> +++ b/arch/x86/kvm/pmu.c
> @@ -160,7 +160,7 @@ static void stop_counter(struct kvm_pmc *pmc)
>
> static void reprogram_counter(struct kvm_pmc *pmc, u32 type,
> unsigned config, bool exclude_user, bool exclude_kernel,
> - bool intr)
> + bool intr, bool intx, bool intx_cp)
> {
> struct perf_event *event;
> struct perf_event_attr attr = {
> @@ -173,6 +173,10 @@ static void reprogram_counter(struct kvm_pmc *pmc, u32 type,
> .exclude_kernel = exclude_kernel,
> .config = config,
> };
> + if (intx)
> + attr.config |= HSW_INTX;
> + if (intx_cp)
> + attr.config |= HSW_INTX_CHECKPOINTED;
>
> attr.sample_period = (-pmc->counter) & pmc_bitmask(pmc);
>
> @@ -226,7 +230,9 @@ static void reprogram_gp_counter(struct kvm_pmc *pmc, u64 eventsel)
>
> if (!(eventsel & (ARCH_PERFMON_EVENTSEL_EDGE |
> ARCH_PERFMON_EVENTSEL_INV |
> - ARCH_PERFMON_EVENTSEL_CMASK))) {
> + ARCH_PERFMON_EVENTSEL_CMASK |
> + HSW_INTX |
> + HSW_INTX_CHECKPOINTED))) {
> config = find_arch_event(&pmc->vcpu->arch.pmu, event_select,
> unit_mask);
> if (config != PERF_COUNT_HW_MAX)
> @@ -239,7 +245,9 @@ static void reprogram_gp_counter(struct kvm_pmc *pmc, u64 eventsel)
> reprogram_counter(pmc, type, config,
> !(eventsel & ARCH_PERFMON_EVENTSEL_USR),
> !(eventsel & ARCH_PERFMON_EVENTSEL_OS),
> - eventsel & ARCH_PERFMON_EVENTSEL_INT);
> + eventsel & ARCH_PERFMON_EVENTSEL_INT,
> + (eventsel & HSW_INTX),
> + (eventsel & HSW_INTX_CHECKPOINTED));
> }
>
> static void reprogram_fixed_counter(struct kvm_pmc *pmc, u8 en_pmi, int idx)
> @@ -256,7 +264,7 @@ static void reprogram_fixed_counter(struct kvm_pmc *pmc, u8 en_pmi, int idx)
> arch_events[fixed_pmc_events[idx]].event_type,
> !(en & 0x2), /* exclude user */
> !(en & 0x1), /* exclude kernel */
> - pmi);
> + pmi, false, false);
> }
>
> static inline u8 fixed_en_pmi(u64 ctrl, int idx)
> @@ -400,7 +408,7 @@ int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, u32 index, u64 data)
> } else if ((pmc = get_gp_pmc(pmu, index, MSR_P6_EVNTSEL0))) {
> if (data == pmc->eventsel)
> return 0;
> - if (!(data & 0xffffffff00200000ull)) {
> + if (!(data & pmu->reserved_bits)) {
> reprogram_gp_counter(pmc, data);
> return 0;
> }
> @@ -442,6 +450,7 @@ void kvm_pmu_cpuid_update(struct kvm_vcpu *vcpu)
> pmu->counter_bitmask[KVM_PMC_GP] = 0;
> pmu->counter_bitmask[KVM_PMC_FIXED] = 0;
> pmu->version = 0;
> + pmu->reserved_bits = 0xffffffff00200000ull;
>
> entry = kvm_find_cpuid_entry(vcpu, 0xa, 0);
> if (!entry)
> @@ -470,6 +479,12 @@ void kvm_pmu_cpuid_update(struct kvm_vcpu *vcpu)
> pmu->global_ctrl = ((1 << pmu->nr_arch_gp_counters) - 1) |
> (((1ull << pmu->nr_arch_fixed_counters) - 1) << INTEL_PMC_IDX_FIXED);
> pmu->global_ctrl_mask = ~pmu->global_ctrl;
> +
> + entry = kvm_find_cpuid_entry(vcpu, 7, 0);
> + if (entry &&
> + (boot_cpu_has(X86_FEATURE_HLE) || boot_cpu_has(X86_FEATURE_RTM)) &&
> + (entry->ebx & (X86_FEATURE_HLE|X86_FEATURE_RTM)))
> + pmu->reserved_bits ^= HSW_INTX|HSW_INTX_CHECKPOINTED;
> }
>
> void kvm_pmu_init(struct kvm_vcpu *vcpu)
> --
> 1.7.7.6

--
Gleb.

2013-06-19 08:52:13

by Michael Ellerman

[permalink] [raw]
Subject: Re: [PATCH 15/15] perf, tools: Add perf stat --transaction v3

On Sat, Apr 20, 2013 at 12:19:23PM -0700, Andi Kleen wrote:
> From: Andi Kleen <[email protected]>
>
> Add support to perf stat to print the basic transactional execution statistics:
> Total cycles, Cycles in Transaction, Cycles in aborted transsactions
> using the intx and intx_checkpoint qualifiers.
> Transaction Starts and Elision Starts, to compute the average transaction length.
>
> This is a reasonable overview over the success of the transactions.
>
> Enable with a new --transaction / -T option.
>
> This requires measuring these events in a group, since they depend on each
> other.
>
> This is implemented by using TM sysfs events exported by the kernel

Hi Andi,

I think this still hasn't gone upstream, so I thought I'd just jump in
and comment for powerpc ...

> diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
> index 7e910ba..5053c1a 100644
> --- a/tools/perf/builtin-stat.c
> +++ b/tools/perf/builtin-stat.c
> @@ -70,6 +70,30 @@ static void print_counter_aggr(struct perf_evsel *counter, char *prefix);
> static void print_counter(struct perf_evsel *counter, char *prefix);
> static void print_aggr(char *prefix);
>
> +/* Default events used for perf stat -T */
> +static const char * const transaction_attrs[] = {
> + "task-clock",
> + "{"
> + "instructions,"
> + "cycles,"
> + "cpu/cycles-t/,"
> + "cpu/cycles-ct/,"
> + "cpu/tx-start/,"
> + "cpu/el-start/"
> + "}"
> +};

This hard coded list isn't going to work for us on powerpc.

We don't have HLE, so we won't ever have an event for el-start.

I don't quite grok what the cycles-ct is about, checkpointed cycles? But
I don't think we have anything equivalent.

I guess the simplest option is to make it a per-arch list inside the
perf tool?

cheers

2013-06-19 14:46:24

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 15/15] perf, tools: Add perf stat --transaction v3

> This hard coded list isn't going to work for us on powerpc.
>
> We don't have HLE, so we won't ever have an event for el-start.
>
> I don't quite grok what the cycles-ct is about, checkpointed cycles?

The counter is check pointed on a transaction start, and set back to
the checkpoint when an abort happens. This allows to count the cycles
wasted in aborts, when you subtract from cycles-t.

> But I don't think we have anything equivalent.

But you have cycles-t and tx-start?

I can make el-start, cycles-ct optional

> I guess the simplest option is to make it a per-arch list inside the
> perf tool?

I'm not sure that would be acceptable to the perf maintainers.
Although I'm just guessing, I haven't heard any comments on
this patch recently.

-Andi

--
[email protected] -- Speaking for myself only.

2013-06-27 03:18:59

by Michael Ellerman

[permalink] [raw]
Subject: Re: [PATCH 15/15] perf, tools: Add perf stat --transaction v3

On Wed, Jun 19, 2013 at 04:46:21PM +0200, Andi Kleen wrote:
> > This hard coded list isn't going to work for us on powerpc.
> >
> > We don't have HLE, so we won't ever have an event for el-start.
> >
> > I don't quite grok what the cycles-ct is about, checkpointed cycles?
>
> The counter is check pointed on a transaction start, and set back to
> the checkpoint when an abort happens. This allows to count the cycles
> wasted in aborts, when you subtract from cycles-t.

OK. I'm still confused by that one sorry. In the patch you do:

+ else if (perf_evsel__cmp(counter, nth_evsel(T_CYCLES_IN_TX_CP)))
+ update_stats(&runtime_cycles_in_txcp_stats[0], count[0]);

But then I don't see where you use runtime_cycles_in_txcp_stats ?

> > But I don't think we have anything equivalent.
>
> But you have cycles-t and tx-start?

We have:
- cycles
- cycles in transactional state
- cycles spent in successful transactions

So your cycles-t is "cycles in transactional state".

We would calculate cycles wasted in aborts with:

"cycles in transactional" - "cycles in successful transactions"

Which I think is what you're describing above with cycles-ct.


Does "tx-start" just count the number of transactions begun? Does it
count nested transactions?

We have one counter for non-nested transactions and one for nested, but
I think we could just count the non-nested as "tx-start", that's
probably of most interest.

> > I guess the simplest option is to make it a per-arch list inside the
> > perf tool?
>
> I'm not sure that would be acceptable to the perf maintainers.
> Although I'm just guessing, I haven't heard any comments on
> this patch recently.

Yeah sure. Although I agree with the desire to make the perf tool work
similarly across architectures, I think as we add more of these detailed
analysis tools we are eventually going to come across something that
can't be handled generically. But I guess we'll see.

cheers

2013-06-27 03:49:14

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 15/15] perf, tools: Add perf stat --transaction v3

> OK. I'm still confused by that one sorry. In the patch you do:
>
> + else if (perf_evsel__cmp(counter, nth_evsel(T_CYCLES_IN_TX_CP)))
> + update_stats(&runtime_cycles_in_txcp_stats[0], count[0]);
>
> But then I don't see where you use runtime_cycles_in_txcp_stats ?

You're right that variable is not needed. I'll remove it.
It only needs the in_tx stat.

intx-cp is still output, but directly by abs_printout,
without going through a variable.

>
> > > But I don't think we have anything equivalent.
> >
> > But you have cycles-t and tx-start?
>
> We have:
> - cycles
> - cycles in transactional state
> - cycles spent in successful transactions
>
> So your cycles-t is "cycles in transactional state".
>
> We would calculate cycles wasted in aborts with:
>
> "cycles in transactional" - "cycles in successful transactions"
>
> Which I think is what you're describing above with cycles-ct.

Yes, that should be equivalent.
That should be easy to check for and handle: check for that
event and switch the formula around.
I'll leave that to you as I don't have any way to test it.

> Does "tx-start" just count the number of transactions begun? Does it
> count nested transactions?

Just begun without nesting (TSX flattens all transactions)

-Andi

--
[email protected] -- Speaking for myself only.