2015-12-16 00:54:42

by Andi Kleen

[permalink] [raw]
Subject: Add top down metrics to perf stat v2

Note to reviewers: includes both tools and kernel patches.
The kernel patches are at the end.

This patchkit adds support for TopDown measurements to perf stat
It applies on top of my earlier metrics patchkit, posted
separately, and the --metric-only patchkit (also
posted separately)

TopDown is intended to replace the frontend cycles idle/
backend cycles idle metrics in standard perf stat output.
These metrics are not reliable in many workloads,
due to out of order effects.

This implements a new --topdown mode in perf stat
(similar to --transaction) that measures the pipe line
bottlenecks using standardized formulas. The measurement
can be all done with 5 counters (one fixed counter)

The result are four metrics:
FrontendBound, BackendBound, BadSpeculation, Retiring

that describe the CPU pipeline behavior on a high level.

FrontendBound and BackendBound
BadSpeculation is a higher

The full top down methology has many hierarchical metrics.
This implementation only supports level 1 which can be
collected without multiplexing. A full implementation
of top down on top of perf is available in pmu-tools toplev.
(http://github.com/andikleen/pmu-tools)

The current version works on Intel Core CPUs starting
with Sandy Bridge, and Atom CPUs starting with Silvermont.
In principle the generic metrics should be also implementable
on other out of order CPUs.

TopDown level 1 uses a set of abstracted metrics which
are generic to out of order CPU cores (although some
CPUs may not implement all of them):

topdown-total-slots Available slots in the pipeline
topdown-slots-issued Slots issued into the pipeline
topdown-slots-retired Slots successfully retired
topdown-fetch-bubbles Pipeline gaps in the frontend
topdown-recovery-bubbles Pipeline gaps during recovery
from misspeculation

These metrics then allow to compute four useful metrics:
FrontendBound, BackendBound, Retiring, BadSpeculation.

The formulas to compute the metrics are generic, they
only change based on the availability on the abstracted
input values.

The kernel declares the events supported by the current
CPU and perf stat then computes the formulas based on the
available metrics.


Example output:

$ ./perf stat --topdown -a ./BC1s

Performance counter stats for 'system wide':

S0-C0 2 19650790 topdown-total-slots (100.00%)
S0-C0 2 4445680.00 topdown-fetch-bubbles # 22.62% frontend bound (100.00%)
S0-C0 2 1743552.00 topdown-slots-retired (100.00%)
S0-C0 2 622954 topdown-recovery-bubbles (100.00%)
S0-C0 2 2025498.00 topdown-slots-issued # 63.90% backend bound
S0-C1 2 16685216540 topdown-total-slots (100.00%)
S0-C1 2 962557931.00 topdown-fetch-bubbles (100.00%)
S0-C1 2 4175583320.00 topdown-slots-retired (100.00%)
S0-C1 2 1743329246 topdown-recovery-bubbles # 22.22% bad speculation (100.00%)
S0-C1 2 6138901193.50 topdown-slots-issued # 46.99% backend bound

1.535832673 seconds time elapsed

$ perf stat --topdown --topdown --metric-only -I 100 ./BC1s
0.100576098 frontend bound retiring bad speculation backend bound
0.100576098 8.83% 48.93% 35.24% 7.00%
0.200800845 8.84% 48.49% 35.53% 7.13%
0.300905983 8.73% 48.64% 35.58% 7.05%
...



On Hyper Threaded CPUs Top Down computes metrics per core instead of per logical CPU.
In this case perf stat automatically enables --per-core mode and also requires
global mode (-a) and avoiding other filters (no cgroup mode)

One side effect is that this may require root rights or a
kernel.perf_event_paranoid=-1 setting.

On systems without Hyper Threading it can be used per process.

Full tree available in
git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-misc perf/top-down-11

No changelog against previous version. There were lots of changes.


2015-12-16 00:57:23

by Andi Kleen

[permalink] [raw]
Subject: [PATCH 01/10] perf, tools: Dont stop PMU parsing on alias parse error

From: Andi Kleen <[email protected]>

When an error happens during alias parsing currently the complete
parsing of all attributes of the PMU is stopped. This is breaks
old perf on a newer kernel that may have not-yet-know
alias attributes (such as .scale or .per-pkg).

Continue when some attribute is unparseable.

This is IMHO a stable candidate and should be backported
to older versions to avoid problems with newer kernels.

Signed-off-by: Andi Kleen <[email protected]>
---
tools/perf/util/pmu.c | 10 ++++------
1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/tools/perf/util/pmu.c b/tools/perf/util/pmu.c
index e4b173d..8a520e9 100644
--- a/tools/perf/util/pmu.c
+++ b/tools/perf/util/pmu.c
@@ -283,13 +283,12 @@ static int pmu_aliases_parse(char *dir, struct list_head *head)
{
struct dirent *evt_ent;
DIR *event_dir;
- int ret = 0;

event_dir = opendir(dir);
if (!event_dir)
return -EINVAL;

- while (!ret && (evt_ent = readdir(event_dir))) {
+ while ((evt_ent = readdir(event_dir))) {
char path[PATH_MAX];
char *name = evt_ent->d_name;
FILE *file;
@@ -305,17 +304,16 @@ static int pmu_aliases_parse(char *dir, struct list_head *head)

snprintf(path, PATH_MAX, "%s/%s", dir, name);

- ret = -EINVAL;
file = fopen(path, "r");
if (!file)
- break;
+ continue;

- ret = perf_pmu__new_alias(head, dir, name, file);
+ perf_pmu__new_alias(head, dir, name, file);
fclose(file);
}

closedir(event_dir);
- return ret;
+ return 0;
}

/*
--
2.4.3

2015-12-16 00:56:11

by Andi Kleen

[permalink] [raw]
Subject: [PATCH 02/10] perf, tools, stat: Force --per-core mode for .agg-per-core aliases

From: Andi Kleen <[email protected]>

When an event alias is used that the kernel marked as .agg-per-core, force
--per-core mode (and also require -a and forbid cgroups or per thread mode).
This in term means, --topdown forces --per-core mode.

This is needed for TopDown in SMT mode, because it needs to measure
all threads in a core together and merge the values to compute the correct
percentages of how the pipeline is limited.

We do this if any alias is agg-per-core.

Add the code to parse the .agg-per-core attributes and propagate
the information to the evsel. Then the main stat code does
the necessary checks and forces per core mode.

Open issue: in combination with -C ... we get wrong values. I think that's
a existing bug that needs to be debugged/fixed separately.

Signed-off-by: Andi Kleen <[email protected]>
---
tools/perf/builtin-stat.c | 18 ++++++++++++++++++
tools/perf/util/evsel.h | 1 +
tools/perf/util/parse-events.c | 1 +
tools/perf/util/pmu.c | 23 +++++++++++++++++++++++
tools/perf/util/pmu.h | 2 ++
5 files changed, 45 insertions(+)

diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index 4777355..0c7cdda 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -1540,6 +1540,7 @@ static int add_default_attributes(void)

int cmd_stat(int argc, const char **argv, const char *prefix __maybe_unused)
{
+ struct perf_evsel *counter;
const char * const stat_usage[] = {
"perf stat [<options>] [<command>]",
NULL
@@ -1674,6 +1675,23 @@ int cmd_stat(int argc, const char **argv, const char *prefix __maybe_unused)
if (add_default_attributes())
goto out;

+ evlist__for_each (evsel_list, counter) {
+ /* Enable per core mode if only a single event requires it. */
+ if (counter->agg_per_core) {
+ if (stat_config.aggr_mode != AGGR_GLOBAL &&
+ stat_config.aggr_mode != AGGR_CORE) {
+ pr_err("per core event configuration requires per core mode\n");
+ goto out;
+ }
+ stat_config.aggr_mode = AGGR_CORE;
+ if (nr_cgroups || !target__has_cpu(&target)) {
+ pr_err("per core event configuration requires system-wide mode (-a)\n");
+ goto out;
+ }
+ break;
+ }
+ }
+
target__validate(&target);

if (perf_evlist__create_maps(evsel_list, &target) < 0) {
diff --git a/tools/perf/util/evsel.h b/tools/perf/util/evsel.h
index 5ded1fc..efc7f7c 100644
--- a/tools/perf/util/evsel.h
+++ b/tools/perf/util/evsel.h
@@ -114,6 +114,7 @@ struct perf_evsel {
bool tracking;
bool per_pkg;
bool precise_max;
+ bool agg_per_core;
/* parse modifier helper */
int exclude_GH;
int nr_members;
diff --git a/tools/perf/util/parse-events.c b/tools/perf/util/parse-events.c
index 6fc8cd7..66d8ebd 100644
--- a/tools/perf/util/parse-events.c
+++ b/tools/perf/util/parse-events.c
@@ -1027,6 +1027,7 @@ int parse_events_add_pmu(struct parse_events_evlist *data,
evsel->unit = info.unit;
evsel->scale = info.scale;
evsel->per_pkg = info.per_pkg;
+ evsel->agg_per_core = info.agg_per_core;
evsel->snapshot = info.snapshot;
}

diff --git a/tools/perf/util/pmu.c b/tools/perf/util/pmu.c
index 8a520e9..5a52dac 100644
--- a/tools/perf/util/pmu.c
+++ b/tools/perf/util/pmu.c
@@ -189,6 +189,23 @@ perf_pmu__parse_per_pkg(struct perf_pmu_alias *alias, char *dir, char *name)
return 0;
}

+static void
+perf_pmu__parse_agg_per_core(struct perf_pmu_alias *alias, char *dir, char *name)
+{
+ char path[PATH_MAX];
+ FILE *f;
+ int flag;
+
+ snprintf(path, PATH_MAX, "%s/%s.agg-per-core", dir, name);
+
+ f = fopen(path, "r");
+ if (f && fscanf(f, "%d", &flag) == 1) {
+ alias->agg_per_core = flag != 0;
+ fclose(f);
+ }
+}
+
+
static int perf_pmu__parse_snapshot(struct perf_pmu_alias *alias,
char *dir, char *name)
{
@@ -237,6 +254,7 @@ static int __perf_pmu__new_alias(struct list_head *list, char *dir, char *name,
perf_pmu__parse_scale(alias, dir, name);
perf_pmu__parse_per_pkg(alias, dir, name);
perf_pmu__parse_snapshot(alias, dir, name);
+ perf_pmu__parse_agg_per_core(alias, dir, name);
}

list_add_tail(&alias->list, list);
@@ -271,6 +289,8 @@ static inline bool pmu_alias_info_file(char *name)
return true;
if (len > 9 && !strcmp(name + len - 9, ".snapshot"))
return true;
+ if (len > 13 && !strcmp(name + len - 13, ".agg-per-core"))
+ return true;

return false;
}
@@ -847,6 +867,7 @@ int perf_pmu__check_alias(struct perf_pmu *pmu, struct list_head *head_terms,
int ret;

info->per_pkg = false;
+ info->agg_per_core = false;

/*
* Mark unit and scale as not set
@@ -870,6 +891,8 @@ int perf_pmu__check_alias(struct perf_pmu *pmu, struct list_head *head_terms,

if (alias->per_pkg)
info->per_pkg = true;
+ if (alias->agg_per_core)
+ info->agg_per_core = true;

list_del(&term->list);
free(term);
diff --git a/tools/perf/util/pmu.h b/tools/perf/util/pmu.h
index 5d7e844..5a43719 100644
--- a/tools/perf/util/pmu.h
+++ b/tools/perf/util/pmu.h
@@ -32,6 +32,7 @@ struct perf_pmu_info {
double scale;
bool per_pkg;
bool snapshot;
+ bool agg_per_core;
};

#define UNIT_MAX_LEN 31 /* max length for event unit name */
@@ -44,6 +45,7 @@ struct perf_pmu_alias {
double scale;
bool per_pkg;
bool snapshot;
+ bool agg_per_core;
};

struct perf_pmu *perf_pmu__find(const char *name);
--
2.4.3

2015-12-16 00:55:04

by Andi Kleen

[permalink] [raw]
Subject: [PATCH 03/10] perf, tools, stat: Avoid fractional digits for integer scales

From: Andi Kleen <[email protected]>

When the scaling factor is a full integer don't display fractional
digits. This avoids unnecessary .00 output for topdown metrics
with scale factors.

Signed-off-by: Andi Kleen <[email protected]>
---
tools/perf/builtin-stat.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index 0c7cdda..1faa6fc 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -63,6 +63,7 @@
#include <stdlib.h>
#include <sys/prctl.h>
#include <locale.h>
+#include <math.h>

#define DEFAULT_SEPARATOR " "
#define CNTR_NOT_SUPPORTED "<not supported>"
@@ -739,12 +740,12 @@ static void abs_printout(int id, int nr, struct perf_evsel *evsel, double avg)
const char *fmt;

if (csv_output) {
- fmt = sc != 1.0 ? "%.2f%s" : "%.0f%s";
+ fmt = sc != 1.0 && floor(sc) != sc ? "%.2f%s" : "%.0f%s";
} else {
if (big_num)
- fmt = sc != 1.0 ? "%'18.2f%s" : "%'18.0f%s";
+ fmt = sc != 1.0 && floor(sc) != sc ? "%'18.2f%s" : "%'18.0f%s";
else
- fmt = sc != 1.0 ? "%18.2f%s" : "%18.0f%s";
+ fmt = sc != 1.0 && floor(sc) != sc ? "%18.2f%s" : "%18.0f%s";
}

aggr_printout(evsel, id, nr);
--
2.4.3

2015-12-16 00:54:41

by Andi Kleen

[permalink] [raw]
Subject: [PATCH 04/10] perf, tools, stat: Scale values by unit before metrics

From: Andi Kleen <[email protected]>

Scale values by unit before passing them to the metrics printing functions.
This is needed for TopDown, because it needs to scale the slots correctly
by pipeline width / SMTness.

For existing metrics it shouldn't make any difference, as those generally
use events that don't have any units.

Signed-off-by: Andi Kleen <[email protected]>
---
tools/perf/util/stat.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/tools/perf/util/stat.c b/tools/perf/util/stat.c
index 2d9d830..913d236 100644
--- a/tools/perf/util/stat.c
+++ b/tools/perf/util/stat.c
@@ -307,6 +307,7 @@ int perf_stat_process_counter(struct perf_stat_config *config,
struct perf_counts_values *aggr = &counter->counts->aggr;
struct perf_stat_evsel *ps = counter->priv;
u64 *count = counter->counts->aggr.values;
+ u64 val;
int i, ret;

aggr->val = aggr->ena = aggr->run = 0;
@@ -337,7 +338,8 @@ int perf_stat_process_counter(struct perf_stat_config *config,
/*
* Save the full runtime - to allow normalization during printout:
*/
- perf_stat__update_shadow_stats(counter, count, 0);
+ val = counter->scale * *count;
+ perf_stat__update_shadow_stats(counter, &val, 0);

return 0;
}
--
2.4.3

2015-12-16 00:56:39

by Andi Kleen

[permalink] [raw]
Subject: [PATCH 05/10] perf, tools, stat: Basic support for TopDown in perf stat

From: Andi Kleen <[email protected]>

Add basic plumbing for TopDown in perf stat

Add a new --topdown options to enable events.
When --topdown is specified set up events for all topdown
events supported by the kernel.
Add topdown-* as a special case to the event parser, as is
needed for all events containing -.

The actual code to compute the metrics is in follow-on patches.

v2:
Use standard sysctl read function.
Signed-off-by: Andi Kleen <[email protected]>
---
tools/perf/Documentation/perf-stat.txt | 8 +++
tools/perf/builtin-stat.c | 100 ++++++++++++++++++++++++++++++++-
tools/perf/util/parse-events.l | 1 +
3 files changed, 108 insertions(+), 1 deletion(-)

diff --git a/tools/perf/Documentation/perf-stat.txt b/tools/perf/Documentation/perf-stat.txt
index 58296e7..9be9ac8 100644
--- a/tools/perf/Documentation/perf-stat.txt
+++ b/tools/perf/Documentation/perf-stat.txt
@@ -163,6 +163,14 @@ filter out the startup phase of the program, which is often very different.

Print statistics of transactional execution if supported.

+--topdown::
+
+Print top down level 1 metrics if supported by the CPU. This allows to
+determine bottle necks in the CPU pipeline for CPU bound workloads,
+by breaking it down into frontend bound, backend bound, bad speculation
+and retiring. Specifying the option multiple times shows metrics even
+if the don't cross a threshold.
+
EXAMPLES
--------

diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index 1faa6fc..e10198c 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -60,6 +60,7 @@
#include "util/thread_map.h"
#include "util/counts.h"

+#include <api/fs/fs.h>
#include <stdlib.h>
#include <sys/prctl.h>
#include <locale.h>
@@ -95,6 +96,15 @@ static const char * transaction_limited_attrs = {
"}"
};

+static const char * topdown_attrs[] = {
+ "topdown-total-slots",
+ "topdown-fetch-bubbles",
+ "topdown-slots-retired",
+ "topdown-recovery-bubbles",
+ "topdown-slots-issued",
+ NULL,
+};
+
static struct perf_evlist *evsel_list;

static struct target target = {
@@ -109,6 +119,7 @@ static volatile pid_t child_pid = -1;
static bool null_run = false;
static int detailed_run = 0;
static bool transaction_run;
+static int topdown_run = 0;
static bool big_num = true;
static int big_num_opt = -1;
static const char *csv_sep = NULL;
@@ -1283,6 +1294,8 @@ static const struct option stat_options[] = {
"ms to wait before starting measurement after program start"),
OPT_BOOLEAN(0, "metric-only", &metric_only,
"Only print computed metrics. No raw values"),
+ OPT_INCR(0, "topdown", &topdown_run,
+ "measure topdown level 1 statistics"),
OPT_END()
};

@@ -1380,12 +1393,68 @@ static void perf_stat__exit_aggr_mode(void)
cpus_aggr_map = NULL;
}

+static void filter_events(const char **attr, char **str, bool use_group)
+{
+ int off = 0;
+ int i;
+ int len = 0;
+ char *s;
+
+ for (i = 0; attr[i]; i++) {
+ if (pmu_have_event("cpu", attr[i])) {
+ len += strlen(attr[i]) + 1;
+ attr[i - off] = attr[i];
+ } else
+ off++;
+ }
+ attr[i - off] = NULL;
+
+ *str = malloc(len + 1 + 2);
+ if (!*str)
+ return;
+ s = *str;
+ if (i - off == 0) {
+ *s = 0;
+ return;
+ }
+ if (use_group)
+ *s++ = '{';
+ for (i = 0; attr[i]; i++) {
+ strcpy(s, attr[i]);
+ s += strlen(s);
+ *s++ = ',';
+ }
+ if (use_group) {
+ s[-1] = '}';
+ *s = 0;
+ } else
+ s[-1] = 0;
+}
+
+/*
+ * Check whether we can use a group for top down.
+ * Without a group may get bad results due to multiplexing.
+ */
+static bool check_group(bool *warn)
+{
+ int n;
+
+ if (sysctl__read_int("kernel/nmi_watchdog", &n) < 0)
+ return false;
+ if (n > 0) {
+ *warn = true;
+ return false;
+ }
+ return true;
+}
+
/*
* Add default attributes, if there were no attributes specified or
* if -d/--detailed, -d -d or -d -d -d is used:
*/
static int add_default_attributes(void)
{
+ int err;
struct perf_event_attr default_attrs[] = {

{ .type = PERF_TYPE_SOFTWARE, .config = PERF_COUNT_SW_TASK_CLOCK },
@@ -1498,7 +1567,6 @@ static int add_default_attributes(void)
return 0;

if (transaction_run) {
- int err;
if (pmu_have_event("cpu", "cycles-ct") &&
pmu_have_event("cpu", "el-start"))
err = parse_events(evsel_list, transaction_attrs, NULL);
@@ -1511,6 +1579,36 @@ static int add_default_attributes(void)
return 0;
}

+ if (topdown_run) {
+ char *str = NULL;
+ bool warn = false;
+
+ filter_events(topdown_attrs, &str, check_group(&warn));
+ if (topdown_attrs[0] && str) {
+ if (warn)
+ fprintf(stderr,
+ "nmi_watchdog enabled with topdown. May give wrong results.\n"
+ "Disable with echo 0 > /proc/sys/kernel/nmi_watchdog\n");
+ err = parse_events(evsel_list, str, NULL);
+ if (err) {
+ fprintf(stderr,
+ "Cannot set up top down events %s: %d\n",
+ str, err);
+ free(str);
+ return -1;
+ }
+ } else {
+ fprintf(stderr, "System does not support topdown\n");
+ return -1;
+ }
+ free(str);
+ /*
+ * Right now combining with the other attributes breaks group
+ * semantics.
+ */
+ return 0;
+ }
+
if (!evsel_list->nr_entries) {
if (perf_evlist__add_default_attrs(evsel_list, default_attrs) < 0)
return -1;
diff --git a/tools/perf/util/parse-events.l b/tools/perf/util/parse-events.l
index 58c5831..3e65d61 100644
--- a/tools/perf/util/parse-events.l
+++ b/tools/perf/util/parse-events.l
@@ -248,6 +248,7 @@ cycles-ct { return str(yyscanner, PE_KERNEL_PMU_EVENT); }
cycles-t { return str(yyscanner, PE_KERNEL_PMU_EVENT); }
mem-loads { return str(yyscanner, PE_KERNEL_PMU_EVENT); }
mem-stores { return str(yyscanner, PE_KERNEL_PMU_EVENT); }
+topdown-[a-z-]+ { return str(yyscanner, PE_KERNEL_PMU_EVENT); }

L1-dcache|l1-d|l1d|L1-data |
L1-icache|l1-i|l1i|L1-instruction |
--
2.4.3

2015-12-16 00:56:37

by Andi Kleen

[permalink] [raw]
Subject: [PATCH 06/10] perf, tools, stat: Add computation of TopDown formulas

From: Andi Kleen <[email protected]>

Implement the TopDown formulas in perf stat. The topdown basic metrics
reported by the kernel are collected, and the formulas are computed
and output as normal metrics.

See the kernel commit exporting the events for details on the used
metrics.

Signed-off-by: Andi Kleen <[email protected]>
---
tools/perf/builtin-stat.c | 6 +-
tools/perf/util/stat-shadow.c | 164 +++++++++++++++++++++++++++++++++++++++++-
tools/perf/util/stat.c | 5 ++
tools/perf/util/stat.h | 8 ++-
4 files changed, 179 insertions(+), 4 deletions(-)

diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index e10198c..6c44aae 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -859,7 +859,8 @@ static void printout(int id, int nr, struct perf_evsel *counter, double uval,
perf_stat__print_shadow_stats(counter, uval,
stat_config.aggr_mode == AGGR_GLOBAL ? 0 :
first_shadow_cpu(counter, id),
- &out);
+ &out,
+ topdown_run);

if (!metric_only) {
print_noise(counter, noise);
@@ -1048,7 +1049,8 @@ static void print_metric_headers(char *prefix)
os.evsel = counter;
perf_stat__print_shadow_stats(counter, 0,
0,
- &out);
+ &out,
+ topdown_run);
}
fputc('\n', stat_config.output);
}
diff --git a/tools/perf/util/stat-shadow.c b/tools/perf/util/stat-shadow.c
index 4d8f185..e977992 100644
--- a/tools/perf/util/stat-shadow.c
+++ b/tools/perf/util/stat-shadow.c
@@ -2,6 +2,7 @@
#include "evsel.h"
#include "stat.h"
#include "color.h"
+#include "pmu.h"

enum {
CTX_BIT_USER = 1 << 0,
@@ -28,6 +29,11 @@ static struct stats runtime_dtlb_cache_stats[NUM_CTX][MAX_NR_CPUS];
static struct stats runtime_cycles_in_tx_stats[NUM_CTX][MAX_NR_CPUS];
static struct stats runtime_transaction_stats[NUM_CTX][MAX_NR_CPUS];
static struct stats runtime_elision_stats[NUM_CTX][MAX_NR_CPUS];
+static struct stats runtime_topdown_total_slots[NUM_CTX][MAX_NR_CPUS];
+static struct stats runtime_topdown_slots_issued[NUM_CTX][MAX_NR_CPUS];
+static struct stats runtime_topdown_slots_retired[NUM_CTX][MAX_NR_CPUS];
+static struct stats runtime_topdown_fetch_bubbles[NUM_CTX][MAX_NR_CPUS];
+static struct stats runtime_topdown_recovery_bubbles[NUM_CTX][MAX_NR_CPUS];

struct stats walltime_nsecs_stats;

@@ -68,6 +74,11 @@ void perf_stat__reset_shadow_stats(void)
sizeof(runtime_transaction_stats));
memset(runtime_elision_stats, 0, sizeof(runtime_elision_stats));
memset(&walltime_nsecs_stats, 0, sizeof(walltime_nsecs_stats));
+ memset(runtime_topdown_total_slots, 0, sizeof(runtime_topdown_total_slots));
+ memset(runtime_topdown_slots_retired, 0, sizeof(runtime_topdown_slots_retired));
+ memset(runtime_topdown_slots_issued, 0, sizeof(runtime_topdown_slots_issued));
+ memset(runtime_topdown_fetch_bubbles, 0, sizeof(runtime_topdown_fetch_bubbles));
+ memset(runtime_topdown_recovery_bubbles, 0, sizeof(runtime_topdown_recovery_bubbles));
}

/*
@@ -90,6 +101,16 @@ void perf_stat__update_shadow_stats(struct perf_evsel *counter, u64 *count,
update_stats(&runtime_transaction_stats[ctx][cpu], count[0]);
else if (perf_stat_evsel__is(counter, ELISION_START))
update_stats(&runtime_elision_stats[ctx][cpu], count[0]);
+ else if (perf_stat_evsel__is(counter, TOPDOWN_TOTAL_SLOTS))
+ update_stats(&runtime_topdown_total_slots[ctx][cpu], count[0]);
+ else if (perf_stat_evsel__is(counter, TOPDOWN_SLOTS_ISSUED))
+ update_stats(&runtime_topdown_slots_issued[ctx][cpu], count[0]);
+ else if (perf_stat_evsel__is(counter, TOPDOWN_SLOTS_RETIRED))
+ update_stats(&runtime_topdown_slots_retired[ctx][cpu], count[0]);
+ else if (perf_stat_evsel__is(counter, TOPDOWN_FETCH_BUBBLES))
+ update_stats(&runtime_topdown_fetch_bubbles[ctx][cpu],count[0]);
+ else if (perf_stat_evsel__is(counter, TOPDOWN_RECOVERY_BUBBLES))
+ update_stats(&runtime_topdown_recovery_bubbles[ctx][cpu], count[0]);
else if (perf_evsel__match(counter, HARDWARE, HW_STALLED_CYCLES_FRONTEND))
update_stats(&runtime_stalled_cycles_front_stats[ctx][cpu], count[0]);
else if (perf_evsel__match(counter, HARDWARE, HW_STALLED_CYCLES_BACKEND))
@@ -289,9 +310,108 @@ static void print_ll_cache_misses(int cpu,
out->print_metric(out->ctx, color, "%7.2f%%", "of all LL-cache hits", ratio);
}

+/*
+ * High level "TopDown" CPU core pipe line bottleneck break down.
+ *
+ * Basic concept following
+ * Yasin, A Top Down Method for Performance analysis and Counter architecture
+ * ISPASS14
+ *
+ * The CPU pipeline is divided into 4 areas that can be bottlenecks:
+ *
+ * Frontend -> Backend -> Retiring
+ * BadSpeculation in addition means out of order execution that is thrown away
+ * (for example branch mispredictions)
+ * Frontend is instruction decoding.
+ * Backend is execution, like computation and accessing data in memory
+ * Retiring is good execution that is not directly bottlenecked
+ *
+ * The formulas are computed in slots.
+ * A slot is an entry in the pipeline each for the pipeline width
+ * (for example a 4-wide pipeline has 4 slots for each cycle)
+ *
+ * Formulas:
+ * BadSpeculation = ((SlotsIssued - SlotsRetired) + RecoveryBubbles) /
+ * TotalSlots
+ * Retiring = SlotsRetired / TotalSlots
+ * FrontendBound = FetchBubbles / TotalSlots
+ * BackendBound = 1.0 - BadSpeculation - Retiring - FrontendBound
+ *
+ * The kernel provides the mapping to the low level CPU events and any scaling
+ * needed for the CPU pipeline width, for example:
+ *
+ * TotalSlots = Cycles * 4
+ *
+ * The scaling factor is communicated in the sysfs unit.
+ *
+ * In some cases the CPU may not be able to measure all the formulas due to
+ * missing events. In this case multiple formulas are combined, as possible.
+ *
+ * With SMT the slots of thread siblings need to be combined to get meaningful
+ * results. This is implemented by the kernel forcing per-core mode with
+ * the .agg-per-core sysfs attribute.
+ *
+ * Full TopDown supports more levels to sub-divide each area: for example
+ * BackendBound into computing bound and memory bound. For now we only
+ * support Level 1 TopDown.
+ */
+
+static double td_total_slots(int ctx, int cpu)
+{
+ return avg_stats(&runtime_topdown_total_slots[ctx][cpu]);
+}
+
+static double td_bad_spec(int ctx, int cpu)
+{
+ double bad_spec = 0;
+ double total_slots;
+ double total;
+
+ total = avg_stats(&runtime_topdown_slots_issued[ctx][cpu]) -
+ avg_stats(&runtime_topdown_slots_retired[ctx][cpu]) +
+ avg_stats(&runtime_topdown_recovery_bubbles[ctx][cpu]);
+ total_slots = td_total_slots(ctx, cpu);
+ if (total_slots)
+ bad_spec = total / total_slots;
+ return bad_spec;
+}
+
+static double td_retiring(int ctx, int cpu)
+{
+ double retiring = 0;
+ double total_slots = td_total_slots(ctx, cpu);
+ double ret_slots = avg_stats(&runtime_topdown_slots_retired[ctx][cpu]);
+
+ if (total_slots)
+ retiring = ret_slots / total_slots;
+ return retiring;
+}
+
+static double td_fe_bound(int ctx, int cpu)
+{
+ double fe_bound = 0;
+ double total_slots = td_total_slots(ctx, cpu);
+ double fetch_bub = avg_stats(&runtime_topdown_fetch_bubbles[ctx][cpu]);
+
+ if (total_slots)
+ fe_bound = fetch_bub / total_slots;
+ return fe_bound;
+}
+
+static double td_be_bound(int ctx, int cpu)
+{
+ double sum = (td_fe_bound(ctx, cpu) +
+ td_bad_spec(ctx, cpu) +
+ td_retiring(ctx, cpu));
+ if (sum == 0)
+ return 0;
+ return 1.0 - sum;
+}
+
void perf_stat__print_shadow_stats(struct perf_evsel *evsel,
double avg, int cpu,
- struct perf_stat_output_ctx *out)
+ struct perf_stat_output_ctx *out,
+ int topdown_run)
{
void *ctxp = out->ctx;
print_metric_t print_metric = out->print_metric;
@@ -438,6 +558,48 @@ void perf_stat__print_shadow_stats(struct perf_evsel *evsel,
avg / ratio);
else
print_metric(ctxp, NULL, NULL, "CPUs utilized", 0);
+ } else if (perf_stat_evsel__is(evsel, TOPDOWN_FETCH_BUBBLES)) {
+ double fe_bound = td_fe_bound(ctx, cpu);
+
+ if (fe_bound > 0.2 || topdown_run > 1)
+ print_metric(ctxp, NULL, "%8.2f%%", "frontend bound",
+ fe_bound * 100.);
+ else
+ print_metric(ctxp, NULL, NULL, "frontend bound", 0);
+ } else if (perf_stat_evsel__is(evsel, TOPDOWN_SLOTS_RETIRED)) {
+ double retiring = td_retiring(ctx, cpu);
+
+ if (retiring > 0.7 || topdown_run > 1)
+ print_metric(ctxp, NULL, "%8.2f%%", "retiring",
+ retiring * 100.);
+ else
+ print_metric(ctxp, NULL, NULL, "retiring", 0);
+ } else if (perf_stat_evsel__is(evsel, TOPDOWN_RECOVERY_BUBBLES)) {
+ double bad_spec = td_bad_spec(ctx, cpu);
+
+ if (bad_spec > 0.1 || topdown_run > 1)
+ print_metric(ctxp, NULL, "%8.2f%%", "bad speculation",
+ bad_spec * 100.);
+ else
+ print_metric(ctxp, NULL, NULL, "bad speculation", 0);
+ } else if (perf_stat_evsel__is(evsel, TOPDOWN_SLOTS_ISSUED)) {
+ double be_bound = td_be_bound(ctx, cpu);
+ const char *name = "backend bound";
+ static int have_recovery_bubbles = -1;
+
+ /* In case the CPU does not support topdown-recovery-bubbles */
+ if (have_recovery_bubbles < 0)
+ have_recovery_bubbles = pmu_have_event("cpu",
+ "topdown-recovery-bubbles");
+ if (!have_recovery_bubbles)
+ name = "backend bound/bad spec";
+
+ if (td_total_slots(ctx, cpu) > 0 &&
+ (be_bound > 0.2 || topdown_run > 1))
+ print_metric(ctxp, NULL, "%8.2f%%", name,
+ be_bound * 100.);
+ else
+ print_metric(ctxp, NULL, NULL, name, 0);
} else if (runtime_nsecs_stats[cpu].n != 0) {
char unit = 'M';
char unit_buf[10];
diff --git a/tools/perf/util/stat.c b/tools/perf/util/stat.c
index 913d236..fdb34e3 100644
--- a/tools/perf/util/stat.c
+++ b/tools/perf/util/stat.c
@@ -79,6 +79,11 @@ static const char *id_str[PERF_STAT_EVSEL_ID__MAX] = {
ID(TRANSACTION_START, cpu/tx-start/),
ID(ELISION_START, cpu/el-start/),
ID(CYCLES_IN_TX_CP, cpu/cycles-ct/),
+ ID(TOPDOWN_TOTAL_SLOTS, topdown-total-slots),
+ ID(TOPDOWN_SLOTS_ISSUED, topdown-slots-issued),
+ ID(TOPDOWN_SLOTS_RETIRED, topdown-slots-retired),
+ ID(TOPDOWN_FETCH_BUBBLES, topdown-fetch-bubbles),
+ ID(TOPDOWN_RECOVERY_BUBBLES, topdown-recovery-bubbles),
};
#undef ID

diff --git a/tools/perf/util/stat.h b/tools/perf/util/stat.h
index f51d94e..0c26633 100644
--- a/tools/perf/util/stat.h
+++ b/tools/perf/util/stat.h
@@ -17,6 +17,11 @@ enum perf_stat_evsel_id {
PERF_STAT_EVSEL_ID__TRANSACTION_START,
PERF_STAT_EVSEL_ID__ELISION_START,
PERF_STAT_EVSEL_ID__CYCLES_IN_TX_CP,
+ PERF_STAT_EVSEL_ID__TOPDOWN_TOTAL_SLOTS,
+ PERF_STAT_EVSEL_ID__TOPDOWN_SLOTS_ISSUED,
+ PERF_STAT_EVSEL_ID__TOPDOWN_SLOTS_RETIRED,
+ PERF_STAT_EVSEL_ID__TOPDOWN_FETCH_BUBBLES,
+ PERF_STAT_EVSEL_ID__TOPDOWN_RECOVERY_BUBBLES,
PERF_STAT_EVSEL_ID__MAX,
};

@@ -83,7 +88,8 @@ struct perf_stat_output_ctx {

void perf_stat__print_shadow_stats(struct perf_evsel *evsel,
double avg, int cpu,
- struct perf_stat_output_ctx *out);
+ struct perf_stat_output_ctx *out,
+ int topdown_run);

void perf_evsel__reset_stat_priv(struct perf_evsel *evsel);
int perf_evsel__alloc_stat_priv(struct perf_evsel *evsel);
--
2.4.3

2015-12-16 00:54:43

by Andi Kleen

[permalink] [raw]
Subject: [PATCH 07/10] perf, tools, stat: Add extra output of counter values with -v

From: Andi Kleen <[email protected]>

Add debug output of raw counter values per CPU when
perf stat -v is specified, together with their cpu numbers.
This is very useful to debug problems with per core counters,
where we can normally only see aggregated values.

Signed-off-by: Andi Kleen <[email protected]>
---
tools/perf/builtin-stat.c | 7 +++++++
1 file changed, 7 insertions(+)

diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index 6c44aae..595292f 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -238,6 +238,13 @@ static int read_counter(struct perf_evsel *counter)
count = perf_counts(counter->counts, cpu, thread);
if (perf_evsel__read(counter, cpu, thread, count))
return -1;
+ if (verbose) {
+ fprintf(stat_config.output,
+ "%s: %d: %" PRIu64 " %" PRIu64 " %" PRIu64 "\n",
+ perf_evsel__name(counter),
+ cpu,
+ count->val, count->ena, count->run);
+ }
}
}

--
2.4.3

2015-12-16 00:54:49

by Andi Kleen

[permalink] [raw]
Subject: [PATCH 08/10] x86, perf: Support sysfs files depending on SMT status

From: Andi Kleen <[email protected]>

Add a way to show different sysfs events attributes depending on
HyperThreading is on or off. This is difficult to determine
early at boot, so we just do it dynamically when the sysfs
attribute is read.

Signed-off-by: Andi Kleen <[email protected]>
---
arch/x86/kernel/cpu/perf_event.c | 34 ++++++++++++++++++++++++++++++++++
arch/x86/kernel/cpu/perf_event.h | 10 ++++++++++
include/linux/perf_event.h | 7 +++++++
3 files changed, 51 insertions(+)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 9dfbba5..976686f 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -1590,6 +1590,40 @@ ssize_t events_sysfs_show(struct device *dev, struct device_attribute *attr,
return x86_pmu.events_sysfs_show(page, config);
}

+ssize_t events_ht_sysfs_show(struct device *dev, struct device_attribute *attr,
+ char *page)
+{
+ struct perf_pmu_events_ht_attr *pmu_attr =
+ container_of(attr, struct perf_pmu_events_ht_attr, attr);
+ bool ht_on = false;
+ int cpu;
+
+ /*
+ * Report conditional events depending on Hyper-Threading.
+ *
+ * Check all online CPUs if any have a thread sibling,
+ * as perf may measure any of them.
+ *
+ * This is overly conservative as usually the HT special
+ * handling is not needed if the other CPU thread is idle.
+ *
+ * Note this does not (cannot) handle the case when thread
+ * siblings are invisible, for example with virtualization
+ * if they are owned by some other guest. The user tool
+ * has to re-read when a thread sibling gets onlined later.
+ */
+ for_each_online_cpu (cpu) {
+ ht_on = cpumask_weight(topology_sibling_cpumask(cpu)) > 1;
+ if (ht_on)
+ break;
+ }
+
+ return sprintf(page, "%s",
+ ht_on ?
+ pmu_attr->event_str_ht :
+ pmu_attr->event_str_noht);
+}
+
EVENT_ATTR(cpu-cycles, CPU_CYCLES );
EVENT_ATTR(instructions, INSTRUCTIONS );
EVENT_ATTR(cache-references, CACHE_REFERENCES );
diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 799e6bd..ce87cd6 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -660,6 +660,14 @@ static struct perf_pmu_events_attr event_attr_##v = { \
.event_str = str, \
};

+#define EVENT_ATTR_STR_HT(_name, v, noht, ht) \
+static struct perf_pmu_events_ht_attr event_attr_##v = { \
+ .attr = __ATTR(_name, 0444, events_ht_sysfs_show, NULL),\
+ .id = 0, \
+ .event_str_noht = noht, \
+ .event_str_ht = ht, \
+}
+
extern struct x86_pmu x86_pmu __read_mostly;

static inline bool x86_pmu_has_lbr_callstack(void)
@@ -919,6 +927,8 @@ int knc_pmu_init(void);

ssize_t events_sysfs_show(struct device *dev, struct device_attribute *attr,
char *page);
+ssize_t events_ht_sysfs_show(struct device *dev, struct device_attribute *attr,
+ char *page);

static inline int is_ht_workaround_enabled(void)
{
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index f9828a4..ea2d830 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1166,6 +1166,13 @@ struct perf_pmu_events_attr {
const char *event_str;
};

+struct perf_pmu_events_ht_attr {
+ struct device_attribute attr;
+ u64 id;
+ const char *event_str_ht;
+ const char *event_str_noht;
+};
+
ssize_t perf_event_sysfs_show(struct device *dev, struct device_attribute *attr,
char *page);

--
2.4.3

2015-12-16 00:56:36

by Andi Kleen

[permalink] [raw]
Subject: [PATCH 09/10] x86, perf: Add Top Down events to Intel Core

From: Andi Kleen <[email protected]>

Add declarations for the events needed for TopDown to the
Intel big core CPUs starting with Sandy Bridge. We need
to report different values if HyperThreading is on or off.

The only thing this patch does is to export some events
in sysfs.

TopDown level 1 uses a set of abstracted metrics which
are generic to out of order CPU cores (although some
CPUs may not implement all of them):

topdown-total-slots Available slots in the pipeline
topdown-slots-issued Slots issued into the pipeline
topdown-slots-retired Slots successfully retired
topdown-fetch-bubbles Pipeline gaps in the frontend
topdown-recovery-bubbles Pipeline gaps during recovery
from misspeculation

A slot is a single operation in the CPU pipe line.

These metrics then allow to compute four useful metrics:
FrontendBound, BackendBound, Retiring, BadSpeculation.

The formulas to compute the metrics are generic, they
only change based on the availability on the abstracted
input values.

The kernel declares the events supported by the current
CPU and their scaling factors (such as the pipeline width)
and perf stat then computes the formulas based on the
available metrics. This is similar how existing
perf metrics, such as TSC metrics or IPC, are implemented.

This abstracts all CPU pipe line specific knowledge in the
kernel driver, but still avoids the need for larger scale perf
interface changes.

For HyperThreading the any bit is needed to get accurate
values when both threads are executing. This implies that
the events can only be collected as root or with
perf_event_paranoid=-1 for now.

Hyper Threading also requires averaging events from both
threads together (the CPU cannot measure them independently).

In perf stat this is already done by the per core mode. The
new .agg-per-core attribute is added to the events, which
then forces perf stat to enable --per-core.

The basic scheme is based on the following paper:
Yasin,
A Top Down Method for Performance analysis and Counter architecture
ISPASS14
(pdf available via google)

v2: Rework scaling. Fix formulas for HyperThreading.
Signed-off-by: Andi Kleen <[email protected]>
---
arch/x86/kernel/cpu/perf_event_intel.c | 73 ++++++++++++++++++++++++++++++++++
1 file changed, 73 insertions(+)

diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index 33b4b67..2eb50f7 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -222,9 +222,64 @@ struct attribute *nhm_events_attrs[] = {
NULL,
};

+/*
+ * TopDown events for Core.
+ *
+ * The events are all in slots, which is a free slot in a 4 wide
+ * pipeline. Some events are already reported in slots, for cycle
+ * events we multiply by the pipeline width (4).
+ *
+ * With Hyper Threading on, TopDown metrics are either summed or averaged
+ * between the threads of a core: (count_t0 + count_t1).
+ *
+ * For the average case the metric is always scaled to pipeline width,
+ * so we use factor 2 ((count_t0 + count_t1) / 2 * 4)
+ *
+ * We tell perf to aggregate per core by setting the .agg-per-core
+ * attribute for the alias to 1.
+ */
+
+EVENT_ATTR_STR_HT(topdown-total-slots, td_total_slots,
+ "event=0x3c,umask=0x0", /* cpu_clk_unhalted.thread */
+ "event=0x3c,umask=0x0,any=1"); /* cpu_clk_unhalted.thread_any */
+EVENT_ATTR_STR_HT(topdown-total-slots.scale, td_total_slots_scale, "4", "2");
+EVENT_ATTR_STR_HT(topdown-total-slots.agg-per-core, td_total_slots_pc,
+ "0", "1");
+EVENT_ATTR_STR(topdown-slots-issued, td_slots_issued,
+ "event=0xe,umask=0x1"); /* uops_issued.any */
+EVENT_ATTR_STR_HT(topdown-slots-issued.agg-per-core, td_slots_issued_pc,
+ "0", "1");
+EVENT_ATTR_STR(topdown-slots-retired, td_slots_retired,
+ "event=0xc2,umask=0x2"); /* uops_retired.retire_slots */
+EVENT_ATTR_STR_HT(topdown-slots-retired.agg-per-core, td_slots_retired_pc,
+ "0", "1");
+EVENT_ATTR_STR(topdown-fetch-bubbles, td_fetch_bubbles,
+ "event=0x9c,umask=0x1"); /* idq_uops_not_delivered_core */
+EVENT_ATTR_STR_HT(topdown-fetch-bubbles.agg-per-core, td_fetch_bubbles_pc,
+ "0", "1");
+EVENT_ATTR_STR_HT(topdown-recovery-bubbles, td_recovery_bubbles,
+ "event=0xd,umask=0x3,cmask=1", /* int_misc.recovery_cycles */
+ "event=0xd,umask=0x3,cmask=1,any=1"); /* int_misc.recovery_cycles_any */
+EVENT_ATTR_STR_HT(topdown-recovery-bubbles.scale, td_recovery_bubbles_scale,
+ "4", "2");
+EVENT_ATTR_STR_HT(topdown-recovery-bubbles.agg-per-core, td_recovery_bubbles_pc,
+ "0", "1");
+
struct attribute *snb_events_attrs[] = {
EVENT_PTR(mem_ld_snb),
EVENT_PTR(mem_st_snb),
+ EVENT_PTR(td_slots_issued),
+ EVENT_PTR(td_slots_issued_pc),
+ EVENT_PTR(td_slots_retired),
+ EVENT_PTR(td_slots_retired_pc),
+ EVENT_PTR(td_fetch_bubbles),
+ EVENT_PTR(td_fetch_bubbles_pc),
+ EVENT_PTR(td_total_slots),
+ EVENT_PTR(td_total_slots_scale),
+ EVENT_PTR(td_total_slots_pc),
+ EVENT_PTR(td_recovery_bubbles),
+ EVENT_PTR(td_recovery_bubbles_scale),
+ EVENT_PTR(td_recovery_bubbles_pc),
NULL,
};

@@ -3201,6 +3256,18 @@ static struct attribute *hsw_events_attrs[] = {
EVENT_PTR(cycles_ct),
EVENT_PTR(mem_ld_hsw),
EVENT_PTR(mem_st_hsw),
+ EVENT_PTR(td_slots_issued),
+ EVENT_PTR(td_slots_issued_pc),
+ EVENT_PTR(td_slots_retired),
+ EVENT_PTR(td_slots_retired_pc),
+ EVENT_PTR(td_fetch_bubbles),
+ EVENT_PTR(td_fetch_bubbles_pc),
+ EVENT_PTR(td_total_slots),
+ EVENT_PTR(td_total_slots_scale),
+ EVENT_PTR(td_total_slots_pc),
+ EVENT_PTR(td_recovery_bubbles),
+ EVENT_PTR(td_recovery_bubbles_scale),
+ EVENT_PTR(td_recovery_bubbles_pc),
NULL
};

@@ -3518,6 +3585,12 @@ __init int intel_pmu_init(void)
memcpy(hw_cache_extra_regs, skl_hw_cache_extra_regs, sizeof(hw_cache_extra_regs));
intel_pmu_lbr_init_skl();

+ /* INT_MISC.RECOVERY_CYCLES has umask 1 in Skylake */
+ event_attr_td_recovery_bubbles.event_str_noht =
+ "event=0xd,umask=0x1,cmask=1";
+ event_attr_td_recovery_bubbles.event_str_ht =
+ "event=0xd,umask=0x1,cmask=1,any=1";
+
x86_pmu.event_constraints = intel_skl_event_constraints;
x86_pmu.pebs_constraints = intel_skl_pebs_event_constraints;
x86_pmu.extra_regs = intel_skl_extra_regs;
--
2.4.3

2015-12-16 00:55:50

by Andi Kleen

[permalink] [raw]
Subject: [PATCH 10/10] x86, perf: Add Top Down events to Intel Atom

From: Andi Kleen <[email protected]>

Add topdown event declarations to Silvermont / Airmont.
These cores do not support the full Top Down metrics, but an useful
subset (FrontendBound, Retiring, Backend Bound/Bad Speculation).

The perf stat tool automatically handles the missing events
and combines the available metrics.

Signed-off-by: Andi Kleen <[email protected]>
---
arch/x86/kernel/cpu/perf_event_intel.c | 24 ++++++++++++++++++++++++
1 file changed, 24 insertions(+)

diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index 2eb50f7..432b9c9 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -1379,6 +1379,29 @@ static __initconst const u64 atom_hw_cache_event_ids
},
};

+EVENT_ATTR_STR(topdown-total-slots, td_total_slots_slm, "event=0x3c");
+EVENT_ATTR_STR(topdown-total-slots.scale, td_total_slots_scale_slm, "2");
+/* no_alloc_cycles.not_delivered */
+EVENT_ATTR_STR(topdown-fetch-bubbles, td_fetch_bubbles_slm,
+ "event=0xca,umask=0x50");
+EVENT_ATTR_STR(topdown-fetch-bubbles.scale, td_fetch_bubbles_scale_slm, "2");
+/* uops_retired.all */
+EVENT_ATTR_STR(topdown-slots-issued, td_slots_issued_slm,
+ "event=0xc2,umask=0x10");
+/* uops_retired.all */
+EVENT_ATTR_STR(topdown-slots-retired, td_slots_retired_slm,
+ "event=0xc2,umask=0x10");
+
+static struct attribute *slm_events_attrs[] = {
+ EVENT_PTR(td_total_slots_slm),
+ EVENT_PTR(td_total_slots_scale_slm),
+ EVENT_PTR(td_fetch_bubbles_slm),
+ EVENT_PTR(td_fetch_bubbles_scale_slm),
+ EVENT_PTR(td_slots_issued_slm),
+ EVENT_PTR(td_slots_retired_slm),
+ NULL
+};
+
static struct extra_reg intel_slm_extra_regs[] __read_mostly =
{
/* must define OFFCORE_RSP_X first, see intel_fixup_er() */
@@ -3416,6 +3439,7 @@ __init int intel_pmu_init(void)
x86_pmu.pebs_constraints = intel_slm_pebs_event_constraints;
x86_pmu.extra_regs = intel_slm_extra_regs;
x86_pmu.flags |= PMU_FL_HAS_RSP_1;
+ x86_pmu.cpu_events = slm_events_attrs;
pr_cont("Silvermont events, ");
break;

--
2.4.3

2015-12-16 08:53:47

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 08/10] x86, perf: Support sysfs files depending on SMT status

On Tue, Dec 15, 2015 at 04:54:24PM -0800, Andi Kleen wrote:

> +ssize_t events_ht_sysfs_show(struct device *dev, struct device_attribute *attr,
> + char *page)
> +{
> + struct perf_pmu_events_ht_attr *pmu_attr =
> + container_of(attr, struct perf_pmu_events_ht_attr, attr);
> + bool ht_on = false;
> + int cpu;
> +
> + /*
> + * Report conditional events depending on Hyper-Threading.
> + *
> + * Check all online CPUs if any have a thread sibling,
> + * as perf may measure any of them.
> + *
> + * This is overly conservative as usually the HT special
> + * handling is not needed if the other CPU thread is idle.
> + *
> + * Note this does not (cannot) handle the case when thread
> + * siblings are invisible, for example with virtualization
> + * if they are owned by some other guest. The user tool
> + * has to re-read when a thread sibling gets onlined later.
> + */
> + for_each_online_cpu (cpu) {
> + ht_on = cpumask_weight(topology_sibling_cpumask(cpu)) > 1;
> + if (ht_on)
> + break;
> + }

At the very least you want {get,put}_online_cpus() around that I
suspect.

But I would rather we generalize and save the state from fixup_ht_bug().

That is, at that subsys_initcall() time all our SMP init is done and
userspace isn't running yet to mess things up, so we can more or less
trust the topology masks to be complete.

So at that time detect if HT is enabled or not and store that
information in x86_pmu.flags for later use here.


> +
> + return sprintf(page, "%s",
> + ht_on ?
> + pmu_attr->event_str_ht :
> + pmu_attr->event_str_noht);
> +}

2015-12-16 12:48:57

by Stephane Eranian

[permalink] [raw]
Subject: Re: [PATCH 08/10] x86, perf: Support sysfs files depending on SMT status

Andi,

On Tue, Dec 15, 2015 at 4:54 PM, Andi Kleen <[email protected]> wrote:
>
> From: Andi Kleen <[email protected]>
>
> Add a way to show different sysfs events attributes depending on
> HyperThreading is on or off. This is difficult to determine
> early at boot, so we just do it dynamically when the sysfs
> attribute is read.
>
The only attribute I can think of is the any thread bit.
But I do not see this in this patch. Is this somewhere else?

>
> Signed-off-by: Andi Kleen <[email protected]>
> ---
> arch/x86/kernel/cpu/perf_event.c | 34 ++++++++++++++++++++++++++++++++++
> arch/x86/kernel/cpu/perf_event.h | 10 ++++++++++
> include/linux/perf_event.h | 7 +++++++
> 3 files changed, 51 insertions(+)
>
> diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
> index 9dfbba5..976686f 100644
> --- a/arch/x86/kernel/cpu/perf_event.c
> +++ b/arch/x86/kernel/cpu/perf_event.c
> @@ -1590,6 +1590,40 @@ ssize_t events_sysfs_show(struct device *dev, struct device_attribute *attr,
> return x86_pmu.events_sysfs_show(page, config);
> }
>
> +ssize_t events_ht_sysfs_show(struct device *dev, struct device_attribute *attr,
> + char *page)
> +{
> + struct perf_pmu_events_ht_attr *pmu_attr =
> + container_of(attr, struct perf_pmu_events_ht_attr, attr);
> + bool ht_on = false;
> + int cpu;
> +
> + /*
> + * Report conditional events depending on Hyper-Threading.
> + *
> + * Check all online CPUs if any have a thread sibling,
> + * as perf may measure any of them.
> + *
> + * This is overly conservative as usually the HT special
> + * handling is not needed if the other CPU thread is idle.
> + *
> + * Note this does not (cannot) handle the case when thread
> + * siblings are invisible, for example with virtualization
> + * if they are owned by some other guest. The user tool
> + * has to re-read when a thread sibling gets onlined later.
> + */
> + for_each_online_cpu (cpu) {
> + ht_on = cpumask_weight(topology_sibling_cpumask(cpu)) > 1;
> + if (ht_on)
> + break;
> + }
> +
> + return sprintf(page, "%s",
> + ht_on ?
> + pmu_attr->event_str_ht :
> + pmu_attr->event_str_noht);
> +}
> +
> EVENT_ATTR(cpu-cycles, CPU_CYCLES );
> EVENT_ATTR(instructions, INSTRUCTIONS );
> EVENT_ATTR(cache-references, CACHE_REFERENCES );
> diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
> index 799e6bd..ce87cd6 100644
> --- a/arch/x86/kernel/cpu/perf_event.h
> +++ b/arch/x86/kernel/cpu/perf_event.h
> @@ -660,6 +660,14 @@ static struct perf_pmu_events_attr event_attr_##v = { \
> .event_str = str, \
> };
>
> +#define EVENT_ATTR_STR_HT(_name, v, noht, ht) \
> +static struct perf_pmu_events_ht_attr event_attr_##v = { \
> + .attr = __ATTR(_name, 0444, events_ht_sysfs_show, NULL),\
> + .id = 0, \
> + .event_str_noht = noht, \
> + .event_str_ht = ht, \
> +}
> +
> extern struct x86_pmu x86_pmu __read_mostly;
>
> static inline bool x86_pmu_has_lbr_callstack(void)
> @@ -919,6 +927,8 @@ int knc_pmu_init(void);
>
> ssize_t events_sysfs_show(struct device *dev, struct device_attribute *attr,
> char *page);
> +ssize_t events_ht_sysfs_show(struct device *dev, struct device_attribute *attr,
> + char *page);
>
> static inline int is_ht_workaround_enabled(void)
> {
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index f9828a4..ea2d830 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -1166,6 +1166,13 @@ struct perf_pmu_events_attr {
> const char *event_str;
> };
>
> +struct perf_pmu_events_ht_attr {
> + struct device_attribute attr;
> + u64 id;
> + const char *event_str_ht;
> + const char *event_str_noht;
> +};
> +
> ssize_t perf_event_sysfs_show(struct device *dev, struct device_attribute *attr,
> char *page);
>
> --
> 2.4.3
>

2015-12-16 14:16:11

by Stephane Eranian

[permalink] [raw]
Subject: Re: [PATCH 02/10] perf, tools, stat: Force --per-core mode for .agg-per-core aliases

On Tue, Dec 15, 2015 at 4:54 PM, Andi Kleen <[email protected]> wrote:
> From: Andi Kleen <[email protected]>
>
> When an event alias is used that the kernel marked as .agg-per-core, force
> --per-core mode (and also require -a and forbid cgroups or per thread mode).
> This in term means, --topdown forces --per-core mode.
>
> This is needed for TopDown in SMT mode, because it needs to measure
> all threads in a core together and merge the values to compute the correct
> percentages of how the pipeline is limited.
>
> We do this if any alias is agg-per-core.
>
I would rather call this attribute aggr-per-core. That would be more consistent
with how perf calls this.

> Add the code to parse the .agg-per-core attributes and propagate
> the information to the evsel. Then the main stat code does
> the necessary checks and forces per core mode.
>
> Open issue: in combination with -C ... we get wrong values. I think that's
> a existing bug that needs to be debugged/fixed separately.
>
> Signed-off-by: Andi Kleen <[email protected]>
> ---
> tools/perf/builtin-stat.c | 18 ++++++++++++++++++
> tools/perf/util/evsel.h | 1 +
> tools/perf/util/parse-events.c | 1 +
> tools/perf/util/pmu.c | 23 +++++++++++++++++++++++
> tools/perf/util/pmu.h | 2 ++
> 5 files changed, 45 insertions(+)
>
> diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
> index 4777355..0c7cdda 100644
> --- a/tools/perf/builtin-stat.c
> +++ b/tools/perf/builtin-stat.c
> @@ -1540,6 +1540,7 @@ static int add_default_attributes(void)
>
> int cmd_stat(int argc, const char **argv, const char *prefix __maybe_unused)
> {
> + struct perf_evsel *counter;
> const char * const stat_usage[] = {
> "perf stat [<options>] [<command>]",
> NULL
> @@ -1674,6 +1675,23 @@ int cmd_stat(int argc, const char **argv, const char *prefix __maybe_unused)
> if (add_default_attributes())
> goto out;
>
> + evlist__for_each (evsel_list, counter) {
> + /* Enable per core mode if only a single event requires it. */
> + if (counter->agg_per_core) {
> + if (stat_config.aggr_mode != AGGR_GLOBAL &&
> + stat_config.aggr_mode != AGGR_CORE) {
> + pr_err("per core event configuration requires per core mode\n");
> + goto out;
> + }
> + stat_config.aggr_mode = AGGR_CORE;
> + if (nr_cgroups || !target__has_cpu(&target)) {
> + pr_err("per core event configuration requires system-wide mode (-a)\n");
> + goto out;
> + }
> + break;
> + }
> + }
> +
> target__validate(&target);
>
> if (perf_evlist__create_maps(evsel_list, &target) < 0) {
> diff --git a/tools/perf/util/evsel.h b/tools/perf/util/evsel.h
> index 5ded1fc..efc7f7c 100644
> --- a/tools/perf/util/evsel.h
> +++ b/tools/perf/util/evsel.h
> @@ -114,6 +114,7 @@ struct perf_evsel {
> bool tracking;
> bool per_pkg;
> bool precise_max;
> + bool agg_per_core;
> /* parse modifier helper */
> int exclude_GH;
> int nr_members;
> diff --git a/tools/perf/util/parse-events.c b/tools/perf/util/parse-events.c
> index 6fc8cd7..66d8ebd 100644
> --- a/tools/perf/util/parse-events.c
> +++ b/tools/perf/util/parse-events.c
> @@ -1027,6 +1027,7 @@ int parse_events_add_pmu(struct parse_events_evlist *data,
> evsel->unit = info.unit;
> evsel->scale = info.scale;
> evsel->per_pkg = info.per_pkg;
> + evsel->agg_per_core = info.agg_per_core;
> evsel->snapshot = info.snapshot;
> }
>
> diff --git a/tools/perf/util/pmu.c b/tools/perf/util/pmu.c
> index 8a520e9..5a52dac 100644
> --- a/tools/perf/util/pmu.c
> +++ b/tools/perf/util/pmu.c
> @@ -189,6 +189,23 @@ perf_pmu__parse_per_pkg(struct perf_pmu_alias *alias, char *dir, char *name)
> return 0;
> }
>
> +static void
> +perf_pmu__parse_agg_per_core(struct perf_pmu_alias *alias, char *dir, char *name)
> +{
> + char path[PATH_MAX];
> + FILE *f;
> + int flag;
> +
> + snprintf(path, PATH_MAX, "%s/%s.agg-per-core", dir, name);
> +
> + f = fopen(path, "r");
> + if (f && fscanf(f, "%d", &flag) == 1) {
> + alias->agg_per_core = flag != 0;
> + fclose(f);
> + }
> +}
> +
> +
> static int perf_pmu__parse_snapshot(struct perf_pmu_alias *alias,
> char *dir, char *name)
> {
> @@ -237,6 +254,7 @@ static int __perf_pmu__new_alias(struct list_head *list, char *dir, char *name,
> perf_pmu__parse_scale(alias, dir, name);
> perf_pmu__parse_per_pkg(alias, dir, name);
> perf_pmu__parse_snapshot(alias, dir, name);
> + perf_pmu__parse_agg_per_core(alias, dir, name);
> }
>
> list_add_tail(&alias->list, list);
> @@ -271,6 +289,8 @@ static inline bool pmu_alias_info_file(char *name)
> return true;
> if (len > 9 && !strcmp(name + len - 9, ".snapshot"))
> return true;
> + if (len > 13 && !strcmp(name + len - 13, ".agg-per-core"))
> + return true;
>
> return false;
> }
> @@ -847,6 +867,7 @@ int perf_pmu__check_alias(struct perf_pmu *pmu, struct list_head *head_terms,
> int ret;
>
> info->per_pkg = false;
> + info->agg_per_core = false;
>
> /*
> * Mark unit and scale as not set
> @@ -870,6 +891,8 @@ int perf_pmu__check_alias(struct perf_pmu *pmu, struct list_head *head_terms,
>
> if (alias->per_pkg)
> info->per_pkg = true;
> + if (alias->agg_per_core)
> + info->agg_per_core = true;
>
> list_del(&term->list);
> free(term);
> diff --git a/tools/perf/util/pmu.h b/tools/perf/util/pmu.h
> index 5d7e844..5a43719 100644
> --- a/tools/perf/util/pmu.h
> +++ b/tools/perf/util/pmu.h
> @@ -32,6 +32,7 @@ struct perf_pmu_info {
> double scale;
> bool per_pkg;
> bool snapshot;
> + bool agg_per_core;
> };
>
> #define UNIT_MAX_LEN 31 /* max length for event unit name */
> @@ -44,6 +45,7 @@ struct perf_pmu_alias {
> double scale;
> bool per_pkg;
> bool snapshot;
> + bool agg_per_core;
> };
>
> struct perf_pmu *perf_pmu__find(const char *name);
> --
> 2.4.3
>

2015-12-16 14:22:48

by Stephane Eranian

[permalink] [raw]
Subject: Re: [PATCH 03/10] perf, tools, stat: Avoid fractional digits for integer scales

On Tue, Dec 15, 2015 at 4:54 PM, Andi Kleen <[email protected]> wrote:
> From: Andi Kleen <[email protected]>
>
> When the scaling factor is a full integer don't display fractional
> digits. This avoids unnecessary .00 output for topdown metrics
> with scale factors.
>
> Signed-off-by: Andi Kleen <[email protected]>
> ---
> tools/perf/builtin-stat.c | 7 ++++---
> 1 file changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
> index 0c7cdda..1faa6fc 100644
> --- a/tools/perf/builtin-stat.c
> +++ b/tools/perf/builtin-stat.c
> @@ -63,6 +63,7 @@
> #include <stdlib.h>
> #include <sys/prctl.h>
> #include <locale.h>
> +#include <math.h>
>
> #define DEFAULT_SEPARATOR " "
> #define CNTR_NOT_SUPPORTED "<not supported>"
> @@ -739,12 +740,12 @@ static void abs_printout(int id, int nr, struct perf_evsel *evsel, double avg)
> const char *fmt;
>
> if (csv_output) {
> - fmt = sc != 1.0 ? "%.2f%s" : "%.0f%s";
> + fmt = sc != 1.0 && floor(sc) != sc ? "%.2f%s" : "%.0f%s";
> } else {
> if (big_num)
> - fmt = sc != 1.0 ? "%'18.2f%s" : "%'18.0f%s";
> + fmt = sc != 1.0 && floor(sc) != sc ? "%'18.2f%s" : "%'18.0f%s";

if sc is 1.0 then floor(sc) == sc and you'd get the else branch as well, so you
probably do not need the sc != 1.0 test anymore.

> else
> - fmt = sc != 1.0 ? "%18.2f%s" : "%18.0f%s";
> + fmt = sc != 1.0 && floor(sc) != sc ? "%18.2f%s" : "%18.0f%s";
> }
>
> aggr_printout(evsel, id, nr);
> --
> 2.4.3
>

2015-12-16 14:32:53

by Stephane Eranian

[permalink] [raw]
Subject: Re: [PATCH 05/10] perf, tools, stat: Basic support for TopDown in perf stat

On Tue, Dec 15, 2015 at 4:54 PM, Andi Kleen <[email protected]> wrote:
> From: Andi Kleen <[email protected]>
>
> Add basic plumbing for TopDown in perf stat
>
> Add a new --topdown options to enable events.
> When --topdown is specified set up events for all topdown
> events supported by the kernel.
> Add topdown-* as a special case to the event parser, as is
> needed for all events containing -.
>
> The actual code to compute the metrics is in follow-on patches.
>
> v2:
> Use standard sysctl read function.
> Signed-off-by: Andi Kleen <[email protected]>
> ---
> tools/perf/Documentation/perf-stat.txt | 8 +++
> tools/perf/builtin-stat.c | 100 ++++++++++++++++++++++++++++++++-
> tools/perf/util/parse-events.l | 1 +
> 3 files changed, 108 insertions(+), 1 deletion(-)
>
> diff --git a/tools/perf/Documentation/perf-stat.txt b/tools/perf/Documentation/perf-stat.txt
> index 58296e7..9be9ac8 100644
> --- a/tools/perf/Documentation/perf-stat.txt
> +++ b/tools/perf/Documentation/perf-stat.txt
> @@ -163,6 +163,14 @@ filter out the startup phase of the program, which is often very different.
>
> Print statistics of transactional execution if supported.
>
> +--topdown::
> +
> +Print top down level 1 metrics if supported by the CPU. This allows to
> +determine bottle necks in the CPU pipeline for CPU bound workloads,
> +by breaking it down into frontend bound, backend bound, bad speculation
> +and retiring. Specifying the option multiple times shows metrics even
> +if the don't cross a threshold.
> +
> EXAMPLES
> --------
>
> diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
> index 1faa6fc..e10198c 100644
> --- a/tools/perf/builtin-stat.c
> +++ b/tools/perf/builtin-stat.c
> @@ -60,6 +60,7 @@
> #include "util/thread_map.h"
> #include "util/counts.h"
>
> +#include <api/fs/fs.h>
> #include <stdlib.h>
> #include <sys/prctl.h>
> #include <locale.h>
> @@ -95,6 +96,15 @@ static const char * transaction_limited_attrs = {
> "}"
> };
>
> +static const char * topdown_attrs[] = {
> + "topdown-total-slots",
> + "topdown-fetch-bubbles",
> + "topdown-slots-retired",
> + "topdown-recovery-bubbles",
> + "topdown-slots-issued",
> + NULL,
> +};
> +
> static struct perf_evlist *evsel_list;
>
> static struct target target = {
> @@ -109,6 +119,7 @@ static volatile pid_t child_pid = -1;
> static bool null_run = false;
> static int detailed_run = 0;
> static bool transaction_run;
> +static int topdown_run = 0;
> static bool big_num = true;
> static int big_num_opt = -1;
> static const char *csv_sep = NULL;
> @@ -1283,6 +1294,8 @@ static const struct option stat_options[] = {
> "ms to wait before starting measurement after program start"),
> OPT_BOOLEAN(0, "metric-only", &metric_only,
> "Only print computed metrics. No raw values"),
> + OPT_INCR(0, "topdown", &topdown_run,
> + "measure topdown level 1 statistics"),
> OPT_END()
> };
>
> @@ -1380,12 +1393,68 @@ static void perf_stat__exit_aggr_mode(void)
> cpus_aggr_map = NULL;
> }
>
> +static void filter_events(const char **attr, char **str, bool use_group)
> +{
> + int off = 0;
> + int i;
> + int len = 0;
> + char *s;
> +
> + for (i = 0; attr[i]; i++) {
> + if (pmu_have_event("cpu", attr[i])) {
> + len += strlen(attr[i]) + 1;
> + attr[i - off] = attr[i];
> + } else
> + off++;
> + }
> + attr[i - off] = NULL;
> +
> + *str = malloc(len + 1 + 2);
> + if (!*str)
> + return;
> + s = *str;
> + if (i - off == 0) {
> + *s = 0;
> + return;
> + }
> + if (use_group)
> + *s++ = '{';
> + for (i = 0; attr[i]; i++) {
> + strcpy(s, attr[i]);
> + s += strlen(s);
> + *s++ = ',';
> + }
> + if (use_group) {
> + s[-1] = '}';
> + *s = 0;
> + } else
> + s[-1] = 0;
> +}
> +
> +/*
> + * Check whether we can use a group for top down.
> + * Without a group may get bad results due to multiplexing.
> + */

That is not because you have a counter used by the NMI that
you cannot group. If HT is off you have plenty of counters to
do this.

> +static bool check_group(bool *warn)
> +{
> + int n;
> +
> + if (sysctl__read_int("kernel/nmi_watchdog", &n) < 0)
> + return false;
> + if (n > 0) {
> + *warn = true;
> + return false;
> + }
> + return true;
> +}
> +

I do not like this part very much. You are pulling in x86 specific
knowledge into
builtin_stat.c. Why not move this into an x86 specific file?

> /*
> * Add default attributes, if there were no attributes specified or
> * if -d/--detailed, -d -d or -d -d -d is used:
> */
> static int add_default_attributes(void)
> {
> + int err;
> struct perf_event_attr default_attrs[] = {
>
> { .type = PERF_TYPE_SOFTWARE, .config = PERF_COUNT_SW_TASK_CLOCK },
> @@ -1498,7 +1567,6 @@ static int add_default_attributes(void)
> return 0;
>
> if (transaction_run) {
> - int err;
> if (pmu_have_event("cpu", "cycles-ct") &&
> pmu_have_event("cpu", "el-start"))
> err = parse_events(evsel_list, transaction_attrs, NULL);
> @@ -1511,6 +1579,36 @@ static int add_default_attributes(void)
> return 0;
> }
>
> + if (topdown_run) {
> + char *str = NULL;
> + bool warn = false;
> +
> + filter_events(topdown_attrs, &str, check_group(&warn));
> + if (topdown_attrs[0] && str) {
> + if (warn)
> + fprintf(stderr,
> + "nmi_watchdog enabled with topdown. May give wrong results.\n"
> + "Disable with echo 0 > /proc/sys/kernel/nmi_watchdog\n");

This is x86 specific. Why not just try it out and in case of error
suggest checking
if pinned system-wide events exist (such as NMI watchdog on x86). that would
be more generic.

> + err = parse_events(evsel_list, str, NULL);
> + if (err) {
> + fprintf(stderr,
> + "Cannot set up top down events %s: %d\n",
> + str, err);
> + free(str);
> + return -1;
> + }
> + } else {
> + fprintf(stderr, "System does not support topdown\n");
> + return -1;
> + }
> + free(str);
> + /*
> + * Right now combining with the other attributes breaks group
> + * semantics.
> + */
> + return 0;
> + }
> +
> if (!evsel_list->nr_entries) {
> if (perf_evlist__add_default_attrs(evsel_list, default_attrs) < 0)
> return -1;
> diff --git a/tools/perf/util/parse-events.l b/tools/perf/util/parse-events.l
> index 58c5831..3e65d61 100644
> --- a/tools/perf/util/parse-events.l
> +++ b/tools/perf/util/parse-events.l
> @@ -248,6 +248,7 @@ cycles-ct { return str(yyscanner, PE_KERNEL_PMU_EVENT); }
> cycles-t { return str(yyscanner, PE_KERNEL_PMU_EVENT); }
> mem-loads { return str(yyscanner, PE_KERNEL_PMU_EVENT); }
> mem-stores { return str(yyscanner, PE_KERNEL_PMU_EVENT); }
> +topdown-[a-z-]+ { return str(yyscanner, PE_KERNEL_PMU_EVENT); }
>
> L1-dcache|l1-d|l1d|L1-data |
> L1-icache|l1-i|l1i|L1-instruction |
> --
> 2.4.3
>

2015-12-16 16:26:43

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 08/10] x86, perf: Support sysfs files depending on SMT status

On Wed, Dec 16, 2015 at 04:48:52AM -0800, Stephane Eranian wrote:
> Andi,
>
> On Tue, Dec 15, 2015 at 4:54 PM, Andi Kleen <[email protected]> wrote:
> >
> > From: Andi Kleen <[email protected]>
> >
> > Add a way to show different sysfs events attributes depending on
> > HyperThreading is on or off. This is difficult to determine
> > early at boot, so we just do it dynamically when the sysfs
> > attribute is read.
> >
> The only attribute I can think of is the any thread bit.
> But I do not see this in this patch. Is this somewhere else?

Yes the "Intel Core" patch uses the Any bit when SMT is on.
We need different events, formulas and factors for SMT.
This patch just adds a the infrastructure to support this.

-Andi

2015-12-16 21:21:11

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 05/10] perf, tools, stat: Basic support for TopDown in perf stat

> > +/*
> > + * Check whether we can use a group for top down.
> > + * Without a group may get bad results due to multiplexing.
> > + */
>
> That is not because you have a counter used by the NMI that
> you cannot group. If HT is off you have plenty of counters to
> do this.

Such a heuristic wouldn't work on Atom where there are no more
counters in this case.

> > +static bool check_group(bool *warn)
> > +{
> > + int n;
> > +
> > + if (sysctl__read_int("kernel/nmi_watchdog", &n) < 0)
> > + return false;
> > + if (n > 0) {
> > + *warn = true;
> > + return false;
> > + }
> > + return true;
> > +}
> > +
>
> I do not like this part very much. You are pulling in x86 specific
> knowledge into
> builtin_stat.c. Why not move this into an x86 specific file?

Done.

> > err = parse_events(evsel_list, transaction_attrs, NULL);
> > @@ -1511,6 +1579,36 @@ static int add_default_attributes(void)
> > return 0;
> > }
> >
> > + if (topdown_run) {
> > + char *str = NULL;
> > + bool warn = false;
> > +
> > + filter_events(topdown_attrs, &str, check_group(&warn));
> > + if (topdown_attrs[0] && str) {
> > + if (warn)
> > + fprintf(stderr,
> > + "nmi_watchdog enabled with topdown. May give wrong results.\n"
> > + "Disable with echo 0 > /proc/sys/kernel/nmi_watchdog\n");
>
> This is x86 specific. Why not just try it out and in case of error
> suggest checking
> if pinned system-wide events exist (such as NMI watchdog on x86). that would
> be more generic.

That's really complicated, i would have to tear down all state and then
resubmit all the events. I think just checking the NMI watchdog is good
enough. I couldn't give a sensible error message for the generic case
anyways.

-Andi
--
[email protected] -- Speaking for myself only.

2015-12-17 09:26:38

by Stephane Eranian

[permalink] [raw]
Subject: Re: [PATCH 05/10] perf, tools, stat: Basic support for TopDown in perf stat

Andi,

On Wed, Dec 16, 2015 at 1:21 PM, Andi Kleen <[email protected]> wrote:
>> > +/*
>> > + * Check whether we can use a group for top down.
>> > + * Without a group may get bad results due to multiplexing.
>> > + */
>>
>> That is not because you have a counter used by the NMI that
>> you cannot group. If HT is off you have plenty of counters to
>> do this.
>
> Such a heuristic wouldn't work on Atom where there are no more
> counters in this case.
>
>> > +static bool check_group(bool *warn)
>> > +{
>> > + int n;
>> > +
>> > + if (sysctl__read_int("kernel/nmi_watchdog", &n) < 0)
>> > + return false;
>> > + if (n > 0) {
>> > + *warn = true;
>> > + return false;
>> > + }
>> > + return true;
>> > +}
>> > +
>>
>> I do not like this part very much. You are pulling in x86 specific
>> knowledge into
>> builtin_stat.c. Why not move this into an x86 specific file?
>
> Done.
>
>> > err = parse_events(evsel_list, transaction_attrs, NULL);
>> > @@ -1511,6 +1579,36 @@ static int add_default_attributes(void)
>> > return 0;
>> > }
>> >
>> > + if (topdown_run) {
>> > + char *str = NULL;
>> > + bool warn = false;
>> > +
>> > + filter_events(topdown_attrs, &str, check_group(&warn));
>> > + if (topdown_attrs[0] && str) {
>> > + if (warn)
>> > + fprintf(stderr,
>> > + "nmi_watchdog enabled with topdown. May give wrong results.\n"
>> > + "Disable with echo 0 > /proc/sys/kernel/nmi_watchdog\n");
>>
>> This is x86 specific. Why not just try it out and in case of error
>> suggest checking
>> if pinned system-wide events exist (such as NMI watchdog on x86). that would
>> be more generic.
>
> That's really complicated, i would have to tear down all state and then
> resubmit all the events. I think just checking the NMI watchdog is good
> enough. I couldn't give a sensible error message for the generic case
> anyways.
>
What about that fancy error reporting extension that was proposed a few months
back? It could help print a sensible error msg. Did it ever make it in?

2015-12-17 10:28:05

by Stephane Eranian

[permalink] [raw]
Subject: Re: Add top down metrics to perf stat v2

Andi,

On Tue, Dec 15, 2015 at 4:54 PM, Andi Kleen <[email protected]> wrote:
> Note to reviewers: includes both tools and kernel patches.
> The kernel patches are at the end.
>
> This patchkit adds support for TopDown measurements to perf stat
> It applies on top of my earlier metrics patchkit, posted
> separately, and the --metric-only patchkit (also
> posted separately)
>
> TopDown is intended to replace the frontend cycles idle/
> backend cycles idle metrics in standard perf stat output.
> These metrics are not reliable in many workloads,
> due to out of order effects.
>
> This implements a new --topdown mode in perf stat
> (similar to --transaction) that measures the pipe line
> bottlenecks using standardized formulas. The measurement
> can be all done with 5 counters (one fixed counter)
>
> The result are four metrics:
> FrontendBound, BackendBound, BadSpeculation, Retiring
>
> that describe the CPU pipeline behavior on a high level.
>
> FrontendBound and BackendBound
> BadSpeculation is a higher
>
> The full top down methology has many hierarchical metrics.
> This implementation only supports level 1 which can be
> collected without multiplexing. A full implementation
> of top down on top of perf is available in pmu-tools toplev.
> (http://github.com/andikleen/pmu-tools)
>
> The current version works on Intel Core CPUs starting
> with Sandy Bridge, and Atom CPUs starting with Silvermont.
> In principle the generic metrics should be also implementable
> on other out of order CPUs.
>
> TopDown level 1 uses a set of abstracted metrics which
> are generic to out of order CPU cores (although some
> CPUs may not implement all of them):
>
> topdown-total-slots Available slots in the pipeline
> topdown-slots-issued Slots issued into the pipeline
> topdown-slots-retired Slots successfully retired
> topdown-fetch-bubbles Pipeline gaps in the frontend
> topdown-recovery-bubbles Pipeline gaps during recovery
> from misspeculation
>
> These metrics then allow to compute four useful metrics:
> FrontendBound, BackendBound, Retiring, BadSpeculation.
>
> The formulas to compute the metrics are generic, they
> only change based on the availability on the abstracted
> input values.
>
> The kernel declares the events supported by the current
> CPU and perf stat then computes the formulas based on the
> available metrics.
>
>
> Example output:
>
> $ ./perf stat --topdown -a ./BC1s
>
> Performance counter stats for 'system wide':
>
> S0-C0 2 19650790 topdown-total-slots (100.00%)
> S0-C0 2 4445680.00 topdown-fetch-bubbles # 22.62% frontend bound (100.00%)
> S0-C0 2 1743552.00 topdown-slots-retired (100.00%)
> S0-C0 2 622954 topdown-recovery-bubbles (100.00%)
> S0-C0 2 2025498.00 topdown-slots-issued # 63.90% backend bound
> S0-C1 2 16685216540 topdown-total-slots (100.00%)
> S0-C1 2 962557931.00 topdown-fetch-bubbles (100.00%)
> S0-C1 2 4175583320.00 topdown-slots-retired (100.00%)
> S0-C1 2 1743329246 topdown-recovery-bubbles # 22.22% bad speculation (100.00%)
> S0-C1 2 6138901193.50 topdown-slots-issued # 46.99% backend bound
>
I don't see how this output could be very useful. What matters is the
percentage in the comments
and not so much the raw counts because what is the unit? Same remark
holds for the percentage.
I think you need to explain or show that this is % of issue slots and
not cycles.

> 1.535832673 seconds time elapsed
>
> $ perf stat --topdown --topdown --metric-only -I 100 ./BC1s

When I tried from your git tree the --metric-only option was not recognized.

> 0.100576098 frontend bound retiring bad speculation backend bound
> 0.100576098 8.83% 48.93% 35.24% 7.00%
> 0.200800845 8.84% 48.49% 35.53% 7.13%
> 0.300905983 8.73% 48.64% 35.58% 7.05%
> ...
>
This kind of output is more meaningful and clearer for end-users based
on my experience
and you'd like it per-core possibly.

>
>
> On Hyper Threaded CPUs Top Down computes metrics per core instead of per logical CPU.
> In this case perf stat automatically enables --per-core mode and also requires
> global mode (-a) and avoiding other filters (no cgroup mode)
>
> One side effect is that this may require root rights or a
> kernel.perf_event_paranoid=-1 setting.
>
> On systems without Hyper Threading it can be used per process.
>
> Full tree available in
> git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-misc perf/top-down-11

That is in the top-down-2 branch instead, I think.

>
> No changelog against previous version. There were lots of changes.

2015-12-17 14:01:28

by Andi Kleen

[permalink] [raw]
Subject: Re: Add top down metrics to perf stat v2

On Thu, Dec 17, 2015 at 02:27:58AM -0800, Stephane Eranian wrote:
> > S0-C1 2 4175583320.00 topdown-slots-retired (100.00%)
> > S0-C1 2 1743329246 topdown-recovery-bubbles # 22.22% bad speculation (100.00%)
> > S0-C1 2 6138901193.50 topdown-slots-issued # 46.99% backend bound
> >
> I don't see how this output could be very useful. What matters is the
> percentage in the comments
> and not so much the raw counts because what is the unit? Same remark
> holds for the percentage.
> I think you need to explain or show that this is % of issue slots and
> not cycles.

The events already say slots, not cycles. Except for recovery-bubbles. Could add
-slots there too if you think it's helpful, although it would make the
name very long and may not fit into the column anymore.

>
> > 1.535832673 seconds time elapsed
> >
> > $ perf stat --topdown --topdown --metric-only -I 100 ./BC1s
>
> When I tried from your git tree the --metric-only option was not recognized.

See below.
>
> > 0.100576098 frontend bound retiring bad speculation backend bound
> > 0.100576098 8.83% 48.93% 35.24% 7.00%
> > 0.200800845 8.84% 48.49% 35.53% 7.13%
> > 0.300905983 8.73% 48.64% 35.58% 7.05%
> > ...
> >
> This kind of output is more meaningful and clearer for end-users based
> on my experience
> and you'd like it per-core possibly.

Yes --metric-only is a lot clearer.

per-core is supported and automatically enabled with SMT on.

> > Full tree available in
> > git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-misc perf/top-down-11
>
> That is in the top-down-2 branch instead, I think.

Sorry, typo

The correct branch is perf/top-down-10

I also updated it now with the latest review feedback changes.

top-down-2 is an really old branch that indeed didn't have metric-only.

-Andi

--
[email protected] -- Speaking for myself only.

2015-12-17 14:04:00

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 05/10] perf, tools, stat: Basic support for TopDown in perf stat

> >> This is x86 specific. Why not just try it out and in case of error
> >> suggest checking
> >> if pinned system-wide events exist (such as NMI watchdog on x86). that would
> >> be more generic.
> >
> > That's really complicated, i would have to tear down all state and then
> > resubmit all the events. I think just checking the NMI watchdog is good
> > enough. I couldn't give a sensible error message for the generic case
> > anyways.
> >
> What about that fancy error reporting extension that was proposed a few months
> back? It could help print a sensible error msg. Did it ever make it in?

AFAIK it's not in. Also even the extended error wouldn't know who
installed a pinned counter.

-Andi

--
[email protected] -- Speaking for myself only.

2015-12-17 23:31:33

by Stephane Eranian

[permalink] [raw]
Subject: Re: Add top down metrics to perf stat v2

On Thu, Dec 17, 2015 at 6:01 AM, Andi Kleen <[email protected]> wrote:
> On Thu, Dec 17, 2015 at 02:27:58AM -0800, Stephane Eranian wrote:
>> > S0-C1 2 4175583320.00 topdown-slots-retired (100.00%)
>> > S0-C1 2 1743329246 topdown-recovery-bubbles # 22.22% bad speculation (100.00%)
>> > S0-C1 2 6138901193.50 topdown-slots-issued # 46.99% backend bound
>> >
>> I don't see how this output could be very useful. What matters is the
>> percentage in the comments
>> and not so much the raw counts because what is the unit? Same remark
>> holds for the percentage.
>> I think you need to explain or show that this is % of issue slots and
>> not cycles.
>
> The events already say slots, not cycles. Except for recovery-bubbles. Could add
> -slots there too if you think it's helpful, although it would make the
> name very long and may not fit into the column anymore.
>
I would drop the default output, it is not useful.

I would not add a --topdown option but instead a --metric option with arguments
such that other metrics could be added later:

$ perf stat --metrics topdown -I 1000 -a sleep 100

If you do this, you do not need the --metric-only option

The double --topdown is confusing.

Why force --per-core when HT is on. I know you you need to aggregate
per core, but
you could still display globally. And then if user requests
--per-core, then display per core.
Same if user specifies --per-socket. I know this requires some more
plumbing inside perf
but it would be clearer and simpler to interpret to users.

One bug I found when testing is that if you do with HT-on:

$ perf stat -a --topdown -I 1000 --metric-only sleep 100
Then you get data for frontend and backend but nothing for retiring or
bad speculation.

I suspect it is because you expect --metric-only to be used only when
you have the
double --topdown. That's why I think this double topdown is confusing. If you do
as I suggest, it will be much simpler.

>>
>> > 1.535832673 seconds time elapsed
>> >
>> > $ perf stat --topdown --topdown --metric-only -I 100 ./BC1s
>>
>> When I tried from your git tree the --metric-only option was not recognized.
>
> See below.
>>
>> > 0.100576098 frontend bound retiring bad speculation backend bound
>> > 0.100576098 8.83% 48.93% 35.24% 7.00%
>> > 0.200800845 8.84% 48.49% 35.53% 7.13%
>> > 0.300905983 8.73% 48.64% 35.58% 7.05%
>> > ...
>> >
>> This kind of output is more meaningful and clearer for end-users based
>> on my experience
>> and you'd like it per-core possibly.
>
> Yes --metric-only is a lot clearer.
>
> per-core is supported and automatically enabled with SMT on.
>
>> > Full tree available in
>> > git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-misc perf/top-down-11
>>
>> That is in the top-down-2 branch instead, I think.
>
> Sorry, typo
>
> The correct branch is perf/top-down-10
>
> I also updated it now with the latest review feedback changes.
>
> top-down-2 is an really old branch that indeed didn't have metric-only.
>
> -Andi
>
> --
> [email protected] -- Speaking for myself only.

2015-12-18 01:55:42

by Andi Kleen

[permalink] [raw]
Subject: Re: Add top down metrics to perf stat v2

Thanks for testing.


On Thu, Dec 17, 2015 at 03:31:30PM -0800, Stephane Eranian wrote:
> I would not add a --topdown option but instead a --metric option with arguments
> such that other metrics could be added later:
>
> $ perf stat --metrics topdown -I 1000 -a sleep 100
>
> If you do this, you do not need the --metric-only option

The --metric-only option is useful with other metrics too. For example
to get concise (and plottable) IPC or TSX abort statistics. See the
examples in the original commit.

However could make --topdown default to --metric-only and add
an option to turn it off. Yes that's probably a better default
for more people, although some people could be annoyed by the
wide output.

> The double --topdown is confusing.

Ok. I was thinking of changing it and adding an extra argument for
the "ignore threshold" behavior. That would also make it more extensible
if we ever add Level 2.

>
> Why force --per-core when HT is on. I know you you need to aggregate
> per core, but
> you could still display globally. And then if user requests
> --per-core, then display per core.

Global TopDown doesn't make much sense. Suppose you have two programs
running on different cores, one frontend bound and one backend bound.
What would the union of the two mean? And you may well end up
with sums of ratios which are >100%.

The only exception where it's useful is for the single threaded
case (like the toplev --single-thread) option. However it is something
ugly and difficult because the user would need to ensure that there is
nothing active on the sibling thread. So I left it out.

> Same if user specifies --per-socket. I know this requires some more
> plumbing inside perf
> but it would be clearer and simpler to interpret to users.

Same problem as above.

>
> One bug I found when testing is that if you do with HT-on:
>
> $ perf stat -a --topdown -I 1000 --metric-only sleep 100
> Then you get data for frontend and backend but nothing for retiring or
> bad speculation.

You see all the columns, but no data in some?

That's intended: the percentage is only printed when it crosses a
threshold. That's part of the top down specificatio.

> I suspect it is because you expect --metric-only to be used only when
> you have the
> double --topdown. That's why I think this double topdown is confusing. If you do
> as I suggest, it will be much simpler.

It works fine with single topdown as far as I can tell.


-Andi
--
[email protected] -- Speaking for myself only.

2015-12-18 09:33:24

by Stephane Eranian

[permalink] [raw]
Subject: Re: Add top down metrics to perf stat v2

Andi,

On Thu, Dec 17, 2015 at 5:55 PM, Andi Kleen <[email protected]> wrote:
> Thanks for testing.
>
>
> On Thu, Dec 17, 2015 at 03:31:30PM -0800, Stephane Eranian wrote:
>> I would not add a --topdown option but instead a --metric option with arguments
>> such that other metrics could be added later:
>>
>> $ perf stat --metrics topdown -I 1000 -a sleep 100
>>
>> If you do this, you do not need the --metric-only option
>
> The --metric-only option is useful with other metrics too. For example
> to get concise (and plottable) IPC or TSX abort statistics. See the
> examples in the original commit.
>
> However could make --topdown default to --metric-only and add
> an option to turn it off. Yes that's probably a better default
> for more people, although some people could be annoyed by the
> wide output.
>
>> The double --topdown is confusing.
>
> Ok. I was thinking of changing it and adding an extra argument for
> the "ignore threshold" behavior. That would also make it more extensible
> if we ever add Level 2.
>
I think you should drop that implicit threshold data suppression
feature altogether.
See what I write below.

>>
>> Why force --per-core when HT is on. I know you you need to aggregate
>> per core, but
>> you could still display globally. And then if user requests
>> --per-core, then display per core.
>
> Global TopDown doesn't make much sense. Suppose you have two programs
> running on different cores, one frontend bound and one backend bound.
> What would the union of the two mean? And you may well end up
> with sums of ratios which are >100%.
>
How could that be if you consider that the machine is N-wide and not just 4-wide
anymore?

How what you are describing here is different when HT is off?
If you force --per-core with HT-on, then you need to force it too when
HT is off so that you get a similar per core breakdown. In the HT on
case, each Sx-Cy represents 2 threads, compared to 1 in the non HT
case.Right now, you have non-HT reporting global, HT reporting per-core.
That does not make much sense to me.

> The only exception where it's useful is for the single threaded
> case (like the toplev --single-thread) option. However it is something
> ugly and difficult because the user would need to ensure that there is
> nothing active on the sibling thread. So I left it out.
>
>> Same if user specifies --per-socket. I know this requires some more
>> plumbing inside perf
>> but it would be clearer and simpler to interpret to users.
>
> Same problem as above.
>
>>
>> One bug I found when testing is that if you do with HT-on:
>>
>> $ perf stat -a --topdown -I 1000 --metric-only sleep 100
>> Then you get data for frontend and backend but nothing for retiring or
>> bad speculation.
>
> You see all the columns, but no data in some?
>
yes, and I don't like that. It is confusing especially when you do not
know the threshold.
Why are you suppressing the 'retiring' data when it is at 25% (1/4 of
the maximum possible)
when I am running a simple noploop? 25% is a sign of underutilization,
that could be useful too.

Furthermore, it makes it harder to parse, including with the -x option
because some fields may
not be there. I would rather see all the values. In non -x mode, you
could use color to indicate
high/low thresholds (similar to perf report).


> That's intended: the percentage is only printed when it crosses a
> threshold. That's part of the top down specification.
>
I don't like that. I would rather see all the percentages.
My remark applies to non topdown metrics as well, such as IPC.
Clearly the IPC is awkward to use. You need to know you need to
measure cycles, instructions to get ipc with --metric-only. Again,
I would rather see:
$ perf stat --metrics ipc ....
$ perf stat --metrics topdown

It is more uniform and users do not have to worry about what events
to use to compute a metric. As an example, here is what you
could do (showing side by side metrics ipc and uops/cycles):

# perf stat -a --metric ipc,upc -I 1000 sleep 100
#======================================
# | ipc| upc
# | IPC %Peak| UPC
# | ^ ^ | ^
#======================================
1.006038929 0.21 5.16% 0.60
2.012169514 0.21 5.31% 0.60
3.018314389 0.20 5.08% 0.59
4.024430081 0.21 5.26% 0.60

>> I suspect it is because you expect --metric-only to be used only when
>> you have the
>> double --topdown. That's why I think this double topdown is confusing. If you do
>> as I suggest, it will be much simpler.
>
> It works fine with single topdown as far as I can tell.
>
>
> -Andi
> --
> [email protected] -- Speaking for myself only.

2015-12-18 21:38:59

by Andi Kleen

[permalink] [raw]
Subject: Re: Add top down metrics to perf stat v2

On Fri, Dec 18, 2015 at 01:31:18AM -0800, Stephane Eranian wrote:
> >> Why force --per-core when HT is on. I know you you need to aggregate
> >> per core, but
> >> you could still display globally. And then if user requests
> >> --per-core, then display per core.
> >
> > Global TopDown doesn't make much sense. Suppose you have two programs
> > running on different cores, one frontend bound and one backend bound.
> > What would the union of the two mean? And you may well end up
> > with sums of ratios which are >100%.
> >
> How could that be if you consider that the machine is N-wide and not just 4-wide
> anymore?
>
> How what you are describing here is different when HT is off?

I was talking about cores, not CPU threads.

With global aggregation we would aggregate data from different cores,
which is highly dubious for TopDown.

CPU threads on a core are of course aggregated, that is why the patchkit
forces --per-core with HT on.

> If you force --per-core with HT-on, then you need to force it too when
> HT is off so that you get a similar per core breakdown. In the HT on
> case, each Sx-Cy represents 2 threads, compared to 1 in the non HT
> case.Right now, you have non-HT reporting global, HT reporting per-core.
> That does not make much sense to me.

Ok. I guess can force --per-core in this case too. This would simplify
things because can get rid of the agg-per-core attribute.

> >> but it would be clearer and simpler to interpret to users.
> >
> > Same problem as above.
> >
> >>
> >> One bug I found when testing is that if you do with HT-on:
> >>
> >> $ perf stat -a --topdown -I 1000 --metric-only sleep 100
> >> Then you get data for frontend and backend but nothing for retiring or
> >> bad speculation.
> >
> > You see all the columns, but no data in some?
> >
> yes, and I don't like that. It is confusing especially when you do not
> know the threshold.
> Why are you suppressing the 'retiring' data when it is at 25% (1/4 of
> the maximum possible)
> when I am running a simple noploop? 25% is a sign of underutilization,
> that could be useful too.

It's what the TopDown specification uses and the paper describes.

The thresholds are needed when you have more than one level because
the lower levels become meaningless if their parents didn't cross the
threshold. Otherwise you may report something that looks like a
bottle neck, but isn't.

Given there is currently only level 1 in the patchkit, but if we ever
add more levels absolutely need thresholds. So it's better to have
them from Day 1.

Utilization should be reported separately. TopDown cannot give
utilization because it doesn't know about idle time.

I can report - for empty fields if it helps you. It's not clear
to me why empty fields in CSV are a problem.

I don't think colors are useful here, this would have the problem
described above.

>
> > That's intended: the percentage is only printed when it crosses a
> > threshold. That's part of the top down specification.
> >
> I don't like that. I would rather see all the percentages.
> My remark applies to non topdown metrics as well, such as IPC.
> Clearly the IPC is awkward to use. You need to know you need to
> measure cycles, instructions to get ipc with --metric-only. Again,

Well it's the default (perf stat --metric-only), or with -d*, and it works fine
with --transaction too.

If you think there should be more predefined sets of metrics
that's fine for me too, but it would be a separate
patch.


-Andi
--
[email protected] -- Speaking for myself only.

2015-12-21 16:09:03

by Jiri Olsa

[permalink] [raw]
Subject: Re: [PATCH 01/10] perf, tools: Dont stop PMU parsing on alias parse error

On Tue, Dec 15, 2015 at 04:54:17PM -0800, Andi Kleen wrote:
> From: Andi Kleen <[email protected]>
>
> When an error happens during alias parsing currently the complete
> parsing of all attributes of the PMU is stopped. This is breaks
> old perf on a newer kernel that may have not-yet-know
> alias attributes (such as .scale or .per-pkg).

I kind of recall seeing similar patch, but looks like
it never made it in

>
> Continue when some attribute is unparseable.
>
> This is IMHO a stable candidate and should be backported
> to older versions to avoid problems with newer kernels.
>
> Signed-off-by: Andi Kleen <[email protected]>
> ---
> tools/perf/util/pmu.c | 10 ++++------
> 1 file changed, 4 insertions(+), 6 deletions(-)
>
> diff --git a/tools/perf/util/pmu.c b/tools/perf/util/pmu.c
> index e4b173d..8a520e9 100644
> --- a/tools/perf/util/pmu.c
> +++ b/tools/perf/util/pmu.c
> @@ -283,13 +283,12 @@ static int pmu_aliases_parse(char *dir, struct list_head *head)
> {
> struct dirent *evt_ent;
> DIR *event_dir;
> - int ret = 0;
>
> event_dir = opendir(dir);
> if (!event_dir)
> return -EINVAL;
>
> - while (!ret && (evt_ent = readdir(event_dir))) {
> + while ((evt_ent = readdir(event_dir))) {
> char path[PATH_MAX];
> char *name = evt_ent->d_name;
> FILE *file;
> @@ -305,17 +304,16 @@ static int pmu_aliases_parse(char *dir, struct list_head *head)
>
> snprintf(path, PATH_MAX, "%s/%s", dir, name);
>
> - ret = -EINVAL;
> file = fopen(path, "r");
> if (!file)
> - break;
> + continue;

I think we should emit warning in case of failure to let user know

>
> - ret = perf_pmu__new_alias(head, dir, name, file);

same here

> + perf_pmu__new_alias(head, dir, name, file);
> fclose(file);
> }
>

thanks,
jirka

2015-12-21 16:14:24

by Jiri Olsa

[permalink] [raw]
Subject: Re: [PATCH 02/10] perf, tools, stat: Force --per-core mode for .agg-per-core aliases

On Tue, Dec 15, 2015 at 04:54:18PM -0800, Andi Kleen wrote:
> From: Andi Kleen <[email protected]>
>
> When an event alias is used that the kernel marked as .agg-per-core, force
> --per-core mode (and also require -a and forbid cgroups or per thread mode).
> This in term means, --topdown forces --per-core mode.
>
> This is needed for TopDown in SMT mode, because it needs to measure
> all threads in a core together and merge the values to compute the correct
> percentages of how the pipeline is limited.
>
> We do this if any alias is agg-per-core.
>
> Add the code to parse the .agg-per-core attributes and propagate
> the information to the evsel. Then the main stat code does
> the necessary checks and forces per core mode.
>
> Open issue: in combination with -C ... we get wrong values. I think that's
> a existing bug that needs to be debugged/fixed separately.

could you please be more specific on what's failing for you?

thanks,
jirka

2015-12-21 16:16:09

by Jiri Olsa

[permalink] [raw]
Subject: Re: [PATCH 02/10] perf, tools, stat: Force --per-core mode for .agg-per-core aliases

On Tue, Dec 15, 2015 at 04:54:18PM -0800, Andi Kleen wrote:
> From: Andi Kleen <[email protected]>
>
> When an event alias is used that the kernel marked as .agg-per-core, force
> --per-core mode (and also require -a and forbid cgroups or per thread mode).
> This in term means, --topdown forces --per-core mode.
>
> This is needed for TopDown in SMT mode, because it needs to measure
> all threads in a core together and merge the values to compute the correct
> percentages of how the pipeline is limited.
>
> We do this if any alias is agg-per-core.
>
> Add the code to parse the .agg-per-core attributes and propagate
> the information to the evsel. Then the main stat code does
> the necessary checks and forces per core mode.

please split it into 2 patches:
- support .aggr-per-core alias parsing
- force --per-core mode for .agg-per-core aliases

thanks,
jirka