With the introduction of NUMA systems, came the possibility of remote memory accesses.
Combine those remote memory accesses with contention on the remote node (ie a modified
cacheline) and you have a possibility for very long latencies. These latencies can
bottleneck a program.
The program added by these patches, helps detect the situation where two nodes are
'tugging' on the same _data_ cacheline. The term used through out this program and
the various changelogs is called a HITM. This means nodeX went to read a cacheline
and it was discovered to be loaded in nodeY's LLC cache (hence the cacheHIT). The
remote cacheline was also in a 'M'odified state thus creating a 'HIT M' for hit in
a modified state. HITMs can happen locally and remotely. This program's interest
is mainly in remote HITMs as they cause the longest latencies.
Why a program has a remote HITM derives from how the two nodes are 'sharing' the
cacheline. Is the sharing intentional ("true") or unintentional ("false"). We have seen
lots of "false" sharing cases, which lead to simple solutions such as seperating the data
onto different cachelines.
This tool does not distinguish between 'true' or 'false' sharing, instead it just points to
the more expensive sharing situations under the current workload. It is up to the user
to understand what the workload is doing to determine whether a problem exists or not and
how to report it.
The data output is verbose and there are lots of data tables that interprit the latencies
and data addresses in different ways to help see where bottlenecks might be lying.
Most of this idea, work and calculations were done by Dick Fowles. My work mainly
includes porting it to perf. Joe Mario has contributed greatly with ideas to make the
output more informative based on his usage of the tool. Joe has found a handful of
bottlenecks using various industry benchmarks and has worked with developers to fix
them.
I would also like to thank Stephane Eranian for his early help and guidance on
navigating the differences between the current perf tool and how similar tools
looked at HP. And also his tireless work in getting the MMAP2 interface to stick.
Also thanks to Arnaldo and Jiri Olso for their help in suggestions for this tool.
I also have a test program that generated a controlled number of HITMs that we used
frequently to validate our early work (the Intel docs were not always clear which
bits had to be set and some arches do not work well). I would like to add it, but
didn't know how (nor did I spend any serious time looking either).
This program has been tested primarily on Intel's Ivy Bridge platforms. The Sandy Bridge
platforms had some quirks that were fixed on Ivy Bridge. We haven't tried Haswell as
that has a re-worked latency event implementation.
A handful of patches include re-enabling MMAP2 support and some fixes to perf itself. One
in particular hacks up how standard deviation is calculated. It works with our calculations
but may break other tools expectations. Feedback is welcomed.
Comemnts, feedback, anything else welcomed.
Signed-off-by: Don Zickus <[email protected]>
Arnaldo Carvalho de Melo (2):
perf c2c: Shared data analyser
perf c2c: Dump raw records, decode data_src bits
Don Zickus (19):
Revert "perf: Disable PERF_RECORD_MMAP2 support"
perf, machine: Use map as success in ip__resolve_ams
perf, session: Change header.misc dump from decimal to hex
perf, stat: FIXME Stddev calculation is incorrect
perf, callchain: Add generic callchain print handler for stdio
perf, c2c: Rework setup code to prepare for features
perf, c2c: Add rbtree sorted on mmap2 data
perf, c2c: Add stats to track data source bits and cpu to node maps
perf, c2c: Sort based on hottest cache line
perf, c2c: Display cacheline HITM analysis to stdout
perf, c2c: Add callchain support
perf, c2c: Output summary stats
perf, c2c: Dump rbtree for debugging
perf, c2c: Fixup tid because of perf map is broken
perf, c2c: Add symbol count table
perf, c2c: Add shared cachline summary table
perf, c2c: Add framework to analyze latency and display summary stats
perf, c2c: Add selected extreme latencies to output cacheline stats
table
perf, c2c: Add summary latency table for various parts of caches
kernel/events/core.c | 4 -
tools/perf/Documentation/perf-c2c.c | 22 +
tools/perf/Makefile.perf | 1 +
tools/perf/builtin-c2c.c | 2963 +++++++++++++++++++++++++++++++++++
tools/perf/builtin.h | 1 +
tools/perf/perf.c | 1 +
tools/perf/ui/stdio/hist.c | 37 +
tools/perf/util/event.c | 36 +-
tools/perf/util/evlist.c | 37 +
tools/perf/util/evlist.h | 7 +
tools/perf/util/evsel.c | 1 +
tools/perf/util/hist.h | 4 +
tools/perf/util/machine.c | 2 +-
tools/perf/util/session.c | 2 +-
tools/perf/util/stat.c | 3 +-
15 files changed, 3097 insertions(+), 24 deletions(-)
create mode 100644 tools/perf/Documentation/perf-c2c.c
create mode 100644 tools/perf/builtin-c2c.c
--
1.7.11.7
When printing the raw dump of a data file, the header.misc is
printed as a decimal. Unfortunately, that field is a bit mask, so
it is hard to interpret as a decimal.
Print in hex, so the user can easily see what bits are set and more
importantly what type of info it is conveying.
Signed-off-by: Don Zickus <[email protected]>
---
tools/perf/util/session.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index 0b39a48..d1ad10f 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -793,7 +793,7 @@ static void dump_sample(struct perf_evsel *evsel, union perf_event *event,
if (!dump_trace)
return;
- printf("(IP, %d): %d/%d: %#" PRIx64 " period: %" PRIu64 " addr: %#" PRIx64 "\n",
+ printf("(IP, %x): %d/%d: %#" PRIx64 " period: %" PRIu64 " addr: %#" PRIx64 "\n",
event->header.misc, sample->pid, sample->tid, sample->ip,
sample->period, sample->addr);
--
1.7.11.7
When trying to map a bunch of instruction addresses to their respective
threads, I kept getting a lot of bogus entries [I forget the exact reason
as I patched my code months ago].
Looking through ip__resovle_ams, I noticed the check for
if (al.sym)
and realized, most times I have an al.map definition but sometimes an
al.sym is undefined. In the cases where al.sym is undefined, the loop
keeps going even though a valid al.map exists.
Modify this check to use the more reliable al.map. This fixed my bogus
entries.
Signed-off-by: Don Zickus <[email protected]>
---
tools/perf/util/machine.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c
index c872991..620a198 100644
--- a/tools/perf/util/machine.c
+++ b/tools/perf/util/machine.c
@@ -1213,7 +1213,7 @@ static void ip__resolve_ams(struct machine *machine, struct thread *thread,
*/
thread__find_addr_location(thread, machine, m, MAP__FUNCTION,
ip, &al);
- if (al.sym)
+ if (al.map)
goto found;
}
found:
--
1.7.11.7
Fix the standard deviation calculation. The current calculation
does the standard deviation mean which is slightly different.
Signed-off-by: Don Zickus <[email protected]>
---
tools/perf/util/stat.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/tools/perf/util/stat.c b/tools/perf/util/stat.c
index 6506b3d..58aa661 100644
--- a/tools/perf/util/stat.c
+++ b/tools/perf/util/stat.c
@@ -47,7 +47,8 @@ double stddev_stats(struct stats *stats)
return 0.0;
variance = stats->M2 / (stats->n - 1);
- variance_mean = variance / stats->n;
+ //variance_mean = variance / stats->n;
+ variance_mean = variance;
return sqrt(variance_mean);
}
--
1.7.11.7
Sometimes you want to verify the rbtree sorting on a unique id
is working correctly. This allows you to dump it.
Signed-off-by: Don Zickus <[email protected]>
---
tools/perf/builtin-c2c.c | 30 ++++++++++++++++++++++++++++++
1 file changed, 30 insertions(+)
diff --git a/tools/perf/builtin-c2c.c b/tools/perf/builtin-c2c.c
index c8e76dc..0760f6a 100644
--- a/tools/perf/builtin-c2c.c
+++ b/tools/perf/builtin-c2c.c
@@ -900,6 +900,34 @@ err:
#define HAS_HITMS(h) (h->stats.t.lcl_hitm || h->stats.t.rmt_hitm)
+static void dump_rb_tree(struct rb_root *tree,
+ struct perf_c2c *c2c __maybe_unused)
+{
+ struct rb_node *next = rb_first(tree);
+ struct c2c_entry *n;
+
+ printf("%3s %3s %8s %8s %6s %16s %16s %16s %16s %16s %8s\n",
+ "Maj", "Min", "Ino", "InoGen", "Pid", "Start",
+ "Vaddr", "al_addr", "ip addr", "pgoff", "cpumode");
+ while (next) {
+ n = rb_entry(next, struct c2c_entry, rb_node);
+ next = rb_next(&n->rb_node);
+
+ printf("%3x %3x %8lx %8lx %6d %16lx %16lx %16lx %16lx %16lx %8x\n",
+ n->mi->daddr.map->maj,
+ n->mi->daddr.map->min,
+ n->mi->daddr.map->ino,
+ n->mi->daddr.map->ino_generation,
+ n->thread->pid_,
+ n->mi->daddr.map->start,
+ n->mi->daddr.addr,
+ n->mi->daddr.al_addr,
+ n->mi->iaddr.al_addr,
+ n->mi->daddr.map->pgoff,
+ n->cpumode);
+ }
+}
+
static void c2c_hit__update_stats(struct c2c_stats *new,
struct c2c_stats *old)
{
@@ -1494,6 +1522,8 @@ static int perf_c2c__process_events(struct perf_session *session,
goto err;
}
+ if (verbose > 2)
+ dump_rb_tree(&c2c->tree_physid, c2c);
print_c2c_trace_report(c2c);
c2c_analyze_hitms(c2c);
--
1.7.11.7
Just a simple summary table of latencies for the different parts of a
hardware cache (L1, LFB, L2, LLC [local/remote], DRAM [local/remote]).
Of course, this is based on the original ldlat filter level, which is 30 cycles
as of this writing. This makes the L1, LFB, L2 numbers slightly misleading.
Original done by Dick Fowles and ported to perf by me.
Suggested-by: Joe Mario <[email protected]>
Original-by: Dick Fowles <[email protected]>
Signed-off-by: Don Zickus <[email protected]>
---
tools/perf/builtin-c2c.c | 215 +++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 215 insertions(+)
diff --git a/tools/perf/builtin-c2c.c b/tools/perf/builtin-c2c.c
index 1fa21b4..a73535a 100644
--- a/tools/perf/builtin-c2c.c
+++ b/tools/perf/builtin-c2c.c
@@ -122,6 +122,41 @@ typedef struct {
void *analyze;
} stats_t;
+enum {
+ LD_L1HIT_NONE,
+ LD_LFBHIT_NONE,
+ LD_L2HIT_NONE,
+ LD_L3HIT_NONE,
+ LD_L3HIT_MISS, /* other core snoop miss */
+ LD_L3HIT_HIT, /* hit on other core within socket, no fwd */
+ LD_L3HIT_HITM, /* hitm on other core within socket */
+ LD_L3MISS_HIT_CACHE, /* remote cache hit, fwd data? */
+ LD_L3MISS_HITM_CACHE, /* remote cache hitm, C2C, implicit WB, invalidate */
+ LD_L3MISS_HIT_LDRAM, /* load shared from local dram */
+ LD_L3MISS_HIT_RDRAM, /* load shared from remote dram */
+ LD_L3MISS_MISS_LDRAM, /* load exclusive from local dram */
+ LD_L3MISS_MISS_RDRAM, /* load exclusive from remote dram */
+ LD_L3MISS_NA,
+ LD_UNCACHED,
+ LOAD_CATAGORIES,
+ ST_L1HIT_NA,
+ ST_L1MISS_NA,
+ ST_UNCACHED,
+ LOCK, /* defines a bit flag to represent locked events */
+ ALL_CATAGORIES
+};
+
+struct ld_lat_stats {
+ struct stats stats;
+ u64 total;
+};
+
+struct ld_lat_stats ld_lat_stats[ALL_CATAGORIES];
+
+typedef struct {
+ const char *name;
+ int id;
+} xref_t;
enum { EMPTY, SYMBOL, OBJECT };
enum { OVERALL, EXTREMES, ANALYZE, SCOPES };
@@ -131,6 +166,16 @@ struct c2c_latency_stats hist_info[SCOPES];
enum { OP, LVL, SNP, LCK, TLB };
+#define LOAD_OP(a) ((a) & PERF_MEM_OP_LOAD )
+#define STORE_OP(a) ((a) & PERF_MEM_OP_STORE )
+#define LOCKED_OP(a) ((a) & PERF_MEM_LOCK_LOCKED)
+
+#define SNOOP_NA(a) ((a) & PERF_MEM_SNOOP_NA)
+#define SNOOP_NONE(a) ((a) & PERF_MEM_SNOOP_NONE)
+#define SNOOP_MISS(a) ((a) & PERF_MEM_SNOOP_MISS)
+#define SNOOP_HIT(a) ((a) & PERF_MEM_SNOOP_HIT)
+#define SNOOP_HITM(a) ((a) & PERF_MEM_SNOOP_HITM)
+
#define RMT_RAM (PERF_MEM_LVL_REM_RAM1 | PERF_MEM_LVL_REM_RAM2)
#define RMT_LLC (PERF_MEM_LVL_REM_CCE1 | PERF_MEM_LVL_REM_CCE2)
@@ -1066,6 +1111,87 @@ static void c2c_hit__update_stats(struct c2c_stats *new,
new->total_period += old->total_period;
}
+xref_t names[LOAD_CATAGORIES] = {
+ { "L1 Hit - Snp None ", LD_L1HIT_NONE },
+ { "LFB Hit - Snp None ", LD_LFBHIT_NONE },
+ { "L2 Hit - Snp None ", LD_L2HIT_NONE },
+ { "L3 Hit - Snp None ", LD_L3HIT_NONE },
+ { "L3 Hit - Snp Miss ", LD_L3HIT_MISS },
+ { "L3 Hit - Snp Hit - Lcl Cache", LD_L3HIT_HIT },
+ { "L3 Hit - Snp Hitm - Lcl Cache", LD_L3HIT_HITM },
+ { "L3 Miss - Snp Hit - Rmt Cache", LD_L3MISS_HIT_CACHE },
+ { "L3 Miss - Snp Hitm - Rmt Cache", LD_L3MISS_HITM_CACHE },
+ { "L3 Miss - Snp Hit - Lcl Dram ", LD_L3MISS_HIT_LDRAM },
+ { "L3 Miss - Snp Hit - Rmt Dram ", LD_L3MISS_HIT_RDRAM },
+ { "L3 Miss - Snp Miss - Lcl Dram ", LD_L3MISS_MISS_LDRAM },
+ { "L3 Miss - Snp Miss - Rmt Dram ", LD_L3MISS_MISS_RDRAM },
+ { "L3 Miss - Snp NA ", LD_L3MISS_NA },
+ { "Ld UNC - Snp None ", LD_UNCACHED },
+};
+
+static void print_latency_load_info(void)
+{
+#define TITLE "Load Access & Excute Latency Information"
+
+ char title_str[256];
+ double stddev;
+ double mean;
+ double covar;
+ uint64_t cycles;
+ int pad;
+ int idx;
+ int i;
+
+
+ cycles = 0;
+
+ for (i = 0; i < LOAD_CATAGORIES; i++)
+ cycles += ld_lat_stats[i].total;
+
+ sprintf(title_str, "%32s %10s %10s %10s %10s %10s %10s",
+ " ",
+ "Count",
+ "Minmum",
+ "Average",
+ "CV ",
+ "Maximum",
+ "%dist");
+
+ pad = (strlen(title_str)/2) - (strlen(TITLE)/2);
+
+ printf("\n\n");
+ for (i = 0; i < (int)strlen(title_str); i++) printf("=");
+ printf("\n");
+ for (i = 0; i < pad; i++) printf(" ");
+ printf("%s\n", TITLE);
+ printf("\n");
+ printf("%s\n", title_str);
+ for (i = 0; i < (int)strlen(title_str); i++) printf("=");
+ printf("\n");
+
+ for (i = 0; i < LOAD_CATAGORIES; i++) {
+
+ idx = names[i].id;
+
+ mean = avg_stats(&ld_lat_stats[idx].stats);
+ stddev = stddev_stats(&ld_lat_stats[idx].stats);
+ covar = stddev / mean;
+
+ printf("%-32s %10lu %10lu %10.0f %10.4f %10lu %10.1f%%\n",
+ names[i].name,
+ (u64)ld_lat_stats[idx].stats.n,
+ ld_lat_stats[idx].stats.min,
+ ld_lat_stats[idx].stats.mean,
+ covar,
+ ld_lat_stats[idx].stats.max,
+ 100. * ((double)ld_lat_stats[idx].total / (double)cycles));
+
+ }
+
+ printf("\n");
+
+}
+
LIST_HEAD(ref_tree);
LIST_HEAD(ref_tree_sorted);
struct refs {
@@ -1721,6 +1847,88 @@ static void calculate_latency_info(struct rb_root *tree,
selected->mode = mode;
}
+static int decode_src(union perf_mem_data_src dsrc)
+{
+ if (LOAD_OP(dsrc.mem_op)) {
+
+ if (FILLBUF_HIT(dsrc.mem_lvl)) return(LD_LFBHIT_NONE);
+ if (L1CACHE_HIT(dsrc.mem_lvl)) return(LD_L1HIT_NONE);
+ if (L2CACHE_HIT(dsrc.mem_lvl)) return(LD_L2HIT_NONE);
+
+ if (L3CACHE_HIT(dsrc.mem_lvl)) {
+
+ if (SNOOP_HITM(dsrc.mem_snoop)) return(LD_L3HIT_HITM);
+ if (SNOOP_HIT(dsrc.mem_snoop)) return(LD_L3HIT_HIT);
+ if (SNOOP_MISS(dsrc.mem_snoop)) return(LD_L3HIT_MISS);
+ if (SNOOP_NONE(dsrc.mem_snoop)) return(LD_L3HIT_NONE);
+
+ }
+
+ if (L3CACHE_MISS(dsrc.mem_lvl)) {
+
+ if (SNOOP_NA(dsrc.mem_snoop)) return(LD_L3MISS_NA);
+
+ }
+
+ if (RMT_LLCHIT(dsrc.mem_lvl)) {
+
+ if (SNOOP_HITM(dsrc.mem_snoop)) return(LD_L3MISS_HITM_CACHE);
+ if (SNOOP_HIT(dsrc.mem_snoop)) return(LD_L3MISS_HIT_CACHE);
+
+ }
+
+
+ if (LCL_MEM(dsrc.mem_lvl)) {
+
+ if (SNOOP_MISS(dsrc.mem_snoop)) return(LD_L3MISS_MISS_LDRAM);
+ if (SNOOP_HIT(dsrc.mem_snoop)) return(LD_L3MISS_HIT_LDRAM);
+
+ }
+
+
+ if (RMT_MEM(dsrc.mem_lvl)) {
+
+ if (SNOOP_MISS(dsrc.mem_snoop)) return(LD_L3MISS_MISS_RDRAM);
+ if (SNOOP_HIT(dsrc.mem_snoop)) return(LD_L3MISS_HIT_RDRAM);
+
+ }
+
+ if (LD_UNCACHED(dsrc.mem_lvl)) {
+ if (SNOOP_NONE(dsrc.mem_snoop)) return(LD_UNCACHED);
+ }
+
+ }
+
+
+ if (STORE_OP(dsrc.mem_op)) {
+
+ if (SNOOP_NA(dsrc.mem_snoop)) {
+
+ if (L1CACHE_HIT(dsrc.mem_lvl)) return(ST_L1HIT_NA);
+ if (L1CACHE_MISS(dsrc.mem_lvl)) return(ST_L1MISS_NA);
+
+ }
+
+ }
+ return -1;
+}
+
+static void latency_update_stats(union perf_mem_data_src src,
+ u64 weight)
+{
+ int id = decode_src(src);
+
+ if (id < 0) {
+ pr_err("Bad data_src: %llx\n", src.val);
+ return;
+ }
+
+ update_stats(&ld_lat_stats[id].stats, weight);
+ ld_lat_stats[id].total += weight;
+
+ return;
+}
+
static void c2c_analyze_latency(struct perf_c2c *c2c)
{
@@ -1742,6 +1950,9 @@ static void c2c_analyze_latency(struct perf_c2c *c2c)
extremes = &hist_info[EXTREMES];
selected = &hist_info[ANALYZE];
+ for (i = 0; i < LOAD_CATAGORIES; i++)
+ init_stats(&ld_lat_stats[i].stats);
+
/* sort on latency */
while (next) {
n = rb_entry(next, struct c2c_entry, rb_node);
@@ -1749,6 +1960,9 @@ static void c2c_analyze_latency(struct perf_c2c *c2c)
snoop = n->mi->data_src.mem_snoop;
+ /* piggy back updating load latency stats */
+ latency_update_stats(n->mi->data_src, n->weight);
+
/* filter out HITs as un-interesting */
if ((snoop & P(SNOOP, HIT)) ||
(snoop & P(SNOOP, HITM)) ||
@@ -1765,6 +1979,7 @@ static void c2c_analyze_latency(struct perf_c2c *c2c)
calculate_latency_selected_info(&lat_select_tree, selected->start, &lat_stats);
print_latency_select_info(&lat_select_tree, &lat_stats);
+ print_latency_load_info();
return;
}
--
1.7.11.7
Just another table that displays the referenced symbols in the analysis
report. The table lists the most frequently used symbols first.
It is just another way to look at similar data to figure out who
is causing the most contention (based on the workload used).
Originally done by Dick Fowles and ported by me.
Suggested-by: Joe Mario <[email protected]>
Original-by: Dick Fowles <[email protected]>
Signed-off-by: Don Zickus <[email protected]>
---
tools/perf/builtin-c2c.c | 99 ++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 99 insertions(+)
diff --git a/tools/perf/builtin-c2c.c b/tools/perf/builtin-c2c.c
index 32c2319..979187f 100644
--- a/tools/perf/builtin-c2c.c
+++ b/tools/perf/builtin-c2c.c
@@ -950,6 +950,104 @@ static void c2c_hit__update_stats(struct c2c_stats *new,
new->total_period += old->total_period;
}
+LIST_HEAD(ref_tree);
+LIST_HEAD(ref_tree_sorted);
+struct refs {
+ struct list_head list;
+ int nr;
+ const char *name;
+ char *long_name;
+};
+
+static int update_ref_tree(struct c2c_entry *entry)
+{
+ struct refs *p;
+ struct dso *dso = entry->mi->iaddr.map->dso;
+ const char *name = dso->short_name;
+
+ list_for_each_entry(p, &ref_tree, list) {
+ if (!strcmp(p->name, name))
+ goto found;
+ }
+
+ p = zalloc(sizeof(struct refs));
+ if (!p)
+ return -1;
+ p->name = name;
+ p->long_name = dso->long_name;
+ list_add_tail(&p->list, &ref_tree);
+
+found:
+ p->nr++;
+ return 0;
+}
+
+static void print_symbol_record_count(struct rb_root *tree)
+{
+ struct rb_node *next = rb_first(tree);
+ struct c2c_entry *n;
+ struct refs *p, *q, *pn;
+ char string[256];
+ char delimit[256];
+ int i;
+ int idx = 0;
+
+ /* gather symbol references */
+ while (next) {
+ n = rb_entry(next, struct c2c_entry, rb_node);
+ next = rb_next(&n->rb_node);
+
+ if (update_ref_tree(n)) {
+ pr_err("Could not update reference tree\n");
+ goto cleanup;
+ }
+ }
+
+ /* sort on number of references per symbol */
+ list_for_each_entry_safe(p, pn, &ref_tree, list) {
+ list_del_init(&p->list);
+ list_for_each_entry(q, &ref_tree_sorted, list) {
+ if (p->nr > q->nr) {
+ list_add_tail(&p->list, &q->list);
+ break;
+ }
+ }
+ if (list_empty(&p->list))
+ list_add_tail(&p->list, &ref_tree_sorted);
+ }
+
+ /* print header info */
+ sprintf(string, "%5s %8s %-32s %-80s",
+ "Index",
+ "Records",
+ "Object Name",
+ "Object Path");
+
+ delimit[0] = '\0';
+ for (i = 0; i < (int)strlen(string); i++) strcat(delimit, "=");
+
+ printf("\n\n");
+ printf("%s\n", delimit);
+ printf("%50s %s\n", " ", "Object Name, Path & Reference Counts");
+ printf("\n");
+ printf("%s\n", string);
+ printf("%s\n", delimit);
+
+ /* print out table */
+ list_for_each_entry(p, &ref_tree_sorted, list) {
+ printf("%5d %8d %-32s %-80s\n",
+ idx, p->nr, p->name, p->long_name);
+ idx++;
+ }
+ printf("\n");
+
+cleanup:
+ list_for_each_entry_safe(p, pn, &ref_tree_sorted, list) {
+ list_del(&p->list);
+ free(p);
+ }
+}
+
static void print_hitm_cacheline_header(void)
{
#define SHARING_REPORT_TITLE "Shared Cache Line Distribution Pareto"
@@ -1528,6 +1626,7 @@ static int perf_c2c__process_events(struct perf_session *session,
dump_rb_tree(&c2c->tree_physid, c2c);
print_c2c_trace_report(c2c);
c2c_analyze_hitms(c2c);
+ print_symbol_record_count(&c2c->tree_physid);
err:
return err;
--
1.7.11.7
This adds a quick summary of the hottest cache contention lines based
on the input data. This summarizes what the broken table shows you,
so you can see at a quick glance which cachelines are interesting.
Originally done by Dick Fowles and ported by me.
Original-by: Dick Fowles <[email protected]>
Signed-off-by: Don Zickus <[email protected]>
---
tools/perf/builtin-c2c.c | 136 +++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 136 insertions(+)
diff --git a/tools/perf/builtin-c2c.c b/tools/perf/builtin-c2c.c
index 979187f..014c9b0 100644
--- a/tools/perf/builtin-c2c.c
+++ b/tools/perf/builtin-c2c.c
@@ -1048,6 +1048,141 @@ cleanup:
}
}
+static void print_c2c_shared_cacheline_report(struct rb_root *hitm_tree,
+ struct c2c_stats *shared_stats __maybe_unused,
+ struct c2c_stats *c2c_stats __maybe_unused)
+{
+#define SHM_TITLE "Shared Data Cache Line Table"
+
+ struct rb_node *next = rb_first(hitm_tree);
+ struct c2c_hit *h;
+ char header[256];
+ char delimit[256];
+ u32 crecords;
+ u32 lclmiss;
+ u32 ldcnt;
+ double p_hitm;
+ double p_all;
+ int totmiss;
+ int rmt_hitm;
+ int len;
+ int pad;
+ int i;
+
+ sprintf(header,"%28s %8s %8s %8s %8s %28s %18s %28s %18s %8s %28s",
+ " ",
+ "Total",
+ "%All ",
+ " ",
+ "Total",
+ "---- Core Load Hit ----",
+ "-- LLC Load Hit --",
+ "----- LLC Load Hitm -----",
+ "-- Load Dram --",
+ "LLC ",
+ "---- Store Reference ----");
+
+ len = strlen(header);
+ delimit[0] = '\0';
+
+ for (i = 0; i < len; i++)
+ strcat(delimit, "=");
+
+ printf("\n\n");
+ printf("%s\n", delimit);
+ printf("\n");
+ pad = (strlen(header)/2) - (strlen(SHM_TITLE)/2);
+ for (i = 0; i < pad; i++)
+ printf(" ");
+ printf("%s\n", SHM_TITLE);
+ printf("\n");
+ printf("%s\n", header);
+
+ sprintf(header, "%8s %18s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s",
+ "Index",
+ "Phys Adrs",
+ "Records",
+ "Ld Miss",
+ "%hitm",
+ "Loads",
+ "FB",
+ "L1D",
+ "L2D",
+ "Lcl",
+ "Rmt",
+ "Total",
+ "Lcl",
+ "Rmt",
+ "Lcl",
+ "Rmt",
+ "Ld Miss",
+ "Total",
+ "L1Hit",
+ "L1Miss");
+
+ printf("%s\n", header);
+ printf("%s\n", delimit);
+
+ rmt_hitm = c2c_stats->t.rmt_hitm;
+ totmiss = c2c_stats->t.lcl_dram +
+ c2c_stats->t.rmt_dram +
+ c2c_stats->t.rmt_hit +
+ c2c_stats->t.rmt_hitm;
+
+ i = 0;
+ while (next) {
+ h = rb_entry(next, struct c2c_hit, rb_node);
+ next = rb_next(&h->rb_node);
+
+ lclmiss = h->stats.t.lcl_dram +
+ h->stats.t.rmt_dram +
+ h->stats.t.rmt_hitm +
+ h->stats.t.rmt_hit;
+
+ ldcnt = lclmiss +
+ h->stats.t.ld_fbhit +
+ h->stats.t.ld_l1hit +
+ h->stats.t.ld_l2hit +
+ h->stats.t.ld_llchit +
+ h->stats.t.lcl_hitm;
+
+ crecords = ldcnt +
+ h->stats.t.st_l1hit +
+ h->stats.t.st_l1miss;
+
+ p_hitm = (double)h->stats.t.rmt_hitm / (double)rmt_hitm;
+ p_all = (double)h->stats.t.rmt_hitm / (double)totmiss;
+
+ /* stop when the percentage gets to low */
+ if (p_hitm < DISPLAY_LINE_LIMIT)
+ break;
+
+ printf("%8d %#18lx %8u %7.2f%% %7.2f%% %8u %8u %8u %8u %8u %8u %8u %8u %8u %8u %8u %8u %8u %8u %8u\n",
+ i,
+ h->cacheline,
+ crecords,
+ 100. * p_all,
+ 100. * p_hitm,
+ ldcnt,
+ h->stats.t.ld_fbhit,
+ h->stats.t.ld_l1hit,
+ h->stats.t.ld_l2hit,
+ h->stats.t.ld_llchit,
+ h->stats.t.rmt_hit,
+ h->stats.t.lcl_hitm + h->stats.t.rmt_hitm,
+ h->stats.t.lcl_hitm,
+ h->stats.t.rmt_hitm,
+ h->stats.t.lcl_dram,
+ h->stats.t.rmt_dram,
+ lclmiss,
+ h->stats.t.store,
+ h->stats.t.st_l1hit,
+ h->stats.t.st_l1miss);
+
+ i++;
+ }
+}
+
static void print_hitm_cacheline_header(void)
{
#define SHARING_REPORT_TITLE "Shared Cache Line Distribution Pareto"
@@ -1555,6 +1690,7 @@ static void c2c_analyze_hitms(struct perf_c2c *c2c)
free(h);
print_shared_cacheline_info(&hitm_stats, shared_clines);
+ print_c2c_shared_cacheline_report(&hitm_tree, &hitm_stats, &c2c->stats);
print_c2c_hitm_report(&hitm_tree, &hitm_stats, &c2c->stats);
cleanup:
--
1.7.11.7
This patch adds a bunch of stats that will be used later in post-processing
to determine where and with what frequency the HITMs are coming from.
Most of the stats are decoded from the data source response. Another
piece of the stats is tracking which cpu the record came in on.
In order to properly build a cpu map to map where interesting events are coming
from, I shamelessly copy-n-pasted the cpu->NUMA node code from builtin-kmem.c.
As HITMs are most expensive when going across NUMA nodes, it only made sense
to create a quick cpu->NUMA lookup for when processing the records.
Credit to Dick Fowles for determining which bits are important and how to
properly track them. Ported to perf by me.
Original-by: Dick Fowles <[email protected]>
Signed-off-by: Don Zickus <[email protected]>
---
tools/perf/builtin-c2c.c | 327 ++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 326 insertions(+), 1 deletion(-)
diff --git a/tools/perf/builtin-c2c.c b/tools/perf/builtin-c2c.c
index a9c536b..360fbcf 100644
--- a/tools/perf/builtin-c2c.c
+++ b/tools/perf/builtin-c2c.c
@@ -5,15 +5,54 @@
#include "util/parse-options.h"
#include "util/session.h"
#include "util/tool.h"
+#include "util/stat.h"
+#include "util/cpumap.h"
#include "util/debug.h"
#include <linux/compiler.h>
#include <linux/kernel.h>
+#include <sched.h>
+
+typedef struct {
+ int locks; /* count of 'lock' transactions */
+ int store; /* count of all stores in trace */
+ int st_uncache; /* stores to uncacheable address */
+ int st_noadrs; /* cacheable store with no address */
+ int st_l1hit; /* count of stores that hit L1D */
+ int st_l1miss; /* count of stores that miss L1D */
+ int load; /* count of all loads in trace */
+ int ld_excl; /* exclusive loads, rmt/lcl DRAM - snp none/miss */
+ int ld_shared; /* shared loads, rmt/lcl DRAM - snp hit */
+ int ld_uncache; /* loads to uncacheable address */
+ int ld_noadrs; /* cacheable load with no address */
+ int ld_fbhit; /* count of loads hitting Fill Buffer */
+ int ld_l1hit; /* count of loads that hit L1D */
+ int ld_l2hit; /* count of loads that hit L2D */
+ int ld_llchit; /* count of loads that hit LLC */
+ int lcl_hitm; /* count of loads with local HITM */
+ int rmt_hitm; /* count of loads with remote HITM */
+ int rmt_hit; /* count of loads with remote hit clean; */
+ int lcl_dram; /* count of loads miss to local DRAM */
+ int rmt_dram; /* count of loads miss to remote DRAM */
+ int nomap; /* count of load/stores with no phys adrs */
+ int remap; /* count of virt->phys remappings */
+} trinfo_t;
+
+struct c2c_stats {
+ cpu_set_t cpuset;
+ int nr_entries;
+ u64 total_period;
+ trinfo_t t;
+ struct stats stats;
+};
struct perf_c2c {
struct perf_tool tool;
bool raw_records;
struct rb_root tree_physid;
+
+ /* stats */
+ struct c2c_stats stats;
};
#define REGION_SAME 1 << 0;
@@ -31,6 +70,179 @@ struct c2c_entry {
enum { OP, LVL, SNP, LCK, TLB };
+#define RMT_RAM (PERF_MEM_LVL_REM_RAM1 | PERF_MEM_LVL_REM_RAM2)
+#define RMT_LLC (PERF_MEM_LVL_REM_CCE1 | PERF_MEM_LVL_REM_CCE2)
+
+#define L1CACHE_HIT(a) (((a) & PERF_MEM_LVL_L1 ) && ((a) & PERF_MEM_LVL_HIT))
+#define FILLBUF_HIT(a) (((a) & PERF_MEM_LVL_LFB) && ((a) & PERF_MEM_LVL_HIT))
+#define L2CACHE_HIT(a) (((a) & PERF_MEM_LVL_L2 ) && ((a) & PERF_MEM_LVL_HIT))
+#define L3CACHE_HIT(a) (((a) & PERF_MEM_LVL_L3 ) && ((a) & PERF_MEM_LVL_HIT))
+
+#define L1CACHE_MISS(a) (((a) & PERF_MEM_LVL_L1 ) && ((a) & PERF_MEM_LVL_MISS))
+#define L3CACHE_MISS(a) (((a) & PERF_MEM_LVL_L3 ) && ((a) & PERF_MEM_LVL_MISS))
+
+#define LD_UNCACHED(a) (((a) & PERF_MEM_LVL_UNC) && ((a) & PERF_MEM_LVL_HIT))
+#define ST_UNCACHED(a) (((a) & PERF_MEM_LVL_UNC) && ((a) & PERF_MEM_LVL_HIT))
+
+#define RMT_LLCHIT(a) (((a) & RMT_LLC) && ((a) & PERF_MEM_LVL_HIT))
+#define RMT_HIT(a,b) (((a) & RMT_LLC) && ((b) & PERF_MEM_SNOOP_HIT))
+#define RMT_HITM(a,b) (((a) & RMT_LLC) && ((b) & PERF_MEM_SNOOP_HITM))
+#define RMT_MEM(a) (((a) & RMT_RAM) && ((a) & PERF_MEM_LVL_HIT))
+
+#define LCL_HIT(a,b) (L3CACHE_HIT(a) && ((b) & PERF_MEM_SNOOP_HIT))
+#define LCL_HITM(a,b) (L3CACHE_HIT(a) && ((b) & PERF_MEM_SNOOP_HITM))
+#define LCL_MEM(a) (((a) & PERF_MEM_LVL_LOC_RAM) && ((a) & PERF_MEM_LVL_HIT))
+
+static int max_cpu_num;
+static int max_node_num;
+static int *cpunode_map;
+
+#define PATH_SYS_NODE "/sys/devices/system/node"
+
+/* Determine highest possible cpu in the system for sparse allocation */
+static void set_max_cpu_num(void)
+{
+ FILE *fp;
+ char buf[256];
+ int num;
+
+ /* set up default */
+ max_cpu_num = 4096;
+
+ /* get the highest possible cpu number for a sparse allocation */
+ fp = fopen("/sys/devices/system/cpu/possible", "r");
+ if (!fp)
+ goto out;
+
+ num = fread(&buf, 1, sizeof(buf), fp);
+ if (!num)
+ goto out_close;
+ buf[num] = '\0';
+
+ /* start on the right, to find highest cpu num */
+ while (--num) {
+ if ((buf[num] == ',') || (buf[num] == '-')) {
+ num++;
+ break;
+ }
+ }
+ if (sscanf(&buf[num], "%d", &max_cpu_num) < 1)
+ goto out_close;
+
+ max_cpu_num++;
+
+ fclose(fp);
+ return;
+
+out_close:
+ fclose(fp);
+out:
+ pr_err("Failed to read max cpus, using default of %d\n",
+ max_cpu_num);
+ return;
+}
+
+/* Determine highest possible node in the system for sparse allocation */
+static void set_max_node_num(void)
+{
+ FILE *fp;
+ char buf[256];
+ int num;
+
+ /* set up default */
+ max_node_num = 8;
+
+ /* get the highest possible cpu number for a sparse allocation */
+ fp = fopen("/sys/devices/system/node/possible", "r");
+ if (!fp)
+ goto out;
+
+ num = fread(&buf, 1, sizeof(buf), fp);
+ if (!num)
+ goto out_close;
+ buf[num] = '\0';
+
+ /* start on the right, to find highest node num */
+ while (--num) {
+ if ((buf[num] == ',') || (buf[num] == '-')) {
+ num++;
+ break;
+ }
+ }
+ if (sscanf(&buf[num], "%d", &max_node_num) < 1)
+ goto out_close;
+
+ max_node_num++;
+
+ fclose(fp);
+ return;
+
+out_close:
+ fclose(fp);
+out:
+ pr_err("Failed to read max nodes, using default of %d\n",
+ max_node_num);
+ return;
+}
+
+static int init_cpunode_map(void)
+{
+ int i;
+
+ set_max_cpu_num();
+ set_max_node_num();
+
+ cpunode_map = calloc(max_cpu_num, sizeof(int));
+ if (!cpunode_map) {
+ pr_err("%s: calloc failed\n", __func__);
+ goto out;
+ }
+
+ for (i = 0; i < max_cpu_num; i++)
+ cpunode_map[i] = -1;
+
+ return 0;
+out:
+ return -1;
+}
+
+static int setup_cpunode_map(void)
+{
+ struct dirent *dent1, *dent2;
+ DIR *dir1, *dir2;
+ unsigned int cpu, mem;
+ char buf[PATH_MAX];
+
+ /* initialize globals */
+ if (init_cpunode_map())
+ return -1;
+
+ dir1 = opendir(PATH_SYS_NODE);
+ if (!dir1)
+ return 0;
+
+ /* walk tree and setup map */
+ while ((dent1 = readdir(dir1)) != NULL) {
+ if (dent1->d_type != DT_DIR ||
+ sscanf(dent1->d_name, "node%u", &mem) < 1)
+ continue;
+
+ snprintf(buf, PATH_MAX, "%s/%s", PATH_SYS_NODE, dent1->d_name);
+ dir2 = opendir(buf);
+ if (!dir2)
+ continue;
+ while ((dent2 = readdir(dir2)) != NULL) {
+ if (dent2->d_type != DT_LNK ||
+ sscanf(dent2->d_name, "cpu%u", &cpu) < 1)
+ continue;
+ cpunode_map[cpu] = mem;
+ }
+ closedir(dir2);
+ }
+ closedir(dir1);
+ return 0;
+}
+
static int perf_c2c__scnprintf_data_src(char *bf, size_t size, uint64_t val)
{
#define PREFIX "["
@@ -303,17 +515,120 @@ static struct c2c_entry *c2c_entry__new(struct perf_sample *sample,
return entry;
}
+static int c2c_decode_stats(struct c2c_stats *stats, struct c2c_entry *entry)
+{
+ union perf_mem_data_src *data_src = &entry->mi->data_src;
+ u64 daddr = entry->mi->daddr.addr;
+ u64 weight = entry->weight;
+ int err = 0;
+
+ u64 op = data_src->mem_op;
+ u64 lvl = data_src->mem_lvl;
+ u64 snoop = data_src->mem_snoop;
+ u64 lock = data_src->mem_lock;
+
+#define P(a,b) PERF_MEM_##a##_##b
+
+ stats->nr_entries++;
+ stats->total_period += entry->period;
+
+ if (lock & P(LOCK,LOCKED)) stats->t.locks++;
+
+ if (op & P(OP,LOAD)) {
+ stats->t.load++;
+
+ if (!daddr) {
+ stats->t.ld_noadrs++;
+ return -1;
+ }
+
+ if (lvl & P(LVL,HIT)) {
+ if (lvl & P(LVL,UNC)) stats->t.ld_uncache++;
+ if (lvl & P(LVL,LFB)) stats->t.ld_fbhit++;
+ if (lvl & P(LVL,L1 )) stats->t.ld_l1hit++;
+ if (lvl & P(LVL,L2 )) stats->t.ld_l2hit++;
+ if (lvl & P(LVL,L3 )) {
+ if (snoop & P(SNOOP,HITM))
+ stats->t.lcl_hitm++;
+ else
+ stats->t.ld_llchit++;
+ }
+
+ if (lvl & P(LVL,LOC_RAM)) {
+ stats->t.lcl_dram++;
+ if (snoop & P(SNOOP,HIT))
+ stats->t.ld_shared++;
+ else
+ stats->t.ld_excl++;
+ }
+
+ if ((lvl & P(LVL,REM_RAM1)) ||
+ (lvl & P(LVL,REM_RAM2))) {
+ stats->t.rmt_dram++;
+ if (snoop & P(SNOOP,HIT))
+ stats->t.ld_shared++;
+ else
+ stats->t.ld_excl++;
+ }
+ }
+
+ if ((lvl & P(LVL,REM_CCE1)) ||
+ (lvl & P(LVL,REM_CCE2))) {
+ if (snoop & P(SNOOP, HIT))
+ stats->t.rmt_hit++;
+ else if (snoop & P(SNOOP, HITM)) {
+ stats->t.rmt_hitm++;
+ update_stats(&stats->stats, weight);
+ }
+ }
+
+ } else if (op & P(OP,STORE)) {
+ /* store */
+ stats->t.store++;
+
+ if (!daddr) {
+ stats->t.st_noadrs++;
+ return -1;
+ }
+
+ if (lvl & P(LVL,HIT)) {
+ if (lvl & P(LVL,UNC)) stats->t.st_uncache++;
+ if (lvl & P(LVL,L1 )) stats->t.st_l1hit++;
+ }
+ if (lvl & P(LVL,MISS))
+ if (lvl & P(LVL,L1)) stats->t.st_l1miss++;
+ } else {
+ /* unparsable data_src? */
+ return -1;
+ }
+
+ if (!entry->mi->daddr.map || !entry->mi->iaddr.map)
+ return -1;
+
+ return err;
+}
+
static int perf_c2c__process_load_store(struct perf_c2c *c2c,
struct perf_sample *sample __maybe_unused,
struct c2c_entry *entry)
{
+ int err = 0;
+
+ err = c2c_decode_stats(&c2c->stats, entry);
+ if (err < 0) {
+ err = 1;
+ goto err;
+ }
+ err = 0;
+
c2c_entry__add_to_list(c2c, entry);
/* don't lose the maps if remapped */
entry->mi->iaddr.map->referenced = true;
entry->mi->daddr.map->referenced = true;
- return 0;
+err:
+ return err;
}
static const struct perf_evsel_str_handler handlers[] = {
@@ -403,6 +718,9 @@ static int perf_c2c__read_events(struct perf_c2c *c2c)
goto out;
}
+ if (symbol__init() < 0)
+ goto out_delete;
+
if (perf_evlist__set_handlers(session->evlist, handlers))
goto out_delete;
@@ -416,6 +734,13 @@ out:
static int perf_c2c__init(struct perf_c2c *c2c)
{
+ /* setup cpu map */
+ if (setup_cpunode_map() < 0) {
+ pr_err("can not setup cpu map\n");
+ return -1;
+ }
+
+ CPU_ZERO(&c2c->stats.cpuset);
c2c->tree_physid = RB_ROOT;
return 0;
--
1.7.11.7
This just takes the previously calculated extreme latencies and prints them
in a pretty table with the cacheline and its offsets exposed for to help
further understand what they are coming from.
Original work done by Dick Fowles, ported to perf by me.
Suggested-by: Joe Mario <[email protected]>
Original-by: Dick Fowles <[email protected]>
Signed-off-by: Don Zickus <[email protected]>
---
tools/perf/builtin-c2c.c | 265 +++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 265 insertions(+)
diff --git a/tools/perf/builtin-c2c.c b/tools/perf/builtin-c2c.c
index b1d4a8b..1fa21b4 100644
--- a/tools/perf/builtin-c2c.c
+++ b/tools/perf/builtin-c2c.c
@@ -76,6 +76,7 @@ struct perf_c2c {
struct c2c_entry {
struct rb_node rb_node;
struct rb_node latency;
+ struct rb_node latency_scratch;
struct list_head scratch; /* scratch list for resorting */
struct thread *thread;
int tid; /* FIXME perf maps broken */
@@ -571,6 +572,62 @@ static int c2c_latency__add_to_list(struct rb_root *root, struct c2c_entry *n)
return 0;
}
+static struct c2c_entry *c2c_latency__add_to_list_physid(struct rb_root *root,
+ struct c2c_entry *entry)
+{
+ struct rb_node **p;
+ struct rb_node *parent = NULL;
+ struct c2c_entry *ce;
+ int64_t cmp;
+
+ p = &root->rb_node;
+
+ while (*p != NULL) {
+ parent = *p;
+ ce = rb_entry(parent, struct c2c_entry, latency_scratch);
+
+ cmp = physid_cmp(ce, entry);
+
+ if (cmp > 0)
+ p = &(*p)->rb_left;
+ else
+ p = &(*p)->rb_right;
+ }
+
+ rb_link_node(&entry->latency_scratch, parent, p);
+ rb_insert_color(&entry->latency_scratch, root);
+
+ return entry;
+}
+
+static int c2c_latency__add_to_list_count(struct rb_root *root,
+ struct c2c_hit *h)
+{
+ struct rb_node **p;
+ struct rb_node *parent = NULL;
+ struct c2c_hit *he;
+ int64_t cmp;
+
+ p = &root->rb_node;
+
+ while (*p != NULL) {
+ parent = *p;
+ he = rb_entry(parent, struct c2c_hit, rb_node);
+
+ cmp = h->stats.stats.n - he->stats.stats.n;
+
+ if (cmp > 0)
+ p = &(*p)->rb_left;
+ else
+ p = &(*p)->rb_right;
+ }
+
+ rb_link_node(&h->rb_node, parent, p);
+ rb_insert_color(&h->rb_node, root);
+
+ return 0;
+}
+
static int perf_c2c__fprintf_header(FILE *fp)
{
int printed = fprintf(fp, "%c %-16s %6s %6s %4s %18s %18s %18s %6s %-10s %-60s %s\n",
@@ -1107,6 +1164,209 @@ cleanup:
}
}
+static void print_latency_select_cacheline_offset(struct c2c_hit *offset,
+ int total)
+{
+ struct stats *s = &offset->stats.stats;
+ struct addr_map_symbol *ams = &offset->mi->iaddr;
+
+ printf("%5s %6s %6s %7.1f%% %14s0x%02lx %#18lx %8ld %7.1f %8ld %7.1f %7.1f%% %-30s %-20s\n",
+ " ",
+ " ",
+ " ",
+ ((double) s->n / (double)total) * 100.0,
+ " ",
+ (cloffset == LVL2) ? (offset->mi->daddr.addr & 0xff) : CLOFFSET(offset->mi->daddr.addr),
+ offset->mi->iaddr.addr,
+ s->min,
+ 0.0,
+ s->max,
+ avg_stats(s),
+ (stddev_stats(s)/avg_stats(s) * 100.0),
+ (ams->sym ? ams->sym->name : "?????"),
+ ams->map->dso->short_name);
+}
+
+static void print_latency_select_header(void)
+{
+#define EXCESS_LATENCY_TITLE "Non Shared Data Loads With Excessive Execution Latency"
+
+ static char delimit[MAXTITLE_SZ];
+ static char title[MAXTITLE_SZ];
+ int pad;
+ int i;
+
+ sprintf(title, "%5s %6s %6s %8s %18s %18s %8s %8s %8s %8s %8s %-30s %-20s",
+ "Num",
+ "%dist",
+ "%cumm",
+ "Count",
+ "Data Address",
+ "Inst Address",
+ "Min",
+ "Median",
+ "Max",
+ "Mean",
+ "CV",
+ "Symbol",
+ "Object");
+
+ memset(delimit, 0, sizeof(delimit));
+ for (i = 0; i < (int)strlen(title); i++) delimit[i] = '=';
+
+ printf("\n\n");
+ printf("%s\n", delimit);
+
+ pad = (strlen(title)/2) - (strlen(EXCESS_LATENCY_TITLE)/2);
+ for (i = 0; i < pad; i++) printf(" ");
+ printf("%s\n", EXCESS_LATENCY_TITLE);
+ printf("\n");
+
+ printf("%5s %6s %6s %8s %18s %18s %44s %-30s %-20s\n",
+ " ",
+ " ",
+ " ",
+ "Load",
+ " ",
+ " ",
+ "------ Load Inst Execute Latency ------",
+ " ",
+ " ");
+
+ printf("%s\n", title);
+ printf("%s\n", delimit);
+}
+
+static void print_latency_select_info(struct rb_root *root,
+ struct c2c_stats *stats)
+{
+#define XLAT_DIST_LIMIT 0.1
+
+ struct rb_node *next = rb_first(root);
+ struct c2c_hit *h, *clo = NULL;
+ struct c2c_entry *entry;
+ double tot_dist, tot_cumm;
+ int idx = 0, j;
+ static char delimit[MAXTITLE_SZ];
+ static char summary[MAXTITLE_SZ];
+
+ print_latency_select_header();
+
+ tot_cumm = 0.0;
+
+ while (next) {
+ h = rb_entry(next, struct c2c_hit, rb_node);
+ next = rb_next(&h->rb_node);
+
+ tot_dist = ((double)h->stats.stats.n / stats->stats.n);
+ tot_cumm += tot_dist;
+
+ /*
+ * don't display lines with insignificant sharing contribution
+ */
+ if (tot_dist*100.0 < XLAT_DIST_LIMIT)
+ break;
+
+ sprintf(summary, "%5d %5.1f%% %5.1f%% %8d %#18lx",
+ idx,
+ tot_dist*100.0,
+ tot_cumm*100.0,
+ (int)h->stats.stats.n,
+ h->cacheline);
+
+ if (delimit[0] != '-') {
+ memset(delimit, 0, sizeof(delimit));
+ for (j = 0; j < (int)strlen(summary); j++) delimit[j] = '-';
+ }
+
+ printf("%s\n", delimit);
+ printf("%s\n", summary);
+ printf("%s\n", delimit);
+
+ list_for_each_entry(entry, &h->list, scratch) {
+
+ if (!clo || !matching_coalescing(clo, entry)) {
+ u64 addr;
+
+ if (clo)
+ print_latency_select_cacheline_offset(clo, h->stats.stats.n);
+
+ free(clo);
+ addr = entry->mi->iaddr.al_addr;
+ clo = c2c_hit__new(addr, entry);
+ }
+ update_stats(&clo->stats.stats, entry->weight);
+ }
+ if (clo) {
+ print_latency_select_cacheline_offset(clo, h->stats.stats.n);
+ free(clo);
+ clo = NULL;
+ }
+
+ idx++;
+ }
+ printf("\n\n");
+}
+
+static void calculate_latency_selected_info(struct rb_root *root,
+ struct rb_node *start,
+ struct c2c_stats *lat_stats)
+{
+ struct rb_node *next = start;
+ struct rb_root lat_tree = RB_ROOT;
+ struct c2c_hit *h = NULL;
+ struct c2c_entry *n;
+ u64 cl;
+
+ /* new sort of 'selected' tree using physid_cmp */
+ while (next) {
+ n = rb_entry(next, struct c2c_entry, latency);
+ next = rb_next(&n->latency);
+
+ c2c_latency__add_to_list_physid(&lat_tree, n);
+ }
+
+ /* resort based on number of entries in each cacheline */
+ next = rb_first(&lat_tree);
+ while (next) {
+ n = rb_entry(next, struct c2c_entry, latency_scratch);
+ next = rb_next(&n->latency_scratch);
+
+ cl = n->mi->daddr.al_addr;
+
+ /* switch cache line objects */
+ /* 'color' forces a boundary change based on the original sort */
+ if (!h || !n->color || (CLADRS(cl) != h->cacheline)) {
+ if (h)
+ c2c_latency__add_to_list_count(root, h);
+
+ h = c2c_hit__new(CLADRS(cl), n);
+ if (!h)
+ goto cleanup;
+ }
+
+ update_stats(&h->stats.stats, n->weight);
+ update_stats(&lat_stats->stats, n->weight);
+
+ /* save the entry for later processing */
+ list_add_tail(&n->scratch, &h->list);
+ }
+ /* last chunk */
+ if (h)
+ c2c_latency__add_to_list_count(root, h);
+ return;
+
+cleanup:
+ next = rb_first(root);
+ while (next) {
+ h = rb_entry(next, struct c2c_hit, rb_node);
+ next = rb_next(&h->rb_node);
+ rb_erase(&h->rb_node, root);
+
+ free(h);
+ }
+}
+
stats_t data[] = {
{ "Samples ", "%20d", &hist_info[OVERALL].cnt, &hist_info[EXTREMES].cnt, &hist_info[ANALYZE].cnt },
{ " ", NULL, NULL, NULL, NULL },
@@ -1471,6 +1731,8 @@ static void c2c_analyze_latency(struct perf_c2c *c2c)
struct c2c_stats lat_stats;
u64 snoop;
struct stats s;
+ int i;
+ struct rb_root lat_select_tree = RB_ROOT;
init_stats(&s);
memset(&lat_stats, 0, sizeof(struct c2c_stats));
@@ -1500,6 +1762,9 @@ static void c2c_analyze_latency(struct perf_c2c *c2c)
calculate_latency_info(&lat_tree, &s, overall, extremes, selected);
print_latency_info();
+ calculate_latency_selected_info(&lat_select_tree, selected->start, &lat_stats);
+ print_latency_select_info(&lat_select_tree, &lat_stats);
+
return;
}
--
1.7.11.7
The overall goal for the cache-to-cache contention tool is to find extreme latencies and
help point out the problem, so they can be fixed. The big assumption is remote cache hits
cause the biggest contentions.
Those are summarized by previous patches. However, we still have non-remote cache hits with
high latency. We display those here in a table. We identify the outliers and focus on them.
Orignal work done by Dick Fowles. I just ported it over to the perf framework.
Suggeted-by: Joe Mario <[email protected]>
Original-by: Dick Fowles <[email protected]>
Signed-off-by: Don Zickus <[email protected]>
---
tools/perf/builtin-c2c.c | 456 +++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 456 insertions(+)
diff --git a/tools/perf/builtin-c2c.c b/tools/perf/builtin-c2c.c
index 014c9b0..b1d4a8b 100644
--- a/tools/perf/builtin-c2c.c
+++ b/tools/perf/builtin-c2c.c
@@ -12,6 +12,7 @@
#include <linux/compiler.h>
#include <linux/kernel.h>
#include <sched.h>
+#include <math.h>
typedef struct {
int locks; /* count of 'lock' transactions */
@@ -46,6 +47,20 @@ struct c2c_stats {
struct stats stats;
};
+struct c2c_latency_stats {
+ int min;
+ int mode;
+ int max;
+ int cnt;
+ int thrshld;
+ double median;
+ double stddev;
+ double cv;
+ double ci;
+ double mean;
+ struct rb_node *start;
+};
+
struct perf_c2c {
struct perf_tool tool;
bool raw_records;
@@ -60,6 +75,7 @@ struct perf_c2c {
struct c2c_entry {
struct rb_node rb_node;
+ struct rb_node latency;
struct list_head scratch; /* scratch list for resorting */
struct thread *thread;
int tid; /* FIXME perf maps broken */
@@ -97,6 +113,21 @@ struct c2c_hit {
struct callchain_root callchain[0]; /* must be last member */
};
+typedef struct {
+ const char *label;
+ const char *fmt;
+ void *overall;
+ void *extremes;
+ void *analyze;
+} stats_t;
+
+
+enum { EMPTY, SYMBOL, OBJECT };
+enum { OVERALL, EXTREMES, ANALYZE, SCOPES };
+enum { INVALID, NONE, INTEGER, DOUBLE };
+
+struct c2c_latency_stats hist_info[SCOPES];
+
enum { OP, LVL, SNP, LCK, TLB };
#define RMT_RAM (PERF_MEM_LVL_REM_RAM1 | PERF_MEM_LVL_REM_RAM2)
@@ -512,6 +543,34 @@ static int c2c_hitm__add_to_list(struct rb_root *root, struct c2c_hit *h)
return 0;
}
+static int c2c_latency__add_to_list(struct rb_root *root, struct c2c_entry *n)
+{
+ struct rb_node **p;
+ struct rb_node *parent = NULL;
+ struct c2c_entry *ne;
+ int64_t cmp;
+
+ p = &root->rb_node;
+
+ while (*p != NULL) {
+ parent = *p;
+ ne = rb_entry(parent, struct c2c_entry, latency);
+
+ /* sort on weight */
+ cmp = ne->weight - n->weight;
+
+ if (cmp > 0)
+ p = &(*p)->rb_left;
+ else
+ p = &(*p)->rb_right;
+ }
+
+ rb_link_node(&n->latency, parent, p);
+ rb_insert_color(&n->latency, root);
+
+ return 0;
+}
+
static int perf_c2c__fprintf_header(FILE *fp)
{
int printed = fprintf(fp, "%c %-16s %6s %6s %4s %18s %18s %18s %6s %-10s %-60s %s\n",
@@ -1048,6 +1107,402 @@ cleanup:
}
}
+stats_t data[] = {
+ { "Samples ", "%20d", &hist_info[OVERALL].cnt, &hist_info[EXTREMES].cnt, &hist_info[ANALYZE].cnt },
+ { " ", NULL, NULL, NULL, NULL },
+ { "Minimum ", "%20d", &hist_info[OVERALL].min, &hist_info[EXTREMES].min, &hist_info[ANALYZE].min },
+ { "Maximum ", "%20d", &hist_info[OVERALL].max, &hist_info[EXTREMES].max, &hist_info[ANALYZE].max },
+ { "Threshold ", "%20d", &hist_info[OVERALL].thrshld, &hist_info[EXTREMES].thrshld, &hist_info[ANALYZE].thrshld },
+ { " ", NULL, NULL, NULL, NULL },
+ { "Mode ", "%20d", &hist_info[OVERALL].mode, &hist_info[EXTREMES].mode, &hist_info[ANALYZE].mode },
+ { "Median ", "%20.0f", &hist_info[OVERALL].median, &hist_info[EXTREMES].median, &hist_info[ANALYZE].median },
+ { "Mean ", "%20.0f", &hist_info[OVERALL].mean, &hist_info[EXTREMES].mean, &hist_info[ANALYZE].mean },
+ { " ", NULL, NULL, NULL, NULL },
+ { "Std Dev ", "%20.1f", &hist_info[OVERALL].stddev, &hist_info[EXTREMES].stddev, &hist_info[ANALYZE].stddev },
+ { "Coeff of Variation", "%20.3f", &hist_info[OVERALL].cv, &hist_info[EXTREMES].cv, &hist_info[ANALYZE].cv },
+ { "Confid Interval ", "%20.1f", &hist_info[OVERALL].ci, &hist_info[EXTREMES].ci, &hist_info[ANALYZE].ci },
+};
+
+#define STATS_ENTRIES (sizeof(data) / sizeof(stats_t))
+enum { C90, C95 };
+
+static double tdist(int ci, int num_samples)
+{
+ #define MAX_ENTRIES 32
+ #define INFINITE_SAMPLES (MAX_ENTRIES-1)
+
+ /*
+ * Student's T-distribution for 90% & 95% confidence intervals
+ * The last entry is the value for infinite degress of freedom
+ */
+
+ static double t_dist[MAX_ENTRIES][2] = {
+ { NAN, NAN }, /* 0 */
+ { 6.31, 12.71 }, /* 1 */
+ { 2.92, 4.30 }, /* 2 */
+ { 2.35, 3.18 }, /* 3 */
+ { 2.13, 2.78 }, /* 4 */
+ { 2.02, 2.57 }, /* 5 */
+ { 1.94, 2.45 }, /* 6 */
+ { 1.90, 2.36 }, /* 7 */
+ { 1.86, 2.31 }, /* 8 */
+ { 1.83, 2.26 }, /* 9 */
+ { 1.81, 2.23 }, /* 10 */
+ { 1.80, 2.20 }, /* 11 */
+ { 1.78, 2.18 }, /* 12 */
+ { 1.77, 2.16 }, /* 13 */
+ { 1.76, 2.14 }, /* 14 */
+ { 1.75, 2.13 }, /* 15 */
+ { 1.75, 2.12 }, /* 16 */
+ { 1.74, 2.11 }, /* 17 */
+ { 1.73, 2.10 }, /* 18 */
+ { 1.73, 2.09 }, /* 19 */
+ { 1.72, 2.09 }, /* 20 */
+ { 1.72, 2.08 }, /* 21 */
+ { 1.72, 2.07 }, /* 22 */
+ { 1.71, 2.07 }, /* 23 */
+ { 1.71, 2.06 }, /* 24 */
+ { 1.71, 2.06 }, /* 25 */
+ { 1.71, 2.06 }, /* 26 */
+ { 1.70, 2.05 }, /* 27 */
+ { 1.70, 2.05 }, /* 28 */
+ { 1.70, 2.04 }, /* 29 */
+ { 1.70, 2.04 }, /* 30 */
+ { 1.645, 1.96 }, /* 31 */
+ };
+
+ double tvalue;
+
+ tvalue = 0;
+
+ switch (ci) {
+
+ case C90: /* 90% CI */
+ tvalue = ((num_samples-1) > 30) ? t_dist[INFINITE_SAMPLES][ci] : t_dist[num_samples-1][ci];
+ break;
+
+ case C95: /* 95% CI */
+ tvalue = ((num_samples-1) > 30) ? t_dist[INFINITE_SAMPLES][ci] : t_dist[num_samples-1][ci];
+ break;
+
+ default:
+ printf("internal error - invalid confidence interval value specified");
+ break;
+ }
+
+ return tvalue;
+}
+
+static inline void print_latency_info_header(void)
+{
+ const char *title;
+ char hdrstr[256];
+ int twidth;
+ int size;
+ int pad;
+ int i;
+
+ twidth = sprintf(hdrstr, "%-20s%20s%20s%20s",
+ "Metric", "Overall", "Extremes", "Selected");
+ title = "Execution Latency For Loads to Non Shared Memory";
+ size = strlen(title);
+ pad = (twidth - size)/2;
+
+ printf("\n\n");
+ for (i = 0; i < twidth; i++) printf("=");
+ printf("\n");
+
+ if (pad > 0) {
+ for (i = 0; i < pad; i++) printf(" ");
+ }
+ printf("%s\n", title);
+ printf("\n");
+
+ printf("%s\n", hdrstr);
+
+ for (i = 0; i < twidth; i++) printf("=");
+ printf("\n");
+}
+
+static void print_latency_info(void)
+{
+#define LBLFMT "%-20s"
+
+ char fmtstr[32];
+ int i, dtype;
+ stats_t *ptr;
+
+ print_latency_info_header();
+
+ for (i = 0; i < (int)STATS_ENTRIES; i++) {
+
+ ptr = &data[i];
+
+ dtype = INVALID;
+
+ if (ptr->fmt == NULL) {
+
+ dtype = EMPTY;
+
+ } else {
+
+ if (strchr(ptr->fmt, 'd') != NULL) dtype = INTEGER;
+ if (strchr(ptr->fmt, 'f') != NULL) dtype = DOUBLE;
+
+ strcpy(fmtstr, ptr->fmt);
+ strtok(fmtstr, ".d");
+ strcat(fmtstr, "s");
+
+ }
+
+ switch (dtype) {
+
+ case INTEGER:
+ printf(LBLFMT, ptr->label);
+ (ptr->overall != NULL) ? printf(ptr->fmt, *((int *)ptr->overall)) : printf(fmtstr, "na");
+ (ptr->extremes != NULL) ? printf(ptr->fmt, *((int *)ptr->extremes)) : printf(fmtstr, "na");
+ (ptr->analyze != NULL) ? printf(ptr->fmt, *((int *)ptr->analyze)) : printf(fmtstr, "na");
+ printf("\n");
+ break;
+
+
+ case DOUBLE:
+ printf(LBLFMT, ptr->label);
+ (ptr->overall != NULL) ? printf(ptr->fmt, *((double *)ptr->overall)) : printf(fmtstr, "na");
+ (ptr->extremes != NULL) ? printf(ptr->fmt, *((double *)ptr->extremes)) : printf(fmtstr, "na");
+ (ptr->analyze != NULL) ? printf(ptr->fmt, *((double *)ptr->analyze)) : printf(fmtstr, "na");
+ printf("\n");
+ break;
+
+
+ case EMPTY:
+ printf("\n");
+ break;
+
+
+ default:
+ printf("internal error - unsupported formtat specifier : %s\n", ptr->fmt);
+ break;
+
+ };
+
+ }
+ printf("\n\n");
+}
+
+static void calculate_latency_info(struct rb_root *tree,
+ struct stats *stats,
+ struct c2c_latency_stats *overall,
+ struct c2c_latency_stats *extremes,
+ struct c2c_latency_stats *selected)
+{
+ struct rb_node *next = rb_first(tree);
+ struct rb_node *start = NULL;
+ struct c2c_entry *n;
+ int count = 0, weight = 0;
+ int mode = 0, mode_count = 0, idx = 0;
+ int median;
+ double threshold;
+ struct stats s;
+
+
+ median = stats->n / 2.0;
+
+ overall->cnt = stats->n;
+ overall->min = stats->min;
+ overall->max = stats->max;
+ overall->thrshld = 0;
+ overall->mean = avg_stats(stats);
+ overall->stddev = stddev_stats(stats);
+ overall->cv = overall->stddev / overall->mean;
+ overall->ci = (tdist(C90, stats->n) * stddev_stats(stats)) / sqrt(stats->n);
+ overall->start = next;
+
+ /* set threshold to mean + 3 * stddev of stats */
+ threshold = avg_stats(stats) + 3 * stddev_stats(stats);
+ init_stats(&s);
+
+ /* calculate overall latency */
+ while (next) {
+ n = rb_entry(next, struct c2c_entry, latency);
+ next = rb_next(&n->latency);
+
+ /* sorted on weight, makes counting easy, look for boundary */
+ if (n->weight != weight) {
+ if (count > mode_count) {
+ mode = weight;
+ mode_count = count;
+ }
+ count = 0;
+ weight = n->weight;
+ }
+ count++;
+
+ if (idx == median)
+ overall->median = n->weight;
+
+ /* save start for extreme latency calculation */
+ if (n->weight > threshold) {
+ if (!start)
+ start = next;
+
+ update_stats(&s, n->weight);
+ }
+
+ idx++;
+ }
+ /* count last set */
+ if (count > mode_count)
+ mode = weight;
+
+ overall->mode = mode;
+
+ /* calculate extreme latency */
+ next = start;
+ start = NULL;
+ idx = 0;
+ count = 0;
+ mode_count = 0;
+ mode = 0;
+ weight = 0;
+ median = s.n / 2.0;
+
+ extremes->cnt = s.n;
+ extremes->min = s.min;
+ extremes->max = s.max;
+ extremes->thrshld = threshold;
+ extremes->mean = avg_stats(&s);
+ extremes->stddev = stddev_stats(&s);
+ extremes->cv = extremes->stddev / extremes->mean;
+ extremes->ci = (tdist(C90, s.n) * stddev_stats(&s)) / sqrt(s.n);
+ extremes->start = next;
+
+ /* set threshold to mean + 3 * stddev of stats */
+ threshold = avg_stats(&s) + 3 * stddev_stats(&s);
+ init_stats(&s);
+
+ while (next) {
+ n = rb_entry(next, struct c2c_entry, latency);
+ next = rb_next(&n->latency);
+
+ /* sorted on weight, makes counting easy, look for boundary */
+ if (n->weight != weight) {
+ if (count > mode_count) {
+ mode = weight;
+ mode_count = count;
+ }
+ count = 0;
+ weight = n->weight;
+ }
+ count++;
+
+ if (idx == median)
+ extremes->median = n->weight;
+
+ /* save start for extreme latency calculation */
+ if (n->weight > threshold) {
+ if (!start)
+ start = next;
+
+ update_stats(&s, n->weight);
+ }
+
+ idx++;
+ }
+ /* count last set */
+ if (count > mode_count)
+ mode = weight;
+
+ extremes->mode = mode;
+
+ /* calculate analyze latency */
+ next = start;
+ idx = 0;
+ count = 0;
+ mode_count = 0;
+ mode = 0;
+ weight = 0;
+ median = s.n / 2.0;
+
+ selected->cnt = s.n;
+ selected->min = s.min;
+ selected->max = s.max;
+ selected->thrshld = threshold;
+ selected->mean = avg_stats(&s);
+ selected->stddev = stddev_stats(&s);
+ selected->cv = selected->stddev / selected->mean;
+ selected->ci = (tdist(C90, s.n) * stddev_stats(stats)) / sqrt(s.n);
+ selected->start = next;
+
+ while (next) {
+ n = rb_entry(next, struct c2c_entry, latency);
+ next = rb_next(&n->latency);
+
+ /* sorted on weight, makes counting easy, look for boundary */
+ if (n->weight != weight) {
+ if (count > mode_count) {
+ mode = weight;
+ mode_count = count;
+ }
+ count = 0;
+ weight = n->weight;
+ }
+ count++;
+
+ if (idx == median)
+ selected->median = n->weight;
+
+ idx++;
+ }
+ /* count last set */
+ if (count > mode_count)
+ mode = weight;
+
+ selected->mode = mode;
+}
+
+static void c2c_analyze_latency(struct perf_c2c *c2c)
+{
+
+ struct rb_node *next = rb_first(&c2c->tree_physid);
+ struct c2c_entry *n;
+ struct c2c_latency_stats *overall, *extremes, *selected;
+ struct rb_root lat_tree = RB_ROOT;
+ struct c2c_stats lat_stats;
+ u64 snoop;
+ struct stats s;
+
+ init_stats(&s);
+ memset(&lat_stats, 0, sizeof(struct c2c_stats));
+ memset(&hist_info, 0, sizeof(struct c2c_latency_stats) * SCOPES);
+
+ overall = &hist_info[OVERALL];
+ extremes = &hist_info[EXTREMES];
+ selected = &hist_info[ANALYZE];
+
+ /* sort on latency */
+ while (next) {
+ n = rb_entry(next, struct c2c_entry, rb_node);
+ next = rb_next(&n->rb_node);
+
+ snoop = n->mi->data_src.mem_snoop;
+
+ /* filter out HITs as un-interesting */
+ if ((snoop & P(SNOOP, HIT)) ||
+ (snoop & P(SNOOP, HITM)) ||
+ (snoop & P(SNOOP, NA)))
+ continue;
+
+ c2c_latency__add_to_list(&lat_tree, n);
+ update_stats(&s, n->weight);
+ }
+
+ calculate_latency_info(&lat_tree, &s, overall, extremes, selected);
+ print_latency_info();
+
+ return;
+}
+
static void print_c2c_shared_cacheline_report(struct rb_root *hitm_tree,
struct c2c_stats *shared_stats __maybe_unused,
struct c2c_stats *c2c_stats __maybe_unused)
@@ -1761,6 +2216,7 @@ static int perf_c2c__process_events(struct perf_session *session,
if (verbose > 2)
dump_rb_tree(&c2c->tree_physid, c2c);
print_c2c_trace_report(c2c);
+ c2c_analyze_latency(c2c);
c2c_analyze_hitms(c2c);
print_symbol_record_count(&c2c->tree_physid);
--
1.7.11.7
When perf tries to load the initial mmaps, it graps data from /proc/<pid>/maps files
and feeds them into perf as generated MMAP events. Unfortunately, when the system
has threads running, those thread maps are not generated (because perf doesn't know the
history of the fork events leading to the threads).
As a result when trying to map a data source address (not IP), to a thread map, it fails
and the map returns NULL. Feeding perf the pid instead gets us the correct map, however
the TID is now incorrect for the thread struct (it now has a PID for a TID instead).
Performing any cache-to-cache contention analysis leads to problems, so for now save the
TID for local use and do not rely on the perf maps' TID.
Signed-off-by: Don Zickus <[email protected]>
---
tools/perf/builtin-c2c.c | 24 +++++++++++++-----------
1 file changed, 13 insertions(+), 11 deletions(-)
diff --git a/tools/perf/builtin-c2c.c b/tools/perf/builtin-c2c.c
index 0760f6a..32c2319 100644
--- a/tools/perf/builtin-c2c.c
+++ b/tools/perf/builtin-c2c.c
@@ -62,6 +62,7 @@ struct c2c_entry {
struct rb_node rb_node;
struct list_head scratch; /* scratch list for resorting */
struct thread *thread;
+ int tid; /* FIXME perf maps broken */
struct mem_info *mi;
u32 cpu;
u8 cpumode;
@@ -397,8 +398,8 @@ static int physid_cmp(struct c2c_entry *left, struct c2c_entry *right)
if (left->thread->pid_ > right->thread->pid_) return 1;
if (left->thread->pid_ < right->thread->pid_) return -1;
- if (left->thread->tid > right->thread->tid) return 1;
- if (left->thread->tid < right->thread->tid) return -1;
+ if (left->tid > right->tid) return 1;
+ if (left->tid < right->tid) return -1;
} else if (left->cpumode == PERF_RECORD_MISC_KERNEL) {
/* kernel mapped areas where 'start' doesn't matter */
@@ -416,15 +417,15 @@ static int physid_cmp(struct c2c_entry *left, struct c2c_entry *right)
if (left->thread->pid_ > right->thread->pid_) return 1;
if (left->thread->pid_ < right->thread->pid_) return -1;
- if (left->thread->tid > right->thread->tid) return 1;
- if (left->thread->tid < right->thread->tid) return -1;
+ if (left->tid > right->tid) return 1;
+ if (left->tid < right->tid) return -1;
} else {
/* userspace anonymous */
if (left->thread->pid_ > right->thread->pid_) return 1;
if (left->thread->pid_ < right->thread->pid_) return -1;
- if (left->thread->tid > right->thread->tid) return 1;
- if (left->thread->tid < right->thread->tid) return -1;
+ if (left->tid > right->tid) return 1;
+ if (left->tid < right->tid) return -1;
/* hack to mark similar regions, 'right' is new entry */
/* userspace anonymous address space is contained within pid */
@@ -574,6 +575,7 @@ static struct c2c_entry *c2c_entry__new(struct perf_sample *sample,
if (entry != NULL) {
entry->thread = thread;
entry->mi = mi;
+ entry->tid = sample->tid;
entry->cpu = sample->cpu;
entry->cpumode = cpumode;
entry->weight = sample->weight;
@@ -695,7 +697,7 @@ static struct c2c_hit *c2c_hit__new(u64 cacheline, struct c2c_entry *entry)
h->tree = RB_ROOT;
h->cacheline = cacheline;
h->pid = entry->thread->pid_;
- h->tid = entry->thread->tid;
+ h->tid = entry->tid;
if (symbol_conf.use_callchain)
callchain_init(h->callchain);
@@ -713,7 +715,7 @@ static void c2c_hit__update_strings(struct c2c_hit *h,
if (h->pid != n->thread->pid_)
h->pid = -1;
- if (h->tid != n->thread->tid)
+ if (h->tid != n->tid)
h->tid = -1;
/* use original addresses here, not adjusted al_addr */
@@ -743,7 +745,7 @@ static inline bool matching_coalescing(struct c2c_hit *h,
case LVL1:
value = ((h->daddr == mi->daddr.addr) &&
(h->pid == e->thread->pid_) &&
- (h->tid == e->thread->tid) &&
+ (h->tid == e->tid) &&
(h->iaddr == mi->iaddr.addr));
break;
@@ -775,7 +777,7 @@ static inline bool matching_coalescing(struct c2c_hit *h,
case LVL0:
value = ((h->daddr == mi->daddr.addr) &&
(h->pid == e->thread->pid_) &&
- (h->tid == e->thread->tid) &&
+ (h->tid == e->tid) &&
(h->iaddr == mi->iaddr.addr));
break;
@@ -850,7 +852,7 @@ static int perf_c2c__process_sample(struct perf_tool *tool,
if (evsel->handler.func == NULL)
return 0;
- thread = machine__find_thread(machine, sample->tid);
+ thread = machine__find_thread(machine, sample->pid);
if (thread == NULL)
goto err;
--
1.7.11.7
In order for the c2c tool to work correctly, it needs to properly
sort all the records on uniquely identifiable data addresses. These
unique addresses are converted from virtual addresses provided by the
hardware into a kernel address using an mmap2 record as the decoder.
Once a unique address is converted, we can sort on them based on
various rules. Then it becomes clear which address are overlapping
with each other across mmap regions or pid spaces.
This patch just creates the rules and inserts the records into an
rbtree for safe keeping until later patches process them.
The general sorting rule is:
o group cpumodes together
o group similar major, minor, inode, inode generation numbers togther
o if (nonzero major/minor number - ie mmap'd areas)
o sort on data addresses
o sort on instruction address
o sort on pid
o sort on tid
o if cpumode is kernel
o sort on data addresses
o sort on instruction address
o sort on pid
o sort on tid
o else (private to pid space)
o sort on pid
o sort on tid
o sort on data addresses
o sort on instruction address
I also hacked in the concept of 'color'. The purpose of that bit is to
provides hints later when processing these records that indicate a new unique
address has been encountered. Because later processing only checks the data
addresses, there can be a theoretical scenario that similar sequential data
addresses (when walking the rbtree) could be misinterpreted as overlapping
when in fact they are not.
Signed-off-by: Don Zickus <[email protected]>
---
tools/perf/builtin-c2c.c | 145 ++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 144 insertions(+), 1 deletion(-)
diff --git a/tools/perf/builtin-c2c.c b/tools/perf/builtin-c2c.c
index b062485..a9c536b 100644
--- a/tools/perf/builtin-c2c.c
+++ b/tools/perf/builtin-c2c.c
@@ -13,15 +13,20 @@
struct perf_c2c {
struct perf_tool tool;
bool raw_records;
+ struct rb_root tree_physid;
};
+#define REGION_SAME 1 << 0;
+
struct c2c_entry {
+ struct rb_node rb_node;
struct thread *thread;
struct mem_info *mi;
u32 cpu;
u8 cpumode;
int weight;
int period;
+ int color;
};
enum { OP, LVL, SNP, LCK, TLB };
@@ -96,6 +101,133 @@ static int perf_c2c__scnprintf_data_src(char *bf, size_t size, uint64_t val)
return printed;
}
+static int physid_cmp(struct c2c_entry *left, struct c2c_entry *right)
+{
+ u64 l, r;
+ struct map *l_map = left->mi->daddr.map;
+ struct map *r_map = right->mi->daddr.map;
+
+ /* group event types together */
+ if (left->cpumode > right->cpumode) return 1;
+ if (left->cpumode < right->cpumode) return -1;
+
+ if (l_map->maj > r_map->maj) return 1;
+ if (l_map->maj < r_map->maj) return -1;
+
+ if (l_map->min > r_map->min) return 1;
+ if (l_map->min < r_map->min) return -1;
+
+ if (l_map->ino > r_map->ino) return 1;
+ if (l_map->ino < r_map->ino) return -1;
+
+ if (l_map->ino_generation > r_map->ino_generation) return 1;
+ if (l_map->ino_generation < r_map->ino_generation) return -1;
+
+ /*
+ * Addresses with no major/minor numbers are assumed to be
+ * anonymous in userspace. Sort those on pid then address.
+ *
+ * The kernel and non-zero major/minor mapped areas are
+ * assumed to be unity mapped. Sort those on address then pid.
+ */
+
+ /* al_addr does all the right addr - start + offset calculations */
+ l = left->mi->daddr.al_addr;
+ r = right->mi->daddr.al_addr;
+
+ if (l_map->maj || l_map->min) {
+ /* mmapped areas */
+
+ /* hack to mark similar regions, 'right' is new entry */
+ /* entries with same maj/min/ino/inogen are in same address space */
+ right->color = REGION_SAME;
+
+ if (l > r) return 1;
+ if (l < r) return -1;
+
+ /* sorting by iaddr makes calculations easier later */
+ if (left->mi->iaddr.al_addr > right->mi->iaddr.al_addr) return 1;
+ if (left->mi->iaddr.al_addr < right->mi->iaddr.al_addr) return -1;
+
+ if (left->thread->pid_ > right->thread->pid_) return 1;
+ if (left->thread->pid_ < right->thread->pid_) return -1;
+
+ if (left->thread->tid > right->thread->tid) return 1;
+ if (left->thread->tid < right->thread->tid) return -1;
+ } else if (left->cpumode == PERF_RECORD_MISC_KERNEL) {
+ /* kernel mapped areas where 'start' doesn't matter */
+
+ /* hack to mark similar regions, 'right' is new entry */
+ /* whole kernel region is in the same address space */
+ right->color = REGION_SAME;
+
+ if (l > r) return 1;
+ if (l < r) return -1;
+
+ /* sorting by iaddr makes calculations easier later */
+ if (left->mi->iaddr.al_addr > right->mi->iaddr.al_addr) return 1;
+ if (left->mi->iaddr.al_addr < right->mi->iaddr.al_addr) return -1;
+
+ if (left->thread->pid_ > right->thread->pid_) return 1;
+ if (left->thread->pid_ < right->thread->pid_) return -1;
+
+ if (left->thread->tid > right->thread->tid) return 1;
+ if (left->thread->tid < right->thread->tid) return -1;
+ } else {
+ /* userspace anonymous */
+ if (left->thread->pid_ > right->thread->pid_) return 1;
+ if (left->thread->pid_ < right->thread->pid_) return -1;
+
+ if (left->thread->tid > right->thread->tid) return 1;
+ if (left->thread->tid < right->thread->tid) return -1;
+
+ /* hack to mark similar regions, 'right' is new entry */
+ /* userspace anonymous address space is contained within pid */
+ right->color = REGION_SAME;
+
+ if (l > r) return 1;
+ if (l < r) return -1;
+
+ /* sorting by iaddr makes calculations easier later */
+ if (left->mi->iaddr.al_addr > right->mi->iaddr.al_addr) return 1;
+ if (left->mi->iaddr.al_addr < right->mi->iaddr.al_addr) return -1;
+ }
+
+ return 0;
+}
+static struct c2c_entry *c2c_entry__add_to_list(struct perf_c2c *c2c, struct c2c_entry *entry)
+{
+ struct rb_node **p;
+ struct rb_node *parent = NULL;
+ struct c2c_entry *ce;
+ int64_t cmp;
+
+ p = &c2c->tree_physid.rb_node;
+
+ while (*p != NULL) {
+ parent = *p;
+ ce = rb_entry(parent, struct c2c_entry, rb_node);
+
+ cmp = physid_cmp(ce, entry);
+
+ /* FIXME wrap this with a #ifdef debug or something */
+ if (!cmp)
+ if ((entry->mi->daddr.map != ce->mi->daddr.map) &&
+ !entry->mi->daddr.map->maj && !entry->mi->daddr.map->min)
+ pr_err("Similar entries have different maps\n");
+
+ if (cmp > 0)
+ p = &(*p)->rb_left;
+ else
+ p = &(*p)->rb_right;
+ }
+
+ rb_link_node(&entry->rb_node, parent, p);
+ rb_insert_color(&entry->rb_node, &c2c->tree_physid);
+
+ return entry;
+}
+
static int perf_c2c__fprintf_header(FILE *fp)
{
int printed = fprintf(fp, "%c %-16s %6s %6s %4s %18s %18s %18s %6s %-10s %-60s %s\n",
@@ -171,10 +303,12 @@ static struct c2c_entry *c2c_entry__new(struct perf_sample *sample,
return entry;
}
-static int perf_c2c__process_load_store(struct perf_c2c *c2c __maybe_unused,
+static int perf_c2c__process_load_store(struct perf_c2c *c2c,
struct perf_sample *sample __maybe_unused,
struct c2c_entry *entry)
{
+ c2c_entry__add_to_list(c2c, entry);
+
/* don't lose the maps if remapped */
entry->mi->iaddr.map->referenced = true;
entry->mi->daddr.map->referenced = true;
@@ -280,10 +414,19 @@ out:
return err;
}
+static int perf_c2c__init(struct perf_c2c *c2c)
+{
+ c2c->tree_physid = RB_ROOT;
+
+ return 0;
+}
static int perf_c2c__report(struct perf_c2c *c2c)
{
setup_pager();
+ if (perf_c2c__init(c2c))
+ return -1;
+
if (c2c->raw_records)
perf_c2c__fprintf_header(stdout);
--
1.7.11.7
Output some summary stats based on the processed records.
Mainly diagnostic uses.
Stats done by Dick Fowles, backported by me.
Original-by: Dick Fowles <[email protected]>
Signed-off-by: Don Zickus <[email protected]>
---
tools/perf/builtin-c2c.c | 44 +++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 43 insertions(+), 1 deletion(-)
diff --git a/tools/perf/builtin-c2c.c b/tools/perf/builtin-c2c.c
index 047fe26..c8e76dc 100644
--- a/tools/perf/builtin-c2c.c
+++ b/tools/perf/builtin-c2c.c
@@ -1247,7 +1247,6 @@ static void print_hitm_cacheline_offset(struct c2c_hit *clo,
23, stdout);
}
}
-
static void print_c2c_hitm_report(struct rb_root *hitm_tree,
struct c2c_stats *hitm_stats __maybe_unused,
struct c2c_stats *c2c_stats)
@@ -1442,6 +1441,48 @@ cleanup:
return;
}
+static void print_c2c_trace_report(struct perf_c2c *c2c)
+{
+ int llc_misses;
+ struct c2c_stats *stats = &c2c->stats;
+
+ llc_misses = stats->t.lcl_dram +
+ stats->t.rmt_dram +
+ stats->t.rmt_hit +
+ stats->t.rmt_hitm;
+
+ printf("=================================================\n");
+ printf(" Trace Event Information \n");
+ printf("=================================================\n");
+ printf(" Total records : %10d\n", c2c->stats.nr_entries);
+ printf(" Locked Load/Store Operations : %10d\n", stats->t.locks);
+ printf(" Load Operations : %10d\n", stats->t.load);
+ printf(" Loads - uncacheable : %10d\n", stats->t.ld_uncache);
+ printf(" Loads - no mapping : %10d\n", stats->t.ld_noadrs);
+ printf(" Load Fill Buffer Hit : %10d\n", stats->t.ld_fbhit);
+ printf(" Load L1D hit : %10d\n", stats->t.ld_l1hit);
+ printf(" Load L2D hit : %10d\n", stats->t.ld_l2hit);
+ printf(" Load LLC hit : %10d\n", stats->t.ld_llchit + stats->t.lcl_hitm);
+ printf(" Load Local HITM : %10d\n", stats->t.lcl_hitm);
+ printf(" Load Remote HITM : %10d\n", stats->t.rmt_hitm);
+ printf(" Load Remote HIT : %10d\n", stats->t.rmt_hit);
+ printf(" Load Local DRAM : %10d\n", stats->t.lcl_dram);
+ printf(" Load Remote DRAM : %10d\n", stats->t.rmt_dram);
+ printf(" Load MESI State Exclusive : %10d\n", stats->t.ld_excl);
+ printf(" Load MESI State Shared : %10d\n", stats->t.ld_shared);
+ printf(" Load LLC Misses : %10d\n", llc_misses);
+ printf(" LLC Misses to Local DRAM : %10.1f%%\n", ((double)stats->t.lcl_dram/(double)llc_misses) * 100.);
+ printf(" LLC Misses to Remote DRAM : %10.1f%%\n", ((double)stats->t.rmt_dram/(double)llc_misses) * 100.);
+ printf(" LLC Misses to Remote cache (HIT) : %10.1f%%\n", ((double)stats->t.rmt_hit /(double)llc_misses) * 100.);
+ printf(" LLC Misses to Remote cache (HITM) : %10.1f%%\n", ((double)stats->t.rmt_hitm/(double)llc_misses) * 100.);
+ printf(" Store Operations : %10d\n", stats->t.store);
+ printf(" Store - uncacheable : %10d\n", stats->t.st_uncache);
+ printf(" Store - no mapping : %10d\n", stats->t.st_noadrs);
+ printf(" Store L1D Hit : %10d\n", stats->t.st_l1hit);
+ printf(" Virt -> Phys Remap Rejects : %10d\n", stats->t.remap);
+ printf(" No Page Map Rejects : %10d\n", stats->t.nomap);
+}
+
static int perf_c2c__process_events(struct perf_session *session,
struct perf_c2c *c2c)
{
@@ -1453,6 +1494,7 @@ static int perf_c2c__process_events(struct perf_session *session,
goto err;
}
+ print_c2c_trace_report(c2c);
c2c_analyze_hitms(c2c);
err:
--
1.7.11.7
Now that we have all the events sort on a unique address, we can walk
the rbtree sequential and count up all the HITMs for each cacheline
fairly easily.
Once we encounter a new event on a different cacheline, process the previous
cacheline. That includes determining if any HITMs were present on that
cacheline and if so, add it to another rbtree sorted on the number of HITMs.
This second rbtree sorted on number of HITMs will be the interesting data
we want to report and will be displayed in a follow up patch.
For now, organize the data properly.
Signed-off-by: Don Zickus <[email protected]>
---
tools/perf/builtin-c2c.c | 219 +++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 219 insertions(+)
diff --git a/tools/perf/builtin-c2c.c b/tools/perf/builtin-c2c.c
index 360fbcf..8b26ea2 100644
--- a/tools/perf/builtin-c2c.c
+++ b/tools/perf/builtin-c2c.c
@@ -59,6 +59,7 @@ struct perf_c2c {
struct c2c_entry {
struct rb_node rb_node;
+ struct list_head scratch; /* scratch list for resorting */
struct thread *thread;
struct mem_info *mi;
u32 cpu;
@@ -68,6 +69,25 @@ struct c2c_entry {
int color;
};
+#define CACHE_LINESIZE 64
+#define CLINE_OFFSET_MSK (CACHE_LINESIZE - 1)
+#define CLADRS(a) ((a) & ~(CLINE_OFFSET_MSK))
+#define CLOFFSET(a) (int)((a) & (CLINE_OFFSET_MSK))
+
+struct c2c_hit {
+ struct rb_node rb_node;
+ struct rb_root tree;
+ struct list_head list;
+ u64 cacheline;
+ int color;
+ struct c2c_stats stats;
+ pid_t pid;
+ pid_t tid;
+ u64 daddr;
+ u64 iaddr;
+ struct mem_info *mi;
+};
+
enum { OP, LVL, SNP, LCK, TLB };
#define RMT_RAM (PERF_MEM_LVL_REM_RAM1 | PERF_MEM_LVL_REM_RAM2)
@@ -440,6 +460,44 @@ static struct c2c_entry *c2c_entry__add_to_list(struct perf_c2c *c2c, struct c2c
return entry;
}
+static int c2c_hitm__add_to_list(struct rb_root *root, struct c2c_hit *h)
+{
+ struct rb_node **p;
+ struct rb_node *parent = NULL;
+ struct c2c_hit *he;
+ int64_t cmp;
+ u64 l_hitms, r_hitms;
+
+ p = &root->rb_node;
+
+ while (*p != NULL) {
+ parent = *p;
+ he = rb_entry(parent, struct c2c_hit, rb_node);
+
+ /* sort on remote hitms first */
+ l_hitms = he->stats.t.rmt_hitm;
+ r_hitms = h->stats.t.rmt_hitm;
+ cmp = r_hitms - l_hitms;
+
+ if (!cmp) {
+ /* sort on local hitms */
+ l_hitms = he->stats.t.lcl_hitm;
+ r_hitms = h->stats.t.lcl_hitm;
+ cmp = r_hitms - l_hitms;
+ }
+
+ if (cmp > 0)
+ p = &(*p)->rb_left;
+ else
+ p = &(*p)->rb_right;
+ }
+
+ rb_link_node(&h->rb_node, parent, p);
+ rb_insert_color(&h->rb_node, root);
+
+ return 0;
+}
+
static int perf_c2c__fprintf_header(FILE *fp)
{
int printed = fprintf(fp, "%c %-16s %6s %6s %4s %18s %18s %18s %6s %-10s %-60s %s\n",
@@ -608,6 +666,50 @@ static int c2c_decode_stats(struct c2c_stats *stats, struct c2c_entry *entry)
return err;
}
+static struct c2c_hit *c2c_hit__new(u64 cacheline, struct c2c_entry *entry)
+{
+ struct c2c_hit *h = zalloc(sizeof(struct c2c_hit));
+
+ if (!h) {
+ pr_err("Could not allocate c2c_hit memory\n");
+ return NULL;
+ }
+
+ CPU_ZERO(&h->stats.cpuset);
+ INIT_LIST_HEAD(&h->list);
+ init_stats(&h->stats.stats);
+ h->tree = RB_ROOT;
+ h->cacheline = cacheline;
+ h->pid = entry->thread->pid_;
+ h->tid = entry->thread->tid;
+
+ /* use original addresses here, not adjusted al_addr */
+ h->iaddr = entry->mi->iaddr.addr;
+ h->daddr = entry->mi->daddr.addr;
+
+ h->mi = entry->mi;
+ return h;
+}
+
+static void c2c_hit__update_strings(struct c2c_hit *h,
+ struct c2c_entry *n)
+{
+ if (h->pid != n->thread->pid_)
+ h->pid = -1;
+
+ if (h->tid != n->thread->tid)
+ h->tid = -1;
+
+ /* use original addresses here, not adjusted al_addr */
+ if (h->iaddr != n->mi->iaddr.addr)
+ h->iaddr = -1;
+
+ if (CLADRS(h->daddr) != CLADRS(n->mi->daddr.addr))
+ h->daddr = -1;
+
+ CPU_SET(n->cpu, &h->stats.cpuset);
+}
+
static int perf_c2c__process_load_store(struct perf_c2c *c2c,
struct perf_sample *sample __maybe_unused,
struct c2c_entry *entry)
@@ -692,6 +794,121 @@ err:
return err;
}
+#define HAS_HITMS(h) (h->stats.t.lcl_hitm || h->stats.t.rmt_hitm)
+
+static void c2c_hit__update_stats(struct c2c_stats *new,
+ struct c2c_stats *old)
+{
+ new->t.load += old->t.load;
+ new->t.ld_fbhit += old->t.ld_fbhit;
+ new->t.ld_l1hit += old->t.ld_l1hit;
+ new->t.ld_l2hit += old->t.ld_l2hit;
+ new->t.ld_llchit += old->t.ld_llchit;
+ new->t.locks += old->t.locks;
+ new->t.lcl_dram += old->t.lcl_dram;
+ new->t.rmt_dram += old->t.rmt_dram;
+ new->t.lcl_hitm += old->t.lcl_hitm;
+ new->t.rmt_hitm += old->t.rmt_hitm;
+ new->t.rmt_hit += old->t.rmt_hit;
+ new->t.store += old->t.store;
+ new->t.st_l1hit += old->t.st_l1hit;
+
+ new->total_period += old->total_period;
+}
+
+static void dump_tree_hitm(struct rb_root *tree,
+ struct perf_c2c *c2c __maybe_unused)
+{
+ struct rb_node *next = rb_first(tree);
+ struct c2c_hit *h;
+
+ printf("%16s %8s %8s %8s\n",
+ "Cacheline", "nr", "loads", "stores");
+ while (next) {
+ h = rb_entry(next, struct c2c_hit, rb_node);
+ next = rb_next(&h->rb_node);
+
+ printf("%16lx %8d %8d %8d\n",
+ h->cacheline,
+ h->stats.nr_entries,
+ h->stats.t.load,
+ h->stats.t.store);
+ }
+}
+
+static void c2c_analyze_hitms(struct perf_c2c *c2c)
+{
+
+ struct rb_node *next = rb_first(&c2c->tree_physid);
+ struct c2c_entry *n;
+ struct c2c_hit *h = NULL;
+ struct c2c_stats hitm_stats;
+ struct rb_root hitm_tree = RB_ROOT;
+ int shared_clines = 0;
+ u64 cl = 0;
+
+ memset(&hitm_stats, 0, sizeof(struct c2c_stats));
+
+ /* find HITMs */
+ while (next) {
+ n = rb_entry(next, struct c2c_entry, rb_node);
+ next = rb_next(&n->rb_node);
+
+ cl = n->mi->daddr.al_addr;
+
+ /* switch cache line objects */
+ /* 'color' forces a boundary change based on the original sort */
+ if (!h || !n->color || (CLADRS(cl) != h->cacheline)) {
+ if (h && HAS_HITMS(h)) {
+ c2c_hit__update_stats(&hitm_stats, &h->stats);
+
+ /* sort based on hottest cacheline */
+ c2c_hitm__add_to_list(&hitm_tree, h);
+ shared_clines++;
+ } else {
+ /* stores-only are un-interesting */
+ free(h);
+ }
+ h = c2c_hit__new(CLADRS(cl), n);
+ if (!h)
+ goto cleanup;
+ }
+
+
+ c2c_decode_stats(&h->stats, n);
+
+ /* filter out non-hitms as un-interesting noise */
+ if (valid_hitm_or_store(&n->mi->data_src)) {
+ /* save the entry for later processing */
+ list_add_tail(&n->scratch, &h->list);
+
+ c2c_hit__update_strings(h, n);
+ }
+ }
+
+ /* last chunk */
+ if (HAS_HITMS(h)) {
+ c2c_hit__update_stats(&hitm_stats, &h->stats);
+ c2c_hitm__add_to_list(&hitm_tree, h);
+ shared_clines++;
+ } else
+ free(h);
+
+ if (verbose > 2)
+ dump_tree_hitm(&c2c->tree_hitm, c2c);
+
+cleanup:
+ next = rb_first(&hitm_tree);
+ while (next) {
+ h = rb_entry(next, struct c2c_hit, rb_node);
+ next = rb_next(&h->rb_node);
+ rb_erase(&h->rb_node, &hitm_tree);
+
+ free(h);
+ }
+ return;
+}
+
static int perf_c2c__process_events(struct perf_session *session,
struct perf_c2c *c2c)
{
@@ -703,6 +920,8 @@ static int perf_c2c__process_events(struct perf_session *session,
goto err;
}
+ c2c_analyze_hitms(c2c);
+
err:
return err;
}
--
1.7.11.7
Seeing cacheline statistics is useful by itself. Seeing the callchain
for these cache contentions saves time tracking things down.
This patch tries to add callchain support. I had to use the generic
interface from a previous patch to output things to stdout easily.
Other than the displaying the results, collecting the callchain and
merging it was fairly straightforward.
I used a lot of copying-n-pasting from other builtin tools to get
the intial parameter setup correctly and the automatic reading of
'symbol_conf.use_callchain' from the data file.
Hopefully this is all correct. The amount of memory corruption (from the
callchain dynamic array) seems to have dwindled done to nothing. :-)
Suggested-by: Joe Mario <[email protected]>
Signed-off-by: Don Zickus <[email protected]>
---
tools/perf/builtin-c2c.c | 160 ++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 159 insertions(+), 1 deletion(-)
diff --git a/tools/perf/builtin-c2c.c b/tools/perf/builtin-c2c.c
index 39fd233..047fe26 100644
--- a/tools/perf/builtin-c2c.c
+++ b/tools/perf/builtin-c2c.c
@@ -49,6 +49,7 @@ struct c2c_stats {
struct perf_c2c {
struct perf_tool tool;
bool raw_records;
+ bool call_graph;
struct rb_root tree_physid;
/* stats */
@@ -67,6 +68,8 @@ struct c2c_entry {
int weight;
int period;
int color;
+
+ struct callchain_root callchain[0]; /* must be last member */
};
#define DISPLAY_LINE_LIMIT 0.0015
@@ -89,6 +92,8 @@ struct c2c_hit {
u64 daddr;
u64 iaddr;
struct mem_info *mi;
+
+ struct callchain_root callchain[0]; /* must be last member */
};
enum { OP, LVL, SNP, LCK, TLB };
@@ -676,7 +681,8 @@ static int c2c_decode_stats(struct c2c_stats *stats, struct c2c_entry *entry)
static struct c2c_hit *c2c_hit__new(u64 cacheline, struct c2c_entry *entry)
{
- struct c2c_hit *h = zalloc(sizeof(struct c2c_hit));
+ size_t callchain_size = symbol_conf.use_callchain ? sizeof(struct callchain_root) : 0;
+ struct c2c_hit *h = zalloc(sizeof(struct c2c_hit) + callchain_size);
if (!h) {
pr_err("Could not allocate c2c_hit memory\n");
@@ -690,6 +696,8 @@ static struct c2c_hit *c2c_hit__new(u64 cacheline, struct c2c_entry *entry)
h->cacheline = cacheline;
h->pid = entry->thread->pid_;
h->tid = entry->thread->tid;
+ if (symbol_conf.use_callchain)
+ callchain_init(h->callchain);
/* use original addresses here, not adjusted al_addr */
h->iaddr = entry->mi->iaddr.addr;
@@ -834,6 +842,7 @@ static int perf_c2c__process_sample(struct perf_tool *tool,
u8 cpumode = event->header.misc & PERF_RECORD_MISC_CPUMODE_MASK;
struct mem_info *mi;
struct thread *thread;
+ struct symbol *parent = NULL;
struct c2c_entry *entry;
sample_handler f;
int err = -1;
@@ -864,6 +873,19 @@ static int perf_c2c__process_sample(struct perf_tool *tool,
if (err)
goto err_entry;
+ /* attach callchain if everything is good */
+ if (symbol_conf.use_callchain && sample->callchain) {
+ callchain_init(entry->callchain);
+
+ err = machine__resolve_callchain(machine, evsel, thread,
+ sample, &parent, NULL);
+ if (!err)
+ err = callchain_append(entry->callchain,
+ &callchain_cursor,
+ entry->period);
+ if (err)
+ pr_err("Could not attach callchain, skipping\n");
+ }
return 0;
err_entry:
@@ -1217,6 +1239,13 @@ static void print_hitm_cacheline_offset(struct c2c_hit *clo,
print_socket_shared_str(node_stats);
printf("\n");
+
+ if (symbol_conf.use_callchain) {
+ generic_entry_callchain__fprintf(clo->callchain,
+ h->stats.total_period,
+ clo->stats.total_period,
+ 23, stdout);
+ }
}
static void print_c2c_hitm_report(struct rb_root *hitm_tree,
@@ -1293,6 +1322,12 @@ static void print_c2c_hitm_report(struct rb_root *hitm_tree,
c2c_decode_stats(&node_stats[node], entry);
CPU_SET(entry->cpu, &(node_stats[node].cpuset));
}
+ if (symbol_conf.use_callchain) {
+ callchain_cursor_reset(&callchain_cursor);
+ callchain_merge(&callchain_cursor,
+ clo->callchain,
+ entry->callchain);
+ }
}
if (clo) {
@@ -1424,6 +1459,30 @@ err:
return err;
}
+static int perf_c2c__setup_sample_type(struct perf_c2c *c2c,
+ struct perf_session *session)
+{
+ u64 sample_type = perf_evlist__combined_sample_type(session->evlist);
+
+ if (!(sample_type & PERF_SAMPLE_CALLCHAIN)) {
+ if (symbol_conf.use_callchain) {
+ printf("Selected -g but no callchain data. Did "
+ "you call 'perf c2c record' without -g?\n");
+ return -1;
+ }
+ } else if (callchain_param.mode != CHAIN_NONE &&
+ !symbol_conf.use_callchain) {
+ symbol_conf.use_callchain = true;
+ c2c->call_graph = true;
+ if (callchain_register_param(&callchain_param) < 0) {
+ printf("Can't register callchain params.\n");
+ return -EINVAL;
+ }
+ }
+
+ return 0;
+}
+
static int perf_c2c__read_events(struct perf_c2c *c2c)
{
int err = -1;
@@ -1438,6 +1497,9 @@ static int perf_c2c__read_events(struct perf_c2c *c2c)
if (symbol__init() < 0)
goto out_delete;
+ if (perf_c2c__setup_sample_type(c2c, session) < 0)
+ goto out_delete;
+
if (perf_evlist__set_handlers(session->evlist, handlers))
goto out_delete;
@@ -1508,8 +1570,101 @@ static int perf_c2c__record(int argc, const char **argv)
return cmd_record(i, rec_argv, NULL);
}
+static int
+opt_callchain_cb(const struct option *opt, const char *arg, int unset)
+{
+ struct perf_c2c *c2c = (struct perf_c2c *)opt->value;
+ char *tok, *tok2;
+ char *endptr;
+
+ /*
+ * --no-call-graph
+ */
+ if (unset) {
+ c2c->call_graph = false;
+ return 0;
+ }
+
+ symbol_conf.use_callchain = true;
+ c2c->call_graph = true;
+
+ if (!arg)
+ return 0;
+
+ tok = strtok((char *)arg, ",");
+ if (!tok)
+ return -1;
+
+ /* get the output mode */
+ if (!strncmp(tok, "graph", strlen(arg)))
+ callchain_param.mode = CHAIN_GRAPH_ABS;
+
+ else if (!strncmp(tok, "flat", strlen(arg)))
+ callchain_param.mode = CHAIN_FLAT;
+
+ else if (!strncmp(tok, "fractal", strlen(arg)))
+ callchain_param.mode = CHAIN_GRAPH_REL;
+
+ else if (!strncmp(tok, "none", strlen(arg))) {
+ callchain_param.mode = CHAIN_NONE;
+ symbol_conf.use_callchain = false;
+
+ return 0;
+ }
+
+ else
+ return -1;
+
+ /* get the min percentage */
+ tok = strtok(NULL, ",");
+ if (!tok)
+ goto setup;
+
+ callchain_param.min_percent = strtod(tok, &endptr);
+ if (tok == endptr)
+ return -1;
+
+ /* get the print limit */
+ tok2 = strtok(NULL, ",");
+ if (!tok2)
+ goto setup;
+
+ if (tok2[0] != 'c') {
+ callchain_param.print_limit = strtoul(tok2, &endptr, 0);
+ tok2 = strtok(NULL, ",");
+ if (!tok2)
+ goto setup;
+ }
+
+ /* get the call chain order */
+ if (!strncmp(tok2, "caller", strlen("caller")))
+ callchain_param.order = ORDER_CALLER;
+ else if (!strncmp(tok2, "callee", strlen("callee")))
+ callchain_param.order = ORDER_CALLEE;
+ else
+ return -1;
+
+ /* Get the sort key */
+ tok2 = strtok(NULL, ",");
+ if (!tok2)
+ goto setup;
+ if (!strncmp(tok2, "function", strlen("function")))
+ callchain_param.key = CCKEY_FUNCTION;
+ else if (!strncmp(tok2, "address", strlen("address")))
+ callchain_param.key = CCKEY_ADDRESS;
+ else
+ return -1;
+setup:
+ if (callchain_register_param(&callchain_param) < 0) {
+ fprintf(stderr, "Can't register callchain params\n");
+ return -1;
+ }
+ return 0;
+}
+
int cmd_c2c(int argc, const char **argv, const char *prefix __maybe_unused)
{
+ char callchain_default_opt[] = "fractal,0.05,callee";
struct perf_c2c c2c = {
.tool = {
.sample = perf_c2c__process_sample,
@@ -1536,6 +1691,9 @@ int cmd_c2c(int argc, const char **argv, const char *prefix __maybe_unused)
"separator",
"separator for columns, no spaces will be added"
" between columns '.' is reserved."),
+ OPT_CALLBACK_DEFAULT('g', "call-graph", &c2c, "output_type,min_percent[,print_limit],call_order",
+ "Display callchains using output_type (graph, flat, fractal, or none) , min percent threshold, optional print limit, callchain order, key (function or address). "
+ "Default: fractal,0.5,callee,function", &opt_callchain_cb, callchain_default_opt),
OPT_END()
};
const char * const c2c_usage[] = {
--
1.7.11.7
This patch mainly focuses on processing and displaying the collected
HITMs to stdout. Most of it is just printing data in a pretty way.
There is one trick used when walking the cacheline. When we get this
far we have two rbtrees. One rbtree holds every record sorted on a
unique id (using the mmap2 decoder), the other rbtree holds every
cacheline with at least one HITM sorted on number of HITMs.
To display the output, the tool walks the second rbtree to display
the hottest cachelines. Inside each hot cacheline, the tool displays
the offsets and the loads/stores it generates. To determine the
cacheline offset, it uses linked list inside the cacheline elment
to walk the first rbtree elements for that particular cacheline.
The first rbtree elements are already sorted correctly in offset order, so
processing the offsets is fairly trivial and is done sequentially.
This is why you will see two while loops in the print_c2c_hitm_report(),
the outer one walks the cachelines, the inner one walks the offsets.
A knob has been added to display node information, which is useful
to see which cpus are involved in the contention and their nodes.
Another knob has been added to change the coalescing levels. You can
coalesce the output based on pid, tid, ip, and/or symbol.
Original output and statistics done by Dick Fowles, backported by me.
Original-by: Dick Fowles <[email protected]>
Signed-off-by: Don Zickus <[email protected]>
---
tools/perf/builtin-c2c.c | 528 +++++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 515 insertions(+), 13 deletions(-)
diff --git a/tools/perf/builtin-c2c.c b/tools/perf/builtin-c2c.c
index 8b26ea2..39fd233 100644
--- a/tools/perf/builtin-c2c.c
+++ b/tools/perf/builtin-c2c.c
@@ -69,10 +69,13 @@ struct c2c_entry {
int color;
};
+#define DISPLAY_LINE_LIMIT 0.0015
#define CACHE_LINESIZE 64
#define CLINE_OFFSET_MSK (CACHE_LINESIZE - 1)
#define CLADRS(a) ((a) & ~(CLINE_OFFSET_MSK))
#define CLOFFSET(a) (int)((a) & (CLINE_OFFSET_MSK))
+#define MAXTITLE_SZ 400
+#define MAXLBL_SZ 256
struct c2c_hit {
struct rb_node rb_node;
@@ -113,6 +116,11 @@ enum { OP, LVL, SNP, LCK, TLB };
#define LCL_HITM(a,b) (L3CACHE_HIT(a) && ((b) & PERF_MEM_SNOOP_HITM))
#define LCL_MEM(a) (((a) & PERF_MEM_LVL_LOC_RAM) && ((a) & PERF_MEM_LVL_HIT))
+enum { LVL0, LVL1, LVL2, LVL3, LVL4, MAX_LVL };
+static int cloffset = LVL1;
+static int node_info = 0;
+static int coalesce_level = LVL1;
+
static int max_cpu_num;
static int max_node_num;
static int *cpunode_map;
@@ -710,6 +718,80 @@ static void c2c_hit__update_strings(struct c2c_hit *h,
CPU_SET(n->cpu, &h->stats.cpuset);
}
+static inline bool matching_coalescing(struct c2c_hit *h,
+ struct c2c_entry *e)
+{
+ bool value = false;
+ struct mem_info *mi = e->mi;
+
+ if (coalesce_level > MAX_LVL)
+ printf("DON: bad coalesce level %d\n", coalesce_level);
+
+ if (e->cpumode != PERF_RECORD_MISC_KERNEL) {
+
+ switch (coalesce_level) {
+
+ case LVL0:
+ case LVL1:
+ value = ((h->daddr == mi->daddr.addr) &&
+ (h->pid == e->thread->pid_) &&
+ (h->tid == e->thread->tid) &&
+ (h->iaddr == mi->iaddr.addr));
+ break;
+
+ case LVL2:
+ value = ((h->daddr == mi->daddr.addr) &&
+ (h->pid == e->thread->pid_) &&
+ (h->iaddr == mi->iaddr.addr));
+ break;
+
+ case LVL3:
+ value = ((h->daddr == mi->daddr.addr) &&
+ (h->iaddr == mi->iaddr.addr));
+ break;
+
+ case LVL4:
+ value = ((h->daddr == mi->daddr.addr) &&
+ (h->mi->iaddr.sym == mi->iaddr.sym));
+ break;
+
+ default:
+ break;
+
+ }
+
+ } else {
+
+ switch (coalesce_level) {
+
+ case LVL0:
+ value = ((h->daddr == mi->daddr.addr) &&
+ (h->pid == e->thread->pid_) &&
+ (h->tid == e->thread->tid) &&
+ (h->iaddr == mi->iaddr.addr));
+ break;
+
+ case LVL1:
+ case LVL2:
+ case LVL3:
+ value = ((h->daddr == mi->daddr.addr) &&
+ (h->iaddr == mi->iaddr.addr));
+ break;
+
+ case LVL4:
+ value = ((h->daddr == mi->daddr.addr) &&
+ (h->mi->iaddr.sym == mi->iaddr.sym));
+ break;
+
+ default:
+ break;
+
+ }
+ }
+
+ return value;
+}
+
static int perf_c2c__process_load_store(struct perf_c2c *c2c,
struct perf_sample *sample __maybe_unused,
struct c2c_entry *entry)
@@ -816,26 +898,442 @@ static void c2c_hit__update_stats(struct c2c_stats *new,
new->total_period += old->total_period;
}
-static void dump_tree_hitm(struct rb_root *tree,
- struct perf_c2c *c2c __maybe_unused)
+static void print_hitm_cacheline_header(void)
+{
+#define SHARING_REPORT_TITLE "Shared Cache Line Distribution Pareto"
+#define PARTICIPANTS1 "Node{cpus %htims %stores} Node{cpus %hitms %stores} ..."
+#define PARTICIPANTS2 "Node{cpu list}; Node{cpu list}; Node{cpu list}; ..."
+
+ int i;
+ const char *docptr;
+ static char delimit[MAXTITLE_SZ];
+ static char title2[MAXTITLE_SZ];
+ int pad;
+
+ docptr = " ";
+ if (node_info == 1)
+ docptr = PARTICIPANTS1;
+ if (node_info == 2)
+ docptr = PARTICIPANTS2;
+
+ sprintf(title2, "%4s %6s %6s %6s %6s %8s %8s %8s %8s %18s %6s %6s %18s %8s %8s %8s %6s %-30s %-20s %s",
+ "Num",
+ "%dist",
+ "%cumm",
+ "%dist",
+ "%cumm",
+ "LLCmiss",
+ "LLChit",
+ "L1 hit",
+ "L1 Miss",
+ "Data Address",
+ "Pid",
+ "Tid",
+ "Inst Address",
+ "median",
+ "mean",
+ "CV ",
+ "cnt",
+ "Symbol",
+ "Object",
+ docptr);
+
+ for (i = 0; i < (int)strlen(title2); i++) strcat(delimit, "=");
+
+
+ printf("\n\n");
+ printf("%s\n", delimit);
+ printf("\n");
+
+ pad = (strlen(title2)/2) - (strlen(SHARING_REPORT_TITLE)/2);
+ for (i = 0; i < pad; i++) printf(" ");
+ printf("%s\n", SHARING_REPORT_TITLE);
+
+ printf("\n");
+ printf("%4s %13s %13s %17s %8s %8s %18s %6s %6s %18s %26s %6s %30s %20s %s\n",
+ " ",
+ "---- All ----",
+ "-- Shared --",
+ "---- HITM ----",
+ " ",
+ " ",
+ " ",
+ " ",
+ " ",
+ " ",
+ "Load Inst Execute Latency",
+ " ",
+ " ",
+ " ",
+ node_info ? "Shared Data Participants" : " ");
+
+
+ printf("%4s %13s %13s %8s %8s %17s %18s %6s %6s %17s %18s\n",
+ " ",
+ " Data Misses",
+ " Data Misses",
+ "Remote",
+ "Local",
+ "-- Store Refs --",
+ " ",
+ " ",
+ " ",
+ " ",
+ " ");
+
+ printf("%4s %13s %13s %8s %8s %8s %8s %18s %6s %6s %17s %18s %8s %6s\n",
+ " ",
+ " ",
+ " ",
+ " ",
+ " ",
+ " ",
+ " ",
+ " ",
+ " ",
+ " ",
+ " ",
+ "---- cycles ----",
+ " ",
+ "cpu");
+
+ printf("%s\n", title2);
+ printf("%s\n", delimit);
+}
+
+static void print_hitm_cacheline(struct c2c_hit *h,
+ int record,
+ double tot_cumm,
+ double ld_cumm,
+ double tot_dist,
+ double ld_dist)
+{
+ char pidstr[7];
+ char addrstr[20];
+ static char summary[MAXLBL_SZ];
+ int j;
+
+ if (h->pid > 0)
+ sprintf(pidstr, "%6d", h->pid);
+ else
+ sprintf(pidstr, "***");
+ /*
+ * It is possible to have none distinct virtual addresses
+ * pointing to a distinct SYstem V shared memory region.
+ * if there are multple virtual addresses the address
+ * field will be astericks. It would be possible to subsitute
+ * the physical address but this count be confusing as some
+ * times the field is a virtual address while or times it
+ * may be a physical address which may lead to confusion.
+ */
+ if (h->daddr != ~0UL)
+ sprintf(addrstr, "%#18lx", CLADRS(h->daddr));
+ else
+ sprintf(addrstr, "****************");
+
+
+ sprintf(summary, "%4d %5.1f%% %5.1f%% %5.1f%% %5.1f%% %8d %8d %8d %8d %18s %6s\n",
+ record,
+ tot_dist * 100.,
+ tot_cumm * 100.,
+ ld_dist * 100.,
+ ld_cumm * 100.,
+ h->stats.t.rmt_hitm,
+ h->stats.t.lcl_hitm,
+ h->stats.t.st_l1hit,
+ h->stats.t.st_l1miss,
+ addrstr,
+ pidstr);
+
+ for (j = 0; j < (int)strlen(summary); j++) printf("-");
+ printf("\n");
+ printf("%s", summary);
+ for (j = 0; j < (int)strlen(summary); j++) printf("-");
+ printf("\n");
+}
+
+static void print_socket_stats_str(struct c2c_hit *clo,
+ struct c2c_stats *node_stats)
{
- struct rb_node *next = rb_first(tree);
- struct c2c_hit *h;
+ int i, j;
+
+ if (!node_stats)
+ return;
+
+ for (i = 0; i < max_node_num; i++) {
+ struct c2c_stats *stats = &node_stats[i];
+ int num = CPU_COUNT(&stats->cpuset);
+
+ if (!num) {
+ /* pad align socket info */
+ for (j = 0; j < 21; j++)
+ printf(" ");
+ continue;
+ }
+
+ printf("%2d{%2d ", i, num);
+
+ if (clo->stats.t.rmt_hitm > 0)
+ printf("%5.1f%% ", 100. * ((double)stats->t.rmt_hitm / (double) clo->stats.t.rmt_hitm));
+ else
+ printf("%6s ", "n/a");
+
+ if (clo->stats.t.store > 0)
+ printf("%5.1f%%} ", 100. * ((double)stats->t.store / (double)clo->stats.t.store));
+ else
+ printf("%6s} ", "n/a");
+ }
+}
+
+static void print_socket_shared_str(struct c2c_stats *node_stats)
+{
+ int i, j;
+
+ if (!node_stats)
+ return;
+
+ for (i = 0; i < max_node_num; i++) {
+ struct c2c_stats *stats = &node_stats[i];
+ int num = CPU_COUNT(&stats->cpuset);
+ int start = -1;
+ bool first = true;
+
+ if (!num)
+ continue;
+
+ printf("%d{", i);
+
+ for (j = 0; j < max_cpu_num; j++) {
+ if (!CPU_ISSET(j, &stats->cpuset)) {
+ if (start != -1) {
+ if ((j-1) - start)
+ /* print the range */
+ printf("%s%d-%d", (first ? "" : ","), start, j-1);
+ else
+ /* standalone */
+ printf("%s%d", (first ? "" : ",") , start);
+ start = -1;
+ first = false;
+ }
+ continue;
+ }
+
+ if (start == -1)
+ start = j;
+ }
+ /* last chunk */
+ if (start != -1) {
+ if ((j-1) - start)
+ /* print the range */
+ printf("%s%d-%d", (first ? "" : ","), start, j-1);
+ else
+ /* standalone */
+ printf("%s%d", (first ? "" : ",") , start);
+ }
+
+ printf("}; ");
+ }
+}
+
+static void print_hitm_cacheline_offset(struct c2c_hit *clo,
+ struct c2c_hit *h,
+ struct c2c_stats *node_stats)
+{
+#define SHORT_STR_LEN 7
+#define LONG_STR_LEN 30
+
+ char pidstr[SHORT_STR_LEN];
+ char tidstr[SHORT_STR_LEN];
+ char addrstr[LONG_STR_LEN];
+ char latstr[LONG_STR_LEN];
+ char objptr[LONG_STR_LEN];
+ char symptr[LONG_STR_LEN];
+ struct c2c_stats *stats = &clo->stats;
+ struct addr_map_symbol *ams;
+
+ ams = &clo->mi->iaddr;
+
+ if (clo->pid >= 0)
+ snprintf(pidstr, SHORT_STR_LEN, "%6d", clo->pid);
+ else
+ sprintf(pidstr, "***");
+
+ if (clo->tid >= 0)
+ snprintf(tidstr, SHORT_STR_LEN, "%6d", clo->tid);
+ else
+ sprintf(tidstr, "***");
+
+ if (clo->iaddr != ~0UL)
+ snprintf(addrstr, LONG_STR_LEN, "%#18lx", clo->iaddr);
+ else
+ sprintf(addrstr, "****************");
+ snprintf(objptr, LONG_STR_LEN, "%-18s", ams->map->dso->short_name);
+ snprintf(symptr, LONG_STR_LEN, "%-18s", (ams->sym ? ams->sym->name : "?????"));
+
+ if (stats->t.rmt_hitm > 0) {
+ double mean = avg_stats(&stats->stats);
+ double std = stddev_stats(&stats->stats);
+
+ sprintf(latstr, "%8.0f %8.0f %7.1f%%",
+ -1.0, /* FIXME */
+ mean,
+ rel_stddev_stats(std, mean));
+ } else {
+ sprintf(latstr, "%8s %8s %8s",
+ "n/a",
+ "n/a",
+ "n/a");
+
+ }
+
+ /*
+ * implicit assumption that we are not coalescing over IPs
+ */
+ printf("%4s %6s %6s %6s %6s %7.1f%% %7.1f%% %7.1f%% %7.1f%% %14s0x%02lx %6s %6s %18s %8s %6d %-30s %-20s",
+ " ",
+ " ",
+ " ",
+ " ",
+ " ",
+ (stats->t.rmt_hitm > 0) ? (100. * ((double)stats->t.rmt_hitm / (double)h->stats.t.rmt_hitm)) : 0.0,
+ (stats->t.lcl_hitm > 0) ? (100. * ((double)stats->t.lcl_hitm / (double)h->stats.t.lcl_hitm)) : 0.0,
+ (stats->t.st_l1hit > 0) ? (100. * ((double)stats->t.st_l1hit / (double)h->stats.t.st_l1hit)) : 0.0,
+ (stats->t.st_l1miss > 0) ? (100. * ((double)stats->t.st_l1miss / (double)h->stats.t.st_l1miss)) : 0.0,
+ " ",
+ (cloffset == LVL2) ? (clo->daddr & 0xff) : CLOFFSET(clo->daddr),
+ pidstr,
+ tidstr,
+ addrstr,
+ latstr,
+ CPU_COUNT(&clo->stats.cpuset),
+ symptr,
+ objptr);
+
+ if (node_info == 0)
+ printf(" ");
+ else if (node_info == 1)
+ print_socket_stats_str(clo, node_stats);
+ else if (node_info == 2)
+ print_socket_shared_str(node_stats);
+
+ printf("\n");
+}
+
+static void print_c2c_hitm_report(struct rb_root *hitm_tree,
+ struct c2c_stats *hitm_stats __maybe_unused,
+ struct c2c_stats *c2c_stats)
+{
+ struct rb_node *next = rb_first(hitm_tree);
+ struct c2c_hit *h, *clo = NULL;
+ u64 addr;
+ double tot_dist, tot_cumm;
+ double ld_dist, ld_cumm;
+ int llc_misses;
+ int record = 0;
+ struct c2c_stats *node_stats = NULL;
+
+ if (node_info) {
+ node_stats = zalloc(sizeof(struct c2c_stats) * max_node_num);
+ if (!node_stats) {
+ printf("Can not allocate stats for node output\n");
+ return;
+ }
+ }
+
+ print_hitm_cacheline_header();
+
+ llc_misses = c2c_stats->t.lcl_dram +
+ c2c_stats->t.rmt_dram +
+ c2c_stats->t.rmt_hit +
+ c2c_stats->t.rmt_hitm;
+
+ /*
+ * generate distinct cache line report
+ */
+ tot_cumm = 0.0;
+ ld_cumm = 0.0;
- printf("%16s %8s %8s %8s\n",
- "Cacheline", "nr", "loads", "stores");
while (next) {
+ struct c2c_entry *entry;
+
h = rb_entry(next, struct c2c_hit, rb_node);
next = rb_next(&h->rb_node);
- printf("%16lx %8d %8d %8d\n",
- h->cacheline,
- h->stats.nr_entries,
- h->stats.t.load,
- h->stats.t.store);
+ tot_dist = ((double)h->stats.t.rmt_hitm / llc_misses);
+ tot_cumm += tot_dist;
+
+ ld_dist = ((double)h->stats.t.rmt_hitm / c2c_stats->t.rmt_hitm);
+ ld_cumm += ld_dist;
+
+ /*
+ * don't display lines with insignificant sharing contribution
+ */
+ if (ld_dist < DISPLAY_LINE_LIMIT)
+ break;
+
+ print_hitm_cacheline(h, record, tot_cumm, ld_cumm, tot_dist, ld_dist);
+
+ list_for_each_entry(entry, &h->list, scratch) {
+
+ if (!clo || !matching_coalescing(clo, entry)) {
+ if (clo)
+ print_hitm_cacheline_offset(clo, h, node_stats);
+
+ free(clo);
+ addr = entry->mi->iaddr.al_addr;
+ clo = c2c_hit__new(addr, entry);
+ if (node_info)
+ memset(node_stats, 0, sizeof(struct c2c_stats) * max_node_num);
+ }
+ c2c_decode_stats(&clo->stats, entry);
+ c2c_hit__update_strings(clo, entry);
+
+ if (node_info) {
+ int node = cpunode_map[entry->cpu];
+ c2c_decode_stats(&node_stats[node], entry);
+ CPU_SET(entry->cpu, &(node_stats[node].cpuset));
+ }
+
+ }
+ if (clo) {
+ print_hitm_cacheline_offset(clo, h, node_stats);
+ free(clo);
+ clo = NULL;
+ }
+
+ if (node_info)
+ memset(node_stats, 0, sizeof(struct c2c_stats) * max_node_num);
+
+ printf("\n");
+ record++;
}
}
+static inline int valid_hitm_or_store(union perf_mem_data_src *dsrc)
+{
+ return ((dsrc->mem_snoop & P(SNOOP,HITM)) ||
+ (dsrc->mem_op & P(OP,STORE)));
+}
+
+static void print_shared_cacheline_info(struct c2c_stats *stats, int cline_cnt)
+{
+ int hitm_cnt = stats->t.lcl_hitm + stats->t.rmt_hitm;
+
+ printf("=================================================\n");
+ printf(" Global Shared Cache Line Event Information \n");
+ printf("=================================================\n");
+ printf(" Total Shared Cache Lines : %10d\n", cline_cnt);
+ printf(" Load HITs on shared lines : %10d\n", stats->t.load);
+ printf(" Fill Buffer Hits on shared lines : %10d\n", stats->t.ld_fbhit);
+ printf(" L1D hits on shared lines : %10d\n", stats->t.ld_l1hit);
+ printf(" L2D hits on shared lines : %10d\n", stats->t.ld_l2hit);
+ printf(" LLC hits on shared lines : %10d\n", stats->t.ld_llchit + stats->t.lcl_hitm);
+ printf(" Locked Access on shared lines : %10d\n", stats->t.locks);
+ printf(" Store HITs on shared lines : %10d\n", stats->t.store);
+ printf(" Store L1D hits on shared lines : %10d\n", stats->t.st_l1hit);
+ printf(" Total Merged records : %10d\n", hitm_cnt + stats->t.store);
+}
+
static void c2c_analyze_hitms(struct perf_c2c *c2c)
{
@@ -894,8 +1392,8 @@ static void c2c_analyze_hitms(struct perf_c2c *c2c)
} else
free(h);
- if (verbose > 2)
- dump_tree_hitm(&c2c->tree_hitm, c2c);
+ print_shared_cacheline_info(&hitm_stats, shared_clines);
+ print_c2c_hitm_report(&hitm_tree, &hitm_stats, &c2c->stats);
cleanup:
next = rb_first(&hitm_tree);
@@ -1026,6 +1524,10 @@ int cmd_c2c(int argc, const char **argv, const char *prefix __maybe_unused)
};
const struct option c2c_options[] = {
OPT_BOOLEAN('r', "raw_records", &c2c.raw_records, "dump raw events"),
+ OPT_INCR('N', "node-info", &node_info,
+ "show extra node info in report (repeat for more info)"),
+ OPT_INTEGER('c', "coalesce-level", &coalesce_level,
+ "how much coalescing for tid, pid, and ip is done (repeat for more coalescing)"),
OPT_INCR('v', "verbose", &verbose,
"be more verbose (show counter open errors, etc)"),
OPT_STRING('i', "input", &input_name, "file",
--
1.7.11.7
A basic patch that re-arranges some of the c2c code and adds a couple
of small features to lay the ground work for the rest of the patch
series.
Changes include:
o reworking the report path
o creating an initial entry struct
o replace preprocess_sample with simpler calls
o rework raw output to handle separators
o remove phys id gunk
o add some generic options
There isn't much meat in this patch just a bunch of code movement and cleanups.
Signed-off-by: Don Zickus <[email protected]>
---
tools/perf/builtin-c2c.c | 163 +++++++++++++++++++++++++++++++++++++----------
1 file changed, 129 insertions(+), 34 deletions(-)
diff --git a/tools/perf/builtin-c2c.c b/tools/perf/builtin-c2c.c
index a5dc412..b062485 100644
--- a/tools/perf/builtin-c2c.c
+++ b/tools/perf/builtin-c2c.c
@@ -5,6 +5,7 @@
#include "util/parse-options.h"
#include "util/session.h"
#include "util/tool.h"
+#include "util/debug.h"
#include <linux/compiler.h>
#include <linux/kernel.h>
@@ -14,6 +15,15 @@ struct perf_c2c {
bool raw_records;
};
+struct c2c_entry {
+ struct thread *thread;
+ struct mem_info *mi;
+ u32 cpu;
+ u8 cpumode;
+ int weight;
+ int period;
+};
+
enum { OP, LVL, SNP, LCK, TLB };
static int perf_c2c__scnprintf_data_src(char *bf, size_t size, uint64_t val)
@@ -105,34 +115,69 @@ static int perf_c2c__fprintf_header(FILE *fp)
}
static int perf_sample__fprintf(struct perf_sample *sample, char tag,
- const char *reason, struct addr_location *al, FILE *fp)
+ const char *reason, struct mem_info *mi, FILE *fp)
{
char data_src[61];
+ const char *fmt, *sep;
+ struct map *map = mi->iaddr.map;
perf_c2c__scnprintf_data_src(data_src, sizeof(data_src), sample->data_src);
- return fprintf(fp, "%c %-16s %6d %6d %4d %#18" PRIx64 " %#18" PRIx64 " %#18" PRIx64 " %6" PRIu64 " %#10" PRIx64 " %-60.60s %s:%s\n",
- tag,
- reason ?: "valid record",
- sample->pid,
- sample->tid,
- sample->cpu,
- sample->ip,
- sample->addr,
- 0UL,
- sample->weight,
- sample->data_src,
- data_src,
- al->map ? (al->map->dso ? al->map->dso->long_name : "???") : "???",
- al->sym ? al->sym->name : "???");
+ if (symbol_conf.field_sep) {
+ fmt = "%c%s%s%s%d%s%d%s%d%s%#"PRIx64"%s%#"PRIx64"%s"
+ "%"PRIu64"%s%#"PRIx64"%s%s%s%s:%s\n";
+ sep = symbol_conf.field_sep;
+ } else {
+ fmt = "%c%s%-16s%s%6d%s%6d%s%4d%s%#18"PRIx64"%s%#18"PRIx64"%s"
+ "%6"PRIu64"%s%#10"PRIx64"%s%-60.60s%s%s:%s\n";
+ sep = " ";
+ }
+
+ return fprintf(fp, fmt,
+ tag, sep,
+ reason ?: "valid record", sep,
+ sample->pid, sep,
+ sample->tid, sep,
+ sample->cpu, sep,
+ sample->ip, sep,
+ sample->addr, sep,
+ sample->weight, sep,
+ sample->data_src, sep,
+ data_src, sep,
+ map ? (map->dso ? map->dso->long_name : "???") : "???",
+ mi->iaddr.sym ? mi->iaddr.sym->name : "???");
}
-static int perf_c2c__process_load_store(struct perf_c2c *c2c,
- struct perf_sample *sample,
- struct addr_location *al)
+static struct c2c_entry *c2c_entry__new(struct perf_sample *sample,
+ struct thread *thread,
+ struct mem_info *mi,
+ u8 cpumode)
{
- if (c2c->raw_records)
- perf_sample__fprintf(sample, ' ', "raw input", al, stdout);
+ size_t callchain_size = symbol_conf.use_callchain ? sizeof(struct callchain_root) : 0;
+ struct c2c_entry *entry = zalloc(sizeof(*entry) + callchain_size);
+
+ if (entry != NULL) {
+ entry->thread = thread;
+ entry->mi = mi;
+ entry->cpu = sample->cpu;
+ entry->cpumode = cpumode;
+ entry->weight = sample->weight;
+ if (sample->period)
+ entry->period = sample->period;
+ else
+ entry->period = 1;
+ }
+
+ return entry;
+}
+
+static int perf_c2c__process_load_store(struct perf_c2c *c2c __maybe_unused,
+ struct perf_sample *sample __maybe_unused,
+ struct c2c_entry *entry)
+{
+ /* don't lose the maps if remapped */
+ entry->mi->iaddr.map->referenced = true;
+ entry->mi->daddr.map->referenced = true;
return 0;
}
@@ -144,7 +189,7 @@ static const struct perf_evsel_str_handler handlers[] = {
typedef int (*sample_handler)(struct perf_c2c *c2c,
struct perf_sample *sample,
- struct addr_location *al);
+ struct c2c_entry *entry);
static int perf_c2c__process_sample(struct perf_tool *tool,
union perf_event *event,
@@ -153,20 +198,63 @@ static int perf_c2c__process_sample(struct perf_tool *tool,
struct machine *machine)
{
struct perf_c2c *c2c = container_of(tool, struct perf_c2c, tool);
- struct addr_location al;
- int err = 0;
+ u8 cpumode = event->header.misc & PERF_RECORD_MISC_CPUMODE_MASK;
+ struct mem_info *mi;
+ struct thread *thread;
+ struct c2c_entry *entry;
+ sample_handler f;
+ int err = -1;
+
+ if (evsel->handler.func == NULL)
+ return 0;
+
+ thread = machine__find_thread(machine, sample->tid);
+ if (thread == NULL)
+ goto err;
+
+ mi = machine__resolve_mem(machine, thread, sample, cpumode);
+ if (mi == NULL)
+ goto err;
- if (perf_event__preprocess_sample(event, machine, &al, sample) < 0) {
- pr_err("problem processing %d event, skipping it.\n",
- event->header.type);
- return -1;
+ if (c2c->raw_records) {
+ perf_sample__fprintf(sample, ' ', "raw input", mi, stdout);
+ free(mi);
+ return 0;
}
- if (evsel->handler.func != NULL) {
- sample_handler f = evsel->handler.func;
- err = f(c2c, sample, &al);
+ entry = c2c_entry__new(sample, thread, mi, cpumode);
+ if (entry == NULL)
+ goto err_mem;
+
+ f = evsel->handler.func;
+ err = f(c2c, sample, entry);
+ if (err)
+ goto err_entry;
+
+ return 0;
+
+err_entry:
+ free(entry);
+err_mem:
+ free(mi);
+err:
+ if (err > 0)
+ err = 0;
+ return err;
+}
+
+static int perf_c2c__process_events(struct perf_session *session,
+ struct perf_c2c *c2c)
+{
+ int err = -1;
+
+ err = perf_session__process_events(session, &c2c->tool);
+ if (err) {
+ pr_err("Failed to process count events, error %d\n", err);
+ goto err;
}
+err:
return err;
}
@@ -184,9 +272,7 @@ static int perf_c2c__read_events(struct perf_c2c *c2c)
if (perf_evlist__set_handlers(session->evlist, handlers))
goto out_delete;
- err = perf_session__process_events(session, &c2c->tool);
- if (err)
- pr_err("Failed to process events, error %d", err);
+ err = perf_c2c__process_events(session, c2c);
out_delete:
perf_session__delete(session);
@@ -210,7 +296,6 @@ static int perf_c2c__record(int argc, const char **argv)
const char **rec_argv;
const char * const record_args[] = {
"record",
- /* "--phys-addr", */
"-W",
"-d",
"-a",
@@ -243,6 +328,8 @@ int cmd_c2c(int argc, const char **argv, const char *prefix __maybe_unused)
struct perf_c2c c2c = {
.tool = {
.sample = perf_c2c__process_sample,
+ .mmap2 = perf_event__process_mmap2,
+ .mmap = perf_event__process_mmap,
.comm = perf_event__process_comm,
.exit = perf_event__process_exit,
.fork = perf_event__process_fork,
@@ -252,6 +339,14 @@ int cmd_c2c(int argc, const char **argv, const char *prefix __maybe_unused)
};
const struct option c2c_options[] = {
OPT_BOOLEAN('r', "raw_records", &c2c.raw_records, "dump raw events"),
+ OPT_INCR('v', "verbose", &verbose,
+ "be more verbose (show counter open errors, etc)"),
+ OPT_STRING('i', "input", &input_name, "file",
+ "the input file to process"),
+ OPT_STRING('x', "field-separator", &symbol_conf.field_sep,
+ "separator",
+ "separator for columns, no spaces will be added"
+ " between columns '.' is reserved."),
OPT_END()
};
const char * const c2c_usage[] = {
--
1.7.11.7
This reverts commit 3090ffb5a2515990182f3f55b0688a7817325488.
With the introduction of the c2c tools, we now have a user of MMAP2
Signed-off-by: Don Zickus <[email protected]>
---
kernel/events/core.c | 4 ----
tools/perf/util/event.c | 36 +++++++++++++++++++-----------------
tools/perf/util/evsel.c | 1 +
3 files changed, 20 insertions(+), 21 deletions(-)
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 56003c6..f18cbb8 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6832,10 +6832,6 @@ static int perf_copy_attr(struct perf_event_attr __user *uattr,
if (ret)
return -EFAULT;
- /* disabled for now */
- if (attr->mmap2)
- return -EINVAL;
-
if (attr->__reserved_1)
return -EINVAL;
diff --git a/tools/perf/util/event.c b/tools/perf/util/event.c
index b0f3ca8..086c7c8 100644
--- a/tools/perf/util/event.c
+++ b/tools/perf/util/event.c
@@ -201,13 +201,14 @@ int perf_event__synthesize_mmap_events(struct perf_tool *tool,
return -1;
}
- event->header.type = PERF_RECORD_MMAP;
+ event->header.type = PERF_RECORD_MMAP2;
while (1) {
char bf[BUFSIZ];
char prot[5];
char execname[PATH_MAX];
char anonstr[] = "//anon";
+ unsigned int ino;
size_t size;
ssize_t n;
@@ -218,14 +219,15 @@ int perf_event__synthesize_mmap_events(struct perf_tool *tool,
strcpy(execname, "");
/* 00400000-0040c000 r-xp 00000000 fd:01 41038 /bin/cat */
- n = sscanf(bf, "%"PRIx64"-%"PRIx64" %s %"PRIx64" %*x:%*x %*u %s\n",
- &event->mmap.start, &event->mmap.len, prot,
- &event->mmap.pgoff,
- execname);
- /*
- * Anon maps don't have the execname.
- */
- if (n < 4)
+ n = sscanf(bf, "%"PRIx64"-%"PRIx64" %s %"PRIx64" %x:%x %u %s\n",
+ &event->mmap2.start, &event->mmap2.len, prot,
+ &event->mmap2.pgoff, &event->mmap2.maj,
+ &event->mmap2.min,
+ &ino, execname);
+
+ event->mmap2.ino = (u64)ino;
+
+ if (n < 7)
continue;
/*
* Just like the kernel, see __perf_event_mmap in kernel/perf_event.c
@@ -246,15 +248,15 @@ int perf_event__synthesize_mmap_events(struct perf_tool *tool,
strcpy(execname, anonstr);
size = strlen(execname) + 1;
- memcpy(event->mmap.filename, execname, size);
+ memcpy(event->mmap2.filename, execname, size);
size = PERF_ALIGN(size, sizeof(u64));
- event->mmap.len -= event->mmap.start;
- event->mmap.header.size = (sizeof(event->mmap) -
- (sizeof(event->mmap.filename) - size));
- memset(event->mmap.filename + size, 0, machine->id_hdr_size);
- event->mmap.header.size += machine->id_hdr_size;
- event->mmap.pid = tgid;
- event->mmap.tid = pid;
+ event->mmap2.len -= event->mmap.start;
+ event->mmap2.header.size = (sizeof(event->mmap2) -
+ (sizeof(event->mmap2.filename) - size));
+ memset(event->mmap2.filename + size, 0, machine->id_hdr_size);
+ event->mmap2.header.size += machine->id_hdr_size;
+ event->mmap2.pid = tgid;
+ event->mmap2.tid = pid;
if (process(tool, event, &synth_sample, machine) != 0) {
rc = -1;
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 55407c5..65db757 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -640,6 +640,7 @@ void perf_evsel__config(struct perf_evsel *evsel, struct record_opts *opts)
perf_evsel__set_sample_bit(evsel, WEIGHT);
attr->mmap = track;
+ attr->mmap2 = track && !perf_missing_features.mmap2;
attr->comm = track;
if (opts->sample_transaction)
--
1.7.11.7
My initial implementation for rbtree sorting in the c2c tool does not use the
normal history elements. As a result, adding callchain support (which is
deeply integrated with history elements) is more challenging when trying to
display its output.
To make things simpler for myself (and to avoid rewriting the same code into
the c2c tool), I provided a generic interface that takes an unsorted callchain
list along with its total and relative sample size, and sorts it locally based
on period and calls the appropriate graph function (passing the correct sample
size).
This makes things easier because the c2c tool can be dumber and just collect
callchains and not worry about the magic needed to sort and display them
correctly.
Unfortunately, this is assuming a stdio output only and does not use the other
gui type outputs.
Regardless, this patch provides useful info for the tool right now. Tweaks and
recommendations for a better approach are welcomed. :-)
Signed-off-by: Don Zickus <[email protected]>
---
tools/perf/ui/stdio/hist.c | 37 +++++++++++++++++++++++++++++++++++++
tools/perf/util/hist.h | 4 ++++
2 files changed, 41 insertions(+)
diff --git a/tools/perf/ui/stdio/hist.c b/tools/perf/ui/stdio/hist.c
index 831fbb7..0a40f59 100644
--- a/tools/perf/ui/stdio/hist.c
+++ b/tools/perf/ui/stdio/hist.c
@@ -536,3 +536,40 @@ size_t events_stats__fprintf(struct events_stats *stats, FILE *fp)
return ret;
}
+
+size_t generic_entry_callchain__fprintf(struct callchain_root *unsorted_callchain,
+ u64 total_samples, u64 relative_samples,
+ int left_margin, FILE *fp)
+{
+ struct rb_root sorted_chain;
+ u64 min_callchain_hits;
+
+ if (!symbol_conf.use_callchain)
+ return 0;
+
+ min_callchain_hits = total_samples * (callchain_param.min_percent / 100);
+
+ callchain_param.sort(&sorted_chain, unsorted_callchain,
+ min_callchain_hits, &callchain_param);
+
+ switch (callchain_param.mode) {
+ case CHAIN_GRAPH_REL:
+ return callchain__fprintf_graph(fp, &sorted_chain, relative_samples,
+ left_margin);
+ break;
+ case CHAIN_GRAPH_ABS:
+ return callchain__fprintf_graph(fp, &sorted_chain, total_samples,
+ left_margin);
+ break;
+ case CHAIN_FLAT:
+ return callchain__fprintf_flat(fp, &sorted_chain, total_samples);
+ break;
+ case CHAIN_NONE:
+ break;
+ default:
+ pr_err("Bad callchain mode\n");
+ }
+
+ return 0;
+}
+
diff --git a/tools/perf/util/hist.h b/tools/perf/util/hist.h
index a59743f..4fe5de5 100644
--- a/tools/perf/util/hist.h
+++ b/tools/perf/util/hist.h
@@ -111,6 +111,10 @@ size_t events_stats__fprintf(struct events_stats *stats, FILE *fp);
size_t hists__fprintf(struct hists *hists, bool show_header, int max_rows,
int max_cols, float min_pcnt, FILE *fp);
+size_t generic_entry_callchain__fprintf(struct callchain_root *unsorted_callchain,
+ u64 total_samples, u64 relative_samples,
+ int left_margin, FILE *fp);
+
void hists__filter_by_dso(struct hists *hists);
void hists__filter_by_thread(struct hists *hists);
void hists__filter_by_symbol(struct hists *hists);
--
1.7.11.7
This can be really useful for us performance folks, thanks. It seems
however that the first two patches in the series are missing.
On Mon, Feb 10, 2014 at 10:59:30AM -0800, Davidlohr Bueso wrote:
> This can be really useful for us performance folks, thanks. It seems
> however that the first two patches in the series are missing.
Odd, yes. For some reason they cc'd to me fine, just never made it to
lkml. Let me resend them. Thanks.
Cheers,
Don
From: Arnaldo Carvalho de Melo <[email protected]>
This is the start of a new perf tool that will collect information about
memory accesses and analyse it to find things like hot cachelines, etc.
This is basically trying to get a prototype written by Richard Fowles
written using the tools/perf coding style and libraries.
Start it from 'perf sched', this patch starts the process by adding the
'record' subcommand to collect the needed mem loads and stores samples.
It also have the basic 'report' skeleton, resolving the sample address
and hooking the events found in a perf.data file with methods to handle
them, right now just printing the resolved perf_sample data structure
after each event name.
Cc: David Ahern <[email protected]>
Cc: Don Zickus <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Joe Mario <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Richard Fowles <[email protected]>
Cc: Stephane Eranian <[email protected]>
Link: http://lkml.kernel.org/n/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
---
tools/perf/Documentation/perf-c2c.c | 22 +++++
tools/perf/Makefile.perf | 1 +
tools/perf/builtin-c2c.c | 174 ++++++++++++++++++++++++++++++++++++
tools/perf/builtin.h | 1 +
tools/perf/perf.c | 1 +
tools/perf/util/evlist.c | 37 ++++++++
tools/perf/util/evlist.h | 7 ++
7 files changed, 243 insertions(+)
create mode 100644 tools/perf/Documentation/perf-c2c.c
create mode 100644 tools/perf/builtin-c2c.c
diff --git a/tools/perf/Documentation/perf-c2c.c b/tools/perf/Documentation/perf-c2c.c
new file mode 100644
index 0000000..4d52798
--- /dev/null
+++ b/tools/perf/Documentation/perf-c2c.c
@@ -0,0 +1,22 @@
+perf-c2c(1)
+===========
+
+NAME
+----
+perf-c2c - Shared Data C2C/HITM Analyzer.
+
+SYNOPSIS
+--------
+[verse]
+'perf c2c' record
+
+DESCRIPTION
+-----------
+These are the variants of perf c2c:
+
+ 'perf c2c record <command>' to record the memory accesses of an arbitrary
+ workload.
+
+SEE ALSO
+--------
+linkperf:perf-record[1], linkperf:perf-mem[1]
diff --git a/tools/perf/Makefile.perf b/tools/perf/Makefile.perf
index 7257e7e..3b21f5b 100644
--- a/tools/perf/Makefile.perf
+++ b/tools/perf/Makefile.perf
@@ -421,6 +421,7 @@ endif
BUILTIN_OBJS += $(OUTPUT)bench/mem-memcpy.o
BUILTIN_OBJS += $(OUTPUT)bench/mem-memset.o
+BUILTIN_OBJS += $(OUTPUT)builtin-c2c.o
BUILTIN_OBJS += $(OUTPUT)builtin-diff.o
BUILTIN_OBJS += $(OUTPUT)builtin-evlist.o
BUILTIN_OBJS += $(OUTPUT)builtin-help.o
diff --git a/tools/perf/builtin-c2c.c b/tools/perf/builtin-c2c.c
new file mode 100644
index 0000000..897eadb
--- /dev/null
+++ b/tools/perf/builtin-c2c.c
@@ -0,0 +1,174 @@
+#include "builtin.h"
+#include "cache.h"
+
+#include "util/evlist.h"
+#include "util/parse-options.h"
+#include "util/session.h"
+#include "util/tool.h"
+
+#include <linux/compiler.h>
+#include <linux/kernel.h>
+
+struct perf_c2c {
+ struct perf_tool tool;
+};
+
+static int perf_sample__fprintf(struct perf_sample *sample,
+ struct perf_evsel *evsel,
+ struct addr_location *al, FILE *fp)
+{
+ return fprintf(fp, "%25.25s: %5d %5d 0x%016" PRIx64 " 0x016%" PRIx64 " %5" PRIu64 " 0x%06" PRIx64 " %s:%s\n",
+ perf_evsel__name(evsel),
+ sample->pid, sample->tid, sample->ip, sample->addr,
+ sample->weight, sample->data_src,
+ al->map ? (al->map->dso ? al->map->dso->long_name : "???") : "???",
+ al->sym ? al->sym->name : "???");
+}
+
+static int perf_c2c__process_load(struct perf_evsel *evsel,
+ struct perf_sample *sample,
+ struct addr_location *al)
+{
+ perf_sample__fprintf(sample, evsel, al, stdout);
+ return 0;
+}
+
+static int perf_c2c__process_store(struct perf_evsel *evsel,
+ struct perf_sample *sample,
+ struct addr_location *al)
+{
+ perf_sample__fprintf(sample, evsel, al, stdout);
+ return 0;
+}
+
+static const struct perf_evsel_str_handler handlers[] = {
+ { "cpu/mem-loads,ldlat=30/pp", perf_c2c__process_load, },
+ { "cpu/mem-stores/pp", perf_c2c__process_store, },
+};
+
+typedef int (*sample_handler)(struct perf_evsel *evsel,
+ struct perf_sample *sample,
+ struct addr_location *al);
+
+static int perf_c2c__process_sample(struct perf_tool *tool __maybe_unused,
+ union perf_event *event,
+ struct perf_sample *sample,
+ struct perf_evsel *evsel,
+ struct machine *machine)
+{
+ struct addr_location al;
+ int err = 0;
+
+ if (perf_event__preprocess_sample(event, machine, &al, sample) < 0) {
+ pr_err("problem processing %d event, skipping it.\n",
+ event->header.type);
+ return -1;
+ }
+
+ if (evsel->handler.func != NULL) {
+ sample_handler f = evsel->handler.func;
+ err = f(evsel, sample, &al);
+ }
+
+ return err;
+}
+
+static int perf_c2c__read_events(struct perf_c2c *c2c)
+{
+ int err = -1;
+ struct perf_session *session;
+
+ session = perf_session__new(input_name, O_RDONLY, 0, false, &c2c->tool);
+ if (session == NULL) {
+ pr_debug("No memory for session\n");
+ goto out;
+ }
+
+ if (perf_evlist__set_handlers(session->evlist, handlers))
+ goto out_delete;
+
+ err = perf_session__process_events(session, &c2c->tool);
+ if (err)
+ pr_err("Failed to process events, error %d", err);
+
+out_delete:
+ perf_session__delete(session);
+out:
+ return err;
+}
+
+static int perf_c2c__report(struct perf_c2c *c2c)
+{
+ setup_pager();
+ return perf_c2c__read_events(c2c);
+}
+
+static int perf_c2c__record(int argc, const char **argv)
+{
+ unsigned int rec_argc, i, j;
+ const char **rec_argv;
+ const char * const record_args[] = {
+ "record",
+ /* "--phys-addr", */
+ "-W",
+ "-d",
+ "-a",
+ };
+
+ rec_argc = ARRAY_SIZE(record_args) + 2 * ARRAY_SIZE(handlers) + argc - 1;
+ rec_argv = calloc(rec_argc + 1, sizeof(char *));
+
+ if (rec_argv == NULL)
+ return -ENOMEM;
+
+ for (i = 0; i < ARRAY_SIZE(record_args); i++)
+ rec_argv[i] = strdup(record_args[i]);
+
+ for (j = 0; j < ARRAY_SIZE(handlers); j++) {
+ rec_argv[i++] = strdup("-e");
+ rec_argv[i++] = strdup(handlers[j].name);
+ }
+
+ for (j = 1; j < (unsigned int)argc; j++, i++)
+ rec_argv[i] = argv[j];
+
+ BUG_ON(i != rec_argc);
+
+ return cmd_record(i, rec_argv, NULL);
+}
+
+int cmd_c2c(int argc, const char **argv, const char *prefix __maybe_unused)
+{
+ struct perf_c2c c2c = {
+ .tool = {
+ .sample = perf_c2c__process_sample,
+ .comm = perf_event__process_comm,
+ .exit = perf_event__process_exit,
+ .fork = perf_event__process_fork,
+ .lost = perf_event__process_lost,
+ .ordered_samples = true,
+ },
+ };
+ const struct option c2c_options[] = {
+ OPT_END()
+ };
+ const char * const c2c_usage[] = {
+ "perf c2c {record|report}",
+ NULL
+ };
+
+ argc = parse_options(argc, argv, c2c_options, c2c_usage,
+ PARSE_OPT_STOP_AT_NON_OPTION);
+ if (!argc)
+ usage_with_options(c2c_usage, c2c_options);
+
+ if (!strncmp(argv[0], "rec", 3)) {
+ return perf_c2c__record(argc, argv);
+ } else if (!strncmp(argv[0], "rep", 3)) {
+ return perf_c2c__report(&c2c);
+ } else {
+ usage_with_options(c2c_usage, c2c_options);
+ }
+
+ return 0;
+}
diff --git a/tools/perf/builtin.h b/tools/perf/builtin.h
index b210d62..2d0b1b5 100644
--- a/tools/perf/builtin.h
+++ b/tools/perf/builtin.h
@@ -17,6 +17,7 @@ extern int cmd_annotate(int argc, const char **argv, const char *prefix);
extern int cmd_bench(int argc, const char **argv, const char *prefix);
extern int cmd_buildid_cache(int argc, const char **argv, const char *prefix);
extern int cmd_buildid_list(int argc, const char **argv, const char *prefix);
+extern int cmd_c2c(int argc, const char **argv, const char *prefix);
extern int cmd_diff(int argc, const char **argv, const char *prefix);
extern int cmd_evlist(int argc, const char **argv, const char *prefix);
extern int cmd_help(int argc, const char **argv, const char *prefix);
diff --git a/tools/perf/perf.c b/tools/perf/perf.c
index 431798a..c7012a3 100644
--- a/tools/perf/perf.c
+++ b/tools/perf/perf.c
@@ -35,6 +35,7 @@ struct cmd_struct {
static struct cmd_struct commands[] = {
{ "buildid-cache", cmd_buildid_cache, 0 },
{ "buildid-list", cmd_buildid_list, 0 },
+ { "c2c", cmd_c2c, 0 },
{ "diff", cmd_diff, 0 },
{ "evlist", cmd_evlist, 0 },
{ "help", cmd_help, 0 },
diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
index 59ef280..faf29b0 100644
--- a/tools/perf/util/evlist.c
+++ b/tools/perf/util/evlist.c
@@ -1243,3 +1243,40 @@ void perf_evlist__to_front(struct perf_evlist *evlist,
list_splice(&move, &evlist->entries);
}
+
+static struct perf_evsel *perf_evlist__find_by_name(struct perf_evlist *evlist,
+ const char *name)
+{
+ struct perf_evsel *evsel;
+
+ list_for_each_entry(evsel, &evlist->entries, node) {
+ if (strcmp(name, perf_evsel__name(evsel)) == 0)
+ return evsel;
+ }
+
+ return NULL;
+}
+
+int __perf_evlist__set_handlers(struct perf_evlist *evlist,
+ const struct perf_evsel_str_handler *assocs,
+ size_t nr_assocs)
+{
+ struct perf_evsel *evsel;
+ size_t i;
+ int err = -EEXIST;
+
+ for (i = 0; i < nr_assocs; i++) {
+ evsel = perf_evlist__find_by_name(evlist, assocs[i].name);
+ if (evsel == NULL)
+ continue;
+
+ if (evsel->handler.func != NULL)
+ goto out;
+
+ evsel->handler.func = assocs[i].handler;
+ }
+
+ err = 0;
+out:
+ return err;
+}
diff --git a/tools/perf/util/evlist.h b/tools/perf/util/evlist.h
index f5173cd..76f77c8 100644
--- a/tools/perf/util/evlist.h
+++ b/tools/perf/util/evlist.h
@@ -52,6 +52,13 @@ struct perf_evsel_str_handler {
void *handler;
};
+int __perf_evlist__set_handlers(struct perf_evlist *evlist,
+ const struct perf_evsel_str_handler *assocs,
+ size_t nr_assocs);
+
+#define perf_evlist__set_handlers(evlist, array) \
+ __perf_evlist__set_handlers(evlist, array, ARRAY_SIZE(array))
+
struct perf_evlist *perf_evlist__new(void);
struct perf_evlist *perf_evlist__new_default(void);
void perf_evlist__init(struct perf_evlist *evlist, struct cpu_map *cpus,
--
1.7.11.7
From: Arnaldo Carvalho de Melo <[email protected]>
>From the c2c prototype:
[root@sandy ~]# perf c2c -r report | head -7
T Status Pid Tid CPU Inst Adrs Virt Data Adrs Phys Data Adrs Cycles Source Decoded Source ObJect:Symbol
--------------------------------------------------------------------------------------------------------------------------------------------
raw input 779 779 7 0xffffffff810865dd 0xffff8803f4d75ec8 0 370 0x68080882 [LOAD,LCL_LLC,MISS,SNP NA] [kernel.kallsyms]:try_to_wake_up
raw input 779 779 7 0xffffffff8107acb3 0xffff8802a5b73158 0 297 0x6a100142 [LOAD,L1,HIT,SNP NONE,LOCKED] [kernel.kallsyms]:up_read
raw input 779 779 7 0x3b7e009814 0x7fff87429ea0 0 925 0x68100142 [LOAD,L1,HIT,SNP NONE] ???:???
raw input 0 0 1 0xffffffff8108bf81 0xffff8803eafebf50 0 172 0x68800842 [LOAD,LCL_LLC,HIT,SNP HITM] [kernel.kallsyms]:update_stats_wait_end
raw input 779 779 7 0x3b7e0097cc 0x7fac94b69068 0 228 0x68100242 [LOAD,LFB,HIT,SNP NONE] ???:???
[root@sandy ~]#
The "Phys Data Adrs" column is not available at this point.
Cc: David Ahern <[email protected]>
Cc: Don Zickus <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Joe Mario <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Richard Fowles <[email protected]>
Cc: Stephane Eranian <[email protected]>
Link: http://lkml.kernel.org/n/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
---
tools/perf/builtin-c2c.c | 148 +++++++++++++++++++++++++++++++++++++++--------
1 file changed, 125 insertions(+), 23 deletions(-)
diff --git a/tools/perf/builtin-c2c.c b/tools/perf/builtin-c2c.c
index 897eadb..a5dc412 100644
--- a/tools/perf/builtin-c2c.c
+++ b/tools/perf/builtin-c2c.c
@@ -11,51 +11,148 @@
struct perf_c2c {
struct perf_tool tool;
+ bool raw_records;
};
-static int perf_sample__fprintf(struct perf_sample *sample,
- struct perf_evsel *evsel,
- struct addr_location *al, FILE *fp)
+enum { OP, LVL, SNP, LCK, TLB };
+
+static int perf_c2c__scnprintf_data_src(char *bf, size_t size, uint64_t val)
{
- return fprintf(fp, "%25.25s: %5d %5d 0x%016" PRIx64 " 0x016%" PRIx64 " %5" PRIu64 " 0x%06" PRIx64 " %s:%s\n",
- perf_evsel__name(evsel),
- sample->pid, sample->tid, sample->ip, sample->addr,
- sample->weight, sample->data_src,
- al->map ? (al->map->dso ? al->map->dso->long_name : "???") : "???",
- al->sym ? al->sym->name : "???");
+#define PREFIX "["
+#define SUFFIX "]"
+#define ELLIPSIS "..."
+ static const struct {
+ uint64_t bit;
+ int64_t field;
+ const char *name;
+ } decode_bits[] = {
+ { PERF_MEM_OP_LOAD, OP, "LOAD" },
+ { PERF_MEM_OP_STORE, OP, "STORE" },
+ { PERF_MEM_OP_NA, OP, "OP_NA" },
+ { PERF_MEM_LVL_LFB, LVL, "LFB" },
+ { PERF_MEM_LVL_L1, LVL, "L1" },
+ { PERF_MEM_LVL_L2, LVL, "L2" },
+ { PERF_MEM_LVL_L3, LVL, "LCL_LLC" },
+ { PERF_MEM_LVL_LOC_RAM, LVL, "LCL_RAM" },
+ { PERF_MEM_LVL_REM_RAM1, LVL, "RMT_RAM" },
+ { PERF_MEM_LVL_REM_RAM2, LVL, "RMT_RAM" },
+ { PERF_MEM_LVL_REM_CCE1, LVL, "RMT_LLC" },
+ { PERF_MEM_LVL_REM_CCE2, LVL, "RMT_LLC" },
+ { PERF_MEM_LVL_IO, LVL, "I/O" },
+ { PERF_MEM_LVL_UNC, LVL, "UNCACHED" },
+ { PERF_MEM_LVL_NA, LVL, "N" },
+ { PERF_MEM_LVL_HIT, LVL, "HIT" },
+ { PERF_MEM_LVL_MISS, LVL, "MISS" },
+ { PERF_MEM_SNOOP_NONE, SNP, "SNP NONE" },
+ { PERF_MEM_SNOOP_HIT, SNP, "SNP HIT" },
+ { PERF_MEM_SNOOP_MISS, SNP, "SNP MISS" },
+ { PERF_MEM_SNOOP_HITM, SNP, "SNP HITM" },
+ { PERF_MEM_SNOOP_NA, SNP, "SNP NA" },
+ { PERF_MEM_LOCK_LOCKED, LCK, "LOCKED" },
+ { PERF_MEM_LOCK_NA, LCK, "LOCK_NA" },
+ };
+ union perf_mem_data_src dsrc = { .val = val, };
+ int printed = scnprintf(bf, size, PREFIX);
+ size_t i;
+ bool first_present = true;
+
+ for (i = 0; i < ARRAY_SIZE(decode_bits); i++) {
+ int bitval;
+
+ switch (decode_bits[i].field) {
+ case OP: bitval = decode_bits[i].bit & dsrc.mem_op; break;
+ case LVL: bitval = decode_bits[i].bit & dsrc.mem_lvl; break;
+ case SNP: bitval = decode_bits[i].bit & dsrc.mem_snoop; break;
+ case LCK: bitval = decode_bits[i].bit & dsrc.mem_lock; break;
+ case TLB: bitval = decode_bits[i].bit & dsrc.mem_dtlb; break;
+ default: bitval = 0; break;
+ }
+
+ if (!bitval)
+ continue;
+
+ if (strlen(decode_bits[i].name) + !!i > size - printed - sizeof(SUFFIX)) {
+ sprintf(bf + size - sizeof(SUFFIX) - sizeof(ELLIPSIS) + 1, ELLIPSIS);
+ printed = size - sizeof(SUFFIX);
+ break;
+ }
+
+ printed += scnprintf(bf + printed, size - printed, "%s%s",
+ first_present ? "" : ",", decode_bits[i].name);
+ first_present = false;
+ }
+
+ printed += scnprintf(bf + printed, size - printed, SUFFIX);
+ return printed;
}
-static int perf_c2c__process_load(struct perf_evsel *evsel,
- struct perf_sample *sample,
- struct addr_location *al)
+static int perf_c2c__fprintf_header(FILE *fp)
{
- perf_sample__fprintf(sample, evsel, al, stdout);
- return 0;
+ int printed = fprintf(fp, "%c %-16s %6s %6s %4s %18s %18s %18s %6s %-10s %-60s %s\n",
+ 'T',
+ "Status",
+ "Pid",
+ "Tid",
+ "CPU",
+ "Inst Adrs",
+ "Virt Data Adrs",
+ "Phys Data Adrs",
+ "Cycles",
+ "Source",
+ " Decoded Source",
+ "ObJect:Symbol");
+ return printed + fprintf(fp, "%-*.*s\n", printed, printed, graph_dotted_line);
}
-static int perf_c2c__process_store(struct perf_evsel *evsel,
- struct perf_sample *sample,
- struct addr_location *al)
+static int perf_sample__fprintf(struct perf_sample *sample, char tag,
+ const char *reason, struct addr_location *al, FILE *fp)
{
- perf_sample__fprintf(sample, evsel, al, stdout);
+ char data_src[61];
+
+ perf_c2c__scnprintf_data_src(data_src, sizeof(data_src), sample->data_src);
+
+ return fprintf(fp, "%c %-16s %6d %6d %4d %#18" PRIx64 " %#18" PRIx64 " %#18" PRIx64 " %6" PRIu64 " %#10" PRIx64 " %-60.60s %s:%s\n",
+ tag,
+ reason ?: "valid record",
+ sample->pid,
+ sample->tid,
+ sample->cpu,
+ sample->ip,
+ sample->addr,
+ 0UL,
+ sample->weight,
+ sample->data_src,
+ data_src,
+ al->map ? (al->map->dso ? al->map->dso->long_name : "???") : "???",
+ al->sym ? al->sym->name : "???");
+}
+
+static int perf_c2c__process_load_store(struct perf_c2c *c2c,
+ struct perf_sample *sample,
+ struct addr_location *al)
+{
+ if (c2c->raw_records)
+ perf_sample__fprintf(sample, ' ', "raw input", al, stdout);
+
return 0;
}
static const struct perf_evsel_str_handler handlers[] = {
- { "cpu/mem-loads,ldlat=30/pp", perf_c2c__process_load, },
- { "cpu/mem-stores/pp", perf_c2c__process_store, },
+ { "cpu/mem-loads,ldlat=30/pp", perf_c2c__process_load_store, },
+ { "cpu/mem-stores/pp", perf_c2c__process_load_store, },
};
-typedef int (*sample_handler)(struct perf_evsel *evsel,
+typedef int (*sample_handler)(struct perf_c2c *c2c,
struct perf_sample *sample,
struct addr_location *al);
-static int perf_c2c__process_sample(struct perf_tool *tool __maybe_unused,
+static int perf_c2c__process_sample(struct perf_tool *tool,
union perf_event *event,
struct perf_sample *sample,
struct perf_evsel *evsel,
struct machine *machine)
{
+ struct perf_c2c *c2c = container_of(tool, struct perf_c2c, tool);
struct addr_location al;
int err = 0;
@@ -67,7 +164,7 @@ static int perf_c2c__process_sample(struct perf_tool *tool __maybe_unused,
if (evsel->handler.func != NULL) {
sample_handler f = evsel->handler.func;
- err = f(evsel, sample, &al);
+ err = f(c2c, sample, &al);
}
return err;
@@ -100,6 +197,10 @@ out:
static int perf_c2c__report(struct perf_c2c *c2c)
{
setup_pager();
+
+ if (c2c->raw_records)
+ perf_c2c__fprintf_header(stdout);
+
return perf_c2c__read_events(c2c);
}
@@ -150,6 +251,7 @@ int cmd_c2c(int argc, const char **argv, const char *prefix __maybe_unused)
},
};
const struct option c2c_options[] = {
+ OPT_BOOLEAN('r', "raw_records", &c2c.raw_records, "dump raw events"),
OPT_END()
};
const char * const c2c_usage[] = {
--
1.7.11.7
On Mon, Feb 10, 2014 at 12:28:55PM -0500, Don Zickus wrote:
> With the introduction of NUMA systems, came the possibility of remote memory accesses.
> Combine those remote memory accesses with contention on the remote node (ie a modified
> cacheline) and you have a possibility for very long latencies. These latencies can
> bottleneck a program.
>
> The program added by these patches, helps detect the situation where two nodes are
> 'tugging' on the same _data_ cacheline. The term used through out this program and
> the various changelogs is called a HITM. This means nodeX went to read a cacheline
> and it was discovered to be loaded in nodeY's LLC cache (hence the cacheHIT). The
> remote cacheline was also in a 'M'odified state thus creating a 'HIT M' for hit in
> a modified state. HITMs can happen locally and remotely. This program's interest
> is mainly in remote HITMs as they cause the longest latencies.
All of that is true of the traditional SMP system too. Just use lower
level caches.
On Mon, Feb 10, 2014 at 12:28:55PM -0500, Don Zickus wrote:
> The data output is verbose and there are lots of data tables that interprit the latencies
> and data addresses in different ways to help see where bottlenecks might be lying.
Would be good to see what the output looks like.
What I haven't seen; and what I would find most useful; is using the IP
+ dwarf info to map it back to a data structure member.
Since you're already using the PEBS data-source fields, you can also
have a precise IP. For many cases its possible to reconstruct the exact
data member the instruction is modifying.
At that point you can do pahole like output of data structures, showing
which members are 'hot' on misses etc.
On Mon, 2014-02-10 at 14:18 -0500, Don Zickus wrote:
> From: Arnaldo Carvalho de Melo <[email protected]>
>
> This is the start of a new perf tool that will collect information about
> memory accesses and analyse it to find things like hot cachelines, etc.
>
> This is basically trying to get a prototype written by Richard Fowles
> written using the tools/perf coding style and libraries.
>
> Start it from 'perf sched', this patch starts the process by adding the
> 'record' subcommand to collect the needed mem loads and stores samples.
>
> It also have the basic 'report' skeleton, resolving the sample address
> and hooking the events found in a perf.data file with methods to handle
> them, right now just printing the resolved perf_sample data structure
> after each event name.
What tree/branch is this developed against? I'm getting the following
with Linus' latest and tip tree:
builtin-c2c.c: In function ‘perf_c2c__process_sample’:
builtin-c2c.c:68:20: error: request for member ‘func’ in something not a structure or union
builtin-c2c.c:69:36: error: request for member ‘func’ in something not a structure or union
builtin-c2c.c: In function ‘perf_c2c__read_events’:
builtin-c2c.c:81:2: error: passing argument 1 of ‘perf_session__new’ from incompatible pointer type [-Werror]
In file included from builtin-c2c.c:6:0:
util/session.h:52:22: note: expected ‘struct perf_data_file *’ but argument is of type ‘const char *’
builtin-c2c.c:81:2: error: too many arguments to function ‘perf_session__new’
In file included from builtin-c2c.c:6:0:
util/session.h:52:22: note: declared here
On Mon, Feb 10, 2014 at 10:18:25PM +0100, Peter Zijlstra wrote:
> On Mon, Feb 10, 2014 at 12:28:55PM -0500, Don Zickus wrote:
> > With the introduction of NUMA systems, came the possibility of remote memory accesses.
> > Combine those remote memory accesses with contention on the remote node (ie a modified
> > cacheline) and you have a possibility for very long latencies. These latencies can
> > bottleneck a program.
> >
> > The program added by these patches, helps detect the situation where two nodes are
> > 'tugging' on the same _data_ cacheline. The term used through out this program and
> > the various changelogs is called a HITM. This means nodeX went to read a cacheline
> > and it was discovered to be loaded in nodeY's LLC cache (hence the cacheHIT). The
> > remote cacheline was also in a 'M'odified state thus creating a 'HIT M' for hit in
> > a modified state. HITMs can happen locally and remotely. This program's interest
> > is mainly in remote HITMs as they cause the longest latencies.
>
> All of that is true of the traditional SMP system too. Just use lower
> level caches.
Yup. We just focused on the longer latencies which is the remote case. I
think the idea was overflowing an L1 and L2 wasn't that hard, so the gain
on solving local LLC HITMs wouldn't be that much. Maybe we are wrong.
Anyway, if this tool can help solve any bottlenecks, NUMA or non-NUMA,
that would be great. :-)
Cheers,
Don
On Mon, Feb 10, 2014 at 10:29:55PM +0100, Peter Zijlstra wrote:
> On Mon, Feb 10, 2014 at 12:28:55PM -0500, Don Zickus wrote:
> > The data output is verbose and there are lots of data tables that interprit the latencies
> > and data addresses in different ways to help see where bottlenecks might be lying.
>
> Would be good to see what the output looks like.
hehe. unfortunately, my node info is causing double frees now, but
attached below (without node info)..
>
> What I haven't seen; and what I would find most useful; is using the IP
> + dwarf info to map it back to a data structure member.
Yeah, we would like that too. :-)
>
> Since you're already using the PEBS data-source fields, you can also
> have a precise IP. For many cases its possible to reconstruct the exact
> data member the instruction is modifying.
>
> At that point you can do pahole like output of data structures, showing
> which members are 'hot' on misses etc.
Yeah, Arnaldo promised to look into that. I think Stephane was doing some
research into that too.
Cheers,
Don
=================================================
Trace Event Information
=================================================
Total records : 1322047
Locked Load/Store Operations : 206317
Load Operations : 355701
Loads - uncacheable : 590
Loads - no mapping : 207
Load Fill Buffer Hit : 100214
Load L1D hit : 148454
Load L2D hit : 15170
Load LLC hit : 53872
Load Local HITM : 15388
Load Remote HITM : 26760
Load Remote HIT : 3910
Load Local DRAM : 2436
Load Remote DRAM : 3648
Load MESI State Exclusive : 2883
Load MESI State Shared : 3201
Load LLC Misses : 36754
LLC Misses to Local DRAM : 6.6%
LLC Misses to Remote DRAM : 9.9%
LLC Misses to Remote cache (HIT) : 10.6%
LLC Misses to Remote cache (HITM) : 72.8%
Store Operations : 966322
Store - uncacheable : 0
Store - no mapping : 42931
Store L1D Hit : 915696
Virt -> Phys Remap Rejects : 0
No Page Map Rejects : 0
================================================================================
Execution Latency For Loads to Non Shared Memory
Metric Overall Extremes Selected
================================================================================
Samples 301189 3454 104
Minimum 32 1006 4095
Maximum 8149 8149 8149
Threshold 0 1005 4042
Mode 34 1152 4556
Median 136 1250 5163
Mean 236 1524 5337
Std Dev 256.3 839.7 979.5
Coeff of Variation 1.086 0.551 0.184
Confid Interval 0.8 23.5 41.3
====================================================================================================================================================================
Non Shared Data Loads With Excessive Execution Latency
Load ------ Load Inst Execute Latency ------
Num %dist %cumm Count Data Address Inst Address Min Median Max Mean CV Symbol Object
====================================================================================================================================================================
-----------------------------------------------
0 57.3% 57.3% 59 0xffffffff81c57ac0
-----------------------------------------------
33.9% 0x00 0xffffffff81098c43 4169 0.0 8149 5479.0 20.5% update_cfs_shares [kernel.kallsyms]
66.1% 0x20 0xffffffff81094929 4155 0.0 7492 5286.6 16.9% update_cfs_rq_blocked_load [kernel.kallsyms]
-----------------------------------------------
1 5.8% 63.1% 6 0xffffffff818a1180
-----------------------------------------------
50.0% 0x04 0xffffffff815c4f1e 4866 0.0 8116 6745.7 25.0% _raw_spin_lock [kernel.kallsyms]
50.0% 0x04 0xffffffff815c4f47 4556 0.0 7389 6099.3 23.5% _raw_spin_lock [kernel.kallsyms]
-----------------------------------------------
2 4.9% 68.0% 5 0xffff881fbf608ac0
-----------------------------------------------
100.0% 0x24 0xffffffff810981dd 4201 0.0 6653 5429.4 17.4% task_tick_numa [kernel.kallsyms]
-----------------------------------------------
3 2.9% 70.9% 3 0xffff881fa55b0140
-----------------------------------------------
100.0% 0x38 0xffffffff81082e8a 4317 0.0 4571 4418.7 3.0% mspin_lock [kernel.kallsyms]
-----------------------------------------------
4 2.9% 73.8% 3 0xffff883fff834700
-----------------------------------------------
100.0% 0x30 0xffffffff810948e4 4906 0.0 6078 5532.3 10.7% update_cfs_rq_blocked_load [kernel.kallsyms]
-----------------------------------------------
5 2.9% 76.7% 3 0xffff885fff834700
-----------------------------------------------
100.0% 0x30 0xffffffff810948e4 4921 0.0 6703 5828.7 15.3% update_cfs_rq_blocked_load [kernel.kallsyms]
-----------------------------------------------
6 2.9% 79.6% 3 0xffff885fff8b4700
-----------------------------------------------
100.0% 0x30 0xffffffff810948e4 4101 0.0 6022 5166.7 18.9% update_cfs_rq_blocked_load [kernel.kallsyms]
-----------------------------------------------
7 2.9% 82.5% 3 0xffff885fff9d4700
-----------------------------------------------
100.0% 0x30 0xffffffff810948e4 4319 0.0 4486 4381.7 2.1% update_cfs_rq_blocked_load [kernel.kallsyms]
-----------------------------------------------
8 1.9% 84.5% 2 0xffff885fff854700
-----------------------------------------------
100.0% 0x30 0xffffffff810948e4 5434 0.0 6075 5754.5 7.9% update_cfs_rq_blocked_load [kernel.kallsyms]
-----------------------------------------------
9 1.9% 86.4% 2 0xffff885fff974700
-----------------------------------------------
100.0% 0x30 0xffffffff810948e4 5326 0.0 5589 5457.5 3.4% update_cfs_rq_blocked_load [kernel.kallsyms]
-----------------------------------------------
10 1.9% 88.3% 2 0xffffffff819bd400
-----------------------------------------------
50.0% 0x04 0xffffffff810ae886 5427 0.0 5427 5427.0 0.0% ktime_get [kernel.kallsyms]
50.0% 0x04 0xffffffff810aed07 4334 0.0 4334 4334.0 0.0% update_wall_time [kernel.kallsyms]
-----------------------------------------------
11 1.0% 89.3% 1 0xffff881fc5600040
-----------------------------------------------
100.0% 0x20 0xffffffffa162cb70 5606 0.0 5606 5606.0 0.0% mtip_irq_handler [mtip32xx]
<snip 11 records>
...
========================================================================================================
Load Access & Excute Latency Information
Count Minmum Average CV Maximum %dist
========================================================================================================
L1 Hit - Snp None 148444 32 242 1.1501 7492 39.5%
LFB Hit - Snp None 100190 32 271 0.8572 8149 29.8%
L2 Hit - Snp None 15154 32 227 0.9071 1682 3.8%
L3 Hit - Snp None 32029 32 70 1.7761 6353 2.5%
L3 Hit - Snp Miss 2489 38 373 0.7307 5306 1.0%
L3 Hit - Snp Hit - Lcl Cache 3802 32 150 0.9289 3225 0.6%
L3 Hit - Snp Hitm - Lcl Cache 15388 32 187 1.0542 8485 3.2%
L3 Miss - Snp Hit - Rmt Cache 3910 32 355 0.3318 3972 1.5%
L3 Miss - Snp Hitm - Rmt Cache 26760 32 493 0.5116 6236 14.5%
L3 Miss - Snp Hit - Lcl Dram 1029 32 400 0.5783 3578 0.5%
L3 Miss - Snp Hit - Rmt Dram 2170 32 541 0.7315 9967 1.3%
L3 Miss - Snp Miss - Lcl Dram 1406 32 431 0.9117 8116 0.7%
L3 Miss - Snp Miss - Rmt Dram 1477 34 554 0.4437 2956 0.9%
L3 Miss - Snp NA 440 0 607 0.6940 2717 0.3%
Ld UNC - Snp None 0 18446744073709551615 0 -nan 0 0.0%
=================================================
Global Shared Cache Line Event Information
=================================================
Total Shared Cache Lines : 1327
Load HITs on shared lines : 167131
Fill Buffer Hits on shared lines : 43469
L1D hits on shared lines : 50497
L2D hits on shared lines : 960
LLC hits on shared lines : 38467
Locked Access on shared lines : 100032
Store HITs on shared lines : 118659
Store L1D hits on shared lines : 113783
Total Merged records : 160807
================================================================================================================================================================================================================
Shared Data Cache Line Table
Total %All Total ---- Core Load Hit ---- -- LLC Load Hit -- ----- LLC Load Hitm ----- -- Load Dram -- LLC ---- Store Reference ----
Index Phys Adrs Records Ld Miss %hitm Loads FB L1D L2D Lcl Rmt Total Lcl Rmt Lcl Rmt Ld Miss Total L1Hit L1Miss
================================================================================================================================================================================================================
0 0xffff881fa55b0140 72006 16.97% 23.31% 43095 13591 16860 45 2651 25 9526 3288 6238 266 131 6660 28911 28098 813
1 0xffff881fba47f000 21854 5.29% 7.26% 13938 3887 6941 15 1 7 3087 1143 1944 0 0 1951 7916 7916 0
2 0xffff881fc21b9cc0 2153 1.61% 2.21% 862 32 70 0 15 1 740 148 592 0 4 597 1291 1235 56
3 0xffff881fc7d91cc0 1957 1.40% 1.92% 866 34 94 0 14 3 720 207 513 0 1 517 1091 1028 63
4 0xffff881fba539cc0 1813 1.35% 1.85% 808 33 84 3 14 1 665 170 495 8 0 504 1005 967 38
5 0xffff881fc770bcc0 1939 1.30% 1.78% 827 36 70 0 16 1 700 223 477 4 0 482 1112 1058 54
6 0xffff881fbc8adcc0 1854 1.25% 1.72% 788 21 77 1 12 2 674 215 459 0 1 462 1066 965 101
7 0xffff881fc6c03cc0 1825 1.19% 1.63% 800 20 80 1 16 3 677 240 437 1 2 443 1025 973 52
8 0xffff881fb93f1cc0 1934 1.17% 1.61% 846 42 79 2 7 2 711 280 431 0 3 436 1088 1022 66
9 0xffff881fd1391cc0 1901 1.16% 1.59% 840 35 97 0 10 0 693 267 426 0 5 431 1061 1000 61
10 0xffff881fd0771cc0 1731 1.14% 1.57% 779 19 62 1 22 9 663 244 419 1 2 431 952 890 62
11 0xffff881fc7d31cc0 1971 1.13% 1.56% 826 29 91 2 19 5 677 260 417 0 3 425 1145 1064 81
12 0xffff881fb9dcdcc0 1821 1.13% 1.55% 784 31 67 0 13 8 663 249 414 0 2 424 1037 973 64
13 0xffff881fc3febcc0 1795 1.10% 1.51% 788 33 74 1 11 4 663 258 405 1 1 411 1007 936 71
14 0xffff881fc7d29cc0 1837 1.10% 1.51% 756 30 71 0 19 1 634 229 405 1 0 407 1081 1023 58
15 0xffff881fbc365cc0 1961 1.06% 1.46% 850 23 73 0 16 0 736 345 391 2 0 393 1111 1047 64
16 0xffff881fd0259cc0 1896 1.04% 1.43% 779 28 68 1 15 0 665 282 383 1 1 385 1117 1052 65
17 0xffff881fd0589cc0 1848 1.03% 1.42% 860 26 95 0 22 2 714 334 380 0 1 383 988 921 67
18 0xffff881fd01a1cc0 1822 1.02% 1.40% 823 38 78 0 17 1 688 313 375 1 0 377 999 926 73
19 0xffff881fb7ce1cc0 1833 1.00% 1.38% 761 28 72 0 11 2 642 274 368 1 5 376 1072 975 97
20 0xffff881fb8099cc0 1846 0.94% 1.30% 779 23 52 1 11 2 684 337 347 1 5 355 1067 991 76
21 0xffff881fc7e91cc0 1751 0.91% 1.25% 792 29 79 0 20 1 662 327 335 0 1 337 959 905 54
22 0xffff881fbf608000 625 0.33% 0.45% 551 162 17 27 189 18 134 13 121 4 0 143 74 64 10
<snip 40 lines>
....
================================================================================================================================================================================================================
Shared Cache Line Distribution Pareto
---- All ---- -- Shared -- ---- HITM ---- Load Inst Execute Latency
Data Misses Data Misses Remote Local -- Store Refs --
---- cycles ---- cpu
Num %dist %cumm %dist %cumm LLCmiss LLChit L1 hit L1 Miss Data Address Pid Tid Inst Address median mean CV cnt Symbol Object
================================================================================================================================================================================================================
-----------------------------------------------------------------------------------------------
0 17.0% 17.0% 23.3% 23.3% 6238 3288 28098 813 0xffff881fa55b0140 ***
-----------------------------------------------------------------------------------------------
0.0% 0.0% 0.0% 0.0% 0x00 375 375 0xffffffffa018ff5b n/a n/a n/a 1 ext4_bio_write_page [ext4]
0.0% 0.0% 0.0% 0.0% 0x08 18156 18165 0xffffffffa018b7f9 -1 384 0.0% 1 ext4_mark_iloc_dirty [ext4]
0.2% 0.0% 0.0% 0.0% 0x10 18156 *** 0xffffffff811ca1aa -1 387 10.7% 7 __mark_inode_dirty [kernel.kallsyms]
0.7% 0.1% 15.9% 0.0% 0x18 18156 *** 0xffffffff815c15b1 -1 1241 24.0% 51 mutex_unlock [kernel.kallsyms]
0.0% 0.0% 23.2% 0.0% 0x18 18156 *** 0xffffffff815c1615 -1 684 0.0% 50 mutex_lock [kernel.kallsyms]
0.0% 0.0% 0.5% 3.1% 0x18 18156 *** 0xffffffff815c2082 n/a n/a n/a 38 __mutex_unlock_slowpath [kernel.kallsyms]
0.2% 3.3% 0.0% 0.0% 0x18 18156 *** 0xffffffff815c2139 -1 496 22.7% 31 __mutex_lock_slowpath [kernel.kallsyms]
0.2% 3.4% 5.1% 0.0% 0x18 18156 *** 0xffffffff815c2142 -1 821 13.2% 50 __mutex_lock_slowpath [kernel.kallsyms]
0.0% 0.0% 0.1% 0.0% 0x18 18156 *** 0xffffffff815c21bf -1 1203 0.0% 18 __mutex_lock_slowpath [kernel.kallsyms]
1.2% 0.1% 0.0% 0.0% 0x18 18156 *** 0xffffffff815c21ed -1 671 42.9% 37 __mutex_lock_slowpath [kernel.kallsyms]
0.0% 0.1% 0.1% 0.1% 0x18 18156 *** 0xffffffff815c21f6 -1 971 16.5% 25 __mutex_lock_slowpath [kernel.kallsyms]
10.9% 6.2% 2.5% 89.3% 0x1c 18156 *** 0xffffffff815c4e4a -1 478 51.7% 50 _raw_spin_unlock [kernel.kallsyms]
11.8% 2.0% 18.4% 0.0% 0x1c 18156 *** 0xffffffff815c4f1e -1 1276 22.0% 50 _raw_spin_lock [kernel.kallsyms]
3.2% 2.6% 0.0% 0.0% 0x1c 18156 *** 0xffffffff815c4f47 -1 831 54.2% 48 _raw_spin_lock [kernel.kallsyms]
0.8% 0.4% 0.0% 0.0% 0x20 18156 *** 0xffffffff815c207f -1 669 26.3% 26 __mutex_unlock_slowpath [kernel.kallsyms]
0.0% 0.0% 0.0% 0.6% 0x28 18156 *** 0xffffffff812b7e4e n/a n/a n/a 5 __list_add [kernel.kallsyms]
0.0% 0.1% 0.0% 0.0% 0x30 18156 *** 0xffffffff81082f19 -1 738 58.4% 5 mutex_spin_on_owner [kernel.kallsyms]
0.1% 8.8% 0.0% 0.0% 0x30 18156 *** 0xffffffff81082f55 -1 730 33.9% 23 mutex_spin_on_owner [kernel.kallsyms]
0.0% 0.0% 0.2% 0.4% 0x30 18156 *** 0xffffffff815c15a6 n/a n/a n/a 27 mutex_unlock [kernel.kallsyms]
0.0% 0.0% 2.8% 6.5% 0x30 18156 *** 0xffffffff815c1628 n/a n/a n/a 50 mutex_lock [kernel.kallsyms]
0.4% 0.4% 0.0% 0.0% 0x30 18156 *** 0xffffffff815c20d6 -1 860 50.5% 17 __mutex_lock_slowpath [kernel.kallsyms]
60.3% 66.7% 0.0% 0.0% 0x30 18156 *** 0xffffffff815c211d -1 471 46.6% 50 __mutex_lock_slowpath [kernel.kallsyms]
0.0% 0.0% 0.0% 0.0% 0x30 18156 18165 0xffffffff815c2157 n/a n/a n/a 1 __mutex_lock_slowpath [kernel.kallsyms]
9.8% 5.8% 31.0% 0.0% 0x38 18156 *** 0xffffffff81082e8a -1 960 33.3% 50 mspin_lock [kernel.kallsyms]
0.0% 0.0% 0.1% 0.0% 0x38 18156 *** 0xffffffff81082ec4 -1 588 42.1% 18 mspin_unlock [kernel.kallsyms]
-----------------------------------------------------------------------------------------------
1 5.3% 22.3% 7.3% 30.6% 1944 1143 7916 0 0xffff881fba47f000 18156
-----------------------------------------------------------------------------------------------
100.0% 100.0% 0.0% 0.0% 0x00 18156 *** 0xffffffffa01b410e -1 401 13.5% 50 __ext4_journal_start_sb [ext4]
0.0% 0.0% 10.1% 0.0% 0x28 18156 *** 0xffffffffa0167409 n/a n/a n/a 50 start_this_handle [jbd2]
0.0% 0.0% 89.9% 0.0% 0x28 18156 *** 0xffffffff815c4be9 n/a n/a n/a 50 _raw_read_lock [kernel.kallsyms]
-----------------------------------------------------------------------------------------------
2 1.6% 23.9% 2.2% 32.8% 592 148 1235 56 0xffff881fc21b9cc0 18156
-----------------------------------------------------------------------------------------------
0.0% 0.0% 0.2% 0.0% 0x00 18156 18172 0xffffffff81082e75 n/a n/a n/a 2 mspin_lock [kernel.kallsyms]
0.0% 0.0% 0.2% 0.0% 0x00 18156 18172 0xffffffff81082f15 n/a n/a n/a 2 mutex_spin_on_owner [kernel.kallsyms]
0.3% 0.0% 0.0% 0.0% 0x00 18156 18172 0xffffffff81082f5a -1 449 0.3% 1 mutex_spin_on_owner [kernel.kallsyms]
0.0% 0.0% 0.2% 0.0% 0x00 18156 18172 0xffffffff810908ac n/a n/a n/a 1 try_to_wake_up [kernel.kallsyms]
0.0% 0.0% 0.9% 0.0% 0x00 18156 18172 0xffffffff81090a73 n/a n/a n/a 5 try_to_wake_up [kernel.kallsyms]
0.0% 0.0% 2.3% 0.0% 0x00 18156 18172 0xffffffff8113524c n/a n/a n/a 7 generic_segment_checks [kernel.kallsyms]
0.0% 0.0% 1.1% 0.0% 0x00 18156 18172 0xffffffff81135f09 n/a n/a n/a 3 generic_file_buffered_write [kernel.kallsyms]
0.0% 0.0% 3.8% 0.0% 0x00 18156 18172 0xffffffff811ba059 n/a n/a n/a 6 file_update_time [kernel.kallsyms]
0.0% 0.0% 0.0% 1.8% 0x00 18156 18172 0xffffffff815c4275 n/a n/a n/a 1 schedule_preempt_disabled [kernel.kallsyms]
2.9% 0.0% 0.0% 0.0% 0x00 18156 18172 0xffffffff815c4299 -1 353 15.2% 6 schedule_preempt_disabled [kernel.kallsyms]
0.0% 0.0% 1.9% 0.0% 0x08 18156 18172 0xffffffff81135245 n/a n/a n/a 6 generic_segment_checks [kernel.kallsyms]
0.0% 0.0% 0.5% 0.0% 0x08 18156 18172 0xffffffff81135f05 n/a n/a n/a 3 generic_file_buffered_write [kernel.kallsyms]
0.0% 0.0% 10.0% 0.0% 0x08 18156 18172 0xffffffff811b9cb5 n/a n/a n/a 8 file_remove_suid [kernel.kallsyms]
0.0% 0.0% 2.3% 0.0% 0x08 18156 18172 0xffffffff811ba055 n/a n/a n/a 8 file_update_time [kernel.kallsyms]
0.0% 0.0% 2.8% 3.6% 0x08 18156 18172 0xffffffff815c2106 n/a n/a n/a 5 __mutex_lock_slowpath [kernel.kallsyms]
0.0% 0.0% 12.2% 0.0% 0x08 18156 18172 0xffffffff815c2118 n/a n/a n/a 9 __mutex_lock_slowpath [kernel.kallsyms]
0.0% 0.0% 2.9% 33.9% 0x08 18156 18172 0xffffffff815c212c n/a n/a n/a 8 __mutex_lock_slowpath [kernel.kallsyms]
0.0% 0.0% 0.0% 7.1% 0x08 18156 18172 0xffffffff815c21e8 n/a n/a n/a 3 __mutex_lock_slowpath [kernel.kallsyms]
0.3% 0.0% 0.0% 0.0% 0x08 18156 18172 0xffffffff815c429a -1 314 28.6% 2 schedule_preempt_disabled [kernel.kallsyms]
0.0% 0.0% 0.1% 0.0% 0x10 18156 18172 0xffffffff81082e80 n/a n/a n/a 1 mspin_lock [kernel.kallsyms]
0.0% 0.0% 0.0% 17.9% 0x10 18156 *** 0xffffffff81082e9b n/a n/a n/a 8 mspin_lock [kernel.kallsyms]
0.0% 0.0% 0.1% 0.0% 0x10 18156 18172 0xffffffff810908a2 n/a n/a n/a 1 try_to_wake_up [kernel.kallsyms]
0.0% 0.0% 1.4% 0.0% 0x10 18156 18172 0xffffffff81137c4d n/a n/a n/a 7 __generic_file_aio_write [kernel.kallsyms]
0.0% 0.0% 7.4% 0.0% 0x10 18156 18172 0xffffffff81137d7b n/a n/a n/a 9 __generic_file_aio_write [kernel.kallsyms]
0.0% 0.0% 0.2% 0.0% 0x10 18156 18172 0xffffffff81137dbc n/a n/a n/a 2 __generic_file_aio_write [kernel.kallsyms]
0.2% 0.7% 0.0% 0.0% 0x10 18156 *** 0xffffffff812b7e3b -1 507 0.0% 2 __list_add [kernel.kallsyms]
89.0% 98.0% 0.0% 0.0% 0x18 18156 18172 0xffffffff81082ea7 -1 471 28.5% 9 mspin_lock [kernel.kallsyms]
0.0% 0.0% 0.0% 33.9% 0x18 18156 *** 0xffffffff81082eda n/a n/a n/a 15 mspin_unlock [kernel.kallsyms]
0.0% 0.0% 0.1% 0.0% 0x18 18156 18172 0xffffffff810908a0 n/a n/a n/a 1 try_to_wake_up [kernel.kallsyms]
0.0% 0.0% 13.8% 1.8% 0x18 18156 18172 0xffffffff811360de n/a n/a n/a 9 generic_file_buffered_write [kernel.kallsyms]
0.0% 0.0% 6.6% 0.0% 0x18 18156 18172 0xffffffff81137dab n/a n/a n/a 9 __generic_file_aio_write [kernel.kallsyms]
0.0% 0.0% 0.2% 0.0% 0x20 18156 18172 0xffffffff8109089e n/a n/a n/a 1 try_to_wake_up [kernel.kallsyms]
7.3% 1.4% 0.0% 0.0% 0x20 18156 *** 0xffffffff815c208e -1 402 13.6% 29 __mutex_unlock_slowpath [kernel.kallsyms]
0.0% 0.0% 19.5% 0.0% 0x28 18156 18172 0xffffffff81137d63 n/a n/a n/a 9 __generic_file_aio_write [kernel.kallsyms]
0.0% 0.0% 0.2% 0.0% 0x30 18156 18172 0xffffffff81137c49 n/a n/a n/a 1 __generic_file_aio_write [kernel.kallsyms]
0.0% 0.0% 5.4% 0.0% 0x30 18156 18172 0xffffffff81137dc1 n/a n/a n/a 6 __generic_file_aio_write [kernel.kallsyms]
0.0% 0.0% 2.6% 0.0% 0x38 18156 18172 0xffffffff81090b4e n/a n/a n/a 7 wake_up_process [kernel.kallsyms]
0.0% 0.0% 1.3% 0.0% 0x38 18156 18172 0xffffffff81137c2c n/a n/a n/a 4 __generic_file_aio_write [kernel.kallsyms]
-----------------------------------------------------------------------------------------------
3 1.4% 25.3% 1.9% 34.7% 513 207 1028 63 0xffff881fc7d91cc0 18156
-----------------------------------------------------------------------------------------------
0.0% 0.0% 0.2% 0.0% 0x00 18156 18159 0xffffffff81082e75 n/a n/a n/a 2 mspin_lock [kernel.kallsyms]
0.0% 0.0% 0.1% 0.0% 0x00 18156 18159 0xffffffff81082f15 n/a n/a n/a 1 mutex_spin_on_owner [kernel.kallsyms]
0.4% 0.5% 0.0% 0.0% 0x00 18156 18159 0xffffffff81082f5a -1 446 2.5% 2 mutex_spin_on_owner [kernel.kallsyms]
0.0% 0.0% 0.3% 0.0% 0x00 18156 18159 0xffffffff810908ac n/a n/a n/a 3 try_to_wake_up [kernel.kallsyms]
0.0% 0.0% 0.1% 0.0% 0x00 18156 18159 0xffffffff810908fc n/a n/a n/a 1 try_to_wake_up [kernel.kallsyms]
0.0% 0.0% 1.6% 0.0% 0x00 18156 18159 0xffffffff81090a73 n/a n/a n/a 7 try_to_wake_up [kernel.kallsyms]
0.0% 0.0% 2.9% 0.0% 0x00 18156 18159 0xffffffff8113524c n/a n/a n/a 6 generic_segment_checks [kernel.kallsyms]
0.0% 0.0% 0.1% 0.0% 0x00 18156 18159 0xffffffff81135f09 n/a n/a n/a 1 generic_file_buffered_write [kernel.kallsyms]
0.0% 0.0% 0.2% 0.0% 0x00 18156 18159 0xffffffff811ba059 n/a n/a n/a 1 file_update_time [kernel.kallsyms]
0.0% 0.0% 0.1% 0.0% 0x00 18156 18159 0xffffffff815c4275 n/a n/a n/a 1 schedule_preempt_disabled [kernel.kallsyms]
3.1% 0.0% 0.0% 0.0% 0x00 18156 18159 0xffffffff815c4299 -1 376 18.4% 5 schedule_preempt_disabled [kernel.kallsyms]
0.2% 0.0% 0.0% 0.0% 0x00 18156 18159 0xffffffff815c4f4f -1 431 0.0% 1 _raw_spin_lock [kernel.kallsyms]
0.0% 0.0% 1.9% 0.0% 0x08 18156 18159 0xffffffff81135245 n/a n/a n/a 5 generic_segment_checks [kernel.kallsyms]
0.0% 0.0% 1.6% 0.0% 0x08 18156 18159 0xffffffff811b9cb5 n/a n/a n/a 5 file_remove_suid [kernel.kallsyms]
0.0% 0.0% 0.1% 0.0% 0x08 18156 18159 0xffffffff811ba055 n/a n/a n/a 1 file_update_time [kernel.kallsyms]
0.0% 0.0% 2.9% 3.2% 0x08 18156 18159 0xffffffff815c2106 n/a n/a n/a 5 __mutex_lock_slowpath [kernel.kallsyms]
0.0% 0.0% 17.6% 0.0% 0x08 18156 18159 0xffffffff815c2118 n/a n/a n/a 7 __mutex_lock_slowpath [kernel.kallsyms]
0.0% 0.0% 6.2% 31.7% 0x08 18156 18159 0xffffffff815c212c n/a n/a n/a 7 __mutex_lock_slowpath [kernel.kallsyms]
0.0% 0.0% 0.1% 0.0% 0x08 18156 18159 0xffffffff815c21aa n/a n/a n/a 1 __mutex_lock_slowpath [kernel.kallsyms]
0.0% 0.0% 0.0% 9.5% 0x08 18156 18159 0xffffffff815c21e8 n/a n/a n/a 5 __mutex_lock_slowpath [kernel.kallsyms]
0.0% 0.0% 0.1% 0.0% 0x08 18156 18159 0xffffffff815c2246 n/a n/a n/a 1 __mutex_lock_slowpath [kernel.kallsyms]
0.4% 0.0% 0.0% 0.0% 0x08 18156 18159 0xffffffff815c429a -1 358 35.7% 2 schedule_preempt_disabled [kernel.kallsyms]
0.0% 0.0% 0.0% 23.8% 0x10 18156 *** 0xffffffff81082e9b n/a n/a n/a 10 mspin_lock [kernel.kallsyms]
0.0% 0.0% 0.2% 0.0% 0x10 18156 18159 0xffffffff810908a2 n/a n/a n/a 2 try_to_wake_up [kernel.kallsyms]
0.0% 0.0% 1.3% 0.0% 0x10 18156 18159 0xffffffff81137c4d n/a n/a n/a 4 __generic_file_aio_write [kernel.kallsyms]
0.0% 0.0% 4.2% 0.0% 0x10 18156 18159 0xffffffff81137d7b n/a n/a n/a 6 __generic_file_aio_write [kernel.kallsyms]
0.0% 0.0% 2.0% 0.0% 0x10 18156 18159 0xffffffff81137dbc n/a n/a n/a 5 __generic_file_aio_write [kernel.kallsyms]
0.2% 0.0% 0.0% 0.0% 0x10 18156 18175 0xffffffff812b7e3b -1 502 0.0% 1 __list_add [kernel.kallsyms]
90.8% 92.3% 0.0% 0.0% 0x18 18156 18159 0xffffffff81082ea7 -1 481 25.2% 7 mspin_lock [kernel.kallsyms]
0.0% 0.0% 0.0% 31.7% 0x18 18156 *** 0xffffffff81082eda n/a n/a n/a 12 mspin_unlock [kernel.kallsyms]
0.0% 0.0% 0.1% 0.0% 0x18 18156 18159 0xffffffff810908a0 n/a n/a n/a 1 try_to_wake_up [kernel.kallsyms]
0.0% 0.0% 24.2% 0.0% 0x18 18156 18159 0xffffffff811360de n/a n/a n/a 7 generic_file_buffered_write [kernel.kallsyms]
0.0% 0.0% 7.8% 0.0% 0x18 18156 18159 0xffffffff81137dab n/a n/a n/a 6 __generic_file_aio_write [kernel.kallsyms]
0.0% 0.0% 0.1% 0.0% 0x20 18156 18159 0xffffffff8109089e n/a n/a n/a 1 try_to_wake_up [kernel.kallsyms]
4.9% 7.2% 0.0% 0.0% 0x20 18156 *** 0xffffffff815c208e -1 381 15.7% 28 __mutex_unlock_slowpath [kernel.kallsyms]
0.0% 0.0% 15.9% 0.0% 0x28 18156 18159 0xffffffff81137d63 n/a n/a n/a 7 __generic_file_aio_write [kernel.kallsyms]
0.0% 0.0% 0.1% 0.0% 0x30 18156 18159 0xffffffff81090895 n/a n/a n/a 1 try_to_wake_up [kernel.kallsyms]
0.0% 0.0% 0.1% 0.0% 0x30 18156 18159 0xffffffff81137c49 n/a n/a n/a 1 __generic_file_aio_write [kernel.kallsyms]
0.0% 0.0% 4.3% 0.0% 0x30 18156 18159 0xffffffff81137dc1 n/a n/a n/a 7 __generic_file_aio_write [kernel.kallsyms]
0.0% 0.0% 0.1% 0.0% 0x30 18156 18159 0xffffffff815ce68c n/a n/a n/a 1 apic_timer_interrupt [kernel.kallsyms]
0.0% 0.0% 2.3% 0.0% 0x38 18156 18159 0xffffffff81090b4e n/a n/a n/a 5 wake_up_process [kernel.kallsyms]
0.0% 0.0% 1.3% 0.0% 0x38 18156 18159 0xffffffff81137c2c n/a n/a n/a 6 __generic_file_aio_write [kernel.kallsyms]
-----------------------------------------------------------------------------------------------
4 1.3% 26.6% 1.8% 36.6% 495 170 967 38 0xffff881fba539cc0 18156
-----------------------------------------------------------------------------------------------
0.0% 0.0% 0.2% 0.0% 0x00 18156 18169 0xffffffff81082e75 n/a n/a n/a 2 mspin_lock [kernel.kallsyms]
0.0% 0.0% 0.2% 0.0% 0x00 18156 18169 0xffffffff81082f15 n/a n/a n/a 2 mutex_spin_on_owner [kernel.kallsyms]
0.0% 0.6% 0.0% 0.0% 0x00 18156 18169 0xffffffff81082f5a n/a n/a n/a 1 mutex_spin_on_owner [kernel.kallsyms]
0.0% 0.0% 0.1% 0.0% 0x00 18156 18169 0xffffffff810908ac n/a n/a n/a 1 try_to_wake_up [kernel.kallsyms]
0.0% 0.0% 1.0% 0.0% 0x00 18156 18169 0xffffffff81090a73 n/a n/a n/a 5 try_to_wake_up [kernel.kallsyms]
0.0% 0.0% 2.1% 0.0% 0x00 18156 18169 0xffffffff8113524c n/a n/a n/a 7 generic_segment_checks [kernel.kallsyms]
0.0% 0.0% 0.1% 0.0% 0x00 18156 18169 0xffffffff81135f09 n/a n/a n/a 1 generic_file_buffered_write [kernel.kallsyms]
0.0% 0.0% 0.4% 0.0% 0x00 18156 18169 0xffffffff811ba059 n/a n/a n/a 3 file_update_time [kernel.kallsyms]
0.0% 0.0% 0.1% 0.0% 0x00 18156 18169 0xffffffff815c4275 n/a n/a n/a 1 schedule_preempt_disabled [kernel.kallsyms]
1.2% 0.0% 0.0% 0.0% 0x00 18156 18169 0xffffffff815c4299 -1 298 8.7% 6 schedule_preempt_disabled [kernel.kallsyms]
0.2% 0.0% 0.0% 0.0% 0x00 18156 18169 0xffffffff815c4f2c -1 441 0.0% 1 _raw_spin_lock [kernel.kallsyms]
0.4% 0.0% 0.0% 0.0% 0x00 18156 18169 0xffffffff815c4f4f -1 353 8.8% 2 _raw_spin_lock [kernel.kallsyms]
0.0% 0.0% 1.2% 0.0% 0x08 18156 18169 0xffffffff81135245 n/a n/a n/a 9 generic_segment_checks [kernel.kallsyms]
0.0% 0.0% 1.3% 0.0% 0x08 18156 18169 0xffffffff811b9cb5 n/a n/a n/a 8 file_remove_suid [kernel.kallsyms]
0.0% 0.0% 0.6% 0.0% 0x08 18156 18169 0xffffffff811ba055 n/a n/a n/a 4 file_update_time [kernel.kallsyms]
0.0% 0.0% 4.2% 13.2% 0x08 18156 18169 0xffffffff815c2106 n/a n/a n/a 9 __mutex_lock_slowpath [kernel.kallsyms]
0.0% 0.0% 13.8% 0.0% 0x08 18156 18169 0xffffffff815c2118 n/a n/a n/a 12 __mutex_lock_slowpath [kernel.kallsyms]
0.0% 0.0% 5.0% 50.0% 0x08 18156 18169 0xffffffff815c212c n/a n/a n/a 11 __mutex_lock_slowpath [kernel.kallsyms]
0.0% 0.0% 0.1% 0.0% 0x08 18156 18169 0xffffffff815c21e0 n/a n/a n/a 1 __mutex_lock_slowpath [kernel.kallsyms]
0.0% 0.0% 0.1% 15.8% 0x08 18156 18169 0xffffffff815c21e8 n/a n/a n/a 5 __mutex_lock_slowpath [kernel.kallsyms]
1.0% 0.0% 0.0% 0.0% 0x08 18156 18169 0xffffffff815c429a -1 286 5.8% 3 schedule_preempt_disabled [kernel.kallsyms]
0.0% 0.0% 0.1% 0.0% 0x10 18156 18169 0xffffffff81082e80 n/a n/a n/a 1 mspin_lock [kernel.kallsyms]
0.0% 0.0% 0.2% 0.0% 0x10 18156 18169 0xffffffff810908a2 n/a n/a n/a 2 try_to_wake_up [kernel.kallsyms]
0.0% 0.0% 1.2% 0.0% 0x10 18156 18169 0xffffffff81137c4d n/a n/a n/a 6 __generic_file_aio_write [kernel.kallsyms]
0.0% 0.0% 5.2% 0.0% 0x10 18156 18169 0xffffffff81137d7b n/a n/a n/a 12 __generic_file_aio_write [kernel.kallsyms]
0.0% 0.0% 1.9% 0.0% 0x10 18156 18169 0xffffffff81137dbc n/a n/a n/a 8 __generic_file_aio_write [kernel.kallsyms]
0.2% 0.0% 0.0% 0.0% 0x10 18156 18175 0xffffffff812b7e3b -1 355 0.0% 1 __list_add [kernel.kallsyms]
90.9% 93.5% 0.0% 0.0% 0x18 18156 18169 0xffffffff81082ea7 -1 613 28.9% 13 mspin_lock [kernel.kallsyms]
0.0% 0.0% 0.0% 21.1% 0x18 18156 *** 0xffffffff81082eda n/a n/a n/a 7 mspin_unlock [kernel.kallsyms]
0.0% 0.0% 26.0% 0.0% 0x18 18156 18169 0xffffffff811360de n/a n/a n/a 12 generic_file_buffered_write [kernel.kallsyms]
0.0% 0.0% 9.0% 0.0% 0x18 18156 18169 0xffffffff81137dab n/a n/a n/a 12 __generic_file_aio_write [kernel.kallsyms]
6.1% 5.9% 0.0% 0.0% 0x20 18156 *** 0xffffffff815c208e -1 377 19.3% 25 __mutex_unlock_slowpath [kernel.kallsyms]
0.0% 0.0% 18.2% 0.0% 0x28 18156 18169 0xffffffff81137d63 n/a n/a n/a 12 __generic_file_aio_write [kernel.kallsyms]
0.0% 0.0% 0.2% 0.0% 0x30 18156 18169 0xffffffff81137c49 n/a n/a n/a 2 __generic_file_aio_write [kernel.kallsyms]
0.0% 0.0% 0.1% 0.0% 0x30 18156 18169 0xffffffff81137d67 n/a n/a n/a 1 __generic_file_aio_write [kernel.kallsyms]
0.0% 0.0% 3.1% 0.0% 0x30 18156 18169 0xffffffff81137dc1 n/a n/a n/a 8 __generic_file_aio_write [kernel.kallsyms]
0.0% 0.0% 3.1% 0.0% 0x38 18156 18169 0xffffffff81090b4e n/a n/a n/a 11 wake_up_process [kernel.kallsyms]
0.0% 0.0% 1.1% 0.0% 0x38 18156 18169 0xffffffff81137c2c n/a n/a n/a 8 __generic_file_aio_write [kernel.kallsyms]
<snip 58 records>
....
=====================================================================================================================================
Object Name, Path & Reference Counts
Index Records Object Name Object Path
=====================================================================================================================================
0 931379 [kernel.kallsyms] [kernel.kallsyms]
1 192258 fio /home/joe/old_fio-2.0.15/fio
2 80302 [jbd2] /lib/modules/3.10.0c2c_all+/kernel/fs/jbd2/jbd2.ko
3 65392 [ext4] /lib/modules/3.10.0c2c_all+/kernel/fs/ext4/ext4.ko
4 8236 libpthread-2.17.so /usr/lib64/libpthread-2.17.so
5 19 [ip_tables] /lib/modules/3.10.0c2c_all+/kernel/net/ipv4/netfilter/ip_tables.ko
6 17 [ixgbe] /lib/modules/3.10.0c2c_all+/kernel/drivers/net/ethernet/intel/ixgbe/ixgbe.ko
7 17 perf /home/root/git/rhel7.don/tools/perf/perf
8 13 [ipmi_si] /lib/modules/3.10.0c2c_all+/kernel/drivers/char/ipmi/ipmi_si.ko
9 11 libc-2.17.so /usr/lib64/libc-2.17.so
10 10 [megaraid_sas] /lib/modules/3.10.0c2c_all+/kernel/drivers/scsi/megaraid/megaraid_sas.ko
11 9 [mtip32xx] /lib/modules/3.10.0c2c_all+/kernel/drivers/block/mtip32xx/mtip32xx.ko
12 6 irqbalance /usr/sbin/irqbalance
13 6 libpython2.7.so.1.0 /usr/lib64/libpython2.7.so.1.0
14 5 [ip6_tables] /lib/modules/3.10.0c2c_all+/kernel/net/ipv6/netfilter/ip6_tables.ko
15 4 [nf_conntrack] /lib/modules/3.10.0c2c_all+/kernel/net/netfilter/nf_conntrack.ko
16 2 [dm_mod] /lib/modules/3.10.0c2c_all+/kernel/drivers/md/dm-mod.ko
17 2 sshd /usr/sbin/sshd
18 1 [iptable_raw] /lib/modules/3.10.0c2c_all+/kernel/net/ipv4/netfilter/iptable_raw.ko
19 1 [sb_edac] /lib/modules/3.10.0c2c_all+/kernel/drivers/edac/sb_edac.ko
20 1 [edac_core] /lib/modules/3.10.0c2c_all+/kernel/drivers/edac/edac_core.ko
21 1 libdbus-1.so.3.7.4 /usr/lib64/libdbus-1.so.3.7.4
On Mon, Feb 10, 2014 at 10:29 PM, Peter Zijlstra <[email protected]> wrote:
> On Mon, Feb 10, 2014 at 12:28:55PM -0500, Don Zickus wrote:
>> The data output is verbose and there are lots of data tables that interprit the latencies
>> and data addresses in different ways to help see where bottlenecks might be lying.
>
> Would be good to see what the output looks like.
>
> What I haven't seen; and what I would find most useful; is using the IP
> + dwarf info to map it back to a data structure member.
>
> Since you're already using the PEBS data-source fields, you can also
> have a precise IP. For many cases its possible to reconstruct the exact
> data member the instruction is modifying.
>
The tool already uses precise=2 to get the precise IP.
To get from IP to data member, you'd need some debug info which is not
yet emitted
by the compiler.
> At that point you can do pahole like output of data structures, showing
> which members are 'hot' on misses etc.
On Mon, Feb 10, 2014 at 11:21:53PM +0100, Stephane Eranian wrote:
> On Mon, Feb 10, 2014 at 10:29 PM, Peter Zijlstra <[email protected]> wrote:
> > On Mon, Feb 10, 2014 at 12:28:55PM -0500, Don Zickus wrote:
> >> The data output is verbose and there are lots of data tables that interprit the latencies
> >> and data addresses in different ways to help see where bottlenecks might be lying.
> >
> > Would be good to see what the output looks like.
> >
> > What I haven't seen; and what I would find most useful; is using the IP
> > + dwarf info to map it back to a data structure member.
> >
> > Since you're already using the PEBS data-source fields, you can also
> > have a precise IP. For many cases its possible to reconstruct the exact
> > data member the instruction is modifying.
> >
> The tool already uses precise=2 to get the precise IP.
>
> To get from IP to data member, you'd need some debug info which is not
> yet emitted
> by the compiler.
That blows; how much is missing?
On Tue, Feb 11, 2014 at 8:14 AM, Peter Zijlstra <[email protected]> wrote:
> On Mon, Feb 10, 2014 at 11:21:53PM +0100, Stephane Eranian wrote:
>> On Mon, Feb 10, 2014 at 10:29 PM, Peter Zijlstra <[email protected]> wrote:
>> > On Mon, Feb 10, 2014 at 12:28:55PM -0500, Don Zickus wrote:
>> >> The data output is verbose and there are lots of data tables that interprit the latencies
>> >> and data addresses in different ways to help see where bottlenecks might be lying.
>> >
>> > Would be good to see what the output looks like.
>> >
>> > What I haven't seen; and what I would find most useful; is using the IP
>> > + dwarf info to map it back to a data structure member.
>> >
>> > Since you're already using the PEBS data-source fields, you can also
>> > have a precise IP. For many cases its possible to reconstruct the exact
>> > data member the instruction is modifying.
>> >
>> The tool already uses precise=2 to get the precise IP.
>>
>> To get from IP to data member, you'd need some debug info which is not
>> yet emitted
>> by the compiler.
>
> That blows; how much is missing?
They need to annotate load and stores. I asked for that feature a while ago.
It will come.
On Tue, Feb 11, 2014 at 11:35:45AM +0100, Stephane Eranian wrote:
> On Tue, Feb 11, 2014 at 8:14 AM, Peter Zijlstra <[email protected]> wrote:
> >
> > That blows; how much is missing?
>
> They need to annotate load and stores. I asked for that feature a while ago.
> It will come.
And there is no way to deduce the information? We have type information
for all arguments and local variables, right? So we can follow that.
struct foo {
int ponies;
int moar_ponies;
};
struct bar {
int my_ponies;
struct foo *foo;
};
int moo(struct bar *bar)
{
return bar->foo->moar_ponies;
}
Since we have the argument type, we can find the type for both loads,
the first load:
*bar+8, we know is: struct foo * bar::foo
*foo+4, we know is: int foo::moar_ponies
Or am I missing something?
On Tue, Feb 11, 2014 at 11:52 AM, Peter Zijlstra <[email protected]> wrote:
> On Tue, Feb 11, 2014 at 11:35:45AM +0100, Stephane Eranian wrote:
>> On Tue, Feb 11, 2014 at 8:14 AM, Peter Zijlstra <[email protected]> wrote:
>> >
>> > That blows; how much is missing?
>>
>> They need to annotate load and stores. I asked for that feature a while ago.
>> It will come.
>
> And there is no way to deduce the information? We have type information
> for all arguments and local variables, right? So we can follow that.
>
> struct foo {
> int ponies;
> int moar_ponies;
> };
>
> struct bar {
> int my_ponies;
> struct foo *foo;
> };
>
> int moo(struct bar *bar)
> {
> return bar->foo->moar_ponies;
> }
>
> Since we have the argument type, we can find the type for both loads,
> the first load:
>
> *bar+8, we know is: struct foo * bar::foo
> *foo+4, we know is: int foo::moar_ponies
>
> Or am I missing something?
How do you know that load at addr 0x1000 is accessing variable bar?
The IP gives you line number, and then what?
I think dwarf has the mapping regs -> variable and yes, the type info.
But I am not sure that's enough.
On Tue, Feb 11, 2014 at 11:58:45AM +0100, Stephane Eranian wrote:
> On Tue, Feb 11, 2014 at 11:52 AM, Peter Zijlstra <[email protected]> wrote:
> > On Tue, Feb 11, 2014 at 11:35:45AM +0100, Stephane Eranian wrote:
> >> On Tue, Feb 11, 2014 at 8:14 AM, Peter Zijlstra <[email protected]> wrote:
> >> >
> >> > That blows; how much is missing?
> >>
> >> They need to annotate load and stores. I asked for that feature a while ago.
> >> It will come.
> >
> > And there is no way to deduce the information? We have type information
> > for all arguments and local variables, right? So we can follow that.
> >
> > struct foo {
> > int ponies;
> > int moar_ponies;
> > };
> >
> > struct bar {
> > int my_ponies;
> > struct foo *foo;
> > };
> >
> > int moo(struct bar *bar)
> > {
> > return bar->foo->moar_ponies;
> > }
> >
> > Since we have the argument type, we can find the type for both loads,
> > the first load:
> >
> > *bar+8, we know is: struct foo * bar::foo
> > *foo+4, we know is: int foo::moar_ponies
> >
> > Or am I missing something?
>
> How do you know that load at addr 0x1000 is accessing variable bar?
> The IP gives you line number, and then what?
> I think dwarf has the mapping regs -> variable and yes, the type info.
> But I am not sure that's enough.
Ah, but if you have the instruction, you can decode it and obtain the
reg and thus type-info, no?
On Tue, Feb 11, 2014 at 12:02 PM, Peter Zijlstra <[email protected]> wrote:
> On Tue, Feb 11, 2014 at 11:58:45AM +0100, Stephane Eranian wrote:
>> On Tue, Feb 11, 2014 at 11:52 AM, Peter Zijlstra <[email protected]> wrote:
>> > On Tue, Feb 11, 2014 at 11:35:45AM +0100, Stephane Eranian wrote:
>> >> On Tue, Feb 11, 2014 at 8:14 AM, Peter Zijlstra <[email protected]> wrote:
>> >> >
>> >> > That blows; how much is missing?
>> >>
>> >> They need to annotate load and stores. I asked for that feature a while ago.
>> >> It will come.
>> >
>> > And there is no way to deduce the information? We have type information
>> > for all arguments and local variables, right? So we can follow that.
>> >
>> > struct foo {
>> > int ponies;
>> > int moar_ponies;
>> > };
>> >
>> > struct bar {
>> > int my_ponies;
>> > struct foo *foo;
>> > };
>> >
>> > int moo(struct bar *bar)
>> > {
>> > return bar->foo->moar_ponies;
>> > }
>> >
>> > Since we have the argument type, we can find the type for both loads,
>> > the first load:
>> >
>> > *bar+8, we know is: struct foo * bar::foo
>> > *foo+4, we know is: int foo::moar_ponies
>> >
>> > Or am I missing something?
>>
>> How do you know that load at addr 0x1000 is accessing variable bar?
>> The IP gives you line number, and then what?
>> I think dwarf has the mapping regs -> variable and yes, the type info.
>> But I am not sure that's enough.
>
> Ah, but if you have the instruction, you can decode it and obtain the
> reg and thus type-info, no?
>
But on x86, you can load directly from memory, you'd only have the
target reg for the load. Not enough.
On Tue, Feb 11, 2014 at 12:04:23PM +0100, Stephane Eranian wrote:
> >> How do you know that load at addr 0x1000 is accessing variable bar?
> >> The IP gives you line number, and then what?
> >> I think dwarf has the mapping regs -> variable and yes, the type info.
> >> But I am not sure that's enough.
> >
> > Ah, but if you have the instruction, you can decode it and obtain the
> > reg and thus type-info, no?
> >
> But on x86, you can load directly from memory, you'd only have the
> target reg for the load. Not enough.
But if you load an immediate, you should be able to find it in the
symbol table.
Any other load will have a register base and will thus have type-info
therefrom.
On Tue, Feb 11, 2014 at 12:04 PM, Stephane Eranian <[email protected]> wrote:
> On Tue, Feb 11, 2014 at 12:02 PM, Peter Zijlstra <[email protected]> wrote:
>> On Tue, Feb 11, 2014 at 11:58:45AM +0100, Stephane Eranian wrote:
>>> On Tue, Feb 11, 2014 at 11:52 AM, Peter Zijlstra <[email protected]> wrote:
>>> > On Tue, Feb 11, 2014 at 11:35:45AM +0100, Stephane Eranian wrote:
>>> >> On Tue, Feb 11, 2014 at 8:14 AM, Peter Zijlstra <[email protected]> wrote:
>>> >> >
>>> >> > That blows; how much is missing?
>>> >>
>>> >> They need to annotate load and stores. I asked for that feature a while ago.
>>> >> It will come.
>>> >
>>> > And there is no way to deduce the information? We have type information
>>> > for all arguments and local variables, right? So we can follow that.
>>> >
>>> > struct foo {
>>> > int ponies;
>>> > int moar_ponies;
>>> > };
>>> >
>>> > struct bar {
>>> > int my_ponies;
>>> > struct foo *foo;
>>> > };
>>> >
>>> > int moo(struct bar *bar)
>>> > {
>>> > return bar->foo->moar_ponies;
>>> > }
>>> >
>>> > Since we have the argument type, we can find the type for both loads,
>>> > the first load:
>>> >
>>> > *bar+8, we know is: struct foo * bar::foo
>>> > *foo+4, we know is: int foo::moar_ponies
>>> >
>>> > Or am I missing something?
>>>
>>> How do you know that load at addr 0x1000 is accessing variable bar?
>>> The IP gives you line number, and then what?
>>> I think dwarf has the mapping regs -> variable and yes, the type info.
>>> But I am not sure that's enough.
>>
>> Ah, but if you have the instruction, you can decode it and obtain the
>> reg and thus type-info, no?
>>
> But on x86, you can load directly from memory, you'd only have the
> target reg for the load. Not enough.
Assuming you can decode and get the info about the base registers used,
you'd have to do this for each arch with load/store sampling capabilities.
this is painful compared to getting the portable info from dwarf directly.
On Tue, Feb 11, 2014 at 12:08:56PM +0100, Stephane Eranian wrote:
> Assuming you can decode and get the info about the base registers used,
> you'd have to do this for each arch with load/store sampling capabilities.
> this is painful compared to getting the portable info from dwarf directly.
But its useful now, as compared to whenever GCC gets around to
implementing more dwarves and that GCC getting used widely enough to
actually rely on it.
All you need for the decode is a disassembler, and every arch should
already have multiple of those. Should be easy to reuse one, right?
On Mon, Feb 10, 2014 at 02:10:04PM -0800, Davidlohr Bueso wrote:
> On Mon, 2014-02-10 at 14:18 -0500, Don Zickus wrote:
> > From: Arnaldo Carvalho de Melo <[email protected]>
> >
> > This is the start of a new perf tool that will collect information about
> > memory accesses and analyse it to find things like hot cachelines, etc.
> >
> > This is basically trying to get a prototype written by Richard Fowles
> > written using the tools/perf coding style and libraries.
> >
> > Start it from 'perf sched', this patch starts the process by adding the
> > 'record' subcommand to collect the needed mem loads and stores samples.
> >
> > It also have the basic 'report' skeleton, resolving the sample address
> > and hooking the events found in a perf.data file with methods to handle
> > them, right now just printing the resolved perf_sample data structure
> > after each event name.
>
> What tree/branch is this developed against? I'm getting the following
> with Linus' latest and tip tree:
>
> builtin-c2c.c: In function ‘perf_c2c__process_sample’:
> builtin-c2c.c:68:20: error: request for member ‘func’ in something not a structure or union
> builtin-c2c.c:69:36: error: request for member ‘func’ in something not a structure or union
> builtin-c2c.c: In function ‘perf_c2c__read_events’:
> builtin-c2c.c:81:2: error: passing argument 1 of ‘perf_session__new’ from incompatible pointer type [-Werror]
> In file included from builtin-c2c.c:6:0:
> util/session.h:52:22: note: expected ‘struct perf_data_file *’ but argument is of type ‘const char *’
> builtin-c2c.c:81:2: error: too many arguments to function ‘perf_session__new’
> In file included from builtin-c2c.c:6:0:
> util/session.h:52:22: note: declared here
got the same one.. Don, what did you based on?
please, rebase to latest acme's perf/core or send your HEAD ;-)
thanks,
jirka
On Tue, Feb 11, 2014 at 12:14 PM, Peter Zijlstra <[email protected]> wrote:
> On Tue, Feb 11, 2014 at 12:08:56PM +0100, Stephane Eranian wrote:
>> Assuming you can decode and get the info about the base registers used,
>> you'd have to do this for each arch with load/store sampling capabilities.
>> this is painful compared to getting the portable info from dwarf directly.
>
> But its useful now, as compared to whenever GCC gets around to
> implementing more dwarves and that GCC getting used widely enough to
> actually rely on it.
>
> All you need for the decode is a disassembler, and every arch should
> already have multiple of those. Should be easy to reuse one, right?
I know, and you want to pull this into the perf tool?
Em Mon, Feb 10, 2014 at 02:10:04PM -0800, Davidlohr Bueso escreveu:
> On Mon, 2014-02-10 at 14:18 -0500, Don Zickus wrote:
> > From: Arnaldo Carvalho de Melo <[email protected]>
> >
> > This is the start of a new perf tool that will collect information about
> > memory accesses and analyse it to find things like hot cachelines, etc.
> >
> > This is basically trying to get a prototype written by Richard Fowles
> > written using the tools/perf coding style and libraries.
> >
> > Start it from 'perf sched', this patch starts the process by adding the
> > 'record' subcommand to collect the needed mem loads and stores samples.
> >
> > It also have the basic 'report' skeleton, resolving the sample address
> > and hooking the events found in a perf.data file with methods to handle
> > them, right now just printing the resolved perf_sample data structure
> > after each event name.
>
> What tree/branch is this developed against? I'm getting the following
> with Linus' latest and tip tree:
I'll try refreshing it on top of my perf/core branch today
> builtin-c2c.c: In function ‘perf_c2c__process_sample’:
> builtin-c2c.c:68:20: error: request for member ‘func’ in something not a structure or union
> builtin-c2c.c:69:36: error: request for member ‘func’ in something not a structure or union
> builtin-c2c.c: In function ‘perf_c2c__read_events’:
> builtin-c2c.c:81:2: error: passing argument 1 of ‘perf_session__new’ from incompatible pointer type [-Werror]
> In file included from builtin-c2c.c:6:0:
> util/session.h:52:22: note: expected ‘struct perf_data_file *’ but argument is of type ‘const char *’
> builtin-c2c.c:81:2: error: too many arguments to function ‘perf_session__new’
> In file included from builtin-c2c.c:6:0:
> util/session.h:52:22: note: declared here
>
On Tue, Feb 11, 2014 at 12:28:47PM +0100, Stephane Eranian wrote:
> On Tue, Feb 11, 2014 at 12:14 PM, Peter Zijlstra <[email protected]> wrote:
> > On Tue, Feb 11, 2014 at 12:08:56PM +0100, Stephane Eranian wrote:
> >> Assuming you can decode and get the info about the base registers used,
> >> you'd have to do this for each arch with load/store sampling capabilities.
> >> this is painful compared to getting the portable info from dwarf directly.
> >
> > But its useful now, as compared to whenever GCC gets around to
> > implementing more dwarves and that GCC getting used widely enough to
> > actually rely on it.
> >
> > All you need for the decode is a disassembler, and every arch should
> > already have multiple of those. Should be easy to reuse one, right?
>
> I know, and you want to pull this into the perf tool?
Sure why not, its already got the world and then some :/
It would be just another dynamic lib.
Em Tue, Feb 11, 2014 at 12:14:21PM +0100, Peter Zijlstra escreveu:
> On Tue, Feb 11, 2014 at 12:08:56PM +0100, Stephane Eranian wrote:
> > Assuming you can decode and get the info about the base registers used,
> > you'd have to do this for each arch with load/store sampling capabilities.
> > this is painful compared to getting the portable info from dwarf directly.
> But its useful now, as compared to whenever GCC gets around to
> implementing more dwarves and that GCC getting used widely enough to
> actually rely on it.
> All you need for the decode is a disassembler, and every arch should
> already have multiple of those. Should be easy to reuse one, right?
Yeah, I never got around to actually try to implement this, but my
feeling was that all the bits and pieces were there already:
1) the precise IP for the instruction, that disassembled would tell
which registers were being operated on, or memory that we would "reverse
map" to a register
2) DWARF expression locations that allows us to go from registers to a
variable/parameter and thus to a type
3) PERF_SAMPLE_REGS_USER (from a quick look, why do we have "USER" in
it? Jiri?)
4) libunwind have register maps for various arches, so probably
something there could be reused here as well (Jiri?)
Get that and generate a series of (type,offset) tuples for the samples
and get pahole to highlight the members with different colours, just
like 'annotate' does with source code/asm.
That way we would reuse 'pahole' in much the same way as we reuse
'objdump'. Give some more time to revisit the libdwarves APIs and then
we could use it directly on perf or perhaps extract just what is needed
and merge into the kernel sources.
- Arnaldo
On Tue, Feb 11, 2014 at 12:31:49PM +0100, Peter Zijlstra wrote:
> On Tue, Feb 11, 2014 at 12:28:47PM +0100, Stephane Eranian wrote:
> > On Tue, Feb 11, 2014 at 12:14 PM, Peter Zijlstra <[email protected]> wrote:
> > > On Tue, Feb 11, 2014 at 12:08:56PM +0100, Stephane Eranian wrote:
> > >> Assuming you can decode and get the info about the base registers used,
> > >> you'd have to do this for each arch with load/store sampling capabilities.
> > >> this is painful compared to getting the portable info from dwarf directly.
> > >
> > > But its useful now, as compared to whenever GCC gets around to
> > > implementing more dwarves and that GCC getting used widely enough to
> > > actually rely on it.
> > >
> > > All you need for the decode is a disassembler, and every arch should
> > > already have multiple of those. Should be easy to reuse one, right?
> >
> > I know, and you want to pull this into the perf tool?
>
> Sure why not, its already got the world and then some :/
>
> It would be just another dynamic lib.
The added benefit is that we could get rid of the objdump usage for
annotate if we find a usable disasm lib. At which point we can start
improving the annotations with these variable/type information as well.
On Tue, Feb 11, 2014 at 08:50:13AM -0300, Arnaldo Carvalho de Melo wrote:
> 3) PERF_SAMPLE_REGS_USER (from a quick look, why do we have "USER" in
> it? Jiri?)
Note that the regs are in the POST instruction state, so any op that
does something like:
MOV %edx, $(eax+edx*8)
Will have lost the original index value.
On Tue, Feb 11, 2014 at 08:31:27AM -0300, Arnaldo Carvalho de Melo wrote:
> Em Mon, Feb 10, 2014 at 02:10:04PM -0800, Davidlohr Bueso escreveu:
> > On Mon, 2014-02-10 at 14:18 -0500, Don Zickus wrote:
> > > From: Arnaldo Carvalho de Melo <[email protected]>
> > >
> > > This is the start of a new perf tool that will collect information about
> > > memory accesses and analyse it to find things like hot cachelines, etc.
> > >
> > > This is basically trying to get a prototype written by Richard Fowles
> > > written using the tools/perf coding style and libraries.
> > >
> > > Start it from 'perf sched', this patch starts the process by adding the
> > > 'record' subcommand to collect the needed mem loads and stores samples.
> > >
> > > It also have the basic 'report' skeleton, resolving the sample address
> > > and hooking the events found in a perf.data file with methods to handle
> > > them, right now just printing the resolved perf_sample data structure
> > > after each event name.
> >
> > What tree/branch is this developed against? I'm getting the following
> > with Linus' latest and tip tree:
>
> I'll try refreshing it on top of my perf/core branch today
Sorry everyone, I managed to rebase it on top of Ingo's master branch
f58a0b1790e3973b23548e297db60c18b29b0818
Let me find your perf/core branch.
Cheers,
Don
>
> > builtin-c2c.c: In function ‘perf_c2c__process_sample’:
> > builtin-c2c.c:68:20: error: request for member ‘func’ in something not a structure or union
> > builtin-c2c.c:69:36: error: request for member ‘func’ in something not a structure or union
> > builtin-c2c.c: In function ‘perf_c2c__read_events’:
> > builtin-c2c.c:81:2: error: passing argument 1 of ‘perf_session__new’ from incompatible pointer type [-Werror]
> > In file included from builtin-c2c.c:6:0:
> > util/session.h:52:22: note: expected ‘struct perf_data_file *’ but argument is of type ‘const char *’
> > builtin-c2c.c:81:2: error: too many arguments to function ‘perf_session__new’
> > In file included from builtin-c2c.c:6:0:
> > util/session.h:52:22: note: declared here
> >
On Tue, Feb 11, 2014 at 08:31:27AM -0300, Arnaldo Carvalho de Melo wrote:
> Em Mon, Feb 10, 2014 at 02:10:04PM -0800, Davidlohr Bueso escreveu:
> > On Mon, 2014-02-10 at 14:18 -0500, Don Zickus wrote:
> > > From: Arnaldo Carvalho de Melo <[email protected]>
> > >
> > > This is the start of a new perf tool that will collect information about
> > > memory accesses and analyse it to find things like hot cachelines, etc.
> > >
> > > This is basically trying to get a prototype written by Richard Fowles
> > > written using the tools/perf coding style and libraries.
> > >
> > > Start it from 'perf sched', this patch starts the process by adding the
> > > 'record' subcommand to collect the needed mem loads and stores samples.
> > >
> > > It also have the basic 'report' skeleton, resolving the sample address
> > > and hooking the events found in a perf.data file with methods to handle
> > > them, right now just printing the resolved perf_sample data structure
> > > after each event name.
> >
> > What tree/branch is this developed against? I'm getting the following
> > with Linus' latest and tip tree:
>
> I'll try refreshing it on top of my perf/core branch today
Sorry for the trouble. I guess I missed all the function cleanups from
last month. Attached is a patch that gets things to compile.
I'll split this patch up to the right spots on my next refresh.
Cheers,
Don
diff --git a/tools/perf/builtin-c2c.c b/tools/perf/builtin-c2c.c
index a73535a..b55f281 100644
--- a/tools/perf/builtin-c2c.c
+++ b/tools/perf/builtin-c2c.c
@@ -1009,15 +1009,20 @@ static int perf_c2c__process_sample(struct perf_tool *tool,
struct c2c_entry *entry;
sample_handler f;
int err = -1;
+ struct addr_location al = {
+ .machine = machine,
+ .cpumode = cpumode,
+ };
- if (evsel->handler.func == NULL)
+ if (evsel->handler == NULL)
return 0;
thread = machine__find_thread(machine, sample->pid);
if (thread == NULL)
goto err;
- mi = machine__resolve_mem(machine, thread, sample, cpumode);
+ al.thread = thread;
+ mi = sample__resolve_mem(sample, &al);
if (mi == NULL)
goto err;
@@ -1031,7 +1036,7 @@ static int perf_c2c__process_sample(struct perf_tool *tool,
if (entry == NULL)
goto err_mem;
- f = evsel->handler.func;
+ f = evsel->handler;
err = f(c2c, sample, entry);
if (err)
goto err_entry;
@@ -1040,8 +1045,8 @@ static int perf_c2c__process_sample(struct perf_tool *tool,
if (symbol_conf.use_callchain && sample->callchain) {
callchain_init(entry->callchain);
- err = machine__resolve_callchain(machine, evsel, thread,
- sample, &parent, NULL);
+ err = sample__resolve_callchain(sample, &parent, evsel, &al,
+ PERF_MAX_STACK_DEPTH);
if (!err)
err = callchain_append(entry->callchain,
&callchain_cursor,
@@ -1198,7 +1203,7 @@ struct refs {
struct list_head list;
int nr;
const char *name;
- char *long_name;
+ const char *long_name;
};
static int update_ref_tree(struct c2c_entry *entry)
@@ -2732,8 +2737,12 @@ static int perf_c2c__read_events(struct perf_c2c *c2c)
{
int err = -1;
struct perf_session *session;
+ struct perf_data_file file = {
+ .path = input_name,
+ .mode = PERF_DATA_MODE_READ,
+ };
- session = perf_session__new(input_name, O_RDONLY, 0, false, &c2c->tool);
+ session = perf_session__new(&file, 0, &c2c->tool);
if (session == NULL) {
pr_debug("No memory for session\n");
goto out;
diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
index faf29b0..49e0328 100644
--- a/tools/perf/util/evlist.c
+++ b/tools/perf/util/evlist.c
@@ -1270,10 +1270,10 @@ int __perf_evlist__set_handlers(struct perf_evlist *evlist,
if (evsel == NULL)
continue;
- if (evsel->handler.func != NULL)
+ if (evsel->handler != NULL)
goto out;
- evsel->handler.func = assocs[i].handler;
+ evsel->handler = assocs[i].handler;
}
err = 0;
Em Tue, Feb 11, 2014 at 09:36:43AM -0500, Don Zickus escreveu:
> On Tue, Feb 11, 2014 at 08:31:27AM -0300, Arnaldo Carvalho de Melo wrote:
> > Em Mon, Feb 10, 2014 at 02:10:04PM -0800, Davidlohr Bueso escreveu:
> > > On Mon, 2014-02-10 at 14:18 -0500, Don Zickus wrote:
> > > > From: Arnaldo Carvalho de Melo <[email protected]>
> > > >
> > > > This is the start of a new perf tool that will collect information about
> > > > memory accesses and analyse it to find things like hot cachelines, etc.
> > > >
> > > > This is basically trying to get a prototype written by Richard Fowles
> > > > written using the tools/perf coding style and libraries.
> > > >
> > > > Start it from 'perf sched', this patch starts the process by adding the
> > > > 'record' subcommand to collect the needed mem loads and stores samples.
> > > >
> > > > It also have the basic 'report' skeleton, resolving the sample address
> > > > and hooking the events found in a perf.data file with methods to handle
> > > > them, right now just printing the resolved perf_sample data structure
> > > > after each event name.
> > >
> > > What tree/branch is this developed against? I'm getting the following
> > > with Linus' latest and tip tree:
> >
> > I'll try refreshing it on top of my perf/core branch today
>
> Sorry for the trouble. I guess I missed all the function cleanups from
> last month. Attached is a patch that gets things to compile.
>
> I'll split this patch up to the right spots on my next refresh.
I go cleaning up the libraries trying to simplify its use as I go seeing
how new tools use the core, hopefully making it easier/less
boilerplate'ish.
Will look at how you used it to see what can be folded into the
libraries, thanks!
- Arnaldo
> Cheers,
> Don
>
> diff --git a/tools/perf/builtin-c2c.c b/tools/perf/builtin-c2c.c
> index a73535a..b55f281 100644
> --- a/tools/perf/builtin-c2c.c
> +++ b/tools/perf/builtin-c2c.c
> @@ -1009,15 +1009,20 @@ static int perf_c2c__process_sample(struct perf_tool *tool,
> struct c2c_entry *entry;
> sample_handler f;
> int err = -1;
> + struct addr_location al = {
> + .machine = machine,
> + .cpumode = cpumode,
> + };
>
> - if (evsel->handler.func == NULL)
> + if (evsel->handler == NULL)
> return 0;
>
> thread = machine__find_thread(machine, sample->pid);
> if (thread == NULL)
> goto err;
>
> - mi = machine__resolve_mem(machine, thread, sample, cpumode);
> + al.thread = thread;
> + mi = sample__resolve_mem(sample, &al);
> if (mi == NULL)
> goto err;
>
> @@ -1031,7 +1036,7 @@ static int perf_c2c__process_sample(struct perf_tool *tool,
> if (entry == NULL)
> goto err_mem;
>
> - f = evsel->handler.func;
> + f = evsel->handler;
> err = f(c2c, sample, entry);
> if (err)
> goto err_entry;
> @@ -1040,8 +1045,8 @@ static int perf_c2c__process_sample(struct perf_tool *tool,
> if (symbol_conf.use_callchain && sample->callchain) {
> callchain_init(entry->callchain);
>
> - err = machine__resolve_callchain(machine, evsel, thread,
> - sample, &parent, NULL);
> + err = sample__resolve_callchain(sample, &parent, evsel, &al,
> + PERF_MAX_STACK_DEPTH);
> if (!err)
> err = callchain_append(entry->callchain,
> &callchain_cursor,
> @@ -1198,7 +1203,7 @@ struct refs {
> struct list_head list;
> int nr;
> const char *name;
> - char *long_name;
> + const char *long_name;
> };
>
> static int update_ref_tree(struct c2c_entry *entry)
> @@ -2732,8 +2737,12 @@ static int perf_c2c__read_events(struct perf_c2c *c2c)
> {
> int err = -1;
> struct perf_session *session;
> + struct perf_data_file file = {
> + .path = input_name,
> + .mode = PERF_DATA_MODE_READ,
> + };
>
> - session = perf_session__new(input_name, O_RDONLY, 0, false, &c2c->tool);
> + session = perf_session__new(&file, 0, &c2c->tool);
> if (session == NULL) {
> pr_debug("No memory for session\n");
> goto out;
> diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
> index faf29b0..49e0328 100644
> --- a/tools/perf/util/evlist.c
> +++ b/tools/perf/util/evlist.c
> @@ -1270,10 +1270,10 @@ int __perf_evlist__set_handlers(struct perf_evlist *evlist,
> if (evsel == NULL)
> continue;
>
> - if (evsel->handler.func != NULL)
> + if (evsel->handler != NULL)
> goto out;
>
> - evsel->handler.func = assocs[i].handler;
> + evsel->handler = assocs[i].handler;
> }
>
> err = 0;
On Tue, Feb 11, 2014 at 08:50:13AM -0300, Arnaldo Carvalho de Melo wrote:
> Em Tue, Feb 11, 2014 at 12:14:21PM +0100, Peter Zijlstra escreveu:
> > On Tue, Feb 11, 2014 at 12:08:56PM +0100, Stephane Eranian wrote:
> > > Assuming you can decode and get the info about the base registers used,
> > > you'd have to do this for each arch with load/store sampling capabilities.
> > > this is painful compared to getting the portable info from dwarf directly.
>
> > But its useful now, as compared to whenever GCC gets around to
> > implementing more dwarves and that GCC getting used widely enough to
> > actually rely on it.
>
> > All you need for the decode is a disassembler, and every arch should
> > already have multiple of those. Should be easy to reuse one, right?
>
> Yeah, I never got around to actually try to implement this, but my
> feeling was that all the bits and pieces were there already:
>
> 1) the precise IP for the instruction, that disassembled would tell
> which registers were being operated on, or memory that we would "reverse
> map" to a register
>
> 2) DWARF expression locations that allows us to go from registers to a
> variable/parameter and thus to a type
>
> 3) PERF_SAMPLE_REGS_USER (from a quick look, why do we have "USER" in
> it? Jiri?)
well, it was meant for store user registers only
to assists user DWARF unwind
we can add PERF_SAMPLE_REGS_KERNEL
>
> 4) libunwind have register maps for various arches, so probably
> something there could be reused here as well (Jiri?)
not sure what you mean by 'something' here.. but yep,
libunwind does have register maps for various arches
jirka
On Thu, Feb 13, 2014 at 2:02 PM, Jiri Olsa <[email protected]> wrote:
> On Tue, Feb 11, 2014 at 08:50:13AM -0300, Arnaldo Carvalho de Melo wrote:
>> Em Tue, Feb 11, 2014 at 12:14:21PM +0100, Peter Zijlstra escreveu:
>> > On Tue, Feb 11, 2014 at 12:08:56PM +0100, Stephane Eranian wrote:
>> > > Assuming you can decode and get the info about the base registers used,
>> > > you'd have to do this for each arch with load/store sampling capabilities.
>> > > this is painful compared to getting the portable info from dwarf directly.
>>
>> > But its useful now, as compared to whenever GCC gets around to
>> > implementing more dwarves and that GCC getting used widely enough to
>> > actually rely on it.
>>
>> > All you need for the decode is a disassembler, and every arch should
>> > already have multiple of those. Should be easy to reuse one, right?
>>
>> Yeah, I never got around to actually try to implement this, but my
>> feeling was that all the bits and pieces were there already:
>>
>> 1) the precise IP for the instruction, that disassembled would tell
>> which registers were being operated on, or memory that we would "reverse
>> map" to a register
>>
>> 2) DWARF expression locations that allows us to go from registers to a
>> variable/parameter and thus to a type
>>
>> 3) PERF_SAMPLE_REGS_USER (from a quick look, why do we have "USER" in
>> it? Jiri?)
>
> well, it was meant for store user registers only
> to assists user DWARF unwind
>
But it does captures the user state regardless of the stack snapshotting.
> we can add PERF_SAMPLE_REGS_KERNEL
>
I have a patch series to do this and more. It will be ready next month
hopefully.
>>
>> 4) libunwind have register maps for various arches, so probably
>> something there could be reused here as well (Jiri?)
>
> not sure what you mean by 'something' here.. but yep,
> libunwind does have register maps for various arches
>
> jirka
On Mon, Feb 10, 2014 at 12:28:56PM -0500, Don Zickus wrote:
> From: Arnaldo Carvalho de Melo <[email protected]>
>
> This is the start of a new perf tool that will collect information about
> memory accesses and analyse it to find things like hot cachelines, etc.
>
> This is basically trying to get a prototype written by Richard Fowles
> written using the tools/perf coding style and libraries.
>
> Start it from 'perf sched', this patch starts the process by adding the
> 'record' subcommand to collect the needed mem loads and stores samples.
>
> It also have the basic 'report' skeleton, resolving the sample address
> and hooking the events found in a perf.data file with methods to handle
> them, right now just printing the resolved perf_sample data structure
> after each event name.
SNIP
> + evsel->handler.func = assocs[i].handler;
> + }
> +
> + err = 0;
> +out:
> + return err;
> +}
> diff --git a/tools/perf/util/evlist.h b/tools/perf/util/evlist.h
> index f5173cd..76f77c8 100644
> --- a/tools/perf/util/evlist.h
> +++ b/tools/perf/util/evlist.h
> @@ -52,6 +52,13 @@ struct perf_evsel_str_handler {
> void *handler;
> };
>
> +int __perf_evlist__set_handlers(struct perf_evlist *evlist,
> + const struct perf_evsel_str_handler *assocs,
> + size_t nr_assocs);
> +
> +#define perf_evlist__set_handlers(evlist, array) \
> + __perf_evlist__set_handlers(evlist, array, ARRAY_SIZE(array))
> +
this is already implemented in session object.. just need to be
changed to work over any event is globaly usable
jirka
On Mon, Feb 10, 2014 at 12:28:57PM -0500, Don Zickus wrote:
> From: Arnaldo Carvalho de Melo <[email protected]>
>
> From the c2c prototype:
>
> [root@sandy ~]# perf c2c -r report | head -7
> T Status Pid Tid CPU Inst Adrs Virt Data Adrs Phys Data Adrs Cycles Source Decoded Source ObJect:Symbol
> --------------------------------------------------------------------------------------------------------------------------------------------
> raw input 779 779 7 0xffffffff810865dd 0xffff8803f4d75ec8 0 370 0x68080882 [LOAD,LCL_LLC,MISS,SNP NA] [kernel.kallsyms]:try_to_wake_up
> raw input 779 779 7 0xffffffff8107acb3 0xffff8802a5b73158 0 297 0x6a100142 [LOAD,L1,HIT,SNP NONE,LOCKED] [kernel.kallsyms]:up_read
> raw input 779 779 7 0x3b7e009814 0x7fff87429ea0 0 925 0x68100142 [LOAD,L1,HIT,SNP NONE] ???:???
> raw input 0 0 1 0xffffffff8108bf81 0xffff8803eafebf50 0 172 0x68800842 [LOAD,LCL_LLC,HIT,SNP HITM] [kernel.kallsyms]:update_stats_wait_end
> raw input 779 779 7 0x3b7e0097cc 0x7fac94b69068 0 228 0x68100242 [LOAD,LFB,HIT,SNP NONE] ???:???
> [root@sandy ~]#
>
> The "Phys Data Adrs" column is not available at this point.
SNIP
> + sample->data_src,
> + data_src,
> + al->map ? (al->map->dso ? al->map->dso->long_name : "???") : "???",
> + al->sym ? al->sym->name : "???");
> +}
> +
> +static int perf_c2c__process_load_store(struct perf_c2c *c2c,
> + struct perf_sample *sample,
> + struct addr_location *al)
> +{
> + if (c2c->raw_records)
> + perf_sample__fprintf(sample, ' ', "raw input", al, stdout);
> +
> return 0;
> }
>
> static const struct perf_evsel_str_handler handlers[] = {
> - { "cpu/mem-loads,ldlat=30/pp", perf_c2c__process_load, },
> - { "cpu/mem-stores/pp", perf_c2c__process_store, },
> + { "cpu/mem-loads,ldlat=30/pp", perf_c2c__process_load_store, },
> + { "cpu/mem-stores/pp", perf_c2c__process_load_store, },
hm.. so it's only one function for both handlers.. no need
to use handlers at all then, right?
jirka
Em Tue, Feb 18, 2014 at 01:52:06PM +0100, Jiri Olsa escreveu:
> On Mon, Feb 10, 2014 at 12:28:56PM -0500, Don Zickus wrote:
> > From: Arnaldo Carvalho de Melo <[email protected]>
<SNIP>
> > +++ b/tools/perf/util/evlist.h
> > @@ -52,6 +52,13 @@ struct perf_evsel_str_handler {
> > void *handler;
> > };
> >
> > +int __perf_evlist__set_handlers(struct perf_evlist *evlist,
> > + const struct perf_evsel_str_handler *assocs,
> > + size_t nr_assocs);
> > +
> > +#define perf_evlist__set_handlers(evlist, array) \
> > + __perf_evlist__set_handlers(evlist, array, ARRAY_SIZE(array))
> > +
>
> this is already implemented in session object.. just need to be
> changed to work over any event is globaly usable
This probably has some historical baggage, i.e. I introduced this in
this cset and later brought it to the trunk, perhaps, need to review
this to figure this out, busy right now.
- Arnaldo
On Mon, Feb 10, 2014 at 12:29:00PM -0500, Don Zickus wrote:
> When printing the raw dump of a data file, the header.misc is
> printed as a decimal. Unfortunately, that field is a bit mask, so
> it is hard to interpret as a decimal.
>
> Print in hex, so the user can easily see what bits are set and more
> importantly what type of info it is conveying.
>
> Signed-off-by: Don Zickus <[email protected]>
> ---
> tools/perf/util/session.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
> index 0b39a48..d1ad10f 100644
> --- a/tools/perf/util/session.c
> +++ b/tools/perf/util/session.c
> @@ -793,7 +793,7 @@ static void dump_sample(struct perf_evsel *evsel, union perf_event *event,
> if (!dump_trace)
> return;
>
> - printf("(IP, %d): %d/%d: %#" PRIx64 " period: %" PRIu64 " addr: %#" PRIx64 "\n",
> + printf("(IP, %x): %d/%d: %#" PRIx64 " period: %" PRIu64 " addr: %#" PRIx64 "\n",
> event->header.misc, sample->pid, sample->tid, sample->ip,
> sample->period, sample->addr);
nit, maybe use '0x%x' ? hum, but probably nobody is actually parsing this..
jirka
On Mon, Feb 10, 2014 at 12:29:03PM -0500, Don Zickus wrote:
> A basic patch that re-arranges some of the c2c code and adds a couple
> of small features to lay the ground work for the rest of the patch
> series.
>
> Changes include:
>
> o reworking the report path
> o creating an initial entry struct
> o replace preprocess_sample with simpler calls
> o rework raw output to handle separators
> o remove phys id gunk
> o add some generic options
>
> There isn't much meat in this patch just a bunch of code movement and cleanups.
>
> Signed-off-by: Don Zickus <[email protected]>
> ---
SNIP
> static int perf_c2c__process_sample(struct perf_tool *tool,
> union perf_event *event,
> @@ -153,20 +198,63 @@ static int perf_c2c__process_sample(struct perf_tool *tool,
> struct machine *machine)
> {
> struct perf_c2c *c2c = container_of(tool, struct perf_c2c, tool);
> - struct addr_location al;
> - int err = 0;
> + u8 cpumode = event->header.misc & PERF_RECORD_MISC_CPUMODE_MASK;
> + struct mem_info *mi;
> + struct thread *thread;
> + struct c2c_entry *entry;
> + sample_handler f;
> + int err = -1;
> +
> + if (evsel->handler.func == NULL)
> + return 0;
> +
> + thread = machine__find_thread(machine, sample->tid);
> + if (thread == NULL)
> + goto err;
> +
> + mi = machine__resolve_mem(machine, thread, sample, cpumode);
> + if (mi == NULL)
> + goto err;
>
> - if (perf_event__preprocess_sample(event, machine, &al, sample) < 0) {
> - pr_err("problem processing %d event, skipping it.\n",
> - event->header.type);
> - return -1;
> + if (c2c->raw_records) {
> + perf_sample__fprintf(sample, ' ', "raw input", mi, stdout);
> + free(mi);
> + return 0;
> }
>
> - if (evsel->handler.func != NULL) {
> - sample_handler f = evsel->handler.func;
> - err = f(c2c, sample, &al);
> + entry = c2c_entry__new(sample, thread, mi, cpumode);
> + if (entry == NULL)
> + goto err_mem;
> +
> + f = evsel->handler.func;
> + err = f(c2c, sample, entry);
> + if (err)
> + goto err_entry;
> +
> + return 0;
this looks like new mode for namhyung's iterator patchset
http://marc.info/?l=linux-kernel&m=138967747319160&w=2
git://git.kernel.org/pub/scm/linux/kernel/git/namhyung/linux-perf.git
perf/cumulate-v8
jirka
On Mon, Feb 10, 2014 at 12:29:04PM -0500, Don Zickus wrote:
> In order for the c2c tool to work correctly, it needs to properly
> sort all the records on uniquely identifiable data addresses. These
> unique addresses are converted from virtual addresses provided by the
> hardware into a kernel address using an mmap2 record as the decoder.
>
SNIP
> +static int physid_cmp(struct c2c_entry *left, struct c2c_entry *right)
> +{
> + u64 l, r;
> + struct map *l_map = left->mi->daddr.map;
> + struct map *r_map = right->mi->daddr.map;
> +
> + /* group event types together */
> + if (left->cpumode > right->cpumode) return 1;
> + if (left->cpumode < right->cpumode) return -1;
> +
> + if (l_map->maj > r_map->maj) return 1;
> + if (l_map->maj < r_map->maj) return -1;
> +
> + if (l_map->min > r_map->min) return 1;
> + if (l_map->min < r_map->min) return -1;
> +
> + if (l_map->ino > r_map->ino) return 1;
> + if (l_map->ino < r_map->ino) return -1;
> +
> + if (l_map->ino_generation > r_map->ino_generation) return 1;
> + if (l_map->ino_generation < r_map->ino_generation) return -1;
> +
> + /*
> + * Addresses with no major/minor numbers are assumed to be
> + * anonymous in userspace. Sort those on pid then address.
> + *
> + * The kernel and non-zero major/minor mapped areas are
> + * assumed to be unity mapped. Sort those on address then pid.
> + */
> +
> + /* al_addr does all the right addr - start + offset calculations */
> + l = left->mi->daddr.al_addr;
> + r = right->mi->daddr.al_addr;
> +
> + if (l_map->maj || l_map->min) {
> + /* mmapped areas */
> +
> + /* hack to mark similar regions, 'right' is new entry */
> + /* entries with same maj/min/ino/inogen are in same address space */
> + right->color = REGION_SAME;
> +
> + if (l > r) return 1;
> + if (l < r) return -1;
> +
> + /* sorting by iaddr makes calculations easier later */
> + if (left->mi->iaddr.al_addr > right->mi->iaddr.al_addr) return 1;
> + if (left->mi->iaddr.al_addr < right->mi->iaddr.al_addr) return -1;
> +
> + if (left->thread->pid_ > right->thread->pid_) return 1;
> + if (left->thread->pid_ < right->thread->pid_) return -1;
> +
> + if (left->thread->tid > right->thread->tid) return 1;
> + if (left->thread->tid < right->thread->tid) return -1;
> + } else if (left->cpumode == PERF_RECORD_MISC_KERNEL) {
> + /* kernel mapped areas where 'start' doesn't matter */
> +
> + /* hack to mark similar regions, 'right' is new entry */
> + /* whole kernel region is in the same address space */
> + right->color = REGION_SAME;
> +
> + if (l > r) return 1;
> + if (l < r) return -1;
> +
> + /* sorting by iaddr makes calculations easier later */
> + if (left->mi->iaddr.al_addr > right->mi->iaddr.al_addr) return 1;
> + if (left->mi->iaddr.al_addr < right->mi->iaddr.al_addr) return -1;
> +
> + if (left->thread->pid_ > right->thread->pid_) return 1;
> + if (left->thread->pid_ < right->thread->pid_) return -1;
> +
> + if (left->thread->tid > right->thread->tid) return 1;
> + if (left->thread->tid < right->thread->tid) return -1;
> + } else {
> + /* userspace anonymous */
> + if (left->thread->pid_ > right->thread->pid_) return 1;
> + if (left->thread->pid_ < right->thread->pid_) return -1;
> +
> + if (left->thread->tid > right->thread->tid) return 1;
> + if (left->thread->tid < right->thread->tid) return -1;
> +
> + /* hack to mark similar regions, 'right' is new entry */
> + /* userspace anonymous address space is contained within pid */
> + right->color = REGION_SAME;
> +
> + if (l > r) return 1;
> + if (l < r) return -1;
> +
> + /* sorting by iaddr makes calculations easier later */
> + if (left->mi->iaddr.al_addr > right->mi->iaddr.al_addr) return 1;
> + if (left->mi->iaddr.al_addr < right->mi->iaddr.al_addr) return -1;
> + }
> +
> + return 0;
> +}
there's sort object doing exatly this over hist_entry's
Is there any reason not to use hist_entries?
jirka
On Mon, Feb 10, 2014 at 12:29:05PM -0500, Don Zickus wrote:
> This patch adds a bunch of stats that will be used later in post-processing
> to determine where and with what frequency the HITMs are coming from.
>
> Most of the stats are decoded from the data source response. Another
> piece of the stats is tracking which cpu the record came in on.
>
> In order to properly build a cpu map to map where interesting events are coming
> from, I shamelessly copy-n-pasted the cpu->NUMA node code from builtin-kmem.c.
>
> As HITMs are most expensive when going across NUMA nodes, it only made sense
> to create a quick cpu->NUMA lookup for when processing the records.
>
> Credit to Dick Fowles for determining which bits are important and how to
> properly track them. Ported to perf by me.
>
> Original-by: Dick Fowles <[email protected]>
> Signed-off-by: Don Zickus <[email protected]>
> ---
SNIP
> +
> +static int setup_cpunode_map(void)
> +{
> + struct dirent *dent1, *dent2;
> + DIR *dir1, *dir2;
> + unsigned int cpu, mem;
> + char buf[PATH_MAX];
> +
> + /* initialize globals */
> + if (init_cpunode_map())
> + return -1;
> +
> + dir1 = opendir(PATH_SYS_NODE);
> + if (!dir1)
> + return 0;
> +
> + /* walk tree and setup map */
> + while ((dent1 = readdir(dir1)) != NULL) {
> + if (dent1->d_type != DT_DIR ||
> + sscanf(dent1->d_name, "node%u", &mem) < 1)
> + continue;
> +
> + snprintf(buf, PATH_MAX, "%s/%s", PATH_SYS_NODE, dent1->d_name);
> + dir2 = opendir(buf);
> + if (!dir2)
> + continue;
> + while ((dent2 = readdir(dir2)) != NULL) {
> + if (dent2->d_type != DT_LNK ||
> + sscanf(dent2->d_name, "cpu%u", &cpu) < 1)
> + continue;
> + cpunode_map[cpu] = mem;
> + }
> + closedir(dir2);
> + }
> + closedir(dir1);
> + return 0;
> +}
There's already setup_cpunode_map interface in builtin-kmem.c
Please make it global (maybe place in separate object?)
and use this one.
jirka
On Mon, Feb 10, 2014 at 12:29:08PM -0500, Don Zickus wrote:
> Seeing cacheline statistics is useful by itself. Seeing the callchain
> for these cache contentions saves time tracking things down.
>
> This patch tries to add callchain support. I had to use the generic
> interface from a previous patch to output things to stdout easily.
>
> Other than the displaying the results, collecting the callchain and
> merging it was fairly straightforward.
>
> I used a lot of copying-n-pasting from other builtin tools to get
> the intial parameter setup correctly and the automatic reading of
> 'symbol_conf.use_callchain' from the data file.
>
> Hopefully this is all correct. The amount of memory corruption (from the
> callchain dynamic array) seems to have dwindled done to nothing. :-)
hum.. report command already has all this.. if we could go the
hist_entry way, there'd be no need to reiplement this
jirka
On Mon, Feb 10, 2014 at 12:29:12PM -0500, Don Zickus wrote:
> Just another table that displays the referenced symbols in the analysis
> report. The table lists the most frequently used symbols first.
>
> It is just another way to look at similar data to figure out who
> is causing the most contention (based on the workload used).
>
> Originally done by Dick Fowles and ported by me.
>
> Suggested-by: Joe Mario <[email protected]>
> Original-by: Dick Fowles <[email protected]>
> Signed-off-by: Don Zickus <[email protected]>
> ---
> tools/perf/builtin-c2c.c | 99 ++++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 99 insertions(+)
>
> diff --git a/tools/perf/builtin-c2c.c b/tools/perf/builtin-c2c.c
> index 32c2319..979187f 100644
> --- a/tools/perf/builtin-c2c.c
> +++ b/tools/perf/builtin-c2c.c
> @@ -950,6 +950,104 @@ static void c2c_hit__update_stats(struct c2c_stats *new,
> new->total_period += old->total_period;
> }
>
> +LIST_HEAD(ref_tree);
> +LIST_HEAD(ref_tree_sorted);
> +struct refs {
> + struct list_head list;
> + int nr;
> + const char *name;
> + char *long_name;
> +};
> +
> +static int update_ref_tree(struct c2c_entry *entry)
> +{
> + struct refs *p;
> + struct dso *dso = entry->mi->iaddr.map->dso;
> + const char *name = dso->short_name;
> +
> + list_for_each_entry(p, &ref_tree, list) {
> + if (!strcmp(p->name, name))
> + goto found;
> + }
> +
> + p = zalloc(sizeof(struct refs));
> + if (!p)
> + return -1;
> + p->name = name;
> + p->long_name = dso->long_name;
> + list_add_tail(&p->list, &ref_tree);
so this is a tree, which is actually a list ;-)
jirka
Em Tue, Feb 18, 2014 at 01:53:35PM +0100, Jiri Olsa escreveu:
> On Mon, Feb 10, 2014 at 12:28:57PM -0500, Don Zickus wrote:
> > From: Arnaldo Carvalho de Melo <[email protected]>
<SNIP>
> > +static int perf_c2c__process_load_store(struct perf_c2c *c2c,
> > + struct perf_sample *sample,
> > + struct addr_location *al)
> > +{
> > + if (c2c->raw_records)
> > + perf_sample__fprintf(sample, ' ', "raw input", al, stdout);
> > +
> > return 0;
> > }
> > static const struct perf_evsel_str_handler handlers[] = {
> > - { "cpu/mem-loads,ldlat=30/pp", perf_c2c__process_load, },
> > - { "cpu/mem-stores/pp", perf_c2c__process_store, },
> > + { "cpu/mem-loads,ldlat=30/pp", perf_c2c__process_load_store, },
> > + { "cpu/mem-stores/pp", perf_c2c__process_load_store, },
> hm.. so it's only one function for both handlers.. no need
> to use handlers at all then, right?
This was just an skeleton from where to continue, so no, the idea isn't
to have just one function for both.
- Arnaldo
On Tue, Feb 18, 2014 at 01:56:44PM +0100, Jiri Olsa wrote:
> On Mon, Feb 10, 2014 at 12:29:00PM -0500, Don Zickus wrote:
> > When printing the raw dump of a data file, the header.misc is
> > printed as a decimal. Unfortunately, that field is a bit mask, so
> > it is hard to interpret as a decimal.
> >
> > Print in hex, so the user can easily see what bits are set and more
> > importantly what type of info it is conveying.
> >
> > Signed-off-by: Don Zickus <[email protected]>
> > ---
> > tools/perf/util/session.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
> > index 0b39a48..d1ad10f 100644
> > --- a/tools/perf/util/session.c
> > +++ b/tools/perf/util/session.c
> > @@ -793,7 +793,7 @@ static void dump_sample(struct perf_evsel *evsel, union perf_event *event,
> > if (!dump_trace)
> > return;
> >
> > - printf("(IP, %d): %d/%d: %#" PRIx64 " period: %" PRIu64 " addr: %#" PRIx64 "\n",
> > + printf("(IP, %x): %d/%d: %#" PRIx64 " period: %" PRIu64 " addr: %#" PRIx64 "\n",
> > event->header.misc, sample->pid, sample->tid, sample->ip,
> > sample->period, sample->addr);
>
> nit, maybe use '0x%x' ? hum, but probably nobody is actually parsing this..
Fair enough. :-)
Cheers,
Don
On Tue, Feb 18, 2014 at 09:56:47AM -0300, Arnaldo Carvalho de Melo wrote:
> Em Tue, Feb 18, 2014 at 01:52:06PM +0100, Jiri Olsa escreveu:
> > On Mon, Feb 10, 2014 at 12:28:56PM -0500, Don Zickus wrote:
> > > From: Arnaldo Carvalho de Melo <[email protected]>
>
> <SNIP>
>
> > > +++ b/tools/perf/util/evlist.h
> > > @@ -52,6 +52,13 @@ struct perf_evsel_str_handler {
> > > void *handler;
> > > };
> > >
> > > +int __perf_evlist__set_handlers(struct perf_evlist *evlist,
> > > + const struct perf_evsel_str_handler *assocs,
> > > + size_t nr_assocs);
> > > +
> > > +#define perf_evlist__set_handlers(evlist, array) \
> > > + __perf_evlist__set_handlers(evlist, array, ARRAY_SIZE(array))
> > > +
> >
> > this is already implemented in session object.. just need to be
> > changed to work over any event is globaly usable
>
> This probably has some historical baggage, i.e. I introduced this in
> this cset and later brought it to the trunk, perhaps, need to review
> this to figure this out, busy right now.
No worries, I will dig into it and try to use the right function calls.
Cheers,
Don
On Tue, Feb 18, 2014 at 02:02:23PM +0100, Jiri Olsa wrote:
> On Mon, Feb 10, 2014 at 12:29:03PM -0500, Don Zickus wrote:
> > A basic patch that re-arranges some of the c2c code and adds a couple
> > of small features to lay the ground work for the rest of the patch
> > series.
> >
> > Changes include:
> >
> > o reworking the report path
> > o creating an initial entry struct
> > o replace preprocess_sample with simpler calls
> > o rework raw output to handle separators
> > o remove phys id gunk
> > o add some generic options
> >
> > There isn't much meat in this patch just a bunch of code movement and cleanups.
> >
> > Signed-off-by: Don Zickus <[email protected]>
> > ---
>
> SNIP
>
> > static int perf_c2c__process_sample(struct perf_tool *tool,
> > union perf_event *event,
> > @@ -153,20 +198,63 @@ static int perf_c2c__process_sample(struct perf_tool *tool,
> > struct machine *machine)
> > {
> > struct perf_c2c *c2c = container_of(tool, struct perf_c2c, tool);
> > - struct addr_location al;
> > - int err = 0;
> > + u8 cpumode = event->header.misc & PERF_RECORD_MISC_CPUMODE_MASK;
> > + struct mem_info *mi;
> > + struct thread *thread;
> > + struct c2c_entry *entry;
> > + sample_handler f;
> > + int err = -1;
> > +
> > + if (evsel->handler.func == NULL)
> > + return 0;
> > +
> > + thread = machine__find_thread(machine, sample->tid);
> > + if (thread == NULL)
> > + goto err;
> > +
> > + mi = machine__resolve_mem(machine, thread, sample, cpumode);
> > + if (mi == NULL)
> > + goto err;
> >
> > - if (perf_event__preprocess_sample(event, machine, &al, sample) < 0) {
> > - pr_err("problem processing %d event, skipping it.\n",
> > - event->header.type);
> > - return -1;
> > + if (c2c->raw_records) {
> > + perf_sample__fprintf(sample, ' ', "raw input", mi, stdout);
> > + free(mi);
> > + return 0;
> > }
> >
> > - if (evsel->handler.func != NULL) {
> > - sample_handler f = evsel->handler.func;
> > - err = f(c2c, sample, &al);
> > + entry = c2c_entry__new(sample, thread, mi, cpumode);
> > + if (entry == NULL)
> > + goto err_mem;
> > +
> > + f = evsel->handler.func;
> > + err = f(c2c, sample, entry);
> > + if (err)
> > + goto err_entry;
> > +
> > + return 0;
>
> this looks like new mode for namhyung's iterator patchset
> http://marc.info/?l=linux-kernel&m=138967747319160&w=2
>
> git://git.kernel.org/pub/scm/linux/kernel/git/namhyung/linux-perf.git
> perf/cumulate-v8
I'll take a look at it and seem what is useful for me. Thanks!
Cheers,
Don
On Tue, Feb 18, 2014 at 02:04:05PM +0100, Jiri Olsa wrote:
> On Mon, Feb 10, 2014 at 12:29:04PM -0500, Don Zickus wrote:
> > In order for the c2c tool to work correctly, it needs to properly
> > sort all the records on uniquely identifiable data addresses. These
> > unique addresses are converted from virtual addresses provided by the
> > hardware into a kernel address using an mmap2 record as the decoder.
> >
>
> SNIP
>
> > +static int physid_cmp(struct c2c_entry *left, struct c2c_entry *right)
> > +{
> > + u64 l, r;
> > + struct map *l_map = left->mi->daddr.map;
> > + struct map *r_map = right->mi->daddr.map;
> > +
> > + /* group event types together */
> > + if (left->cpumode > right->cpumode) return 1;
> > + if (left->cpumode < right->cpumode) return -1;
> > +
> > + if (l_map->maj > r_map->maj) return 1;
> > + if (l_map->maj < r_map->maj) return -1;
> > +
> > + if (l_map->min > r_map->min) return 1;
> > + if (l_map->min < r_map->min) return -1;
> > +
> > + if (l_map->ino > r_map->ino) return 1;
> > + if (l_map->ino < r_map->ino) return -1;
> > +
> > + if (l_map->ino_generation > r_map->ino_generation) return 1;
> > + if (l_map->ino_generation < r_map->ino_generation) return -1;
> > +
> > + /*
> > + * Addresses with no major/minor numbers are assumed to be
> > + * anonymous in userspace. Sort those on pid then address.
> > + *
> > + * The kernel and non-zero major/minor mapped areas are
> > + * assumed to be unity mapped. Sort those on address then pid.
> > + */
> > +
> > + /* al_addr does all the right addr - start + offset calculations */
> > + l = left->mi->daddr.al_addr;
> > + r = right->mi->daddr.al_addr;
> > +
> > + if (l_map->maj || l_map->min) {
> > + /* mmapped areas */
> > +
> > + /* hack to mark similar regions, 'right' is new entry */
> > + /* entries with same maj/min/ino/inogen are in same address space */
> > + right->color = REGION_SAME;
> > +
> > + if (l > r) return 1;
> > + if (l < r) return -1;
> > +
> > + /* sorting by iaddr makes calculations easier later */
> > + if (left->mi->iaddr.al_addr > right->mi->iaddr.al_addr) return 1;
> > + if (left->mi->iaddr.al_addr < right->mi->iaddr.al_addr) return -1;
> > +
> > + if (left->thread->pid_ > right->thread->pid_) return 1;
> > + if (left->thread->pid_ < right->thread->pid_) return -1;
> > +
> > + if (left->thread->tid > right->thread->tid) return 1;
> > + if (left->thread->tid < right->thread->tid) return -1;
> > + } else if (left->cpumode == PERF_RECORD_MISC_KERNEL) {
> > + /* kernel mapped areas where 'start' doesn't matter */
> > +
> > + /* hack to mark similar regions, 'right' is new entry */
> > + /* whole kernel region is in the same address space */
> > + right->color = REGION_SAME;
> > +
> > + if (l > r) return 1;
> > + if (l < r) return -1;
> > +
> > + /* sorting by iaddr makes calculations easier later */
> > + if (left->mi->iaddr.al_addr > right->mi->iaddr.al_addr) return 1;
> > + if (left->mi->iaddr.al_addr < right->mi->iaddr.al_addr) return -1;
> > +
> > + if (left->thread->pid_ > right->thread->pid_) return 1;
> > + if (left->thread->pid_ < right->thread->pid_) return -1;
> > +
> > + if (left->thread->tid > right->thread->tid) return 1;
> > + if (left->thread->tid < right->thread->tid) return -1;
> > + } else {
> > + /* userspace anonymous */
> > + if (left->thread->pid_ > right->thread->pid_) return 1;
> > + if (left->thread->pid_ < right->thread->pid_) return -1;
> > +
> > + if (left->thread->tid > right->thread->tid) return 1;
> > + if (left->thread->tid < right->thread->tid) return -1;
> > +
> > + /* hack to mark similar regions, 'right' is new entry */
> > + /* userspace anonymous address space is contained within pid */
> > + right->color = REGION_SAME;
> > +
> > + if (l > r) return 1;
> > + if (l < r) return -1;
> > +
> > + /* sorting by iaddr makes calculations easier later */
> > + if (left->mi->iaddr.al_addr > right->mi->iaddr.al_addr) return 1;
> > + if (left->mi->iaddr.al_addr < right->mi->iaddr.al_addr) return -1;
> > + }
> > +
> > + return 0;
> > +}
>
> there's sort object doing exatly this over hist_entry's
>
> Is there any reason not to use hist_entries?
I started there but had trouble wrapping my head around how I wanted the
above implemented (it took several iterations to sort correctly), so I
took the standalone approach first.
I need to double check how easy it is to manipulate the hist_entry tree
once sorted. I have to resort the objects into another rbtree based on
cacheline hitms.
Cheers,
Don
On Tue, Feb 18, 2014 at 02:05:31PM +0100, Jiri Olsa wrote:
> On Mon, Feb 10, 2014 at 12:29:05PM -0500, Don Zickus wrote:
> > This patch adds a bunch of stats that will be used later in post-processing
> > to determine where and with what frequency the HITMs are coming from.
> >
> > Most of the stats are decoded from the data source response. Another
> > piece of the stats is tracking which cpu the record came in on.
> >
> > In order to properly build a cpu map to map where interesting events are coming
> > from, I shamelessly copy-n-pasted the cpu->NUMA node code from builtin-kmem.c.
> >
> > As HITMs are most expensive when going across NUMA nodes, it only made sense
> > to create a quick cpu->NUMA lookup for when processing the records.
> >
> > Credit to Dick Fowles for determining which bits are important and how to
> > properly track them. Ported to perf by me.
> >
> > Original-by: Dick Fowles <[email protected]>
> > Signed-off-by: Don Zickus <[email protected]>
> > ---
>
> SNIP
>
> > +
> > +static int setup_cpunode_map(void)
> > +{
> > + struct dirent *dent1, *dent2;
> > + DIR *dir1, *dir2;
> > + unsigned int cpu, mem;
> > + char buf[PATH_MAX];
> > +
> > + /* initialize globals */
> > + if (init_cpunode_map())
> > + return -1;
> > +
> > + dir1 = opendir(PATH_SYS_NODE);
> > + if (!dir1)
> > + return 0;
> > +
> > + /* walk tree and setup map */
> > + while ((dent1 = readdir(dir1)) != NULL) {
> > + if (dent1->d_type != DT_DIR ||
> > + sscanf(dent1->d_name, "node%u", &mem) < 1)
> > + continue;
> > +
> > + snprintf(buf, PATH_MAX, "%s/%s", PATH_SYS_NODE, dent1->d_name);
> > + dir2 = opendir(buf);
> > + if (!dir2)
> > + continue;
> > + while ((dent2 = readdir(dir2)) != NULL) {
> > + if (dent2->d_type != DT_LNK ||
> > + sscanf(dent2->d_name, "cpu%u", &cpu) < 1)
> > + continue;
> > + cpunode_map[cpu] = mem;
> > + }
> > + closedir(dir2);
> > + }
> > + closedir(dir1);
> > + return 0;
> > +}
>
>
> There's already setup_cpunode_map interface in builtin-kmem.c
> Please make it global (maybe place in separate object?)
> and use this one.
Heh, where do you think I got this from? :-) Though I did tweak it for my
needs, namely I used 'possible' cpus as opposed to 'online' cpus to deal
with hotplug.
I also ran into a bug here, where this code populating an array based on
what is on the running system, not the system where the data was
collected. Is it possible to have perf-archive add this info?
I try to make this function global on the next version.
Cheers,
Don
>
> jirka
On Tue, Feb 18, 2014 at 02:07:31PM +0100, Jiri Olsa wrote:
> On Mon, Feb 10, 2014 at 12:29:08PM -0500, Don Zickus wrote:
> > Seeing cacheline statistics is useful by itself. Seeing the callchain
> > for these cache contentions saves time tracking things down.
> >
> > This patch tries to add callchain support. I had to use the generic
> > interface from a previous patch to output things to stdout easily.
> >
> > Other than the displaying the results, collecting the callchain and
> > merging it was fairly straightforward.
> >
> > I used a lot of copying-n-pasting from other builtin tools to get
> > the intial parameter setup correctly and the automatic reading of
> > 'symbol_conf.use_callchain' from the data file.
> >
> > Hopefully this is all correct. The amount of memory corruption (from the
> > callchain dynamic array) seems to have dwindled done to nothing. :-)
>
> hum.. report command already has all this.. if we could go the
> hist_entry way, there'd be no need to reiplement this
Sure. I will look into the hist_entry. It would also be nice if the
callchain support had a better api so those commands that can't hook in
through hist_entry had a simpler way to utilize callchains. :-)
Cheers,
Don
On Tue, Feb 18, 2014 at 02:09:29PM +0100, Jiri Olsa wrote:
> On Mon, Feb 10, 2014 at 12:29:12PM -0500, Don Zickus wrote:
> > Just another table that displays the referenced symbols in the analysis
> > report. The table lists the most frequently used symbols first.
> >
> > It is just another way to look at similar data to figure out who
> > is causing the most contention (based on the workload used).
> >
> > Originally done by Dick Fowles and ported by me.
> >
> > Suggested-by: Joe Mario <[email protected]>
> > Original-by: Dick Fowles <[email protected]>
> > Signed-off-by: Don Zickus <[email protected]>
> > ---
> > tools/perf/builtin-c2c.c | 99 ++++++++++++++++++++++++++++++++++++++++++++++++
> > 1 file changed, 99 insertions(+)
> >
> > diff --git a/tools/perf/builtin-c2c.c b/tools/perf/builtin-c2c.c
> > index 32c2319..979187f 100644
> > --- a/tools/perf/builtin-c2c.c
> > +++ b/tools/perf/builtin-c2c.c
> > @@ -950,6 +950,104 @@ static void c2c_hit__update_stats(struct c2c_stats *new,
> > new->total_period += old->total_period;
> > }
> >
> > +LIST_HEAD(ref_tree);
> > +LIST_HEAD(ref_tree_sorted);
> > +struct refs {
> > + struct list_head list;
> > + int nr;
> > + const char *name;
> > + char *long_name;
> > +};
> > +
> > +static int update_ref_tree(struct c2c_entry *entry)
> > +{
> > + struct refs *p;
> > + struct dso *dso = entry->mi->iaddr.map->dso;
> > + const char *name = dso->short_name;
> > +
> > + list_for_each_entry(p, &ref_tree, list) {
> > + if (!strcmp(p->name, name))
> > + goto found;
> > + }
> > +
> > + p = zalloc(sizeof(struct refs));
> > + if (!p)
> > + return -1;
> > + p->name = name;
> > + p->long_name = dso->long_name;
> > + list_add_tail(&p->list, &ref_tree);
>
> so this is a tree, which is actually a list ;-)
It used to be a tree, now it is a stick. :-)
Old code that needs to be renamed. I can update that.
Cheers,
Don
On Tue, Feb 18, 2014 at 01:53:35PM +0100, Jiri Olsa wrote:
> On Mon, Feb 10, 2014 at 12:28:57PM -0500, Don Zickus wrote:
> > From: Arnaldo Carvalho de Melo <[email protected]>
> >
> > From the c2c prototype:
> >
> > [root@sandy ~]# perf c2c -r report | head -7
> > T Status Pid Tid CPU Inst Adrs Virt Data Adrs Phys Data Adrs Cycles Source Decoded Source ObJect:Symbol
> > --------------------------------------------------------------------------------------------------------------------------------------------
> > raw input 779 779 7 0xffffffff810865dd 0xffff8803f4d75ec8 0 370 0x68080882 [LOAD,LCL_LLC,MISS,SNP NA] [kernel.kallsyms]:try_to_wake_up
> > raw input 779 779 7 0xffffffff8107acb3 0xffff8802a5b73158 0 297 0x6a100142 [LOAD,L1,HIT,SNP NONE,LOCKED] [kernel.kallsyms]:up_read
> > raw input 779 779 7 0x3b7e009814 0x7fff87429ea0 0 925 0x68100142 [LOAD,L1,HIT,SNP NONE] ???:???
> > raw input 0 0 1 0xffffffff8108bf81 0xffff8803eafebf50 0 172 0x68800842 [LOAD,LCL_LLC,HIT,SNP HITM] [kernel.kallsyms]:update_stats_wait_end
> > raw input 779 779 7 0x3b7e0097cc 0x7fac94b69068 0 228 0x68100242 [LOAD,LFB,HIT,SNP NONE] ???:???
> > [root@sandy ~]#
> >
> > The "Phys Data Adrs" column is not available at this point.
>
> SNIP
>
> > + sample->data_src,
> > + data_src,
> > + al->map ? (al->map->dso ? al->map->dso->long_name : "???") : "???",
> > + al->sym ? al->sym->name : "???");
> > +}
> > +
> > +static int perf_c2c__process_load_store(struct perf_c2c *c2c,
> > + struct perf_sample *sample,
> > + struct addr_location *al)
> > +{
> > + if (c2c->raw_records)
> > + perf_sample__fprintf(sample, ' ', "raw input", al, stdout);
> > +
> > return 0;
> > }
> >
> > static const struct perf_evsel_str_handler handlers[] = {
> > - { "cpu/mem-loads,ldlat=30/pp", perf_c2c__process_load, },
> > - { "cpu/mem-stores/pp", perf_c2c__process_store, },
> > + { "cpu/mem-loads,ldlat=30/pp", perf_c2c__process_load_store, },
> > + { "cpu/mem-stores/pp", perf_c2c__process_load_store, },
>
> hm.. so it's only one function for both handlers.. no need
> to use handlers at all then, right?
I implemented them seperately but then realized they look identical once
everything was working, so I combined them again. I keep thinking there
has to be some advantage to have them seperate, but haven't found a use
case.
You still need to use the handlers, in case you want to add some other events
into the mix and have them filtered out with this tool.
However, I do have the problem of trying to figure out a good way to
dynamically adjust the '30' above. Seeing that Intel doesn't publish
L1, LFB and L2 latency numbers, we have been guessing at 30 cycles for an
LLC hit. It would probably be nice to adjust that on the command line as
opposed to recompiling. Small issue.
Cheers,
Don
On Tue, Feb 18, 2014 at 02:04:05PM +0100, Jiri Olsa wrote:
> On Mon, Feb 10, 2014 at 12:29:04PM -0500, Don Zickus wrote:
> > In order for the c2c tool to work correctly, it needs to properly
> > sort all the records on uniquely identifiable data addresses. These
> > unique addresses are converted from virtual addresses provided by the
> > hardware into a kernel address using an mmap2 record as the decoder.
> >
>
> SNIP
>
> > +static int physid_cmp(struct c2c_entry *left, struct c2c_entry *right)
> > +{
> > + u64 l, r;
> > + struct map *l_map = left->mi->daddr.map;
> > + struct map *r_map = right->mi->daddr.map;
> > +
> > + /* group event types together */
> > + if (left->cpumode > right->cpumode) return 1;
> > + if (left->cpumode < right->cpumode) return -1;
> > +
> > + if (l_map->maj > r_map->maj) return 1;
> > + if (l_map->maj < r_map->maj) return -1;
> > +
> > + if (l_map->min > r_map->min) return 1;
> > + if (l_map->min < r_map->min) return -1;
> > +
> > + if (l_map->ino > r_map->ino) return 1;
> > + if (l_map->ino < r_map->ino) return -1;
> > +
> > + if (l_map->ino_generation > r_map->ino_generation) return 1;
> > + if (l_map->ino_generation < r_map->ino_generation) return -1;
> > +
> > + /*
> > + * Addresses with no major/minor numbers are assumed to be
> > + * anonymous in userspace. Sort those on pid then address.
> > + *
> > + * The kernel and non-zero major/minor mapped areas are
> > + * assumed to be unity mapped. Sort those on address then pid.
> > + */
> > +
> > + /* al_addr does all the right addr - start + offset calculations */
> > + l = left->mi->daddr.al_addr;
> > + r = right->mi->daddr.al_addr;
> > +
> > + if (l_map->maj || l_map->min) {
> > + /* mmapped areas */
> > +
> > + /* hack to mark similar regions, 'right' is new entry */
> > + /* entries with same maj/min/ino/inogen are in same address space */
> > + right->color = REGION_SAME;
> > +
> > + if (l > r) return 1;
> > + if (l < r) return -1;
> > +
> > + /* sorting by iaddr makes calculations easier later */
> > + if (left->mi->iaddr.al_addr > right->mi->iaddr.al_addr) return 1;
> > + if (left->mi->iaddr.al_addr < right->mi->iaddr.al_addr) return -1;
> > +
> > + if (left->thread->pid_ > right->thread->pid_) return 1;
> > + if (left->thread->pid_ < right->thread->pid_) return -1;
> > +
> > + if (left->thread->tid > right->thread->tid) return 1;
> > + if (left->thread->tid < right->thread->tid) return -1;
> > + } else if (left->cpumode == PERF_RECORD_MISC_KERNEL) {
> > + /* kernel mapped areas where 'start' doesn't matter */
> > +
> > + /* hack to mark similar regions, 'right' is new entry */
> > + /* whole kernel region is in the same address space */
> > + right->color = REGION_SAME;
> > +
> > + if (l > r) return 1;
> > + if (l < r) return -1;
> > +
> > + /* sorting by iaddr makes calculations easier later */
> > + if (left->mi->iaddr.al_addr > right->mi->iaddr.al_addr) return 1;
> > + if (left->mi->iaddr.al_addr < right->mi->iaddr.al_addr) return -1;
> > +
> > + if (left->thread->pid_ > right->thread->pid_) return 1;
> > + if (left->thread->pid_ < right->thread->pid_) return -1;
> > +
> > + if (left->thread->tid > right->thread->tid) return 1;
> > + if (left->thread->tid < right->thread->tid) return -1;
> > + } else {
> > + /* userspace anonymous */
> > + if (left->thread->pid_ > right->thread->pid_) return 1;
> > + if (left->thread->pid_ < right->thread->pid_) return -1;
> > +
> > + if (left->thread->tid > right->thread->tid) return 1;
> > + if (left->thread->tid < right->thread->tid) return -1;
> > +
> > + /* hack to mark similar regions, 'right' is new entry */
> > + /* userspace anonymous address space is contained within pid */
> > + right->color = REGION_SAME;
> > +
> > + if (l > r) return 1;
> > + if (l < r) return -1;
> > +
> > + /* sorting by iaddr makes calculations easier later */
> > + if (left->mi->iaddr.al_addr > right->mi->iaddr.al_addr) return 1;
> > + if (left->mi->iaddr.al_addr < right->mi->iaddr.al_addr) return -1;
> > + }
> > +
> > + return 0;
> > +}
>
> there's sort object doing exatly this over hist_entry's
>
> Is there any reason not to use hist_entries?
So looking over hist_entry, I realize, what do I gain? I implemented it
and realized I had to add, 'cpumode', 'tid' and a 'private' field to
struct hist_entry. Then because I have my own report implementation, I
still have to copy and paste a ton of stuff from builtin-report over to
here (including callchain support).
Not unless you are expecting me to add giant chunks of code to
builtin-report.c?
Cheers,
Don
On Thu, Feb 20, 2014 at 09:45:53PM -0500, Don Zickus wrote:
> On Tue, Feb 18, 2014 at 02:04:05PM +0100, Jiri Olsa wrote:
> > On Mon, Feb 10, 2014 at 12:29:04PM -0500, Don Zickus wrote:
SNIP
> > > +
> > > + if (l > r) return 1;
> > > + if (l < r) return -1;
> > > +
> > > + /* sorting by iaddr makes calculations easier later */
> > > + if (left->mi->iaddr.al_addr > right->mi->iaddr.al_addr) return 1;
> > > + if (left->mi->iaddr.al_addr < right->mi->iaddr.al_addr) return -1;
> > > + }
> > > +
> > > + return 0;
> > > +}
> >
> > there's sort object doing exatly this over hist_entry's
> >
> > Is there any reason not to use hist_entries?
>
> So looking over hist_entry, I realize, what do I gain? I implemented it
> and realized I had to add, 'cpumode', 'tid' and a 'private' field to
> struct hist_entry. Then because I have my own report implementation, I
> still have to copy and paste a ton of stuff from builtin-report over to
> here (including callchain support).
you mean new sort_entry objects?
>
> Not unless you are expecting me to add giant chunks of code to
> builtin-report.c?
it can be separated object, implementing new report iterator
I think that we should go on with existing sort code we have..
but I understand you might need some special usage.. i'll dive
in and try to find some answer ;-)
jirka
On Fri, Feb 21, 2014 at 05:59:28PM +0100, Jiri Olsa wrote:
> On Thu, Feb 20, 2014 at 09:45:53PM -0500, Don Zickus wrote:
> > On Tue, Feb 18, 2014 at 02:04:05PM +0100, Jiri Olsa wrote:
> > > On Mon, Feb 10, 2014 at 12:29:04PM -0500, Don Zickus wrote:
>
> SNIP
>
> > > > +
> > > > + if (l > r) return 1;
> > > > + if (l < r) return -1;
> > > > +
> > > > + /* sorting by iaddr makes calculations easier later */
> > > > + if (left->mi->iaddr.al_addr > right->mi->iaddr.al_addr) return 1;
> > > > + if (left->mi->iaddr.al_addr < right->mi->iaddr.al_addr) return -1;
> > > > + }
> > > > +
> > > > + return 0;
> > > > +}
> > >
> > > there's sort object doing exatly this over hist_entry's
> > >
> > > Is there any reason not to use hist_entries?
> >
> > So looking over hist_entry, I realize, what do I gain? I implemented it
> > and realized I had to add, 'cpumode', 'tid' and a 'private' field to
> > struct hist_entry. Then because I have my own report implementation, I
> > still have to copy and paste a ton of stuff from builtin-report over to
> > here (including callchain support).
>
> you mean new sort_entry objects?
>
> >
> > Not unless you are expecting me to add giant chunks of code to
> > builtin-report.c?
>
> it can be separated object, implementing new report iterator
>
> I think that we should go on with existing sort code we have..
> but I understand you might need some special usage.. i'll dive
> in and try to find some answer ;-)
Do things fall apart if I do not use evsel->hists to store the hist_entry
tree? I need to combine two events (store and load) on to the same tree.
Cheers,
Don