2014-02-28 17:43:37

by Don Zickus

[permalink] [raw]
Subject: [PATCH 00/19 V2] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems

With the introduction of NUMA systems, came the possibility of remote memory accesses.
Combine those remote memory accesses with contention on the remote node (ie a modified
cacheline) and you have a possibility for very long latencies. These latencies can
bottleneck a program.

The program added by these patches, helps detect the situation where two nodes are
'tugging' on the same _data_ cacheline. The term used through out this program and
the various changelogs is called a HITM. This means nodeX went to read a cacheline
and it was discovered to be loaded in nodeY's LLC cache (hence the cacheHIT). The
remote cacheline was also in a 'M'odified state thus creating a 'HIT M' for hit in
a modified state. HITMs can happen locally and remotely. This program's interest
is mainly in remote HITMs as they cause the longest latencies.

Why a program has a remote HITM derives from how the two nodes are 'sharing' the
cacheline. Is the sharing intentional ("true") or unintentional ("false"). We have seen
lots of "false" sharing cases, which lead to simple solutions such as seperating the data
onto different cachelines.

This tool does not distinguish between 'true' or 'false' sharing, instead it just points to
the more expensive sharing situations under the current workload. It is up to the user
to understand what the workload is doing to determine whether a problem exists or not and
how to report it.

The data output is verbose and there are lots of data tables that interprit the latencies
and data addresses in different ways to help see where bottlenecks might be lying.

Most of this idea, work and calculations were done by Dick Fowles. My work mainly
includes porting it to perf. Joe Mario has contributed greatly with ideas to make the
output more informative based on his usage of the tool. Joe has found a handful of
bottlenecks using various industry benchmarks and has worked with developers to fix
them.

I would also like to thank Stephane Eranian for his early help and guidance on
navigating the differences between the current perf tool and how similar tools
looked at HP. And also his tireless work in getting the MMAP2 interface to stick.

Also thanks to Arnaldo and Jiri Olso for their help in suggestions for this tool.

I also have a test program that generated a controlled number of HITMs that we used
frequently to validate our early work (the Intel docs were not always clear which
bits had to be set and some arches do not work well). I would like to add it, but
didn't know how (nor did I spend any serious time looking either).

This program has been tested primarily on Intel's Ivy Bridge platforms. The Sandy Bridge
platforms had some quirks that were fixed on Ivy Bridge. We haven't tried Haswell as
that has a re-worked latency event implementation.

A handful of patches include re-enabling MMAP2 support and some fixes to perf itself. One
in particular hacks up how standard deviation is calculated. It works with our calculations
but may break other tools expectations. Feedback is welcomed.

Comemnts, feedback, anything else welcomed.

V2: updated to latest perf/core branch 1029f9fedf87fa6
switched to hist_entry based on Jiri O's suggestion
dropped latency analyze for now until this patchset is accepted
little fixes and tweaks

Signed-off-by: Don Zickus <[email protected]>

Arnaldo Carvalho de Melo (2):
perf c2c: Shared data analyser
perf c2c: Dump raw records, decode data_src bits

Don Zickus (19):
Revert "perf: Disable PERF_RECORD_MMAP2 support"
perf, machine: Use map as success in ip__resolve_ams
perf, session: Change header.misc dump from decimal to hex
perf, stat: FIXME Stddev calculation is incorrect
perf, callchain: Add generic callchain print handler for stdio
perf, c2c: Rework setup code to prepare for features
perf, c2c: Add rbtree sorted on mmap2 data
perf, c2c: Add stats to track data source bits and cpu to node maps
perf, c2c: Sort based on hottest cache line
perf, c2c: Display cacheline HITM analysis to stdout
perf, c2c: Add callchain support
perf, c2c: Output summary stats
perf, c2c: Dump rbtree for debugging
perf, c2c: Fixup tid because of perf map is broken
perf, c2c: Add symbol count table
perf, c2c: Add shared cachline summary table
perf, c2c: Add framework to analyze latency and display summary stats
perf, c2c: Add selected extreme latencies to output cacheline stats
table
perf, c2c: Add summary latency table for various parts of caches

kernel/events/core.c | 4 -
tools/perf/Documentation/perf-c2c.c | 22 +
tools/perf/Makefile.perf | 1 +
tools/perf/builtin-c2c.c | 2963 +++++++++++++++++++++++++++++++++++
tools/perf/builtin.h | 1 +
tools/perf/perf.c | 1 +
tools/perf/ui/stdio/hist.c | 37 +
tools/perf/util/event.c | 36 +-
tools/perf/util/evlist.c | 37 +
tools/perf/util/evlist.h | 7 +
tools/perf/util/evsel.c | 1 +
tools/perf/util/hist.h | 4 +
tools/perf/util/machine.c | 2 +-
tools/perf/util/session.c | 2 +-
tools/perf/util/stat.c | 3 +-
15 files changed, 3097 insertions(+), 24 deletions(-)
create mode 100644 tools/perf/Documentation/perf-c2c.c
create mode 100644 tools/perf/builtin-c2c.c

--
1.7.11.7

Arnaldo Carvalho de Melo (2):
perf c2c: Shared data analyser
perf c2c: Dump raw records, decode data_src bits

Don Zickus (17):
Revert "perf: Disable PERF_RECORD_MMAP2 support"
perf, sort: Add physid sorting based on mmap2 data
perf, sort: Allow unique sorting instead of combining hist_entries
perf: Allow ability to map cpus to nodes easily
perf, kmem: Utilize the new generic cpunode_map
perf: Fix stddev calculation
perf, callchain: Add generic callchain print handler for stdio
perf, c2c: Rework setup code to prepare for features
perf, c2c: Add in sort on physid
perf, c2c: Add stats to track data source bits and cpu to node maps
perf, c2c: Sort based on hottest cache line
perf, c2c: Display cacheline HITM analysis to stdout
perf, c2c: Add callchain support
perf, c2c: Output summary stats
perf, c2c: Dump rbtree for debugging
perf, c2c: Add symbol count table
perf, c2c: Add shared cachline summary table

kernel/events/core.c | 4 -
tools/perf/Documentation/perf-c2c.c | 22 +
tools/perf/Makefile.perf | 1 +
tools/perf/builtin-c2c.c | 1787 +++++++++++++++++++++++++++++++++++
tools/perf/builtin-kmem.c | 78 +-
tools/perf/builtin-report.c | 2 +-
tools/perf/builtin.h | 1 +
tools/perf/perf.c | 1 +
tools/perf/ui/stdio/hist.c | 37 +
tools/perf/util/cpumap.c | 150 +++
tools/perf/util/cpumap.h | 35 +
tools/perf/util/event.c | 36 +-
tools/perf/util/evsel.c | 1 +
tools/perf/util/hist.c | 10 +-
tools/perf/util/hist.h | 5 +
tools/perf/util/sort.c | 149 +++
tools/perf/util/sort.h | 4 +
tools/perf/util/stat.c | 13 +
tools/perf/util/stat.h | 1 +
19 files changed, 2236 insertions(+), 101 deletions(-)
create mode 100644 tools/perf/Documentation/perf-c2c.c
create mode 100644 tools/perf/builtin-c2c.c

--
1.7.11.7


2014-02-28 17:43:39

by Don Zickus

[permalink] [raw]
Subject: [PATCH 02/19] perf, sort: Add physid sorting based on mmap2 data

In order for the c2c tool to work correctly, it needs to properly
sort all the records on uniquely identifiable data addresses. These
unique addresses are converted from virtual addresses provided by the
hardware into a kernel address using an mmap2 record as the decoder.

Once a unique address is converted, we can sort on them based on
various rules. Then it becomes clear which address are overlapping
with each other across mmap regions or pid spaces.

This patch just creates the rules and inserts the records into a
sort entry for safe keeping until later patches process them.

The general sorting rule is:

o group cpumodes together
o group similar major, minor, inode, inode generation numbers togther
o if (nonzero major/minor number - ie mmap'd areas)
o sort on data addresses
o sort on instruction address
o sort on pid
o sort on tid
o if cpumode is kernel
o sort on data addresses
o sort on instruction address
o sort on pid
o sort on tid
o else (private to pid space)
o sort on pid
o sort on tid
o sort on data addresses
o sort on instruction address

I also hacked in the concept of 'color'. The purpose of that bit is to
provides hints later when processing these records that indicate a new unique
address has been encountered. Because later processing only checks the data
addresses, there can be a theoretical scenario that similar sequential data
addresses (when walking the rbtree) could be misinterpreted as overlapping
when in fact they are not.

Signed-off-by: Don Zickus <[email protected]>
---
tools/perf/builtin-report.c | 2 +-
tools/perf/util/hist.c | 7 ++-
tools/perf/util/hist.h | 1 +
tools/perf/util/sort.c | 148 ++++++++++++++++++++++++++++++++++++++++++++
tools/perf/util/sort.h | 3 +
5 files changed, 157 insertions(+), 4 deletions(-)

diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c
index d882b6f..ec797da 100644
--- a/tools/perf/builtin-report.c
+++ b/tools/perf/builtin-report.c
@@ -755,7 +755,7 @@ int cmd_report(int argc, const char **argv, const char *prefix __maybe_unused)
"sort by key(s): pid, comm, dso, symbol, parent, cpu, srcline,"
" dso_to, dso_from, symbol_to, symbol_from, mispredict,"
" weight, local_weight, mem, symbol_daddr, dso_daddr, tlb, "
- "snoop, locked, abort, in_tx, transaction"),
+ "snoop, locked, abort, in_tx, transaction, physid"),
OPT_BOOLEAN(0, "showcpuutilization", &symbol_conf.show_cpu_utilization,
"Show sample percentage for different cpu modes"),
OPT_STRING('p', "parent", &parent_pattern, "regex",
diff --git a/tools/perf/util/hist.c b/tools/perf/util/hist.c
index 0466efa..ea54db3 100644
--- a/tools/perf/util/hist.c
+++ b/tools/perf/util/hist.c
@@ -420,9 +420,10 @@ struct hist_entry *__hists__add_entry(struct hists *hists,
.map = al->map,
.sym = al->sym,
},
- .cpu = al->cpu,
- .ip = al->addr,
- .level = al->level,
+ .cpu = al->cpu,
+ .cpumode = al->cpumode,
+ .ip = al->addr,
+ .level = al->level,
.stat = {
.nr_events = 1,
.period = period,
diff --git a/tools/perf/util/hist.h b/tools/perf/util/hist.h
index a59743f..d226c5b 100644
--- a/tools/perf/util/hist.h
+++ b/tools/perf/util/hist.h
@@ -62,6 +62,7 @@ enum hist_column {
HISTC_MEM_LVL,
HISTC_MEM_SNOOP,
HISTC_TRANSACTION,
+ HISTC_PHYSID,
HISTC_NR_COLS, /* Last entry */
};

diff --git a/tools/perf/util/sort.c b/tools/perf/util/sort.c
index 635cd8f..0cb43a5 100644
--- a/tools/perf/util/sort.c
+++ b/tools/perf/util/sort.c
@@ -977,6 +977,151 @@ struct sort_entry sort_transaction = {
.se_width_idx = HISTC_TRANSACTION,
};

+static int64_t
+sort__physid_cmp(struct hist_entry *left, struct hist_entry *right)
+{
+ u64 l, r;
+ struct map *l_map = left->mem_info->daddr.map;
+ struct map *r_map = right->mem_info->daddr.map;
+
+ /* store all NULL mem maps at the bottom */
+ /* shouldn't even need this check, should have stubs */
+ if (!left->mem_info->daddr.map || !right->mem_info->daddr.map)
+ return 1;
+
+ /* group event types together */
+ if (left->cpumode > right->cpumode) return -1;
+ if (left->cpumode < right->cpumode) return 1;
+
+ if (l_map->maj > r_map->maj) return -1;
+ if (l_map->maj < r_map->maj) return 1;
+
+ if (l_map->min > r_map->min) return -1;
+ if (l_map->min < r_map->min) return 1;
+
+ if (l_map->ino > r_map->ino) return -1;
+ if (l_map->ino < r_map->ino) return 1;
+
+ if (l_map->ino_generation > r_map->ino_generation) return -1;
+ if (l_map->ino_generation < r_map->ino_generation) return 1;
+
+ /*
+ * Addresses with no major/minor numbers are assumed to be
+ * anonymous in userspace. Sort those on pid then address.
+ *
+ * The kernel and non-zero major/minor mapped areas are
+ * assumed to be unity mapped. Sort those on address then pid.
+ */
+
+ /* al_addr does all the right addr - start + offset calculations */
+ l = left->mem_info->daddr.al_addr;
+ r = right->mem_info->daddr.al_addr;
+
+ if (l_map->maj || l_map->min || l_map->ino || l_map-> ino_generation) {
+ /* mmapped areas */
+
+ /* hack to mark similar regions, 'right' is new entry */
+ /* entries with same maj/min/ino/inogen are in same address space */
+ right->color = TRUE;
+
+ if (l > r) return -1;
+ if (l < r) return 1;
+
+ /* sorting by iaddr makes calculations easier later */
+ if (left->mem_info->iaddr.al_addr > right->mem_info->iaddr.al_addr) return -1;
+ if (left->mem_info->iaddr.al_addr < right->mem_info->iaddr.al_addr) return 1;
+
+ if (left->thread->pid_ > right->thread->pid_) return -1;
+ if (left->thread->pid_ < right->thread->pid_) return 1;
+
+ if (left->thread->tid > right->thread->tid) return -1;
+ if (left->thread->tid < right->thread->tid) return 1;
+ } else if (left->cpumode == PERF_RECORD_MISC_KERNEL) {
+ /* kernel mapped areas where 'start' doesn't matter */
+
+ /* hack to mark similar regions, 'right' is new entry */
+ /* whole kernel region is in the same address space */
+ right->color = TRUE;
+
+ if (l > r) return -1;
+ if (l < r) return 1;
+
+ /* sorting by iaddr makes calculations easier later */
+ if (left->mem_info->iaddr.al_addr > right->mem_info->iaddr.al_addr) return -1;
+ if (left->mem_info->iaddr.al_addr < right->mem_info->iaddr.al_addr) return 1;
+
+ if (left->thread->pid_ > right->thread->pid_) return -1;
+ if (left->thread->pid_ < right->thread->pid_) return 1;
+
+ if (left->thread->tid > right->thread->tid) return -1;
+ if (left->thread->tid < right->thread->tid) return 1;
+ } else {
+ /* userspace anonymous */
+ if (left->thread->pid_ > right->thread->pid_) return -1;
+ if (left->thread->pid_ < right->thread->pid_) return 1;
+
+ if (left->thread->tid > right->thread->tid) return -1;
+ if (left->thread->tid < right->thread->tid) return 1;
+
+ /* hack to mark similar regions, 'right' is new entry */
+ /* userspace anonymous address space is contained within pid */
+ right->color = TRUE;
+
+ if (l > r) return -1;
+ if (l < r) return 1;
+
+ /* sorting by iaddr makes calculations easier later */
+ if (left->mem_info->iaddr.al_addr > right->mem_info->iaddr.al_addr) return -1;
+ if (left->mem_info->iaddr.al_addr < right->mem_info->iaddr.al_addr) return 1;
+ }
+
+ /* sanity check the maps; only mmaped areas should have different maps */
+ if ((left->mem_info->daddr.map != right->mem_info->daddr.map) &&
+ !right->mem_info->daddr.map->maj && !right->mem_info->daddr.map->min)
+ pr_debug("physid_cmp: Similar entries have different maps\n");
+
+ return 0;
+}
+
+static int hist_entry__physid_snprintf(struct hist_entry *he, char *bf,
+ size_t size, unsigned int width)
+{
+ char buf[256];
+ char *p = buf;
+
+ if (!he->mem_info->daddr.map) {
+ sprintf(p, "%3x %3x %8lx %8lx %6d %16lx %16lx %16lx %8x\n",
+ -1,
+ -1,
+ -1UL,
+ -1UL,
+ he->thread->pid_,
+ -1UL,
+ he->mem_info->daddr.addr,
+ he->mem_info->iaddr.al_addr,
+ he->cpumode);
+ } else {
+ sprintf(p, "%3x %3x %8lx %8lx %6d %16lx %16lx %16lx %8x\n",
+ he->mem_info->daddr.map->maj,
+ he->mem_info->daddr.map->min,
+ he->mem_info->daddr.map->ino,
+ he->mem_info->daddr.map->ino_generation,
+ he->thread->pid_,
+ he->mem_info->daddr.map->start,
+ he->mem_info->daddr.addr,
+ he->mem_info->iaddr.al_addr,
+ he->cpumode);
+ }
+ return repsep_snprintf(bf, size, "%-*s", width, buf);
+}
+
+struct sort_entry sort_physid = {
+ .se_header = "Physid (major, minor, inode, inode generation, pid, start, Data addr, IP, cpumode)",
+ .se_cmp = sort__physid_cmp,
+ .se_snprintf = hist_entry__physid_snprintf,
+ .se_width_idx = HISTC_PHYSID,
+};
+
struct sort_dimension {
const char *name;
struct sort_entry *entry;
@@ -1023,6 +1168,7 @@ static struct sort_dimension memory_sort_dimensions[] = {
DIM(SORT_MEM_TLB, "tlb", sort_mem_tlb),
DIM(SORT_MEM_LVL, "mem", sort_mem_lvl),
DIM(SORT_MEM_SNOOP, "snoop", sort_mem_snoop),
+ DIM(SORT_MEM_PHYSID, "physid", sort_physid),
};

#undef DIM
@@ -1182,6 +1328,8 @@ void sort__setup_elide(FILE *output)
"tlb", output);
sort_entry__setup_elide(&sort_dso, symbol_conf.dso_list,
"snoop", output);
+ sort_entry__setup_elide(&sort_physid, symbol_conf.dso_list,
+ "physid", output);
}

/*
diff --git a/tools/perf/util/sort.h b/tools/perf/util/sort.h
index 43e5ff4..eb8cd50 100644
--- a/tools/perf/util/sort.h
+++ b/tools/perf/util/sort.h
@@ -87,11 +87,13 @@ struct hist_entry {
u64 ip;
u64 transaction;
s32 cpu;
+ u8 cpumode;

struct hist_entry_diff diff;

/* We are added by hists__add_dummy_entry. */
bool dummy;
+ bool color;

/* XXX These two should move to some tree widget lib */
u16 row_offset;
@@ -166,6 +168,7 @@ enum sort_type {
SORT_MEM_TLB,
SORT_MEM_LVL,
SORT_MEM_SNOOP,
+ SORT_MEM_PHYSID,
};

/*
--
1.7.11.7

2014-02-28 17:43:44

by Don Zickus

[permalink] [raw]
Subject: [PATCH 10/19] perf, c2c: Rework setup code to prepare for features

A basic patch that re-arranges some of the c2c code and adds a couple
of small features to lay the ground work for the rest of the patch
series.

Changes include:

o reworking the report path
o replace preprocess_sample with simpler calls
o rework raw output to handle separators
o remove phys id gunk
o add some generic options

There isn't much meat in this patch just a bunch of code movement and cleanups.

V2: refresh to latest upstream changes

Signed-off-by: Don Zickus <[email protected]>
---
tools/perf/builtin-c2c.c | 125 ++++++++++++++++++++++++++++++++++-------------
1 file changed, 92 insertions(+), 33 deletions(-)

diff --git a/tools/perf/builtin-c2c.c b/tools/perf/builtin-c2c.c
index 7082913..367d6c1 100644
--- a/tools/perf/builtin-c2c.c
+++ b/tools/perf/builtin-c2c.c
@@ -5,6 +5,7 @@
#include "util/parse-options.h"
#include "util/session.h"
#include "util/tool.h"
+#include "util/debug.h"

#include <linux/compiler.h>
#include <linux/kernel.h>
@@ -105,34 +106,55 @@ static int perf_c2c__fprintf_header(FILE *fp)
}

static int perf_sample__fprintf(struct perf_sample *sample, char tag,
- const char *reason, struct addr_location *al, FILE *fp)
+ const char *reason, struct mem_info *mi, FILE *fp)
{
char data_src[61];
+ const char *fmt, *sep;
+ struct map *map = mi->iaddr.map;

perf_c2c__scnprintf_data_src(data_src, sizeof(data_src), sample->data_src);

- return fprintf(fp, "%c %-16s %6d %6d %4d %#18" PRIx64 " %#18" PRIx64 " %#18" PRIx64 " %6" PRIu64 " %#10" PRIx64 " %-60.60s %s:%s\n",
- tag,
- reason ?: "valid record",
- sample->pid,
- sample->tid,
- sample->cpu,
- sample->ip,
- sample->addr,
- 0UL,
- sample->weight,
- sample->data_src,
- data_src,
- al->map ? (al->map->dso ? al->map->dso->long_name : "???") : "???",
- al->sym ? al->sym->name : "???");
+ if (symbol_conf.field_sep) {
+ fmt = "%c%s%s%s%d%s%d%s%d%s%#"PRIx64"%s%#"PRIx64"%s"
+ "%"PRIu64"%s%#"PRIx64"%s%s%s%s:%s\n";
+ sep = symbol_conf.field_sep;
+ } else {
+ fmt = "%c%s%-16s%s%6d%s%6d%s%4d%s%#18"PRIx64"%s%#18"PRIx64"%s"
+ "%6"PRIu64"%s%#10"PRIx64"%s%-60.60s%s%s:%s\n";
+ sep = " ";
+ }
+
+ return fprintf(fp, fmt,
+ tag, sep,
+ reason ?: "valid record", sep,
+ sample->pid, sep,
+ sample->tid, sep,
+ sample->cpu, sep,
+ sample->ip, sep,
+ sample->addr, sep,
+ sample->weight, sep,
+ sample->data_src, sep,
+ data_src, sep,
+ map ? (map->dso ? map->dso->long_name : "???") : "???",
+ mi->iaddr.sym ? mi->iaddr.sym->name : "???");
}

static int perf_c2c__process_load_store(struct perf_c2c *c2c,
+ struct addr_location *al,
struct perf_sample *sample,
- struct addr_location *al)
+ struct perf_evsel *evsel)
{
- if (c2c->raw_records)
- perf_sample__fprintf(sample, ' ', "raw input", al, stdout);
+ struct mem_info *mi;
+
+ mi = sample__resolve_mem(sample, al);
+ if (!mi)
+ return -ENOMEM;
+
+ if (c2c->raw_records) {
+ perf_sample__fprintf(sample, ' ', "raw input", mi, stdout);
+ free(mi);
+ return 0;
+ }

return 0;
}
@@ -143,8 +165,9 @@ static const struct perf_evsel_str_handler handlers[] = {
};

typedef int (*sample_handler)(struct perf_c2c *c2c,
+ struct addr_location *al,
struct perf_sample *sample,
- struct addr_location *al);
+ struct perf_evsel *evsel);

static int perf_c2c__process_sample(struct perf_tool *tool,
union perf_event *event,
@@ -153,20 +176,49 @@ static int perf_c2c__process_sample(struct perf_tool *tool,
struct machine *machine)
{
struct perf_c2c *c2c = container_of(tool, struct perf_c2c, tool);
- struct addr_location al;
- int err = 0;
+ u8 cpumode = event->header.misc & PERF_RECORD_MISC_CPUMODE_MASK;
+ struct thread *thread;
+ sample_handler f;
+ int err = -1;
+ struct addr_location al = {
+ .machine = machine,
+ .cpu = sample->cpu,
+ .cpumode = cpumode,
+ };

- if (perf_event__preprocess_sample(event, machine, &al, sample) < 0) {
- pr_err("problem processing %d event, skipping it.\n",
- event->header.type);
- return -1;
- }
+ if (evsel->handler == NULL)
+ return 0;
+
+ thread = machine__find_thread(machine, sample->tid);
+ if (thread == NULL)
+ goto err;
+
+ al.thread = thread;

- if (evsel->handler != NULL) {
- sample_handler f = evsel->handler;
- err = f(c2c, sample, &al);
+ f = evsel->handler;
+ err = f(c2c, &al, sample, evsel);
+ if (err)
+ goto err;
+
+ return 0;
+err:
+ if (err > 0)
+ err = 0;
+ return err;
+}
+
+static int perf_c2c__process_events(struct perf_session *session,
+ struct perf_c2c *c2c)
+{
+ int err = -1;
+
+ err = perf_session__process_events(session, &c2c->tool);
+ if (err) {
+ pr_err("Failed to process count events, error %d\n", err);
+ goto err;
}

+err:
return err;
}

@@ -197,9 +249,7 @@ static int perf_c2c__read_events(struct perf_c2c *c2c)
}
}

- err = perf_session__process_events(session, &c2c->tool);
- if (err)
- pr_err("Failed to process events, error %d", err);
+ err = perf_c2c__process_events(session, c2c);

out:
return err;
@@ -221,7 +271,6 @@ static int perf_c2c__record(int argc, const char **argv)
const char **rec_argv;
const char * const record_args[] = {
"record",
- /* "--phys-addr", */
"-W",
"-d",
"-a",
@@ -254,6 +303,8 @@ int cmd_c2c(int argc, const char **argv, const char *prefix __maybe_unused)
struct perf_c2c c2c = {
.tool = {
.sample = perf_c2c__process_sample,
+ .mmap2 = perf_event__process_mmap2,
+ .mmap = perf_event__process_mmap,
.comm = perf_event__process_comm,
.exit = perf_event__process_exit,
.fork = perf_event__process_fork,
@@ -263,6 +314,14 @@ int cmd_c2c(int argc, const char **argv, const char *prefix __maybe_unused)
};
const struct option c2c_options[] = {
OPT_BOOLEAN('r', "raw_records", &c2c.raw_records, "dump raw events"),
+ OPT_INCR('v', "verbose", &verbose,
+ "be more verbose (show counter open errors, etc)"),
+ OPT_STRING('i', "input", &input_name, "file",
+ "the input file to process"),
+ OPT_STRING('x', "field-separator", &symbol_conf.field_sep,
+ "separator",
+ "separator for columns, no spaces will be added"
+ " between columns '.' is reserved."),
OPT_END()
};
const char * const c2c_usage[] = {
--
1.7.11.7

2014-02-28 17:43:56

by Don Zickus

[permalink] [raw]
Subject: [PATCH 17/19] perf, c2c: Dump rbtree for debugging

Sometimes you want to verify the rbtree sorting on a unique id
is working correctly. This allows you to dump it.

Sample output:

Idx Hit Maj Min Ino InoGen Pid Daddr Iaddr Data Src (string) cpumode
0 0 0 0 0 0 22 ffffffff813044cf 48080184 [STORE,L1,MISS,SNP NA] 1

1 0 0 0 0 2332 ca3 ffffffffa0226032 48080184 [STORE,L1,MISS,SNP NA] 1
2 0 0 0 0 2332 ca3 ffffffffa0226032 48080184 [STORE,L1,MISS,SNP NA] 1
3 0 0 0 0 2332 ca3 ffffffffa0226032 48080184 [STORE,L1,MISS,SNP NA] 1
4 0 0 0 0 2332 ca3 ffffffffa0226032 48080184 [STORE,L1,MISS,SNP NA] 1
5 0 0 0 0 2332 ca3 ffffffffa0226032 48080184 [STORE,L1,MISS,SNP NA] 1
6 0 0 0 0 2332 ca3 ffffffffa0226032 48080184 [STORE,L1,MISS,SNP NA] 1

7 0 0 0 0 18179 135f860 ffffffff812ad509 68100242 [LOAD,LFB,HIT,SNP NONE] 1

8 0 0 0 0 18179 7ff9d7fbaf98 ffffffff812ad509 68100242 [LOAD,LFB,HIT,SNP NONE] 1

V2: refresh with hist_entry

Signed-off-by: Don Zickus <[email protected]>
---
tools/perf/builtin-c2c.c | 51 ++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 51 insertions(+)

diff --git a/tools/perf/builtin-c2c.c b/tools/perf/builtin-c2c.c
index 3b0e0b2..c095f1b 100644
--- a/tools/perf/builtin-c2c.c
+++ b/tools/perf/builtin-c2c.c
@@ -616,6 +616,55 @@ err:

#define HAS_HITMS(h) (h->stats.t.lcl_hitm || h->stats.t.rmt_hitm)

+static void dump_rb_tree(struct rb_root *tree,
+ struct perf_c2c *c2c __maybe_unused)
+{
+ struct rb_node *next = rb_first(tree);
+ struct hist_entry *he;
+ u64 cl = 0;
+ int idx = 0;
+
+ printf("# Summary: Total entries - %d\n", c2c->stats.nr_entries);
+ printf("# HITMs: Local - %d Remote - %d Total - %d\n",
+ c2c->stats.t.lcl_hitm, c2c->stats.t.rmt_hitm,
+ (c2c->stats.t.lcl_hitm + c2c->stats.t.rmt_hitm));
+
+ printf("%6s %3s %3s %3s %8s %16s %6s %16s %16s %16s %32s %8s\n",
+ "Idx", "Hit", "Maj", "Min", "Ino", "InoGen", "Pid",
+ "Daddr", "Iaddr", "Data Src", "(string)", "cpumode");
+ while (next) {
+ char data_src[32];
+ u64 val;
+
+ he = rb_entry(next, struct hist_entry, rb_node_in);
+ next = rb_next(&he->rb_node_in);
+
+ if ((!he->color) || (cl != CLADRS(he->mem_info->daddr.al_addr))) {
+ printf("\n");
+ cl = CLADRS(he->mem_info->daddr.al_addr);
+ }
+
+ val = he->mem_info->data_src.val;
+ perf_c2c__scnprintf_data_src(data_src, sizeof(data_src), val);
+
+ printf("%6d %3s %3x %3x %8lx %16lx %6d %16lx %16lx %16lx %32s %8x\n",
+ idx,
+ (PERF_MEM_S(SNOOP,HITM) & val) ? " * " : " ",
+ he->mem_info->daddr.map->maj,
+ he->mem_info->daddr.map->min,
+ he->mem_info->daddr.map->ino,
+ he->mem_info->daddr.map->ino_generation,
+ he->thread->pid_,
+ he->mem_info->daddr.addr,
+ he->mem_info->iaddr.addr,
+ val,
+ data_src,
+ he->cpumode);
+
+ idx++;
+ }
+}
+
static void c2c_hit__update_stats(struct c2c_stats *new,
struct c2c_stats *old)
{
@@ -1213,6 +1262,8 @@ static int perf_c2c__process_events(struct perf_session *session,
goto err;
}

+ if (verbose > 2)
+ dump_rb_tree(c2c->hists.entries_in, c2c);
print_c2c_trace_report(c2c);
c2c_analyze_hitms(c2c);

--
1.7.11.7

2014-02-28 17:43:59

by Don Zickus

[permalink] [raw]
Subject: [PATCH 19/19] perf, c2c: Add shared cachline summary table

This adds a quick summary of the hottest cache contention lines based
on the input data. This summarizes what the broken table shows you,
so you can see at a quick glance which cachelines are interesting.

Originally done by Dick Fowles, backported by me.

Sample output (width trimmed):

===================================================================================================================================================

Shared Data Cache Line Table

Total %All Total ---- Core Load Hit ---- -- LLC Load Hit -- ----- LLC Load Hitm -----
Index Phys Adrs Records Ld Miss %hitm Loads FB L1D L2D Lcl Rmt Total Lcl Rmt
====================================================================================================================================================
0 0xffff881fa55b0140 72006 16.97% 23.31% 43095 13591 16860 45 2651 25 9526 3288 6238
1 0xffff881fba47f000 21854 5.29% 7.26% 13938 3887 6941 15 1 7 3087 1143 1944
2 0xffff881fc21b9cc0 2153 1.61% 2.21% 862 32 70 0 15 1 740 148 592
3 0xffff881fc7d91cc0 1957 1.40% 1.92% 866 34 94 0 14 3 720 207 513
4 0xffff881fba539cc0 1813 1.35% 1.85% 808 33 84 3 14 1 665 170 495

Original-by: Dick Fowles <[email protected]>
Signed-off-by: Don Zickus <[email protected]>
---
tools/perf/builtin-c2c.c | 136 +++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 136 insertions(+)

diff --git a/tools/perf/builtin-c2c.c b/tools/perf/builtin-c2c.c
index 0749ea6..57441b9 100644
--- a/tools/perf/builtin-c2c.c
+++ b/tools/perf/builtin-c2c.c
@@ -783,6 +783,141 @@ cleanup:
}
}

+static void print_c2c_shared_cacheline_report(struct rb_root *hitm_tree,
+ struct c2c_stats *shared_stats __maybe_unused,
+ struct c2c_stats *c2c_stats __maybe_unused)
+{
+#define SHM_TITLE "Shared Data Cache Line Table"
+
+ struct rb_node *next = rb_first(hitm_tree);
+ struct c2c_hit *h;
+ char header[256];
+ char delimit[256];
+ u32 crecords;
+ u32 lclmiss;
+ u32 ldcnt;
+ double p_hitm;
+ double p_all;
+ int totmiss;
+ int rmt_hitm;
+ int len;
+ int pad;
+ int i;
+
+ sprintf(header,"%28s %8s %8s %8s %8s %28s %18s %28s %18s %8s %28s",
+ " ",
+ "Total",
+ "%All ",
+ " ",
+ "Total",
+ "---- Core Load Hit ----",
+ "-- LLC Load Hit --",
+ "----- LLC Load Hitm -----",
+ "-- Load Dram --",
+ "LLC ",
+ "---- Store Reference ----");
+
+ len = strlen(header);
+ delimit[0] = '\0';
+
+ for (i = 0; i < len; i++)
+ strcat(delimit, "=");
+
+ printf("\n\n");
+ printf("%s\n", delimit);
+ printf("\n");
+ pad = (strlen(header)/2) - (strlen(SHM_TITLE)/2);
+ for (i = 0; i < pad; i++)
+ printf(" ");
+ printf("%s\n", SHM_TITLE);
+ printf("\n");
+ printf("%s\n", header);
+
+ sprintf(header, "%8s %18s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s",
+ "Index",
+ "Phys Adrs",
+ "Records",
+ "Ld Miss",
+ "%hitm",
+ "Loads",
+ "FB",
+ "L1D",
+ "L2D",
+ "Lcl",
+ "Rmt",
+ "Total",
+ "Lcl",
+ "Rmt",
+ "Lcl",
+ "Rmt",
+ "Ld Miss",
+ "Total",
+ "L1Hit",
+ "L1Miss");
+
+ printf("%s\n", header);
+ printf("%s\n", delimit);
+
+ rmt_hitm = c2c_stats->t.rmt_hitm;
+ totmiss = c2c_stats->t.lcl_dram +
+ c2c_stats->t.rmt_dram +
+ c2c_stats->t.rmt_hit +
+ c2c_stats->t.rmt_hitm;
+
+ i = 0;
+ while (next) {
+ h = rb_entry(next, struct c2c_hit, rb_node);
+ next = rb_next(&h->rb_node);
+
+ lclmiss = h->stats.t.lcl_dram +
+ h->stats.t.rmt_dram +
+ h->stats.t.rmt_hitm +
+ h->stats.t.rmt_hit;
+
+ ldcnt = lclmiss +
+ h->stats.t.ld_fbhit +
+ h->stats.t.ld_l1hit +
+ h->stats.t.ld_l2hit +
+ h->stats.t.ld_llchit +
+ h->stats.t.lcl_hitm;
+
+ crecords = ldcnt +
+ h->stats.t.st_l1hit +
+ h->stats.t.st_l1miss;
+
+ p_hitm = (double)h->stats.t.rmt_hitm / (double)rmt_hitm;
+ p_all = (double)h->stats.t.rmt_hitm / (double)totmiss;
+
+ /* stop when the percentage gets to low */
+ if (p_hitm < DISPLAY_LINE_LIMIT)
+ break;
+
+ printf("%8d %#18lx %8u %7.2f%% %7.2f%% %8u %8u %8u %8u %8u %8u %8u %8u %8u %8u %8u %8u %8u %8u %8u\n",
+ i,
+ h->cacheline,
+ crecords,
+ 100. * p_all,
+ 100. * p_hitm,
+ ldcnt,
+ h->stats.t.ld_fbhit,
+ h->stats.t.ld_l1hit,
+ h->stats.t.ld_l2hit,
+ h->stats.t.ld_llchit,
+ h->stats.t.rmt_hit,
+ h->stats.t.lcl_hitm + h->stats.t.rmt_hitm,
+ h->stats.t.lcl_hitm,
+ h->stats.t.rmt_hitm,
+ h->stats.t.lcl_dram,
+ h->stats.t.rmt_dram,
+ lclmiss,
+ h->stats.t.store,
+ h->stats.t.st_l1hit,
+ h->stats.t.st_l1miss);
+
+ i++;
+ }
+}
+
static void print_hitm_cacheline_header(void)
{
#define SHARING_REPORT_TITLE "Shared Cache Line Distribution Pareto"
@@ -1290,6 +1425,7 @@ static void c2c_analyze_hitms(struct perf_c2c *c2c)
free(h);

print_shared_cacheline_info(&hitm_stats, shared_clines);
+ print_c2c_shared_cacheline_report(&hitm_tree, &hitm_stats, &c2c->stats);
print_c2c_hitm_report(&hitm_tree, &hitm_stats, &c2c->stats);

cleanup:
--
1.7.11.7

2014-02-28 17:44:24

by Don Zickus

[permalink] [raw]
Subject: [PATCH 08/19] perf c2c: Shared data analyser

From: Arnaldo Carvalho de Melo <[email protected]>

This is the start of a new perf tool that will collect information about
memory accesses and analyse it to find things like hot cachelines, etc.

This is basically trying to get a prototype written by Richard Fowles
written using the tools/perf coding style and libraries.

Start it from 'perf sched', this patch starts the process by adding the
'record' subcommand to collect the needed mem loads and stores samples.

It also have the basic 'report' skeleton, resolving the sample address
and hooking the events found in a perf.data file with methods to handle
them, right now just printing the resolved perf_sample data structure
after each event name.

[dcz: refreshed to latest upstream changes]

Cc: David Ahern <[email protected]>
Cc: Don Zickus <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Joe Mario <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Richard Fowles <[email protected]>
Cc: Stephane Eranian <[email protected]>
Signed-off-by: Don Zickus <[email protected]>
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
---
tools/perf/Documentation/perf-c2c.c | 22 +++++
tools/perf/Makefile.perf | 1 +
tools/perf/builtin-c2c.c | 185 ++++++++++++++++++++++++++++++++++++
tools/perf/builtin.h | 1 +
tools/perf/perf.c | 1 +
5 files changed, 210 insertions(+)
create mode 100644 tools/perf/Documentation/perf-c2c.c
create mode 100644 tools/perf/builtin-c2c.c

diff --git a/tools/perf/Documentation/perf-c2c.c b/tools/perf/Documentation/perf-c2c.c
new file mode 100644
index 0000000..4d52798
--- /dev/null
+++ b/tools/perf/Documentation/perf-c2c.c
@@ -0,0 +1,22 @@
+perf-c2c(1)
+===========
+
+NAME
+----
+perf-c2c - Shared Data C2C/HITM Analyzer.
+
+SYNOPSIS
+--------
+[verse]
+'perf c2c' record
+
+DESCRIPTION
+-----------
+These are the variants of perf c2c:
+
+ 'perf c2c record <command>' to record the memory accesses of an arbitrary
+ workload.
+
+SEE ALSO
+--------
+linkperf:perf-record[1], linkperf:perf-mem[1]
diff --git a/tools/perf/Makefile.perf b/tools/perf/Makefile.perf
index 1f7ec48..a9eebb4 100644
--- a/tools/perf/Makefile.perf
+++ b/tools/perf/Makefile.perf
@@ -427,6 +427,7 @@ endif
BUILTIN_OBJS += $(OUTPUT)bench/mem-memcpy.o
BUILTIN_OBJS += $(OUTPUT)bench/mem-memset.o

+BUILTIN_OBJS += $(OUTPUT)builtin-c2c.o
BUILTIN_OBJS += $(OUTPUT)builtin-diff.o
BUILTIN_OBJS += $(OUTPUT)builtin-evlist.o
BUILTIN_OBJS += $(OUTPUT)builtin-help.o
diff --git a/tools/perf/builtin-c2c.c b/tools/perf/builtin-c2c.c
new file mode 100644
index 0000000..2935484
--- /dev/null
+++ b/tools/perf/builtin-c2c.c
@@ -0,0 +1,185 @@
+#include "builtin.h"
+#include "cache.h"
+
+#include "util/evlist.h"
+#include "util/parse-options.h"
+#include "util/session.h"
+#include "util/tool.h"
+
+#include <linux/compiler.h>
+#include <linux/kernel.h>
+
+struct perf_c2c {
+ struct perf_tool tool;
+};
+
+static int perf_sample__fprintf(struct perf_sample *sample,
+ struct perf_evsel *evsel,
+ struct addr_location *al, FILE *fp)
+{
+ return fprintf(fp, "%25.25s: %5d %5d 0x%016" PRIx64 " 0x016%" PRIx64 " %5" PRIu64 " 0x%06" PRIx64 " %s:%s\n",
+ perf_evsel__name(evsel),
+ sample->pid, sample->tid, sample->ip, sample->addr,
+ sample->weight, sample->data_src,
+ al->map ? (al->map->dso ? al->map->dso->long_name : "???") : "???",
+ al->sym ? al->sym->name : "???");
+}
+
+static int perf_c2c__process_load(struct perf_evsel *evsel,
+ struct perf_sample *sample,
+ struct addr_location *al)
+{
+ perf_sample__fprintf(sample, evsel, al, stdout);
+ return 0;
+}
+
+static int perf_c2c__process_store(struct perf_evsel *evsel,
+ struct perf_sample *sample,
+ struct addr_location *al)
+{
+ perf_sample__fprintf(sample, evsel, al, stdout);
+ return 0;
+}
+
+static const struct perf_evsel_str_handler handlers[] = {
+ { "cpu/mem-loads,ldlat=30/pp", perf_c2c__process_load, },
+ { "cpu/mem-stores/pp", perf_c2c__process_store, },
+};
+
+typedef int (*sample_handler)(struct perf_evsel *evsel,
+ struct perf_sample *sample,
+ struct addr_location *al);
+
+static int perf_c2c__process_sample(struct perf_tool *tool __maybe_unused,
+ union perf_event *event,
+ struct perf_sample *sample,
+ struct perf_evsel *evsel,
+ struct machine *machine)
+{
+ struct addr_location al;
+ int err = 0;
+
+ if (perf_event__preprocess_sample(event, machine, &al, sample) < 0) {
+ pr_err("problem processing %d event, skipping it.\n",
+ event->header.type);
+ return -1;
+ }
+
+ if (evsel->handler != NULL) {
+ sample_handler f = evsel->handler;
+ err = f(evsel, sample, &al);
+ }
+
+ return err;
+}
+
+static int perf_c2c__read_events(struct perf_c2c *c2c)
+{
+ int err = -1;
+ struct perf_session *session;
+ struct perf_data_file file = {
+ .path = input_name,
+ .mode = PERF_DATA_MODE_READ,
+ };
+ struct perf_evsel *evsel;
+
+ session = perf_session__new(&file, 0, &c2c->tool);
+ if (session == NULL) {
+ pr_debug("No memory for session\n");
+ goto out;
+ }
+
+ /* setup the evsel handlers for each event type */
+ evlist__for_each(session->evlist, evsel) {
+ const char *name = perf_evsel__name(evsel);
+ unsigned int i;
+
+ for (i = 0; i < ARRAY_SIZE(handlers); i++) {
+ if (!strcmp(name, handlers[i].name))
+ evsel->handler = handlers[i].handler;
+ }
+ }
+
+ err = perf_session__process_events(session, &c2c->tool);
+ if (err)
+ pr_err("Failed to process events, error %d", err);
+
+out:
+ return err;
+}
+
+static int perf_c2c__report(struct perf_c2c *c2c)
+{
+ setup_pager();
+ return perf_c2c__read_events(c2c);
+}
+
+static int perf_c2c__record(int argc, const char **argv)
+{
+ unsigned int rec_argc, i, j;
+ const char **rec_argv;
+ const char * const record_args[] = {
+ "record",
+ /* "--phys-addr", */
+ "-W",
+ "-d",
+ "-a",
+ };
+
+ rec_argc = ARRAY_SIZE(record_args) + 2 * ARRAY_SIZE(handlers) + argc - 1;
+ rec_argv = calloc(rec_argc + 1, sizeof(char *));
+
+ if (rec_argv == NULL)
+ return -ENOMEM;
+
+ for (i = 0; i < ARRAY_SIZE(record_args); i++)
+ rec_argv[i] = strdup(record_args[i]);
+
+ for (j = 0; j < ARRAY_SIZE(handlers); j++) {
+ rec_argv[i++] = strdup("-e");
+ rec_argv[i++] = strdup(handlers[j].name);
+ }
+
+ for (j = 1; j < (unsigned int)argc; j++, i++)
+ rec_argv[i] = argv[j];
+
+ BUG_ON(i != rec_argc);
+
+ return cmd_record(i, rec_argv, NULL);
+}
+
+int cmd_c2c(int argc, const char **argv, const char *prefix __maybe_unused)
+{
+ struct perf_c2c c2c = {
+ .tool = {
+ .sample = perf_c2c__process_sample,
+ .comm = perf_event__process_comm,
+ .exit = perf_event__process_exit,
+ .fork = perf_event__process_fork,
+ .lost = perf_event__process_lost,
+ .ordered_samples = true,
+ },
+ };
+ const struct option c2c_options[] = {
+ OPT_END()
+ };
+ const char * const c2c_usage[] = {
+ "perf c2c {record|report}",
+ NULL
+ };
+
+ argc = parse_options(argc, argv, c2c_options, c2c_usage,
+ PARSE_OPT_STOP_AT_NON_OPTION);
+ if (!argc)
+ usage_with_options(c2c_usage, c2c_options);
+
+ if (!strncmp(argv[0], "rec", 3)) {
+ return perf_c2c__record(argc, argv);
+ } else if (!strncmp(argv[0], "rep", 3)) {
+ return perf_c2c__report(&c2c);
+ } else {
+ usage_with_options(c2c_usage, c2c_options);
+ }
+
+ return 0;
+}
diff --git a/tools/perf/builtin.h b/tools/perf/builtin.h
index b210d62..2d0b1b5 100644
--- a/tools/perf/builtin.h
+++ b/tools/perf/builtin.h
@@ -17,6 +17,7 @@ extern int cmd_annotate(int argc, const char **argv, const char *prefix);
extern int cmd_bench(int argc, const char **argv, const char *prefix);
extern int cmd_buildid_cache(int argc, const char **argv, const char *prefix);
extern int cmd_buildid_list(int argc, const char **argv, const char *prefix);
+extern int cmd_c2c(int argc, const char **argv, const char *prefix);
extern int cmd_diff(int argc, const char **argv, const char *prefix);
extern int cmd_evlist(int argc, const char **argv, const char *prefix);
extern int cmd_help(int argc, const char **argv, const char *prefix);
diff --git a/tools/perf/perf.c b/tools/perf/perf.c
index 431798a..c7012a3 100644
--- a/tools/perf/perf.c
+++ b/tools/perf/perf.c
@@ -35,6 +35,7 @@ struct cmd_struct {
static struct cmd_struct commands[] = {
{ "buildid-cache", cmd_buildid_cache, 0 },
{ "buildid-list", cmd_buildid_list, 0 },
+ { "c2c", cmd_c2c, 0 },
{ "diff", cmd_diff, 0 },
{ "evlist", cmd_evlist, 0 },
{ "help", cmd_help, 0 },
--
1.7.11.7

2014-02-28 17:44:34

by Don Zickus

[permalink] [raw]
Subject: [PATCH 09/19] perf c2c: Dump raw records, decode data_src bits

From: Arnaldo Carvalho de Melo <[email protected]>

>From the c2c prototype:

[root@sandy ~]# perf c2c -r report | head -7
T Status Pid Tid CPU Inst Adrs Virt Data Adrs Phys Data Adrs Cycles Source Decoded Source ObJect:Symbol
--------------------------------------------------------------------------------------------------------------------------------------------
raw input 779 779 7 0xffffffff810865dd 0xffff8803f4d75ec8 0 370 0x68080882 [LOAD,LCL_LLC,MISS,SNP NA] [kernel.kallsyms]:try_to_wake_up
raw input 779 779 7 0xffffffff8107acb3 0xffff8802a5b73158 0 297 0x6a100142 [LOAD,L1,HIT,SNP NONE,LOCKED] [kernel.kallsyms]:up_read
raw input 779 779 7 0x3b7e009814 0x7fff87429ea0 0 925 0x68100142 [LOAD,L1,HIT,SNP NONE] ???:???
raw input 0 0 1 0xffffffff8108bf81 0xffff8803eafebf50 0 172 0x68800842 [LOAD,LCL_LLC,HIT,SNP HITM] [kernel.kallsyms]:update_stats_wait_end
raw input 779 779 7 0x3b7e0097cc 0x7fac94b69068 0 228 0x68100242 [LOAD,LFB,HIT,SNP NONE] ???:???
[root@sandy ~]#

The "Phys Data Adrs" column is not available at this point.

Cc: David Ahern <[email protected]>
Cc: Don Zickus <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Joe Mario <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Richard Fowles <[email protected]>
Cc: Stephane Eranian <[email protected]>
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
---
tools/perf/builtin-c2c.c | 148 +++++++++++++++++++++++++++++++++++++++--------
1 file changed, 125 insertions(+), 23 deletions(-)

diff --git a/tools/perf/builtin-c2c.c b/tools/perf/builtin-c2c.c
index 2935484..7082913 100644
--- a/tools/perf/builtin-c2c.c
+++ b/tools/perf/builtin-c2c.c
@@ -11,51 +11,148 @@

struct perf_c2c {
struct perf_tool tool;
+ bool raw_records;
};

-static int perf_sample__fprintf(struct perf_sample *sample,
- struct perf_evsel *evsel,
- struct addr_location *al, FILE *fp)
+enum { OP, LVL, SNP, LCK, TLB };
+
+static int perf_c2c__scnprintf_data_src(char *bf, size_t size, uint64_t val)
{
- return fprintf(fp, "%25.25s: %5d %5d 0x%016" PRIx64 " 0x016%" PRIx64 " %5" PRIu64 " 0x%06" PRIx64 " %s:%s\n",
- perf_evsel__name(evsel),
- sample->pid, sample->tid, sample->ip, sample->addr,
- sample->weight, sample->data_src,
- al->map ? (al->map->dso ? al->map->dso->long_name : "???") : "???",
- al->sym ? al->sym->name : "???");
+#define PREFIX "["
+#define SUFFIX "]"
+#define ELLIPSIS "..."
+ static const struct {
+ uint64_t bit;
+ int64_t field;
+ const char *name;
+ } decode_bits[] = {
+ { PERF_MEM_OP_LOAD, OP, "LOAD" },
+ { PERF_MEM_OP_STORE, OP, "STORE" },
+ { PERF_MEM_OP_NA, OP, "OP_NA" },
+ { PERF_MEM_LVL_LFB, LVL, "LFB" },
+ { PERF_MEM_LVL_L1, LVL, "L1" },
+ { PERF_MEM_LVL_L2, LVL, "L2" },
+ { PERF_MEM_LVL_L3, LVL, "LCL_LLC" },
+ { PERF_MEM_LVL_LOC_RAM, LVL, "LCL_RAM" },
+ { PERF_MEM_LVL_REM_RAM1, LVL, "RMT_RAM" },
+ { PERF_MEM_LVL_REM_RAM2, LVL, "RMT_RAM" },
+ { PERF_MEM_LVL_REM_CCE1, LVL, "RMT_LLC" },
+ { PERF_MEM_LVL_REM_CCE2, LVL, "RMT_LLC" },
+ { PERF_MEM_LVL_IO, LVL, "I/O" },
+ { PERF_MEM_LVL_UNC, LVL, "UNCACHED" },
+ { PERF_MEM_LVL_NA, LVL, "N" },
+ { PERF_MEM_LVL_HIT, LVL, "HIT" },
+ { PERF_MEM_LVL_MISS, LVL, "MISS" },
+ { PERF_MEM_SNOOP_NONE, SNP, "SNP NONE" },
+ { PERF_MEM_SNOOP_HIT, SNP, "SNP HIT" },
+ { PERF_MEM_SNOOP_MISS, SNP, "SNP MISS" },
+ { PERF_MEM_SNOOP_HITM, SNP, "SNP HITM" },
+ { PERF_MEM_SNOOP_NA, SNP, "SNP NA" },
+ { PERF_MEM_LOCK_LOCKED, LCK, "LOCKED" },
+ { PERF_MEM_LOCK_NA, LCK, "LOCK_NA" },
+ };
+ union perf_mem_data_src dsrc = { .val = val, };
+ int printed = scnprintf(bf, size, PREFIX);
+ size_t i;
+ bool first_present = true;
+
+ for (i = 0; i < ARRAY_SIZE(decode_bits); i++) {
+ int bitval;
+
+ switch (decode_bits[i].field) {
+ case OP: bitval = decode_bits[i].bit & dsrc.mem_op; break;
+ case LVL: bitval = decode_bits[i].bit & dsrc.mem_lvl; break;
+ case SNP: bitval = decode_bits[i].bit & dsrc.mem_snoop; break;
+ case LCK: bitval = decode_bits[i].bit & dsrc.mem_lock; break;
+ case TLB: bitval = decode_bits[i].bit & dsrc.mem_dtlb; break;
+ default: bitval = 0; break;
+ }
+
+ if (!bitval)
+ continue;
+
+ if (strlen(decode_bits[i].name) + !!i > size - printed - sizeof(SUFFIX)) {
+ sprintf(bf + size - sizeof(SUFFIX) - sizeof(ELLIPSIS) + 1, ELLIPSIS);
+ printed = size - sizeof(SUFFIX);
+ break;
+ }
+
+ printed += scnprintf(bf + printed, size - printed, "%s%s",
+ first_present ? "" : ",", decode_bits[i].name);
+ first_present = false;
+ }
+
+ printed += scnprintf(bf + printed, size - printed, SUFFIX);
+ return printed;
}

-static int perf_c2c__process_load(struct perf_evsel *evsel,
- struct perf_sample *sample,
- struct addr_location *al)
+static int perf_c2c__fprintf_header(FILE *fp)
{
- perf_sample__fprintf(sample, evsel, al, stdout);
- return 0;
+ int printed = fprintf(fp, "%c %-16s %6s %6s %4s %18s %18s %18s %6s %-10s %-60s %s\n",
+ 'T',
+ "Status",
+ "Pid",
+ "Tid",
+ "CPU",
+ "Inst Adrs",
+ "Virt Data Adrs",
+ "Phys Data Adrs",
+ "Cycles",
+ "Source",
+ " Decoded Source",
+ "ObJect:Symbol");
+ return printed + fprintf(fp, "%-*.*s\n", printed, printed, graph_dotted_line);
+}
+
+static int perf_sample__fprintf(struct perf_sample *sample, char tag,
+ const char *reason, struct addr_location *al, FILE *fp)
+{
+ char data_src[61];
+
+ perf_c2c__scnprintf_data_src(data_src, sizeof(data_src), sample->data_src);
+
+ return fprintf(fp, "%c %-16s %6d %6d %4d %#18" PRIx64 " %#18" PRIx64 " %#18" PRIx64 " %6" PRIu64 " %#10" PRIx64 " %-60.60s %s:%s\n",
+ tag,
+ reason ?: "valid record",
+ sample->pid,
+ sample->tid,
+ sample->cpu,
+ sample->ip,
+ sample->addr,
+ 0UL,
+ sample->weight,
+ sample->data_src,
+ data_src,
+ al->map ? (al->map->dso ? al->map->dso->long_name : "???") : "???",
+ al->sym ? al->sym->name : "???");
}

-static int perf_c2c__process_store(struct perf_evsel *evsel,
- struct perf_sample *sample,
- struct addr_location *al)
+static int perf_c2c__process_load_store(struct perf_c2c *c2c,
+ struct perf_sample *sample,
+ struct addr_location *al)
{
- perf_sample__fprintf(sample, evsel, al, stdout);
+ if (c2c->raw_records)
+ perf_sample__fprintf(sample, ' ', "raw input", al, stdout);
+
return 0;
}

static const struct perf_evsel_str_handler handlers[] = {
- { "cpu/mem-loads,ldlat=30/pp", perf_c2c__process_load, },
- { "cpu/mem-stores/pp", perf_c2c__process_store, },
+ { "cpu/mem-loads,ldlat=30/pp", perf_c2c__process_load_store, },
+ { "cpu/mem-stores/pp", perf_c2c__process_load_store, },
};

-typedef int (*sample_handler)(struct perf_evsel *evsel,
+typedef int (*sample_handler)(struct perf_c2c *c2c,
struct perf_sample *sample,
struct addr_location *al);

-static int perf_c2c__process_sample(struct perf_tool *tool __maybe_unused,
+static int perf_c2c__process_sample(struct perf_tool *tool,
union perf_event *event,
struct perf_sample *sample,
struct perf_evsel *evsel,
struct machine *machine)
{
+ struct perf_c2c *c2c = container_of(tool, struct perf_c2c, tool);
struct addr_location al;
int err = 0;

@@ -67,7 +164,7 @@ static int perf_c2c__process_sample(struct perf_tool *tool __maybe_unused,

if (evsel->handler != NULL) {
sample_handler f = evsel->handler;
- err = f(evsel, sample, &al);
+ err = f(c2c, sample, &al);
}

return err;
@@ -111,6 +208,10 @@ out:
static int perf_c2c__report(struct perf_c2c *c2c)
{
setup_pager();
+
+ if (c2c->raw_records)
+ perf_c2c__fprintf_header(stdout);
+
return perf_c2c__read_events(c2c);
}

@@ -161,6 +262,7 @@ int cmd_c2c(int argc, const char **argv, const char *prefix __maybe_unused)
},
};
const struct option c2c_options[] = {
+ OPT_BOOLEAN('r', "raw_records", &c2c.raw_records, "dump raw events"),
OPT_END()
};
const char * const c2c_usage[] = {
--
1.7.11.7

2014-02-28 17:43:54

by Don Zickus

[permalink] [raw]
Subject: [PATCH 15/19] perf, c2c: Add callchain support

Seeing cacheline statistics is useful by itself. Seeing the callchain
for these cache contentions saves time tracking things down.

This patch tries to add callchain support. I had to use the generic
interface from a previous patch to output things to stdout easily.

Other than the displaying the results, collecting the callchain and
merging it was fairly straightforward.

I used a lot of copying-n-pasting from other builtin tools to get
the intial parameter setup correctly and the automatic reading of
'symbol_conf.use_callchain' from the data file.

Hopefully this is all correct. The amount of memory corruption (from the
callchain dynamic array) seems to have dwindled done to nothing. :-)

V2: update to latest api

Signed-off-by: Don Zickus <[email protected]>
---
tools/perf/builtin-c2c.c | 153 ++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 150 insertions(+), 3 deletions(-)

diff --git a/tools/perf/builtin-c2c.c b/tools/perf/builtin-c2c.c
index c5f4b5a..8756ca5 100644
--- a/tools/perf/builtin-c2c.c
+++ b/tools/perf/builtin-c2c.c
@@ -52,6 +52,7 @@ struct c2c_stats {
struct perf_c2c {
struct perf_tool tool;
bool raw_records;
+ bool call_graph;
struct hists hists;

/* stats */
@@ -78,6 +79,8 @@ struct c2c_hit {
u64 daddr;
u64 iaddr;
struct mem_info *mi;
+
+ struct callchain_root callchain[0]; /* must be last member */
};

enum { OP, LVL, SNP, LCK, TLB };
@@ -372,7 +375,8 @@ static int c2c_decode_stats(struct c2c_stats *stats, struct hist_entry *entry)

static struct c2c_hit *c2c_hit__new(u64 cacheline, struct hist_entry *entry)
{
- struct c2c_hit *h = zalloc(sizeof(struct c2c_hit));
+ size_t callchain_size = symbol_conf.use_callchain ? sizeof(struct callchain_root) : 0;
+ struct c2c_hit *h = zalloc(sizeof(struct c2c_hit) + callchain_size);

if (!h) {
pr_err("Could not allocate c2c_hit memory\n");
@@ -386,6 +390,8 @@ static struct c2c_hit *c2c_hit__new(u64 cacheline, struct hist_entry *entry)
h->cacheline = cacheline;
h->pid = entry->thread->pid_;
h->tid = entry->thread->tid;
+ if (symbol_conf.use_callchain)
+ callchain_init(h->callchain);

/* use original addresses here, not adjusted al_addr */
h->iaddr = entry->mem_info->iaddr.addr;
@@ -509,6 +515,10 @@ static int perf_c2c__process_load_store(struct perf_c2c *c2c,
return 0;
}

+ err = sample__resolve_callchain(sample, &parent, evsel, al, PERF_MAX_STACK_DEPTH);
+ if (err)
+ return err;
+
cost = sample->weight;
if (!cost)
cost = 1;
@@ -544,8 +554,9 @@ static int perf_c2c__process_load_store(struct perf_c2c *c2c,
if (err)
goto out;

- c2c->hists.stats.total_period += cost;
- hists__inc_nr_events(&c2c->hists, PERF_RECORD_SAMPLE);
+ c2c->hists.stats.total_period += cost;
+ hists__inc_nr_events(&c2c->hists, PERF_RECORD_SAMPLE);
+ err = hist_entry__append_callchain(he, sample);
return err;

out_mem:
@@ -944,6 +955,13 @@ static void print_hitm_cacheline_offset(struct c2c_hit *clo,
print_socket_shared_str(node_stats);

printf("\n");
+
+ if (symbol_conf.use_callchain) {
+ generic_entry_callchain__fprintf(clo->callchain,
+ h->stats.total_period,
+ clo->stats.total_period,
+ 23, stdout);
+ }
}

static void print_c2c_hitm_report(struct rb_root *hitm_tree,
@@ -1020,6 +1038,12 @@ static void print_c2c_hitm_report(struct rb_root *hitm_tree,
c2c_decode_stats(&node_stats[node], entry);
CPU_SET(entry->cpu, &(node_stats[node].cpuset));
}
+ if (symbol_conf.use_callchain) {
+ callchain_cursor_reset(&callchain_cursor);
+ callchain_merge(&callchain_cursor,
+ clo->callchain,
+ entry->callchain);
+ }

}
if (clo) {
@@ -1151,6 +1175,30 @@ err:
return err;
}

+static int perf_c2c__setup_sample_type(struct perf_c2c *c2c,
+ struct perf_session *session)
+{
+ u64 sample_type = perf_evlist__combined_sample_type(session->evlist);
+
+ if (!(sample_type & PERF_SAMPLE_CALLCHAIN)) {
+ if (symbol_conf.use_callchain) {
+ printf("Selected -g but no callchain data. Did "
+ "you call 'perf c2c record' without -g?\n");
+ return -1;
+ }
+ } else if (callchain_param.mode != CHAIN_NONE &&
+ !symbol_conf.use_callchain) {
+ symbol_conf.use_callchain = true;
+ c2c->call_graph = true;
+ if (callchain_register_param(&callchain_param) < 0) {
+ printf("Can't register callchain params.\n");
+ return -EINVAL;
+ }
+ }
+
+ return 0;
+}
+
static int perf_c2c__read_events(struct perf_c2c *c2c)
{
int err = -1;
@@ -1170,6 +1218,9 @@ static int perf_c2c__read_events(struct perf_c2c *c2c)
if (symbol__init() < 0)
goto out_delete;

+ if (perf_c2c__setup_sample_type(c2c, session) < 0)
+ goto out_delete;
+
/* setup the evsel handlers for each event type */
evlist__for_each(session->evlist, evsel) {
const char *name = perf_evsel__name(evsel);
@@ -1257,8 +1308,101 @@ static int perf_c2c__record(int argc, const char **argv)
return cmd_record(i, rec_argv, NULL);
}

+static int
+opt_callchain_cb(const struct option *opt, const char *arg, int unset)
+{
+ struct perf_c2c *c2c = (struct perf_c2c *)opt->value;
+ char *tok, *tok2;
+ char *endptr;
+
+ /*
+ * --no-call-graph
+ */
+ if (unset) {
+ c2c->call_graph = false;
+ return 0;
+ }
+
+ symbol_conf.use_callchain = true;
+ c2c->call_graph = true;
+
+ if (!arg)
+ return 0;
+
+ tok = strtok((char *)arg, ",");
+ if (!tok)
+ return -1;
+
+ /* get the output mode */
+ if (!strncmp(tok, "graph", strlen(arg)))
+ callchain_param.mode = CHAIN_GRAPH_ABS;
+
+ else if (!strncmp(tok, "flat", strlen(arg)))
+ callchain_param.mode = CHAIN_FLAT;
+
+ else if (!strncmp(tok, "fractal", strlen(arg)))
+ callchain_param.mode = CHAIN_GRAPH_REL;
+
+ else if (!strncmp(tok, "none", strlen(arg))) {
+ callchain_param.mode = CHAIN_NONE;
+ symbol_conf.use_callchain = false;
+
+ return 0;
+ }
+
+ else
+ return -1;
+
+ /* get the min percentage */
+ tok = strtok(NULL, ",");
+ if (!tok)
+ goto setup;
+
+ callchain_param.min_percent = strtod(tok, &endptr);
+ if (tok == endptr)
+ return -1;
+
+ /* get the print limit */
+ tok2 = strtok(NULL, ",");
+ if (!tok2)
+ goto setup;
+
+ if (tok2[0] != 'c') {
+ callchain_param.print_limit = strtoul(tok2, &endptr, 0);
+ tok2 = strtok(NULL, ",");
+ if (!tok2)
+ goto setup;
+ }
+
+ /* get the call chain order */
+ if (!strncmp(tok2, "caller", strlen("caller")))
+ callchain_param.order = ORDER_CALLER;
+ else if (!strncmp(tok2, "callee", strlen("callee")))
+ callchain_param.order = ORDER_CALLEE;
+ else
+ return -1;
+
+ /* Get the sort key */
+ tok2 = strtok(NULL, ",");
+ if (!tok2)
+ goto setup;
+ if (!strncmp(tok2, "function", strlen("function")))
+ callchain_param.key = CCKEY_FUNCTION;
+ else if (!strncmp(tok2, "address", strlen("address")))
+ callchain_param.key = CCKEY_ADDRESS;
+ else
+ return -1;
+setup:
+ if (callchain_register_param(&callchain_param) < 0) {
+ fprintf(stderr, "Can't register callchain params\n");
+ return -1;
+ }
+ return 0;
+}
+
int cmd_c2c(int argc, const char **argv, const char *prefix __maybe_unused)
{
+ char callchain_default_opt[] = "fractal,0.05,callee";
struct perf_c2c c2c = {
.tool = {
.sample = perf_c2c__process_sample,
@@ -1285,6 +1429,9 @@ int cmd_c2c(int argc, const char **argv, const char *prefix __maybe_unused)
"separator",
"separator for columns, no spaces will be added"
" between columns '.' is reserved."),
+ OPT_CALLBACK_DEFAULT('g', "call-graph", &c2c, "output_type,min_percent[,print_limit],call_order",
+ "Display callchains using output_type (graph, flat, fractal, or none) , min percent threshold, optional print limit, callchain order, key (function or address). "
+ "Default: fractal,0.5,callee,function", &opt_callchain_cb, callchain_default_opt),
OPT_END()
};
const char * const c2c_usage[] = {
--
1.7.11.7

2014-02-28 17:43:52

by Don Zickus

[permalink] [raw]
Subject: [PATCH 14/19] perf, c2c: Display cacheline HITM analysis to stdout

This patch mainly focuses on processing and displaying the collected
HITMs to stdout. Most of it is just printing data in a pretty way.

There is one trick used when walking the cacheline. When we get this
far we have two rbtrees. One rbtree holds every record sorted on a
unique id (using the mmap2 decoder), the other rbtree holds every
cacheline with at least one HITM sorted on number of HITMs.

To display the output, the tool walks the second rbtree to display
the hottest cachelines. Inside each hot cacheline, the tool displays
the offsets and the loads/stores it generates. To determine the
cacheline offset, it uses linked list inside the cacheline elment
to walk the first rbtree elements for that particular cacheline.

The first rbtree elements are already sorted correctly in offset order, so
processing the offsets is fairly trivial and is done sequentially.

This is why you will see two while loops in the print_c2c_hitm_report(),
the outer one walks the cachelines, the inner one walks the offsets.

A knob has been added to display node information, which is useful
to see which cpus are involved in the contention and their nodes.

Another knob has been added to change the coalescing levels. You can
coalesce the output based on pid, tid, ip, and/or symbol.

Original output and statistics done by Dick Fowles, backported by me.

Sample output:

=================================================
Global Shared Cache Line Event Information
=================================================
Total Shared Cache Lines : 1327
Load HITs on shared lines : 167131
Fill Buffer Hits on shared lines : 43469
L1D hits on shared lines : 50497
L2D hits on shared lines : 960
LLC hits on shared lines : 38467
Locked Access on shared lines : 100032
Store HITs on shared lines : 118659
Store L1D hits on shared lines : 113783
Total Merged records : 160807

===========================================================================================================================================================

Shared Cache Line Distribution Pareto

---- All ---- -- Shared -- ---- HITM ---- Load Inst Execute Latency
Data Misses Data Misses Remote Local -- Store Refs --
---- cycles ---- cpu
Num %dist %cumm %dist %cumm LLCmiss LLChit L1 hit L1 Miss Data Address Pid Tid Inst Address median mean CV cnt
==========================================================================================================================================================
-----------------------------------------------------------------------------------------------
0 17.0% 17.0% 23.3% 23.3% 6238 3288 28098 813 0xffff881fa55b0140 ***
-----------------------------------------------------------------------------------------------
0.0% 0.0% 0.0% 0.0% 0x00 375 375 0xffffffffa018ff5b n/a n/a n/a 1
0.0% 0.0% 0.0% 0.0% 0x08 18156 18156 0xffffffffa018b7f9 -1 384 0.0% 1
0.2% 0.0% 0.0% 0.0% 0x10 18156 18156 0xffffffff811ca1aa -1 387 10.7% 7
0.0% 0.0% 23.2% 0.0% 0x18 18156 18156 0xffffffff815c1615 -1 684 0.0% 50

-----------------------------------------------------------------------------------------------
1 5.3% 22.3% 7.3% 30.6% 1944 1143 7916 0 0xffff881fba47f000 18156
-----------------------------------------------------------------------------------------------
100.0% 100.0% 0.0% 0.0% 0x00 18156 18156 0xffffffffa01b410e -1 401 13.5% 50
0.0% 0.0% 10.1% 0.0% 0x28 18156 18156 0xffffffffa0167409 n/a n/a n/a 50
0.0% 0.0% 89.9% 0.0% 0x28 18156 18156 0xffffffff815c4be9 n/a n/a n/a 50

Original-by: Dick Fowles <[email protected]>
Signed-off-by: Don Zickus <[email protected]>
---
tools/perf/builtin-c2c.c | 519 +++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 519 insertions(+)

diff --git a/tools/perf/builtin-c2c.c b/tools/perf/builtin-c2c.c
index 30cb8c3..c5f4b5a 100644
--- a/tools/perf/builtin-c2c.c
+++ b/tools/perf/builtin-c2c.c
@@ -58,10 +58,13 @@ struct perf_c2c {
struct c2c_stats stats;
};

+#define DISPLAY_LINE_LIMIT 0.0015
#define CACHE_LINESIZE 64
#define CLINE_OFFSET_MSK (CACHE_LINESIZE - 1)
#define CLADRS(a) ((a) & ~(CLINE_OFFSET_MSK))
#define CLOFFSET(a) (int)((a) & (CLINE_OFFSET_MSK))
+#define MAXTITLE_SZ 400
+#define MAXLBL_SZ 256

struct c2c_hit {
struct rb_node rb_node;
@@ -102,6 +105,11 @@ enum { OP, LVL, SNP, LCK, TLB };
#define LCL_HITM(a,b) (L3CACHE_HIT(a) && ((b) & PERF_MEM_SNOOP_HITM))
#define LCL_MEM(a) (((a) & PERF_MEM_LVL_LOC_RAM) && ((a) & PERF_MEM_LVL_HIT))

+enum { LVL0, LVL1, LVL2, LVL3, LVL4, MAX_LVL };
+static int cloffset = LVL1;
+static int node_info = 0;
+static int coalesce_level = LVL1;
+
static int perf_c2c__scnprintf_data_src(char *bf, size_t size, uint64_t val)
{
#define PREFIX "["
@@ -406,6 +414,80 @@ static void c2c_hit__update_strings(struct c2c_hit *h,
CPU_SET(n->cpu, &h->stats.cpuset);
}

+static inline bool matching_coalescing(struct c2c_hit *h,
+ struct hist_entry *e)
+{
+ bool value = false;
+ struct mem_info *mi = e->mem_info;
+
+ if (coalesce_level > MAX_LVL)
+ printf("DON: bad coalesce level %d\n", coalesce_level);
+
+ if (e->cpumode != PERF_RECORD_MISC_KERNEL) {
+
+ switch (coalesce_level) {
+
+ case LVL0:
+ case LVL1:
+ value = ((h->daddr == mi->daddr.addr) &&
+ (h->pid == e->thread->pid_) &&
+ (h->tid == e->thread->tid) &&
+ (h->iaddr == mi->iaddr.addr));
+ break;
+
+ case LVL2:
+ value = ((h->daddr == mi->daddr.addr) &&
+ (h->pid == e->thread->pid_) &&
+ (h->iaddr == mi->iaddr.addr));
+ break;
+
+ case LVL3:
+ value = ((h->daddr == mi->daddr.addr) &&
+ (h->iaddr == mi->iaddr.addr));
+ break;
+
+ case LVL4:
+ value = ((h->daddr == mi->daddr.addr) &&
+ (h->mi->iaddr.sym == mi->iaddr.sym));
+ break;
+
+ default:
+ break;
+
+ }
+
+ } else {
+
+ switch (coalesce_level) {
+
+ case LVL0:
+ value = ((h->daddr == mi->daddr.addr) &&
+ (h->pid == e->thread->pid_) &&
+ (h->tid == e->thread->tid) &&
+ (h->iaddr == mi->iaddr.addr));
+ break;
+
+ case LVL1:
+ case LVL2:
+ case LVL3:
+ value = ((h->daddr == mi->daddr.addr) &&
+ (h->iaddr == mi->iaddr.addr));
+ break;
+
+ case LVL4:
+ value = ((h->daddr == mi->daddr.addr) &&
+ (h->mi->iaddr.sym == mi->iaddr.sym));
+ break;
+
+ default:
+ break;
+
+ }
+ }
+
+ return value;
+}
+
static int perf_c2c__process_load_store(struct perf_c2c *c2c,
struct addr_location *al,
struct perf_sample *sample,
@@ -543,12 +625,442 @@ static void c2c_hit__update_stats(struct c2c_stats *new,
new->total_period += old->total_period;
}

+static void print_hitm_cacheline_header(void)
+{
+#define SHARING_REPORT_TITLE "Shared Cache Line Distribution Pareto"
+#define PARTICIPANTS1 "Node{cpus %hitms %stores} Node{cpus %hitms %stores} ..."
+#define PARTICIPANTS2 "Node{cpu list}; Node{cpu list}; Node{cpu list}; ..."
+
+ int i;
+ const char *docptr;
+ static char delimit[MAXTITLE_SZ];
+ static char title2[MAXTITLE_SZ];
+ int pad;
+
+ docptr = " ";
+ if (node_info == 1)
+ docptr = PARTICIPANTS1;
+ if (node_info == 2)
+ docptr = PARTICIPANTS2;
+
+ sprintf(title2, "%4s %6s %6s %6s %6s %8s %8s %8s %8s %18s %6s %6s %18s %8s %8s %8s %6s %-30s %-20s %s",
+ "Num",
+ "%dist",
+ "%cumm",
+ "%dist",
+ "%cumm",
+ "LLCmiss",
+ "LLChit",
+ "L1 hit",
+ "L1 Miss",
+ "Data Address",
+ "Pid",
+ "Tid",
+ "Inst Address",
+ "median",
+ "mean",
+ "CV ",
+ "cnt",
+ "Symbol",
+ "Object",
+ docptr);
+
+ for (i = 0; i < (int)strlen(title2); i++) strcat(delimit, "=");
+
+
+ printf("\n\n");
+ printf("%s\n", delimit);
+ printf("\n");
+
+ pad = (strlen(title2)/2) - (strlen(SHARING_REPORT_TITLE)/2);
+ for (i = 0; i < pad; i++) printf(" ");
+ printf("%s\n", SHARING_REPORT_TITLE);
+
+ printf("\n");
+ printf("%4s %13s %13s %17s %8s %8s %18s %6s %6s %18s %26s %6s %30s %20s %s\n",
+ " ",
+ "---- All ----",
+ "-- Shared --",
+ "---- HITM ----",
+ " ",
+ " ",
+ " ",
+ " ",
+ " ",
+ " ",
+ "Load Inst Execute Latency",
+ " ",
+ " ",
+ " ",
+ node_info ? "Shared Data Participants" : " ");
+
+
+ printf("%4s %13s %13s %8s %8s %17s %18s %6s %6s %17s %18s\n",
+ " ",
+ " Data Misses",
+ " Data Misses",
+ "Remote",
+ "Local",
+ "-- Store Refs --",
+ " ",
+ " ",
+ " ",
+ " ",
+ " ");
+
+ printf("%4s %13s %13s %8s %8s %8s %8s %18s %6s %6s %17s %18s %8s %6s\n",
+ " ",
+ " ",
+ " ",
+ " ",
+ " ",
+ " ",
+ " ",
+ " ",
+ " ",
+ " ",
+ " ",
+ "---- cycles ----",
+ " ",
+ "cpu");
+
+ printf("%s\n", title2);
+ printf("%s\n", delimit);
+}
+
+static void print_hitm_cacheline(struct c2c_hit *h,
+ int record,
+ double tot_cumm,
+ double ld_cumm,
+ double tot_dist,
+ double ld_dist)
+{
+ char pidstr[7];
+ char addrstr[20];
+ static char summary[MAXLBL_SZ];
+ int j;
+
+ if (h->pid > 0)
+ sprintf(pidstr, "%6d", h->pid);
+ else
+ sprintf(pidstr, "***");
+ /*
+ * It is possible to have none distinct virtual addresses
+ * pointing to a distinct SYstem V shared memory region.
+ * if there are multple virtual addresses the address
+ * field will be astericks. It would be possible to subsitute
+ * the physical address but this count be confusing as some
+ * times the field is a virtual address while or times it
+ * may be a physical address which may lead to confusion.
+ */
+ if (h->daddr != ~0UL)
+ sprintf(addrstr, "%#18lx", CLADRS(h->daddr));
+ else
+ sprintf(addrstr, "****************");
+
+
+ sprintf(summary, "%4d %5.1f%% %5.1f%% %5.1f%% %5.1f%% %8d %8d %8d %8d %18s %6s\n",
+ record,
+ tot_dist * 100.,
+ tot_cumm * 100.,
+ ld_dist * 100.,
+ ld_cumm * 100.,
+ h->stats.t.rmt_hitm,
+ h->stats.t.lcl_hitm,
+ h->stats.t.st_l1hit,
+ h->stats.t.st_l1miss,
+ addrstr,
+ pidstr);
+
+ for (j = 0; j < (int)strlen(summary); j++) printf("-");
+ printf("\n");
+ printf("%s", summary);
+ for (j = 0; j < (int)strlen(summary); j++) printf("-");
+ printf("\n");
+}
+
+static void print_socket_stats_str(struct c2c_hit *clo,
+ struct c2c_stats *node_stats)
+{
+ int i, j;
+
+ if (!node_stats)
+ return;
+
+ for (i = 0; i < max_node_num; i++) {
+ struct c2c_stats *stats = &node_stats[i];
+ int num = CPU_COUNT(&stats->cpuset);
+
+ if (!num) {
+ /* pad align socket info */
+ for (j = 0; j < 21; j++)
+ printf(" ");
+ continue;
+ }
+
+ printf("%2d{%2d ", i, num);
+
+ if (clo->stats.t.rmt_hitm > 0)
+ printf("%5.1f%% ", 100. * ((double)stats->t.rmt_hitm / (double) clo->stats.t.rmt_hitm));
+ else
+ printf("%6s ", "n/a");
+
+ if (clo->stats.t.store > 0)
+ printf("%5.1f%%} ", 100. * ((double)stats->t.store / (double)clo->stats.t.store));
+ else
+ printf("%6s} ", "n/a");
+ }
+}
+
+static void print_socket_shared_str(struct c2c_stats *node_stats)
+{
+ int i, j;
+
+ if (!node_stats)
+ return;
+
+ for (i = 0; i < max_node_num; i++) {
+ struct c2c_stats *stats = &node_stats[i];
+ int num = CPU_COUNT(&stats->cpuset);
+ int start = -1;
+ bool first = true;
+
+ if (!num)
+ continue;
+
+ printf("%d{", i);
+
+ for (j = 0; j < max_cpu_num; j++) {
+ if (!CPU_ISSET(j, &stats->cpuset)) {
+ if (start != -1) {
+ if ((j-1) - start)
+ /* print the range */
+ printf("%s%d-%d", (first ? "" : ","), start, j-1);
+ else
+ /* standalone */
+ printf("%s%d", (first ? "" : ",") , start);
+ start = -1;
+ first = false;
+ }
+ continue;
+ }
+
+ if (start == -1)
+ start = j;
+ }
+ /* last chunk */
+ if (start != -1) {
+ if ((j-1) - start)
+ /* print the range */
+ printf("%s%d-%d", (first ? "" : ","), start, j-1);
+ else
+ /* standalone */
+ printf("%s%d", (first ? "" : ",") , start);
+ }
+
+ printf("}; ");
+ }
+}
+
+static void print_hitm_cacheline_offset(struct c2c_hit *clo,
+ struct c2c_hit *h,
+ struct c2c_stats *node_stats)
+{
+#define SHORT_STR_LEN 7
+#define LONG_STR_LEN 30
+
+ char pidstr[SHORT_STR_LEN];
+ char tidstr[SHORT_STR_LEN];
+ char addrstr[LONG_STR_LEN];
+ char latstr[LONG_STR_LEN];
+ char objptr[LONG_STR_LEN];
+ char symptr[LONG_STR_LEN];
+ struct c2c_stats *stats = &clo->stats;
+ struct addr_map_symbol *ams;
+
+ ams = &clo->mi->iaddr;
+
+ if (clo->pid >= 0)
+ snprintf(pidstr, SHORT_STR_LEN, "%6d", clo->pid);
+ else
+ sprintf(pidstr, "***");
+
+ if (clo->tid >= 0)
+ snprintf(tidstr, SHORT_STR_LEN, "%6d", clo->tid);
+ else
+ sprintf(tidstr, "***");
+
+ if (clo->iaddr != ~0UL)
+ snprintf(addrstr, LONG_STR_LEN, "%#18lx", clo->iaddr);
+ else
+ sprintf(addrstr, "****************");
+ snprintf(objptr, LONG_STR_LEN, "%-18s", ams->map->dso->short_name);
+ snprintf(symptr, LONG_STR_LEN, "%-18s", (ams->sym ? ams->sym->name : "?????"));
+
+ if (stats->t.rmt_hitm > 0) {
+ double mean = avg_stats(&stats->stats);
+ double std = stddev_stats(&stats->stats);
+
+ sprintf(latstr, "%8.0f %8.0f %7.1f%%",
+ -1.0, /* FIXME */
+ mean,
+ rel_stddev_stats(std, mean));
+ } else {
+ sprintf(latstr, "%8s %8s %8s",
+ "n/a",
+ "n/a",
+ "n/a");
+
+ }
+
+ /*
+ * implicit assumption that we are not coalescing over IPs
+ */
+ printf("%4s %6s %6s %6s %6s %7.1f%% %7.1f%% %7.1f%% %7.1f%% %14s0x%02lx %6s %6s %18s %8s %6d %-30s %-20s",
+ " ",
+ " ",
+ " ",
+ " ",
+ " ",
+ (stats->t.rmt_hitm > 0) ? (100. * ((double)stats->t.rmt_hitm / (double)h->stats.t.rmt_hitm)) : 0.0,
+ (stats->t.lcl_hitm > 0) ? (100. * ((double)stats->t.lcl_hitm / (double)h->stats.t.lcl_hitm)) : 0.0,
+ (stats->t.st_l1hit > 0) ? (100. * ((double)stats->t.st_l1hit / (double)h->stats.t.st_l1hit)) : 0.0,
+ (stats->t.st_l1miss > 0) ? (100. * ((double)stats->t.st_l1miss / (double)h->stats.t.st_l1miss)) : 0.0,
+ " ",
+ (cloffset == LVL2) ? (clo->daddr & 0xff) : CLOFFSET(clo->daddr),
+ pidstr,
+ tidstr,
+ addrstr,
+ latstr,
+ CPU_COUNT(&clo->stats.cpuset),
+ symptr,
+ objptr);
+
+ if (node_info == 0)
+ printf(" ");
+ else if (node_info == 1)
+ print_socket_stats_str(clo, node_stats);
+ else if (node_info == 2)
+ print_socket_shared_str(node_stats);
+
+ printf("\n");
+}
+
+static void print_c2c_hitm_report(struct rb_root *hitm_tree,
+ struct c2c_stats *hitm_stats __maybe_unused,
+ struct c2c_stats *c2c_stats)
+{
+ struct rb_node *next = rb_first(hitm_tree);
+ struct c2c_hit *h, *clo = NULL;
+ u64 addr;
+ double tot_dist, tot_cumm;
+ double ld_dist, ld_cumm;
+ int llc_misses;
+ int record = 0;
+ struct c2c_stats *node_stats = NULL;
+
+ if (node_info) {
+ node_stats = zalloc(sizeof(struct c2c_stats) * cpu_map__max_node());
+ if (!node_stats) {
+ printf("Can not allocate stats for node output\n");
+ return;
+ }
+ }
+
+ print_hitm_cacheline_header();
+
+ llc_misses = c2c_stats->t.lcl_dram +
+ c2c_stats->t.rmt_dram +
+ c2c_stats->t.rmt_hit +
+ c2c_stats->t.rmt_hitm;
+
+ /*
+ * generate distinct cache line report
+ */
+ tot_cumm = 0.0;
+ ld_cumm = 0.0;
+
+ while (next) {
+ struct hist_entry *entry;
+
+ h = rb_entry(next, struct c2c_hit, rb_node);
+ next = rb_next(&h->rb_node);
+
+ tot_dist = ((double)h->stats.t.rmt_hitm / llc_misses);
+ tot_cumm += tot_dist;
+
+ ld_dist = ((double)h->stats.t.rmt_hitm / c2c_stats->t.rmt_hitm);
+ ld_cumm += ld_dist;
+
+ /*
+ * don't display lines with insignificant sharing contribution
+ */
+ if (ld_dist < DISPLAY_LINE_LIMIT)
+ break;
+
+ print_hitm_cacheline(h, record, tot_cumm, ld_cumm, tot_dist, ld_dist);
+
+ list_for_each_entry(entry, &h->list, pairs.node) {
+
+ if (!clo || !matching_coalescing(clo, entry)) {
+ if (clo)
+ print_hitm_cacheline_offset(clo, h, node_stats);
+
+ free(clo);
+ addr = entry->mem_info->iaddr.al_addr;
+ clo = c2c_hit__new(addr, entry);
+ if (node_info)
+ memset(node_stats, 0, sizeof(struct c2c_stats) * cpu_map__max_node());
+ }
+ c2c_decode_stats(&clo->stats, entry);
+ c2c_hit__update_strings(clo, entry);
+
+ if (node_info) {
+ int node = cpu_map__get_node(entry->cpu);
+ c2c_decode_stats(&node_stats[node], entry);
+ CPU_SET(entry->cpu, &(node_stats[node].cpuset));
+ }
+
+ }
+ if (clo) {
+ print_hitm_cacheline_offset(clo, h, node_stats);
+ free(clo);
+ clo = NULL;
+ }
+
+ if (node_info)
+ memset(node_stats, 0, sizeof(struct c2c_stats) * cpu_map__max_node());
+
+ printf("\n");
+ record++;
+ }
+}
+
static inline int valid_hitm_or_store(union perf_mem_data_src *dsrc)
{
return ((dsrc->mem_snoop & P(SNOOP,HITM)) ||
(dsrc->mem_op & P(OP,STORE)));
}

+static void print_shared_cacheline_info(struct c2c_stats *stats, int cline_cnt)
+{
+ int hitm_cnt = stats->t.lcl_hitm + stats->t.rmt_hitm;
+
+ printf("=================================================\n");
+ printf(" Global Shared Cache Line Event Information \n");
+ printf("=================================================\n");
+ printf(" Total Shared Cache Lines : %10d\n", cline_cnt);
+ printf(" Load HITs on shared lines : %10d\n", stats->t.load);
+ printf(" Fill Buffer Hits on shared lines : %10d\n", stats->t.ld_fbhit);
+ printf(" L1D hits on shared lines : %10d\n", stats->t.ld_l1hit);
+ printf(" L2D hits on shared lines : %10d\n", stats->t.ld_l2hit);
+ printf(" LLC hits on shared lines : %10d\n", stats->t.ld_llchit + stats->t.lcl_hitm);
+ printf(" Locked Access on shared lines : %10d\n", stats->t.locks);
+ printf(" Store HITs on shared lines : %10d\n", stats->t.store);
+ printf(" Store L1D hits on shared lines : %10d\n", stats->t.st_l1hit);
+ printf(" Total Merged records : %10d\n", hitm_cnt + stats->t.store);
+}
+
static void c2c_analyze_hitms(struct perf_c2c *c2c)
{

@@ -607,6 +1119,9 @@ static void c2c_analyze_hitms(struct perf_c2c *c2c)
} else
free(h);

+ print_shared_cacheline_info(&hitm_stats, shared_clines);
+ print_c2c_hitm_report(&hitm_tree, &hitm_stats, &c2c->stats);
+
cleanup:
next = rb_first(&hitm_tree);
while (next) {
@@ -758,6 +1273,10 @@ int cmd_c2c(int argc, const char **argv, const char *prefix __maybe_unused)
};
const struct option c2c_options[] = {
OPT_BOOLEAN('r', "raw_records", &c2c.raw_records, "dump raw events"),
+ OPT_INCR('N', "node-info", &node_info,
+ "show extra node info in report (repeat for more info)"),
+ OPT_INTEGER('c', "coalesce-level", &coalesce_level,
+ "how much coalescing for tid, pid, and ip is done (repeat for more coalescing)"),
OPT_INCR('v', "verbose", &verbose,
"be more verbose (show counter open errors, etc)"),
OPT_STRING('i', "input", &input_name, "file",
--
1.7.11.7

2014-02-28 17:43:51

by Don Zickus

[permalink] [raw]
Subject: [PATCH 13/19] perf, c2c: Sort based on hottest cache line

Now that we have all the events sort on a unique address, we can walk
the rbtree sequential and count up all the HITMs for each cacheline
fairly easily.

Once we encounter a new event on a different cacheline, process the previous
cacheline. That includes determining if any HITMs were present on that
cacheline and if so, add it to another rbtree sorted on the number of HITMs.

This second rbtree sorted on number of HITMs will be the interesting data
we want to report and will be displayed in a follow up patch.

For now, organize the data properly.

V2: re-work using hist_entries

Signed-off-by: Don Zickus <[email protected]>
---
tools/perf/builtin-c2c.c | 201 +++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 201 insertions(+)

diff --git a/tools/perf/builtin-c2c.c b/tools/perf/builtin-c2c.c
index 5deb7cc..30cb8c3 100644
--- a/tools/perf/builtin-c2c.c
+++ b/tools/perf/builtin-c2c.c
@@ -58,6 +58,25 @@ struct perf_c2c {
struct c2c_stats stats;
};

+#define CACHE_LINESIZE 64
+#define CLINE_OFFSET_MSK (CACHE_LINESIZE - 1)
+#define CLADRS(a) ((a) & ~(CLINE_OFFSET_MSK))
+#define CLOFFSET(a) (int)((a) & (CLINE_OFFSET_MSK))
+
+struct c2c_hit {
+ struct rb_node rb_node;
+ struct rb_root tree;
+ struct list_head list;
+ u64 cacheline;
+ int color;
+ struct c2c_stats stats;
+ pid_t pid;
+ pid_t tid;
+ u64 daddr;
+ u64 iaddr;
+ struct mem_info *mi;
+};
+
enum { OP, LVL, SNP, LCK, TLB };

#define RMT_RAM (PERF_MEM_LVL_REM_RAM1 | PERF_MEM_LVL_REM_RAM2)
@@ -153,6 +172,44 @@ static int perf_c2c__scnprintf_data_src(char *bf, size_t size, uint64_t val)
return printed;
}

+static int c2c_hitm__add_to_list(struct rb_root *root, struct c2c_hit *h)
+{
+ struct rb_node **p;
+ struct rb_node *parent = NULL;
+ struct c2c_hit *he;
+ int64_t cmp;
+ u64 l_hitms, r_hitms;
+
+ p = &root->rb_node;
+
+ while (*p != NULL) {
+ parent = *p;
+ he = rb_entry(parent, struct c2c_hit, rb_node);
+
+ /* sort on remote hitms first */
+ l_hitms = he->stats.t.rmt_hitm;
+ r_hitms = h->stats.t.rmt_hitm;
+ cmp = r_hitms - l_hitms;
+
+ if (!cmp) {
+ /* sort on local hitms */
+ l_hitms = he->stats.t.lcl_hitm;
+ r_hitms = h->stats.t.lcl_hitm;
+ cmp = r_hitms - l_hitms;
+ }
+
+ if (cmp > 0)
+ p = &(*p)->rb_left;
+ else
+ p = &(*p)->rb_right;
+ }
+
+ rb_link_node(&h->rb_node, parent, p);
+ rb_insert_color(&h->rb_node, root);
+
+ return 0;
+}
+
static int perf_c2c__fprintf_header(FILE *fp)
{
int printed = fprintf(fp, "%c %-16s %6s %6s %4s %18s %18s %18s %6s %-10s %-60s %s\n",
@@ -305,6 +362,50 @@ static int c2c_decode_stats(struct c2c_stats *stats, struct hist_entry *entry)
return err;
}

+static struct c2c_hit *c2c_hit__new(u64 cacheline, struct hist_entry *entry)
+{
+ struct c2c_hit *h = zalloc(sizeof(struct c2c_hit));
+
+ if (!h) {
+ pr_err("Could not allocate c2c_hit memory\n");
+ return NULL;
+ }
+
+ CPU_ZERO(&h->stats.cpuset);
+ INIT_LIST_HEAD(&h->list);
+ init_stats(&h->stats.stats);
+ h->tree = RB_ROOT;
+ h->cacheline = cacheline;
+ h->pid = entry->thread->pid_;
+ h->tid = entry->thread->tid;
+
+ /* use original addresses here, not adjusted al_addr */
+ h->iaddr = entry->mem_info->iaddr.addr;
+ h->daddr = entry->mem_info->daddr.addr;
+
+ h->mi = entry->mem_info;
+ return h;
+}
+
+static void c2c_hit__update_strings(struct c2c_hit *h,
+ struct hist_entry *n)
+{
+ if (h->pid != n->thread->pid_)
+ h->pid = -1;
+
+ if (h->tid != n->thread->tid)
+ h->tid = -1;
+
+ /* use original addresses here, not adjusted al_addr */
+ if (h->iaddr != n->mem_info->iaddr.addr)
+ h->iaddr = -1;
+
+ if (CLADRS(h->daddr) != CLADRS(n->mem_info->daddr.addr))
+ h->daddr = -1;
+
+ CPU_SET(n->cpu, &h->stats.cpuset);
+}
+
static int perf_c2c__process_load_store(struct perf_c2c *c2c,
struct addr_location *al,
struct perf_sample *sample,
@@ -420,6 +521,104 @@ err:
return err;
}

+#define HAS_HITMS(h) (h->stats.t.lcl_hitm || h->stats.t.rmt_hitm)
+
+static void c2c_hit__update_stats(struct c2c_stats *new,
+ struct c2c_stats *old)
+{
+ new->t.load += old->t.load;
+ new->t.ld_fbhit += old->t.ld_fbhit;
+ new->t.ld_l1hit += old->t.ld_l1hit;
+ new->t.ld_l2hit += old->t.ld_l2hit;
+ new->t.ld_llchit += old->t.ld_llchit;
+ new->t.locks += old->t.locks;
+ new->t.lcl_dram += old->t.lcl_dram;
+ new->t.rmt_dram += old->t.rmt_dram;
+ new->t.lcl_hitm += old->t.lcl_hitm;
+ new->t.rmt_hitm += old->t.rmt_hitm;
+ new->t.rmt_hit += old->t.rmt_hit;
+ new->t.store += old->t.store;
+ new->t.st_l1hit += old->t.st_l1hit;
+
+ new->total_period += old->total_period;
+}
+
+static inline int valid_hitm_or_store(union perf_mem_data_src *dsrc)
+{
+ return ((dsrc->mem_snoop & P(SNOOP,HITM)) ||
+ (dsrc->mem_op & P(OP,STORE)));
+}
+
+static void c2c_analyze_hitms(struct perf_c2c *c2c)
+{
+
+ struct rb_node *next = rb_first(c2c->hists.entries_in);
+ struct hist_entry *he;
+ struct c2c_hit *h = NULL;
+ struct c2c_stats hitm_stats;
+ struct rb_root hitm_tree = RB_ROOT;
+ int shared_clines = 0;
+ u64 cl = 0;
+
+ memset(&hitm_stats, 0, sizeof(struct c2c_stats));
+
+ /* find HITMs */
+ while (next) {
+ he = rb_entry(next, struct hist_entry, rb_node_in);
+ next = rb_next(&he->rb_node_in);
+
+ cl = he->mem_info->daddr.al_addr;
+
+ /* switch cache line objects */
+ /* 'color' forces a boundary change based on the original sort */
+ if (!h || !he->color || (CLADRS(cl) != h->cacheline)) {
+ if (h && HAS_HITMS(h)) {
+ c2c_hit__update_stats(&hitm_stats, &h->stats);
+
+ /* sort based on hottest cacheline */
+ c2c_hitm__add_to_list(&hitm_tree, h);
+ shared_clines++;
+ } else {
+ /* stores-only are un-interesting */
+ free(h);
+ }
+ h = c2c_hit__new(CLADRS(cl), he);
+ if (!h)
+ goto cleanup;
+ }
+
+
+ c2c_decode_stats(&h->stats, he);
+
+ /* filter out non-hitms as un-interesting noise */
+ if (valid_hitm_or_store(&he->mem_info->data_src)) {
+ /* save the entry for later processing */
+ list_add_tail(&he->pairs.node, &h->list);
+
+ c2c_hit__update_strings(h, he);
+ }
+ }
+
+ /* last chunk */
+ if (HAS_HITMS(h)) {
+ c2c_hit__update_stats(&hitm_stats, &h->stats);
+ c2c_hitm__add_to_list(&hitm_tree, h);
+ shared_clines++;
+ } else
+ free(h);
+
+cleanup:
+ next = rb_first(&hitm_tree);
+ while (next) {
+ h = rb_entry(next, struct c2c_hit, rb_node);
+ next = rb_next(&h->rb_node);
+ rb_erase(&h->rb_node, &hitm_tree);
+
+ free(h);
+ }
+ return;
+}
+
static int perf_c2c__process_events(struct perf_session *session,
struct perf_c2c *c2c)
{
@@ -431,6 +630,8 @@ static int perf_c2c__process_events(struct perf_session *session,
goto err;
}

+ c2c_analyze_hitms(c2c);
+
err:
return err;
}
--
1.7.11.7

2014-02-28 17:46:10

by Don Zickus

[permalink] [raw]
Subject: [PATCH 12/19] perf, c2c: Add stats to track data source bits and cpu to node maps

This patch adds a bunch of stats that will be used later in post-processing
to determine where and with what frequency the HITMs are coming from.

Most of the stats are decoded from the data source response. Another
piece of the stats is tracking which cpu the record came in on.

Credit to Dick Fowles for determining which bits are important and how to
properly track them. Ported to perf by me.

V2: refresh with hist_entry

Original-by: Dick Fowles <[email protected]>
Signed-off-by: Don Zickus <[email protected]>
---
tools/perf/builtin-c2c.c | 184 +++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 184 insertions(+)

diff --git a/tools/perf/builtin-c2c.c b/tools/perf/builtin-c2c.c
index 472d4d9..5deb7cc 100644
--- a/tools/perf/builtin-c2c.c
+++ b/tools/perf/builtin-c2c.c
@@ -5,20 +5,84 @@
#include "util/parse-options.h"
#include "util/session.h"
#include "util/tool.h"
+#include "util/stat.h"
+#include "util/cpumap.h"
#include "util/debug.h"
#include "util/annotate.h"

#include <linux/compiler.h>
#include <linux/kernel.h>
+#include <sched.h>
+
+typedef struct {
+ int locks; /* count of 'lock' transactions */
+ int store; /* count of all stores in trace */
+ int st_uncache; /* stores to uncacheable address */
+ int st_noadrs; /* cacheable store with no address */
+ int st_l1hit; /* count of stores that hit L1D */
+ int st_l1miss; /* count of stores that miss L1D */
+ int load; /* count of all loads in trace */
+ int ld_excl; /* exclusive loads, rmt/lcl DRAM - snp none/miss */
+ int ld_shared; /* shared loads, rmt/lcl DRAM - snp hit */
+ int ld_uncache; /* loads to uncacheable address */
+ int ld_io; /* loads to io address */
+ int ld_miss; /* loads miss */
+ int ld_noadrs; /* cacheable load with no address */
+ int ld_fbhit; /* count of loads hitting Fill Buffer */
+ int ld_l1hit; /* count of loads that hit L1D */
+ int ld_l2hit; /* count of loads that hit L2D */
+ int ld_llchit; /* count of loads that hit LLC */
+ int lcl_hitm; /* count of loads with local HITM */
+ int rmt_hitm; /* count of loads with remote HITM */
+ int rmt_hit; /* count of loads with remote hit clean; */
+ int lcl_dram; /* count of loads miss to local DRAM */
+ int rmt_dram; /* count of loads miss to remote DRAM */
+ int nomap; /* count of load/stores with no phys adrs */
+ int noparse; /* count of unparsable data sources */
+} trinfo_t;
+
+struct c2c_stats {
+ cpu_set_t cpuset;
+ int nr_entries;
+ u64 total_period;
+ trinfo_t t;
+ struct stats stats;
+};

struct perf_c2c {
struct perf_tool tool;
bool raw_records;
struct hists hists;
+
+ /* stats */
+ struct c2c_stats stats;
};

enum { OP, LVL, SNP, LCK, TLB };

+#define RMT_RAM (PERF_MEM_LVL_REM_RAM1 | PERF_MEM_LVL_REM_RAM2)
+#define RMT_LLC (PERF_MEM_LVL_REM_CCE1 | PERF_MEM_LVL_REM_CCE2)
+
+#define L1CACHE_HIT(a) (((a) & PERF_MEM_LVL_L1 ) && ((a) & PERF_MEM_LVL_HIT))
+#define FILLBUF_HIT(a) (((a) & PERF_MEM_LVL_LFB) && ((a) & PERF_MEM_LVL_HIT))
+#define L2CACHE_HIT(a) (((a) & PERF_MEM_LVL_L2 ) && ((a) & PERF_MEM_LVL_HIT))
+#define L3CACHE_HIT(a) (((a) & PERF_MEM_LVL_L3 ) && ((a) & PERF_MEM_LVL_HIT))
+
+#define L1CACHE_MISS(a) (((a) & PERF_MEM_LVL_L1 ) && ((a) & PERF_MEM_LVL_MISS))
+#define L3CACHE_MISS(a) (((a) & PERF_MEM_LVL_L3 ) && ((a) & PERF_MEM_LVL_MISS))
+
+#define LD_UNCACHED(a) (((a) & PERF_MEM_LVL_UNC) && ((a) & PERF_MEM_LVL_HIT))
+#define ST_UNCACHED(a) (((a) & PERF_MEM_LVL_UNC) && ((a) & PERF_MEM_LVL_HIT))
+
+#define RMT_LLCHIT(a) (((a) & RMT_LLC) && ((a) & PERF_MEM_LVL_HIT))
+#define RMT_HIT(a,b) (((a) & RMT_LLC) && ((b) & PERF_MEM_SNOOP_HIT))
+#define RMT_HITM(a,b) (((a) & RMT_LLC) && ((b) & PERF_MEM_SNOOP_HITM))
+#define RMT_MEM(a) (((a) & RMT_RAM) && ((a) & PERF_MEM_LVL_HIT))
+
+#define LCL_HIT(a,b) (L3CACHE_HIT(a) && ((b) & PERF_MEM_SNOOP_HIT))
+#define LCL_HITM(a,b) (L3CACHE_HIT(a) && ((b) & PERF_MEM_SNOOP_HITM))
+#define LCL_MEM(a) (((a) & PERF_MEM_LVL_LOC_RAM) && ((a) & PERF_MEM_LVL_HIT))
+
static int perf_c2c__scnprintf_data_src(char *bf, size_t size, uint64_t val)
{
#define PREFIX "["
@@ -141,6 +205,106 @@ static int perf_sample__fprintf(struct perf_sample *sample, char tag,
mi->iaddr.sym ? mi->iaddr.sym->name : "???");
}

+static int c2c_decode_stats(struct c2c_stats *stats, struct hist_entry *entry)
+{
+ union perf_mem_data_src *data_src = &entry->mem_info->data_src;
+ u64 daddr = entry->mem_info->daddr.addr;
+ u64 weight = entry->stat.weight;
+ int err = 0;
+
+ u64 op = data_src->mem_op;
+ u64 lvl = data_src->mem_lvl;
+ u64 snoop = data_src->mem_snoop;
+ u64 lock = data_src->mem_lock;
+
+#define P(a,b) PERF_MEM_##a##_##b
+
+ stats->nr_entries++;
+ stats->total_period += entry->stat.period;
+
+ if (lock & P(LOCK,LOCKED)) stats->t.locks++;
+
+ if (op & P(OP,LOAD)) {
+ stats->t.load++;
+
+ if (!daddr) {
+ stats->t.ld_noadrs++;
+ return -1;
+ }
+
+ if (lvl & P(LVL,HIT)) {
+ if (lvl & P(LVL,UNC)) stats->t.ld_uncache++;
+ if (lvl & P(LVL,IO)) stats->t.ld_io++;
+ if (lvl & P(LVL,LFB)) stats->t.ld_fbhit++;
+ if (lvl & P(LVL,L1 )) stats->t.ld_l1hit++;
+ if (lvl & P(LVL,L2 )) stats->t.ld_l2hit++;
+ if (lvl & P(LVL,L3 )) {
+ if (snoop & P(SNOOP,HITM))
+ stats->t.lcl_hitm++;
+ else
+ stats->t.ld_llchit++;
+ }
+
+ if (lvl & P(LVL,LOC_RAM)) {
+ stats->t.lcl_dram++;
+ if (snoop & P(SNOOP,HIT))
+ stats->t.ld_shared++;
+ else
+ stats->t.ld_excl++;
+ }
+
+ if ((lvl & P(LVL,REM_RAM1)) ||
+ (lvl & P(LVL,REM_RAM2))) {
+ stats->t.rmt_dram++;
+ if (snoop & P(SNOOP,HIT))
+ stats->t.ld_shared++;
+ else
+ stats->t.ld_excl++;
+ }
+ }
+
+ if ((lvl & P(LVL,REM_CCE1)) ||
+ (lvl & P(LVL,REM_CCE2))) {
+ if (snoop & P(SNOOP, HIT))
+ stats->t.rmt_hit++;
+ else if (snoop & P(SNOOP, HITM)) {
+ stats->t.rmt_hitm++;
+ update_stats(&stats->stats, weight);
+ }
+ }
+
+ if ((lvl & P(LVL,MISS)))
+ stats->t.ld_miss++;
+
+ } else if (op & P(OP,STORE)) {
+ /* store */
+ stats->t.store++;
+
+ if (!daddr) {
+ stats->t.st_noadrs++;
+ return -1;
+ }
+
+ if (lvl & P(LVL,HIT)) {
+ if (lvl & P(LVL,UNC)) stats->t.st_uncache++;
+ if (lvl & P(LVL,L1 )) stats->t.st_l1hit++;
+ }
+ if (lvl & P(LVL,MISS))
+ if (lvl & P(LVL,L1)) stats->t.st_l1miss++;
+ } else {
+ /* unparsable data_src? */
+ stats->t.noparse++;
+ return -1;
+ }
+
+ if (!entry->mem_info->daddr.map || !entry->mem_info->iaddr.map) {
+ stats->t.nomap++;
+ return -1;
+ }
+
+ return err;
+}
+
static int perf_c2c__process_load_store(struct perf_c2c *c2c,
struct addr_location *al,
struct perf_sample *sample,
@@ -180,6 +344,14 @@ static int perf_c2c__process_load_store(struct perf_c2c *c2c,
goto out_mem;
}

+ err = c2c_decode_stats(&c2c->stats, he);
+ if (err < 0) {
+ err = 0;
+ rb_erase(&he->rb_node_in, c2c->hists.entries_in);
+ free(he);
+ goto out;
+ }
+
err = hist_entry__inc_addr_samples(he, evsel->idx, al->addr);
if (err)
goto out;
@@ -279,6 +451,9 @@ static int perf_c2c__read_events(struct perf_c2c *c2c)
goto out;
}

+ if (symbol__init() < 0)
+ goto out_delete;
+
/* setup the evsel handlers for each event type */
evlist__for_each(session->evlist, evsel) {
const char *name = perf_evsel__name(evsel);
@@ -292,12 +467,20 @@ static int perf_c2c__read_events(struct perf_c2c *c2c)

err = perf_c2c__process_events(session, c2c);

+out_delete:
+ perf_session__delete(session);
out:
return err;
}

static int perf_c2c__init(struct perf_c2c *c2c)
{
+ /* setup cpu map */
+ if (cpu_map__setup_cpunode_map() < 0) {
+ pr_err("can not setup cpu map\n");
+ return -1;
+ }
+
sort__mode = SORT_MODE__MEMORY;
sort__wants_unique = 1;
sort_order = "physid";
@@ -308,6 +491,7 @@ static int perf_c2c__init(struct perf_c2c *c2c)
}

hists__init(&c2c->hists);
+ CPU_ZERO(&c2c->stats.cpuset);

return 0;
}
--
1.7.11.7

2014-02-28 17:46:58

by Don Zickus

[permalink] [raw]
Subject: [PATCH 18/19] perf, c2c: Add symbol count table

Just another table that displays the referenced symbols in the analysis
report. The table lists the most frequently used symbols first.

It is just another way to look at similar data to figure out who
is causing the most contention (based on the workload used).

Original done by Dick Fowles, backported by me.

Sample output:

=======================================================================================================
Object Name, Path & Reference Counts

Index Records Object Name Object Path
=======================================================================================================
0 931379 [kernel.kallsyms] [kernel.kallsyms]
1 192258 fio /home/joe/old_fio-2.0.15/fio
2 80302 [jbd2] /lib/modules/3.10.0c2c_all+/kernel/fs/jbd2/jbd2.ko
3 65392 [ext4] /lib/modules/3.10.0c2c_all+/kernel/fs/ext4/ext4.ko

V2: refresh to latest upstream changes and hist_entry

Original-by: Dick Fowles <[email protected]>
Signed-off-by: Don Zickus <[email protected]>
---
tools/perf/builtin-c2c.c | 99 ++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 99 insertions(+)

diff --git a/tools/perf/builtin-c2c.c b/tools/perf/builtin-c2c.c
index c095f1b..0749ea6 100644
--- a/tools/perf/builtin-c2c.c
+++ b/tools/perf/builtin-c2c.c
@@ -685,6 +685,104 @@ static void c2c_hit__update_stats(struct c2c_stats *new,
new->total_period += old->total_period;
}

+LIST_HEAD(ref_tree);
+LIST_HEAD(ref_tree_sorted);
+struct refs {
+ struct list_head list;
+ int nr;
+ const char *name;
+ const char *long_name;
+};
+
+static int update_ref_list(struct hist_entry *entry)
+{
+ struct refs *p;
+ struct dso *dso = entry->mem_info->iaddr.map->dso;
+ const char *name = dso->short_name;
+
+ list_for_each_entry(p, &ref_tree, list) {
+ if (!strcmp(p->name, name))
+ goto found;
+ }
+
+ p = zalloc(sizeof(struct refs));
+ if (!p)
+ return -1;
+ p->name = name;
+ p->long_name = dso->long_name;
+ list_add_tail(&p->list, &ref_tree);
+
+found:
+ p->nr++;
+ return 0;
+}
+
+static void print_symbol_record_count(struct rb_root *tree)
+{
+ struct rb_node *next = rb_first(tree);
+ struct hist_entry *he;
+ struct refs *p, *q, *pn;
+ char string[256];
+ char delimit[256];
+ int i;
+ int idx = 0;
+
+ /* gather symbol references */
+ while (next) {
+ he = rb_entry(next, struct hist_entry, rb_node_in);
+ next = rb_next(&he->rb_node_in);
+
+ if (update_ref_list(he)) {
+ pr_err("Could not update reference tree\n");
+ goto cleanup;
+ }
+ }
+
+ /* sort on number of references per symbol */
+ list_for_each_entry_safe(p, pn, &ref_tree, list) {
+ list_del_init(&p->list);
+ list_for_each_entry(q, &ref_tree_sorted, list) {
+ if (p->nr > q->nr) {
+ list_add_tail(&p->list, &q->list);
+ break;
+ }
+ }
+ if (list_empty(&p->list))
+ list_add_tail(&p->list, &ref_tree_sorted);
+ }
+
+ /* print header info */
+ sprintf(string, "%5s %8s %-32s %-80s",
+ "Index",
+ "Records",
+ "Object Name",
+ "Object Path");
+
+ delimit[0] = '\0';
+ for (i = 0; i < (int)strlen(string); i++) strcat(delimit, "=");
+
+ printf("\n\n");
+ printf("%s\n", delimit);
+ printf("%50s %s\n", " ", "Object Name, Path & Reference Counts");
+ printf("\n");
+ printf("%s\n", string);
+ printf("%s\n", delimit);
+
+ /* print out table */
+ list_for_each_entry(p, &ref_tree_sorted, list) {
+ printf("%5d %8d %-32s %-80s\n",
+ idx, p->nr, p->name, p->long_name);
+ idx++;
+ }
+ printf("\n");
+
+cleanup:
+ list_for_each_entry_safe(p, pn, &ref_tree_sorted, list) {
+ list_del(&p->list);
+ free(p);
+ }
+}
+
static void print_hitm_cacheline_header(void)
{
#define SHARING_REPORT_TITLE "Shared Cache Line Distribution Pareto"
@@ -1266,6 +1364,7 @@ static int perf_c2c__process_events(struct perf_session *session,
dump_rb_tree(c2c->hists.entries_in, c2c);
print_c2c_trace_report(c2c);
c2c_analyze_hitms(c2c);
+ print_symbol_record_count(c2c->hists.entries_in);

err:
return err;
--
1.7.11.7

2014-02-28 17:47:27

by Don Zickus

[permalink] [raw]
Subject: [PATCH 16/19] perf, c2c: Output summary stats

Output some summary stats based on the processed records.
Mainly diagnostic uses.

Stats done by Dick Fowles, backported by me.

Sample output:

=================================================
Trace Event Information
=================================================
Total records : 1322047
Locked Load/Store Operations : 206317
Load Operations : 355701
Loads - uncacheable : 590
Loads - IO : 0
Loads - Miss : 440
Loads - no mapping : 207
Load Fill Buffer Hit : 100214
Load L1D hit : 148454
Load L2D hit : 15170
Load LLC hit : 53872
Load Local HITM : 15388
Load Remote HITM : 26760
Load Remote HIT : 3910
Load Local DRAM : 2436
Load Remote DRAM : 3648
Load MESI State Exclusive : 2883
Load MESI State Shared : 3201
Load LLC Misses : 36754
LLC Misses to Local DRAM : 6.6%
LLC Misses to Remote DRAM : 9.9%
LLC Misses to Remote cache (HIT) : 10.6%
LLC Misses to Remote cache (HITM) : 72.8%
Store Operations : 966322
Store - uncacheable : 0
Store - no mapping : 42931
Store L1D Hit : 915696
Store L1D Miss : 7695
No Page Map Rejects : 1193
Unable to parse data source : 24

V2: refresh to hist_entry

Original-by: Dick Fowles <[email protected]>
Signed-off-by: Don Zickus <[email protected]>
---
tools/perf/builtin-c2c.c | 47 ++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 46 insertions(+), 1 deletion(-)

diff --git a/tools/perf/builtin-c2c.c b/tools/perf/builtin-c2c.c
index 8756ca5..3b0e0b2 100644
--- a/tools/perf/builtin-c2c.c
+++ b/tools/perf/builtin-c2c.c
@@ -963,7 +963,6 @@ static void print_hitm_cacheline_offset(struct c2c_hit *clo,
23, stdout);
}
}
-
static void print_c2c_hitm_report(struct rb_root *hitm_tree,
struct c2c_stats *hitm_stats __maybe_unused,
struct c2c_stats *c2c_stats)
@@ -1158,6 +1157,51 @@ cleanup:
return;
}

+static void print_c2c_trace_report(struct perf_c2c *c2c)
+{
+ int llc_misses;
+ struct c2c_stats *stats = &c2c->stats;
+
+ llc_misses = stats->t.lcl_dram +
+ stats->t.rmt_dram +
+ stats->t.rmt_hit +
+ stats->t.rmt_hitm;
+
+ printf("=================================================\n");
+ printf(" Trace Event Information \n");
+ printf("=================================================\n");
+ printf(" Total records : %10d\n", c2c->stats.nr_entries);
+ printf(" Locked Load/Store Operations : %10d\n", stats->t.locks);
+ printf(" Load Operations : %10d\n", stats->t.load);
+ printf(" Loads - uncacheable : %10d\n", stats->t.ld_uncache);
+ printf(" Loads - IO : %10d\n", stats->t.ld_io);
+ printf(" Loads - Miss : %10d\n", stats->t.ld_miss);
+ printf(" Loads - no mapping : %10d\n", stats->t.ld_noadrs);
+ printf(" Load Fill Buffer Hit : %10d\n", stats->t.ld_fbhit);
+ printf(" Load L1D hit : %10d\n", stats->t.ld_l1hit);
+ printf(" Load L2D hit : %10d\n", stats->t.ld_l2hit);
+ printf(" Load LLC hit : %10d\n", stats->t.ld_llchit + stats->t.lcl_hitm);
+ printf(" Load Local HITM : %10d\n", stats->t.lcl_hitm);
+ printf(" Load Remote HITM : %10d\n", stats->t.rmt_hitm);
+ printf(" Load Remote HIT : %10d\n", stats->t.rmt_hit);
+ printf(" Load Local DRAM : %10d\n", stats->t.lcl_dram);
+ printf(" Load Remote DRAM : %10d\n", stats->t.rmt_dram);
+ printf(" Load MESI State Exclusive : %10d\n", stats->t.ld_excl);
+ printf(" Load MESI State Shared : %10d\n", stats->t.ld_shared);
+ printf(" Load LLC Misses : %10d\n", llc_misses);
+ printf(" LLC Misses to Local DRAM : %10.1f%%\n", ((double)stats->t.lcl_dram/(double)llc_misses) * 100.);
+ printf(" LLC Misses to Remote DRAM : %10.1f%%\n", ((double)stats->t.rmt_dram/(double)llc_misses) * 100.);
+ printf(" LLC Misses to Remote cache (HIT) : %10.1f%%\n", ((double)stats->t.rmt_hit /(double)llc_misses) * 100.);
+ printf(" LLC Misses to Remote cache (HITM) : %10.1f%%\n", ((double)stats->t.rmt_hitm/(double)llc_misses) * 100.);
+ printf(" Store Operations : %10d\n", stats->t.store);
+ printf(" Store - uncacheable : %10d\n", stats->t.st_uncache);
+ printf(" Store - no mapping : %10d\n", stats->t.st_noadrs);
+ printf(" Store L1D Hit : %10d\n", stats->t.st_l1hit);
+ printf(" Store L1D Miss : %10d\n", stats->t.st_l1miss);
+ printf(" No Page Map Rejects : %10d\n", stats->t.nomap);
+ printf(" Unable to parse data source : %10d\n", stats->t.noparse);
+}
+
static int perf_c2c__process_events(struct perf_session *session,
struct perf_c2c *c2c)
{
@@ -1169,6 +1213,7 @@ static int perf_c2c__process_events(struct perf_session *session,
goto err;
}

+ print_c2c_trace_report(c2c);
c2c_analyze_hitms(c2c);

err:
--
1.7.11.7

2014-02-28 17:47:25

by Don Zickus

[permalink] [raw]
Subject: [PATCH 06/19] perf: Fix stddev calculation

The stddev calculation written matched standard error. As a result when
using this result to find the relative stddev between runs, it was not
accurate.

Update the formula to match traditional stddev. Then rename the old
stddev calculation to stderr_stats in case someone wants to use it.

Signed-off-by: Don Zickus <[email protected]>
---
tools/perf/util/stat.c | 13 +++++++++++++
tools/perf/util/stat.h | 1 +
2 files changed, 14 insertions(+)

diff --git a/tools/perf/util/stat.c b/tools/perf/util/stat.c
index 6506b3d..0cb4dbc 100644
--- a/tools/perf/util/stat.c
+++ b/tools/perf/util/stat.c
@@ -33,6 +33,7 @@ double avg_stats(struct stats *stats)
* http://en.wikipedia.org/wiki/Stddev
*
* The std dev of the mean is related to the std dev by:
+ * (also known as standard error)
*
* s
* s_mean = -------
@@ -41,6 +42,18 @@ double avg_stats(struct stats *stats)
*/
double stddev_stats(struct stats *stats)
{
+ double variance;
+
+ if (stats->n < 2)
+ return 0.0;
+
+ variance = stats->M2 / (stats->n - 1);
+
+ return sqrt(variance);
+}
+
+double stderr_stats(struct stats *stats)
+{
double variance, variance_mean;

if (stats->n < 2)
diff --git a/tools/perf/util/stat.h b/tools/perf/util/stat.h
index ae8ccd7..6f61615 100644
--- a/tools/perf/util/stat.h
+++ b/tools/perf/util/stat.h
@@ -12,6 +12,7 @@ struct stats
void update_stats(struct stats *stats, u64 val);
double avg_stats(struct stats *stats);
double stddev_stats(struct stats *stats);
+double stderr_stats(struct stats *stats);
double rel_stddev_stats(double stddev, double avg);

static inline void init_stats(struct stats *stats)
--
1.7.11.7

2014-02-28 17:48:29

by Don Zickus

[permalink] [raw]
Subject: [PATCH 11/19] perf, c2c: Add in sort on physid

Now that the infrastructure is set, add in the support to use
hist_entry to sort on physid.

Signed-off-by: Don Zickus <[email protected]>
---
tools/perf/builtin-c2c.c | 63 ++++++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 61 insertions(+), 2 deletions(-)

diff --git a/tools/perf/builtin-c2c.c b/tools/perf/builtin-c2c.c
index 367d6c1..472d4d9 100644
--- a/tools/perf/builtin-c2c.c
+++ b/tools/perf/builtin-c2c.c
@@ -6,6 +6,7 @@
#include "util/session.h"
#include "util/tool.h"
#include "util/debug.h"
+#include "util/annotate.h"

#include <linux/compiler.h>
#include <linux/kernel.h>
@@ -13,6 +14,7 @@
struct perf_c2c {
struct perf_tool tool;
bool raw_records;
+ struct hists hists;
};

enum { OP, LVL, SNP, LCK, TLB };
@@ -144,7 +146,11 @@ static int perf_c2c__process_load_store(struct perf_c2c *c2c,
struct perf_sample *sample,
struct perf_evsel *evsel)
{
- struct mem_info *mi;
+ struct symbol *parent = NULL;
+ struct hist_entry *he;
+ struct mem_info *mi, *mx;
+ uint64_t cost;
+ int err;

mi = sample__resolve_mem(sample, al);
if (!mi)
@@ -156,7 +162,42 @@ static int perf_c2c__process_load_store(struct perf_c2c *c2c,
return 0;
}

- return 0;
+ cost = sample->weight;
+ if (!cost)
+ cost = 1;
+
+ /*
+ * must pass period=weight in order to get the correct
+ * sorting from hists__collapse_resort() which is solely
+ * based on periods. We want sorting be done on nr_events * weight
+ * and this is indirectly achieved by passing period=weight here
+ * and the he_stat__add_period() function.
+ */
+ he = __hists__add_entry(&c2c->hists, al, parent, NULL, mi,
+ cost, cost, 0);
+ if (!he) {
+ err = -ENOMEM;
+ goto out_mem;
+ }
+
+ err = hist_entry__inc_addr_samples(he, evsel->idx, al->addr);
+ if (err)
+ goto out;
+
+ mx = he->mem_info;
+ err = addr_map_symbol__inc_samples(&mx->daddr, evsel->idx);
+ if (err)
+ goto out;
+
+ c2c->hists.stats.total_period += cost;
+ hists__inc_nr_events(&c2c->hists, PERF_RECORD_SAMPLE);
+ return err;
+
+out_mem:
+ /* implicitly freed by __hists__add_entry */
+ free(mi);
+out:
+ return err;
}

static const struct perf_evsel_str_handler handlers[] = {
@@ -255,10 +296,28 @@ out:
return err;
}

+static int perf_c2c__init(struct perf_c2c *c2c)
+{
+ sort__mode = SORT_MODE__MEMORY;
+ sort__wants_unique = 1;
+ sort_order = "physid";
+
+ if (setup_sorting() < 0) {
+ pr_err("can not setup sorting\n");
+ return -1;
+ }
+
+ hists__init(&c2c->hists);
+
+ return 0;
+}
static int perf_c2c__report(struct perf_c2c *c2c)
{
setup_pager();

+ if (perf_c2c__init(c2c))
+ return -1;
+
if (c2c->raw_records)
perf_c2c__fprintf_header(stdout);

--
1.7.11.7

2014-02-28 17:48:49

by Don Zickus

[permalink] [raw]
Subject: [PATCH 05/19] perf, kmem: Utilize the new generic cpunode_map

Use the previous patch implementation of cpunode_map for builtin-kmem.c
Should not be any functional difference.

Cc: Li Zefan <[email protected]>
Signed-off-by: Don Zickus <[email protected]>
---
tools/perf/builtin-kmem.c | 78 ++---------------------------------------------
1 file changed, 3 insertions(+), 75 deletions(-)

diff --git a/tools/perf/builtin-kmem.c b/tools/perf/builtin-kmem.c
index 929462a..9ff7892 100644
--- a/tools/perf/builtin-kmem.c
+++ b/tools/perf/builtin-kmem.c
@@ -14,6 +14,7 @@
#include "util/parse-options.h"
#include "util/trace-event.h"
#include "util/data.h"
+#include "util/cpumap.h"

#include "util/debug.h"

@@ -31,9 +32,6 @@ static int caller_lines = -1;

static bool raw_ip;

-static int *cpunode_map;
-static int max_cpu_num;
-
struct alloc_stat {
u64 call_site;
u64 ptr;
@@ -55,76 +53,6 @@ static struct rb_root root_caller_sorted;
static unsigned long total_requested, total_allocated;
static unsigned long nr_allocs, nr_cross_allocs;

-#define PATH_SYS_NODE "/sys/devices/system/node"
-
-static int init_cpunode_map(void)
-{
- FILE *fp;
- int i, err = -1;
-
- fp = fopen("/sys/devices/system/cpu/kernel_max", "r");
- if (!fp) {
- max_cpu_num = 4096;
- return 0;
- }
-
- if (fscanf(fp, "%d", &max_cpu_num) < 1) {
- pr_err("Failed to read 'kernel_max' from sysfs");
- goto out_close;
- }
-
- max_cpu_num++;
-
- cpunode_map = calloc(max_cpu_num, sizeof(int));
- if (!cpunode_map) {
- pr_err("%s: calloc failed\n", __func__);
- goto out_close;
- }
-
- for (i = 0; i < max_cpu_num; i++)
- cpunode_map[i] = -1;
-
- err = 0;
-out_close:
- fclose(fp);
- return err;
-}
-
-static int setup_cpunode_map(void)
-{
- struct dirent *dent1, *dent2;
- DIR *dir1, *dir2;
- unsigned int cpu, mem;
- char buf[PATH_MAX];
-
- if (init_cpunode_map())
- return -1;
-
- dir1 = opendir(PATH_SYS_NODE);
- if (!dir1)
- return 0;
-
- while ((dent1 = readdir(dir1)) != NULL) {
- if (dent1->d_type != DT_DIR ||
- sscanf(dent1->d_name, "node%u", &mem) < 1)
- continue;
-
- snprintf(buf, PATH_MAX, "%s/%s", PATH_SYS_NODE, dent1->d_name);
- dir2 = opendir(buf);
- if (!dir2)
- continue;
- while ((dent2 = readdir(dir2)) != NULL) {
- if (dent2->d_type != DT_LNK ||
- sscanf(dent2->d_name, "cpu%u", &cpu) < 1)
- continue;
- cpunode_map[cpu] = mem;
- }
- closedir(dir2);
- }
- closedir(dir1);
- return 0;
-}
-
static int insert_alloc_stat(unsigned long call_site, unsigned long ptr,
int bytes_req, int bytes_alloc, int cpu)
{
@@ -235,7 +163,7 @@ static int perf_evsel__process_alloc_node_event(struct perf_evsel *evsel,
int ret = perf_evsel__process_alloc_event(evsel, sample);

if (!ret) {
- int node1 = cpunode_map[sample->cpu],
+ int node1 = cpu_map__get_node(sample->cpu),
node2 = perf_evsel__intval(evsel, sample, "node");

if (node1 != node2)
@@ -770,7 +698,7 @@ int cmd_kmem(int argc, const char **argv, const char *prefix __maybe_unused)
if (!strncmp(argv[0], "rec", 3)) {
return __cmd_record(argc, argv);
} else if (!strcmp(argv[0], "stat")) {
- if (setup_cpunode_map())
+ if (cpu_map__setup_cpunode_map())
return -1;

if (list_empty(&caller_sort))
--
1.7.11.7

2014-02-28 17:49:09

by Don Zickus

[permalink] [raw]
Subject: [PATCH 07/19] perf, callchain: Add generic callchain print handler for stdio

My initial implementation for rbtree sorting in the c2c tool does not use the
normal history elements. As a result, adding callchain support (which is
deeply integrated with history elements) is more challenging when trying to
display its output.

To make things simpler for myself (and to avoid rewriting the same code into
the c2c tool), I provided a generic interface that takes an unsorted callchain
list along with its total and relative sample size, and sorts it locally based
on period and calls the appropriate graph function (passing the correct sample
size).

This makes things easier because the c2c tool can be dumber and just collect
callchains and not worry about the magic needed to sort and display them
correctly.

Unfortunately, this is assuming a stdio output only and does not use the other
gui type outputs.

Regardless, this patch provides useful info for the tool right now. Tweaks and
recommendations for a better approach are welcomed. :-)

Signed-off-by: Don Zickus <[email protected]>
---
tools/perf/ui/stdio/hist.c | 37 +++++++++++++++++++++++++++++++++++++
tools/perf/util/hist.h | 4 ++++
2 files changed, 41 insertions(+)

diff --git a/tools/perf/ui/stdio/hist.c b/tools/perf/ui/stdio/hist.c
index 831fbb7..0a40f59 100644
--- a/tools/perf/ui/stdio/hist.c
+++ b/tools/perf/ui/stdio/hist.c
@@ -536,3 +536,40 @@ size_t events_stats__fprintf(struct events_stats *stats, FILE *fp)

return ret;
}
+
+size_t generic_entry_callchain__fprintf(struct callchain_root *unsorted_callchain,
+ u64 total_samples, u64 relative_samples,
+ int left_margin, FILE *fp)
+{
+ struct rb_root sorted_chain;
+ u64 min_callchain_hits;
+
+ if (!symbol_conf.use_callchain)
+ return 0;
+
+ min_callchain_hits = total_samples * (callchain_param.min_percent / 100);
+
+ callchain_param.sort(&sorted_chain, unsorted_callchain,
+ min_callchain_hits, &callchain_param);
+
+ switch (callchain_param.mode) {
+ case CHAIN_GRAPH_REL:
+ return callchain__fprintf_graph(fp, &sorted_chain, relative_samples,
+ left_margin);
+ break;
+ case CHAIN_GRAPH_ABS:
+ return callchain__fprintf_graph(fp, &sorted_chain, total_samples,
+ left_margin);
+ break;
+ case CHAIN_FLAT:
+ return callchain__fprintf_flat(fp, &sorted_chain, total_samples);
+ break;
+ case CHAIN_NONE:
+ break;
+ default:
+ pr_err("Bad callchain mode\n");
+ }
+
+ return 0;
+}
+
diff --git a/tools/perf/util/hist.h b/tools/perf/util/hist.h
index d226c5b..df981ce 100644
--- a/tools/perf/util/hist.h
+++ b/tools/perf/util/hist.h
@@ -112,6 +112,10 @@ size_t events_stats__fprintf(struct events_stats *stats, FILE *fp);
size_t hists__fprintf(struct hists *hists, bool show_header, int max_rows,
int max_cols, float min_pcnt, FILE *fp);

+size_t generic_entry_callchain__fprintf(struct callchain_root *unsorted_callchain,
+ u64 total_samples, u64 relative_samples,
+ int left_margin, FILE *fp);
+
void hists__filter_by_dso(struct hists *hists);
void hists__filter_by_thread(struct hists *hists);
void hists__filter_by_symbol(struct hists *hists);
--
1.7.11.7

2014-02-28 17:50:48

by Don Zickus

[permalink] [raw]
Subject: [PATCH 03/19] perf, sort: Allow unique sorting instead of combining hist_entries

The cache contention tools needs to keep all the perf records unique in order
to properly parse all the data. Currently add_hist_entry() will combine
the duplicate record and add the weight/period to the existing record.

This throws away the unique data the cache contention tool needs (mainly
the data source). Create a flag to force the records to stay unique.

Signed-off-by: Don Zickus <[email protected]>
---
tools/perf/util/hist.c | 3 ++-
tools/perf/util/sort.c | 1 +
tools/perf/util/sort.h | 1 +
3 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/tools/perf/util/hist.c b/tools/perf/util/hist.c
index ea54db3..cf85d7d 100644
--- a/tools/perf/util/hist.c
+++ b/tools/perf/util/hist.c
@@ -365,7 +365,8 @@ static struct hist_entry *add_hist_entry(struct hists *hists,
*/
cmp = hist_entry__cmp(he, entry);

- if (!cmp) {
+ if (!cmp && !sort__wants_unique) {
+
he_stat__add_period(&he->stat, period, weight);

/*
diff --git a/tools/perf/util/sort.c b/tools/perf/util/sort.c
index 0cb43a5..453b8f0 100644
--- a/tools/perf/util/sort.c
+++ b/tools/perf/util/sort.c
@@ -14,6 +14,7 @@ int sort__need_collapse = 0;
int sort__has_parent = 0;
int sort__has_sym = 0;
int sort__has_dso = 0;
+int sort__wants_unique = 0;
enum sort_mode sort__mode = SORT_MODE__NORMAL;

enum sort_type sort__first_dimension;
diff --git a/tools/perf/util/sort.h b/tools/perf/util/sort.h
index eb8cd50..4e960fd 100644
--- a/tools/perf/util/sort.h
+++ b/tools/perf/util/sort.h
@@ -33,6 +33,7 @@ extern int have_ignore_callees;
extern int sort__need_collapse;
extern int sort__has_parent;
extern int sort__has_sym;
+extern int sort__wants_unique;
extern enum sort_mode sort__mode;
extern struct sort_entry sort_comm;
extern struct sort_entry sort_dso;
--
1.7.11.7

2014-02-28 17:51:06

by Don Zickus

[permalink] [raw]
Subject: [PATCH 04/19] perf: Allow ability to map cpus to nodes easily

This patch figures out the max number of cpus and nodes that are on the
system and creates a map of cpu to node. This allows us to provide a cpu
and quickly get the node associated with it.

It was mostly copied from builtin-kmem.c and tweaked slightly to use less memory
(use possible cpus instead of max). It also calculates the max number of nodes.

Signed-off-by: Don Zickus <[email protected]>
---
tools/perf/util/cpumap.c | 150 +++++++++++++++++++++++++++++++++++++++++++++++
tools/perf/util/cpumap.h | 35 +++++++++++
2 files changed, 185 insertions(+)

diff --git a/tools/perf/util/cpumap.c b/tools/perf/util/cpumap.c
index 7fe4994..d19d5b3 100644
--- a/tools/perf/util/cpumap.c
+++ b/tools/perf/util/cpumap.c
@@ -317,3 +317,153 @@ int cpu_map__build_core_map(struct cpu_map *cpus, struct cpu_map **corep)
{
return cpu_map__build_map(cpus, corep, cpu_map__get_core);
}
+
+/* setup simple routines to easily access node numbers given a cpu number */
+
+
+#define PATH_SYS_NODE "/sys/devices/system/node"
+
+/* Determine highest possible cpu in the system for sparse allocation */
+static void set_max_cpu_num(void)
+{
+ FILE *fp;
+ char buf[256];
+ int num;
+
+ /* set up default */
+ max_cpu_num = 4096;
+
+ /* get the highest possible cpu number for a sparse allocation */
+ fp = fopen("/sys/devices/system/cpu/possible", "r");
+ if (!fp)
+ goto out;
+
+ num = fread(&buf, 1, sizeof(buf), fp);
+ if (!num)
+ goto out_close;
+ buf[num] = '\0';
+
+ /* start on the right, to find highest cpu num */
+ while (--num) {
+ if ((buf[num] == ',') || (buf[num] == '-')) {
+ num++;
+ break;
+ }
+ }
+ if (sscanf(&buf[num], "%d", &max_cpu_num) < 1)
+ goto out_close;
+
+ max_cpu_num++;
+
+ fclose(fp);
+ return;
+
+out_close:
+ fclose(fp);
+out:
+ pr_err("Failed to read max cpus, using default of %d\n",
+ max_cpu_num);
+ return;
+}
+
+/* Determine highest possible node in the system for sparse allocation */
+static void set_max_node_num(void)
+{
+ FILE *fp;
+ char buf[256];
+ int num;
+
+ /* set up default */
+ max_node_num = 8;
+
+ /* get the highest possible cpu number for a sparse allocation */
+ fp = fopen("/sys/devices/system/node/possible", "r");
+ if (!fp)
+ goto out;
+
+ num = fread(&buf, 1, sizeof(buf), fp);
+ if (!num)
+ goto out_close;
+ buf[num] = '\0';
+
+ /* start on the right, to find highest node num */
+ while (--num) {
+ if ((buf[num] == ',') || (buf[num] == '-')) {
+ num++;
+ break;
+ }
+ }
+ if (sscanf(&buf[num], "%d", &max_node_num) < 1)
+ goto out_close;
+
+ max_node_num++;
+
+ fclose(fp);
+ return;
+
+out_close:
+ fclose(fp);
+out:
+ pr_err("Failed to read max nodes, using default of %d\n",
+ max_node_num);
+ return;
+}
+
+static int init_cpunode_map(void)
+{
+ int i;
+
+ set_max_cpu_num();
+ set_max_node_num();
+
+ cpunode_map = calloc(max_cpu_num, sizeof(int));
+ if (!cpunode_map) {
+ pr_err("%s: calloc failed\n", __func__);
+ goto out;
+ }
+
+ for (i = 0; i < max_cpu_num; i++)
+ cpunode_map[i] = -1;
+
+ return 0;
+out:
+ return -1;
+}
+
+int cpu_map__setup_cpunode_map(void)
+{
+ struct dirent *dent1, *dent2;
+ DIR *dir1, *dir2;
+ unsigned int cpu, mem;
+ char buf[PATH_MAX];
+
+ /* initialize globals */
+ if (init_cpunode_map())
+ return -1;
+
+ dir1 = opendir(PATH_SYS_NODE);
+ if (!dir1)
+ return 0;
+
+ /* walk tree and setup map */
+ while ((dent1 = readdir(dir1)) != NULL) {
+ if (dent1->d_type != DT_DIR ||
+ sscanf(dent1->d_name, "node%u", &mem) < 1)
+ continue;
+
+ snprintf(buf, PATH_MAX, "%s/%s", PATH_SYS_NODE, dent1->d_name);
+ dir2 = opendir(buf);
+ if (!dir2)
+ continue;
+ while ((dent2 = readdir(dir2)) != NULL) {
+ if (dent2->d_type != DT_LNK ||
+ sscanf(dent2->d_name, "cpu%u", &cpu) < 1)
+ continue;
+ cpunode_map[cpu] = mem;
+ }
+ closedir(dir2);
+ }
+ closedir(dir1);
+ return 0;
+}
+
diff --git a/tools/perf/util/cpumap.h b/tools/perf/util/cpumap.h
index b123bb9..d6fde2b 100644
--- a/tools/perf/util/cpumap.h
+++ b/tools/perf/util/cpumap.h
@@ -4,6 +4,9 @@
#include <stdio.h>
#include <stdbool.h>

+#include "perf.h"
+#include "util/debug.h"
+
struct cpu_map {
int nr;
int map[];
@@ -46,4 +49,36 @@ static inline bool cpu_map__empty(const struct cpu_map *map)
return map ? map->map[0] == -1 : true;
}

+int max_cpu_num;
+int max_node_num;
+int *cpunode_map;
+
+int cpu_map__setup_cpunode_map(void);
+
+static inline int cpu_map__max_node(void)
+{
+ if (unlikely(!max_node_num))
+ pr_debug("cpu_map not initiailzed\n");
+
+ return max_node_num;
+}
+
+static inline int cpu_map__max_cpu(void)
+{
+ if (unlikely(!max_cpu_num))
+ pr_debug("cpu_map not initiailzed\n");
+
+ return max_cpu_num;
+}
+
+static inline int cpu_map__get_node(int cpu)
+{
+ if (unlikely(cpunode_map == NULL)) {
+ pr_debug("cpu_map not initialized\n");
+ return -1;
+ }
+
+ return cpunode_map[cpu];
+}
+
#endif /* __PERF_CPUMAP_H */
--
1.7.11.7

2014-02-28 17:43:34

by Don Zickus

[permalink] [raw]
Subject: [PATCH 01/19] Revert "perf: Disable PERF_RECORD_MMAP2 support"

This reverts commit 3090ffb5a2515990182f3f55b0688a7817325488.

Conflicts:
tools/perf/util/event.c
---
kernel/events/core.c | 4 ----
tools/perf/util/event.c | 36 +++++++++++++++++++-----------------
tools/perf/util/evsel.c | 1 +
3 files changed, 20 insertions(+), 21 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 45e5543..a4ab184 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6851,10 +6851,6 @@ static int perf_copy_attr(struct perf_event_attr __user *uattr,
if (ret)
return -EFAULT;

- /* disabled for now */
- if (attr->mmap2)
- return -EINVAL;
-
if (attr->__reserved_1)
return -EINVAL;

diff --git a/tools/perf/util/event.c b/tools/perf/util/event.c
index 55eebe9..82fb890 100644
--- a/tools/perf/util/event.c
+++ b/tools/perf/util/event.c
@@ -155,13 +155,14 @@ int perf_event__synthesize_mmap_events(struct perf_tool *tool,
return -1;
}

- event->header.type = PERF_RECORD_MMAP;
+ event->header.type = PERF_RECORD_MMAP2;

while (1) {
char bf[BUFSIZ];
char prot[5];
char execname[PATH_MAX];
char anonstr[] = "//anon";
+ unsigned int ino;
size_t size;
ssize_t n;

@@ -172,14 +173,15 @@ int perf_event__synthesize_mmap_events(struct perf_tool *tool,
strcpy(execname, "");

/* 00400000-0040c000 r-xp 00000000 fd:01 41038 /bin/cat */
- n = sscanf(bf, "%"PRIx64"-%"PRIx64" %s %"PRIx64" %*x:%*x %*u %s\n",
- &event->mmap.start, &event->mmap.len, prot,
- &event->mmap.pgoff,
- execname);
- /*
- * Anon maps don't have the execname.
- */
- if (n < 4)
+ n = sscanf(bf, "%"PRIx64"-%"PRIx64" %s %"PRIx64" %x:%x %u %s\n",
+ &event->mmap2.start, &event->mmap2.len, prot,
+ &event->mmap2.pgoff, &event->mmap2.maj,
+ &event->mmap2.min,
+ &ino, execname);
+
+ event->mmap2.ino = (u64)ino;
+
+ if (n < 7)
continue;
/*
* Just like the kernel, see __perf_event_mmap in kernel/perf_event.c
@@ -200,15 +202,15 @@ int perf_event__synthesize_mmap_events(struct perf_tool *tool,
strcpy(execname, anonstr);

size = strlen(execname) + 1;
- memcpy(event->mmap.filename, execname, size);
+ memcpy(event->mmap2.filename, execname, size);
size = PERF_ALIGN(size, sizeof(u64));
- event->mmap.len -= event->mmap.start;
- event->mmap.header.size = (sizeof(event->mmap) -
- (sizeof(event->mmap.filename) - size));
- memset(event->mmap.filename + size, 0, machine->id_hdr_size);
- event->mmap.header.size += machine->id_hdr_size;
- event->mmap.pid = tgid;
- event->mmap.tid = pid;
+ event->mmap2.len -= event->mmap.start;
+ event->mmap2.header.size = (sizeof(event->mmap2) -
+ (sizeof(event->mmap2.filename) - size));
+ memset(event->mmap2.filename + size, 0, machine->id_hdr_size);
+ event->mmap2.header.size += machine->id_hdr_size;
+ event->mmap2.pid = tgid;
+ event->mmap2.tid = pid;

if (process(tool, event, &synth_sample, machine) != 0) {
rc = -1;
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index adc94dd..f2ab0b3 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -640,6 +640,7 @@ void perf_evsel__config(struct perf_evsel *evsel, struct record_opts *opts)
perf_evsel__set_sample_bit(evsel, WEIGHT);

attr->mmap = track;
+ attr->mmap2 = track && !perf_missing_features.mmap2;
attr->comm = track;

if (opts->sample_transaction)
--
1.7.11.7

2014-02-28 18:57:57

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 00/19 V2] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems

Don Zickus <[email protected]> writes:
>
> A handful of patches include re-enabling MMAP2 support and some fixes
> to perf itself.

I would suggest to pursue the lone kernel patch separately. Hopefully
that can be merged soon, once the remainin problems with that are
addressed.

>
> Comemnts, feedback, anything else welcomed.

As a high level comment, can you add support for CSV mode?
(-x, in other tools)

I assume most people would data mine the output in some form,
and that's much easier with a more machine oriented format.

Also you should probably have the standard perf conventions
for output fds (--log-fd, --output, default to stderr). otherwise
the output may often be lost.

-Andi

--
[email protected] -- Speaking for myself only

2014-02-28 18:59:20

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 11/19] perf, c2c: Add in sort on physid

Don Zickus <[email protected]> writes:
> +
> + /*
> + * must pass period=weight in order to get the correct
> + * sorting from hists__collapse_resort() which is solely
> + * based on periods. We want sorting be done on nr_events * weight
> + * and this is indirectly achieved by passing period=weight here
> + * and the he_stat__add_period() function.
> + */

sort.c really must be fixed to support multiple sort keys properly.
Lots of other code has workarounds for this too (e.g. it is a bit
problem for TSX abort weights)

I would suggest to fix that instead of hacking around it.

-Andi


--
[email protected] -- Speaking for myself only

2014-02-28 19:09:03

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 08/19] perf c2c: Shared data analyser

Don Zickus <[email protected]> writes:
> +
> +static const struct perf_evsel_str_handler handlers[] = {
> + { "cpu/mem-loads,ldlat=30/pp", perf_c2c__process_load, },
> + { "cpu/mem-stores/pp", perf_c2c__process_store, },

The 30 magic number should probably be configurable.

Using load-latency here rules out Atom, so at some point
you would need to get rid of that.

I suspect on most systems you should rather use p
instead of pp to get the overhead down (before Haswell pp
is expensive)

> +static int perf_c2c__record(int argc, const char **argv)
> +{
> + unsigned int rec_argc, i, j;
> + const char **rec_argv;
> + const char * const record_args[] = {
> + "record",
> + /* "--phys-addr", */

So is that needed or not?

-Andi
--
[email protected] -- Speaking for myself only

2014-02-28 19:42:53

by Don Zickus

[permalink] [raw]
Subject: Re: [PATCH 00/19 V2] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems

On Fri, Feb 28, 2014 at 10:57:48AM -0800, Andi Kleen wrote:
> Don Zickus <[email protected]> writes:
> >
> > A handful of patches include re-enabling MMAP2 support and some fixes
> > to perf itself.
>
> I would suggest to pursue the lone kernel patch separately. Hopefully
> that can be merged soon, once the remainin problems with that are
> addressed.

Good point.

>
> >
> > Comemnts, feedback, anything else welcomed.
>
> As a high level comment, can you add support for CSV mode?
> (-x, in other tools)

It's there. :-)

perf c2c -r -x, report

will spit out raw records in CSV format.

>
> I assume most people would data mine the output in some form,
> and that's much easier with a more machine oriented format.
>
> Also you should probably have the standard perf conventions
> for output fds (--log-fd, --output, default to stderr). otherwise
> the output may often be lost.

Ok, I'll look into what that entails.

Thanks!

Cheers,
Don

2014-02-28 19:44:34

by Don Zickus

[permalink] [raw]
Subject: Re: [PATCH 11/19] perf, c2c: Add in sort on physid

On Fri, Feb 28, 2014 at 10:59:18AM -0800, Andi Kleen wrote:
> Don Zickus <[email protected]> writes:
> > +
> > + /*
> > + * must pass period=weight in order to get the correct
> > + * sorting from hists__collapse_resort() which is solely
> > + * based on periods. We want sorting be done on nr_events * weight
> > + * and this is indirectly achieved by passing period=weight here
> > + * and the he_stat__add_period() function.
> > + */
>
> sort.c really must be fixed to support multiple sort keys properly.
> Lots of other code has workarounds for this too (e.g. it is a bit
> problem for TSX abort weights)
>
> I would suggest to fix that instead of hacking around it.

I don't think I understand the problem enough to know what to fix. I just
copied this piece of code from builtin-report.c and things seemed to work.

Mind giving me some details and I can look at fixing it. :-)

Cheers,
Don

2014-02-28 19:47:35

by Don Zickus

[permalink] [raw]
Subject: Re: [PATCH 08/19] perf c2c: Shared data analyser

On Fri, Feb 28, 2014 at 11:08:59AM -0800, Andi Kleen wrote:
> Don Zickus <[email protected]> writes:
> > +
> > +static const struct perf_evsel_str_handler handlers[] = {
> > + { "cpu/mem-loads,ldlat=30/pp", perf_c2c__process_load, },
> > + { "cpu/mem-stores/pp", perf_c2c__process_store, },
>
> The 30 magic number should probably be configurable.

Yeah, I just didn't figure out how to make it configurable yet within this
string.

>
> Using load-latency here rules out Atom, so at some point
> you would need to get rid of that.

Oh. How do you get load-latency for Atom then?

>
> I suspect on most systems you should rather use p
> instead of pp to get the overhead down (before Haswell pp
> is expensive)

Ok. Good to know.

>
> > +static int perf_c2c__record(int argc, const char **argv)
> > +{
> > + unsigned int rec_argc, i, j;
> > + const char **rec_argv;
> > + const char * const record_args[] = {
> > + "record",
> > + /* "--phys-addr", */
>
> So is that needed or not?

No. Legacy code before we had MMAP2 support. I'll remove it next respin.

Cheers,
Don

2014-02-28 21:54:46

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 00/19 V2] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems

> It's there. :-)
>
> perf c2c -r -x, report
>
> will spit out raw records in CSV format.

But most of the cooked data seems not? I saw
lots of printfs without separator support.

Sorry I meant data mining the cooked data,
e.g. plotting it.

-Andi

2014-02-28 22:03:08

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: [PATCH 08/19] perf c2c: Shared data analyser

On Fri, 2014-02-28 at 14:46 -0500, Don Zickus wrote:
> On Fri, Feb 28, 2014 at 11:08:59AM -0800, Andi Kleen wrote:
> > Don Zickus <[email protected]> writes:
> > > +
> > > +static const struct perf_evsel_str_handler handlers[] = {
> > > + { "cpu/mem-loads,ldlat=30/pp", perf_c2c__process_load, },
> > > + { "cpu/mem-stores/pp", perf_c2c__process_store, },
> >

Hmm I'm getting this when running a simple record command.

invalid or unsupported event: 'cpu/mem-loads/pp'

This only occurs with c2c, other subcommands work normally. It's as if
it were an old kernel, but it's Linus' latest. Is this an issue with the
patch or something I'm missing?

Furthermore, I see:
ls /sys/bus/event_source/devices/cpu/events
branch-instructions branch-misses cache-misses cache-references cpu-cycles instructions mem-loads

Thanks!


2014-02-28 22:29:47

by Joe Mario

[permalink] [raw]
Subject: Re: [PATCH 08/19] perf c2c: Shared data analyser

Apologies for the resend. My first msg contained html in it.

On 02/28/2014 04:03 PM, Davidlohr Bueso wrote:
> On Fri, 2014-02-28 at 14:46 -0500, Don Zickus wrote:
>> On Fri, Feb 28, 2014 at 11:08:59AM -0800, Andi Kleen wrote:
>>> Don Zickus <[email protected]> writes:
>>>> +
>>>> +static const struct perf_evsel_str_handler handlers[] = {
>>>> + { "cpu/mem-loads,ldlat=30/pp", perf_c2c__process_load, },
>>>> + { "cpu/mem-stores/pp", perf_c2c__process_store, },
>>>
>
> Hmm I'm getting this when running a simple record command.
>
> invalid or unsupported event: 'cpu/mem-loads/pp'
>
> This only occurs with c2c, other subcommands work normally. It's as if
> it were an old kernel, but it's Linus' latest. Is this an issue with the
> patch or something I'm missing?
>
> Furthermore, I see:
> ls /sys/bus/event_source/devices/cpu/events
> branch-instructions branch-misses cache-misses cache-references cpu-cycles instructions mem-loads

David:
It looks like you're running on an older Intel processor, which is missing necessary events for C2C to work.
As Don noted in his patch 00/19, this was primarily developed and tested on Intel's Ivy Bridge platform.
If you rerun this on an Ivy Bridge, it should work fine.
We should add a runtime check for supported platforms.
Joe

> Thanks!
>
>
>

2014-03-01 00:50:29

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 08/19] perf c2c: Shared data analyser

> David:
> It looks like you're running on an older Intel processor, which is missing necessary events for C2C to work.

mem-loads should be supported Nehalem and up.
mem-stores is Sandy Bridge and up

You can check in perf list

-Andi

2014-03-01 01:07:40

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 11/19] perf, c2c: Add in sort on physid

> I don't think I understand the problem enough to know what to fix. I just
> copied this piece of code from builtin-report.c and things seemed to work.
>
> Mind giving me some details and I can look at fixing it. :-)

sort.c even though has all these sort keys only sorts by period.
It should instead sort by all the specified keys in order instead.

Namhyung looked at it at some point.

-Andi

--
[email protected] -- Speaking for myself only.

2014-03-01 01:27:48

by Namhyung Kim

[permalink] [raw]
Subject: Re: [PATCH 11/19] perf, c2c: Add in sort on physid

Hi Andi,

On Sat, Mar 1, 2014 at 1:07 AM, Andi Kleen <[email protected]> wrote:
>> I don't think I understand the problem enough to know what to fix. I just
>> copied this piece of code from builtin-report.c and things seemed to work.
>>
>> Mind giving me some details and I can look at fixing it. :-)
>
> sort.c even though has all these sort keys only sorts by period.
> It should instead sort by all the specified keys in order instead.
>
> Namhyung looked at it at some point.

Yes, I'm working on it now, but I only have a little time. Hopefully
I can send a rfc version next week..

Thanks,
Namhyung