2012-11-05 13:53:07

by Stephane Eranian

[permalink] [raw]
Subject: [PATCH v2 00/16] perf: add memory access sampling support

This patch series had a new feature to the kernel perf_events
interface and corresponding user level tool, perf.

With this patch, it is possible to sample (not trace) memory
accesses (load, store). For loads, the instruction and data
addresses are captured along with the latency and data source.
For stores, the instruction and data addresses are capture
along with limited cache and TLB information.

For load data source, the memory hierarchy level, the tlb, snoop
and lock information is captured.

Although the perf_event interface is extended in a generic manner,
sampling memory accesses requires HW support. The current patches
implement the feature on Intel processors starting with Nehalem.
The patches leverage the PEBS Load Latency and Precise Store
mechanisms. Precise Store is present only on Sandy Bridge and
Ivy Bridge based processors.

The perf tool is extended to make capturing and analyzing the
data easier with a new command: perf mem.

$ perf mem -t load rec triad
$ perf mem -t load rep --stdio
# Samples: 19K of event 'cpu/mem-loads/pp'
# Total cost : 1013994
# Sort order : cost,mem,sym,dso,symbol_daddr,dso_daddr,snoop,tlb,locked
#
# Overhead Samples Cost Memory access Symbol Shared Obj Data Symbol Data Object Snoop TLB access Locked
# ........ ........... ....... ............. .......... ........... ...................... ........... ...... ............ ......
#
0.10% 1 986 LFB hit [.] triad triad [.] 0x00007f67dffe8038 [unknown] None L1 or L2 hit No
0.09% 1 890 LFB hit [.] triad triad [.] 0x00007f67df91a750 [unknown] None L1 or L2 hit No
0.08% 1 826 LFB hit [.] triad triad [.] 0x00007f67e288fba8 [unknown] None L1 or L2 hit No
0.08% 1 825 LFB hit [.] triad triad [.] 0x00007f67dea28c80 [unknown] None L1 or L2 hit No
0.08% 1 787 LFB hit [.] triad triad [.] 0x00007f67df055a60 [unknown] None L1 or L2 hit No

The perf mem command is a wrapper around perf record/report. It passes the
right options to the report and record commands. Note that the TUI mode is
supported.

One powerful feature of perf is that users can toy with sort order to display
the information in different format or from a different angle. This is particularly
useful with memory sampling:

$ perf mem -t load rep --sort=mem
# Samples: 19K of event 'cpu/mem-loads/pp'
# Total cost : 1013994
# Sort order : mem
#
# Overhead Samples Memory access
# ........ ........... ........................
#
85.26% 10633 LFB hit
7.35% 8151 L1 hit
3.13% 383 L3 hit
3.09% 195 Local RAM hit
1.16% 259 L2 hit
0.00% 4 Uncached hit

Or if one is interested in the data view:
$ perf mem -t load rep --sort=symbol_daddr,cost
# Samples: 19K of event 'cpu/mem-loads/pp'
# Total cost : 1013994
# Sort order : symbol_daddr,cost
#
# Overhead Samples Data Symbol Cost
# ........ ........... ...................... .......
#
0.10% 1 [.] 0x00007f67dffe8038 986
0.09% 1 [.] 0x00007f67df91a750 890
0.08% 1 [.] 0x00007f67e288fba8 826


One note on the cost displayed: On Intel processors with PEBS Load Latency, as described
in the SDM, the cost encompasses the number of cycles from dispatch to Globally Observable
(GO) state. That means, that it includes OOO execution. It is not usual to see L1D Hits
with a cost of > 100 cycles. Always look at the memory level for an approximation of the
access penalty, then interpret the cost value accordingly.

There is no cost associated with stores.

CAVEAT: Note that the data addresses are not resolved correctly currently due to a
problem in perf data symbol resolution code which I have not been able to
uncover so far.

In v2, we leverage some of Andi Kleen's Haswell patches, namely the weighted
samples and perf tool event parser fixes. We also introduce PERF_RECORD_MISC_DATA_MMAP
to tag mmaps for data vs. code. This helps the perf tool distinguish data. vs. code
mmaps (and therefore symbols). We have also integrated the feedback from v1. Note that in
v2 data symbol resolution is not yet fully operational, but there is a slight improvement.


Enjoy,

Signed-off-by: Stephane Eranian <[email protected]>


Andi Kleen (2):
perf, core: Add a concept of a weightened sample
perf, tools: Add arbitary aliases and support names with -

Stephane Eranian (14):
perf/x86: improve sysfs event mapping with event string
perf/x86: add flags to event constraints
perf: add minimal support for PERF_SAMPLE_WEIGHT
perf: add support for PERF_SAMPLE_ADDR in dump_sampple()
perf: add generic memory sampling interface
perf/x86: add memory profiling via PEBS Load Latency
perf/x86: export PEBS load latency threshold register to sysfs
perf/x86: add support for PEBS Precise Store
perf tools: add mem access sampling core support
perf report: add support for mem access profiling
perf record: add support for mem access profiling
perf tools: add new mem command for memory access profiling
perf: add PERF_RECORD_MISC_MMAP_DATA to RECORD_MMAP
perf tools: detect data vs. text mappings

arch/x86/include/asm/msr-index.h | 1 +
arch/x86/kernel/cpu/perf_event.c | 33 +--
arch/x86/kernel/cpu/perf_event.h | 61 ++++-
arch/x86/kernel/cpu/perf_event_intel.c | 66 ++++-
arch/x86/kernel/cpu/perf_event_intel_ds.c | 182 +++++++++++++-
arch/x86/kernel/cpu/perf_event_intel_uncore.c | 2 +-
include/linux/perf_event.h | 5 +
include/uapi/linux/perf_event.h | 74 +++++-
kernel/events/core.c | 15 ++
tools/perf/Documentation/perf-mem.txt | 48 ++++
tools/perf/Makefile | 1 +
tools/perf/builtin-mem.c | 238 ++++++++++++++++++
tools/perf/builtin-record.c | 2 +
tools/perf/builtin-report.c | 131 +++++++++-
tools/perf/builtin.h | 1 +
tools/perf/command-list.txt | 1 +
tools/perf/perf.c | 1 +
tools/perf/perf.h | 1 +
tools/perf/util/event.h | 2 +
tools/perf/util/evsel.c | 15 ++
tools/perf/util/hist.c | 66 ++++-
tools/perf/util/hist.h | 13 +
tools/perf/util/machine.c | 10 +-
tools/perf/util/parse-events.l | 2 +
tools/perf/util/session.c | 44 ++++
tools/perf/util/session.h | 4 +
tools/perf/util/sort.c | 324 ++++++++++++++++++++++++-
tools/perf/util/sort.h | 10 +-
tools/perf/util/symbol.h | 7 +
29 files changed, 1312 insertions(+), 48 deletions(-)
create mode 100644 tools/perf/Documentation/perf-mem.txt
create mode 100644 tools/perf/builtin-mem.c

--
1.7.9.5


2012-11-05 13:53:12

by Stephane Eranian

[permalink] [raw]
Subject: [PATCH v2 02/16] perf/x86: add flags to event constraints

This patch adds a flags field to each event constraint.
It can be used to store event specific features which can
then later be used by scheduling code or low-level x86 code.

The flags are propagated into event->hw.flags during the
get_event_constraint() call. They are cleared during the
put_event_constraint() call.

This mechanism is going to be used by the PEBS-LL patches.
It avoids defining yet another table to hold event specific
information.

Signed-off-by: Stephane Eranian <[email protected]>
---
arch/x86/kernel/cpu/perf_event.c | 2 +-
arch/x86/kernel/cpu/perf_event.h | 8 +++++---
arch/x86/kernel/cpu/perf_event_intel.c | 5 ++++-
arch/x86/kernel/cpu/perf_event_intel_ds.c | 4 +++-
arch/x86/kernel/cpu/perf_event_intel_uncore.c | 2 +-
include/linux/perf_event.h | 1 +
6 files changed, 15 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index aeca46d..80dd390 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -1470,7 +1470,7 @@ static int __init init_hw_perf_events(void)

unconstrained = (struct event_constraint)
__EVENT_CONSTRAINT(0, (1ULL << x86_pmu.num_counters) - 1,
- 0, x86_pmu.num_counters, 0);
+ 0, x86_pmu.num_counters, 0, 0);

x86_pmu.attr_rdpmc = 1; /* enable userspace RDPMC usage by default */
x86_pmu_format_group.attrs = x86_pmu.format_attrs;
diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 41479da..f78d6e8 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -59,6 +59,7 @@ struct event_constraint {
u64 cmask;
int weight;
int overlap;
+ int flags;
};

struct amd_nb {
@@ -170,16 +171,17 @@ struct cpu_hw_events {
void *kfree_on_online;
};

-#define __EVENT_CONSTRAINT(c, n, m, w, o) {\
+#define __EVENT_CONSTRAINT(c, n, m, w, o, f) {\
{ .idxmsk64 = (n) }, \
.code = (c), \
.cmask = (m), \
.weight = (w), \
.overlap = (o), \
+ .flags = f, \
}

#define EVENT_CONSTRAINT(c, n, m) \
- __EVENT_CONSTRAINT(c, n, m, HWEIGHT(n), 0)
+ __EVENT_CONSTRAINT(c, n, m, HWEIGHT(n), 0, 0)

/*
* The overlap flag marks event constraints with overlapping counter
@@ -203,7 +205,7 @@ struct cpu_hw_events {
* and its counter masks must be kept at a minimum.
*/
#define EVENT_CONSTRAINT_OVERLAP(c, n, m) \
- __EVENT_CONSTRAINT(c, n, m, HWEIGHT(n), 1)
+ __EVENT_CONSTRAINT(c, n, m, HWEIGHT(n), 1, 0)

/*
* Constraint on the Event code.
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index 93b9e11..57d6527 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -1367,8 +1367,10 @@ x86_get_event_constraints(struct cpu_hw_events *cpuc, struct perf_event *event)

if (x86_pmu.event_constraints) {
for_each_event_constraint(c, x86_pmu.event_constraints) {
- if ((event->hw.config & c->cmask) == c->code)
+ if ((event->hw.config & c->cmask) == c->code) {
+ event->hw.flags |= c->flags;
return c;
+ }
}
}

@@ -1413,6 +1415,7 @@ intel_put_shared_regs_event_constraints(struct cpu_hw_events *cpuc,
static void intel_put_event_constraints(struct cpu_hw_events *cpuc,
struct perf_event *event)
{
+ event->hw.flags = 0;
intel_put_shared_regs_event_constraints(cpuc, event);
}

diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index 826054a..f30d85b 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -430,8 +430,10 @@ struct event_constraint *intel_pebs_constraints(struct perf_event *event)

if (x86_pmu.pebs_constraints) {
for_each_event_constraint(c, x86_pmu.pebs_constraints) {
- if ((event->hw.config & c->cmask) == c->code)
+ if ((event->hw.config & c->cmask) == c->code) {
+ event->hw.flags |= c->flags;
return c;
+ }
}
}

diff --git a/arch/x86/kernel/cpu/perf_event_intel_uncore.c b/arch/x86/kernel/cpu/perf_event_intel_uncore.c
index 3cf3d97..cc39e64 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_uncore.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_uncore.c
@@ -2438,7 +2438,7 @@ static int __init uncore_type_init(struct intel_uncore_type *type)

type->unconstrainted = (struct event_constraint)
__EVENT_CONSTRAINT(0, (1ULL << type->num_counters) - 1,
- 0, type->num_counters, 0);
+ 0, type->num_counters, 0, 0);

for (i = 0; i < type->num_boxes; i++) {
pmus[i].func_id = -1;
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 6bfb2faa..484cfbc 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -128,6 +128,7 @@ struct hw_perf_event {
int event_base_rdpmc;
int idx;
int last_cpu;
+ int flags;

struct hw_perf_event_extra extra_reg;
struct hw_perf_event_extra branch_reg;
--
1.7.9.5

2012-11-05 13:53:21

by Stephane Eranian

[permalink] [raw]
Subject: [PATCH v2 06/16] perf: add support for PERF_SAMPLE_ADDR in dump_sampple()

Was missing from current code yet PERF_SAMPLE_ADDR has
been present for a long time. Needed for PEBS-LL mode.

Signed-off-by: Stephane Eranian <[email protected]>
---
tools/perf/util/session.c | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index 962c04f..b6d7cad 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -1009,6 +1009,10 @@ static void dump_sample(struct perf_evsel *evsel, union perf_event *event,

if (sample_type & PERF_SAMPLE_WEIGHT)
printf(" ... weight: %"PRIu64"\n", sample->weight);
+
+ if (sample_type & PERF_SAMPLE_ADDR)
+ printf(" ..... data: 0x%"PRIx64"\n", sample->addr);
+
}

static struct machine *
--
1.7.9.5

2012-11-05 13:53:16

by Stephane Eranian

[permalink] [raw]
Subject: [PATCH v2 04/16] perf: add minimal support for PERF_SAMPLE_WEIGHT

Ensure we grab the weight from raw sample struct
and that we can dump it via perf report -D.

Signed-off-by: Stephane Eranian <[email protected]>
---
tools/perf/util/event.h | 1 +
tools/perf/util/evsel.c | 5 +++++
tools/perf/util/session.c | 3 +++
3 files changed, 9 insertions(+)

diff --git a/tools/perf/util/event.h b/tools/perf/util/event.h
index 0d573ff..cf52977 100644
--- a/tools/perf/util/event.h
+++ b/tools/perf/util/event.h
@@ -90,6 +90,7 @@ struct perf_sample {
u64 period;
u32 cpu;
u32 raw_size;
+ u64 weight;
void *raw_data;
struct ip_callchain *callchain;
struct branch_stack *branch_stack;
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 618d411..b6f3697 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -1020,6 +1020,11 @@ int perf_evsel__parse_sample(struct perf_evsel *evsel, union perf_event *event,
}
}

+ if (type & PERF_SAMPLE_WEIGHT) {
+ data->weight= *array;
+ array++;
+ }
+
return 0;
}

diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index 15abe40..962c04f 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -1006,6 +1006,9 @@ static void dump_sample(struct perf_evsel *evsel, union perf_event *event,

if (sample_type & PERF_SAMPLE_STACK_USER)
stack_user__printf(&sample->user_stack);
+
+ if (sample_type & PERF_SAMPLE_WEIGHT)
+ printf(" ... weight: %"PRIu64"\n", sample->weight);
}

static struct machine *
--
1.7.9.5

2012-11-05 13:53:27

by Stephane Eranian

[permalink] [raw]
Subject: [PATCH v2 08/16] perf/x86: add memory profiling via PEBS Load Latency

This patch adds support for memory profiling using the
PEBS Load Latency facility.

Load accesses are sampled by HW and the instruction
address, data address, load latency, data source, tlb,
locked information can be saved in the sampling buffer
if using the PERF_SAMPLE_COST (for latency),
PERF_SAMPLE_ADDR, PERF_SAMPLE_DSRC types.

To enable PEBS Load Latency, users have to use the
model specific event:
- on NHM/WSM: MEM_INST_RETIRED:LATENCY_ABOVE_THRESHOLD
- on SNB/IVB: MEM_TRANS_RETIRED:LATENCY_ABOVE_THRESHOLD

To make things easier, this patch also exports a generic
alias via sysfs: mem-loads. It export the right event
encoding based on the host CPU and can be used directly
by the perf tool.

Loosely based on Intel's Lin Ming patch posted on LKML
in July 2011.

Signed-off-by: Stephane Eranian <[email protected]>
---
arch/x86/include/asm/msr-index.h | 1 +
arch/x86/kernel/cpu/perf_event.c | 3 +
arch/x86/kernel/cpu/perf_event.h | 22 ++++-
arch/x86/kernel/cpu/perf_event_intel.c | 56 ++++++++++++
arch/x86/kernel/cpu/perf_event_intel_ds.c | 133 +++++++++++++++++++++++++++--
5 files changed, 206 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 7f0edce..5edc9c0 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -66,6 +66,7 @@
#define MSR_IA32_PEBS_ENABLE 0x000003f1
#define MSR_IA32_DS_AREA 0x00000600
#define MSR_IA32_PERF_CAPABILITIES 0x00000345
+#define MSR_PEBS_LD_LAT_THRESHOLD 0x000003f6

#define MSR_MTRRfix64K_00000 0x00000250
#define MSR_MTRRfix16K_80000 0x00000258
diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 80dd390..01b4717 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -1475,6 +1475,9 @@ static int __init init_hw_perf_events(void)
x86_pmu.attr_rdpmc = 1; /* enable userspace RDPMC usage by default */
x86_pmu_format_group.attrs = x86_pmu.format_attrs;

+ if (x86_pmu.event_attrs)
+ x86_pmu_events_group.attrs = x86_pmu.event_attrs;
+
if (!x86_pmu.events_sysfs_show)
x86_pmu_events_group.attrs = &empty_attrs;
else
diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index f78d6e8..3c5aa72 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -46,6 +46,7 @@ enum extra_reg_type {
EXTRA_REG_RSP_0 = 0, /* offcore_response_0 */
EXTRA_REG_RSP_1 = 1, /* offcore_response_1 */
EXTRA_REG_LBR = 2, /* lbr_select */
+ EXTRA_REG_LDLAT = 3, /* ld_lat_threshold */

EXTRA_REG_MAX /* number of entries needed */
};
@@ -61,6 +62,10 @@ struct event_constraint {
int overlap;
int flags;
};
+/*
+ * struct event_constraint flags
+ */
+#define PERF_X86_EVENT_PEBS_LDLAT 0x1 /* ld+ldlat data address sampling */

struct amd_nb {
int nb_id; /* NorthBridge id */
@@ -233,6 +238,10 @@ struct cpu_hw_events {
#define INTEL_UEVENT_CONSTRAINT(c, n) \
EVENT_CONSTRAINT(c, n, INTEL_ARCH_EVENT_MASK)

+#define INTEL_PLD_CONSTRAINT(c, n) \
+ __EVENT_CONSTRAINT(c, n, INTEL_ARCH_EVENT_MASK, \
+ HWEIGHT(n), 0, PERF_X86_EVENT_PEBS_LDLAT)
+
#define EVENT_CONSTRAINT_END \
EVENT_CONSTRAINT(0, 0, 0)

@@ -262,12 +271,22 @@ struct extra_reg {
.msr = (ms), \
.config_mask = (m), \
.valid_mask = (vm), \
- .idx = EXTRA_REG_##i \
+ .idx = EXTRA_REG_##i, \
}

#define INTEL_EVENT_EXTRA_REG(event, msr, vm, idx) \
EVENT_EXTRA_REG(event, msr, ARCH_PERFMON_EVENTSEL_EVENT, vm, idx)

+#define INTEL_UEVENT_EXTRA_REG(event, msr, vm, idx) \
+ EVENT_EXTRA_REG(event, msr, ARCH_PERFMON_EVENTSEL_EVENT | \
+ ARCH_PERFMON_EVENTSEL_UMASK, vm, idx)
+
+#define INTEL_UEVENT_PEBS_LDLAT_EXTRA_REG(c) \
+ INTEL_UEVENT_EXTRA_REG(c, \
+ MSR_PEBS_LD_LAT_THRESHOLD, \
+ 0xffff, \
+ LDLAT)
+
#define EVENT_EXTRA_END EVENT_EXTRA_REG(0, 0, 0, 0, RSP_0)

union perf_capabilities {
@@ -355,6 +374,7 @@ struct x86_pmu {
*/
int attr_rdpmc;
struct attribute **format_attrs;
+ struct attribute **event_attrs;

ssize_t (*events_sysfs_show)(char *page, u64 config);

diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index 57d6527..a279d01 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -81,6 +81,7 @@ static struct event_constraint intel_nehalem_event_constraints[] __read_mostly =
static struct extra_reg intel_nehalem_extra_regs[] __read_mostly =
{
INTEL_EVENT_EXTRA_REG(0xb7, MSR_OFFCORE_RSP_0, 0xffff, RSP_0),
+ INTEL_UEVENT_PEBS_LDLAT_EXTRA_REG(0x100b),
EVENT_EXTRA_END
};

@@ -111,6 +112,7 @@ static struct extra_reg intel_westmere_extra_regs[] __read_mostly =
{
INTEL_EVENT_EXTRA_REG(0xb7, MSR_OFFCORE_RSP_0, 0xffff, RSP_0),
INTEL_EVENT_EXTRA_REG(0xbb, MSR_OFFCORE_RSP_1, 0xffff, RSP_1),
+ INTEL_UEVENT_PEBS_LDLAT_EXTRA_REG(0x100b),
EVENT_EXTRA_END
};

@@ -130,9 +132,55 @@ static struct event_constraint intel_gen_event_constraints[] __read_mostly =
static struct extra_reg intel_snb_extra_regs[] __read_mostly = {
INTEL_EVENT_EXTRA_REG(0xb7, MSR_OFFCORE_RSP_0, 0x3fffffffffull, RSP_0),
INTEL_EVENT_EXTRA_REG(0xbb, MSR_OFFCORE_RSP_1, 0x3fffffffffull, RSP_1),
+ INTEL_UEVENT_PEBS_LDLAT_EXTRA_REG(0x01cd),
EVENT_EXTRA_END
};

+
+EVENT_ATTR(cpu-cycles, CPU_CYCLES );
+EVENT_ATTR(instructions, INSTRUCTIONS );
+EVENT_ATTR(cache-references, CACHE_REFERENCES );
+EVENT_ATTR(cache-misses, CACHE_MISSES );
+EVENT_ATTR(branch-instructions, BRANCH_INSTRUCTIONS );
+EVENT_ATTR(branch-misses, BRANCH_MISSES );
+EVENT_ATTR(bus-cycles, BUS_CYCLES );
+EVENT_ATTR(stalled-cycles-frontend, STALLED_CYCLES_FRONTEND );
+EVENT_ATTR(stalled-cycles-backend, STALLED_CYCLES_BACKEND );
+EVENT_ATTR(ref-cycles, REF_CPU_CYCLES );
+
+EVENT_ATTR_STR(mem-loads, mem_ld_nhm, "event=0x0b,umask=0x10,ldlat=3");
+EVENT_ATTR_STR(mem-loads, mem_ld_snb, "event=0xcd,umask=0x1,ldlat=3");
+
+struct attribute *nhm_events_attrs[] = {
+ EVENT_PTR(CPU_CYCLES),
+ EVENT_PTR(INSTRUCTIONS),
+ EVENT_PTR(CACHE_REFERENCES),
+ EVENT_PTR(CACHE_MISSES),
+ EVENT_PTR(BRANCH_INSTRUCTIONS),
+ EVENT_PTR(BRANCH_MISSES),
+ EVENT_PTR(BUS_CYCLES),
+ EVENT_PTR(STALLED_CYCLES_FRONTEND),
+ EVENT_PTR(STALLED_CYCLES_BACKEND),
+ EVENT_PTR(REF_CPU_CYCLES),
+ EVENT_PTR(mem_ld_nhm),
+ NULL,
+};
+
+struct attribute *snb_events_attrs[] = {
+ EVENT_PTR(CPU_CYCLES),
+ EVENT_PTR(INSTRUCTIONS),
+ EVENT_PTR(CACHE_REFERENCES),
+ EVENT_PTR(CACHE_MISSES),
+ EVENT_PTR(BRANCH_INSTRUCTIONS),
+ EVENT_PTR(BRANCH_MISSES),
+ EVENT_PTR(BUS_CYCLES),
+ EVENT_PTR(STALLED_CYCLES_FRONTEND),
+ EVENT_PTR(STALLED_CYCLES_BACKEND),
+ EVENT_PTR(REF_CPU_CYCLES),
+ EVENT_PTR(mem_ld_snb),
+ NULL,
+};
+
static u64 intel_pmu_event_map(int hw_event)
{
return intel_perfmon_event_map[hw_event];
@@ -2009,6 +2057,8 @@ __init int intel_pmu_init(void)
x86_pmu.enable_all = intel_pmu_nhm_enable_all;
x86_pmu.extra_regs = intel_nehalem_extra_regs;

+ x86_pmu.event_attrs = nhm_events_attrs;
+
/* UOPS_ISSUED.STALLED_CYCLES */
intel_perfmon_event_map[PERF_COUNT_HW_STALLED_CYCLES_FRONTEND] =
X86_CONFIG(.event=0x0e, .umask=0x01, .inv=1, .cmask=1);
@@ -2049,6 +2099,8 @@ __init int intel_pmu_init(void)
x86_pmu.extra_regs = intel_westmere_extra_regs;
x86_pmu.er_flags |= ERF_HAS_RSP_1;

+ x86_pmu.event_attrs = nhm_events_attrs;
+
/* UOPS_ISSUED.STALLED_CYCLES */
intel_perfmon_event_map[PERF_COUNT_HW_STALLED_CYCLES_FRONTEND] =
X86_CONFIG(.event=0x0e, .umask=0x01, .inv=1, .cmask=1);
@@ -2077,6 +2129,8 @@ __init int intel_pmu_init(void)
x86_pmu.er_flags |= ERF_HAS_RSP_1;
x86_pmu.er_flags |= ERF_NO_HT_SHARING;

+ x86_pmu.event_attrs = snb_events_attrs;
+
/* UOPS_ISSUED.ANY,c=1,i=1 to count stall cycles */
intel_perfmon_event_map[PERF_COUNT_HW_STALLED_CYCLES_FRONTEND] =
X86_CONFIG(.event=0x0e, .umask=0x01, .inv=1, .cmask=1);
@@ -2102,6 +2156,8 @@ __init int intel_pmu_init(void)
x86_pmu.er_flags |= ERF_HAS_RSP_1;
x86_pmu.er_flags |= ERF_NO_HT_SHARING;

+ x86_pmu.event_attrs = snb_events_attrs;
+
/* UOPS_ISSUED.ANY,c=1,i=1 to count stall cycles */
intel_perfmon_event_map[PERF_COUNT_HW_STALLED_CYCLES_FRONTEND] =
X86_CONFIG(.event=0x0e, .umask=0x01, .inv=1, .cmask=1);
diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index f30d85b..759b0c0 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -24,6 +24,92 @@ struct pebs_record_32 {

*/

+union intel_x86_pebs_dse {
+ u64 val;
+ struct {
+ unsigned int ld_dse:4;
+ unsigned int ld_stlb_miss:1;
+ unsigned int ld_locked:1;
+ unsigned int ld_reserved:26;
+ };
+ struct {
+ unsigned int st_l1d_hit:1;
+ unsigned int st_reserved1:3;
+ unsigned int st_stlb_miss:1;
+ unsigned int st_locked:1;
+ unsigned int st_reserved2:26;
+ };
+};
+
+
+/*
+ * Map PEBS Load Latency Data Source encodings to generic
+ * memory data source information
+ */
+#define P(a, b) PERF_MEM_S(a, b)
+#define OP_LH (P(OP, LOAD) | P(LVL, HIT))
+#define SNOOP_NONE_MISS (P(SNOOP, NONE) | P(SNOOP, MISS))
+
+static const u64 pebs_data_source[] = {
+ P(OP, LOAD) | P(LVL, MISS) | P(LVL, L3) | P(SNOOP, NA),/* 0x00:ukn L3 */
+ OP_LH | P(LVL, L1) | P(SNOOP, NONE), /* 0x01: L1 local */
+ OP_LH | P(LVL, LFB)| P(SNOOP, NONE), /* 0x02: LFB hit */
+ OP_LH | P(LVL, L2) | P(SNOOP, NONE), /* 0x03: L2 hit */
+ OP_LH | P(LVL, L3) | P(SNOOP, NONE), /* 0x04: L3 hit */
+ OP_LH | P(LVL, L3) | P(SNOOP, MISS), /* 0x05: L3 hit, snoop miss */
+ OP_LH | P(LVL, L3) | P(SNOOP, HIT), /* 0x06: L3 hit, snoop hit */
+ OP_LH | P(LVL, L3) | P(SNOOP, HITM), /* 0x07: L3 hit, snoop hitm */
+ OP_LH | P(LVL, REM_CCE1) | P(SNOOP, HIT), /* 0x08: L3 miss snoop hit */
+ OP_LH | P(LVL, REM_CCE1) | P(SNOOP, HITM), /* 0x09: L3 miss snoop hitm*/
+ OP_LH | P(LVL, LOC_RAM) | P(SNOOP, HIT), /* 0x0a: L3 miss, shared */
+ OP_LH | P(LVL, REM_RAM1) | P(SNOOP, HIT), /* 0x0b: L3 miss, shared */
+ OP_LH | P(LVL, LOC_RAM) | SNOOP_NONE_MISS,/* 0x0c: L3 miss, excl */
+ OP_LH | P(LVL, REM_RAM1) | SNOOP_NONE_MISS,/* 0x0d: L3 miss, excl */
+ OP_LH | P(LVL, IO) | P(SNOOP, NONE), /* 0x0e: I/O */
+ OP_LH | P(LVL,UNC) | P(SNOOP, NONE), /* 0x0f: uncached */
+};
+
+static u64 load_latency_data(u64 status)
+{
+ union intel_x86_pebs_dse dse;
+ u64 val;
+ int model = boot_cpu_data.x86_model;
+ int fam = boot_cpu_data.x86;
+
+ dse.val = status;
+
+ /*
+ * use the mapping table for bit 0-3
+ */
+ val = pebs_data_source[dse.ld_dse];
+
+ /*
+ * Nehalem models do not support TLB, Lock infos
+ */
+ if (fam == 0x6 && (model == 26 || model == 30
+ || model == 31 || model == 46)) {
+ val |= P(TLB, NA) | P(LOCK, NA);
+ return val;
+ }
+ /*
+ * bit 4: TLB access
+ * 0 = did not miss 2nd level TLB
+ * 1 = missed 2nd level TLB
+ */
+ if (dse.ld_stlb_miss)
+ val |= P(TLB, MISS) | P(TLB, L2);
+ else
+ val |= P(TLB, HIT) | P(TLB,L1) | P(TLB, L2);
+
+ /*
+ * bit 5: locked prefix
+ */
+ if (dse.ld_locked)
+ val |= P(LOCK, LOCKED);
+
+ return val;
+}
+
struct pebs_record_core {
u64 flags, ip;
u64 ax, bx, cx, dx;
@@ -364,7 +450,7 @@ struct event_constraint intel_atom_pebs_event_constraints[] = {
};

struct event_constraint intel_nehalem_pebs_event_constraints[] = {
- INTEL_EVENT_CONSTRAINT(0x0b, 0xf), /* MEM_INST_RETIRED.* */
+ INTEL_PLD_CONSTRAINT(0x100b, 0xf), /* MEM_INST_RETIRED.* */
INTEL_EVENT_CONSTRAINT(0x0f, 0xf), /* MEM_UNCORE_RETIRED.* */
INTEL_UEVENT_CONSTRAINT(0x010c, 0xf), /* MEM_STORE_RETIRED.DTLB_MISS */
INTEL_EVENT_CONSTRAINT(0xc0, 0xf), /* INST_RETIRED.ANY */
@@ -379,7 +465,7 @@ struct event_constraint intel_nehalem_pebs_event_constraints[] = {
};

struct event_constraint intel_westmere_pebs_event_constraints[] = {
- INTEL_EVENT_CONSTRAINT(0x0b, 0xf), /* MEM_INST_RETIRED.* */
+ INTEL_PLD_CONSTRAINT(0x100b, 0xf), /* MEM_INST_RETIRED.* */
INTEL_EVENT_CONSTRAINT(0x0f, 0xf), /* MEM_UNCORE_RETIRED.* */
INTEL_UEVENT_CONSTRAINT(0x010c, 0xf), /* MEM_STORE_RETIRED.DTLB_MISS */
INTEL_EVENT_CONSTRAINT(0xc0, 0xf), /* INSTR_RETIRED.* */
@@ -399,7 +485,7 @@ struct event_constraint intel_snb_pebs_event_constraints[] = {
INTEL_UEVENT_CONSTRAINT(0x02c2, 0xf), /* UOPS_RETIRED.RETIRE_SLOTS */
INTEL_EVENT_CONSTRAINT(0xc4, 0xf), /* BR_INST_RETIRED.* */
INTEL_EVENT_CONSTRAINT(0xc5, 0xf), /* BR_MISP_RETIRED.* */
- INTEL_EVENT_CONSTRAINT(0xcd, 0x8), /* MEM_TRANS_RETIRED.* */
+ INTEL_PLD_CONSTRAINT(0x01cd, 0x8), /* MEM_TRANS_RETIRED.LAT_ABOVE_THR */
INTEL_EVENT_CONSTRAINT(0xd0, 0xf), /* MEM_UOP_RETIRED.* */
INTEL_EVENT_CONSTRAINT(0xd1, 0xf), /* MEM_LOAD_UOPS_RETIRED.* */
INTEL_EVENT_CONSTRAINT(0xd2, 0xf), /* MEM_LOAD_UOPS_LLC_HIT_RETIRED.* */
@@ -413,7 +499,7 @@ struct event_constraint intel_ivb_pebs_event_constraints[] = {
INTEL_UEVENT_CONSTRAINT(0x02c2, 0xf), /* UOPS_RETIRED.RETIRE_SLOTS */
INTEL_EVENT_CONSTRAINT(0xc4, 0xf), /* BR_INST_RETIRED.* */
INTEL_EVENT_CONSTRAINT(0xc5, 0xf), /* BR_MISP_RETIRED.* */
- INTEL_EVENT_CONSTRAINT(0xcd, 0x8), /* MEM_TRANS_RETIRED.* */
+ INTEL_PLD_CONSTRAINT(0x01cd, 0x8), /* MEM_TRANS_RETIRED.LAT_ABOVE_THR */
INTEL_EVENT_CONSTRAINT(0xd0, 0xf), /* MEM_UOP_RETIRED.* */
INTEL_EVENT_CONSTRAINT(0xd1, 0xf), /* MEM_LOAD_UOPS_RETIRED.* */
INTEL_EVENT_CONSTRAINT(0xd2, 0xf), /* MEM_LOAD_UOPS_LLC_HIT_RETIRED.* */
@@ -448,6 +534,9 @@ void intel_pmu_pebs_enable(struct perf_event *event)
hwc->config &= ~ARCH_PERFMON_EVENTSEL_INT;

cpuc->pebs_enabled |= 1ULL << hwc->idx;
+
+ if (event->hw.flags & PERF_X86_EVENT_PEBS_LDLAT)
+ cpuc->pebs_enabled |= 1ULL << (hwc->idx + 32);
}

void intel_pmu_pebs_disable(struct perf_event *event)
@@ -560,20 +649,48 @@ static void __intel_pmu_pebs_event(struct perf_event *event,
struct pt_regs *iregs, void *__pebs)
{
/*
- * We cast to pebs_record_core since that is a subset of
- * both formats and we don't use the other fields in this
- * routine.
+ * We cast to pebs_record_nhm to get the load latency data
+ * if extra_reg MSR_PEBS_LD_LAT_THRESHOLD used
*/
struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
- struct pebs_record_core *pebs = __pebs;
+ struct pebs_record_nhm *pebs = __pebs;
struct perf_sample_data data;
struct pt_regs regs;
+ u64 sample_type;
+ int fll;

if (!intel_pmu_save_and_restart(event))
return;

+ fll = event->hw.flags & PERF_X86_EVENT_PEBS_LDLAT;
+
perf_sample_data_init(&data, 0, event->hw.last_period);

+ data.period = event->hw.last_period;
+ sample_type = event->attr.sample_type;
+
+ /*
+ * if PEBS-LL or PreciseStore
+ */
+ if (fll) {
+ if (sample_type & PERF_SAMPLE_ADDR)
+ data.addr = pebs->dla;
+
+ /*
+ * Use latency for weight (only avail with PEBS-LL)
+ */
+ if (fll && (sample_type & PERF_SAMPLE_WEIGHT))
+ data.weight = pebs->lat;
+
+ /*
+ * data.dsrc encodes the data source
+ */
+ if (sample_type & PERF_SAMPLE_DSRC) {
+ if (fll)
+ data.dsrc.val = load_latency_data(pebs->dse);
+ }
+ }
+
/*
* We use the interrupt regs as a base because the PEBS record
* does not contain a full regs set, specifically it seems to
--
1.7.9.5

2012-11-05 13:53:35

by Stephane Eranian

[permalink] [raw]
Subject: [PATCH v2 10/16] perf/x86: add support for PEBS Precise Store

This patch adds support for PEBS Precise Store
which is available on Intel Sandy Bridge and
Ivy Bridge processors.

To use Precise store, the proper PEBS event
must be used: mem_trans_retired:precise_stores.
For the perf tool, the generic mem-stores event
exported via sysfs can be used directly.

Signed-off-by: Stephane Eranian <[email protected]>
---
arch/x86/kernel/cpu/perf_event.h | 5 +++
arch/x86/kernel/cpu/perf_event_intel.c | 2 ++
arch/x86/kernel/cpu/perf_event_intel_ds.c | 49 +++++++++++++++++++++++++++--
3 files changed, 54 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 3c5aa72..4e95c90 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -66,6 +66,7 @@ struct event_constraint {
* struct event_constraint flags
*/
#define PERF_X86_EVENT_PEBS_LDLAT 0x1 /* ld+ldlat data address sampling */
+#define PERF_X86_EVENT_PEBS_ST 0x2 /* st data address sampling */

struct amd_nb {
int nb_id; /* NorthBridge id */
@@ -242,6 +243,10 @@ struct cpu_hw_events {
__EVENT_CONSTRAINT(c, n, INTEL_ARCH_EVENT_MASK, \
HWEIGHT(n), 0, PERF_X86_EVENT_PEBS_LDLAT)

+#define INTEL_PST_CONSTRAINT(c, n) \
+ __EVENT_CONSTRAINT(c, n, INTEL_ARCH_EVENT_MASK, \
+ HWEIGHT(n), 0, PERF_X86_EVENT_PEBS_ST)
+
#define EVENT_CONSTRAINT_END \
EVENT_CONSTRAINT(0, 0, 0)

diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index cbfb252..4176be3 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -150,6 +150,7 @@ EVENT_ATTR(ref-cycles, REF_CPU_CYCLES );

EVENT_ATTR_STR(mem-loads, mem_ld_nhm, "event=0x0b,umask=0x10,ldlat=3");
EVENT_ATTR_STR(mem-loads, mem_ld_snb, "event=0xcd,umask=0x1,ldlat=3");
+EVENT_ATTR_STR(mem-stores, mem_st_snb, "event=0xcd,umask=0x2");

struct attribute *nhm_events_attrs[] = {
EVENT_PTR(CPU_CYCLES),
@@ -178,6 +179,7 @@ struct attribute *snb_events_attrs[] = {
EVENT_PTR(STALLED_CYCLES_BACKEND),
EVENT_PTR(REF_CPU_CYCLES),
EVENT_PTR(mem_ld_snb),
+ EVENT_PTR(mem_st_snb),
NULL,
};

diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index 759b0c0..ed0ca3e 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -69,6 +69,44 @@ static const u64 pebs_data_source[] = {
OP_LH | P(LVL,UNC) | P(SNOOP, NONE), /* 0x0f: uncached */
};

+static u64 precise_store_data(u64 status)
+{
+ union intel_x86_pebs_dse dse;
+ u64 val = P(OP, STORE) | P(SNOOP, NA) | P(LVL, L1) | P(TLB, L2);
+
+ dse.val = status;
+
+ /*
+ * bit 4: TLB access
+ * 1 = stored missed 2nd level TLB
+ *
+ * so it either hit the walker or the OS
+ * otherwise hit 2nd level TLB
+ */
+ if (dse.st_stlb_miss)
+ val |= P(TLB, MISS);
+ else
+ val |= P(TLB, HIT);
+
+ /*
+ * bit 0: hit L1 data cache
+ * if not set, then all we know is that
+ * it missed L1D
+ */
+ if (dse.st_l1d_hit)
+ val |= P(LVL, HIT);
+ else
+ val |= P(LVL, MISS);
+
+ /*
+ * bit 5: Locked prefix
+ */
+ if (dse.st_locked)
+ val |= P(LOCK, LOCKED);
+
+ return val;
+}
+
static u64 load_latency_data(u64 status)
{
union intel_x86_pebs_dse dse;
@@ -486,6 +524,7 @@ struct event_constraint intel_snb_pebs_event_constraints[] = {
INTEL_EVENT_CONSTRAINT(0xc4, 0xf), /* BR_INST_RETIRED.* */
INTEL_EVENT_CONSTRAINT(0xc5, 0xf), /* BR_MISP_RETIRED.* */
INTEL_PLD_CONSTRAINT(0x01cd, 0x8), /* MEM_TRANS_RETIRED.LAT_ABOVE_THR */
+ INTEL_PST_CONSTRAINT(0x02cd, 0x8), /* MEM_TRANS_RETIRED.PRECISE_STORES */
INTEL_EVENT_CONSTRAINT(0xd0, 0xf), /* MEM_UOP_RETIRED.* */
INTEL_EVENT_CONSTRAINT(0xd1, 0xf), /* MEM_LOAD_UOPS_RETIRED.* */
INTEL_EVENT_CONSTRAINT(0xd2, 0xf), /* MEM_LOAD_UOPS_LLC_HIT_RETIRED.* */
@@ -500,6 +539,7 @@ struct event_constraint intel_ivb_pebs_event_constraints[] = {
INTEL_EVENT_CONSTRAINT(0xc4, 0xf), /* BR_INST_RETIRED.* */
INTEL_EVENT_CONSTRAINT(0xc5, 0xf), /* BR_MISP_RETIRED.* */
INTEL_PLD_CONSTRAINT(0x01cd, 0x8), /* MEM_TRANS_RETIRED.LAT_ABOVE_THR */
+ INTEL_PST_CONSTRAINT(0x02cd, 0x8), /* MEM_TRANS_RETIRED.PRECISE_STORES */
INTEL_EVENT_CONSTRAINT(0xd0, 0xf), /* MEM_UOP_RETIRED.* */
INTEL_EVENT_CONSTRAINT(0xd1, 0xf), /* MEM_LOAD_UOPS_RETIRED.* */
INTEL_EVENT_CONSTRAINT(0xd2, 0xf), /* MEM_LOAD_UOPS_LLC_HIT_RETIRED.* */
@@ -537,6 +577,8 @@ void intel_pmu_pebs_enable(struct perf_event *event)

if (event->hw.flags & PERF_X86_EVENT_PEBS_LDLAT)
cpuc->pebs_enabled |= 1ULL << (hwc->idx + 32);
+ else if (event->hw.flags & PERF_X86_EVENT_PEBS_ST)
+ cpuc->pebs_enabled |= 1ULL << 63;
}

void intel_pmu_pebs_disable(struct perf_event *event)
@@ -657,12 +699,13 @@ static void __intel_pmu_pebs_event(struct perf_event *event,
struct perf_sample_data data;
struct pt_regs regs;
u64 sample_type;
- int fll;
+ int fll, fst;

if (!intel_pmu_save_and_restart(event))
return;

fll = event->hw.flags & PERF_X86_EVENT_PEBS_LDLAT;
+ fst = event->hw.flags & PERF_X86_EVENT_PEBS_ST;

perf_sample_data_init(&data, 0, event->hw.last_period);

@@ -672,7 +715,7 @@ static void __intel_pmu_pebs_event(struct perf_event *event,
/*
* if PEBS-LL or PreciseStore
*/
- if (fll) {
+ if (fll || fst) {
if (sample_type & PERF_SAMPLE_ADDR)
data.addr = pebs->dla;

@@ -688,6 +731,8 @@ static void __intel_pmu_pebs_event(struct perf_event *event,
if (sample_type & PERF_SAMPLE_DSRC) {
if (fll)
data.dsrc.val = load_latency_data(pebs->dse);
+ else
+ data.dsrc.val = precise_store_data(pebs->dse);
}
}

--
1.7.9.5

2012-11-05 13:53:40

by Stephane Eranian

[permalink] [raw]
Subject: [PATCH v2 13/16] perf record: add support for mem access profiling

Add the -l option to perf record to enable sampling
access cost sampling.

Data address sampling is obtained via the -d option.

Signed-off-by: Stephane Eranian <[email protected]>
---
tools/perf/builtin-record.c | 2 ++
tools/perf/perf.h | 1 +
tools/perf/util/evsel.c | 6 ++++++
3 files changed, 9 insertions(+)

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 5783c32..8243405 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -1041,6 +1041,8 @@ const struct option record_options[] = {
OPT_CALLBACK('j', "branch-filter", &record.opts.branch_stack,
"branch filter mask", "branch stack filter modes",
parse_branch_stack),
+ OPT_BOOLEAN('l', "cost", &record.opts.weight,
+ "event cost"),
OPT_END()
};

diff --git a/tools/perf/perf.h b/tools/perf/perf.h
index 469fbf2..8a39117 100644
--- a/tools/perf/perf.h
+++ b/tools/perf/perf.h
@@ -237,6 +237,7 @@ struct perf_record_opts {
bool sample_id_all_missing;
bool exclude_guest_missing;
bool period;
+ bool weight;
unsigned int freq;
unsigned int mmap_pages;
unsigned int user_freq;
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 951d4d1..65ca661 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -474,6 +474,12 @@ void perf_evsel__config(struct perf_evsel *evsel, struct perf_record_opts *opts,
attr->sample_type |= PERF_SAMPLE_CPU;
}

+ if (opts->weight)
+ attr->sample_type |= PERF_SAMPLE_WEIGHT;
+
+ if (opts->sample_address)
+ attr->sample_type |= PERF_SAMPLE_DSRC;
+
if (opts->no_delay) {
attr->watermark = 0;
attr->wakeup_events = 1;
--
1.7.9.5

2012-11-05 13:53:53

by Stephane Eranian

[permalink] [raw]
Subject: [PATCH v2 16/16] perf tools: detect data vs. text mappings

Leverages the PERF_RECORD_MISC_MMAP_DATA bit in
the RECORD_MMAP record header. When the bit is set
then the mapping type is set to MAP__VARIABLE.

Signed-off-by: Stephane Eranian <[email protected]>
---
tools/perf/util/machine.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c
index 502eec0..1a88b9c 100644
--- a/tools/perf/util/machine.c
+++ b/tools/perf/util/machine.c
@@ -192,6 +192,7 @@ int machine__process_mmap_event(struct machine *machine, union perf_event *event
u8 cpumode = event->header.misc & PERF_RECORD_MISC_CPUMODE_MASK;
struct thread *thread;
struct map *map;
+ enum map_type type;
int ret = 0;

if (dump_trace)
@@ -208,10 +209,17 @@ int machine__process_mmap_event(struct machine *machine, union perf_event *event
thread = machine__findnew_thread(machine, event->mmap.pid);
if (thread == NULL)
goto out_problem;
+
+ if (event->header.misc & PERF_RECORD_MISC_MMAP_DATA)
+ type = MAP__VARIABLE;
+ else
+ type = MAP__FUNCTION;
+
map = map__new(&machine->user_dsos, event->mmap.start,
event->mmap.len, event->mmap.pgoff,
event->mmap.pid, event->mmap.filename,
- MAP__FUNCTION);
+ type);
+
if (map == NULL)
goto out_problem;

--
1.7.9.5

2012-11-05 13:53:46

by Stephane Eranian

[permalink] [raw]
Subject: [PATCH v2 15/16] perf: add PERF_RECORD_MISC_MMAP_DATA to RECORD_MMAP

Type of mapping was lost and made it hard for a tool
to distinguish code vs. data mmaps. Perf has the ability
to distinguish the two.

Use a bit in the header->misc bitmask to keep track of
the mmap type. If PERF_RECORD_MISC_MMAP_DATA is set then
the mapping is not executable (!VM_EXEC). If not set, then
the mapping is executable.

Signed-off-by: Stephane Eranian <[email protected]>
---
include/uapi/linux/perf_event.h | 1 +
kernel/events/core.c | 3 +++
2 files changed, 4 insertions(+)

diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index b2a78bb..0fa3feb 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -446,6 +446,7 @@ struct perf_event_mmap_page {
#define PERF_RECORD_MISC_GUEST_KERNEL (4 << 0)
#define PERF_RECORD_MISC_GUEST_USER (5 << 0)

+#define PERF_RECORD_MISC_MMAP_DATA (1 << 13)
/*
* Indicates that the content of PERF_SAMPLE_IP points to
* the actual instruction that triggered the event. See also
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 94b2c70..e7318a7 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -4764,6 +4764,9 @@ static void perf_event_mmap_event(struct perf_mmap_event *mmap_event)
mmap_event->file_name = name;
mmap_event->file_size = size;

+ if (!(vma->vm_flags & VM_EXEC))
+ mmap_event->event_id.header.misc |= PERF_RECORD_MISC_MMAP_DATA;
+
mmap_event->event_id.header.size = sizeof(mmap_event->event_id) + size;

rcu_read_lock();
--
1.7.9.5

2012-11-05 13:54:27

by Stephane Eranian

[permalink] [raw]
Subject: [PATCH v2 14/16] perf tools: add new mem command for memory access profiling

This new command is a wrapper on top of perf record and
perf report to make it easier to configure for memory
access profiling.

To record loads:
$ perf mem -t load rec .....

To record stores:
$ perf mem -t store rec .....

To get the report:
$ perf mem -t load rep

Signed-off-by: Stephane Eranian <[email protected]>
---
tools/perf/Documentation/perf-mem.txt | 48 +++++++
tools/perf/Makefile | 1 +
tools/perf/builtin-mem.c | 238 +++++++++++++++++++++++++++++++++
tools/perf/builtin.h | 1 +
tools/perf/command-list.txt | 1 +
tools/perf/perf.c | 1 +
6 files changed, 290 insertions(+)
create mode 100644 tools/perf/Documentation/perf-mem.txt
create mode 100644 tools/perf/builtin-mem.c

diff --git a/tools/perf/Documentation/perf-mem.txt b/tools/perf/Documentation/perf-mem.txt
new file mode 100644
index 0000000..888d511
--- /dev/null
+++ b/tools/perf/Documentation/perf-mem.txt
@@ -0,0 +1,48 @@
+perf-mem(1)
+===========
+
+NAME
+----
+perf-mem - Profile memory accesses
+
+SYNOPSIS
+--------
+[verse]
+'perf mem' [<options>] (record [<command>] | report)
+
+DESCRIPTION
+-----------
+"perf mem -t <TYPE> record" runs a command and gathers memory operation data
+from it, into perf.data. Perf record options are accepted and are passed through.
+
+"perf mem -t <TYPE> report" displays the result. It invokes perf report with the
+right set of options to display a memory access profile.
+
+OPTIONS
+-------
+<command>...::
+ Any command you can specify in a shell.
+
+-t::
+--type=::
+ Select the memory operation type: load or store (default: load)
+
+-D::
+--dump-raw-samples=::
+ Dump the raw decoded samples on the screen in a format that is easy to parse with
+ one sample per line.
+
+-x::
+--field-separator::
+ Specify the field separator used when dump raw samples (-D option). By default,
+ The separator is the space character.
+
+-C::
+--cpu-list::
+ Restrict dump of raw samples to those provided via this option. Note that the same
+ option can be passed in record mode. It will be interpreted the same way as perf
+ record.
+
+SEE ALSO
+--------
+linkperf:perf-record[1], linkperf:perf-report[1]
diff --git a/tools/perf/Makefile b/tools/perf/Makefile
index 7e25f59..c42b862 100644
--- a/tools/perf/Makefile
+++ b/tools/perf/Makefile
@@ -461,6 +461,7 @@ BUILTIN_OBJS += $(OUTPUT)builtin-lock.o
BUILTIN_OBJS += $(OUTPUT)builtin-kvm.o
BUILTIN_OBJS += $(OUTPUT)builtin-test.o
BUILTIN_OBJS += $(OUTPUT)builtin-inject.o
+BUILTIN_OBJS += $(OUTPUT)builtin-mem.o

PERFLIBS = $(LIB_FILE) $(LIBTRACEEVENT)

diff --git a/tools/perf/builtin-mem.c b/tools/perf/builtin-mem.c
new file mode 100644
index 0000000..c2980b8
--- /dev/null
+++ b/tools/perf/builtin-mem.c
@@ -0,0 +1,238 @@
+#include "builtin.h"
+#include "perf.h"
+
+#include "util/parse-options.h"
+#include "util/trace-event.h"
+#include "util/tool.h"
+#include "util/session.h"
+
+#define MEM_OPERATION_LOAD "load"
+#define MEM_OPERATION_STORE "store"
+
+static const char *mem_operation = MEM_OPERATION_LOAD;
+
+struct perf_mem {
+ struct perf_tool tool;
+ char const *input_name;
+ symbol_filter_t annotate_init;
+ bool hide_unresolved;
+ bool dump_raw;
+ const char *cpu_list;
+ DECLARE_BITMAP(cpu_bitmap, MAX_NR_CPUS);
+};
+
+static const char * const mem_usage[] = {
+ "perf mem [<options>] {record <command> |report}",
+ NULL
+};
+
+static int __cmd_record(int argc, const char **argv)
+{
+ int rec_argc, i = 0, j;
+ const char **rec_argv;
+ char event[64];
+ int ret;
+
+ rec_argc = argc + 4;
+ rec_argv = calloc(rec_argc + 1, sizeof(char *));
+ if (!rec_argv)
+ return -1;
+
+ rec_argv[i++] = strdup("record");
+ if (!strcmp(mem_operation, MEM_OPERATION_LOAD))
+ rec_argv[i++] = strdup("-l");
+ rec_argv[i++] = strdup("-d");
+ rec_argv[i++] = strdup("-e");
+
+ if (strcmp(mem_operation, MEM_OPERATION_LOAD))
+ sprintf(event, "cpu/mem-stores/pp");
+ else
+ sprintf(event, "cpu/mem-loads/pp");
+
+ rec_argv[i++] = strdup(event);
+ for (j = 1; j < argc; j++, i++)
+ rec_argv[i] = argv[j];
+
+ ret = cmd_record(i, rec_argv, NULL);
+ free(rec_argv);
+ return ret;
+}
+
+static int
+dump_raw_samples(struct perf_tool *tool,
+ union perf_event *event,
+ struct perf_sample *sample,
+ struct perf_evsel *evsel __maybe_unused,
+ struct machine *machine)
+{
+ struct perf_mem *mem = container_of(tool, struct perf_mem, tool);
+ struct addr_location al;
+ const char *fmt;
+
+ if (perf_event__preprocess_sample(event, machine, &al, sample,
+ mem->annotate_init) < 0) {
+ fprintf(stderr, "problem processing %d event, skipping it.\n",
+ event->header.type);
+ return -1;
+ }
+
+ if (al.filtered || (mem->hide_unresolved && al.sym == NULL))
+ return 0;
+
+ if (al.map != NULL)
+ al.map->dso->hit = 1;
+
+ if (symbol_conf.field_sep) {
+ fmt = "%d%s%d%s0x%"PRIx64"%s0x%"PRIx64"%s%"PRIu64
+ "%s0x%"PRIx64"%s%s:%s\n";
+ } else {
+ fmt = "%5d%s%5d%s0x%016"PRIx64"%s0x016%"PRIx64
+ "%s%5"PRIu64"%s0x%06"PRIx64"%s%s:%s\n";
+ symbol_conf.field_sep = " ";
+ }
+
+ printf( fmt,
+ sample->pid,
+ symbol_conf.field_sep,
+ sample->tid,
+ symbol_conf.field_sep,
+ event->ip.ip,
+ symbol_conf.field_sep,
+ sample->addr,
+ symbol_conf.field_sep,
+ sample->weight,
+ symbol_conf.field_sep,
+ sample->dsrc,
+ symbol_conf.field_sep,
+ al.map ? (al.map->dso ? al.map->dso->long_name : "???") : "???",
+ al.sym ? al.sym->name : "???");
+
+ return 0;
+}
+
+static int process_sample_event(struct perf_tool *tool,
+ union perf_event *event,
+ struct perf_sample *sample,
+ struct perf_evsel *evsel,
+ struct machine *machine)
+{
+ return dump_raw_samples(tool, event, sample, evsel, machine);
+}
+
+static int report_raw_events(struct perf_mem *mem)
+{
+ int err = -EINVAL;
+ int ret;
+ struct perf_session *session = perf_session__new(input_name, O_RDONLY,
+ 0, false, &mem->tool);
+
+ if (session == NULL)
+ return -ENOMEM;
+
+ if (mem->cpu_list) {
+ ret = perf_session__cpu_bitmap(session, mem->cpu_list,
+ mem->cpu_bitmap);
+ if (ret)
+ goto out_delete;
+ }
+
+ if (symbol__init() < 0)
+ return -1;
+
+ printf("# PID, TID, IP, ADDR, COST, DSRC, SYMBOL\n");
+
+ err = perf_session__process_events(session, &mem->tool);
+ if (err)
+ return err;
+
+ return 0;
+
+out_delete:
+ perf_session__delete(session);
+ return err;
+}
+
+static int report_events(int argc, const char **argv, struct perf_mem *mem)
+{
+ const char **rep_argv;
+ int ret, i = 0, j, rep_argc;
+
+ if (mem->dump_raw)
+ return report_raw_events(mem);
+
+ rep_argc = argc + 3;
+ rep_argv = calloc(rep_argc + 1, sizeof(char *));
+ if (!rep_argv)
+ return -1;
+
+ rep_argv[i++] = strdup("report");
+ rep_argv[i++] = strdup("--mem-mode");
+ rep_argv[i++] = strdup("-n"); /* display number of samples */
+
+ /*
+ * there is no cost associated with stores, so don't print
+ * the column
+ */
+ if (strcmp(mem_operation, MEM_OPERATION_LOAD))
+ rep_argv[i++] = strdup("--sort=mem,sym,dso,symbol_daddr,dso_daddr,tlb,locked");
+
+ for (j = 1; j < argc; j++, i++)
+ rep_argv[i] = argv[j];
+
+ ret = cmd_report(i, rep_argv, NULL);
+ free(rep_argv);
+ return ret;
+}
+
+int cmd_mem(int argc, const char **argv, const char *prefix __maybe_unused)
+{
+ struct stat st;
+ struct perf_mem mem = {
+ .tool = {
+ .sample = process_sample_event,
+ .mmap = perf_event__process_mmap,
+ .comm = perf_event__process_comm,
+ .lost = perf_event__process_lost,
+ .fork = perf_event__process_fork,
+ .build_id = perf_event__process_build_id,
+ .ordered_samples= true,
+ },
+ .input_name = "perf.data",
+ };
+ const struct option mem_options[] = {
+ OPT_STRING('t', "type", &mem_operation,
+ "type", "memory operations(load/store)"),
+ OPT_BOOLEAN('D', "dump-raw-samples", &mem.dump_raw,
+ "dump raw samples in ASCII"),
+ OPT_BOOLEAN('U', "hide-unresolved", &mem.hide_unresolved,
+ "Only display entries resolved to a symbol"),
+ OPT_STRING('i', "input", &input_name, "file", "input file name"),
+ OPT_STRING('C', "cpu", &mem.cpu_list, "cpu", "list of cpus to profile"),
+ OPT_STRING('x', "field-separator", &symbol_conf.field_sep, "separator",
+ "separator for columns, no spaces will be added between "
+ "columns '.' is reserved."),
+ OPT_END()
+ };
+
+ argc = parse_options(argc, argv, mem_options, mem_usage,
+ PARSE_OPT_STOP_AT_NON_OPTION);
+
+ if (!argc || !(strncmp(argv[0], "rec", 3) || mem_operation))
+ usage_with_options(mem_usage, mem_options);
+
+ if (!mem.input_name || !strlen(mem.input_name)) {
+ if (!fstat(STDIN_FILENO, &st) && S_ISFIFO(st.st_mode))
+ mem.input_name = "-";
+ else
+ mem.input_name = "perf.data";
+ }
+
+ if (!strncmp(argv[0], "rec", 3))
+ return __cmd_record(argc, argv);
+ else if (!strncmp(argv[0], "rep", 3))
+ return report_events(argc, argv, &mem);
+ else
+ usage_with_options(mem_usage, mem_options);
+
+ return 0;
+}
diff --git a/tools/perf/builtin.h b/tools/perf/builtin.h
index 08143bd..b210d62 100644
--- a/tools/perf/builtin.h
+++ b/tools/perf/builtin.h
@@ -36,6 +36,7 @@ extern int cmd_kvm(int argc, const char **argv, const char *prefix);
extern int cmd_test(int argc, const char **argv, const char *prefix);
extern int cmd_trace(int argc, const char **argv, const char *prefix);
extern int cmd_inject(int argc, const char **argv, const char *prefix);
+extern int cmd_mem(int argc, const char **argv, const char *prefix);

extern int find_scripts(char **scripts_array, char **scripts_path_array);
#endif
diff --git a/tools/perf/command-list.txt b/tools/perf/command-list.txt
index 3e86bbd..2c5b621 100644
--- a/tools/perf/command-list.txt
+++ b/tools/perf/command-list.txt
@@ -24,3 +24,4 @@ perf-kmem mainporcelain common
perf-lock mainporcelain common
perf-kvm mainporcelain common
perf-test mainporcelain common
+perf-mem mainporcelain common
diff --git a/tools/perf/perf.c b/tools/perf/perf.c
index e968373..ccdf32d 100644
--- a/tools/perf/perf.c
+++ b/tools/perf/perf.c
@@ -60,6 +60,7 @@ static struct cmd_struct commands[] = {
{ "trace", cmd_trace, 0 },
#endif
{ "inject", cmd_inject, 0 },
+ { "mem", cmd_mem, 0 },
};

struct pager_config {
--
1.7.9.5

2012-11-05 13:54:45

by Stephane Eranian

[permalink] [raw]
Subject: [PATCH v2 12/16] perf report: add support for mem access profiling

This patch adds the --mem-mode option to perf report.

This mode requires a perf.data file created with memory
access samples.

Signed-off-by: Stephane Eranian <[email protected]>
---
tools/perf/builtin-report.c | 131 +++++++++++++++++++++++++++++++++++++++++--
1 file changed, 127 insertions(+), 4 deletions(-)

diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c
index f07eae7..14cb7c1 100644
--- a/tools/perf/builtin-report.c
+++ b/tools/perf/builtin-report.c
@@ -46,6 +46,7 @@ struct perf_report {
bool show_full_info;
bool show_threads;
bool inverted_callchain;
+ bool mem_mode;
struct perf_read_values show_threads_values;
const char *pretty_printing_style;
symbol_filter_t annotate_init;
@@ -54,6 +55,96 @@ struct perf_report {
DECLARE_BITMAP(cpu_bitmap, MAX_NR_CPUS);
};

+static int perf_report__add_mem_hist_entry(struct perf_tool *tool,
+ struct addr_location *al,
+ struct perf_sample *sample,
+ struct perf_evsel *evsel,
+ struct machine *machine,
+ union perf_event *event)
+{
+ struct perf_report *rep = container_of(tool, struct perf_report, tool);
+ struct symbol *parent = NULL;
+ u8 cpumode = event->header.misc & PERF_RECORD_MISC_CPUMODE_MASK;
+ int err = 0;
+ struct hist_entry *he;
+ struct mem_info *mi, *mx;
+ uint64_t cost;
+
+ if ((sort__has_parent || symbol_conf.use_callchain)
+ && sample->callchain) {
+ err = machine__resolve_callchain(machine, evsel, al->thread,
+ sample, &parent);
+ if (err)
+ return err;
+ }
+
+ mi = machine__resolve_mem(machine, al->thread, sample, cpumode);
+ if (!mi)
+ return -ENOMEM;
+
+ if (rep->hide_unresolved && !al->sym)
+ return 0;
+
+ cost = mi->cost;
+ if (!cost)
+ cost = 1;
+
+ /*
+ * The report shows the percentage of total branches captured
+ * and not events sampled. Thus we use a pseudo period of 1.
+ * Only in the newt browser we are doing integrated annotation,
+ * so we don't allocated the extra space needed because the stdio
+ * code will not use it.
+ */
+ he = __hists__add_mem_entry(&evsel->hists, al, parent, mi,
+ cost);
+ if (!he)
+ return -ENOMEM;
+
+ if (sort__has_sym && he->ms.sym && use_browser > 0) {
+ struct annotation *notes = symbol__annotation(he->ms.sym);
+
+ assert(evsel != NULL);
+
+ if (notes->src == NULL && symbol__alloc_hist(he->ms.sym) < 0)
+ goto out;
+
+ err = hist_entry__inc_addr_samples(he, evsel->idx, al->addr);
+ if (err)
+ goto out;
+ }
+
+ if (sort__has_sym && he->mem_info->daddr.sym && use_browser > 0) {
+ struct annotation *notes;
+
+ mx = he->mem_info;
+
+ notes = symbol__annotation(mx->daddr.sym);
+ if (!notes->src
+ && symbol__alloc_hist(mx->daddr.sym) < 0)
+ goto out;
+
+ err = symbol__inc_addr_samples(mx->daddr.sym,
+ mx->daddr.map,
+ evsel->idx,
+ mx->daddr.al_addr);
+ if (err)
+ goto out;
+ }
+
+ evsel->hists.stats.total_period += sample->period;
+ hists__inc_nr_events(&evsel->hists, PERF_RECORD_SAMPLE);
+ err = 0;
+
+ if (symbol_conf.use_callchain) {
+ err = callchain_append(he->callchain,
+ &callchain_cursor,
+ sample->period);
+ }
+out:
+ return err;
+}
+
static int perf_report__add_branch_hist_entry(struct perf_tool *tool,
struct addr_location *al,
struct perf_sample *sample,
@@ -209,6 +300,12 @@ static int process_sample_event(struct perf_tool *tool,
pr_debug("problem adding lbr entry, skipping event\n");
return -1;
}
+ } else if (rep->mem_mode == 1) {
+ if (perf_report__add_mem_hist_entry(tool, &al, sample,
+ evsel, machine, event)) {
+ pr_debug("problem adding mem entry, skipping event\n");
+ return -1;
+ }
} else {
if (al.map != NULL)
al.map->dso->hit = 1;
@@ -292,7 +389,8 @@ static void sig_handler(int sig __maybe_unused)
session_done = 1;
}

-static size_t hists__fprintf_nr_sample_events(struct hists *self,
+static size_t hists__fprintf_nr_sample_events(struct perf_report *rep,
+ struct hists *self,
const char *evname, FILE *fp)
{
size_t ret;
@@ -305,7 +403,11 @@ static size_t hists__fprintf_nr_sample_events(struct hists *self,
if (evname != NULL)
ret += fprintf(fp, " of event '%s'", evname);

- ret += fprintf(fp, "\n# Event count (approx.): %" PRIu64, nr_events);
+ if (rep->mem_mode) {
+ ret += fprintf(fp, "\n# Total cost : %" PRIu64, nr_events);
+ ret += fprintf(fp, "\n# Sort order : %s", sort_order);
+ } else
+ ret += fprintf(fp, "\n# Event count (approx.): %" PRIu64, nr_events);
return ret + fprintf(fp, "\n#\n");
}

@@ -319,7 +421,7 @@ static int perf_evlist__tty_browse_hists(struct perf_evlist *evlist,
struct hists *hists = &pos->hists;
const char *evname = perf_evsel__name(pos);

- hists__fprintf_nr_sample_events(hists, evname, stdout);
+ hists__fprintf_nr_sample_events(rep, hists, evname, stdout);
hists__fprintf(hists, true, 0, 0, stdout);
fprintf(stdout, "\n\n");
}
@@ -595,7 +697,7 @@ int cmd_report(int argc, const char **argv, const char *prefix __maybe_unused)
"Use the stdio interface"),
OPT_STRING('s', "sort", &sort_order, "key[,key2...]",
"sort by key(s): pid, comm, dso, symbol, parent, dso_to,"
- " dso_from, symbol_to, symbol_from, mispredict"),
+ " dso_from, symbol_to, symbol_from, mispredict, mem, cost, symbol_daddr, dso_daddr, tlb, snoop, locked"),
OPT_BOOLEAN(0, "showcpuutilization", &symbol_conf.show_cpu_utilization,
"Show sample percentage for different cpu modes"),
OPT_STRING('p', "parent", &parent_pattern, "regex",
@@ -641,6 +743,7 @@ int cmd_report(int argc, const char **argv, const char *prefix __maybe_unused)
"use branch records for histogram filling", parse_branch_mode),
OPT_STRING(0, "objdump", &objdump_path, "path",
"objdump binary to use for disassembly and annotations"),
+ OPT_BOOLEAN(0, "mem-mode", &report.mem_mode, "mem access profile"),
OPT_END()
};

@@ -692,6 +795,18 @@ int cmd_report(int argc, const char **argv, const char *prefix __maybe_unused)
"dso_to,symbol_to";

}
+ if (report.mem_mode) {
+ if (sort__branch_mode == 1) {
+ fprintf(stderr, "branch and mem mode incompatible\n");
+ goto error;
+ }
+ /*
+ * if no sort_order is provided, then specify
+ * branch-mode specific order
+ */
+ if (sort_order == default_sort_order)
+ sort_order = "cost,mem,sym,dso,symbol_daddr,dso_daddr,snoop,tlb,locked";
+ }

if (strcmp(input_name, "-") != 0)
setup_browser(true);
@@ -763,6 +878,14 @@ int cmd_report(int argc, const char **argv, const char *prefix __maybe_unused)
sort_entry__setup_elide(&sort_sym_from, symbol_conf.sym_from_list, "sym_from", stdout);
sort_entry__setup_elide(&sort_sym_to, symbol_conf.sym_to_list, "sym_to", stdout);
} else {
+ if (report.mem_mode) {
+ sort_entry__setup_elide(&sort_dso, symbol_conf.dso_list, "symbol_daddr", stdout);
+ sort_entry__setup_elide(&sort_dso, symbol_conf.dso_list, "dso_daddr", stdout);
+ sort_entry__setup_elide(&sort_dso, symbol_conf.dso_list, "mem", stdout);
+ sort_entry__setup_elide(&sort_dso, symbol_conf.dso_list, "cost", stdout);
+ sort_entry__setup_elide(&sort_dso, symbol_conf.dso_list, "tlb", stdout);
+ sort_entry__setup_elide(&sort_dso, symbol_conf.dso_list, "snoop", stdout);
+ }
sort_entry__setup_elide(&sort_dso, symbol_conf.dso_list, "dso", stdout);
sort_entry__setup_elide(&sort_sym, symbol_conf.sym_list, "symbol", stdout);
}
--
1.7.9.5

2012-11-05 13:55:10

by Stephane Eranian

[permalink] [raw]
Subject: [PATCH v2 11/16] perf tools: add mem access sampling core support

This patch adds the sorting and histogram support
functions to enable profiling of memory accesses.

The following sorting orders are added:
- symbol_daddr: data address symbol (or raw address)
- dso_daddr: data address shared object
- cost: access cost
- locked: access uses locked transaction
- tlb : TLB access
- mem : memory level of the access (L1, L2, L3, RAM, ...)
- snoop: access snoop mode

Signed-off-by: Stephane Eranian <[email protected]>
---
tools/perf/util/event.h | 1 +
tools/perf/util/evsel.c | 4 +
tools/perf/util/hist.c | 66 ++++++++-
tools/perf/util/hist.h | 13 ++
tools/perf/util/session.c | 37 ++++++
tools/perf/util/session.h | 4 +
tools/perf/util/sort.c | 324 ++++++++++++++++++++++++++++++++++++++++++++-
tools/perf/util/sort.h | 10 +-
tools/perf/util/symbol.h | 7 +
9 files changed, 456 insertions(+), 10 deletions(-)

diff --git a/tools/perf/util/event.h b/tools/perf/util/event.h
index cf52977..ad66b44 100644
--- a/tools/perf/util/event.h
+++ b/tools/perf/util/event.h
@@ -91,6 +91,7 @@ struct perf_sample {
u32 cpu;
u32 raw_size;
u64 weight;
+ u64 dsrc;
void *raw_data;
struct ip_callchain *callchain;
struct branch_stack *branch_stack;
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index b6f3697..951d4d1 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -1025,6 +1025,10 @@ int perf_evsel__parse_sample(struct perf_evsel *evsel, union perf_event *event,
array++;
}

+ if (type & PERF_SAMPLE_DSRC) {
+ data->dsrc = *array;
+ array++;
+ }
return 0;
}

diff --git a/tools/perf/util/hist.c b/tools/perf/util/hist.c
index 277947a..52fe4e2 100644
--- a/tools/perf/util/hist.c
+++ b/tools/perf/util/hist.c
@@ -66,12 +66,16 @@ static void hists__set_unres_dso_col_len(struct hists *hists, int dso)
void hists__calc_col_len(struct hists *hists, struct hist_entry *h)
{
const unsigned int unresolved_col_width = BITS_PER_LONG / 4;
+ int symlen;
u16 len;

if (h->ms.sym)
hists__new_col_len(hists, HISTC_SYMBOL, h->ms.sym->namelen + 4);
- else
+ else {
+ symlen = unresolved_col_width + 4 + 2;
+ hists__new_col_len(hists, HISTC_SYMBOL, symlen);
hists__set_unres_dso_col_len(hists, HISTC_DSO);
+ }

len = thread__comm_len(h->thread);
if (hists__new_col_len(hists, HISTC_COMM, len))
@@ -83,7 +87,6 @@ void hists__calc_col_len(struct hists *hists, struct hist_entry *h)
}

if (h->branch_info) {
- int symlen;
/*
* +4 accounts for '[x] ' priv level info
* +2 account of 0x prefix on raw addresses
@@ -111,7 +114,32 @@ void hists__calc_col_len(struct hists *hists, struct hist_entry *h)
hists__new_col_len(hists, HISTC_SYMBOL_TO, symlen);
hists__set_unres_dso_col_len(hists, HISTC_DSO_TO);
}
+ } else if (h->mem_info) {
+ /*
+ * +4 accounts for '[x] ' priv level info
+ * +2 account of 0x prefix on raw addresses
+ */
+ if (h->mem_info->daddr.sym) {
+ symlen = (int)h->mem_info->daddr.sym->namelen + 4 + unresolved_col_width + 2;
+ hists__new_col_len(hists, HISTC_MEM_DADDR_SYMBOL, symlen);
+ } else {
+ symlen = unresolved_col_width + 4 + 2;
+ hists__new_col_len(hists, HISTC_MEM_DADDR_SYMBOL, symlen);
+ }
+ if (h->mem_info->daddr.map) {
+ symlen = dso__name_len(h->mem_info->daddr.map->dso);
+ hists__new_col_len(hists, HISTC_MEM_DADDR_DSO, symlen);
+ } else {
+ symlen = unresolved_col_width + 4 + 2;
+ hists__set_unres_dso_col_len(hists, HISTC_MEM_DADDR_DSO);
+ }
+ hists__new_col_len(hists, HISTC_MEM_COST, 7);
+ hists__new_col_len(hists, HISTC_MEM_LOCKED, 6);
+ hists__new_col_len(hists, HISTC_MEM_TLB, 22);
+ hists__new_col_len(hists, HISTC_MEM_SNOOP, 12);
+ hists__new_col_len(hists, HISTC_MEM_LVL, 21+3);
}
+
}

void hists__output_recalc_col_len(struct hists *hists, int max_rows)
@@ -235,7 +263,7 @@ void hists__decay_entries_threaded(struct hists *hists,
static struct hist_entry *hist_entry__new(struct hist_entry *template)
{
size_t callchain_size = symbol_conf.use_callchain ? sizeof(struct callchain_root) : 0;
- struct hist_entry *he = malloc(sizeof(*he) + callchain_size);
+ struct hist_entry *he = calloc(1, sizeof(*he) + callchain_size);

if (he != NULL) {
*he = *template;
@@ -321,6 +349,35 @@ static struct hist_entry *add_hist_entry(struct hists *hists,
return he;
}

+struct hist_entry *__hists__add_mem_entry(struct hists *self,
+ struct addr_location *al,
+ struct symbol *sym_parent,
+ struct mem_info *mi,
+ u64 weight)
+{
+ struct hist_entry entry = {
+ .thread = al->thread,
+ .ms = {
+ .map = al->map,
+ .sym = al->sym,
+ },
+ .cpu = al->cpu,
+ .ip = al->addr,
+ .level = al->level,
+ .stat = {
+ .period = weight,
+ .nr_events = 1,
+ },
+ .parent = sym_parent,
+ .filtered = symbol__parent_filter(sym_parent),
+ .hists = self,
+ .mem_info = mi,
+ .branch_info = NULL,
+ };
+
+ return add_hist_entry(self, &entry, al, weight);
+}
+
struct hist_entry *__hists__add_branch_entry(struct hists *self,
struct addr_location *al,
struct symbol *sym_parent,
@@ -344,6 +401,7 @@ struct hist_entry *__hists__add_branch_entry(struct hists *self,
.filtered = symbol__parent_filter(sym_parent),
.branch_info = bi,
.hists = self,
+ .mem_info = NULL,
};

return add_hist_entry(self, &entry, al, period);
@@ -369,6 +427,8 @@ struct hist_entry *__hists__add_entry(struct hists *self,
.parent = sym_parent,
.filtered = symbol__parent_filter(sym_parent),
.hists = self,
+ .branch_info = NULL,
+ .mem_info = NULL,
};

return add_hist_entry(self, &entry, al, period);
diff --git a/tools/perf/util/hist.h b/tools/perf/util/hist.h
index b874609..a77d102 100644
--- a/tools/perf/util/hist.h
+++ b/tools/perf/util/hist.h
@@ -48,6 +48,13 @@ enum hist_column {
HISTC_DSO_FROM,
HISTC_DSO_TO,
HISTC_SRCLINE,
+ HISTC_MEM_DADDR_SYMBOL,
+ HISTC_MEM_DADDR_DSO,
+ HISTC_MEM_COST,
+ HISTC_MEM_LOCKED,
+ HISTC_MEM_TLB,
+ HISTC_MEM_LVL,
+ HISTC_MEM_SNOOP,
HISTC_NR_COLS, /* Last entry */
};

@@ -85,6 +92,12 @@ struct hist_entry *__hists__add_branch_entry(struct hists *self,
struct branch_info *bi,
u64 period);

+struct hist_entry *__hists__add_mem_entry(struct hists *self,
+ struct addr_location *al,
+ struct symbol *sym_parent,
+ struct mem_info *mi,
+ u64 period);
+
void hists__output_resort(struct hists *self);
void hists__output_resort_threaded(struct hists *hists);
void hists__collapse_resort(struct hists *self);
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index b6d7cad..58ca636 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -273,6 +273,41 @@ static void ip__resolve_ams(struct machine *self, struct thread *thread,
ams->map = al.map;
}

+static void ip__resolve_data(struct machine *self, struct thread *thread,
+ u8 m,
+ struct addr_map_symbol *ams,
+ u64 addr)
+{
+ struct addr_location al;
+
+ memset(&al, 0, sizeof(al));
+
+ thread__find_addr_location(thread, self, m, MAP__VARIABLE, addr, &al, NULL);
+ ams->addr = addr;
+ ams->al_addr = al.addr;
+ ams->sym = al.sym;
+ ams->map = al.map;
+}
+
+struct mem_info *machine__resolve_mem(struct machine *self,
+ struct thread *thr,
+ struct perf_sample *sample,
+ u8 cpumode)
+{
+ struct mem_info *mi;
+
+ mi = calloc(1, sizeof(struct mem_info));
+ if (!mi)
+ return NULL;
+
+ ip__resolve_ams(self, thr, &mi->iaddr, sample->ip);
+ ip__resolve_data(self, thr, cpumode, &mi->daddr, sample->addr);
+ mi->cost = sample->weight;
+ mi->dsrc.val = sample->dsrc;
+
+ return mi;
+}
+
struct branch_info *machine__resolve_bstack(struct machine *self,
struct thread *thr,
struct branch_stack *bs)
@@ -1013,6 +1048,8 @@ static void dump_sample(struct perf_evsel *evsel, union perf_event *event,
if (sample_type & PERF_SAMPLE_ADDR)
printf(" ..... data: 0x%"PRIx64"\n", sample->addr);

+ if (sample_type & PERF_SAMPLE_DSRC)
+ printf(" . data_src: 0x%"PRIx64"\n", sample->dsrc);
}

static struct machine *
diff --git a/tools/perf/util/session.h b/tools/perf/util/session.h
index dd64261..37ab693 100644
--- a/tools/perf/util/session.h
+++ b/tools/perf/util/session.h
@@ -72,6 +72,10 @@ struct branch_info *machine__resolve_bstack(struct machine *self,
struct thread *thread,
struct branch_stack *bs);

+struct mem_info *machine__resolve_mem(struct machine *self,
+ struct thread *thread,
+ struct perf_sample *sample, u8 cpumode);
+
bool perf_session__has_traces(struct perf_session *self, const char *msg);

void mem_bswap_64(void *src, int byte_size);
diff --git a/tools/perf/util/sort.c b/tools/perf/util/sort.c
index cfd1c0f..e2e466d 100644
--- a/tools/perf/util/sort.c
+++ b/tools/perf/util/sort.c
@@ -182,11 +182,19 @@ static int _hist_entry__sym_snprintf(struct map *map, struct symbol *sym,
}

ret += repsep_snprintf(bf + ret, size - ret, "[%c] ", level);
- if (sym)
- ret += repsep_snprintf(bf + ret, size - ret, "%-*s",
- width - ret,
- sym->name);
- else {
+ if (sym) {
+ if (map->type == MAP__VARIABLE) {
+ ret += repsep_snprintf(bf + ret, size - ret, "%s", sym->name);
+ ret += repsep_snprintf(bf + ret, size - ret, "+0x%llx",
+ ip - sym->start);
+ ret += repsep_snprintf(bf + ret, size - ret, "%-*s",
+ width - ret, "");
+ } else {
+ ret += repsep_snprintf(bf + ret, size - ret, "%-*s",
+ width - ret,
+ sym->name);
+ }
+ } else {
size_t len = BITS_PER_LONG / 4;
ret += repsep_snprintf(bf + ret, size - ret, "%-#.*llx",
len, ip);
@@ -469,6 +477,238 @@ static int hist_entry__mispredict_snprintf(struct hist_entry *self, char *bf,
return repsep_snprintf(bf, size, "%-*s", width, out);
}

+/* --sort daddr_sym */
+static int64_t
+sort__daddr_cmp(struct hist_entry *left, struct hist_entry *right)
+{
+ struct addr_map_symbol *l = &left->mem_info->daddr;
+ struct addr_map_symbol *r = &right->mem_info->daddr;
+
+ return (int64_t)(r->addr - l->addr);
+}
+
+static int hist_entry__daddr_snprintf(struct hist_entry *self, char *bf,
+ size_t size, unsigned int width)
+{
+ return _hist_entry__sym_snprintf(self->mem_info->daddr.map,
+ self->mem_info->daddr.sym,
+ self->mem_info->daddr.addr,
+ self->level, bf, size, width);
+}
+
+static int64_t
+sort__dso_daddr_cmp(struct hist_entry *left, struct hist_entry *right)
+{
+ return _sort__dso_cmp(left->mem_info->daddr.map, right->mem_info->daddr.map);
+}
+
+static int hist_entry__dso_daddr_snprintf(struct hist_entry *self, char *bf,
+ size_t size, unsigned int width)
+{
+ return _hist_entry__dso_snprintf(self->mem_info->daddr.map, bf, size, width);
+}
+
+static int64_t
+sort__cost_cmp(struct hist_entry *left, struct hist_entry *right)
+{
+ u64 cost_l = left->mem_info->cost;
+ u64 cost_r = right->mem_info->cost;
+
+ return (int64_t)(cost_r - cost_l);
+}
+
+static int hist_entry__cost_snprintf(struct hist_entry *self, char *bf,
+ size_t size, unsigned int width)
+{
+ if (self->mem_info->cost == 0)
+ return repsep_snprintf(bf, size, "%*s", width, "N/A");
+ return repsep_snprintf(bf, size, "%*"PRIu64, width, self->mem_info->cost);
+}
+
+static int64_t
+sort__locked_cmp(struct hist_entry *left, struct hist_entry *right)
+{
+ union perf_mem_dsrc dsrc_l = left->mem_info->dsrc;
+ union perf_mem_dsrc dsrc_r = right->mem_info->dsrc;
+
+ return (int64_t)(dsrc_r.mem_lock - dsrc_l.mem_lock);
+}
+
+static int hist_entry__locked_snprintf(struct hist_entry *self, char *bf,
+ size_t size, unsigned int width)
+{
+ const char *out = "??";
+ u64 mask = self->mem_info->dsrc.mem_lock;
+
+ if (mask & PERF_MEM_LOCK_NA)
+ out = "N/A";
+ else if (mask & PERF_MEM_LOCK_LOCKED)
+ out = "Yes";
+ else
+ out = "No";
+
+ return repsep_snprintf(bf, size, "%-*s", width, out);
+}
+
+static int64_t
+sort__tlb_cmp(struct hist_entry *left, struct hist_entry *right)
+{
+ union perf_mem_dsrc dsrc_l = left->mem_info->dsrc;
+ union perf_mem_dsrc dsrc_r = right->mem_info->dsrc;
+
+ return (int64_t)(dsrc_r.mem_dtlb - dsrc_l.mem_dtlb);
+}
+
+static const char *tlb_access[] = {
+ "N/A",
+ "HIT",
+ "MISS",
+ "L1",
+ "L2",
+ "Walker",
+ "Fault",
+};
+#define NUM_TLB_ACCESS (sizeof(tlb_access)/sizeof(const char *))
+
+static int hist_entry__tlb_snprintf(struct hist_entry *self, char *bf,
+ size_t size, unsigned int width)
+{
+ char out[64];
+ size_t sz = sizeof(out) - 1; /* -1 for null termination */
+ size_t l = 0, i;
+ u64 m = self->mem_info->dsrc.mem_dtlb;
+ u64 hit, miss;
+
+ out[0] = '\0';
+
+ hit = m & PERF_MEM_TLB_HIT;
+ miss = m & PERF_MEM_TLB_MISS;
+
+ /* already taken care of */
+ m &= ~(PERF_MEM_TLB_HIT|PERF_MEM_TLB_MISS);
+
+ for (i = 0; m && i < NUM_TLB_ACCESS; i++, m >>= 1) {
+ if (!(m & 0x1))
+ continue;
+ if (l) {
+ strcat(out, " or ");
+ l += 4;
+ }
+ strncat(out, tlb_access[i], sz - l);
+ l += strlen(tlb_access[i]);
+ }
+ if (hit)
+ strncat(out, " hit", sz - l);
+ if (miss)
+ strncat(out, " miss", sz - l);
+
+ return repsep_snprintf(bf, size, "%-*s", width, out);
+}
+
+static int64_t
+sort__lvl_cmp(struct hist_entry *left, struct hist_entry *right)
+{
+ union perf_mem_dsrc dsrc_l = left->mem_info->dsrc;
+ union perf_mem_dsrc dsrc_r = right->mem_info->dsrc;
+
+ return (int64_t)(dsrc_r.mem_lvl - dsrc_l.mem_lvl);
+}
+
+static const char *mem_lvl[] = {
+ "N/A",
+ "HIT",
+ "MISS",
+ "L1",
+ "LFB",
+ "L2",
+ "L3",
+ "Local RAM",
+ "Remote RAM (1 hop)",
+ "Remote RAM (2 hops)",
+ "Remote Cache (1 hop)",
+ "Remote Cache (2 hops)",
+ "I/O",
+ "Uncached",
+};
+#define NUM_MEM_LVL (sizeof(mem_lvl)/sizeof(const char *))
+
+static int hist_entry__lvl_snprintf(struct hist_entry *self, char *bf,
+ size_t size, unsigned int width)
+{
+ char out[64];
+ size_t sz = sizeof(out) - 1; /* -1 for null termination */
+ size_t i, l = 0;
+ u64 m = self->mem_info->dsrc.mem_lvl;
+ u64 hit, miss;
+
+ out[0] = '\0';
+
+ hit = m & PERF_MEM_LVL_HIT;
+ miss = m & PERF_MEM_LVL_MISS;
+
+ /* already taken care of */
+ m &= ~(PERF_MEM_LVL_HIT|PERF_MEM_LVL_MISS);
+
+ for (i = 0; m && i < NUM_MEM_LVL; i++, m >>= 1) {
+ if (!(m & 0x1))
+ continue;
+ if (l) {
+ strcat(out, " or ");
+ l += 4;
+ }
+ strncat(out, mem_lvl[i], sz - l);
+ l += strlen(mem_lvl[i]);
+ }
+ if (hit)
+ strncat(out, " hit", sz - l);
+ if (miss)
+ strncat(out, " miss", sz - l);
+
+ return repsep_snprintf(bf, size, "%-*s", width, out);
+}
+
+static int64_t
+sort__snoop_cmp(struct hist_entry *left, struct hist_entry *right)
+{
+ union perf_mem_dsrc dsrc_l = left->mem_info->dsrc;
+ union perf_mem_dsrc dsrc_r = right->mem_info->dsrc;
+
+ return (int64_t)(dsrc_r.mem_snoop - dsrc_l.mem_snoop);
+}
+
+static const char *snoop_access[] = {
+ "N/A",
+ "None",
+ "Miss",
+ "Hit",
+ "HitM",
+};
+#define NUM_SNOOP_ACCESS (sizeof(snoop_access)/sizeof(const char *))
+
+static int hist_entry__snoop_snprintf(struct hist_entry *self, char *bf,
+ size_t size, unsigned int width)
+{
+ char out[64];
+ size_t sz = sizeof(out) - 1; /* -1 for null termination */
+ size_t i, l = 0;
+ u64 m = self->mem_info->dsrc.mem_snoop;
+
+ out[0] = '\0';
+
+ for (i = 0; m && i < NUM_SNOOP_ACCESS; i++, m >>= 1) {
+ if (!(m & 0x1))
+ continue;
+ if (l) {
+ strcat(out, " or ");
+ l += 4;
+ }
+ strncat(out, snoop_access[i], sz - l);
+ l += strlen(snoop_access[i]);
+ }
+
+ return repsep_snprintf(bf, size, "%-*s", width, out);
+}
+
struct sort_entry sort_mispredict = {
.se_header = "Branch Mispredicted",
.se_cmp = sort__mispredict_cmp,
@@ -476,6 +716,56 @@ struct sort_entry sort_mispredict = {
.se_width_idx = HISTC_MISPREDICT,
};

+struct sort_entry sort_mem_daddr_sym = {
+ .se_header = "Data Symbol",
+ .se_cmp = sort__daddr_cmp,
+ .se_snprintf = hist_entry__daddr_snprintf,
+ .se_width_idx = HISTC_MEM_DADDR_SYMBOL,
+};
+
+struct sort_entry sort_mem_daddr_dso = {
+ .se_header = "Data Object",
+ .se_cmp = sort__dso_daddr_cmp,
+ .se_snprintf = hist_entry__dso_daddr_snprintf,
+ .se_width_idx = HISTC_MEM_DADDR_SYMBOL,
+};
+
+struct sort_entry sort_mem_cost = {
+ .se_header = "Cost",
+ .se_cmp = sort__cost_cmp,
+ .se_snprintf = hist_entry__cost_snprintf,
+ .se_width_idx = HISTC_MEM_COST,
+};
+
+struct sort_entry sort_mem_locked = {
+ .se_header = "Locked",
+ .se_cmp = sort__locked_cmp,
+ .se_snprintf = hist_entry__locked_snprintf,
+ .se_width_idx = HISTC_MEM_LOCKED,
+};
+
+struct sort_entry sort_mem_tlb = {
+ .se_header = "TLB access",
+ .se_cmp = sort__tlb_cmp,
+ .se_snprintf = hist_entry__tlb_snprintf,
+ .se_width_idx = HISTC_MEM_TLB,
+};
+
+struct sort_entry sort_mem_lvl = {
+ .se_header = "Memory access",
+ .se_cmp = sort__lvl_cmp,
+ .se_snprintf = hist_entry__lvl_snprintf,
+ .se_width_idx = HISTC_MEM_LVL,
+};
+
+struct sort_entry sort_mem_snoop = {
+ .se_header = "Snoop",
+ .se_cmp = sort__snoop_cmp,
+ .se_snprintf = hist_entry__snoop_snprintf,
+ .se_width_idx = HISTC_MEM_SNOOP,
+};
+
+
struct sort_dimension {
const char *name;
struct sort_entry *entry;
@@ -497,6 +787,13 @@ static struct sort_dimension sort_dimensions[] = {
DIM(SORT_CPU, "cpu", sort_cpu),
DIM(SORT_MISPREDICT, "mispredict", sort_mispredict),
DIM(SORT_SRCLINE, "srcline", sort_srcline),
+ DIM(SORT_MEM_DADDR_SYMBOL, "symbol_daddr", sort_mem_daddr_sym),
+ DIM(SORT_MEM_DADDR_DSO, "dso_daddr", sort_mem_daddr_dso),
+ DIM(SORT_MEM_COST, "cost", sort_mem_cost),
+ DIM(SORT_MEM_LOCKED, "locked", sort_mem_locked),
+ DIM(SORT_MEM_TLB, "tlb", sort_mem_tlb),
+ DIM(SORT_MEM_LVL, "mem", sort_mem_lvl),
+ DIM(SORT_MEM_SNOOP, "snoop", sort_mem_snoop),
};

int sort_dimension__add(const char *tok)
@@ -520,7 +817,8 @@ int sort_dimension__add(const char *tok)
sort__has_parent = 1;
} else if (sd->entry == &sort_sym ||
sd->entry == &sort_sym_from ||
- sd->entry == &sort_sym_to) {
+ sd->entry == &sort_sym_to ||
+ sd->entry == &sort_mem_daddr_sym) {
sort__has_sym = 1;
}

@@ -553,6 +851,20 @@ int sort_dimension__add(const char *tok)
sort__first_dimension = SORT_DSO_TO;
else if (!strcmp(sd->name, "mispredict"))
sort__first_dimension = SORT_MISPREDICT;
+ else if (!strcmp(sd->name, "symbol_daddr"))
+ sort__first_dimension = SORT_MEM_DADDR_SYMBOL;
+ else if (!strcmp(sd->name, "dso_daddr"))
+ sort__first_dimension = SORT_MEM_DADDR_DSO;
+ else if (!strcmp(sd->name, "cost"))
+ sort__first_dimension = SORT_MEM_COST;
+ else if (!strcmp(sd->name, "locked"))
+ sort__first_dimension = SORT_MEM_LOCKED;
+ else if (!strcmp(sd->name, "tlb"))
+ sort__first_dimension = SORT_MEM_TLB;
+ else if (!strcmp(sd->name, "mem_lvl"))
+ sort__first_dimension = SORT_MEM_LVL;
+ else if (!strcmp(sd->name, "snoop"))
+ sort__first_dimension = SORT_MEM_SNOOP;
}

list_add_tail(&sd->entry->list, &hist_entry__sort_list);
diff --git a/tools/perf/util/sort.h b/tools/perf/util/sort.h
index 13761d8..175eb4c 100644
--- a/tools/perf/util/sort.h
+++ b/tools/perf/util/sort.h
@@ -102,7 +102,8 @@ struct hist_entry {
};
struct branch_info *branch_info;
struct hists *hists;
- struct callchain_root callchain[0];
+ struct mem_info *mem_info;
+ struct callchain_root callchain[0]; /* must be last member */
};

enum sort_type {
@@ -118,6 +119,13 @@ enum sort_type {
SORT_SYM_TO,
SORT_MISPREDICT,
SORT_SRCLINE,
+ SORT_MEM_DADDR_SYMBOL,
+ SORT_MEM_DADDR_DSO,
+ SORT_MEM_COST,
+ SORT_MEM_LOCKED,
+ SORT_MEM_TLB,
+ SORT_MEM_LVL,
+ SORT_MEM_SNOOP,
};

/*
diff --git a/tools/perf/util/symbol.h b/tools/perf/util/symbol.h
index 863b05b..ee719eb 100644
--- a/tools/perf/util/symbol.h
+++ b/tools/perf/util/symbol.h
@@ -152,6 +152,13 @@ struct branch_info {
struct branch_flags flags;
};

+struct mem_info {
+ struct addr_map_symbol iaddr;
+ struct addr_map_symbol daddr;
+ u64 cost;
+ union perf_mem_dsrc dsrc;
+};
+
struct addr_location {
struct thread *thread;
struct map *map;
--
1.7.9.5

2012-11-05 13:55:44

by Stephane Eranian

[permalink] [raw]
Subject: [PATCH v2 09/16] perf/x86: export PEBS load latency threshold register to sysfs

Make the PEBS Load Latency threshold register layout
and encoding visible to user level tools.

Signed-off-by: Stephane Eranian <[email protected]>
---
arch/x86/kernel/cpu/perf_event_intel.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index a279d01..cbfb252 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -1787,6 +1787,8 @@ static void intel_pmu_flush_branch_stack(void)

PMU_FORMAT_ATTR(offcore_rsp, "config1:0-63");

+PMU_FORMAT_ATTR(ldlat, "config1:0-15");
+
static struct attribute *intel_arch3_formats_attr[] = {
&format_attr_event.attr,
&format_attr_umask.attr,
@@ -1797,6 +1799,7 @@ static struct attribute *intel_arch3_formats_attr[] = {
&format_attr_cmask.attr,

&format_attr_offcore_rsp.attr, /* XXX do NHM/WSM + SNB breakout */
+ &format_attr_ldlat.attr, /* PEBS load latency */
NULL,
};

--
1.7.9.5

2012-11-05 13:57:42

by Stephane Eranian

[permalink] [raw]
Subject: [PATCH v2 07/16] perf: add generic memory sampling interface

This patch adds PERF_SAMPLE_COST and PERF_SAMPLE_DSRC.
The first collects a cost associated with the sampled
event. In case of memory access, the cost would be
the latency of the load, otherwise it defaults to
the sampling period.

PERF_SAMPLE_DSRC collects the data source, i.e., where
did the data associated with the sampled instruction
come from. Information is stored in a perf_mem_dsrc
structure. It contains opcode, mem level, tlb, snoop,
lock information, subject to availability in hardware.

Signed-off-by: Stephane Eranian <[email protected]>
---
include/linux/perf_event.h | 2 ++
include/uapi/linux/perf_event.h | 69 +++++++++++++++++++++++++++++++++++++--
kernel/events/core.c | 6 ++++
3 files changed, 75 insertions(+), 2 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index bb2429d..8fe4610 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -579,6 +579,7 @@ struct perf_sample_data {
u32 reserved;
} cpu_entry;
u64 period;
+ union perf_mem_dsrc dsrc;
struct perf_callchain_entry *callchain;
struct perf_raw_record *raw;
struct perf_branch_stack *br_stack;
@@ -599,6 +600,7 @@ static inline void perf_sample_data_init(struct perf_sample_data *data,
data->regs_user.regs = NULL;
data->stack_user_size = 0;
data->weight = 0;
+ data->dsrc.val = 0;
}

extern void perf_output_sample(struct perf_output_handle *handle,
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index c52caab..b2a78bb 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -133,9 +133,9 @@ enum perf_event_sample_format {
PERF_SAMPLE_REGS_USER = 1U << 12,
PERF_SAMPLE_STACK_USER = 1U << 13,
PERF_SAMPLE_WEIGHT = 1U << 14,
+ PERF_SAMPLE_DSRC = 1U << 15,

- PERF_SAMPLE_MAX = 1U << 15, /* non-ABI */
-
+ PERF_SAMPLE_MAX = 1U << 16, /* non-ABI */
};

/*
@@ -591,6 +591,8 @@ enum perf_event_type {
* { u64 size;
* char data[size];
* u64 dyn_size; } && PERF_SAMPLE_STACK_USER
+ *
+ * { u64 dsrc; } && PERF_SAMPLE_DSRC
* };
*/
PERF_RECORD_SAMPLE = 9,
@@ -616,4 +618,67 @@ enum perf_callchain_context {
#define PERF_FLAG_FD_OUTPUT (1U << 1)
#define PERF_FLAG_PID_CGROUP (1U << 2) /* pid=cgroup id, per-cpu mode only */

+union perf_mem_dsrc {
+ __u64 val;
+ struct {
+ __u64 mem_op:5, /* type of opcode */
+ mem_lvl:14, /* memory hierarchy level */
+ mem_snoop:5, /* snoop mode */
+ mem_lock:2, /* lock instr */
+ mem_dtlb:7, /* tlb access */
+ mem_rsvd:31;
+ };
+};
+
+/* type of opcode (load/store/prefetch,code) */
+#define PERF_MEM_OP_NA 0x01 /* not available */
+#define PERF_MEM_OP_LOAD 0x02 /* load instruction */
+#define PERF_MEM_OP_STORE 0x04 /* store instruction */
+#define PERF_MEM_OP_PFETCH 0x08 /* prefetch */
+#define PERF_MEM_OP_EXEC 0x10 /* code (execution) */
+#define PERF_MEM_OP_SHIFT 0
+
+/* memory hierarchy (memory level, hit or miss) */
+#define PERF_MEM_LVL_NA 0x01 /* not available */
+#define PERF_MEM_LVL_HIT 0x02 /* hit level */
+#define PERF_MEM_LVL_MISS 0x04 /* miss level */
+#define PERF_MEM_LVL_L1 0x08 /* L1 */
+#define PERF_MEM_LVL_LFB 0x10 /* Line Fill Buffer */
+#define PERF_MEM_LVL_L2 0x20 /* L2 hit */
+#define PERF_MEM_LVL_L3 0x40 /* L3 hit */
+#define PERF_MEM_LVL_LOC_RAM 0x80 /* Local DRAM */
+#define PERF_MEM_LVL_REM_RAM1 0x100 /* Remote DRAM (1 hop) */
+#define PERF_MEM_LVL_REM_RAM2 0x200 /* Remote DRAM (2 hops) */
+#define PERF_MEM_LVL_REM_CCE1 0x400 /* Remote Cache (1 hop) */
+#define PERF_MEM_LVL_REM_CCE2 0x800 /* Remote Cache (2 hops) */
+#define PERF_MEM_LVL_IO 0x1000 /* I/O memory */
+#define PERF_MEM_LVL_UNC 0x2000 /* Uncached memory */
+#define PERF_MEM_LVL_SHIFT 5
+
+/* snoop mode */
+#define PERF_MEM_SNOOP_NA 0x01 /* not available */
+#define PERF_MEM_SNOOP_NONE 0x02 /* no snoop */
+#define PERF_MEM_SNOOP_HIT 0x04 /* snoop hit */
+#define PERF_MEM_SNOOP_MISS 0x08 /* snoop miss */
+#define PERF_MEM_SNOOP_HITM 0x10 /* snoop hit modified */
+#define PERF_MEM_SNOOP_SHIFT 19
+
+/* locked instruction */
+#define PERF_MEM_LOCK_NA 0x01 /* not available */
+#define PERF_MEM_LOCK_LOCKED 0x02 /* locked transaction */
+#define PERF_MEM_LOCK_SHIFT 24
+
+/* TLB access */
+#define PERF_MEM_TLB_NA 0x01 /* not available */
+#define PERF_MEM_TLB_HIT 0x02 /* hit level */
+#define PERF_MEM_TLB_MISS 0x04 /* miss level */
+#define PERF_MEM_TLB_L1 0x08 /* L1 */
+#define PERF_MEM_TLB_L2 0x10 /* L2 */
+#define PERF_MEM_TLB_WK 0x20 /* Hardware Walker*/
+#define PERF_MEM_TLB_OS 0x40 /* OS fault handler */
+#define PERF_MEM_TLB_SHIFT 26
+
+#define PERF_MEM_S(a, s) \
+ (((u64)PERF_MEM_##a##_##s) << PERF_MEM_##a##_SHIFT)
+
#endif /* _UAPI_LINUX_PERF_EVENT_H */
diff --git a/kernel/events/core.c b/kernel/events/core.c
index d633581..94b2c70 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -958,6 +958,9 @@ static void perf_event__header_size(struct perf_event *event)
if (sample_type & PERF_SAMPLE_READ)
size += event->read_size;

+ if (sample_type & PERF_SAMPLE_DSRC)
+ size += sizeof(data->dsrc.val);
+
event->header_size = size;
}

@@ -4175,6 +4178,9 @@ void perf_output_sample(struct perf_output_handle *handle,
perf_output_sample_ustack(handle,
data->stack_user_size,
data->regs_user.regs);
+
+ if (sample_type & PERF_SAMPLE_DSRC)
+ perf_output_put(handle, data->dsrc.val);
}

void perf_prepare_sample(struct perf_event_header *header,
--
1.7.9.5

2012-11-05 13:58:05

by Stephane Eranian

[permalink] [raw]
Subject: [PATCH v2 03/16] perf, core: Add a concept of a weightened sample

From: Andi Kleen <[email protected]>

For some events it's useful to weight sample with a hardware
provided number. This expresses how expensive the action the
sample represent was. This allows the profiler to scale
the samples to be more informative to the programmer.

There is already the period which is used similarly, but it means
something different, so I chose to not overload it. Instead
a new sample type for WEIGHT is added.

Can be used for multiple things. Initially it is used for TSX abort costs
and profiling by memory latencies (so to make expensive load appear higher
up in the histograms) The concept is quite generic and can be extended
to many other kinds of events or architectures, as long as the hardware
provides suitable auxillary values. In principle it could be also
used for software tracpoints.

This adds the generic glue. A new optional sample format for a 64bit
weight value.

Signed-off-by: Andi Kleen <[email protected]>
---
include/linux/perf_event.h | 2 ++
include/uapi/linux/perf_event.h | 8 ++++++--
kernel/events/core.c | 6 ++++++
3 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 484cfbc..bb2429d 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -584,6 +584,7 @@ struct perf_sample_data {
struct perf_branch_stack *br_stack;
struct perf_regs_user regs_user;
u64 stack_user_size;
+ u64 weight;
};

static inline void perf_sample_data_init(struct perf_sample_data *data,
@@ -597,6 +598,7 @@ static inline void perf_sample_data_init(struct perf_sample_data *data,
data->regs_user.abi = PERF_SAMPLE_REGS_ABI_NONE;
data->regs_user.regs = NULL;
data->stack_user_size = 0;
+ data->weight = 0;
}

extern void perf_output_sample(struct perf_output_handle *handle,
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 4f63c05..c52caab 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -132,8 +132,10 @@ enum perf_event_sample_format {
PERF_SAMPLE_BRANCH_STACK = 1U << 11,
PERF_SAMPLE_REGS_USER = 1U << 12,
PERF_SAMPLE_STACK_USER = 1U << 13,
+ PERF_SAMPLE_WEIGHT = 1U << 14,
+
+ PERF_SAMPLE_MAX = 1U << 15, /* non-ABI */

- PERF_SAMPLE_MAX = 1U << 14, /* non-ABI */
};

/*
@@ -198,8 +200,9 @@ enum perf_event_read_format {
PERF_FORMAT_TOTAL_TIME_RUNNING = 1U << 1,
PERF_FORMAT_ID = 1U << 2,
PERF_FORMAT_GROUP = 1U << 3,
+ PERF_FORMAT_WEIGHT = 1U << 4,

- PERF_FORMAT_MAX = 1U << 4, /* non-ABI */
+ PERF_FORMAT_MAX = 1U << 5, /* non-ABI */
};

#define PERF_ATTR_SIZE_VER0 64 /* sizeof first published struct */
@@ -559,6 +562,7 @@ enum perf_event_type {
* { u64 stream_id;} && PERF_SAMPLE_STREAM_ID
* { u32 cpu, res; } && PERF_SAMPLE_CPU
* { u64 period; } && PERF_SAMPLE_PERIOD
+ * { u64 weight; } && PERF_SAMPLE_WEIGHT
*
* { struct read_format values; } && PERF_SAMPLE_READ
*
diff --git a/kernel/events/core.c b/kernel/events/core.c
index dbccf83..d633581 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -952,6 +952,9 @@ static void perf_event__header_size(struct perf_event *event)
if (sample_type & PERF_SAMPLE_PERIOD)
size += sizeof(data->period);

+ if (sample_type & PERF_SAMPLE_WEIGHT)
+ size += sizeof(data->weight);
+
if (sample_type & PERF_SAMPLE_READ)
size += event->read_size;

@@ -4080,6 +4083,9 @@ void perf_output_sample(struct perf_output_handle *handle,
if (sample_type & PERF_SAMPLE_PERIOD)
perf_output_put(handle, data->period);

+ if (sample_type & PERF_SAMPLE_WEIGHT)
+ perf_output_put(handle, data->weight);
+
if (sample_type & PERF_SAMPLE_READ)
perf_output_read(handle, event);

--
1.7.9.5

2012-11-05 13:58:03

by Stephane Eranian

[permalink] [raw]
Subject: [PATCH v2 05/16] perf, tools: Add arbitary aliases and support names with -

From: Andi Kleen <[email protected]>

- Add missing scanner symbol for arbitrary aliases inside the config
region.
- looks nicer than _, so allow - in the event names. Used for various
of the arch perfmon and Haswell events.

Signed-off-by: Andi Kleen <[email protected]>
---
tools/perf/util/parse-events.l | 2 ++
1 file changed, 2 insertions(+)

diff --git a/tools/perf/util/parse-events.l b/tools/perf/util/parse-events.l
index c87efc1..66959fa 100644
--- a/tools/perf/util/parse-events.l
+++ b/tools/perf/util/parse-events.l
@@ -81,6 +81,7 @@ num_dec [0-9]+
num_hex 0x[a-fA-F0-9]+
num_raw_hex [a-fA-F0-9]+
name [a-zA-Z_*?][a-zA-Z0-9_*?]*
+name_minus [a-zA-Z_*?][a-zA-Z0-9\-_*?]*
modifier_event [ukhpGH]{1,8}
modifier_bp [rwx]{1,3}

@@ -168,6 +169,7 @@ period { return term(yyscanner, PARSE_EVENTS__TERM_TYPE_SAMPLE_PERIOD); }
branch_type { return term(yyscanner, PARSE_EVENTS__TERM_TYPE_BRANCH_SAMPLE_TYPE); }
, { return ','; }
"/" { BEGIN(INITIAL); return '/'; }
+{name_minus} { return str(yyscanner, PE_NAME); }
}

mem: { BEGIN(mem); return PE_PREFIX_MEM; }
--
1.7.9.5

2012-11-05 13:58:51

by Stephane Eranian

[permalink] [raw]
Subject: [PATCH v2 01/16] perf/x86: improve sysfs event mapping with event string

This patch extends Jiri's changes to make generic
events mapping visible via sysfs. The patch extends
the mechanism to non-generic events by allowing
the mappings to be hardcoded in strings.

This mechanism will be used by the PEBS-LL patch
later on.

Signed-off-by: Stephane Eranian <[email protected]>
---
arch/x86/kernel/cpu/perf_event.c | 28 +++++++++++++---------------
arch/x86/kernel/cpu/perf_event.h | 26 ++++++++++++++++++++++++++
2 files changed, 39 insertions(+), 15 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 4428fd1..aeca46d 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -1316,20 +1316,22 @@ static struct attribute_group x86_pmu_format_group = {
.attrs = NULL,
};

-struct perf_pmu_events_attr {
- struct device_attribute attr;
- u64 id;
-};
-
/*
* Remove all undefined events (x86_pmu.event_map(id) == 0)
* out of events_attr attributes.
*/
static void __init filter_events(struct attribute **attrs)
{
+ struct device_attribute *d;
+ struct perf_pmu_events_attr *pmu_attr;
int i, j;

for (i = 0; attrs[i]; i++) {
+ d = (struct device_attribute *)attrs[i];
+ pmu_attr = container_of(d, struct perf_pmu_events_attr, attr);
+ /* str trumps id */
+ if (pmu_attr->event_str)
+ continue;
if (x86_pmu.event_map(i))
continue;

@@ -1341,24 +1343,20 @@ static void __init filter_events(struct attribute **attrs)
}
}

-static ssize_t events_sysfs_show(struct device *dev, struct device_attribute *attr,
+ssize_t events_sysfs_show(struct device *dev, struct device_attribute *attr,
char *page)
{
struct perf_pmu_events_attr *pmu_attr = \
container_of(attr, struct perf_pmu_events_attr, attr);

u64 config = x86_pmu.event_map(pmu_attr->id);
- return x86_pmu.events_sysfs_show(page, config);
-}

-#define EVENT_VAR(_id) event_attr_##_id
-#define EVENT_PTR(_id) &event_attr_##_id.attr.attr
+ /* string trumps id */
+ if (pmu_attr->event_str)
+ return sprintf(page, "%s", pmu_attr->event_str);

-#define EVENT_ATTR(_name, _id) \
-static struct perf_pmu_events_attr EVENT_VAR(_id) = { \
- .attr = __ATTR(_name, 0444, events_sysfs_show, NULL), \
- .id = PERF_COUNT_HW_##_id, \
-};
+ return x86_pmu.events_sysfs_show(page, config);
+}

EVENT_ATTR(cpu-cycles, CPU_CYCLES );
EVENT_ATTR(instructions, INSTRUCTIONS );
diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 115c1ea..41479da 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -419,6 +419,29 @@ do { \
#define ERF_NO_HT_SHARING 1
#define ERF_HAS_RSP_1 2

+#define EVENT_VAR(_id) event_attr_##_id
+#define EVENT_PTR(_id) &event_attr_##_id.attr.attr
+
+#define EVENT_ATTR(_name, _id) \
+static struct perf_pmu_events_attr EVENT_VAR(_id) = { \
+ .attr = __ATTR(_name, 0444, events_sysfs_show, NULL), \
+ .id = PERF_COUNT_HW_##_id, \
+ .event_str = NULL, \
+};
+
+#define EVENT_ATTR_STR(_name, v, str) \
+static struct perf_pmu_events_attr event_attr_##v = { \
+ .attr = __ATTR(_name, 0444, events_sysfs_show, NULL),\
+ .id = 0, \
+ .event_str = str, \
+};
+
+struct perf_pmu_events_attr {
+ struct device_attribute attr;
+ u64 id;
+ const char *event_str;
+};
+
extern struct x86_pmu x86_pmu __read_mostly;

DECLARE_PER_CPU(struct cpu_hw_events, cpu_hw_events);
@@ -633,6 +656,9 @@ int p6_pmu_init(void);

int knc_pmu_init(void);

+ssize_t events_sysfs_show(struct device *dev, struct device_attribute *attr,
+ char *page);
+
#else /* CONFIG_CPU_SUP_INTEL */

static inline void reserve_ds_buffers(void)
--
1.7.9.5

2012-11-05 20:08:06

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: [PATCH v2 03/16] perf, core: Add a concept of a weightened sample

Em Mon, Nov 05, 2012 at 05:01:47PM -0300, Arnaldo Carvalho de Melo escreveu:
> > +
> > if (sample_type & PERF_SAMPLE_READ)
> > size += event->read_size;
> >
> > @@ -4080,6 +4083,9 @@ void perf_output_sample(struct perf_output_handle *handle,
> > if (sample_type & PERF_SAMPLE_PERIOD)
> > perf_output_put(handle, data->period);
> >
> > + if (sample_type & PERF_SAMPLE_WEIGHT)
> > + perf_output_put(handle, data->weight);
> > +
>
> Yeap, it should go after PERF_SAMPLE_STACK_USER
>
> > if (sample_type & PERF_SAMPLE_READ)
> > perf_output_read(handle, event);
> >

I mean this, on top of the above patch:

diff -u b/kernel/events/core.c b/kernel/events/core.c
--- b/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -4083,9 +4083,6 @@
if (sample_type & PERF_SAMPLE_PERIOD)
perf_output_put(handle, data->period);

- if (sample_type & PERF_SAMPLE_WEIGHT)
- perf_output_put(handle, data->weight);
-
if (sample_type & PERF_SAMPLE_READ)
perf_output_read(handle, event);

@@ -4175,6 +4172,9 @@
perf_output_sample_ustack(handle,
data->stack_user_size,
data->regs_user.regs);
+
+ if (sample_type & PERF_SAMPLE_WEIGHT)
+ perf_output_put(handle, data->weight);
}

void perf_prepare_sample(struct perf_event_header *header,

------------

To match what this patch series will do on a later patch, i.e.
parse the weight at the end of perf_evsel__parse_sample, after
it parses PERF_SAMPLE_READ, PERF_SAMPLE_CALLCHAIN, PERF_SAMPLE_RAW,
PERF_SAMPLE_BRANCH_STACK, PERF_SAMPLE_REGS_USER and
PERF_SAMPLE_STACK_USER:

+++ b/tools/perf/util/evsel.c
@@ -1020,6 +1020,11 @@ int perf_evsel__parse_sample(struct perf_evsel *evsel, union perf_event *event,
}
}

+ if (type & PERF_SAMPLE_WEIGHT) {
+ data->weight= *array;
+ array++;
+ }
+
return 0;
}

- Arnaldo

2012-11-05 20:29:20

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: [PATCH v2 03/16] perf, core: Add a concept of a weightened sample

Em Mon, Nov 05, 2012 at 02:50:50PM +0100, Stephane Eranian escreveu:
> --- a/include/uapi/linux/perf_event.h
> +++ b/include/uapi/linux/perf_event.h
> #define PERF_ATTR_SIZE_VER0 64 /* sizeof first published struct */
> @@ -559,6 +562,7 @@ enum perf_event_type {
> * { u64 stream_id;} && PERF_SAMPLE_STREAM_ID
> * { u32 cpu, res; } && PERF_SAMPLE_CPU
> * { u64 period; } && PERF_SAMPLE_PERIOD
> + * { u64 weight; } && PERF_SAMPLE_WEIGHT
> *
> * { struct read_format values; } && PERF_SAMPLE_READ

This comes after PERF_SAMPLE_STACK_USER so that perf_evsel__parse_sample
can get it right, isn't it? Or is just the comment wrong?

> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index dbccf83..d633581 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -952,6 +952,9 @@ static void perf_event__header_size(struct perf_event *event)
> if (sample_type & PERF_SAMPLE_PERIOD)
> size += sizeof(data->period);
>
> + if (sample_type & PERF_SAMPLE_WEIGHT)
> + size += sizeof(data->weight);

Here it doesn't matter, we're just computing the sample size, but would
be nice to add this just after the (sample_type &
PERF_SAMPLE_STACK_USER) branch.

> +
> if (sample_type & PERF_SAMPLE_READ)
> size += event->read_size;
>
> @@ -4080,6 +4083,9 @@ void perf_output_sample(struct perf_output_handle *handle,
> if (sample_type & PERF_SAMPLE_PERIOD)
> perf_output_put(handle, data->period);
>
> + if (sample_type & PERF_SAMPLE_WEIGHT)
> + perf_output_put(handle, data->weight);
> +

Yeap, it should go after PERF_SAMPLE_STACK_USER

> if (sample_type & PERF_SAMPLE_READ)
> perf_output_read(handle, event);
>
> --
> 1.7.9.5

2012-11-05 22:51:05

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH v2 03/16] perf, core: Add a concept of a weightened sample

> > +
> > if (sample_type & PERF_SAMPLE_READ)
> > size += event->read_size;
> >
> > @@ -4080,6 +4083,9 @@ void perf_output_sample(struct perf_output_handle *handle,
> > if (sample_type & PERF_SAMPLE_PERIOD)
> > perf_output_put(handle, data->period);
> >
> > + if (sample_type & PERF_SAMPLE_WEIGHT)
> > + perf_output_put(handle, data->weight);
> > +
>
> Yeap, it should go after PERF_SAMPLE_STACK_USER

There was a bug I recently noticed but haven't investigated so far.

When I do -b -W --transaction the flags/weights get corrupted.

May be related to that.

-Andi

--
[email protected] -- Speaking for myself only

2012-11-06 13:31:37

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH v2 08/16] perf/x86: add memory profiling via PEBS Load Latency

> +EVENT_ATTR(cpu-cycles, CPU_CYCLES );
> +EVENT_ATTR(instructions, INSTRUCTIONS );
> +EVENT_ATTR(cache-references, CACHE_REFERENCES );
> +EVENT_ATTR(cache-misses, CACHE_MISSES );
> +EVENT_ATTR(branch-instructions, BRANCH_INSTRUCTIONS );
> +EVENT_ATTR(branch-misses, BRANCH_MISSES );
> +EVENT_ATTR(bus-cycles, BUS_CYCLES );
> +EVENT_ATTR(stalled-cycles-frontend, STALLED_CYCLES_FRONTEND );
> +EVENT_ATTR(stalled-cycles-backend, STALLED_CYCLES_BACKEND );
> +EVENT_ATTR(ref-cycles, REF_CPU_CYCLES );

The merge_events() approach from the Haswell patches should be far cleaner

-Andi

--
[email protected] -- Speaking for myself only

2012-11-06 14:29:05

by Stephane Eranian

[permalink] [raw]
Subject: Re: [PATCH v2 08/16] perf/x86: add memory profiling via PEBS Load Latency

On Tue, Nov 6, 2012 at 2:31 PM, Andi Kleen <[email protected]> wrote:
>> +EVENT_ATTR(cpu-cycles, CPU_CYCLES );
>> +EVENT_ATTR(instructions, INSTRUCTIONS );
>> +EVENT_ATTR(cache-references, CACHE_REFERENCES );
>> +EVENT_ATTR(cache-misses, CACHE_MISSES );
>> +EVENT_ATTR(branch-instructions, BRANCH_INSTRUCTIONS );
>> +EVENT_ATTR(branch-misses, BRANCH_MISSES );
>> +EVENT_ATTR(bus-cycles, BUS_CYCLES );
>> +EVENT_ATTR(stalled-cycles-frontend, STALLED_CYCLES_FRONTEND );
>> +EVENT_ATTR(stalled-cycles-backend, STALLED_CYCLES_BACKEND );
>> +EVENT_ATTR(ref-cycles, REF_CPU_CYCLES );
>
> The merge_events() approach from the Haswell patches should be far cleaner
>
And which patch in your HSW series implements this?

> -Andi
>
> --
> [email protected] -- Speaking for myself only

2012-11-06 15:45:01

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: [PATCH v2 14/16] perf tools: add new mem command for memory access profiling

Em Mon, Nov 05, 2012 at 02:51:01PM +0100, Stephane Eranian escreveu:
> +
> + if (strcmp(mem_operation, MEM_OPERATION_LOAD))
> + sprintf(event, "cpu/mem-stores/pp");
> + else
> + sprintf(event, "cpu/mem-loads/pp");
> +

Fails for me on a Sandy Bridge notebook:

[root@sandy ~]# perf mem -t store record -a
Error:
'precise' request may not be supported. Try removing 'p' modifier

So if I manually fall back to a less precise mode:

[root@sandy ~]# perf record -g -a -e cpu/mem-stores/p
Error:
'precise' request may not be supported. Try removing 'p' modifier
[root@sandy ~]#

Still can't, manually fall back a one more level:

[root@sandy ~]# perf record -g -a -e cpu/mem-stores/
^C[ perf record: Woken up 25 times to write data ]
[ perf record: Captured and wrote 7.419 MB perf.data (~324160 samples) ]

Yay, got some numbers.

smpboot: CPU0: Intel(R) Core(TM) i7-2920XM CPU @ 2.50GHz (fam: 06, model: 2a, stepping: 07)
Performance Events: PEBS fmt1+, 16-deep LBR, SandyBridge events, Intel PMU driver.
perf_event_intel: PEBS disabled due to CPU errata, please upgrade microcode
... version: 3
... bit width: 48
... generic registers: 4
... value mask: 0000ffffffffffff
... max period: 000000007fffffff
... fixed-purpose events: 3
... event mask: 000000070000000f
Disabled fast string operations
NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter.

Do you have any test program you use to test if the results make sense that you
could share?

I could then try to shape it into a new 'perf test' entry that would test
if requesting this event and then doing synthetic testing would produce
the experted numbers or in a acceptable range.

One suggestion for such new features, examples of use with the resulting output
looks great on changeset logs :-)

Ah, I recall seeing some, but that was on the patchset cover letter,
that doesn't get inserted in the git changelog history, will look at those
now.

- Arnaldo

2012-11-06 15:49:29

by Stephane Eranian

[permalink] [raw]
Subject: Re: [PATCH v2 14/16] perf tools: add new mem command for memory access profiling

On Tue, Nov 6, 2012 at 4:44 PM, Arnaldo Carvalho de Melo
<[email protected]> wrote:
> Em Mon, Nov 05, 2012 at 02:51:01PM +0100, Stephane Eranian escreveu:
>> +
>> + if (strcmp(mem_operation, MEM_OPERATION_LOAD))
>> + sprintf(event, "cpu/mem-stores/pp");
>> + else
>> + sprintf(event, "cpu/mem-loads/pp");
>> +
>
> Fails for me on a Sandy Bridge notebook:
>
That's because you need to update the microcode.

$ sudo modprobe microcode
$ cat /proc/cpuinfo

Line with microcode needs to say:
microcode : 0x28

update the microcode package or install directly from
intel download center.

> [root@sandy ~]# perf mem -t store record -a
> Error:
> 'precise' request may not be supported. Try removing 'p' modifier
>
> So if I manually fall back to a less precise mode:
>
> [root@sandy ~]# perf record -g -a -e cpu/mem-stores/p
> Error:
> 'precise' request may not be supported. Try removing 'p' modifier
> [root@sandy ~]#
>
> Still can't, manually fall back a one more level:
>
> [root@sandy ~]# perf record -g -a -e cpu/mem-stores/
> ^C[ perf record: Woken up 25 times to write data ]
> [ perf record: Captured and wrote 7.419 MB perf.data (~324160 samples) ]
>
> Yay, got some numbers.
>
> smpboot: CPU0: Intel(R) Core(TM) i7-2920XM CPU @ 2.50GHz (fam: 06, model: 2a, stepping: 07)
> Performance Events: PEBS fmt1+, 16-deep LBR, SandyBridge events, Intel PMU driver.
> perf_event_intel: PEBS disabled due to CPU errata, please upgrade microcode
> ... version: 3
> ... bit width: 48
> ... generic registers: 4
> ... value mask: 0000ffffffffffff
> ... max period: 000000007fffffff
> ... fixed-purpose events: 3
> ... event mask: 000000070000000f
> Disabled fast string operations
> NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter.
>
> Do you have any test program you use to test if the results make sense that you
> could share?
>
> I could then try to shape it into a new 'perf test' entry that would test
> if requesting this event and then doing synthetic testing would produce
> the experted numbers or in a acceptable range.
>
> One suggestion for such new features, examples of use with the resulting output
> looks great on changeset logs :-)
>
> Ah, I recall seeing some, but that was on the patchset cover letter,
> that doesn't get inserted in the git changelog history, will look at those
> now.
>
> - Arnaldo

2012-11-06 15:50:37

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: [PATCH v2 14/16] perf tools: add new mem command for memory access profiling

Em Tue, Nov 06, 2012 at 12:44:46PM -0300, Arnaldo Carvalho de Melo escreveu:
> [root@sandy ~]# perf record -g -a -e cpu/mem-stores/
> ^C[ perf record: Woken up 25 times to write data ]
> [ perf record: Captured and wrote 7.419 MB perf.data (~324160 samples) ]
>
> Yay, got some numbers.

But then the results out of:

$ perf mem -t load rep --stdio

Are bogus at least on the callchains:

# ========
# captured on: Tue Nov 6 12:46:21 2012
# hostname : sandy.ghostprotocols.net
# os release : 3.7.0-rc2+
# perf version : 3.7.rc4.gfaa41f
# arch : x86_64
# nrcpus online : 8
# nrcpus avail : 8
# cpudesc : Intel(R) Core(TM) i7-2920XM CPU @ 2.50GHz
# cpuid : GenuineIntel,6,42,7
# total memory : 16220228 kB
# cmdline : /home/acme/bin/perf record -g -a -e cpu/mem-stores/
# event : name = cpu/mem-stores/, type = 4, config = 0x2cd, config1 = 0x0, config2 = 0x0, excl_usr = 0, excl_kern = 0, excl_ho
# HEADER_CPU_TOPOLOGY info available, use -I to display
# HEADER_NUMA_TOPOLOGY info available, use -I to display
# pmu mappings: cpu = 4, software = 1, tracepoint = 2, uncore_cbox_0 = 6, uncore_cbox_1 = 7, uncore_cbox_2 = 8, uncore_cbox_3
# ========
#
# Samples: 98 of event 'cpu/mem-stores/'
# Total cost : 98
# Sort order : cost,mem,sym,dso,symbol_daddr,dso_daddr,snoop,tlb,locked
#
# Overhead Samples Cost Memory access Symbol Share
# ........ ........... ....... ........................ .............................................. ..................
#
19.39% 19 N/A [k] csd_unlock [kernel.kallsyms]
|
--- csd_unlock
|
|--6242.11%-- generic_smp_call_function_single_interrupt
| smp_call_function_single_interrupt
| call_function_single_interrupt
| cpuidle_enter
| cpuidle_enter_state
| cpuidle_idle_call
| cpu_idle
| |
| |--85.08%-- start_secondary
| |
| --14.92%-- rest_init
| start_kernel
| x86_64_start_reservations
| x86_64_start_kernel
|
|--100.00%-- smp_call_function_single_interrupt
| call_function_single_interrupt
| cpuidle_enter
| cpuidle_enter_state
| cpuidle_idle_call
| cpu_idle
| start_secondary
--97088126703734472704.00%-- [...]

5.10% 5 N/A [k] _raw_spin_lock_irqsave [kernel.kallsyms]
|



Ideas?

- Arnaldo

2012-11-06 15:58:04

by Stephane Eranian

[permalink] [raw]
Subject: Re: [PATCH v2 14/16] perf tools: add new mem command for memory access profiling

On Tue, Nov 6, 2012 at 4:50 PM, Arnaldo Carvalho de Melo
<[email protected]> wrote:
> Em Tue, Nov 06, 2012 at 12:44:46PM -0300, Arnaldo Carvalho de Melo escreveu:
>> [root@sandy ~]# perf record -g -a -e cpu/mem-stores/
>> ^C[ perf record: Woken up 25 times to write data ]
>> [ perf record: Captured and wrote 7.419 MB perf.data (~324160 samples) ]
>>
>> Yay, got some numbers.
>
> But then the results out of:
>
> $ perf mem -t load rep --stdio
>
> Are bogus at least on the callchains:
>
I think it's because we are not using the period in the hist_entry but
the cost and there
is no cost for stores.

I think you should start with loads and no callchain. I am interested
in your toughts on how to
fix the data symbol resolution problem in perf. In V2, I modified the
kernel to at least
help perf distinguish MAP__FUNCTION from MAP__VARIABLE. But there is
still something
wrong.


> # ========
> # captured on: Tue Nov 6 12:46:21 2012
> # hostname : sandy.ghostprotocols.net
> # os release : 3.7.0-rc2+
> # perf version : 3.7.rc4.gfaa41f
> # arch : x86_64
> # nrcpus online : 8
> # nrcpus avail : 8
> # cpudesc : Intel(R) Core(TM) i7-2920XM CPU @ 2.50GHz
> # cpuid : GenuineIntel,6,42,7
> # total memory : 16220228 kB
> # cmdline : /home/acme/bin/perf record -g -a -e cpu/mem-stores/
> # event : name = cpu/mem-stores/, type = 4, config = 0x2cd, config1 = 0x0, config2 = 0x0, excl_usr = 0, excl_kern = 0, excl_ho
> # HEADER_CPU_TOPOLOGY info available, use -I to display
> # HEADER_NUMA_TOPOLOGY info available, use -I to display
> # pmu mappings: cpu = 4, software = 1, tracepoint = 2, uncore_cbox_0 = 6, uncore_cbox_1 = 7, uncore_cbox_2 = 8, uncore_cbox_3
> # ========
> #
> # Samples: 98 of event 'cpu/mem-stores/'
> # Total cost : 98
> # Sort order : cost,mem,sym,dso,symbol_daddr,dso_daddr,snoop,tlb,locked
> #
> # Overhead Samples Cost Memory access Symbol Share
> # ........ ........... ....... ........................ .............................................. ..................
> #
> 19.39% 19 N/A [k] csd_unlock [kernel.kallsyms]
> |
> --- csd_unlock
> |
> |--6242.11%-- generic_smp_call_function_single_interrupt
> | smp_call_function_single_interrupt
> | call_function_single_interrupt
> | cpuidle_enter
> | cpuidle_enter_state
> | cpuidle_idle_call
> | cpu_idle
> | |
> | |--85.08%-- start_secondary
> | |
> | --14.92%-- rest_init
> | start_kernel
> | x86_64_start_reservations
> | x86_64_start_kernel
> |
> |--100.00%-- smp_call_function_single_interrupt
> | call_function_single_interrupt
> | cpuidle_enter
> | cpuidle_enter_state
> | cpuidle_idle_call
> | cpu_idle
> | start_secondary
> --97088126703734472704.00%-- [...]
>
> 5.10% 5 N/A [k] _raw_spin_lock_irqsave [kernel.kallsyms]
> |
>
>
>
> Ideas?
>
> - Arnaldo

2012-11-06 16:51:43

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: [PATCH v2 14/16] perf tools: add new mem command for memory access profiling

Em Tue, Nov 06, 2012 at 04:49:26PM +0100, Stephane Eranian escreveu:
> On Tue, Nov 6, 2012 at 4:44 PM, Arnaldo Carvalho de Melo
> <[email protected]> wrote:
> > Em Mon, Nov 05, 2012 at 02:51:01PM +0100, Stephane Eranian escreveu:
> >> +
> >> + if (strcmp(mem_operation, MEM_OPERATION_LOAD))
> >> + sprintf(event, "cpu/mem-stores/pp");
> >> + else
> >> + sprintf(event, "cpu/mem-loads/pp");

> > Fails for me on a Sandy Bridge notebook:

> That's because you need to update the microcode.

> $ sudo modprobe microcode
> $ cat /proc/cpuinfo
>
> Line with microcode needs to say:
> microcode : 0x28

Here it says:

microcode : 0x25

> update the microcode package or install directly from
> intel download center.

Thanks for the info, will do. And they will add "microcode hints for
selected processors should give that info to the user" to my TODO list
:-)

- Arnaldo

2012-11-06 17:05:21

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: [PATCH v2 14/16] perf tools: add new mem command for memory access profiling

Em Tue, Nov 06, 2012 at 01:51:19PM -0300, Arnaldo Carvalho de Melo escreveu:
> Em Tue, Nov 06, 2012 at 04:49:26PM +0100, Stephane Eranian escreveu:
> > On Tue, Nov 6, 2012 at 4:44 PM, Arnaldo Carvalho de Melo
> > <[email protected]> wrote:
> > > Em Mon, Nov 05, 2012 at 02:51:01PM +0100, Stephane Eranian escreveu:
> > >> + sprintf(event, "cpu/mem-loads/pp");

> > > Fails for me on a Sandy Bridge notebook:

> > That's because you need to update the microcode.
>
> > $ sudo modprobe microcode
> > $ cat /proc/cpuinfo

> > Line with microcode needs to say:
> > microcode : 0x28

> Here it says:

> microcode : 0x25

Updated it and now I have 0x28, better:

Samples: 8K of event 'cpu/mem-loads/pp', Event count (approx.): 89255
0.75% 1 665 L3 hit [.] arena_run_tree_remove
0.67% 1 602 LFB hit [.] nsSMILAnimationController::DoSample(bool)
0.63% 1 563 Local RAM hit [.] 0x00000000004620c4
0.58% 1 515 Local RAM hit [.] 0x00007f6b278aa7bc
0.54% 1 479 L2 hit [.] XPending
0.33% 1 292 Local RAM hit [.] nsSVGFEMorphologyElement::Filter(nsSVGFilterInstance*, nsTArray<

Will do more tests after lunch,

- Arnaldo

2012-11-06 17:07:49

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: [PATCH v2 14/16] perf tools: add new mem command for memory access profiling

Em Tue, Nov 06, 2012 at 04:57:59PM +0100, Stephane Eranian escreveu:
> On Tue, Nov 6, 2012 at 4:50 PM, Arnaldo Carvalho de Melo
> <[email protected]> wrote:
> > Em Tue, Nov 06, 2012 at 12:44:46PM -0300, Arnaldo Carvalho de Melo escreveu:
> >> [root@sandy ~]# perf record -g -a -e cpu/mem-stores/
> >> ^C[ perf record: Woken up 25 times to write data ]
> >> [ perf record: Captured and wrote 7.419 MB perf.data (~324160 samples) ]
> >>
> >> Yay, got some numbers.
> >
> > But then the results out of:
> >
> > $ perf mem -t load rep --stdio
> >
> > Are bogus at least on the callchains:
> >
> I think it's because we are not using the period in the hist_entry but
> the cost and there
> is no cost for stores.
>
> I think you should start with loads and no callchain. I am interested
> in your toughts on how to
> fix the data symbol resolution problem in perf. In V2, I modified the
> kernel to at least
> help perf distinguish MAP__FUNCTION from MAP__VARIABLE. But there is
> still something
> wrong.

Yeah, I saw that, now that I'm playing with this patchset I'll work on
that, after lunch.

>
> > # ========
> > # captured on: Tue Nov 6 12:46:21 2012
> > # hostname : sandy.ghostprotocols.net
> > # os release : 3.7.0-rc2+
> > # perf version : 3.7.rc4.gfaa41f
> > # arch : x86_64
> > # nrcpus online : 8
> > # nrcpus avail : 8
> > # cpudesc : Intel(R) Core(TM) i7-2920XM CPU @ 2.50GHz
> > # cpuid : GenuineIntel,6,42,7
> > # total memory : 16220228 kB
> > # cmdline : /home/acme/bin/perf record -g -a -e cpu/mem-stores/
> > # event : name = cpu/mem-stores/, type = 4, config = 0x2cd, config1 = 0x0, config2 = 0x0, excl_usr = 0, excl_kern = 0, excl_ho
> > # HEADER_CPU_TOPOLOGY info available, use -I to display
> > # HEADER_NUMA_TOPOLOGY info available, use -I to display
> > # pmu mappings: cpu = 4, software = 1, tracepoint = 2, uncore_cbox_0 = 6, uncore_cbox_1 = 7, uncore_cbox_2 = 8, uncore_cbox_3
> > # ========
> > #
> > # Samples: 98 of event 'cpu/mem-stores/'
> > # Total cost : 98
> > # Sort order : cost,mem,sym,dso,symbol_daddr,dso_daddr,snoop,tlb,locked
> > #
> > # Overhead Samples Cost Memory access Symbol Share
> > # ........ ........... ....... ........................ .............................................. ..................
> > #
> > 19.39% 19 N/A [k] csd_unlock [kernel.kallsyms]
> > |
> > --- csd_unlock
> > |
> > |--6242.11%-- generic_smp_call_function_single_interrupt
> > | smp_call_function_single_interrupt
> > | call_function_single_interrupt
> > | cpuidle_enter
> > | cpuidle_enter_state
> > | cpuidle_idle_call
> > | cpu_idle
> > | |
> > | |--85.08%-- start_secondary
> > | |
> > | --14.92%-- rest_init
> > | start_kernel
> > | x86_64_start_reservations
> > | x86_64_start_kernel
> > |
> > |--100.00%-- smp_call_function_single_interrupt
> > | call_function_single_interrupt
> > | cpuidle_enter
> > | cpuidle_enter_state
> > | cpuidle_idle_call
> > | cpu_idle
> > | start_secondary
> > --97088126703734472704.00%-- [...]
> >
> > 5.10% 5 N/A [k] _raw_spin_lock_irqsave [kernel.kallsyms]
> > |
> >
> >
> >
> > Ideas?
> >
> > - Arnaldo

2012-11-06 18:51:00

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH v2 08/16] perf/x86: add memory profiling via PEBS Load Latency

On Tue, Nov 06, 2012 at 03:29:01PM +0100, Stephane Eranian wrote:
> On Tue, Nov 6, 2012 at 2:31 PM, Andi Kleen <[email protected]> wrote:
> >> +EVENT_ATTR(cpu-cycles, CPU_CYCLES );
> >> +EVENT_ATTR(instructions, INSTRUCTIONS );
> >> +EVENT_ATTR(cache-references, CACHE_REFERENCES );
> >> +EVENT_ATTR(cache-misses, CACHE_MISSES );
> >> +EVENT_ATTR(branch-instructions, BRANCH_INSTRUCTIONS );
> >> +EVENT_ATTR(branch-misses, BRANCH_MISSES );
> >> +EVENT_ATTR(bus-cycles, BUS_CYCLES );
> >> +EVENT_ATTR(stalled-cycles-frontend, STALLED_CYCLES_FRONTEND );
> >> +EVENT_ATTR(stalled-cycles-backend, STALLED_CYCLES_BACKEND );
> >> +EVENT_ATTR(ref-cycles, REF_CPU_CYCLES );
> >
> > The merge_events() approach from the Haswell patches should be far cleaner
> >
> And which patch in your HSW series implements this?

[PATCH 27/32] perf, x86: Support CPU specific sysfs events

-Andi

--
[email protected] -- Speaking for myself only

2012-11-06 19:37:52

by Stephane Eranian

[permalink] [raw]
Subject: Re: [PATCH v2 08/16] perf/x86: add memory profiling via PEBS Load Latency

On Tue, Nov 6, 2012 at 7:50 PM, Andi Kleen <[email protected]> wrote:
> On Tue, Nov 06, 2012 at 03:29:01PM +0100, Stephane Eranian wrote:
>> On Tue, Nov 6, 2012 at 2:31 PM, Andi Kleen <[email protected]> wrote:
>> >> +EVENT_ATTR(cpu-cycles, CPU_CYCLES );
>> >> +EVENT_ATTR(instructions, INSTRUCTIONS );
>> >> +EVENT_ATTR(cache-references, CACHE_REFERENCES );
>> >> +EVENT_ATTR(cache-misses, CACHE_MISSES );
>> >> +EVENT_ATTR(branch-instructions, BRANCH_INSTRUCTIONS );
>> >> +EVENT_ATTR(branch-misses, BRANCH_MISSES );
>> >> +EVENT_ATTR(bus-cycles, BUS_CYCLES );
>> >> +EVENT_ATTR(stalled-cycles-frontend, STALLED_CYCLES_FRONTEND );
>> >> +EVENT_ATTR(stalled-cycles-backend, STALLED_CYCLES_BACKEND );
>> >> +EVENT_ATTR(ref-cycles, REF_CPU_CYCLES );
>> >
>> > The merge_events() approach from the Haswell patches should be far cleaner
>> >
>> And which patch in your HSW series implements this?
>
> [PATCH 27/32] perf, x86: Support CPU specific sysfs events
>
Will try using that one.
Thanks.

2012-11-06 20:52:30

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: [PATCH v2 00/16] perf: add memory access sampling support

Em Mon, Nov 05, 2012 at 02:50:47PM +0100, Stephane Eranian escreveu:
> Or if one is interested in the data view:
> $ perf mem -t load rep --sort=symbol_daddr,cost
> # Samples: 19K of event 'cpu/mem-loads/pp'
> # Total cost : 1013994
> # Sort order : symbol_daddr,cost
> #
> # Overhead Samples Data Symbol Cost
> # ........ ........... ...................... .......
> #
> 0.10% 1 [.] 0x00007f67dffe8038 986
> 0.09% 1 [.] 0x00007f67df91a750 890
> 0.08% 1 [.] 0x00007f67e288fba8 826
>

> CAVEAT: Note that the data addresses are not resolved correctly currently due to a
> problem in perf data symbol resolution code which I have not been able to
> uncover so far.

Stephane,

Those data addresses mostly are on the stack, we need reverse
resolution using DWARF location expressions to figure out what is the
name of a variable that is on a particular address, etc.

Masami, have you played with this already? I mean:

[root@sandy acme]# perf mem -t load rep --stdio --sort=symbol,symbol_daddr,cost
# Samples: 30 of event 'cpu/mem-loads/pp'
# Total cost : 640
# Sort order : symbol,symbol_daddr,cost
#
# Overhead Samples Symbol Data Symbol Cost
# ........ ........... ...................... ...................... .......
#
55.00% 1 [k] lookup_fast [k] 0xffff8803b7521bd4 352
5.47% 1 [k] cache_alloc_refill [k] 0xffff880407705024 35
3.44% 1 [k] cache_alloc_refill [k] 0xffff88041d8527d8 22
3.28% 1 [k] run_timer_softirq [k] 0xffff88041e2c3e90 21
2.50% 1 [k] __list_add [k] 0xffff8803b7521d68 16
2.19% 1 [.] __strcoll_l [.] 0x00007fffa8d44080 14
1.88% 1 [.] __strcoll_l [.] 0x00007fffa8d44104 12

If we go to the annotation browser to see where is that lookup_fast hitting we get:

100.00 │ mov -0x34(%rbp),%eax

How to map 0xffff8803b7521bd4 to a stack variable, struct members and all?

Humm, for userspace we have PERF_SAMPLE_REGS_USER for the dwarf unwinder we
need for userspace, but what about reverse mapping of kernel variables? Jiri?

- Arnaldo

2012-11-07 07:38:41

by Namhyung Kim

[permalink] [raw]
Subject: Re: [PATCH v2 00/16] perf: add memory access sampling support

Hi Arnaldo,

On Tue, 6 Nov 2012 17:52:21 -0300, Arnaldo Carvalho de Melo wrote:
> Em Mon, Nov 05, 2012 at 02:50:47PM +0100, Stephane Eranian escreveu:
> [root@sandy acme]# perf mem -t load rep --stdio --sort=symbol,symbol_daddr,cost
> # Samples: 30 of event 'cpu/mem-loads/pp'
> # Total cost : 640
> # Sort order : symbol,symbol_daddr,cost
> #
> # Overhead Samples Symbol Data Symbol Cost
> # ........ ........... ...................... ...................... .......
> #
> 55.00% 1 [k] lookup_fast [k] 0xffff8803b7521bd4 352
> 5.47% 1 [k] cache_alloc_refill [k] 0xffff880407705024 35
> 3.44% 1 [k] cache_alloc_refill [k] 0xffff88041d8527d8 22
> 3.28% 1 [k] run_timer_softirq [k] 0xffff88041e2c3e90 21
> 2.50% 1 [k] __list_add [k] 0xffff8803b7521d68 16
> 2.19% 1 [.] __strcoll_l [.] 0x00007fffa8d44080 14
> 1.88% 1 [.] __strcoll_l [.] 0x00007fffa8d44104 12
>
> If we go to the annotation browser to see where is that lookup_fast hitting we get:
>
> 100.00 │ mov -0x34(%rbp),%eax
>
> How to map 0xffff8803b7521bd4 to a stack variable, struct members and all?
>
> Humm, for userspace we have PERF_SAMPLE_REGS_USER for the dwarf unwinder we
> need for userspace, but what about reverse mapping of kernel variables? Jiri?

I suspect there aren't much thing we can do on stack. One thing we can
do is adding dso_daddr to sort key and seeing it's a [stack] (or
[stack:tid]) or not. But not sure it can be done for kernel stacks too.

Thanks,
Namhyung

2012-11-07 10:02:16

by Stephane Eranian

[permalink] [raw]
Subject: Re: [PATCH v2 00/16] perf: add memory access sampling support

On Wed, Nov 7, 2012 at 8:38 AM, Namhyung Kim <[email protected]> wrote:
> Hi Arnaldo,
>
> On Tue, 6 Nov 2012 17:52:21 -0300, Arnaldo Carvalho de Melo wrote:
>> Em Mon, Nov 05, 2012 at 02:50:47PM +0100, Stephane Eranian escreveu:
>> [root@sandy acme]# perf mem -t load rep --stdio --sort=symbol,symbol_daddr,cost
>> # Samples: 30 of event 'cpu/mem-loads/pp'
>> # Total cost : 640
>> # Sort order : symbol,symbol_daddr,cost
>> #
>> # Overhead Samples Symbol Data Symbol Cost
>> # ........ ........... ...................... ...................... .......
>> #
>> 55.00% 1 [k] lookup_fast [k] 0xffff8803b7521bd4 352
>> 5.47% 1 [k] cache_alloc_refill [k] 0xffff880407705024 35
>> 3.44% 1 [k] cache_alloc_refill [k] 0xffff88041d8527d8 22
>> 3.28% 1 [k] run_timer_softirq [k] 0xffff88041e2c3e90 21
>> 2.50% 1 [k] __list_add [k] 0xffff8803b7521d68 16
>> 2.19% 1 [.] __strcoll_l [.] 0x00007fffa8d44080 14
>> 1.88% 1 [.] __strcoll_l [.] 0x00007fffa8d44104 12
>>
>> If we go to the annotation browser to see where is that lookup_fast hitting we get:
>>
>> 100.00 │ mov -0x34(%rbp),%eax
>>
>> How to map 0xffff8803b7521bd4 to a stack variable, struct members and all?
>>

Yes, you need dwarf to achieve this. But I think we should first fix the problem
with global variables at the user and kernel levels and there all you
need is the
data symbol table (which we load because of the -d option) and correct
base+offset
calculations. For now, I am getting perf-xxxx.map for global variables.

For instance, I use the following simple test program with:
$ perf mem rec mcol 10000

I expect to see symbol aa+offset in the data symbol column and mcol in the
data object column. But instance, I get raw hex and perf-xxxx.map.

I think that's what we need to fix first.

/* mcol.c */
#include <sys/types.h>
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <inttypes.h>

typedef struct {
long real;
long img;
} elem_t;

#define NB 100000ULL
static elem_t aa[NB * 2];

uint64_t
mat_column(elem_t *m, size_t nr, size_t nc, elem_t *n)
{
uint64_t real = 0 , img = 0;
elem_t *p;
size_t i, j;

/* scan by columns */
for(i=0; i < nc; i++) {
for(j=0; j < nr; j++) {
real += (m+i+nc*j)->real;
img += (m+i+nc*j)->img;
}
}
return (real+img) > 0 ? 0 : 1;
}

int main(int argc, char **argv)
{
size_t rows, cols;
unsigned long long nloop, nl;
elem_t number;

number.real = 1;
number.img = 5;

nloop = nl = argc > 1 ? strtoul(argv[1], NULL, 0) : 10000;
rows = NB;
cols = 2;
printf("mat[%p:%p] size=%zuMiB\n", aa, &aa[-1+NB*2], sizeof(aa) >> 20);
while(nl--) {
number.img++;
number.real += mat_column(aa, rows, cols, &number);
}
return 0;
}

2012-11-07 14:39:50

by Stephane Eranian

[permalink] [raw]
Subject: Re: [PATCH v2 08/16] perf/x86: add memory profiling via PEBS Load Latency

On Tue, Nov 6, 2012 at 8:37 PM, Stephane Eranian <[email protected]> wrote:
> On Tue, Nov 6, 2012 at 7:50 PM, Andi Kleen <[email protected]> wrote:
>> On Tue, Nov 06, 2012 at 03:29:01PM +0100, Stephane Eranian wrote:
>>> On Tue, Nov 6, 2012 at 2:31 PM, Andi Kleen <[email protected]> wrote:
>>> >> +EVENT_ATTR(cpu-cycles, CPU_CYCLES );
>>> >> +EVENT_ATTR(instructions, INSTRUCTIONS );
>>> >> +EVENT_ATTR(cache-references, CACHE_REFERENCES );
>>> >> +EVENT_ATTR(cache-misses, CACHE_MISSES );
>>> >> +EVENT_ATTR(branch-instructions, BRANCH_INSTRUCTIONS );
>>> >> +EVENT_ATTR(branch-misses, BRANCH_MISSES );
>>> >> +EVENT_ATTR(bus-cycles, BUS_CYCLES );
>>> >> +EVENT_ATTR(stalled-cycles-frontend, STALLED_CYCLES_FRONTEND );
>>> >> +EVENT_ATTR(stalled-cycles-backend, STALLED_CYCLES_BACKEND );
>>> >> +EVENT_ATTR(ref-cycles, REF_CPU_CYCLES );
>>> >
>>> > The merge_events() approach from the Haswell patches should be far cleaner
>>> >

>>> And which patch in your HSW series implements this?
>>
>> [PATCH 27/32] perf, x86: Support CPU specific sysfs events
>>
Ok, integrated that patch in my series and changed my code to use it.
It works.

Subject: Re: Re: [PATCH v2 00/16] perf: add memory access sampling support

(2012/11/07 5:52), Arnaldo Carvalho de Melo wrote:
> Em Mon, Nov 05, 2012 at 02:50:47PM +0100, Stephane Eranian escreveu:
>> Or if one is interested in the data view:
>> $ perf mem -t load rep --sort=symbol_daddr,cost
>> # Samples: 19K of event 'cpu/mem-loads/pp'
>> # Total cost : 1013994
>> # Sort order : symbol_daddr,cost
>> #
>> # Overhead Samples Data Symbol Cost
>> # ........ ........... ...................... .......
>> #
>> 0.10% 1 [.] 0x00007f67dffe8038 986
>> 0.09% 1 [.] 0x00007f67df91a750 890
>> 0.08% 1 [.] 0x00007f67e288fba8 826
>>
>
>> CAVEAT: Note that the data addresses are not resolved correctly currently due to a
>> problem in perf data symbol resolution code which I have not been able to
>> uncover so far.
>
> Stephane,
>
> Those data addresses mostly are on the stack, we need reverse
> resolution using DWARF location expressions to figure out what is the
> name of a variable that is on a particular address, etc.
>
> Masami, have you played with this already? I mean:

No, but it looks interesting. I'll try :)

>
> [root@sandy acme]# perf mem -t load rep --stdio --sort=symbol,symbol_daddr,cost
> # Samples: 30 of event 'cpu/mem-loads/pp'
> # Total cost : 640
> # Sort order : symbol,symbol_daddr,cost
> #
> # Overhead Samples Symbol Data Symbol Cost
> # ........ ........... ...................... ...................... .......
> #
> 55.00% 1 [k] lookup_fast [k] 0xffff8803b7521bd4 352
> 5.47% 1 [k] cache_alloc_refill [k] 0xffff880407705024 35
> 3.44% 1 [k] cache_alloc_refill [k] 0xffff88041d8527d8 22
> 3.28% 1 [k] run_timer_softirq [k] 0xffff88041e2c3e90 21
> 2.50% 1 [k] __list_add [k] 0xffff8803b7521d68 16
> 2.19% 1 [.] __strcoll_l [.] 0x00007fffa8d44080 14
> 1.88% 1 [.] __strcoll_l [.] 0x00007fffa8d44104 12
>
> If we go to the annotation browser to see where is that lookup_fast hitting we get:
>
> 100.00 │ mov -0x34(%rbp),%eax
>
> How to map 0xffff8803b7521bd4 to a stack variable, struct members and all?

If perf stores %rbp value, we can do forward searching from local variables
in this scope block. In some cases, the memory is dereferenced by another
pointer. In that case, it is hard to do that (perhaps, we need to disassemble
it and reverse execution). But in most case of the memory address on a stack,
it will work, I think.

Thank you,

>
> Humm, for userspace we have PERF_SAMPLE_REGS_USER for the dwarf unwinder we
> need for userspace, but what about reverse mapping of kernel variables? Jiri?
>
> - Arnaldo
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>


--
Masami HIRAMATSU
IT Management Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: [email protected]

2012-11-07 14:56:20

by Stephane Eranian

[permalink] [raw]
Subject: Re: Re: [PATCH v2 00/16] perf: add memory access sampling support

On Wed, Nov 7, 2012 at 3:53 PM, Masami Hiramatsu
<[email protected]> wrote:
> (2012/11/07 5:52), Arnaldo Carvalho de Melo wrote:
>> Em Mon, Nov 05, 2012 at 02:50:47PM +0100, Stephane Eranian escreveu:
>>> Or if one is interested in the data view:
>>> $ perf mem -t load rep --sort=symbol_daddr,cost
>>> # Samples: 19K of event 'cpu/mem-loads/pp'
>>> # Total cost : 1013994
>>> # Sort order : symbol_daddr,cost
>>> #
>>> # Overhead Samples Data Symbol Cost
>>> # ........ ........... ...................... .......
>>> #
>>> 0.10% 1 [.] 0x00007f67dffe8038 986
>>> 0.09% 1 [.] 0x00007f67df91a750 890
>>> 0.08% 1 [.] 0x00007f67e288fba8 826
>>>
>>
>>> CAVEAT: Note that the data addresses are not resolved correctly currently due to a
>>> problem in perf data symbol resolution code which I have not been able to
>>> uncover so far.
>>
>> Stephane,
>>
>> Those data addresses mostly are on the stack, we need reverse
>> resolution using DWARF location expressions to figure out what is the
>> name of a variable that is on a particular address, etc.
>>
>> Masami, have you played with this already? I mean:
>
> No, but it looks interesting. I'll try :)
>
>>
>> [root@sandy acme]# perf mem -t load rep --stdio --sort=symbol,symbol_daddr,cost
>> # Samples: 30 of event 'cpu/mem-loads/pp'
>> # Total cost : 640
>> # Sort order : symbol,symbol_daddr,cost
>> #
>> # Overhead Samples Symbol Data Symbol Cost
>> # ........ ........... ...................... ...................... .......
>> #
>> 55.00% 1 [k] lookup_fast [k] 0xffff8803b7521bd4 352
>> 5.47% 1 [k] cache_alloc_refill [k] 0xffff880407705024 35
>> 3.44% 1 [k] cache_alloc_refill [k] 0xffff88041d8527d8 22
>> 3.28% 1 [k] run_timer_softirq [k] 0xffff88041e2c3e90 21
>> 2.50% 1 [k] __list_add [k] 0xffff8803b7521d68 16
>> 2.19% 1 [.] __strcoll_l [.] 0x00007fffa8d44080 14
>> 1.88% 1 [.] __strcoll_l [.] 0x00007fffa8d44104 12
>>
>> If we go to the annotation browser to see where is that lookup_fast hitting we get:
>>
>> 100.00 │ mov -0x34(%rbp),%eax
>>
>> How to map 0xffff8803b7521bd4 to a stack variable, struct members and all?
>
> If perf stores %rbp value, we can do forward searching from local variables
> in this scope block. In some cases, the memory is dereferenced by another
> pointer. In that case, it is hard to do that (perhaps, we need to disassemble
> it and reverse execution). But in most case of the memory address on a stack,
> it will work, I think.
>
PEBS-LL does store rbp but it is not yet exposed. That will be in a follow-up
patch.

> Thank you,
>
>>
>> Humm, for userspace we have PERF_SAMPLE_REGS_USER for the dwarf unwinder we
>> need for userspace, but what about reverse mapping of kernel variables? Jiri?
>>
>> - Arnaldo
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at http://www.tux.org/lkml/
>>
>
>
> --
> Masami HIRAMATSU
> IT Management Research Dept. Linux Technology Center
> Hitachi, Ltd., Yokohama Research Laboratory
> E-mail: [email protected]
>
>

2012-11-14 07:34:49

by Andi Kleen

[permalink] [raw]
Subject: [tip:perf/core] perf tools: Add arbitary aliases and support names with -

Commit-ID: eef9ba98b9ee96f52e544b09f581c397d8cc8265
Gitweb: http://git.kernel.org/tip/eef9ba98b9ee96f52e544b09f581c397d8cc8265
Author: Andi Kleen <[email protected]>
AuthorDate: Mon, 5 Nov 2012 14:50:52 +0100
Committer: Arnaldo Carvalho de Melo <[email protected]>
CommitDate: Thu, 8 Nov 2012 16:12:03 -0300

perf tools: Add arbitary aliases and support names with -

- Add missing scanner symbol for arbitrary aliases inside the config
region.

- looks nicer than _, so allow - in the event names. Used for various of
the arch perfmon and Haswell events.

Signed-off-by: Andi Kleen <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
---
tools/perf/util/parse-events.l | 2 ++
1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/tools/perf/util/parse-events.l b/tools/perf/util/parse-events.l
index c87efc1..66959fa 100644
--- a/tools/perf/util/parse-events.l
+++ b/tools/perf/util/parse-events.l
@@ -81,6 +81,7 @@ num_dec [0-9]+
num_hex 0x[a-fA-F0-9]+
num_raw_hex [a-fA-F0-9]+
name [a-zA-Z_*?][a-zA-Z0-9_*?]*
+name_minus [a-zA-Z_*?][a-zA-Z0-9\-_*?]*
modifier_event [ukhpGH]{1,8}
modifier_bp [rwx]{1,3}

@@ -168,6 +169,7 @@ period { return term(yyscanner, PARSE_EVENTS__TERM_TYPE_SAMPLE_PERIOD); }
branch_type { return term(yyscanner, PARSE_EVENTS__TERM_TYPE_BRANCH_SAMPLE_TYPE); }
, { return ','; }
"/" { BEGIN(INITIAL); return '/'; }
+{name_minus} { return str(yyscanner, PE_NAME); }
}

mem: { BEGIN(mem); return PE_PREFIX_MEM; }