This patch series extends perf_events with the ability to sample
physical data addresses. This is useful with the memory access
sampling mode added just recently. In particular, it helps
disambiguate data addresses between two processes, such as
in the case of a shared memory segment mapped at different
addresses in different processes.
The patch adds the PERF_SAMPLE_PHYS_ADDR sample_type.
A 64-bit address is added to the sample record for
the corresponding event.
On Intel X86, it is used with the PEBS Load Latency
support. On other architectures, zero is returned.
The patch series also demonstrates the use of this
new feature by extending perf report, mem, record
with a --phys-addr option. When enable, it will
capture physical data address and display it.
This is implemented as a new sort_order (symbol_paddr).
$ perf mem --phys-addr -t load rec ...
$ perf mem --phys-addr -t load rep ...
Note that for now on X86, only user addresses are converted
to physical whenever possible. Kernel addresses will show
as -1.
Thanks to Hugh Dickins for the uvirt_to_phys_nmi() code.
The series contains a couple of patches related to
PEBS in general:
- Patch 1 contains a small bug fix to clear the PEBS-LL
bits in pebs_enable in intel_pmu_pebs_disable().
- Patch 2 removes event->flags because it is not used
anymore. The code now uses event->hw.constraints->flags.
Signed-off-by: Stephane Eranian <[email protected]>
---
Stephane Eranian (8):
perf,x86: disable PEBS-LL in intel_pmu_pebs_disable()
perf,x86: drop event->flags and use hw.constraint->flags
perf,x86: add uvirt_to_phys_nmi helper function
perf: add PERF_SAMPLE_PHYS_ADDR sample type
perf,x86: add support for PERF_SAMPLE_PHYS_ADDR for PEBS-LL
perf tools: add infrastructure to handle PERF_SAMPLE_PHYS_ADDR
perf record: add option to sample physical load/store addresses
perf mem: add physical addr sampling support
arch/x86/include/asm/uaccess.h | 1 +
arch/x86/kernel/cpu/perf_event_intel.c | 6 +-
arch/x86/kernel/cpu/perf_event_intel_ds.c | 28 +++++---
arch/x86/lib/usercopy.c | 43 +++++++++++
include/linux/perf_event.h | 3 +-
include/uapi/linux/perf_event.h | 3 +-
kernel/events/core.c | 6 ++
tools/perf/Documentation/perf-record.txt | 4 ++
tools/perf/builtin-mem.c | 116 +++++++++++++++++++++++-------
tools/perf/builtin-record.c | 2 +
tools/perf/builtin-report.c | 2 +-
tools/perf/perf.h | 1 +
tools/perf/util/event.h | 1 +
tools/perf/util/evsel.c | 16 ++++-
tools/perf/util/hist.c | 4 +-
tools/perf/util/hist.h | 1 +
tools/perf/util/machine.c | 1 +
tools/perf/util/session.c | 6 ++
tools/perf/util/sort.c | 42 +++++++++++
tools/perf/util/sort.h | 1 +
tools/perf/util/symbol.h | 1 +
21 files changed, 243 insertions(+), 45 deletions(-)
--
1.8.1.2
Function to convert from a user level address to its
physical address. To be used by perf_events memory
access sampling feature.
Signed-off-by: Hugh Dickins <[email protected]>
Signed-off-by: Stephane Eranian <[email protected]>
---
arch/x86/include/asm/uaccess.h | 1 +
arch/x86/lib/usercopy.c | 43 ++++++++++++++++++++++++++++++++++++++++++
2 files changed, 44 insertions(+)
diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h
index 5ee2687..4c9a102 100644
--- a/arch/x86/include/asm/uaccess.h
+++ b/arch/x86/include/asm/uaccess.h
@@ -513,6 +513,7 @@ struct __large_struct { unsigned long buf[100]; };
extern unsigned long
copy_from_user_nmi(void *to, const void __user *from, unsigned long n);
+extern phys_addr_t uvirt_to_phys_nmi(const void __user *address);
extern __must_check long
strncpy_from_user(char *dst, const char __user *src, long count);
diff --git a/arch/x86/lib/usercopy.c b/arch/x86/lib/usercopy.c
index 4f74d94..3f19023 100644
--- a/arch/x86/lib/usercopy.c
+++ b/arch/x86/lib/usercopy.c
@@ -47,3 +47,46 @@ copy_from_user_nmi(void *to, const void __user *from, unsigned long n)
return len;
}
EXPORT_SYMBOL_GPL(copy_from_user_nmi);
+
+/*
+ * Best effort, NMI-safe GUP-fast-based user-virtual to physical translation.
+ *
+ * Does not really belong in "usercopy.c", but kept here for comparison with
+ * copy_from_user_nmi() above.
+ *
+ * __get_user_pages_fast() may fail at awkward moments e.g. while transparent
+ * hugepage is being split. And at present it happens to SetPageReferenced():
+ * not really a problem when this is used for profiling pages which are being
+ * referenced, but should be fixed if this were to be used any more widely.
+ *
+ * At time of writing, __get_user_pages_fast() is supported by mips, s390, sh
+ * and x86 (with a weak fallback returning 0 on other architectures): we have
+ * not established whether it is NMI-safe on any other architecture than x86.
+ */
+phys_addr_t uvirt_to_phys_nmi(const void __user *address)
+{
+ unsigned long vaddr = (unsigned long)address;
+ phys_addr_t paddr = vaddr & ~PAGE_MASK;
+ struct page *page;
+
+ if (!current->mm)
+ return -1;
+
+ if (__range_not_ok(address, 1, TASK_SIZE))
+ return -1;
+
+ if (!__get_user_pages_fast(vaddr, 1, 0, &page))
+ return -1;
+
+ paddr += (phys_addr_t)page_to_pfn(page) << PAGE_SHIFT;
+
+ /*
+ * If called under NMI, this put_page(page) cannot be its final
+ * put_page (which would indeed be problematic): a racing munmap
+ * on another CPU cannot free the page until it has flushed TLB
+ * on our CPU, and that must wait for us to leave NMI.
+ */
+ put_page(page);
+
+ return paddr;
+}
--
1.8.1.2
This patch adds support for sampling the physical data address
of Loads/Stores captured by PEBS Load Latency/Precise Store
facility.
Signed-off-by: Stephane Eranian <[email protected]>
---
arch/x86/kernel/cpu/perf_event_intel_ds.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index 06d02fa..e32a5ab 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -3,6 +3,7 @@
#include <linux/slab.h>
#include <asm/perf_event.h>
+#include <asm/uaccess.h>
#include <asm/insn.h>
#include "perf_event.h"
@@ -835,10 +836,14 @@ static void __intel_pmu_pebs_event(struct perf_event *event,
else
regs.flags &= ~PERF_EFLAGS_EXACT;
- if ((event->attr.sample_type & PERF_SAMPLE_ADDR) &&
+ if ((sample_type & PERF_SAMPLE_ADDR) &&
x86_pmu.intel_cap.pebs_format >= 1)
data.addr = pebs->dla;
+ if ((fll || fst) && (sample_type & PERF_SAMPLE_PHYS_ADDR) &&
+ x86_pmu.intel_cap.pebs_format >= 1)
+ data.phys_addr = uvirt_to_phys_nmi((void *)(pebs->dla));
+
if (has_branch_stack(event))
data.br_stack = &cpuc->lbr_stack;
--
1.8.1.2
This patch adds the --phys-addr option to perf record.
This is used with memory access sampling to capture physical
addresses. The option may be used in conjunction with -d, thereby
providing virtual and physical addresses for a memory access. This
is useful to disambiguate shared memory accesses between processes.
Signed-off-by: Stephane Eranian <[email protected]>
---
tools/perf/Documentation/perf-record.txt | 4 ++++
tools/perf/builtin-record.c | 2 ++
2 files changed, 6 insertions(+)
diff --git a/tools/perf/Documentation/perf-record.txt b/tools/perf/Documentation/perf-record.txt
index d4da111..ebf98c5 100644
--- a/tools/perf/Documentation/perf-record.txt
+++ b/tools/perf/Documentation/perf-record.txt
@@ -188,6 +188,10 @@ Enable weightened sampling. An additional weight is recorded per sample and can
displayed with the weight and local_weight sort keys. This currently works for TSX
abort events and some memory events in precise mode on modern Intel CPUs.
+--phys_addr::
+Samples physical address for memory loads and stores. May be used in conjunction with
+the -d option when using memory access sampling (via perf mem).
+
SEE ALSO
--------
linkperf:perf-stat[1], linkperf:perf-list[1]
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index fff985c..40bcede 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -932,6 +932,8 @@ const struct option record_options[] = {
"per thread counts"),
OPT_BOOLEAN('d', "data", &record.opts.sample_address,
"Sample addresses"),
+ OPT_BOOLEAN(0, "phys-addr", &record.opts.sample_phys_address,
+ "Sample physical addresses"),
OPT_BOOLEAN('T', "timestamp", &record.opts.sample_time, "Sample timestamps"),
OPT_BOOLEAN('P', "period", &record.opts.period, "Sample period"),
OPT_BOOLEAN('n', "no-samples", &record.opts.no_samples,
--
1.8.1.2
This patch adds support for PERF_SAMPLE_PHYS_ADDR to perf mem
and supporting code in the infrastructure.
Signed-off-by: Stephane Eranian <[email protected]>
---
tools/perf/builtin-mem.c | 116 +++++++++++++++++++++++++++++++++-----------
tools/perf/builtin-report.c | 2 +-
tools/perf/util/hist.c | 4 +-
tools/perf/util/hist.h | 1 +
tools/perf/util/machine.c | 1 +
tools/perf/util/sort.c | 42 ++++++++++++++++
tools/perf/util/sort.h | 1 +
tools/perf/util/symbol.h | 1 +
8 files changed, 139 insertions(+), 29 deletions(-)
diff --git a/tools/perf/builtin-mem.c b/tools/perf/builtin-mem.c
index a8ff6d2..5946f6a 100644
--- a/tools/perf/builtin-mem.c
+++ b/tools/perf/builtin-mem.c
@@ -17,6 +17,7 @@ struct perf_mem {
symbol_filter_t annotate_init;
bool hide_unresolved;
bool dump_raw;
+ bool phys_addr;
const char *cpu_list;
DECLARE_BITMAP(cpu_bitmap, MAX_NR_CPUS);
};
@@ -26,14 +27,14 @@ static const char * const mem_usage[] = {
NULL
};
-static int __cmd_record(int argc, const char **argv)
+static int __cmd_record(int argc, const char **argv, struct perf_mem *mem)
{
int rec_argc, i = 0, j;
const char **rec_argv;
char event[64];
int ret;
- rec_argc = argc + 4;
+ rec_argc = argc + 5;
rec_argv = calloc(rec_argc + 1, sizeof(char *));
if (!rec_argv)
return -1;
@@ -42,6 +43,10 @@ static int __cmd_record(int argc, const char **argv)
if (!strcmp(mem_operation, MEM_OPERATION_LOAD))
rec_argv[i++] = strdup("-W");
rec_argv[i++] = strdup("-d");
+
+ if (mem->phys_addr)
+ rec_argv[i++] = strdup("--phys-addr");
+
rec_argv[i++] = strdup("-e");
if (strcmp(mem_operation, MEM_OPERATION_LOAD))
@@ -83,29 +88,62 @@ dump_raw_samples(struct perf_tool *tool,
al.map->dso->hit = 1;
if (symbol_conf.field_sep) {
- fmt = "%d%s%d%s0x%"PRIx64"%s0x%"PRIx64"%s%"PRIu64
- "%s0x%"PRIx64"%s%s:%s\n";
+ if (mem->phys_addr)
+ fmt = "%d%s%d%s%d%s0x%"PRIx64"%s0x%"PRIx64"%s0x%"PRIx64
+ "%s%"PRIu64"%s0x%"PRIx64"%s%s:%s\n";
+ else
+ fmt = "%d%s%d%s%d%s0x%"PRIx64"%s0x%"PRIx64
+ "%s%"PRIu64"%s0x%"PRIx64"%s%s:%s\n";
} else {
- fmt = "%5d%s%5d%s0x%016"PRIx64"%s0x016%"PRIx64
- "%s%5"PRIu64"%s0x%06"PRIx64"%s%s:%s\n";
+ if (mem->phys_addr)
+ fmt = "%5d%s%5d%s%5d%s0x%016"PRIx64"%s0x%016"PRIx64
+ "%s0x016%"PRIx64"%s%5"PRIu64"%s0x%06"
+ PRIx64"%s%s:%s\n";
+ else
+ fmt = "%5d%s%5d%s%5d%s0x%016"PRIx64"%s0x016%"PRIx64
+ "%s%5"PRIu64"%s0x%06"PRIx64"%s%s:%s\n";
symbol_conf.field_sep = " ";
}
-
- printf(fmt,
- sample->pid,
- symbol_conf.field_sep,
- sample->tid,
- symbol_conf.field_sep,
- event->ip.ip,
- symbol_conf.field_sep,
- sample->addr,
- symbol_conf.field_sep,
- sample->weight,
- symbol_conf.field_sep,
- sample->data_src,
- symbol_conf.field_sep,
- al.map ? (al.map->dso ? al.map->dso->long_name : "???") : "???",
- al.sym ? al.sym->name : "???");
+ if (mem->phys_addr)
+ printf(fmt,
+ sample->pid,
+ symbol_conf.field_sep,
+ sample->tid,
+ symbol_conf.field_sep,
+ sample->cpu,
+ symbol_conf.field_sep,
+ event->ip.ip,
+ symbol_conf.field_sep,
+ sample->addr,
+ symbol_conf.field_sep,
+ sample->paddr,
+ symbol_conf.field_sep,
+ sample->weight,
+ symbol_conf.field_sep,
+ sample->data_src,
+ symbol_conf.field_sep,
+ al.map ? (al.map->dso ?
+ al.map->dso->long_name : "???") : "???",
+ al.sym ? al.sym->name : "???");
+ else
+ printf(fmt,
+ sample->pid,
+ symbol_conf.field_sep,
+ sample->tid,
+ symbol_conf.field_sep,
+ sample->cpu,
+ symbol_conf.field_sep,
+ event->ip.ip,
+ symbol_conf.field_sep,
+ sample->addr,
+ symbol_conf.field_sep,
+ sample->weight,
+ symbol_conf.field_sep,
+ sample->data_src,
+ symbol_conf.field_sep,
+ al.map ? (al.map->dso ?
+ al.map->dso->long_name : "???") : "???",
+ al.sym ? al.sym->name : "???");
return 0;
}
@@ -139,7 +177,10 @@ static int report_raw_events(struct perf_mem *mem)
if (symbol__init() < 0)
return -1;
- printf("# PID, TID, IP, ADDR, LOCAL WEIGHT, DSRC, SYMBOL\n");
+ if (mem->phys_addr)
+ printf("# PID, TID, CPU, IP, ADDR, PADDR, LOCAL WEIGHT, DSRC, SYMBOL\n");
+ else
+ printf("# PID, TID, CPU, IP, ADDR, LOCAL WEIGHT, DSRC, SYMBOL\n");
err = perf_session__process_events(session, &mem->tool);
if (err)
@@ -152,15 +193,33 @@ out_delete:
return err;
}
+static const char *ld_sort_order = "--sort=mem,sym,dso,symbol_daddr,dso_daddr,"
+ "tlb,locked";
+static const char *st_sort_order = "--sort=local_weight,mem,sym,dso,"
+ "symbol_daddr,dso_daddr,snoop,tlb,locked";
+static const char *ld_sort_order_p = "--sort=mem,sym,dso,symbol_daddr,"
+ "symbol_paddr,dso_daddr,tlb,locked";
+static const char *st_sort_order_p = "--sort=local_weight,mem,sym,dso,"
+ "symbol_daddr,symbol_paddr,dso_daddr,snoop,tlb,locked";
+
static int report_events(int argc, const char **argv, struct perf_mem *mem)
{
const char **rep_argv;
+ const char *ld_sort, *st_sort;
int ret, i = 0, j, rep_argc;
if (mem->dump_raw)
return report_raw_events(mem);
- rep_argc = argc + 3;
+ if (mem->phys_addr) {
+ ld_sort = ld_sort_order_p;
+ st_sort = st_sort_order_p;
+ } else {
+ ld_sort = ld_sort_order;
+ st_sort = st_sort_order;
+ }
+
+ rep_argc = argc + 4;
rep_argv = calloc(rep_argc + 1, sizeof(char *));
if (!rep_argv)
return -1;
@@ -174,8 +233,9 @@ static int report_events(int argc, const char **argv, struct perf_mem *mem)
* the column
*/
if (strcmp(mem_operation, MEM_OPERATION_LOAD))
- rep_argv[i++] = strdup("--sort=mem,sym,dso,symbol_daddr,"
- "dso_daddr,tlb,locked");
+ rep_argv[i++] = strdup(st_sort);
+ else
+ rep_argv[i++] = strdup(ld_sort);
for (j = 1; j < argc; j++, i++)
rep_argv[i] = argv[j];
@@ -207,6 +267,8 @@ int cmd_mem(int argc, const char **argv, const char *prefix __maybe_unused)
"dump raw samples in ASCII"),
OPT_BOOLEAN('U', "hide-unresolved", &mem.hide_unresolved,
"Only display entries resolved to a symbol"),
+ OPT_BOOLEAN(0, "phys-addr", &mem.phys_addr,
+ "sample physical data addr"),
OPT_STRING('i', "input", &input_name, "file",
"input file name"),
OPT_STRING('C', "cpu", &mem.cpu_list, "cpu",
@@ -232,7 +294,7 @@ int cmd_mem(int argc, const char **argv, const char *prefix __maybe_unused)
}
if (!strncmp(argv[0], "rec", 3))
- return __cmd_record(argc, argv);
+ return __cmd_record(argc, argv, &mem);
else if (!strncmp(argv[0], "rep", 3))
return report_events(argc, argv, &mem);
else
diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c
index ca98d34..6b2531f 100644
--- a/tools/perf/builtin-report.c
+++ b/tools/perf/builtin-report.c
@@ -894,7 +894,7 @@ repeat:
* branch-mode specific order
*/
if (sort_order == default_sort_order)
- sort_order = "local_weight,mem,sym,dso,symbol_daddr,dso_daddr,snoop,tlb,locked";
+ sort_order = "local_weight,mem,sym,dso,symbol_daddr,symbol_paddr,dso_daddr,snoop,tlb,locked";
}
if (setup_sorting() < 0)
diff --git a/tools/perf/util/hist.c b/tools/perf/util/hist.c
index b11a6cf..136a851 100644
--- a/tools/perf/util/hist.c
+++ b/tools/perf/util/hist.c
@@ -148,10 +148,12 @@ void hists__calc_col_len(struct hists *hists, struct hist_entry *h)
hists__set_unres_dso_col_len(hists, HISTC_MEM_DADDR_DSO);
}
} else {
- symlen = unresolved_col_width + 4 + 2;
+ symlen = unresolved_col_width + 2 + 4;
hists__new_col_len(hists, HISTC_MEM_DADDR_SYMBOL, symlen);
hists__set_unres_dso_col_len(hists, HISTC_MEM_DADDR_DSO);
}
+ symlen = unresolved_col_width + 2;
+ hists__new_col_len(hists, HISTC_MEM_PADDR_SYMBOL, symlen);
hists__new_col_len(hists, HISTC_MEM_LOCKED, 6);
hists__new_col_len(hists, HISTC_MEM_TLB, 22);
diff --git a/tools/perf/util/hist.h b/tools/perf/util/hist.h
index 2d3790f..9add9ae 100644
--- a/tools/perf/util/hist.h
+++ b/tools/perf/util/hist.h
@@ -52,6 +52,7 @@ enum hist_column {
HISTC_LOCAL_WEIGHT,
HISTC_GLOBAL_WEIGHT,
HISTC_MEM_DADDR_SYMBOL,
+ HISTC_MEM_PADDR_SYMBOL,
HISTC_MEM_DADDR_DSO,
HISTC_MEM_LOCKED,
HISTC_MEM_TLB,
diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c
index b2ecad6..7823460 100644
--- a/tools/perf/util/machine.c
+++ b/tools/perf/util/machine.c
@@ -1133,6 +1133,7 @@ struct mem_info *machine__resolve_mem(struct machine *machine,
ip__resolve_ams(machine, thr, &mi->iaddr, sample->ip);
ip__resolve_data(machine, thr, cpumode, &mi->daddr, sample->addr);
mi->data_src.val = sample->data_src;
+ mi->paddr = sample->paddr;
return mi;
}
diff --git a/tools/perf/util/sort.c b/tools/perf/util/sort.c
index 313a5a7..1a0ea58 100644
--- a/tools/perf/util/sort.c
+++ b/tools/perf/util/sort.c
@@ -497,6 +497,39 @@ static int hist_entry__daddr_snprintf(struct hist_entry *self, char *bf,
}
static int64_t
+sort__paddr_cmp(struct hist_entry *left, struct hist_entry *right)
+{
+ uint64_t l = 0, r = 0;
+
+ if (left->mem_info)
+ l = left->mem_info->paddr;
+ if (right->mem_info)
+ r = right->mem_info->paddr;
+
+ return (int64_t)(r - l);
+}
+
+static int hist_entry__paddr_snprintf(struct hist_entry *self, char *bf,
+ size_t size, unsigned int width)
+{
+ uint64_t addr = 0;
+ size_t len = BITS_PER_LONG / 4;
+ int ret = 0;
+
+ if (!self->mem_info)
+ return 0;
+
+ addr = self->mem_info->paddr;
+
+ ret += repsep_snprintf(bf + ret, size - ret, "%-#.*llx",
+ len, addr);
+ ret += repsep_snprintf(bf + ret, size - ret, "%-*s",
+ width - ret, "");
+ return ret;
+}
+
+
+static int64_t
sort__dso_daddr_cmp(struct hist_entry *left, struct hist_entry *right)
{
struct map *map_l = NULL;
@@ -821,6 +854,14 @@ struct sort_entry sort_mem_daddr_sym = {
.se_width_idx = HISTC_MEM_DADDR_SYMBOL,
};
+struct sort_entry sort_mem_paddr_sym = {
+ .se_header = "Data Physical Addr",
+ .se_cmp = sort__paddr_cmp,
+ .se_snprintf = hist_entry__paddr_snprintf,
+ .se_width_idx = HISTC_MEM_PADDR_SYMBOL,
+};
+
+
struct sort_entry sort_mem_daddr_dso = {
.se_header = "Data Object",
.se_cmp = sort__dso_daddr_cmp,
@@ -894,6 +935,7 @@ static struct sort_dimension memory_sort_dimensions[] = {
DIM(SORT_LOCAL_WEIGHT, "local_weight", sort_local_weight),
DIM(SORT_GLOBAL_WEIGHT, "weight", sort_global_weight),
DIM(SORT_MEM_DADDR_SYMBOL, "symbol_daddr", sort_mem_daddr_sym),
+ DIM(SORT_MEM_PADDR_SYMBOL, "symbol_paddr", sort_mem_paddr_sym),
DIM(SORT_MEM_DADDR_DSO, "dso_daddr", sort_mem_daddr_dso),
DIM(SORT_MEM_LOCKED, "locked", sort_mem_locked),
DIM(SORT_MEM_TLB, "tlb", sort_mem_tlb),
diff --git a/tools/perf/util/sort.h b/tools/perf/util/sort.h
index 45ac84c..4d4bdce 100644
--- a/tools/perf/util/sort.h
+++ b/tools/perf/util/sort.h
@@ -152,6 +152,7 @@ enum sort_type {
SORT_LOCAL_WEIGHT = __SORT_MEMORY_MODE,
SORT_GLOBAL_WEIGHT,
SORT_MEM_DADDR_SYMBOL,
+ SORT_MEM_PADDR_SYMBOL,
SORT_MEM_DADDR_DSO,
SORT_MEM_LOCKED,
SORT_MEM_TLB,
diff --git a/tools/perf/util/symbol.h b/tools/perf/util/symbol.h
index 5f720dc..c2e933c 100644
--- a/tools/perf/util/symbol.h
+++ b/tools/perf/util/symbol.h
@@ -159,6 +159,7 @@ struct branch_info {
struct mem_info {
struct addr_map_symbol iaddr;
struct addr_map_symbol daddr;
+ u64 paddr;
union perf_mem_data_src data_src;
};
--
1.8.1.2
Add the ability to sample physical data addresses.
Useful when used in combination with PERF_SAMPLE_ADDR
and memory access sampling. Physical address help
disambiguate sharing between processes.
Signed-off-by: Stephane Eranian <[email protected]>
---
include/linux/perf_event.h | 2 ++
include/uapi/linux/perf_event.h | 3 ++-
kernel/events/core.c | 6 ++++++
3 files changed, 10 insertions(+), 1 deletion(-)
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 535d89c..4fe89e1 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -570,6 +570,7 @@ struct perf_sample_data {
} tid_entry;
u64 time;
u64 addr;
+ u64 phys_addr;
u64 id;
u64 stream_id;
struct {
@@ -599,6 +600,7 @@ static inline void perf_sample_data_init(struct perf_sample_data *data,
data->stack_user_size = 0;
data->weight = 0;
data->data_src.val = 0;
+ data->phys_addr = 0;
}
extern void perf_output_sample(struct perf_output_handle *handle,
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 0b1df41..a1b4fad 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -134,8 +134,9 @@ enum perf_event_sample_format {
PERF_SAMPLE_STACK_USER = 1U << 13,
PERF_SAMPLE_WEIGHT = 1U << 14,
PERF_SAMPLE_DATA_SRC = 1U << 15,
+ PERF_SAMPLE_PHYS_ADDR = 1U << 16,
- PERF_SAMPLE_MAX = 1U << 16, /* non-ABI */
+ PERF_SAMPLE_MAX = 1U << 17, /* non-ABI */
};
/*
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 9c89207..294b439 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1104,6 +1104,9 @@ static void perf_event__header_size(struct perf_event *event)
if (sample_type & PERF_SAMPLE_DATA_SRC)
size += sizeof(data->data_src.val);
+ if (sample_type & PERF_SAMPLE_PHYS_ADDR)
+ size += sizeof(data->phys_addr);
+
event->header_size = size;
}
@@ -4415,6 +4418,9 @@ void perf_output_sample(struct perf_output_handle *handle,
if (sample_type & PERF_SAMPLE_DATA_SRC)
perf_output_put(handle, data->data_src.val);
+
+ if (sample_type & PERF_SAMPLE_PHYS_ADDR)
+ perf_output_put(handle, data->phys_addr);
}
void perf_prepare_sample(struct perf_event_header *header,
--
1.8.1.2
Provide infratstructure to request and dump samples with
PERF_SAMPLE_PHYS_ADDR sample type, i.e., physical address
sampling. This is useful for memory access sampling support.
Signed-off-by: Stephane Eranian <[email protected]>
---
tools/perf/perf.h | 1 +
tools/perf/util/event.h | 1 +
tools/perf/util/evsel.c | 16 +++++++++++++++-
tools/perf/util/session.c | 6 ++++++
4 files changed, 23 insertions(+), 1 deletion(-)
diff --git a/tools/perf/perf.h b/tools/perf/perf.h
index 32bd102..17f2c11 100644
--- a/tools/perf/perf.h
+++ b/tools/perf/perf.h
@@ -218,6 +218,7 @@ struct perf_record_opts {
bool pipe_output;
bool raw_samples;
bool sample_address;
+ bool sample_phys_address;
bool sample_weight;
bool sample_time;
bool period;
diff --git a/tools/perf/util/event.h b/tools/perf/util/event.h
index 1813895..373f0eb 100644
--- a/tools/perf/util/event.h
+++ b/tools/perf/util/event.h
@@ -85,6 +85,7 @@ struct perf_sample {
u32 pid, tid;
u64 time;
u64 addr;
+ u64 paddr;
u64 id;
u64 stream_id;
u64 period;
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 63b6f8c..15e0cca 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -524,7 +524,11 @@ void perf_evsel__config(struct perf_evsel *evsel,
perf_evsel__set_sample_bit(evsel, ADDR);
attr->mmap_data = track;
}
-
+ if (opts->sample_phys_address) {
+ perf_evsel__set_sample_bit(evsel, PHYS_ADDR);
+ attr->mmap_data = track;
+ }
+
if (opts->call_graph) {
perf_evsel__set_sample_bit(evsel, CALLCHAIN);
@@ -1186,6 +1190,11 @@ int perf_evsel__parse_sample(struct perf_evsel *evsel, union perf_event *event,
array++;
}
+ data->paddr = 0;
+ if (type & PERF_SAMPLE_PHYS_ADDR) {
+ data->paddr = *array;
+ array++;
+ }
return 0;
}
@@ -1262,6 +1271,11 @@ int perf_event__synthesize_sample(union perf_event *event, u64 type,
array++;
}
+ if (type & PERF_SAMPLE_PHYS_ADDR) {
+ *array = sample->paddr;
+ array++;
+ }
+
return 0;
}
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index cf1fe01..fa9bf46 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -804,6 +804,12 @@ static void dump_sample(struct perf_evsel *evsel, union perf_event *event,
if (sample_type & PERF_SAMPLE_DATA_SRC)
printf(" . data_src: 0x%"PRIx64"\n", sample->data_src);
+
+ if (sample_type & PERF_SAMPLE_ADDR)
+ printf(" . addr: 0x%"PRIx64"\n", sample->addr);
+
+ if (sample_type & PERF_SAMPLE_PHYS_ADDR)
+ printf(" . phys_addr: 0x%"PRIx64"\n", sample->paddr);
}
static struct machine *
--
1.8.1.2
Now that we use the contraints directly from the event, we
do not need the event->flags field so drop it.
Signed-off-by: Stephane Eranian <[email protected]>
---
arch/x86/kernel/cpu/perf_event_intel.c | 6 +-----
arch/x86/kernel/cpu/perf_event_intel_ds.c | 19 ++++++++++---------
include/linux/perf_event.h | 1 -
3 files changed, 11 insertions(+), 15 deletions(-)
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index a6eccf1..d04be6c 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -1449,11 +1449,8 @@ x86_get_event_constraints(struct cpu_hw_events *cpuc, struct perf_event *event)
if (x86_pmu.event_constraints) {
for_each_event_constraint(c, x86_pmu.event_constraints) {
- if ((event->hw.config & c->cmask) == c->code) {
- /* hw.flags zeroed at initialization */
- event->hw.flags |= c->flags;
+ if ((event->hw.config & c->cmask) == c->code)
return c;
- }
}
}
@@ -1498,7 +1495,6 @@ intel_put_shared_regs_event_constraints(struct cpu_hw_events *cpuc,
static void intel_put_event_constraints(struct cpu_hw_events *cpuc,
struct perf_event *event)
{
- event->hw.flags = 0;
intel_put_shared_regs_event_constraints(cpuc, event);
}
diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index 3065c57..06d02fa 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -623,7 +623,6 @@ struct event_constraint *intel_pebs_constraints(struct perf_event *event)
if (x86_pmu.pebs_constraints) {
for_each_event_constraint(c, x86_pmu.pebs_constraints) {
if ((event->hw.config & c->cmask) == c->code) {
- event->hw.flags |= c->flags;
return c;
}
}
@@ -636,14 +635,15 @@ void intel_pmu_pebs_enable(struct perf_event *event)
{
struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
struct hw_perf_event *hwc = &event->hw;
+ int flags = event->hw.constraint->flags;
hwc->config &= ~ARCH_PERFMON_EVENTSEL_INT;
cpuc->pebs_enabled |= 1ULL << hwc->idx;
- if (event->hw.flags & PERF_X86_EVENT_PEBS_LDLAT)
+ if (flags & PERF_X86_EVENT_PEBS_LDLAT)
cpuc->pebs_enabled |= 1ULL << (hwc->idx + 32);
- else if (event->hw.flags & PERF_X86_EVENT_PEBS_ST)
+ else if (flags & PERF_X86_EVENT_PEBS_ST)
cpuc->pebs_enabled |= 1ULL << 63;
}
@@ -651,12 +651,13 @@ void intel_pmu_pebs_disable(struct perf_event *event)
{
struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
struct hw_perf_event *hwc = &event->hw;
+ int flags = event->hw.constraint->flags;
cpuc->pebs_enabled &= ~(1ULL << hwc->idx);
- if (event->hw.constraint->flags & PERF_X86_EVENT_PEBS_LDLAT)
+ if (flags & PERF_X86_EVENT_PEBS_LDLAT)
cpuc->pebs_enabled &= ~(1ULL << (hwc->idx + 32));
- else if (event->hw.constraint->flags & PERF_X86_EVENT_PEBS_ST)
+ else if (flags & PERF_X86_EVENT_PEBS_ST)
cpuc->pebs_enabled &= ~(1ULL << 63);
if (cpuc->enabled)
@@ -772,14 +773,14 @@ static void __intel_pmu_pebs_event(struct perf_event *event,
struct perf_sample_data data;
struct pt_regs regs;
u64 sample_type;
+ int flags = event->hw.constraint->flags;
int fll, fst;
if (!intel_pmu_save_and_restart(event))
return;
- fll = event->hw.flags & PERF_X86_EVENT_PEBS_LDLAT;
- fst = event->hw.flags & (PERF_X86_EVENT_PEBS_ST |
- PERF_X86_EVENT_PEBS_ST_HSW);
+ fll = flags & PERF_X86_EVENT_PEBS_LDLAT;
+ fst = flags & (PERF_X86_EVENT_PEBS_ST | PERF_X86_EVENT_PEBS_ST_HSW);
perf_sample_data_init(&data, 0, event->hw.last_period);
@@ -802,7 +803,7 @@ static void __intel_pmu_pebs_event(struct perf_event *event,
if (sample_type & PERF_SAMPLE_DATA_SRC) {
if (fll)
data.data_src.val = load_latency_data(pebs->dse);
- else if (event->hw.flags & PERF_X86_EVENT_PEBS_ST_HSW)
+ else if (flags & PERF_X86_EVENT_PEBS_ST_HSW)
data.data_src.val =
precise_store_data_hsw(pebs->dse);
else
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 4ccf846..535d89c 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -134,7 +134,6 @@ struct hw_perf_event {
int event_base_rdpmc;
int idx;
int last_cpu;
- int flags;
struct hw_perf_event_extra extra_reg;
struct hw_perf_event_extra branch_reg;
--
1.8.1.2
Make sure intel_pmu_pebs_disable() and intel_pmu_pebs_enable()
are symmetrical w.r.t. PEBS-LL and precise store.
Signed-off-by: Stephane Eranian <[email protected]>
---
arch/x86/kernel/cpu/perf_event_intel_ds.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index ed3e553..3065c57 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -653,6 +653,12 @@ void intel_pmu_pebs_disable(struct perf_event *event)
struct hw_perf_event *hwc = &event->hw;
cpuc->pebs_enabled &= ~(1ULL << hwc->idx);
+
+ if (event->hw.constraint->flags & PERF_X86_EVENT_PEBS_LDLAT)
+ cpuc->pebs_enabled &= ~(1ULL << (hwc->idx + 32));
+ else if (event->hw.constraint->flags & PERF_X86_EVENT_PEBS_ST)
+ cpuc->pebs_enabled &= ~(1ULL << 63);
+
if (cpuc->enabled)
wrmsrl(MSR_IA32_PEBS_ENABLE, cpuc->pebs_enabled);
--
1.8.1.2
On Fri, Jun 21, 2013 at 04:20:40PM +0200, Stephane Eranian wrote:
> This patch series extends perf_events with the ability to sample
> physical data addresses. This is useful with the memory access
> sampling mode added just recently. In particular, it helps
> disambiguate data addresses between two processes, such as
> in the case of a shared memory segment mapped at different
> addresses in different processes.
hi,
what is this patchset based on?
I cannot apply this on tip nor on acme's tree.
thanks,
jirka
Hi,
It applies and compiles for me on tip.git:
4ce9207 Merge branch 'x86/platform'
On Sun, Jun 23, 2013 at 11:58 PM, Jiri Olsa <[email protected]> wrote:
> On Fri, Jun 21, 2013 at 04:20:40PM +0200, Stephane Eranian wrote:
>> This patch series extends perf_events with the ability to sample
>> physical data addresses. This is useful with the memory access
>> sampling mode added just recently. In particular, it helps
>> disambiguate data addresses between two processes, such as
>> in the case of a shared memory segment mapped at different
>> addresses in different processes.
>
> hi,
> what is this patchset based on?
> I cannot apply this on tip nor on acme's tree.
>
> thanks,
> jirka
On Fri, Jun 21, 2013 at 04:20:40PM +0200, Stephane Eranian wrote:
> This patch series extends perf_events with the ability to sample
> physical data addresses. This is useful with the memory access
> sampling mode added just recently. In particular, it helps
> disambiguate data addresses between two processes, such as
> in the case of a shared memory segment mapped at different
> addresses in different processes.
>
> The patch adds the PERF_SAMPLE_PHYS_ADDR sample_type.
> A 64-bit address is added to the sample record for
> the corresponding event.
>
> On Intel X86, it is used with the PEBS Load Latency
> support. On other architectures, zero is returned.
>
> The patch series also demonstrates the use of this
> new feature by extending perf report, mem, record
> with a --phys-addr option. When enable, it will
> capture physical data address and display it.
> This is implemented as a new sort_order (symbol_paddr).
>
So I'm a bit puzzled by this thing...
What exact problem are we trying to solve here? Only the shared memory
mapped at different addresses between processes thing mentioned earlier?
The big problem I see with all this is that typically memory is subject
to being moved about at random; be it either from paging, compaction,
NUMA policies or explicit page migration.
Such would completely shatter physical page relations.
If the shared memory thing is really the issue, doesn't perf already
have the process memory layout (/proc/$PID/maps and aux stream mmap
updates) with which it can compute map relative offsets and compare
thusly?
On Fri, Jun 21, 2013 at 04:20:41PM +0200, Stephane Eranian wrote:
> Make sure intel_pmu_pebs_disable() and intel_pmu_pebs_enable()
> are symmetrical w.r.t. PEBS-LL and precise store.
>
> Signed-off-by: Stephane Eranian <[email protected]>
This seems unrelated to the actual patch series and should still go in.
> ---
> arch/x86/kernel/cpu/perf_event_intel_ds.c | 6 ++++++
> 1 file changed, 6 insertions(+)
>
> diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
> index ed3e553..3065c57 100644
> --- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
> +++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
> @@ -653,6 +653,12 @@ void intel_pmu_pebs_disable(struct perf_event *event)
> struct hw_perf_event *hwc = &event->hw;
>
> cpuc->pebs_enabled &= ~(1ULL << hwc->idx);
> +
> + if (event->hw.constraint->flags & PERF_X86_EVENT_PEBS_LDLAT)
> + cpuc->pebs_enabled &= ~(1ULL << (hwc->idx + 32));
> + else if (event->hw.constraint->flags & PERF_X86_EVENT_PEBS_ST)
> + cpuc->pebs_enabled &= ~(1ULL << 63);
> +
> if (cpuc->enabled)
> wrmsrl(MSR_IA32_PEBS_ENABLE, cpuc->pebs_enabled);
>
> --
> 1.8.1.2
>
On Fri, Jun 21, 2013 at 04:20:42PM +0200, Stephane Eranian wrote:
> Now that we use the contraints directly from the event, we
> do not need the event->flags field so drop it.
>
> Signed-off-by: Stephane Eranian <[email protected]>
Same for this one I suppose; its not really related to the series at
hand.
On Mon, Jun 24, 2013 at 10:16:23AM +0200, Stephane Eranian wrote:
> Hi,
>
> It applies and compiles for me on tip.git:
>
> 4ce9207 Merge branch 'x86/platform'
ok, now it works.. ;) sry for noise
jirka
On Mon, Jun 24, 2013 at 10:43 AM, Peter Zijlstra <[email protected]> wrote:
> On Fri, Jun 21, 2013 at 04:20:40PM +0200, Stephane Eranian wrote:
>> This patch series extends perf_events with the ability to sample
>> physical data addresses. This is useful with the memory access
>> sampling mode added just recently. In particular, it helps
>> disambiguate data addresses between two processes, such as
>> in the case of a shared memory segment mapped at different
>> addresses in different processes.
>>
>> The patch adds the PERF_SAMPLE_PHYS_ADDR sample_type.
>> A 64-bit address is added to the sample record for
>> the corresponding event.
>>
>> On Intel X86, it is used with the PEBS Load Latency
>> support. On other architectures, zero is returned.
>>
>> The patch series also demonstrates the use of this
>> new feature by extending perf report, mem, record
>> with a --phys-addr option. When enable, it will
>> capture physical data address and display it.
>> This is implemented as a new sort_order (symbol_paddr).
>>
>
> So I'm a bit puzzled by this thing...
>
> What exact problem are we trying to solve here? Only the shared memory
> mapped at different addresses between processes thing mentioned earlier?
>
That is indeed one problem I am trying to address here based on actual
feedback of people building tools on top of PEBS-LL.
> The big problem I see with all this is that typically memory is subject
> to being moved about at random; be it either from paging, compaction,
> NUMA policies or explicit page migration.
>
One guarantee we have is that the physical does correspond to the virtual
address at the time of the interrupt.
But yeah, if physical pages are swapped during the run, then things become
a lot more complicated. I am not trying to address this.
Can page move for shared memory segments?
> Such would completely shatter physical page relations.
>
> If the shared memory thing is really the issue, doesn't perf already
> have the process memory layout (/proc/$PID/maps and aux stream mmap
> updates) with which it can compute map relative offsets and compare
> thusly?
Not sure I understand this.
suppose the same shared memory segment is mapped at two different
addresses by shmat(). First, I don't know if those show up in /proc/maps.
Second, what offset are you talking about here?
On Tue, Jun 25, 2013 at 11:59:09AM +0200, Stephane Eranian wrote:
> One guarantee we have is that the physical does correspond to the virtual
> address at the time of the interrupt.
That might not be much of a guarantee depending on the circumstances.
> But yeah, if physical pages are swapped during the run, then things become
> a lot more complicated. I am not trying to address this.
>
> Can page move for shared memory segments?
Yep..
> > Such would completely shatter physical page relations.
> >
> > If the shared memory thing is really the issue, doesn't perf already
> > have the process memory layout (/proc/$PID/maps and aux stream mmap
> > updates) with which it can compute map relative offsets and compare
> > thusly?
>
> Not sure I understand this.
> suppose the same shared memory segment is mapped at two different
> addresses by shmat(). First, I don't know if those show up in /proc/maps.
They should; IIRC maps is a full vma list.. /me prods about in
fs/proc/task_mmu.c.. yes it prints all vmas.
> Second, what offset are you talking about here?
The offset to the start of the vma, this should be the same for both
maps, irrespective of where they're mapped.
You could then match shared memory segments on inode:offset.
* Peter Zijlstra <[email protected]> wrote:
> On Tue, Jun 25, 2013 at 11:59:09AM +0200, Stephane Eranian wrote:
> > One guarantee we have is that the physical does correspond to the virtual
> > address at the time of the interrupt.
>
> That might not be much of a guarantee depending on the circumstances.
>
> > But yeah, if physical pages are swapped during the run, then things become
> > a lot more complicated. I am not trying to address this.
> >
> > Can page move for shared memory segments?
>
> Yep..
>
> > > Such would completely shatter physical page relations.
> > >
> > > If the shared memory thing is really the issue, doesn't perf already
> > > have the process memory layout (/proc/$PID/maps and aux stream mmap
> > > updates) with which it can compute map relative offsets and compare
> > > thusly?
> >
> > Not sure I understand this.
> > suppose the same shared memory segment is mapped at two different
> > addresses by shmat(). First, I don't know if those show up in /proc/maps.
>
> They should; IIRC maps is a full vma list.. /me prods about in
> fs/proc/task_mmu.c.. yes it prints all vmas.
A syscall (ioctl?) to dump all current vmas into the mmap update stream
(to form a starting point) might be handy - that would remove the
fragility and overhead of parsing /proc/ details.
Thanks,
Ingo
On Mon, Jun 24, 2013 at 10:44 AM, Peter Zijlstra <[email protected]> wrote:
> On Fri, Jun 21, 2013 at 04:20:41PM +0200, Stephane Eranian wrote:
>> Make sure intel_pmu_pebs_disable() and intel_pmu_pebs_enable()
>> are symmetrical w.r.t. PEBS-LL and precise store.
>>
>> Signed-off-by: Stephane Eranian <[email protected]>
>
> This seems unrelated to the actual patch series and should still go in.
>
Yes, it is unrelated. I just ran into it while testing the patch.
It was missing since the beginning, it seems.
>
>> ---
>> arch/x86/kernel/cpu/perf_event_intel_ds.c | 6 ++++++
>> 1 file changed, 6 insertions(+)
>>
>> diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
>> index ed3e553..3065c57 100644
>> --- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
>> +++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
>> @@ -653,6 +653,12 @@ void intel_pmu_pebs_disable(struct perf_event *event)
>> struct hw_perf_event *hwc = &event->hw;
>>
>> cpuc->pebs_enabled &= ~(1ULL << hwc->idx);
>> +
>> + if (event->hw.constraint->flags & PERF_X86_EVENT_PEBS_LDLAT)
>> + cpuc->pebs_enabled &= ~(1ULL << (hwc->idx + 32));
>> + else if (event->hw.constraint->flags & PERF_X86_EVENT_PEBS_ST)
>> + cpuc->pebs_enabled &= ~(1ULL << 63);
>> +
>> if (cpuc->enabled)
>> wrmsrl(MSR_IA32_PEBS_ENABLE, cpuc->pebs_enabled);
>>
>> --
>> 1.8.1.2
>>
On Mon, Jun 24, 2013 at 10:45 AM, Peter Zijlstra <[email protected]> wrote:
> On Fri, Jun 21, 2013 at 04:20:42PM +0200, Stephane Eranian wrote:
>> Now that we use the contraints directly from the event, we
>> do not need the event->flags field so drop it.
>>
>> Signed-off-by: Stephane Eranian <[email protected]>
>
> Same for this one I suppose; its not really related to the series at
> hand.
Correct. I can submit separately if needed.
* Peter Zijlstra <[email protected]> wrote:
> > > Such would completely shatter physical page relations.
> > >
> > > If the shared memory thing is really the issue, doesn't perf already
> > > have the process memory layout (/proc/$PID/maps and aux stream mmap
> > > updates) with which it can compute map relative offsets and compare
> > > thusly?
> >
> > Not sure I understand this.
> > suppose the same shared memory segment is mapped at two different
> > addresses by shmat(). First, I don't know if those show up in /proc/maps.
>
> They should; IIRC maps is a full vma list.. /me prods about in
> fs/proc/task_mmu.c.. yes it prints all vmas.
Btw., as a related matter, it would be nice to add an ioctl() that would
trigger a regular MMAP event for all current vmas.
PERF_EVENT_IOC_DUMP_MMAPS or so?
This would be easier to process than the somewhat fragile parsing
/proc/*/maps text files, and it would more consistently fit into the perf
tracing model.
I suspect it would/could also be less racy than /proc/*/maps, which is
restart based due to the 4K procfs limit, while the ioctl() could dump
everything into the trace buffer, as long as the buffer is large enough.
Thanks,
Ingo
On Tue, Jun 25, 2013 at 12:51:23PM +0200, Ingo Molnar wrote:
> A syscall (ioctl?) to dump all current vmas into the mmap update stream
> (to form a starting point) might be handy - that would remove the
> fragility and overhead of parsing /proc/ details.
Its more difficult than that though :/ Suppose not all mmap events fit
in the output buffer and you're not able to read from the buffer because
you're stuck in the ioctl().
This means we need to either force userspace to use threads to reliably
use the feature; or complicate the ioctl() to allow vma ranges -- which
introduces an inherent race window etc..
Neither option are really pretty and we already have the maps parse code
-- also its not _that_ hard to parse either.
On Wed, Jun 26, 2013 at 09:36:39AM +0200, Stephane Eranian wrote:
> On Mon, Jun 24, 2013 at 10:45 AM, Peter Zijlstra <[email protected]> wrote:
> > On Fri, Jun 21, 2013 at 04:20:42PM +0200, Stephane Eranian wrote:
> >> Now that we use the contraints directly from the event, we
> >> do not need the event->flags field so drop it.
> >>
> >> Signed-off-by: Stephane Eranian <[email protected]>
> >
> > Same for this one I suppose; its not really related to the series at
> > hand.
>
> Correct. I can submit separately if needed.
I applied the first patch; this one failed to apply.. Remind me again if
I fail to remember looking at the rejects once I've collected all other
patches I planned to ship to mingo today.
On Wed, Jun 26, 2013 at 12:33 PM, Peter Zijlstra <[email protected]> wrote:
> On Tue, Jun 25, 2013 at 12:51:23PM +0200, Ingo Molnar wrote:
>> A syscall (ioctl?) to dump all current vmas into the mmap update stream
>> (to form a starting point) might be handy - that would remove the
>> fragility and overhead of parsing /proc/ details.
>
> Its more difficult than that though :/ Suppose not all mmap events fit
> in the output buffer and you're not able to read from the buffer because
> you're stuck in the ioctl().
>
> This means we need to either force userspace to use threads to reliably
> use the feature; or complicate the ioctl() to allow vma ranges -- which
> introduces an inherent race window etc..
>
> Neither option are really pretty and we already have the maps parse code
> -- also its not _that_ hard to parse either.
After more investigation with the author of the false sharing
detection tool, I think
that if the mapping changes, it is okay. The tool can detect this and
drop the analysis
at that address. So as long as we can flag the mapping change, we are
okay. Hopefully,
it does not occur frequently. If so, then I think there are bigger
issues to fix on the system
than false sharing.
Commit-ID: 983433b5812c5cf33a9008fa38c6f9b407fedb76
Gitweb: http://git.kernel.org/tip/983433b5812c5cf33a9008fa38c6f9b407fedb76
Author: Stephane Eranian <[email protected]>
AuthorDate: Fri, 21 Jun 2013 16:20:41 +0200
Committer: Ingo Molnar <[email protected]>
CommitDate: Wed, 26 Jun 2013 21:58:51 +0200
perf/x86: Disable PEBS-LL in intel_pmu_pebs_disable()
Make sure intel_pmu_pebs_disable() and intel_pmu_pebs_enable()
are symmetrical w.r.t. PEBS-LL and precise store.
Signed-off-by: Stephane Eranian <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/kernel/cpu/perf_event_intel_ds.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index ed3e553..3065c57 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -653,6 +653,12 @@ void intel_pmu_pebs_disable(struct perf_event *event)
struct hw_perf_event *hwc = &event->hw;
cpuc->pebs_enabled &= ~(1ULL << hwc->idx);
+
+ if (event->hw.constraint->flags & PERF_X86_EVENT_PEBS_LDLAT)
+ cpuc->pebs_enabled &= ~(1ULL << (hwc->idx + 32));
+ else if (event->hw.constraint->flags & PERF_X86_EVENT_PEBS_ST)
+ cpuc->pebs_enabled &= ~(1ULL << 63);
+
if (cpuc->enabled)
wrmsrl(MSR_IA32_PEBS_ENABLE, cpuc->pebs_enabled);
Peter,
I don't this see patch in tip.git yet.
On Wed, Jun 26, 2013 at 12:36 PM, Peter Zijlstra <[email protected]> wrote:
> On Wed, Jun 26, 2013 at 09:36:39AM +0200, Stephane Eranian wrote:
>> On Mon, Jun 24, 2013 at 10:45 AM, Peter Zijlstra <[email protected]> wrote:
>> > On Fri, Jun 21, 2013 at 04:20:42PM +0200, Stephane Eranian wrote:
>> >> Now that we use the contraints directly from the event, we
>> >> do not need the event->flags field so drop it.
>> >>
>> >> Signed-off-by: Stephane Eranian <[email protected]>
>> >
>> > Same for this one I suppose; its not really related to the series at
>> > hand.
>>
>> Correct. I can submit separately if needed.
>
> I applied the first patch; this one failed to apply.. Remind me again if
> I fail to remember looking at the rejects once I've collected all other
> patches I planned to ship to mingo today.
On Thu, Jun 27, 2013 at 12:23:00PM +0200, Stephane Eranian wrote:
> Peter,
>
> I don't this see patch in tip.git yet.
Didn't apply and I didn't get around to beating it into shape; could you resend
against tip/master. Ingo just pushed out all the other bits.
On Thu, Jun 27, 2013 at 12:48 PM, Peter Zijlstra <[email protected]> wrote:
> On Thu, Jun 27, 2013 at 12:23:00PM +0200, Stephane Eranian wrote:
>> Peter,
>>
>> I don't this see patch in tip.git yet.
>
> Didn't apply and I didn't get around to beating it into shape; could you resend
> against tip/master. Ingo just pushed out all the other bits.
Ok, the chunks that do not apply can be ignored for perf_event_intel.c.
They were already included in my offcore-response sched fix patch.
But I can repost if you prefer.
On Thu, Jun 27, 2013 at 01:01:02PM +0200, Stephane Eranian wrote:
> Ok, the chunks that do not apply can be ignored for perf_event_intel.c.
Just to verify; I did a force apply ignoring the 2 hunks in perf_event_intel.c.
The result is the below patch; which gives:
# nice make O=defconfig-build/ -j16 -s
/usr/src/linux-2.6/arch/x86/kernel/cpu/perf_event.c: In function ‘x86_schedule_events’:
/usr/src/linux-2.6/arch/x86/kernel/cpu/perf_event.c:780:9: error: ‘struct hw_perf_event’ has no member named ‘flags’
/usr/src/linux-2.6/arch/x86/kernel/cpu/perf_event.c:794:14: error: ‘struct hw_perf_event’ has no member named ‘flags’
/usr/src/linux-2.6/arch/x86/kernel/cpu/perf_event.c: In function ‘x86_pmu_del’:
/usr/src/linux-2.6/arch/x86/kernel/cpu/perf_event.c:1180:11: error: ‘struct hw_perf_event’ has no member named ‘flags’
make[4]: *** [arch/x86/kernel/cpu/perf_event.o] Error 1
Please resend one that builds.
---
Subject: perf,x86: Drop event->flags and use hw.constraint->flags
From: Stephane Eranian <[email protected]>
Date: Fri, 21 Jun 2013 16:20:42 +0200
Now that we use the constraints directly from the event, we
do not need the event->flags field so drop it.
Signed-off-by: Stephane Eranian <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
---
arch/x86/kernel/cpu/perf_event_intel_ds.c | 19 ++++++++++---------
include/linux/perf_event.h | 1 -
2 files changed, 10 insertions(+), 10 deletions(-)
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -623,7 +623,6 @@ struct event_constraint *intel_pebs_cons
if (x86_pmu.pebs_constraints) {
for_each_event_constraint(c, x86_pmu.pebs_constraints) {
if ((event->hw.config & c->cmask) == c->code) {
- event->hw.flags |= c->flags;
return c;
}
}
@@ -636,14 +635,15 @@ void intel_pmu_pebs_enable(struct perf_e
{
struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
struct hw_perf_event *hwc = &event->hw;
+ int flags = event->hw.constraint->flags;
hwc->config &= ~ARCH_PERFMON_EVENTSEL_INT;
cpuc->pebs_enabled |= 1ULL << hwc->idx;
- if (event->hw.flags & PERF_X86_EVENT_PEBS_LDLAT)
+ if (flags & PERF_X86_EVENT_PEBS_LDLAT)
cpuc->pebs_enabled |= 1ULL << (hwc->idx + 32);
- else if (event->hw.flags & PERF_X86_EVENT_PEBS_ST)
+ else if (flags & PERF_X86_EVENT_PEBS_ST)
cpuc->pebs_enabled |= 1ULL << 63;
}
@@ -651,12 +651,13 @@ void intel_pmu_pebs_disable(struct perf_
{
struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
struct hw_perf_event *hwc = &event->hw;
+ int flags = event->hw.constraint->flags;
cpuc->pebs_enabled &= ~(1ULL << hwc->idx);
- if (event->hw.constraint->flags & PERF_X86_EVENT_PEBS_LDLAT)
+ if (flags & PERF_X86_EVENT_PEBS_LDLAT)
cpuc->pebs_enabled &= ~(1ULL << (hwc->idx + 32));
- else if (event->hw.constraint->flags & PERF_X86_EVENT_PEBS_ST)
+ else if (flags & PERF_X86_EVENT_PEBS_ST)
cpuc->pebs_enabled &= ~(1ULL << 63);
if (cpuc->enabled)
@@ -772,14 +773,14 @@ static void __intel_pmu_pebs_event(struc
struct perf_sample_data data;
struct pt_regs regs;
u64 sample_type;
+ int flags = event->hw.constraint->flags;
int fll, fst;
if (!intel_pmu_save_and_restart(event))
return;
- fll = event->hw.flags & PERF_X86_EVENT_PEBS_LDLAT;
- fst = event->hw.flags & (PERF_X86_EVENT_PEBS_ST |
- PERF_X86_EVENT_PEBS_ST_HSW);
+ fll = flags & PERF_X86_EVENT_PEBS_LDLAT;
+ fst = flags & (PERF_X86_EVENT_PEBS_ST | PERF_X86_EVENT_PEBS_ST_HSW);
perf_sample_data_init(&data, 0, event->hw.last_period);
@@ -802,7 +803,7 @@ static void __intel_pmu_pebs_event(struc
if (sample_type & PERF_SAMPLE_DATA_SRC) {
if (fll)
data.data_src.val = load_latency_data(pebs->dse);
- else if (event->hw.flags & PERF_X86_EVENT_PEBS_ST_HSW)
+ else if (flags & PERF_X86_EVENT_PEBS_ST_HSW)
data.data_src.val =
precise_store_data_hsw(pebs->dse);
else
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -134,7 +134,6 @@ struct hw_perf_event {
int event_base_rdpmc;
int idx;
int last_cpu;
- int flags;
struct hw_perf_event_extra extra_reg;
struct hw_perf_event_extra branch_reg;
Peter,
Let's drop this patch.
It does not work anymore with the offcore-response
scheduling fix. The constraint->flags are shared, so cannot
use the flags there to store X86_COMMITTED. So we have
to keep a copy in the event->hw.flags.
Sorry about the confusion.
On Thu, Jun 27, 2013 at 1:14 PM, Peter Zijlstra <[email protected]> wrote:
> On Thu, Jun 27, 2013 at 01:01:02PM +0200, Stephane Eranian wrote:
>> Ok, the chunks that do not apply can be ignored for perf_event_intel.c.
>
> Just to verify; I did a force apply ignoring the 2 hunks in perf_event_intel.c.
> The result is the below patch; which gives:
>
> # nice make O=defconfig-build/ -j16 -s
> /usr/src/linux-2.6/arch/x86/kernel/cpu/perf_event.c: In function ‘x86_schedule_events’:
> /usr/src/linux-2.6/arch/x86/kernel/cpu/perf_event.c:780:9: error: ‘struct hw_perf_event’ has no member named ‘flags’
> /usr/src/linux-2.6/arch/x86/kernel/cpu/perf_event.c:794:14: error: ‘struct hw_perf_event’ has no member named ‘flags’
> /usr/src/linux-2.6/arch/x86/kernel/cpu/perf_event.c: In function ‘x86_pmu_del’:
> /usr/src/linux-2.6/arch/x86/kernel/cpu/perf_event.c:1180:11: error: ‘struct hw_perf_event’ has no member named ‘flags’
> make[4]: *** [arch/x86/kernel/cpu/perf_event.o] Error 1
>
>
> Please resend one that builds.
>
> ---
> Subject: perf,x86: Drop event->flags and use hw.constraint->flags
> From: Stephane Eranian <[email protected]>
> Date: Fri, 21 Jun 2013 16:20:42 +0200
>
> Now that we use the constraints directly from the event, we
> do not need the event->flags field so drop it.
>
> Signed-off-by: Stephane Eranian <[email protected]>
> Signed-off-by: Peter Zijlstra <[email protected]>
> Link: http://lkml.kernel.org/r/[email protected]
> ---
> arch/x86/kernel/cpu/perf_event_intel_ds.c | 19 ++++++++++---------
> include/linux/perf_event.h | 1 -
> 2 files changed, 10 insertions(+), 10 deletions(-)
>
> --- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
> +++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
> @@ -623,7 +623,6 @@ struct event_constraint *intel_pebs_cons
> if (x86_pmu.pebs_constraints) {
> for_each_event_constraint(c, x86_pmu.pebs_constraints) {
> if ((event->hw.config & c->cmask) == c->code) {
> - event->hw.flags |= c->flags;
> return c;
> }
> }
> @@ -636,14 +635,15 @@ void intel_pmu_pebs_enable(struct perf_e
> {
> struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
> struct hw_perf_event *hwc = &event->hw;
> + int flags = event->hw.constraint->flags;
>
> hwc->config &= ~ARCH_PERFMON_EVENTSEL_INT;
>
> cpuc->pebs_enabled |= 1ULL << hwc->idx;
>
> - if (event->hw.flags & PERF_X86_EVENT_PEBS_LDLAT)
> + if (flags & PERF_X86_EVENT_PEBS_LDLAT)
> cpuc->pebs_enabled |= 1ULL << (hwc->idx + 32);
> - else if (event->hw.flags & PERF_X86_EVENT_PEBS_ST)
> + else if (flags & PERF_X86_EVENT_PEBS_ST)
> cpuc->pebs_enabled |= 1ULL << 63;
> }
>
> @@ -651,12 +651,13 @@ void intel_pmu_pebs_disable(struct perf_
> {
> struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
> struct hw_perf_event *hwc = &event->hw;
> + int flags = event->hw.constraint->flags;
>
> cpuc->pebs_enabled &= ~(1ULL << hwc->idx);
>
> - if (event->hw.constraint->flags & PERF_X86_EVENT_PEBS_LDLAT)
> + if (flags & PERF_X86_EVENT_PEBS_LDLAT)
> cpuc->pebs_enabled &= ~(1ULL << (hwc->idx + 32));
> - else if (event->hw.constraint->flags & PERF_X86_EVENT_PEBS_ST)
> + else if (flags & PERF_X86_EVENT_PEBS_ST)
> cpuc->pebs_enabled &= ~(1ULL << 63);
>
> if (cpuc->enabled)
> @@ -772,14 +773,14 @@ static void __intel_pmu_pebs_event(struc
> struct perf_sample_data data;
> struct pt_regs regs;
> u64 sample_type;
> + int flags = event->hw.constraint->flags;
> int fll, fst;
>
> if (!intel_pmu_save_and_restart(event))
> return;
>
> - fll = event->hw.flags & PERF_X86_EVENT_PEBS_LDLAT;
> - fst = event->hw.flags & (PERF_X86_EVENT_PEBS_ST |
> - PERF_X86_EVENT_PEBS_ST_HSW);
> + fll = flags & PERF_X86_EVENT_PEBS_LDLAT;
> + fst = flags & (PERF_X86_EVENT_PEBS_ST | PERF_X86_EVENT_PEBS_ST_HSW);
>
> perf_sample_data_init(&data, 0, event->hw.last_period);
>
> @@ -802,7 +803,7 @@ static void __intel_pmu_pebs_event(struc
> if (sample_type & PERF_SAMPLE_DATA_SRC) {
> if (fll)
> data.data_src.val = load_latency_data(pebs->dse);
> - else if (event->hw.flags & PERF_X86_EVENT_PEBS_ST_HSW)
> + else if (flags & PERF_X86_EVENT_PEBS_ST_HSW)
> data.data_src.val =
> precise_store_data_hsw(pebs->dse);
> else
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -134,7 +134,6 @@ struct hw_perf_event {
> int event_base_rdpmc;
> int idx;
> int last_cpu;
> - int flags;
>
> struct hw_perf_event_extra extra_reg;
> struct hw_perf_event_extra branch_reg;
>
On Thu, Jun 27, 2013 at 02:33:58PM +0200, Stephane Eranian wrote:
> Peter,
>
> Let's drop this patch.
>
> It does not work anymore with the offcore-response
> scheduling fix. The constraint->flags are shared, so cannot
> use the flags there to store X86_COMMITTED. So we have
> to keep a copy in the event->hw.flags.
>
> Sorry about the confusion.
n/p I've already forgotten all about it ;-)
On Wed, Jun 26, 2013 at 09:10:50PM +0200, Stephane Eranian wrote:
> After more investigation with the author of the false sharing
> detection tool, I think
> that if the mapping changes, it is okay. The tool can detect this and
> drop the analysis
> at that address. So as long as we can flag the mapping change, we are
> okay. Hopefully,
> it does not occur frequently. If so, then I think there are bigger
> issues to fix on the system
> than false sharing.
But if you index everything using dev:inode:offset it doesn't matter if the
mapping changes; you're invariant to map placement.
And the thing is; we already compute most of that anyway in order to find the
code in DSOs, except there we use filename instead of dev:inode. However if
there were means to open files using dev:inode that might actually be more
reliable than using the filename.
Peter,
On Fri, Jun 28, 2013 at 11:58 AM, Peter Zijlstra <[email protected]> wrote:
> On Wed, Jun 26, 2013 at 09:10:50PM +0200, Stephane Eranian wrote:
>> After more investigation with the author of the false sharing
>> detection tool, I think
>> that if the mapping changes, it is okay. The tool can detect this and
>> drop the analysis
>> at that address. So as long as we can flag the mapping change, we are
>> okay. Hopefully,
>> it does not occur frequently. If so, then I think there are bigger
>> issues to fix on the system
>> than false sharing.
>
> But if you index everything using dev:inode:offset it doesn't matter if the
> mapping changes; you're invariant to map placement.
>
> And the thing is; we already compute most of that anyway in order to find the
> code in DSOs, except there we use filename instead of dev:inode. However if
> there were means to open files using dev:inode that might actually be more
> reliable than using the filename.
So, I tried on an example using shmat(). I map the same shared segment twice
in the same process. Then I fork(): I see this in /proc/PID/maps:
7f80fce28000-7f80fce29000 rw-s 00000000 00:04 1376262
/SYSV00000000 (deleted)
7f80fce29000-7f80fce2a000 rw-s 00000000 00:04 1343491
/SYSV00000000 (deleted)
7f80fce2a000-7f80fce2b000 rw-s 00000000 00:04 1343491
/SYSV00000000 (deleted)
The segment at 1343491 is the one mapped twice. So that number (shmid)
can be used to identify identical mappings. It appears the same way in both
processes. The other 1376262 mapping is just to verify that each segment
gets a different number.
So it looks possible to use this approach across process to identify identical
physical mappings. However, this is not very practical.
The first reason is that perf_event does not capture shmat() mappings in MMAP
records.
The second is is that if you rely on /proc/PID/maps, you will have to
have the tool
constantly poll that file for new shared mappings. This is not how
perf works today,
not even in system-wide mode. /proc/PID/maps is swept only once when perf
record -a is started.
Ingo is proposing a ioctl() to flush the mappings. But then, when is a
good time to do this
from the tool?
So my approach with PERF_SAMPLE_PHYS_ADDR looks easier on the tools which
if I recall is the philosophy behind perf_events.
Any more comments?
On Sat, Jul 06, 2013 at 12:48:48AM +0200, Stephane Eranian wrote:
> So, I tried on an example using shmat(). I map the same shared segment twice
> in the same process. Then I fork(): I see this in /proc/PID/maps:
>
> 7f80fce28000-7f80fce29000 rw-s 00000000 00:04 1376262
> /SYSV00000000 (deleted)
> 7f80fce29000-7f80fce2a000 rw-s 00000000 00:04 1343491
> /SYSV00000000 (deleted)
> 7f80fce2a000-7f80fce2b000 rw-s 00000000 00:04 1343491
> /SYSV00000000 (deleted)
>
> The segment at 1343491 is the one mapped twice. So that number (shmid)
> can be used to identify identical mappings. It appears the same way in both
> processes. The other 1376262 mapping is just to verify that each segment
> gets a different number.
Right, that's the inode number; I think you also need to add the
blockdev id (00:04) in this case as inode numbers are per device, not
global.
> So it looks possible to use this approach across process to identify identical
> physical mappings. However, this is not very practical.
>
> The first reason is that perf_event does not capture shmat() mappings in MMAP
> records.
oops, that would be something we'd definitely need to fix.
ipc/shm.c:SYSCALL_DEFINE3(shmat)
do_shmat()
do_mmap_pgoff()
mmap_region()
perf_event_mmap()
So why isn't it logging them? If its a non-exec map we need
attr::mmap_data but I suppose you have that enabled?
> The second is is that if you rely on /proc/PID/maps, you will have to
> have the tool
> constantly poll that file for new shared mappings. This is not how
> perf works today,
> not even in system-wide mode. /proc/PID/maps is swept only once when perf
> record -a is started.
Ahh. We don't put the useful bits in the mmap event; we'll need to fix
that too then ;-)
Doing so is going to be a bit of a bother since we use the tail of
PERF_RECORD_MMAP for filenames and thus aren't particularly extensible.
This would mean doing something like PERF_RECORD_MMAP2 and some means
for userspace to requrest the new events instead of the old one.
Didn't you already have patches to change the event layout? Can this
piggy back on that?
> Ingo is proposing a ioctl() to flush the mappings. But then, when is a
> good time to do this
> from the tool?
Yeah, that's not going to help with this; that's only to get rid of the
initial /proc/$pid/maps reading. Not to keep up with changes.
> So my approach with PERF_SAMPLE_PHYS_ADDR looks easier on the tools which
> if I recall is the philosophy behind perf_events.
Physical addresses are always going to cause problems, don't ever use
them.
On Mon, Jul 8, 2013 at 10:19 AM, Peter Zijlstra <[email protected]> wrote:
> On Sat, Jul 06, 2013 at 12:48:48AM +0200, Stephane Eranian wrote:
>> So, I tried on an example using shmat(). I map the same shared segment twice
>> in the same process. Then I fork(): I see this in /proc/PID/maps:
>>
>> 7f80fce28000-7f80fce29000 rw-s 00000000 00:04 1376262
>> /SYSV00000000 (deleted)
>> 7f80fce29000-7f80fce2a000 rw-s 00000000 00:04 1343491
>> /SYSV00000000 (deleted)
>> 7f80fce2a000-7f80fce2b000 rw-s 00000000 00:04 1343491
>> /SYSV00000000 (deleted)
>>
>> The segment at 1343491 is the one mapped twice. So that number (shmid)
>> can be used to identify identical mappings. It appears the same way in both
>> processes. The other 1376262 mapping is just to verify that each segment
>> gets a different number.
>
> Right, that's the inode number; I think you also need to add the
> blockdev id (00:04) in this case as inode numbers are per device, not
> global.
>
Well, in the case of SHM sewgment, this is the shmid:
7fb92d4c8000-7fb92d4c9000 rw-s 00000000 00:04 294913
/SYSV00000000 (deleted)
7fb92d4c9000-7fb92d4ca000 rw-s 00000000 00:04 262144
/SYSV00000000 (deleted)
7fb92d4ca000-7fb92d4cb000 rw-s 00000000 00:04 262144
/SYSV00000000 (deleted)
------ Shared Memory Segments --------
key shmid owner perms bytes nattch status
0x00000000 262144 eranian 600 256 6 dest
0x00000000 294913 eranian 600 256 3 dest
>> So it looks possible to use this approach across process to identify identical
>> physical mappings. However, this is not very practical.
>>
>> The first reason is that perf_event does not capture shmat() mappings in MMAP
>> records.
>
> oops, that would be something we'd definitely need to fix.
>
> ipc/shm.c:SYSCALL_DEFINE3(shmat)
> do_shmat()
> do_mmap_pgoff()
> mmap_region()
> perf_event_mmap()
>
> So why isn't it logging them? If its a non-exec map we need
> attr::mmap_data but I suppose you have that enabled?
>
My bad, I was not use load/store sampling mode, just regular sampling
thus mmap_data was not set, and thus data mappings were not captured.
I see the SHM segment with perf mem:
# Overhead Samples Local Weight Memory access
Symbol Data Symbol
Data Object
# ........ ............ ............ ........................
.......................... ...................................
...................................
#
45.70% 4572 6 L1 hit [.]
spin [.] 0x00007f8d6252b080
SYSV00000000 (deleted)
45.49% 4551 6 L1 hit [.]
spin [.] 0x00007f8d6252c000
SYSV00000000 (deleted)
So I think what we need in the MMAP record in this case is the shmid to uniquely
identify segments.
>> The second is is that if you rely on /proc/PID/maps, you will have to
>> have the tool
>> constantly poll that file for new shared mappings. This is not how
>> perf works today,
>> not even in system-wide mode. /proc/PID/maps is swept only once when perf
>> record -a is started.
>
> Ahh. We don't put the useful bits in the mmap event; we'll need to fix
> that too then ;-)
>
Yes, I think with the shmid (or inode number) we could have this fixed.
> Doing so is going to be a bit of a bother since we use the tail of
> PERF_RECORD_MMAP for filenames and thus aren't particularly extensible.
>
> This would mean doing something like PERF_RECORD_MMAP2 and some means
> for userspace to requrest the new events instead of the old one.
>
Unfortunately, yes. Because the length of the filename string is calculated
by:
f_len = ehdr->size - sizeof(struct mmap_event) - sizeof (*ehdr) -
sizeof(sample_id_all);
Given that sample_id_all is already optional, we are screwed. I think
it would have been
much more flexible to have the string length preceeding the string
itself like how it's done
in the perf.data file. But it's too late now.
> Didn't you already have patches to change the event layout? Can this
> piggy back on that?
>
I think the proposal from Hunter with PERF_SAMPLE_IDENTIFIER is simpler
than my solution which was to commandeer a bit in the ehdr->misc bitmask to
indicate that there is a record extension at the end of the sample and that it
contains the eventID.
>> Ingo is proposing a ioctl() to flush the mappings. But then, when is a
>> good time to do this
>> from the tool?
>
> Yeah, that's not going to help with this; that's only to get rid of the
> initial /proc/$pid/maps reading. Not to keep up with changes.
>
>> So my approach with PERF_SAMPLE_PHYS_ADDR looks easier on the tools which
>> if I recall is the philosophy behind perf_events.
>
> Physical addresses are always going to cause problems, don't ever use
> them.
>
Okay, all we need is a way to detect that virtual addresses correspond to the
same actual data. Knowing that the virtual to physical mapping has changed
or is different is good enough. Don't need the actual physical address. So yes,
any kind of flag or unique value would do.
On Mon, Jul 8, 2013 at 10:19 AM, Peter Zijlstra <[email protected]> wrote:
> On Sat, Jul 06, 2013 at 12:48:48AM +0200, Stephane Eranian wrote:
>> So, I tried on an example using shmat(). I map the same shared segment twice
>> in the same process. Then I fork(): I see this in /proc/PID/maps:
>>
>> 7f80fce28000-7f80fce29000 rw-s 00000000 00:04 1376262
>> /SYSV00000000 (deleted)
>> 7f80fce29000-7f80fce2a000 rw-s 00000000 00:04 1343491
>> /SYSV00000000 (deleted)
>> 7f80fce2a000-7f80fce2b000 rw-s 00000000 00:04 1343491
>> /SYSV00000000 (deleted)
>>
>> The segment at 1343491 is the one mapped twice. So that number (shmid)
>> can be used to identify identical mappings. It appears the same way in both
>> processes. The other 1376262 mapping is just to verify that each segment
>> gets a different number.
>
> Right, that's the inode number; I think you also need to add the
> blockdev id (00:04) in this case as inode numbers are per device, not
> global.
>
>> So it looks possible to use this approach across process to identify identical
>> physical mappings. However, this is not very practical.
>>
>> The first reason is that perf_event does not capture shmat() mappings in MMAP
>> records.
>
> oops, that would be something we'd definitely need to fix.
>
> ipc/shm.c:SYSCALL_DEFINE3(shmat)
> do_shmat()
> do_mmap_pgoff()
> mmap_region()
> perf_event_mmap()
>
> So why isn't it logging them? If its a non-exec map we need
> attr::mmap_data but I suppose you have that enabled?
>
>> The second is is that if you rely on /proc/PID/maps, you will have to
>> have the tool
>> constantly poll that file for new shared mappings. This is not how
>> perf works today,
>> not even in system-wide mode. /proc/PID/maps is swept only once when perf
>> record -a is started.
>
> Ahh. We don't put the useful bits in the mmap event; we'll need to fix
> that too then ;-)
>
> Doing so is going to be a bit of a bother since we use the tail of
> PERF_RECORD_MMAP for filenames and thus aren't particularly extensible.
>
> This would mean doing something like PERF_RECORD_MMAP2 and some means
> for userspace to requrest the new events instead of the old one.
>
Tracking mmaps even for shmat() won't cover the paging cases. When you page a
page back in, it most likely gets a different physical page. How would
we track that
case too using the same approach?
On Tue, Jul 30, 2013 at 10:02:01AM +0200, Stephane Eranian wrote:
> > Ahh. We don't put the useful bits in the mmap event; we'll need to fix
> > that too then ;-)
> >
> > Doing so is going to be a bit of a bother since we use the tail of
> > PERF_RECORD_MMAP for filenames and thus aren't particularly extensible.
> >
> > This would mean doing something like PERF_RECORD_MMAP2 and some means
> > for userspace to requrest the new events instead of the old one.
> >
> Tracking mmaps even for shmat() won't cover the paging cases. When you page a
> page back in, it most likely gets a different physical page. How would
> we track that
> case too using the same approach?
It doesn't matter. Even if a page ends up being a different physical
page, it will always be the same sb:inode:pgoffset. You should be able
to always uniquely identify a (shared) page by that triplet.
So if we create a net MMAP record that includes the device (substitute
for the superblock) and inode information we should be good.
On Tue, Jul 30, 2013 at 10:37 AM, Peter Zijlstra <[email protected]> wrote:
> On Tue, Jul 30, 2013 at 10:02:01AM +0200, Stephane Eranian wrote:
>> > Ahh. We don't put the useful bits in the mmap event; we'll need to fix
>> > that too then ;-)
>> >
>> > Doing so is going to be a bit of a bother since we use the tail of
>> > PERF_RECORD_MMAP for filenames and thus aren't particularly extensible.
>> >
>> > This would mean doing something like PERF_RECORD_MMAP2 and some means
>> > for userspace to requrest the new events instead of the old one.
>> >
>> Tracking mmaps even for shmat() won't cover the paging cases. When you page a
>> page back in, it most likely gets a different physical page. How would
>> we track that
>> case too using the same approach?
>
> It doesn't matter. Even if a page ends up being a different physical
> page, it will always be the same sb:inode:pgoffset. You should be able
> to always uniquely identify a (shared) page by that triplet.
>
Ok, so you're saying that triplet uniquely identifies a virtual page
regardless of
the physical page it is mapped onto. If the physical page changes because
of paging, we keep the same triplet and therefore we can still detect the false
sharing.
> So if we create a net MMAP record that includes the device (substitute
> for the superblock) and inode information we should be good.
I will try that. I am not familiar with mm, so where do we find the
device? Inside
the vma?
On Tue, Jul 30, 2013 at 10:51:46AM +0200, Stephane Eranian wrote:
> On Tue, Jul 30, 2013 at 10:37 AM, Peter Zijlstra <[email protected]> wrote:
> > On Tue, Jul 30, 2013 at 10:02:01AM +0200, Stephane Eranian wrote:
> >> > Ahh. We don't put the useful bits in the mmap event; we'll need to fix
> >> > that too then ;-)
> >> >
> >> > Doing so is going to be a bit of a bother since we use the tail of
> >> > PERF_RECORD_MMAP for filenames and thus aren't particularly extensible.
> >> >
> >> > This would mean doing something like PERF_RECORD_MMAP2 and some means
> >> > for userspace to requrest the new events instead of the old one.
> >> >
> >> Tracking mmaps even for shmat() won't cover the paging cases. When you page a
> >> page back in, it most likely gets a different physical page. How would
> >> we track that
> >> case too using the same approach?
> >
> > It doesn't matter. Even if a page ends up being a different physical
> > page, it will always be the same sb:inode:pgoffset. You should be able
> > to always uniquely identify a (shared) page by that triplet.
> >
> Ok, so you're saying that triplet uniquely identifies a virtual page
> regardless of
> the physical page it is mapped onto. If the physical page changes because
> of paging, we keep the same triplet and therefore we can still detect the false
> sharing.
Exactly.
> > So if we create a net MMAP record that includes the device (substitute
> > for the superblock) and inode information we should be good.
>
> I will try that. I am not familiar with mm, so where do we find the
> device? Inside
> the vma?
Take a peek at fs/proc/task_mmu.c:show_map_vma(), its the code used to
print /proc/$PID/maps and displays all stuff we want.
On Tue, Jul 30, 2013 at 11:02 AM, Peter Zijlstra <[email protected]> wrote:
> On Tue, Jul 30, 2013 at 10:51:46AM +0200, Stephane Eranian wrote:
>> On Tue, Jul 30, 2013 at 10:37 AM, Peter Zijlstra <[email protected]> wrote:
>> > On Tue, Jul 30, 2013 at 10:02:01AM +0200, Stephane Eranian wrote:
>> >> > Ahh. We don't put the useful bits in the mmap event; we'll need to fix
>> >> > that too then ;-)
>> >> >
>> >> > Doing so is going to be a bit of a bother since we use the tail of
>> >> > PERF_RECORD_MMAP for filenames and thus aren't particularly extensible.
>> >> >
>> >> > This would mean doing something like PERF_RECORD_MMAP2 and some means
>> >> > for userspace to requrest the new events instead of the old one.
>> >> >
>> >> Tracking mmaps even for shmat() won't cover the paging cases. When you page a
>> >> page back in, it most likely gets a different physical page. How would
>> >> we track that
>> >> case too using the same approach?
>> >
>> > It doesn't matter. Even if a page ends up being a different physical
>> > page, it will always be the same sb:inode:pgoffset. You should be able
>> > to always uniquely identify a (shared) page by that triplet.
>> >
>> Ok, so you're saying that triplet uniquely identifies a virtual page
>> regardless of
>> the physical page it is mapped onto. If the physical page changes because
>> of paging, we keep the same triplet and therefore we can still detect the false
>> sharing.
>
> Exactly.
>
I see this for my program:
7f0a59cbe000-7f0a59cc1000 rw-p 00000000 00:00 0
7f0a59cd3000-7f0a59cd4000 rw-p 00000000 00:00 0
7f0a59cd4000-7f0a59cd5000 rw-s 00000000 00:04 458753
/SYSV00000000 (deleted)
7f0a59cd5000-7f0a59cd6000 rw-s 00000000 00:04 425984
/SYSV00000000 (deleted)
7f0a59cd6000-7f0a59cd7000 rw-s 00000000 00:04 425984
/SYSV00000000 (deleted)
The first 2 lines are heap. There is nothing useful coming out of maj:min ino.
However for shared segment we can use the ino number. Shared memory segment
appear as file in the vma therefore, the kernel does use the ino, maj,
min number.
And in my program I map the same segment twice, and we see the last two mappings
are identical.
But in the case of regular paging, there is no useful info there. But
thenI suspect for a private
heap page we only care about multi-threaded and there the physical
page is irrelevant.
So it seems all we care about is to cover the shared segment case and
we can get the
info from the vma and creates a MMAP2 record for it.
Do we agree?
>> > So if we create a net MMAP record that includes the device (substitute
>> > for the superblock) and inode information we should be good.
>>
>> I will try that. I am not familiar with mm, so where do we find the
>> device? Inside
>> the vma?
>
> Take a peek at fs/proc/task_mmu.c:show_map_vma(), its the code used to
> print /proc/$PID/maps and displays all stuff we want.
That is what I see in that function:
if (file) {
struct inode *inode = file_inode(vma->vm_file);
dev = inode->i_sb->s_dev;
ino = inode->i_ino;
pgoff = ((loff_t)vma->vm_pgoff) << PAGE_SHIFT;
}
It works for anything associated with a file.
Peter,
One thing that bothers me with the MMAP2 approach is that
it forces integration into perf. Now, you will need to analyze
the MMAP2 records. With my sample_type approach, you
simply needed a cmdline option on perf record, and then
you could dump the sample using perf report -D and feed
them into a post-processing script. But now, the analysis
needs to be integrated into perf or the tool needs to parse
the full perf.data file.
On Tue, Jul 30, 2013 at 3:09 PM, Stephane Eranian <[email protected]> wrote:
> On Tue, Jul 30, 2013 at 11:02 AM, Peter Zijlstra <[email protected]> wrote:
>> On Tue, Jul 30, 2013 at 10:51:46AM +0200, Stephane Eranian wrote:
>>> On Tue, Jul 30, 2013 at 10:37 AM, Peter Zijlstra <[email protected]> wrote:
>>> > On Tue, Jul 30, 2013 at 10:02:01AM +0200, Stephane Eranian wrote:
>>> >> > Ahh. We don't put the useful bits in the mmap event; we'll need to fix
>>> >> > that too then ;-)
>>> >> >
>>> >> > Doing so is going to be a bit of a bother since we use the tail of
>>> >> > PERF_RECORD_MMAP for filenames and thus aren't particularly extensible.
>>> >> >
>>> >> > This would mean doing something like PERF_RECORD_MMAP2 and some means
>>> >> > for userspace to requrest the new events instead of the old one.
>>> >> >
>>> >> Tracking mmaps even for shmat() won't cover the paging cases. When you page a
>>> >> page back in, it most likely gets a different physical page. How would
>>> >> we track that
>>> >> case too using the same approach?
>>> >
>>> > It doesn't matter. Even if a page ends up being a different physical
>>> > page, it will always be the same sb:inode:pgoffset. You should be able
>>> > to always uniquely identify a (shared) page by that triplet.
>>> >
>>> Ok, so you're saying that triplet uniquely identifies a virtual page
>>> regardless of
>>> the physical page it is mapped onto. If the physical page changes because
>>> of paging, we keep the same triplet and therefore we can still detect the false
>>> sharing.
>>
>> Exactly.
>>
> I see this for my program:
>
> 7f0a59cbe000-7f0a59cc1000 rw-p 00000000 00:00 0
> 7f0a59cd3000-7f0a59cd4000 rw-p 00000000 00:00 0
> 7f0a59cd4000-7f0a59cd5000 rw-s 00000000 00:04 458753
> /SYSV00000000 (deleted)
> 7f0a59cd5000-7f0a59cd6000 rw-s 00000000 00:04 425984
> /SYSV00000000 (deleted)
> 7f0a59cd6000-7f0a59cd7000 rw-s 00000000 00:04 425984
> /SYSV00000000 (deleted)
>
> The first 2 lines are heap. There is nothing useful coming out of maj:min ino.
> However for shared segment we can use the ino number. Shared memory segment
> appear as file in the vma therefore, the kernel does use the ino, maj,
> min number.
> And in my program I map the same segment twice, and we see the last two mappings
> are identical.
>
> But in the case of regular paging, there is no useful info there. But
> thenI suspect for a private
> heap page we only care about multi-threaded and there the physical
> page is irrelevant.
> So it seems all we care about is to cover the shared segment case and
> we can get the
> info from the vma and creates a MMAP2 record for it.
>
> Do we agree?
>
>
>>> > So if we create a net MMAP record that includes the device (substitute
>>> > for the superblock) and inode information we should be good.
>>>
>>> I will try that. I am not familiar with mm, so where do we find the
>>> device? Inside
>>> the vma?
>>
>> Take a peek at fs/proc/task_mmu.c:show_map_vma(), its the code used to
>> print /proc/$PID/maps and displays all stuff we want.
>
> That is what I see in that function:
>
> if (file) {
> struct inode *inode = file_inode(vma->vm_file);
> dev = inode->i_sb->s_dev;
> ino = inode->i_ino;
> pgoff = ((loff_t)vma->vm_pgoff) << PAGE_SHIFT;
> }
>
> It works for anything associated with a file.
On 7/30/13 8:21 AM, Stephane Eranian wrote:
> One thing that bothers me with the MMAP2 approach is that
> it forces integration into perf. Now, you will need to analyze
> the MMAP2 records. With my sample_type approach, you
> simply needed a cmdline option on perf record, and then
> you could dump the sample using perf report -D and feed
what do you want that is not coming out via perf-script?
> them into a post-processing script. But now, the analysis
> needs to be integrated into perf or the tool needs to parse
> the full perf.data file.
you mean parsed during perf-record?
David
On Tue, Jul 30, 2013 at 4:50 PM, David Ahern <[email protected]> wrote:
> On 7/30/13 8:21 AM, Stephane Eranian wrote:
>>
>> One thing that bothers me with the MMAP2 approach is that
>> it forces integration into perf. Now, you will need to analyze
>> the MMAP2 records. With my sample_type approach, you
>> simply needed a cmdline option on perf record, and then
>> you could dump the sample using perf report -D and feed
>
>
> what do you want that is not coming out via perf-script?
>
mmap records. Are they passed to scripts?
>
>> them into a post-processing script. But now, the analysis
>> needs to be integrated into perf or the tool needs to parse
>> the full perf.data file.
>
>
> you mean parsed during perf-record?
>
No, I mean for a standalone tool on top of perf record.
With my approach, you can simply write a wrapper script
around perf report and do the analysis.
On 7/30/13 8:53 AM, Stephane Eranian wrote:
> On Tue, Jul 30, 2013 at 4:50 PM, David Ahern <[email protected]> wrote:
>> On 7/30/13 8:21 AM, Stephane Eranian wrote:
>>>
>>> One thing that bothers me with the MMAP2 approach is that
>>> it forces integration into perf. Now, you will need to analyze
>>> the MMAP2 records. With my sample_type approach, you
>>> simply needed a cmdline option on perf record, and then
>>> you could dump the sample using perf report -D and feed
>>
>>
>> what do you want that is not coming out via perf-script?
>>
> mmap records. Are they passed to scripts?
I don't do much with external scripts, but I do not believe they are
passed. I use 'perf script -f <opts>' a lot to see the individual
samples (e.g., tracepoints and software events).
It could easily be adapted to dump mmap events, but does not sound like
it improves your workload.
David
On Tue, Jul 30, 2013 at 04:21:41PM +0200, Stephane Eranian wrote:
> Peter,
>
> One thing that bothers me with the MMAP2 approach is that
> it forces integration into perf.
This is a good (TM) thing, yes? ;-)
> Now, you will need to analyze
> the MMAP2 records. With my sample_type approach, you
> simply needed a cmdline option on perf record, and then
> you could dump the sample using perf report -D and feed
> them into a post-processing script. But now, the analysis
> needs to be integrated into perf or the tool needs to parse
> the full perf.data file.
So the disadvantage of the sample_type approach is that it generates
more data and bloats the fast path.
If its useful it shouldn't live in a script anyway ;-) Also if the
script muck can't deal with the side-band information its a worse broken
piece of crap than I thought it was.
On Tue, Jul 30, 2013 at 5:52 PM, Peter Zijlstra <[email protected]> wrote:
> On Tue, Jul 30, 2013 at 04:21:41PM +0200, Stephane Eranian wrote:
>> Peter,
>>
>> One thing that bothers me with the MMAP2 approach is that
>> it forces integration into perf.
>
> This is a good (TM) thing, yes? ;-)
>
Well, that's one way to look at it. But I'd like the API to
be able to support more than one tool easily. If the barrier
for any advanced analysis is too high from the raw kernel
API then we will only have one tool. I don't think that sane.
>> Now, you will need to analyze
>> the MMAP2 records. With my sample_type approach, you
>> simply needed a cmdline option on perf record, and then
>> you could dump the sample using perf report -D and feed
>> them into a post-processing script. But now, the analysis
>> needs to be integrated into perf or the tool needs to parse
>> the full perf.data file.
>
> So the disadvantage of the sample_type approach is that it generates
> more data and bloats the fast path.
>
> If its useful it shouldn't live in a script anyway ;-) Also if the
> script muck can't deal with the side-band information its a worse broken
> piece of crap than I thought it was.
I never use the scripting support. Don't know its current shape.
On Tue, Jul 30, 2013 at 06:09:15PM +0200, Stephane Eranian wrote:
> On Tue, Jul 30, 2013 at 5:52 PM, Peter Zijlstra <[email protected]> wrote:
> > On Tue, Jul 30, 2013 at 04:21:41PM +0200, Stephane Eranian wrote:
> >> Peter,
> >>
> >> One thing that bothers me with the MMAP2 approach is that
> >> it forces integration into perf.
> >
> > This is a good (TM) thing, yes? ;-)
> >
>
> Well, that's one way to look at it. But I'd like the API to
> be able to support more than one tool easily. If the barrier
> for any advanced analysis is too high from the raw kernel
> API then we will only have one tool. I don't think that sane.
Sure, but any half-way usable profiling will need the existing sideband
data. So I don't think that relying on it for this particular feature is
too much.