2022-10-06 16:08:28

by Ravi Bangoria

[permalink] [raw]
Subject: [PATCH v4 0/8] perf mem/c2c: Add support for AMD (tools changes)

Kernel side of changes are already present in tip/perf/core except
one patch to rename PERF_MEM_LVLNUM_EXTN_MEM to PERF_MEM_LVLNUM_CXL[1].

Original description:

Perf mem and c2c tools are wrappers around perf record with mem load/
store events. IBS tagged load/store sample provides most of the
information needed for these tools. Enable support for these tools on
AMD Zen processors based on IBS Op pmu.

There are some limitations though: Only load/store micro-ops provide
mem/c2c information. Whereas, IBS does not have a way to choose a
particular type of micro-op to tag. This results in many non-LS
micro-ops being tagged which appear as N/A in the perf report. IBS,
being an uncore pmu from kernel point of view[2], does not support per
process monitoring. Thus, perf mem/c2c on AMD are currently supported
in per-cpu mode only.

Example:
$ sudo ./perf mem record -- -c 10000
^C[ perf record: Woken up 227 times to write data ]
[ perf record: Captured and wrote 58.760 MB perf.data (836978 samples) ]

$ sudo ./perf mem report -F mem,sample,snoop
Samples: 836K of event 'ibs_op//', Event count (approx.): 8418762
Memory access Samples Snoop
N/A 700620 N/A
L1 hit 126675 N/A
L2 hit 424 N/A
L3 hit 664 HitM
L3 hit 10 N/A
Local RAM hit 2 N/A
Remote RAM (1 hop) hit 8558 N/A
Remote Cache (1 hop) hit 3 N/A
Remote Cache (1 hop) hit 2 HitM
Remote Cache (2 hops) hit 10 HitM
Remote Cache (2 hops) hit 6 N/A
Uncached hit 4 N/A

Prepared on top of acme/perf/core (3b1913adb188)

v3: https://lore.kernel.org/lkml/[email protected]
v3->v4:
- Rename PERF_MEM_LVLNUM_EXTN_MEM to PERF_MEM_LVLNUM_CXL for tools part.

[1]: https://lore.kernel.org/lkml/[email protected]
[2]: https://lore.kernel.org/lkml/[email protected]


Ravi Bangoria (8):
perf tool: Sync include/uapi/linux/perf_event.h header
perf tool: Sync arch/x86/include/asm/amd-ibs.h header
perf mem: Add support for printing PERF_MEM_LVLNUM_{CXL|IO}
perf mem/c2c: Set PERF_SAMPLE_WEIGHT for LOAD_STORE events
perf mem/c2c: Add load store event mappings for AMD
perf mem/c2c: Avoid printing empty lines for unsupported events
perf mem: Print "LFB/MAB" for PERF_MEM_LVLNUM_LFB
perf script: Add missing fields in usage hint

tools/arch/x86/include/asm/amd-ibs.h | 16 ++++++++++++
tools/include/uapi/linux/perf_event.h | 4 ++-
tools/perf/Documentation/perf-c2c.txt | 14 ++++++++---
tools/perf/Documentation/perf-mem.txt | 3 ++-
tools/perf/Documentation/perf-record.txt | 1 +
tools/perf/arch/x86/util/mem-events.c | 31 ++++++++++++++++++++++--
tools/perf/builtin-c2c.c | 1 +
tools/perf/builtin-mem.c | 1 +
tools/perf/builtin-script.c | 7 +++---
tools/perf/util/mem-events.c | 17 +++++++------
10 files changed, 77 insertions(+), 18 deletions(-)

--
2.37.3


2022-10-06 16:09:39

by Ravi Bangoria

[permalink] [raw]
Subject: [PATCH v4 5/8] perf mem/c2c: Add load store event mappings for AMD

Perf mem and c2c tools are wrappers around perf record with mem load/
store events. IBS tagged load/store sample provides most of the
information needed for these tools. Wire in ibs_op// event as mem-ldst
event for AMD.

There are some limitations though: Only load/store micro-ops provide
mem/c2c information. Whereas, IBS does not have a way to choose a
particular type of micro-op to tag. This results in many non-LS
micro-ops being tagged which appear as N/A in the perf report. IBS,
being an uncore pmu from kernel point of view[1], does not support per
process monitoring. Thus, perf mem/c2c on AMD are currently supported
in per-cpu mode only.

Example:
$ sudo ./perf mem record -- -c 10000
^C[ perf record: Woken up 227 times to write data ]
[ perf record: Captured and wrote 58.760 MB perf.data (836978 samples) ]

$ sudo ./perf mem report -F mem,sample,snoop
Samples: 836K of event 'ibs_op//', Event count (approx.): 8418762
Memory access Samples Snoop
N/A 700620 N/A
L1 hit 126675 N/A
L2 hit 424 N/A
L3 hit 664 HitM
L3 hit 10 N/A
Local RAM hit 2 N/A
Remote RAM (1 hop) hit 8558 N/A
Remote Cache (1 hop) hit 3 N/A
Remote Cache (1 hop) hit 2 HitM
Remote Cache (2 hops) hit 10 HitM
Remote Cache (2 hops) hit 6 N/A
Uncached hit 4 N/A

[1]: https://lore.kernel.org/lkml/[email protected]

Signed-off-by: Ravi Bangoria <[email protected]>
Acked-by: Jiri Olsa <[email protected]>
---
tools/perf/Documentation/perf-c2c.txt | 14 ++++++++----
tools/perf/Documentation/perf-mem.txt | 3 ++-
tools/perf/arch/x86/util/mem-events.c | 31 +++++++++++++++++++++++++--
3 files changed, 41 insertions(+), 7 deletions(-)

diff --git a/tools/perf/Documentation/perf-c2c.txt b/tools/perf/Documentation/perf-c2c.txt
index f1f7ae6b08d1..5c5eb2def83e 100644
--- a/tools/perf/Documentation/perf-c2c.txt
+++ b/tools/perf/Documentation/perf-c2c.txt
@@ -19,9 +19,10 @@ C2C stands for Cache To Cache.
The perf c2c tool provides means for Shared Data C2C/HITM analysis. It allows
you to track down the cacheline contentions.

-On x86, the tool is based on load latency and precise store facility events
+On Intel, the tool is based on load latency and precise store facility events
provided by Intel CPUs. On PowerPC, the tool uses random instruction sampling
-with thresholding feature.
+with thresholding feature. On AMD, the tool uses IBS op pmu (due to hardware
+limitations, perf c2c is not supported on Zen3 cpus).

These events provide:
- memory address of the access
@@ -49,7 +50,8 @@ RECORD OPTIONS

-l::
--ldlat::
- Configure mem-loads latency. (x86 only)
+ Configure mem-loads latency. Supported on Intel and Arm64 processors
+ only. Ignored on other archs.

-k::
--all-kernel::
@@ -135,11 +137,15 @@ Following perf record options are configured by default:
-W,-d,--phys-data,--sample-cpu

Unless specified otherwise with '-e' option, following events are monitored by
-default on x86:
+default on Intel:

cpu/mem-loads,ldlat=30/P
cpu/mem-stores/P

+following on AMD:
+
+ ibs_op//
+
and following on PowerPC:

cpu/mem-loads/
diff --git a/tools/perf/Documentation/perf-mem.txt b/tools/perf/Documentation/perf-mem.txt
index 66177511c5c4..005c95580b1e 100644
--- a/tools/perf/Documentation/perf-mem.txt
+++ b/tools/perf/Documentation/perf-mem.txt
@@ -85,7 +85,8 @@ RECORD OPTIONS
Be more verbose (show counter open errors, etc)

--ldlat <n>::
- Specify desired latency for loads event. (x86 only)
+ Specify desired latency for loads event. Supported on Intel and Arm64
+ processors only. Ignored on other archs.

In addition, for report all perf report options are valid, and for record
all perf record options.
diff --git a/tools/perf/arch/x86/util/mem-events.c b/tools/perf/arch/x86/util/mem-events.c
index 5214370ca4e4..f683ac702247 100644
--- a/tools/perf/arch/x86/util/mem-events.c
+++ b/tools/perf/arch/x86/util/mem-events.c
@@ -1,7 +1,9 @@
// SPDX-License-Identifier: GPL-2.0
#include "util/pmu.h"
+#include "util/env.h"
#include "map_symbol.h"
#include "mem-events.h"
+#include "linux/string.h"

static char mem_loads_name[100];
static bool mem_loads_name__init;
@@ -12,18 +14,43 @@ static char mem_stores_name[100];

#define E(t, n, s) { .tag = t, .name = n, .sysfs_name = s }

-static struct perf_mem_event perf_mem_events[PERF_MEM_EVENTS__MAX] = {
+static struct perf_mem_event perf_mem_events_intel[PERF_MEM_EVENTS__MAX] = {
E("ldlat-loads", "%s/mem-loads,ldlat=%u/P", "%s/events/mem-loads"),
E("ldlat-stores", "%s/mem-stores/P", "%s/events/mem-stores"),
E(NULL, NULL, NULL),
};

+static struct perf_mem_event perf_mem_events_amd[PERF_MEM_EVENTS__MAX] = {
+ E(NULL, NULL, NULL),
+ E(NULL, NULL, NULL),
+ E("mem-ldst", "ibs_op//", "ibs_op"),
+};
+
+static int perf_mem_is_amd_cpu(void)
+{
+ struct perf_env env = { .total_mem = 0, };
+
+ perf_env__cpuid(&env);
+ if (env.cpuid && strstarts(env.cpuid, "AuthenticAMD"))
+ return 1;
+ return -1;
+}
+
struct perf_mem_event *perf_mem_events__ptr(int i)
{
+ /* 0: Uninitialized, 1: Yes, -1: No */
+ static int is_amd;
+
if (i >= PERF_MEM_EVENTS__MAX)
return NULL;

- return &perf_mem_events[i];
+ if (!is_amd)
+ is_amd = perf_mem_is_amd_cpu();
+
+ if (is_amd == 1)
+ return &perf_mem_events_amd[i];
+
+ return &perf_mem_events_intel[i];
}

bool is_mem_loads_aux_event(struct evsel *leader)
--
2.37.3

2022-10-06 16:20:33

by Ravi Bangoria

[permalink] [raw]
Subject: [PATCH v4 4/8] perf mem/c2c: Set PERF_SAMPLE_WEIGHT for LOAD_STORE events

Currently perf sets PERF_SAMPLE_WEIGHT flag only for mem load events.
Set it for combined load-store event as well which will enable recording
of load latency by default on arch that does not support independent
mem load event.

Also document missing -W in perf-record man page.

Signed-off-by: Ravi Bangoria <[email protected]>
Acked-by: Jiri Olsa <[email protected]>
---
tools/perf/Documentation/perf-record.txt | 1 +
tools/perf/builtin-c2c.c | 1 +
tools/perf/builtin-mem.c | 1 +
3 files changed, 3 insertions(+)

diff --git a/tools/perf/Documentation/perf-record.txt b/tools/perf/Documentation/perf-record.txt
index 378f497f4be3..e41ae950fdc3 100644
--- a/tools/perf/Documentation/perf-record.txt
+++ b/tools/perf/Documentation/perf-record.txt
@@ -411,6 +411,7 @@ is enabled for all the sampling events. The sampled branch type is the same for
The various filters must be specified as a comma separated list: --branch-filter any_ret,u,k
Note that this feature may not be available on all processors.

+-W::
--weight::
Enable weightened sampling. An additional weight is recorded per sample and can be
displayed with the weight and local_weight sort keys. This currently works for TSX
diff --git a/tools/perf/builtin-c2c.c b/tools/perf/builtin-c2c.c
index f35a47b2dbe4..a9190458d2d5 100644
--- a/tools/perf/builtin-c2c.c
+++ b/tools/perf/builtin-c2c.c
@@ -3281,6 +3281,7 @@ static int perf_c2c__record(int argc, const char **argv)
*/
if (e->tag) {
e->record = true;
+ rec_argv[i++] = "-W";
} else {
e = perf_mem_events__ptr(PERF_MEM_EVENTS__LOAD);
e->record = true;
diff --git a/tools/perf/builtin-mem.c b/tools/perf/builtin-mem.c
index 9e435fd23503..f7dd8216de72 100644
--- a/tools/perf/builtin-mem.c
+++ b/tools/perf/builtin-mem.c
@@ -122,6 +122,7 @@ static int __cmd_record(int argc, const char **argv, struct perf_mem *mem)
(mem->operation & MEM_OPERATION_LOAD) &&
(mem->operation & MEM_OPERATION_STORE)) {
e->record = true;
+ rec_argv[i++] = "-W";
} else {
if (mem->operation & MEM_OPERATION_LOAD) {
e = perf_mem_events__ptr(PERF_MEM_EVENTS__LOAD);
--
2.37.3

2022-10-06 16:22:02

by Ravi Bangoria

[permalink] [raw]
Subject: [PATCH v4 6/8] perf mem/c2c: Avoid printing empty lines for unsupported events

Perf mem and c2c can be used with 3 different events: load, store and
combined load-store. Some architectures might support only partial set
of events in which case, perf prints empty line for unsupported events.
Avoid that.

Ex, AMD Zen cpus supports only combined load-store event and does not
support individual load and store event.

Before patch:
$ ./perf mem record -e list


mem-ldst : available

After patch:
$ ./perf mem record -e list
mem-ldst : available

Signed-off-by: Ravi Bangoria <[email protected]>
Acked-by: Jiri Olsa <[email protected]>
---
tools/perf/util/mem-events.c | 11 ++++++-----
1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/tools/perf/util/mem-events.c b/tools/perf/util/mem-events.c
index 8909dc7b14a7..6c7feecd2e04 100644
--- a/tools/perf/util/mem-events.c
+++ b/tools/perf/util/mem-events.c
@@ -156,11 +156,12 @@ void perf_mem_events__list(void)
for (j = 0; j < PERF_MEM_EVENTS__MAX; j++) {
struct perf_mem_event *e = perf_mem_events__ptr(j);

- fprintf(stderr, "%-13s%-*s%s\n",
- e->tag ?: "",
- verbose > 0 ? 25 : 0,
- verbose > 0 ? perf_mem_events__name(j, NULL) : "",
- e->supported ? ": available" : "");
+ fprintf(stderr, "%-*s%-*s%s",
+ e->tag ? 13 : 0,
+ e->tag ? : "",
+ e->tag && verbose > 0 ? 25 : 0,
+ e->tag && verbose > 0 ? perf_mem_events__name(j, NULL) : "",
+ e->supported ? ": available\n" : "");
}
}

--
2.37.3

2022-10-06 16:27:50

by Ravi Bangoria

[permalink] [raw]
Subject: [PATCH v4 8/8] perf script: Add missing fields in usage hint

Few fields are missing in the usage message printed when wrong
field option is passed. Add them in the list.

Signed-off-by: Ravi Bangoria <[email protected]>
Acked-by: Jiri Olsa <[email protected]>
---
tools/perf/builtin-script.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/tools/perf/builtin-script.c b/tools/perf/builtin-script.c
index 7fa467ed91dc..7ca238277d83 100644
--- a/tools/perf/builtin-script.c
+++ b/tools/perf/builtin-script.c
@@ -3846,9 +3846,10 @@ int cmd_script(int argc, const char **argv)
"Valid types: hw,sw,trace,raw,synth. "
"Fields: comm,tid,pid,time,cpu,event,trace,ip,sym,dso,"
"addr,symoff,srcline,period,iregs,uregs,brstack,"
- "brstacksym,flags,bpf-output,brstackinsn,brstackinsnlen,brstackoff,"
- "callindent,insn,insnlen,synth,phys_addr,metric,misc,ipc,tod,"
- "data_page_size,code_page_size,ins_lat",
+ "brstacksym,flags,data_src,weight,bpf-output,brstackinsn,"
+ "brstackinsnlen,brstackoff,callindent,insn,insnlen,synth,"
+ "phys_addr,metric,misc,srccode,ipc,tod,data_page_size,"
+ "code_page_size,ins_lat",
parse_output_fields),
OPT_BOOLEAN('a', "all-cpus", &system_wide,
"system-wide collection from all CPUs"),
--
2.37.3