2021-09-09 08:00:37

by Stephane Eranian

[permalink] [raw]
Subject: [PATCH v1 00/13] perf/x86/amd: Add AMD Fam19h Branch Sampling support

This patch series adds support for the AMD Fam19h 16-deep branch sampling
feature as described in the AMD PPR Fam19h Model 01h Revision B1 section 2.1.13.
This is a model specific extension. It is not an architected AMD feature.

The Branch Sampling Feature (BRS) provides the statistical taken branch information necessary
to enable autoFDO-style optimization by compilers, i.e., basic block execution counts.

BRS operates with a 16-deep saturating buffer in MSR registers. There is no
hardware branch type filtering. All control flow changes are captured. BRS
relies on specific programming of the core PMU of Fam19h. In particular,
the following requirements must be met:
- the sampling period be greater than 16 (BRS depth)
- the sampling period must use fixed and not frequency mode

BRS interacts with the NMI interrupt as well. Because enabling BRS is expensive,
it is only activated after P event occurrences, where P is the desired sampling period.
At P occurrences of the event, the counter overflows, the CPU catches the NMI interrupt,
activates BRS for 16 branches until it saturates, and then delivers the NMI to the kernel.
Between the overflow and the time BRS activates more branches may be executed skewing the period.
All along, the sampling event keeps counting. The skid may be attenuated by reducing the sampling
period by 16.

BRS is integrated into perf_events seamlessly via the same PERF_RECORD_BRANCH_STACK sample
format. BRS generates branch perf_branch_entry records in the sampling buffer. There is
no prediction or latency information supported. The branches are stored in reverse order of
execution. The most recent branch is the first entry in each record.

Because BRS must be stopped when a CPU goes into low power mode, the series includes patches
to add callbacks on low power entry and exit (mwait and halt).

Given that there is no privilege filterting with BRS, the kernel implements filtering on
privlege level.

This version adds a few simple modifications to perf record and report.
1. add the branch-brs event as a builtin such as it can used directly: perf record -e branch-brs ...
2. improve error handling for AMD IBS and is contributed by Kim Phillips.
3. use the better error handling to improve error handling for branch sampling
4. add two new sort dimensions to help display the branch sampling information. Because there is no
latency information associated with the branch sampling feature perf report would collapse all
samples within a function into a single histogram entry. This is expected because the default
sort mode for PERF_SAMPLE_BRANCH_STACK is symbol_from/symbol_to. This propagates to the annotation.

For more detailed view of the branch samples, the new sort dimensions addr_from,addr_to can be used
instead as follows:

$ perf report --sort=overhead,comm,dso,addr_from,addr_to
# Overhead Command Shared Object Source Address Target Address
# ........ ............... ................ .............. ..............
#
4.21% test_prg test_prg [.] test_threa+0x3c [.] test_threa+0x4
4.14% test_prg test_prg [.] test_threa+0x3e [.] test_threa+0x2
4.10% test_prg test_prg [.] test_threa+0x4 [.] test_threa+0x3a
4.07% test_prg test_prg [.] test_threa+0x2 [.] test_threa+0x3c

Versus the default output:

$ perf report
# Overhead Command Source Shared Object Source Symbol Target Symbol Basic Block Cycles
# ........ ............... .................... ................................... ................................... ..................
#
99.52% test_prg test_prg [.] test_thread [.] test_thread -

BRS can be used with any sampling event. However, it is recommended to use the RETIRED_BRANCH
event because it matches what the BRS captures. For convenience, a pseudo event matching the
branches captured by BRS is exported by the kernel (branch-brs):

$ perf record -b -e cpu/branch-brs/ -c 1000037 test

$ perf report -D
56531696056126 0x193c000 [0x1a8]: PERF_RECORD_SAMPLE(IP, 0x2): 18122/18230: 0x401d24 period: 1000037 addr: 0
... branch stack: nr:16
..... 0: 0000000000401d24 -> 0000000000401d5a 0 cycles 0
..... 1: 0000000000401d5c -> 0000000000401d24 0 cycles 0
..... 2: 0000000000401d22 -> 0000000000401d5c 0 cycles 0
..... 3: 0000000000401d5e -> 0000000000401d22 0 cycles 0
..... 4: 0000000000401d20 -> 0000000000401d5e 0 cycles 0
..... 5: 0000000000401d3e -> 0000000000401d20 0 cycles 0
..... 6: 0000000000401d42 -> 0000000000401d3e 0 cycles 0
..... 7: 0000000000401d3c -> 0000000000401d42 0 cycles 0
..... 8: 0000000000401d44 -> 0000000000401d3c 0 cycles 0
..... 9: 0000000000401d3a -> 0000000000401d44 0 cycles 0
..... 10: 0000000000401d46 -> 0000000000401d3a 0 cycles 0
..... 11: 0000000000401d38 -> 0000000000401d46 0 cycles 0
..... 12: 0000000000401d48 -> 0000000000401d38 0 cycles 0
..... 13: 0000000000401d36 -> 0000000000401d48 0 cycles 0
..... 14: 0000000000401d4a -> 0000000000401d36 0 cycles 0
..... 15: 0000000000401d34 -> 0000000000401d4a 0 cycles 0
... thread: test:18230
...... dso: test

Special thanks to Kim Phillips @ AMD for the testing, reviews and contributions.

Kim Phillips (1):
perf tools: improve IBS error handling

Stephane Eranian (12):
perf/core: add union to struct perf_branch_entry
x86/cpufeatures: add AMD Fam19h Branch Sampling feature
perf/x86/amd: add AMD Fam19h Branch Sampling support
perf/x86/amd: add branch-brs helper event for Fam19h BRS
perf/x86/amd: enable branch sampling priv level filtering
perf/x86/amd: add AMD branch sampling period adjustment
perf/core: add idle hooks
perf/x86/core: add idle hooks
perf/x86/amd: add idle hooks for branch sampling
perf tools: add branch-brs as a new event
perf tools: improve error handling of AMD Branch Sampling
perf report: add addr_from/addr_to sort dimensions

arch/x86/events/amd/Makefile | 2 +-
arch/x86/events/amd/brs.c | 342 +++++++++++++++++++++++++++++
arch/x86/events/amd/core.c | 222 ++++++++++++++++++-
arch/x86/events/core.c | 22 +-
arch/x86/events/intel/lbr.c | 13 +-
arch/x86/events/perf_event.h | 106 +++++++--
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/include/asm/msr-index.h | 4 +
arch/x86/include/asm/mwait.h | 6 +-
include/linux/perf_event.h | 8 +
include/uapi/linux/perf_event.h | 19 +-
kernel/events/core.c | 58 +++++
kernel/sched/idle.c | 19 +-
tools/perf/util/evsel.c | 50 +++++
tools/perf/util/hist.c | 2 +
tools/perf/util/hist.h | 2 +
tools/perf/util/parse-events.l | 1 +
tools/perf/util/sort.c | 128 +++++++++++
tools/perf/util/sort.h | 2 +
19 files changed, 964 insertions(+), 43 deletions(-)
create mode 100644 arch/x86/events/amd/brs.c

--
2.33.0.153.gba50c8fa24-goog


2021-09-09 08:01:05

by Stephane Eranian

[permalink] [raw]
Subject: [PATCH v1 05/13] perf/x86/amd: enable branch sampling priv level filtering

The AMD Branch Sampling features does not provide hardware filtering
by privilege level. The associated PMU counter does but not the branch
sampling by itself. Given how BRS operates there is a possibility that
BRS captures kernel level branches even though the event is programmed to
count only at the user level. This patch implements a workaround in software
by removing the branches which belong to the wrong privilege level. The privilege
level is evaluated on the target of the branch and not the source so as to be
compatible with other architectures. As a consequence of this patch, the number
of entries in the PERF_RECORD_BRANCH_STACK buffer may be less than the maximum (16).
It could even be zero. Another consequence is that consecutive entries in the
branch stack may not reflect actual code path and may have discontinuities,
in case kernel branches were suppressed. But this is no different than what
happens on other architectures.

Signed-off-by: Stephane Eranian <[email protected]>
---
arch/x86/events/amd/brs.c | 26 ++++++++++++++++++++------
1 file changed, 20 insertions(+), 6 deletions(-)

diff --git a/arch/x86/events/amd/brs.c b/arch/x86/events/amd/brs.c
index 86dbc6d06815..23b969001857 100644
--- a/arch/x86/events/amd/brs.c
+++ b/arch/x86/events/amd/brs.c
@@ -92,10 +92,6 @@ int amd_brs_setup_filter(struct perf_event *event)
if ((type & ~PERF_SAMPLE_BRANCH_PLM_ALL) != PERF_SAMPLE_BRANCH_ANY)
return -EINVAL;

- /* can only capture at all priv levels due to the way BRS works */
- if ((type & PERF_SAMPLE_BRANCH_PLM_ALL) != PERF_SAMPLE_BRANCH_PLM_ALL)
- return -EINVAL;
-
return 0;
}

@@ -181,6 +177,21 @@ void amd_brs_disable(void)
}
}

+static bool amd_brs_match_plm(struct perf_event *event, u64 to)
+{
+ int type = event->attr.branch_sample_type;
+ int plm_k = PERF_SAMPLE_BRANCH_KERNEL | PERF_SAMPLE_BRANCH_HV;
+ int plm_u = PERF_SAMPLE_BRANCH_USER;
+
+ if (!(type & plm_k) && kernel_ip(to))
+ return 0;
+
+ if (!(type & plm_u) && !kernel_ip(to))
+ return 0;
+
+ return 1;
+}
+
/*
* Caller must ensure amd_brs_inuse() is true before calling
* return:
@@ -237,8 +248,6 @@ void amd_brs_drain(void)
if (to == BRS_POISON)
break;

- rdmsrl(brs_from(brs_idx), from);
-
/*
* Sign-extend SAMP_BR_TO to 64 bits, bits 61-63 are reserved.
* Necessary to generate proper virtual addresses suitable for
@@ -246,6 +255,11 @@ void amd_brs_drain(void)
*/
to = (u64)(((s64)to << shift) >> shift);

+ if (!amd_brs_match_plm(event, to))
+ continue;
+
+ rdmsrl(brs_from(brs_idx), from);
+
cpuc->lbr_entries[nr].from = from;
cpuc->lbr_entries[nr].to = to;

--
2.33.0.153.gba50c8fa24-goog

2021-09-09 08:01:06

by Stephane Eranian

[permalink] [raw]
Subject: [PATCH v1 03/13] perf/x86/amd: add AMD Fam19h Branch Sampling support

This patch adds support for the AMD Fam19h 16-deep branch sampling
feature as described in the AMD PPR Fam19h Model 01h Revision B1.
This is a model specific extension. It is not an architected AMD feature.

The Branch Sampling (BRS) operates with a 16-deep saturating buffer in
MSR registers. There is no branch type filtering. All control flow changes
are captured. BRS relies on specific programming of the core PMU of Fam19h.
In particular, the following requirements must be met:
- the sampling period be greater than 16 (BRS depth)
- the sampling period must use a fixed and not frequency mode

BRS interacts with the NMI interrupt as well. Because enabling BRS is expensive,
it is only activated after P event occurrences, where P is the desired sampling period.
At P occurrences of the event, the counter overflows, the CPU catches the interrupt,
activates BRS for 16 branches until it saturates, and then delivers the NMI to the kernel.
Between the overflow and the time BRS activates more branches may be executed skewing the period.
All along, the sampling event keeps counting. The skid may be attenuated by reducing the sampling
period by 16 (subsequent patch).

BRS is integrated into perf_events seamlessly via the same PERF_RECORD_BRANCH_STACK sample
format. BRS generates perf_branch_entry records in the sampling buffer. No prediction
information is supported. The branches are stored in reverse order of execution.
The most recent branch is the first entry in each record.

No modification to the perf tool is necessary.

BRS can be used with any sampling event. However, it is recommended to use the
RETIRED_BRANCH_INSTRUCTIONS event because it matches what the BRS captures.

$ perf record -b -c 1000037 -e cpu/event=0xc2,name=retired_branch_instructions/ test

$ perf report -D
56531696056126 0x193c000 [0x1a8]: PERF_RECORD_SAMPLE(IP, 0x2): 18122/18230: 0x401d24 period: 1000037 addr: 0
... branch stack: nr:16
..... 0: 0000000000401d24 -> 0000000000401d5a 0 cycles 0
..... 1: 0000000000401d5c -> 0000000000401d24 0 cycles 0
..... 2: 0000000000401d22 -> 0000000000401d5c 0 cycles 0
..... 3: 0000000000401d5e -> 0000000000401d22 0 cycles 0
..... 4: 0000000000401d20 -> 0000000000401d5e 0 cycles 0
..... 5: 0000000000401d3e -> 0000000000401d20 0 cycles 0
..... 6: 0000000000401d42 -> 0000000000401d3e 0 cycles 0
..... 7: 0000000000401d3c -> 0000000000401d42 0 cycles 0
..... 8: 0000000000401d44 -> 0000000000401d3c 0 cycles 0
..... 9: 0000000000401d3a -> 0000000000401d44 0 cycles 0
..... 10: 0000000000401d46 -> 0000000000401d3a 0 cycles 0
..... 11: 0000000000401d38 -> 0000000000401d46 0 cycles 0
..... 12: 0000000000401d48 -> 0000000000401d38 0 cycles 0
..... 13: 0000000000401d36 -> 0000000000401d48 0 cycles 0
..... 14: 0000000000401d4a -> 0000000000401d36 0 cycles 0
..... 15: 0000000000401d34 -> 0000000000401d4a 0 cycles 0
... thread: test:18230
...... dso: test

Signed-off-by: Stephane Eranian <[email protected]>
---
arch/x86/events/amd/Makefile | 2 +-
arch/x86/events/amd/brs.c | 303 +++++++++++++++++++++++++++++++
arch/x86/events/amd/core.c | 202 ++++++++++++++++++++-
arch/x86/events/core.c | 15 +-
arch/x86/events/perf_event.h | 93 ++++++++--
arch/x86/include/asm/msr-index.h | 4 +
6 files changed, 595 insertions(+), 24 deletions(-)
create mode 100644 arch/x86/events/amd/brs.c

diff --git a/arch/x86/events/amd/Makefile b/arch/x86/events/amd/Makefile
index 6cbe38d5fd9d..cf323ffab5cd 100644
--- a/arch/x86/events/amd/Makefile
+++ b/arch/x86/events/amd/Makefile
@@ -1,5 +1,5 @@
# SPDX-License-Identifier: GPL-2.0
-obj-$(CONFIG_CPU_SUP_AMD) += core.o
+obj-$(CONFIG_CPU_SUP_AMD) += core.o brs.o
obj-$(CONFIG_PERF_EVENTS_AMD_POWER) += power.o
obj-$(CONFIG_X86_LOCAL_APIC) += ibs.o
obj-$(CONFIG_PERF_EVENTS_AMD_UNCORE) += amd-uncore.o
diff --git a/arch/x86/events/amd/brs.c b/arch/x86/events/amd/brs.c
new file mode 100644
index 000000000000..86dbc6d06815
--- /dev/null
+++ b/arch/x86/events/amd/brs.c
@@ -0,0 +1,303 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Implement support for AMD Fam19h Branch Sampling feature
+ * Based on specifications published in AMD PPR Fam19 Model 01
+ *
+ * Copyright 2021 Google LLC
+ * Contributed by Stephane Eranian <[email protected]>
+ */
+#include <linux/kernel.h>
+#include <asm/msr.h>
+#include <asm/cpufeature.h>
+
+#include "../perf_event.h"
+
+#define BRS_POISON 0xFFFFFFFFFFFFFFFEULL /* mark limit of valid entries */
+
+/* Debug Extension Configuration register layout */
+union amd_debug_extn_cfg {
+ __u64 val;
+ struct {
+ __u64 rsvd0:2, /* reserved */
+ brsmen:1, /* branch sample enable */
+ rsvd4_3:2,/* reserved - must be 0x3 */
+ vb:1, /* valid branches recorded */
+ rsvd2:10, /* reserved */
+ msroff:4, /* index of next entry to write */
+ rsvd3:4, /* reserved */
+ pmc:3, /* #PMC holding the sampling event */
+ rsvd4:37; /* reserved */
+ };
+};
+
+static inline unsigned int brs_from(int idx)
+{
+ return MSR_AMD_SAMP_BR_FROM + 2 * idx;
+}
+
+static inline unsigned int brs_to(int idx)
+{
+ return MSR_AMD_SAMP_BR_FROM + 2 * idx + 1;
+}
+
+static inline void set_debug_extn_cfg(u64 val)
+{
+ /* bits[4:3] must always be set to 11b */
+ wrmsrl(MSR_AMD_DBG_EXTN_CFG, val | 3ULL << 3);
+}
+
+static inline u64 get_debug_extn_cfg(void)
+{
+ u64 val;
+
+ rdmsrl(MSR_AMD_DBG_EXTN_CFG, val);
+ return val;
+}
+
+static bool __init amd_brs_detect(void)
+{
+ if (!boot_cpu_has(X86_FEATURE_AMD_BRS))
+ return false;
+
+ switch (boot_cpu_data.x86) {
+ case 0x19: /* AMD Fam19h (Zen3) */
+ x86_pmu.lbr_nr = 16;
+
+ /* No hardware filtering supported */
+ x86_pmu.lbr_sel_map = NULL;
+ x86_pmu.lbr_sel_mask = 0;
+ break;
+ default:
+ return false;
+ }
+
+ return true;
+}
+
+/*
+ * Current BRS implementation does not support branch type or privilege level
+ * filtering. Therefore, this function simply enforces these limitations. No need for
+ * a br_sel_map. Software filtering is not supported because it would not correlate well
+ * with a sampling period.
+ */
+int amd_brs_setup_filter(struct perf_event *event)
+{
+ u64 type = event->attr.branch_sample_type;
+
+ /* No BRS support */
+ if (!x86_pmu.lbr_nr)
+ return -EOPNOTSUPP;
+
+ /* Can only capture all branches, i.e., no filtering */
+ if ((type & ~PERF_SAMPLE_BRANCH_PLM_ALL) != PERF_SAMPLE_BRANCH_ANY)
+ return -EINVAL;
+
+ /* can only capture at all priv levels due to the way BRS works */
+ if ((type & PERF_SAMPLE_BRANCH_PLM_ALL) != PERF_SAMPLE_BRANCH_PLM_ALL)
+ return -EINVAL;
+
+ return 0;
+}
+
+/* tos = top of stack, i.e., last valid entry written */
+static inline int amd_brs_get_tos(union amd_debug_extn_cfg *cfg)
+{
+ /*
+ * msroff: index of next entry to write so top-of-stack is one off
+ * if BRS is full then msroff is set back to 0.
+ */
+ return (cfg->msroff ? cfg->msroff : x86_pmu.lbr_nr) - 1;
+}
+
+/*
+ * make sure we have a sane BRS offset to begin with
+ * especially with kexec
+ */
+void amd_brs_reset(void)
+{
+ /*
+ * Reset config
+ */
+ set_debug_extn_cfg(0);
+
+ /*
+ * Mark first entry as poisoned
+ */
+ wrmsrl(brs_to(0), BRS_POISON);
+}
+
+int __init amd_brs_init(void)
+{
+ if (!amd_brs_detect())
+ return -EOPNOTSUPP;
+
+ pr_cont("%d-deep BRS, ", x86_pmu.lbr_nr);
+
+ return 0;
+}
+
+void amd_brs_enable(void)
+{
+ struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+ union amd_debug_extn_cfg cfg;
+
+ /* Activate only on first user */
+ if (++cpuc->brs_active > 1)
+ return;
+
+ cfg.val = 0; /* reset all fields */
+ cfg.brsmen = 1; /* enable branch sampling */
+
+ /* Set enable bit */
+ set_debug_extn_cfg(cfg.val);
+}
+
+void amd_brs_disable(void)
+{
+ struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+ union amd_debug_extn_cfg cfg;
+
+ /* Check if active (could be disabled via x86_pmu_disable_all()) */
+ if (!cpuc->brs_active)
+ return;
+
+ /* Only disable for last user */
+ if (--cpuc->brs_active)
+ return;
+
+ /*
+ * Clear the brsmen bit but preserve the others as they contain
+ * useful state such as vb and msroff
+ */
+ cfg.val = get_debug_extn_cfg();
+
+ /*
+ * When coming in on interrupt and BRS is full, then hw will have
+ * already stopped BRS, no need to issue wrmsr again
+ */
+ if (cfg.brsmen) {
+ cfg.brsmen = 0;
+ set_debug_extn_cfg(cfg.val);
+ }
+}
+
+/*
+ * Caller must ensure amd_brs_inuse() is true before calling
+ * return:
+ */
+void amd_brs_drain(void)
+{
+ struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+ struct perf_event *event = cpuc->events[0];
+ union amd_debug_extn_cfg cfg;
+ u32 i, nr = 0, num, tos, start;
+ u32 shift = 64 - boot_cpu_data.x86_virt_bits;
+
+ /*
+ * BRS event forced on PMC0,
+ * so check if there is an event.
+ * It is possible to have lbr_users > 0 but the event
+ * not yet scheduled due to long latency PMU irq
+ */
+ if (!event)
+ goto empty;
+
+ cfg.val = get_debug_extn_cfg();
+
+ /* Sanity check [0-x86_pmu.lbr_nr] */
+ if (WARN_ON_ONCE(cfg.msroff >= x86_pmu.lbr_nr))
+ goto empty;
+
+ /* No valid branch */
+ if (cfg.vb == 0)
+ goto empty;
+
+ /*
+ * msr.off points to next entry to be written
+ * tos = most recent entry index = msr.off - 1
+ * BRS register buffer saturates, so we know we have
+ * start < tos and that we have to read from start to tos
+ */
+ start = 0;
+ tos = amd_brs_get_tos(&cfg);
+
+ num = tos - start + 1;
+
+ /*
+ * BRS is only one pass (saturation) from MSROFF to depth-1
+ * MSROFF wraps to zero when buffer is full
+ */
+ for (i = 0; i < num; i++) {
+ u32 brs_idx = tos - i;
+ u64 from, to;
+
+ rdmsrl(brs_to(brs_idx), to);
+
+ /* Entry does not belong to us (as marked by kernel) */
+ if (to == BRS_POISON)
+ break;
+
+ rdmsrl(brs_from(brs_idx), from);
+
+ /*
+ * Sign-extend SAMP_BR_TO to 64 bits, bits 61-63 are reserved.
+ * Necessary to generate proper virtual addresses suitable for
+ * symbolization
+ */
+ to = (u64)(((s64)to << shift) >> shift);
+
+ cpuc->lbr_entries[nr].from = from;
+ cpuc->lbr_entries[nr].to = to;
+
+ /* Other lbr_entries fields are unsupported, set to zero */
+ cpuc->lbr_entries[nr].val = 0;
+
+ nr++;
+ }
+empty:
+ /* Record number of sampled branches */
+ cpuc->lbr_stack.nr = nr;
+}
+
+/*
+ * Poison most recent entry to prevent reuse by next task
+ * required because BRS entry are not tagged by PID
+ */
+static void amd_brs_poison_buffer(void)
+{
+ union amd_debug_extn_cfg cfg;
+ unsigned int idx;
+
+ /* Get current state */
+ cfg.val = get_debug_extn_cfg();
+
+ /* idx is most recently written entry */
+ idx = amd_brs_get_tos(&cfg);
+
+ /* Poison target of entry */
+ wrmsrl(brs_to(idx), BRS_POISON);
+}
+
+/*
+ * On context switch in, we need to make sure no samples from previous user
+ * are left in the BRS.
+ *
+ * On ctxswin, sched_in = true, called after the PMU has started
+ * On ctxswout, sched_in = false, called before the PMU is stopped
+ */
+void amd_pmu_brs_sched_task(struct perf_event_context *ctx, bool sched_in)
+{
+ struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+
+ /* no active users */
+ if (!cpuc->lbr_users)
+ return;
+
+ /*
+ * On context switch in, we need to ensure we do not use entries
+ * from previous BRS user on that CPU, so we poison the buffer as
+ * a faster way compared to resetting all entries.
+ */
+ if (sched_in)
+ amd_brs_poison_buffer();
+}
diff --git a/arch/x86/events/amd/core.c b/arch/x86/events/amd/core.c
index 9687a8aef01c..86adb0e879c6 100644
--- a/arch/x86/events/amd/core.c
+++ b/arch/x86/events/amd/core.c
@@ -327,6 +327,8 @@ static inline bool amd_is_pair_event_code(struct hw_perf_event *hwc)

static int amd_core_hw_config(struct perf_event *event)
{
+ int ret = 0;
+
if (event->attr.exclude_host && event->attr.exclude_guest)
/*
* When HO == GO == 1 the hardware treats that as GO == HO == 0
@@ -343,7 +345,32 @@ static int amd_core_hw_config(struct perf_event *event)
if ((x86_pmu.flags & PMU_FL_PAIR) && amd_is_pair_event_code(&event->hw))
event->hw.flags |= PERF_X86_EVENT_PAIR;

- return 0;
+ /*
+ * if branch stack is requested
+ */
+ if (has_branch_stack(event) && is_sampling_event(event)) {
+ /*
+ * BRS implementation does not work with frequency mode
+ * reprogramming of the period.
+ */
+ if (event->attr.freq)
+ return -EINVAL;
+ /*
+ * The kernel subtracts BRS depth from period, so it must be big enough
+ */
+ if (event->attr.sample_period <= x86_pmu.lbr_nr)
+ return -EINVAL;
+
+ /*
+ * Check if we can allow PERF_SAMPLE_BRANCH_STACK
+ */
+ ret = amd_brs_setup_filter(event);
+
+ /* only set in case of success */
+ if (!ret)
+ event->hw.flags |= PERF_X86_EVENT_AMD_BRS;
+ }
+ return ret;
}

static inline int amd_is_nb_event(struct hw_perf_event *hwc)
@@ -366,9 +393,6 @@ static int amd_pmu_hw_config(struct perf_event *event)
if (event->attr.precise_ip && get_ibs_caps())
return -ENOENT;

- if (has_branch_stack(event))
- return -EOPNOTSUPP;
-
ret = x86_pmu_hw_config(event);
if (ret)
return ret;
@@ -555,6 +579,8 @@ static void amd_pmu_cpu_starting(int cpu)

cpuc->amd_nb->nb_id = nb_id;
cpuc->amd_nb->refcnt++;
+
+ amd_brs_reset();
}

static void amd_pmu_cpu_dead(int cpu)
@@ -634,8 +660,38 @@ static void amd_pmu_disable_all(void)
}
}

+static void amd_pmu_enable_event(struct perf_event *event)
+{
+ x86_pmu_enable_event(event);
+
+ if (__this_cpu_read(cpu_hw_events.enabled)) {
+ if (is_amd_brs(&event->hw))
+ amd_brs_enable();
+ }
+}
+
+static void amd_pmu_enable_all(int added)
+{
+ struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+ struct hw_perf_event *hwc;
+ int idx;
+
+ for (idx = 0; idx < x86_pmu.num_counters; idx++) {
+ hwc = &cpuc->events[idx]->hw;
+
+ /* only activate events which are marked as active */
+ if (!test_bit(idx, cpuc->active_mask))
+ continue;
+
+ amd_pmu_enable_event(cpuc->events[idx]);
+ }
+}
+
static void amd_pmu_disable_event(struct perf_event *event)
{
+ if (is_amd_brs(&event->hw))
+ amd_brs_disable();
+
x86_pmu_disable_event(event);

/*
@@ -651,6 +707,18 @@ static void amd_pmu_disable_event(struct perf_event *event)
amd_pmu_wait_on_overflow(event->hw.idx);
}

+static void amd_pmu_add_event(struct perf_event *event)
+{
+ if (needs_branch_stack(event))
+ amd_pmu_brs_add(event);
+}
+
+static void amd_pmu_del_event(struct perf_event *event)
+{
+ if (needs_branch_stack(event))
+ amd_pmu_brs_del(event);
+}
+
/*
* Because of NMI latency, if multiple PMC counters are active or other sources
* of NMIs are received, the perf NMI handler can handle one or more overflowed
@@ -671,11 +739,31 @@ static void amd_pmu_disable_event(struct perf_event *event)
*/
static int amd_pmu_handle_irq(struct pt_regs *regs)
{
+ struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
int handled;
+ int pmu_enabled;
+
+ /*
+ * Save the PMU state.
+ * It needs to be restored when leaving the handler.
+ */
+ pmu_enabled = cpuc->enabled;
+ cpuc->enabled = 0;
+
+ /* stop everything (includes BRS) */
+ amd_pmu_disable_all();
+
+ /* Drain BRS is in use (could be inactive) */
+ if (cpuc->lbr_users)
+ amd_brs_drain();

/* Process any counter overflows */
handled = x86_pmu_handle_irq(regs);

+ cpuc->enabled = pmu_enabled;
+ if (pmu_enabled)
+ amd_pmu_enable_all(0);
+
/*
* If a counter was handled, record a timestamp such that un-handled
* NMIs will be claimed if arriving within that window.
@@ -897,6 +985,51 @@ static void amd_put_event_constraints_f17h(struct cpu_hw_events *cpuc,
--cpuc->n_pair;
}

+/*
+ * Because of the way BRS operates with an inactive and active phases, and
+ * the link to one counter, it is not possible to have two events using BRS
+ * scheduled at the same time. There would be an issue with enforcing the
+ * period of each one and given that the BRS saturates, it would not be possible
+ * to guarantee correlated content for all events. Therefore, in situations
+ * where multiple events want to use BRS, the kernel enforces mutual exclusion.
+ * Exclusion is enforced by chosing only one counter for events using BRS.
+ * The event scheduling logic will then automatically multiplex the
+ * events and ensure that at most one event is actively using BRS.
+ *
+ * The BRS counter could be any counter, but there is no constraint on Fam19h,
+ * therefore all counters are equal and thus we pick the first one: PMC0
+ */
+static struct event_constraint amd_fam19h_brs_cntr0_constraint =
+ EVENT_CONSTRAINT(0, 0x1, AMD64_RAW_EVENT_MASK);
+
+static struct event_constraint amd_fam19h_brs_pair_cntr0_constraint =
+ __EVENT_CONSTRAINT(0, 0x1, AMD64_RAW_EVENT_MASK, 1, 0, PERF_X86_EVENT_PAIR);
+
+static struct event_constraint *
+amd_get_event_constraints_f19h(struct cpu_hw_events *cpuc, int idx,
+ struct perf_event *event)
+{
+ struct hw_perf_event *hwc = &event->hw;
+ bool has_brs = is_amd_brs(hwc);
+
+ /*
+ * In case BRS is used with an event requiring a counter pair,
+ * the kernel allows it but only on counter 0 & 1 to enforce
+ * multiplexing requiring to protect BRS in case of multiple
+ * BRS users
+ */
+ if (amd_is_pair_event_code(hwc)) {
+ return has_brs ? &amd_fam19h_brs_pair_cntr0_constraint
+ : &pair_constraint;
+ }
+
+ if (has_brs)
+ return &amd_fam19h_brs_cntr0_constraint;
+
+ return &unconstrained;
+}
+
+
static ssize_t amd_event_sysfs_show(char *page, u64 config)
{
u64 event = (config & ARCH_PERFMON_EVENTSEL_EVENT) |
@@ -905,12 +1038,19 @@ static ssize_t amd_event_sysfs_show(char *page, u64 config)
return x86_event_sysfs_show(page, config, event);
}

+static void amd_pmu_sched_task(struct perf_event_context *ctx,
+ bool sched_in)
+{
+ if (sched_in && x86_pmu.lbr_nr)
+ amd_pmu_brs_sched_task(ctx, sched_in);
+}
+
static __initconst const struct x86_pmu amd_pmu = {
.name = "AMD",
.handle_irq = amd_pmu_handle_irq,
.disable_all = amd_pmu_disable_all,
- .enable_all = x86_pmu_enable_all,
- .enable = x86_pmu_enable_event,
+ .enable_all = amd_pmu_enable_all,
+ .enable = amd_pmu_enable_event,
.disable = amd_pmu_disable_event,
.hw_config = amd_pmu_hw_config,
.schedule_events = x86_schedule_events,
@@ -920,6 +1060,8 @@ static __initconst const struct x86_pmu amd_pmu = {
.event_map = amd_pmu_event_map,
.max_events = ARRAY_SIZE(amd_perfmon_event_map),
.num_counters = AMD64_NUM_COUNTERS,
+ .add = amd_pmu_add_event,
+ .del = amd_pmu_del_event,
.cntval_bits = 48,
.cntval_mask = (1ULL << 48) - 1,
.apic = 1,
@@ -938,6 +1080,37 @@ static __initconst const struct x86_pmu amd_pmu = {
.amd_nb_constraints = 1,
};

+static ssize_t branches_show(struct device *cdev,
+ struct device_attribute *attr,
+ char *buf)
+{
+ return snprintf(buf, PAGE_SIZE, "%d\n", x86_pmu.lbr_nr);
+}
+
+static DEVICE_ATTR_RO(branches);
+
+static struct attribute *amd_pmu_brs_attrs[] = {
+ &dev_attr_branches.attr,
+ NULL,
+};
+
+static umode_t
+amd_brs_is_visible(struct kobject *kobj, struct attribute *attr, int i)
+{
+ return x86_pmu.lbr_nr ? attr->mode : 0;
+}
+
+static struct attribute_group group_caps_amd_brs = {
+ .name = "caps",
+ .attrs = amd_pmu_brs_attrs,
+ .is_visible = amd_brs_is_visible,
+};
+
+static const struct attribute_group *amd_attr_update[] = {
+ &group_caps_amd_brs,
+ NULL,
+};
+
static int __init amd_core_pmu_init(void)
{
u64 even_ctr_mask = 0ULL;
@@ -989,6 +1162,23 @@ static int __init amd_core_pmu_init(void)
x86_pmu.flags |= PMU_FL_PAIR;
}

+ if (boot_cpu_data.x86 >= 0x19) {
+ /*
+ * On AMD, invoking pmu_disable_all() is very expensive and the function is
+ * invoked on context-switch in via sched_task_in(), so enable only when necessary
+ */
+ if (!amd_brs_init()) {
+ x86_pmu.get_event_constraints = amd_get_event_constraints_f19h;
+ x86_pmu.sched_task = amd_pmu_sched_task;
+ /*
+ * The put_event_constraints callback is shared with
+ * Fam17h, set above
+ */
+ }
+ }
+
+ x86_pmu.attr_update = amd_attr_update;
+
pr_cont("core perfctr, ");
return 0;
}
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 2a57dbed4894..66a073f21bb7 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -681,11 +681,16 @@ void x86_pmu_disable_all(void)

if (!test_bit(idx, cpuc->active_mask))
continue;
+
+ if (is_amd_brs(hwc))
+ amd_brs_disable();
+
rdmsrl(x86_pmu_config_addr(idx), val);
if (!(val & ARCH_PERFMON_EVENTSEL_ENABLE))
continue;
val &= ~ARCH_PERFMON_EVENTSEL_ENABLE;
wrmsrl(x86_pmu_config_addr(idx), val);
+
if (is_counter_pair(hwc))
wrmsrl(x86_pmu_config_addr(idx + 1), 0);
}
@@ -1334,6 +1339,10 @@ static void x86_pmu_enable(struct pmu *pmu)
if (hwc->state & PERF_HES_ARCH)
continue;

+ /*
+ * if cpuc->enabled = 0, then no wrmsr as
+ * per x86_pmu_enable_event()
+ */
x86_pmu_start(event, PERF_EF_RELOAD);
}
cpuc->n_added = 0;
@@ -1700,11 +1709,15 @@ int x86_pmu_handle_irq(struct pt_regs *regs)
* event overflow
*/
handled++;
- perf_sample_data_init(&data, 0, event->hw.last_period);

if (!x86_perf_event_set_period(event))
continue;

+ perf_sample_data_init(&data, 0, event->hw.last_period);
+
+ if (has_branch_stack(event))
+ data.br_stack = &cpuc->lbr_stack;
+
if (perf_event_overflow(event, &data, regs))
x86_pmu_stop(event, 0);
}
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index e3ac05c97b5e..12fb1e68f188 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -66,22 +66,23 @@ static inline bool constraint_match(struct event_constraint *c, u64 ecode)
/*
* struct hw_perf_event.flags flags
*/
-#define PERF_X86_EVENT_PEBS_LDLAT 0x0001 /* ld+ldlat data address sampling */
-#define PERF_X86_EVENT_PEBS_ST 0x0002 /* st data address sampling */
-#define PERF_X86_EVENT_PEBS_ST_HSW 0x0004 /* haswell style datala, store */
-#define PERF_X86_EVENT_PEBS_LD_HSW 0x0008 /* haswell style datala, load */
-#define PERF_X86_EVENT_PEBS_NA_HSW 0x0010 /* haswell style datala, unknown */
-#define PERF_X86_EVENT_EXCL 0x0020 /* HT exclusivity on counter */
-#define PERF_X86_EVENT_DYNAMIC 0x0040 /* dynamic alloc'd constraint */
-#define PERF_X86_EVENT_RDPMC_ALLOWED 0x0080 /* grant rdpmc permission */
-#define PERF_X86_EVENT_EXCL_ACCT 0x0100 /* accounted EXCL event */
-#define PERF_X86_EVENT_AUTO_RELOAD 0x0200 /* use PEBS auto-reload */
-#define PERF_X86_EVENT_LARGE_PEBS 0x0400 /* use large PEBS */
-#define PERF_X86_EVENT_PEBS_VIA_PT 0x0800 /* use PT buffer for PEBS */
-#define PERF_X86_EVENT_PAIR 0x1000 /* Large Increment per Cycle */
-#define PERF_X86_EVENT_LBR_SELECT 0x2000 /* Save/Restore MSR_LBR_SELECT */
-#define PERF_X86_EVENT_TOPDOWN 0x4000 /* Count Topdown slots/metrics events */
-#define PERF_X86_EVENT_PEBS_STLAT 0x8000 /* st+stlat data address sampling */
+#define PERF_X86_EVENT_PEBS_LDLAT 0x00001 /* ld+ldlat data address sampling */
+#define PERF_X86_EVENT_PEBS_ST 0x00002 /* st data address sampling */
+#define PERF_X86_EVENT_PEBS_ST_HSW 0x00004 /* haswell style datala, store */
+#define PERF_X86_EVENT_PEBS_LD_HSW 0x00008 /* haswell style datala, load */
+#define PERF_X86_EVENT_PEBS_NA_HSW 0x00010 /* haswell style datala, unknown */
+#define PERF_X86_EVENT_EXCL 0x00020 /* HT exclusivity on counter */
+#define PERF_X86_EVENT_DYNAMIC 0x00040 /* dynamic alloc'd constraint */
+#define PERF_X86_EVENT_RDPMC_ALLOWED 0x00080 /* grant rdpmc permission */
+#define PERF_X86_EVENT_EXCL_ACCT 0x00100 /* accounted EXCL event */
+#define PERF_X86_EVENT_AUTO_RELOAD 0x00200 /* use PEBS auto-reload */
+#define PERF_X86_EVENT_LARGE_PEBS 0x00400 /* use large PEBS */
+#define PERF_X86_EVENT_PEBS_VIA_PT 0x00800 /* use PT buffer for PEBS */
+#define PERF_X86_EVENT_PAIR 0x01000 /* Large Increment per Cycle */
+#define PERF_X86_EVENT_LBR_SELECT 0x02000 /* Save/Restore MSR_LBR_SELECT */
+#define PERF_X86_EVENT_TOPDOWN 0x04000 /* Count Topdown slots/metrics events */
+#define PERF_X86_EVENT_PEBS_STLAT 0x08000 /* st+stlat data address sampling */
+#define PERF_X86_EVENT_AMD_BRS 0x10000 /* AMD Branch Sampling */

static inline bool is_topdown_count(struct perf_event *event)
{
@@ -323,6 +324,8 @@ struct cpu_hw_events {
* AMD specific bits
*/
struct amd_nb *amd_nb;
+ int brs_active; /* BRS is enabled */
+
/* Inverted mask of bits to clear in the perf_ctr ctrl registers */
u64 perf_ctr_virt_mask;
int n_pair; /* Large increment events */
@@ -1097,6 +1100,11 @@ int x86_pmu_hw_config(struct perf_event *event);

void x86_pmu_disable_all(void);

+static inline bool is_amd_brs(struct hw_perf_event *hwc)
+{
+ return hwc->flags & PERF_X86_EVENT_AMD_BRS;
+}
+
static inline bool is_counter_pair(struct hw_perf_event *hwc)
{
return hwc->flags & PERF_X86_EVENT_PAIR;
@@ -1137,6 +1145,7 @@ static inline void x86_pmu_disable_event(struct perf_event *event)

if (is_counter_pair(hwc))
wrmsrl(x86_pmu_config_addr(hwc->idx + 1), 0);
+
}

void x86_pmu_enable_event(struct perf_event *event);
@@ -1202,6 +1211,48 @@ static inline bool fixed_counter_disabled(int i, struct pmu *pmu)
#ifdef CONFIG_CPU_SUP_AMD

int amd_pmu_init(void);
+int amd_brs_init(void);
+void amd_brs_disable(void);
+void amd_brs_enable(void);
+void amd_brs_drain(void);
+void amd_brs_disable_all(void);
+int amd_brs_setup_filter(struct perf_event *event);
+void amd_brs_reset(void);
+
+static inline void amd_pmu_brs_add(struct perf_event *event)
+{
+ struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+
+ /*
+ * No need to reset BRS because it is reset
+ * on brs_enable() and it is saturating
+ */
+ cpuc->lbr_users++;
+ perf_sched_cb_inc(event->ctx->pmu);
+}
+
+static inline void amd_pmu_brs_del(struct perf_event *event)
+{
+ struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+
+ cpuc->lbr_users--;
+ WARN_ON_ONCE(cpuc->lbr_users < 0);
+
+ perf_sched_cb_dec(event->ctx->pmu);
+}
+
+void amd_pmu_brs_sched_task(struct perf_event_context *ctx, bool sched_in);
+
+/*
+ * check if BRS is activated on the CPU
+ * active defined as it has non-zero users and DBG_EXT_CFG.BRSEN=1
+ */
+static inline bool amd_brs_active(void)
+{
+ struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+
+ return cpuc->brs_active;
+}

#else /* CONFIG_CPU_SUP_AMD */

@@ -1210,6 +1261,16 @@ static inline int amd_pmu_init(void)
return 0;
}

+static inline int amd_brs_init(void)
+{
+ return 0;
+}
+
+static inline void amd_brs_drain(void)
+{
+ return 0;
+}
+
#endif /* CONFIG_CPU_SUP_AMD */

static inline int is_pebs_pt(struct perf_event *event)
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index a7c413432b33..ed73325a8499 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -648,6 +648,10 @@
#define MSR_IA32_PERF_CTL 0x00000199
#define INTEL_PERF_CTL_MASK 0xffff

+/* AMD Branch Sampling configuration */
+#define MSR_AMD_DBG_EXTN_CFG 0xc000010f
+#define MSR_AMD_SAMP_BR_FROM 0xc0010300
+
#define MSR_IA32_MPERF 0x000000e7
#define MSR_IA32_APERF 0x000000e8

--
2.33.0.153.gba50c8fa24-goog

2021-09-09 08:01:09

by Stephane Eranian

[permalink] [raw]
Subject: [PATCH v1 06/13] perf/x86/amd: add AMD branch sampling period adjustment

This supplemental patch adds the code to adjust the sampling event period
when used with the Branch Sampling feature (BRS). Given the depth of the
BRS (16 on Fam19h Zen3), the period is reduced by that depth such that in
the best case scenario, BRS saturates at the desired sampling period.
In practice, though, the processor may execute more branches. Given a desired
period P and a depth D, the kernel programs the actual period at P - D. After
P occurrences of the sampling event, the counter overflows. It then may take
X branches (skid) before the NMI is caught and held by the hardware and BRS activates. Then,
after D branches, BRS saturates and the NMI is delivered. With no skid, the
effective period would be (P - D) + D = P. In practice, however, it will likely
be (P - D) + X + D. There is no way to eliminate X or predict X.

Signed-off-by: Stephane Eranian <[email protected]>
---
arch/x86/events/core.c | 7 +++++++
arch/x86/events/perf_event.h | 12 ++++++++++++
2 files changed, 19 insertions(+)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 66a073f21bb7..e1fb0e5ceaa3 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -1375,6 +1375,13 @@ int x86_perf_event_set_period(struct perf_event *event)
x86_pmu.set_topdown_event_period)
return x86_pmu.set_topdown_event_period(event);

+ /*
+ * decrease period by the depth of the BRS feature to get
+ * the last N taken branches and approximate the desired period
+ */
+ if (has_branch_stack(event))
+ period = amd_brs_adjust_period(period);
+
/*
* If we are way outside a reasonable range then just skip forward:
*/
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 12fb1e68f188..a275553e78b9 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -1254,6 +1254,14 @@ static inline bool amd_brs_active(void)
return cpuc->brs_active;
}

+static inline s64 amd_brs_adjust_period(s64 period)
+{
+ if (period > x86_pmu.lbr_nr)
+ return period - x86_pmu.lbr_nr;
+
+ return period;
+}
+
#else /* CONFIG_CPU_SUP_AMD */

static inline int amd_pmu_init(void)
@@ -1271,6 +1279,10 @@ static inline void amd_brs_drain(void)
return 0;
}

+static inline s64 amd_brs_adjust_period(s64 period)
+{
+ return period;
+}
#endif /* CONFIG_CPU_SUP_AMD */

static inline int is_pebs_pt(struct perf_event *event)
--
2.33.0.153.gba50c8fa24-goog

2021-09-09 08:01:15

by Stephane Eranian

[permalink] [raw]
Subject: [PATCH v1 07/13] perf/core: add idle hooks

This patch adds a new set of hooks to connect perf_events with the
idle task. On some PMU models, it may be necessary to flush or stop
the PMU when going to low power. Upon return from low power, the opposite
action, i.e., re-enable the PMU, may be necessary. The patch adds
perf_pmu_register_lopwr_cb() to register a callback on entry or return
from low power. The callback is invoked with a boolean arg. If true,
then this is an entry. If false, this is a return.

The callback is invoked from the idle code with interrupts already
disabled.

Signed-off-by: Stephane Eranian <[email protected]>
---
include/linux/perf_event.h | 8 ++++++
kernel/events/core.c | 58 ++++++++++++++++++++++++++++++++++++++
kernel/sched/idle.c | 15 +++++++++-
3 files changed, 80 insertions(+), 1 deletion(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 2d510ad750ed..32ffc009b2ec 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -300,6 +300,7 @@ struct pmu {
/* number of address filters this PMU can do */
unsigned int nr_addr_filters;

+ struct list_head lopwr_entry;
/*
* Fully disable/enable this PMU, can be used to protect from the PMI
* as well as for lazy/batch writing of the MSRs.
@@ -430,6 +431,8 @@ struct pmu {
void (*sched_task) (struct perf_event_context *ctx,
bool sched_in);

+ void (*lopwr_cb) (bool lopwr_in);
+
/*
* Kmem cache of PMU specific data
*/
@@ -1429,6 +1432,11 @@ extern void perf_event_task_tick(void);
extern int perf_event_account_interrupt(struct perf_event *event);
extern int perf_event_period(struct perf_event *event, u64 value);
extern u64 perf_event_pause(struct perf_event *event, bool reset);
+extern void perf_lopwr_cb(bool lopwr_in);
+extern void perf_lopwr_active_inc(void);
+extern void perf_lopwr_active_dec(void);
+extern void perf_register_lopwr_cb(struct pmu *pmu, void (*lowpwr_cb)(bool));
+
#else /* !CONFIG_PERF_EVENTS: */
static inline void *
perf_aux_output_begin(struct perf_output_handle *handle,
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 1cb1f9b8392e..f739fd92e74b 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3521,6 +3521,64 @@ void perf_sched_cb_inc(struct pmu *pmu)
this_cpu_inc(perf_sched_cb_usages);
}

+/*
+ * The perf_lopwr_cb() is invoked from the idle task. As such it
+ * cannot grab a mutex that may end up sleeping. The idle task cannot
+ * sleep by construction. Therefore we create a spinlock and a new
+ * list of PMUs to invoke on idle. The list is protected by a spinlock
+ * Normally we would use the pmus_lock and iterate over each PMUs. But
+ * mutex is not possible and we need to iterate only over the PMU which
+ * do require a idle callback.
+ */
+static DEFINE_SPINLOCK(lopwr_cb_lock);
+static LIST_HEAD(lopwr_cb_pmus);
+static DEFINE_PER_CPU(int, lopwr_nr_active);
+
+void perf_lopwr_active_inc(void)
+{
+ __this_cpu_inc(lopwr_nr_active);
+}
+
+void perf_lopwr_active_dec(void)
+{
+ __this_cpu_dec(lopwr_nr_active);
+}
+
+/*
+ * lopwr_in = true means going to low power state
+ * lopwr_in = false means returning from low power state
+ */
+void perf_lopwr_cb(bool lopwr_in)
+{
+ struct pmu *pmu;
+ unsigned long flags;
+
+ if (!__this_cpu_read(lopwr_nr_active))
+ return;
+
+ spin_lock_irqsave(&lopwr_cb_lock, flags);
+
+ list_for_each_entry(pmu, &lopwr_cb_pmus, lopwr_entry) {
+ if (pmu->lopwr_cb)
+ pmu->lopwr_cb(lopwr_in);
+ }
+
+ spin_unlock_irqrestore(&lopwr_cb_lock, flags);
+}
+EXPORT_SYMBOL_GPL(perf_lopwr_cb);
+
+void perf_register_lopwr_cb(struct pmu *pmu, void (*func)(bool))
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&lopwr_cb_lock, flags);
+
+ pmu->lopwr_cb = func;
+ list_add_tail(&pmu->lopwr_entry, &lopwr_cb_pmus);
+
+ spin_unlock_irqrestore(&lopwr_cb_lock, flags);
+}
+
/*
* This function provides the context switch callback to the lower code
* layer. It is invoked ONLY when the context switch callback is enabled.
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 912b47aa99d8..14ce130aee1b 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -179,6 +179,7 @@ static void cpuidle_idle_call(void)
*/
if (need_resched()) {
local_irq_enable();
+ perf_lopwr_cb(false);
return;
}

@@ -191,7 +192,14 @@ static void cpuidle_idle_call(void)
if (cpuidle_not_available(drv, dev)) {
tick_nohz_idle_stop_tick();

+ if (!cpu_idle_force_poll)
+ perf_lopwr_cb(true);
+
default_idle_call();
+
+ if (!cpu_idle_force_poll)
+ perf_lopwr_cb(false);
+
goto exit_idle;
}

@@ -249,8 +257,10 @@ static void cpuidle_idle_call(void)
/*
* It is up to the idle functions to reenable local interrupts
*/
- if (WARN_ON_ONCE(irqs_disabled()))
+ if (WARN_ON_ONCE(irqs_disabled())) {
local_irq_enable();
+ perf_lopwr_cb(false);
+ }
}

/*
@@ -279,9 +289,12 @@ static void do_idle(void)
__current_set_polling();
tick_nohz_idle_enter();

+
while (!need_resched()) {
rmb();

+ perf_lopwr_cb(true);
+
local_irq_disable();

if (cpu_is_offline(cpu)) {
--
2.33.0.153.gba50c8fa24-goog

2021-09-09 08:01:16

by Stephane Eranian

[permalink] [raw]
Subject: [PATCH v1 10/13] perf tools: add branch-brs as a new event

Although branch-brs is specific to AMD Zen3, we provide it in the list of events
recognized by perf as a convenience. It allows:
$ perf stat -e branch-brs
$ perf list

and enable some test suite to pass.

Signed-off-by: Stephane Eranian <[email protected]>
---
tools/perf/util/parse-events.l | 1 +
1 file changed, 1 insertion(+)

diff --git a/tools/perf/util/parse-events.l b/tools/perf/util/parse-events.l
index 923849024b15..ae229f5c4758 100644
--- a/tools/perf/util/parse-events.l
+++ b/tools/perf/util/parse-events.l
@@ -361,6 +361,7 @@ mem-loads-aux |
mem-stores |
topdown-[a-z-]+ |
tx-capacity-[a-z-]+ |
+branch-brs |
el-capacity-[a-z-]+ { return str(yyscanner, PE_KERNEL_PMU_EVENT); }

L1-dcache|l1-d|l1d|L1-data |
--
2.33.0.153.gba50c8fa24-goog

2021-09-09 08:01:17

by Stephane Eranian

[permalink] [raw]
Subject: [PATCH v1 09/13] perf/x86/amd: add idle hooks for branch sampling

On AMD Fam19h Zen3, the branch sampling (BRS) feature must be disabled before entering low power
and re-enabled (if was active) when returning from low power. Otherwise, the NMI interrupt may
be held up for too long and cause problems. Stopping BRS will cause the NMI to be delivered if it
was held up.

This patch connects the branch sampling code to the perf_events idle callbacks.

Signed-off-by: Stephane Eranian <[email protected]>
---
arch/x86/events/amd/brs.c | 25 +++++++++++++++++++++++++
arch/x86/events/amd/core.c | 4 ++++
arch/x86/events/perf_event.h | 1 +
3 files changed, 30 insertions(+)

diff --git a/arch/x86/events/amd/brs.c b/arch/x86/events/amd/brs.c
index 23b969001857..7d27591ba537 100644
--- a/arch/x86/events/amd/brs.c
+++ b/arch/x86/events/amd/brs.c
@@ -146,6 +146,7 @@ void amd_brs_enable(void)

/* Set enable bit */
set_debug_extn_cfg(cfg.val);
+ perf_lopwr_active_inc();
}

void amd_brs_disable(void)
@@ -175,6 +176,7 @@ void amd_brs_disable(void)
cfg.brsmen = 0;
set_debug_extn_cfg(cfg.val);
}
+ perf_lopwr_active_dec();
}

static bool amd_brs_match_plm(struct perf_event *event, u64 to)
@@ -292,6 +294,29 @@ static void amd_brs_poison_buffer(void)
wrmsrl(brs_to(idx), BRS_POISON);
}

+/*
+ * called indirectly with irqs masked from mwait_idle_*()
+ */
+void amd_pmu_brs_lopwr_cb(bool lopwr_in)
+{
+ struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+ union amd_debug_extn_cfg cfg;
+
+ /*
+ * on mwait in, we may end up in non C0 state.
+ * we must disable branch sampling to avoid holding the NMI
+ * for too long. We disable it in hardware but we
+ * keep the state in cpuc, so we can re-enable.
+ *
+ * The hardware will deliver the NMI if needed when brsmen cleared
+ */
+ if (cpuc->brs_active) {
+ cfg.val = get_debug_extn_cfg();
+ cfg.brsmen = !lopwr_in;
+ set_debug_extn_cfg(cfg.val);
+ }
+}
+
/*
* On context switch in, we need to make sure no samples from previous user
* are left in the BRS.
diff --git a/arch/x86/events/amd/core.c b/arch/x86/events/amd/core.c
index d6d5119260a9..3e1985cd414d 100644
--- a/arch/x86/events/amd/core.c
+++ b/arch/x86/events/amd/core.c
@@ -1184,12 +1184,16 @@ static int __init amd_core_pmu_init(void)
* invoked on context-switch in via sched_task_in(), so enable only when necessary
*/
if (!amd_brs_init()) {
+ struct pmu *pmu = x86_get_pmu(smp_processor_id());
x86_pmu.get_event_constraints = amd_get_event_constraints_f19h;
x86_pmu.sched_task = amd_pmu_sched_task;
/*
* The put_event_constraints callback is shared with
* Fam17h, set above
*/
+
+ /* branch sampling must be stopped when entering low power */
+ perf_register_lopwr_cb(pmu, amd_pmu_brs_lopwr_cb);
}
}

diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index a275553e78b9..73eac9d34bd9 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -1242,6 +1242,7 @@ static inline void amd_pmu_brs_del(struct perf_event *event)
}

void amd_pmu_brs_sched_task(struct perf_event_context *ctx, bool sched_in);
+void amd_pmu_brs_lopwr_cb(bool lopwr_in);

/*
* check if BRS is activated on the CPU
--
2.33.0.153.gba50c8fa24-goog

2021-09-09 08:01:53

by Stephane Eranian

[permalink] [raw]
Subject: [PATCH v1 08/13] perf/x86/core: add idle hooks

This patch adds the perf idle hooks x86 idle routines.

Signed-off-by: Stephane Eranian <[email protected]>
---
arch/x86/include/asm/mwait.h | 6 +++++-
kernel/sched/idle.c | 12 ++++++++----
2 files changed, 13 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/mwait.h b/arch/x86/include/asm/mwait.h
index 29dd27b5a339..92080fdc1d2a 100644
--- a/arch/x86/include/asm/mwait.h
+++ b/arch/x86/include/asm/mwait.h
@@ -4,6 +4,7 @@

#include <linux/sched.h>
#include <linux/sched/idle.h>
+#include <linux/perf_event.h>

#include <asm/cpufeature.h>
#include <asm/nospec-branch.h>
@@ -114,8 +115,11 @@ static inline void mwait_idle_with_hints(unsigned long eax, unsigned long ecx)
}

__monitor((void *)&current_thread_info()->flags, 0, 0);
- if (!need_resched())
+ if (!need_resched()) {
+ perf_lopwr_cb(true);
__mwait(eax, ecx);
+ perf_lopwr_cb(false);
+ }
}
current_clr_polling();
}
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 14ce130aee1b..c0ddc3c32a33 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -179,7 +179,6 @@ static void cpuidle_idle_call(void)
*/
if (need_resched()) {
local_irq_enable();
- perf_lopwr_cb(false);
return;
}

@@ -230,6 +229,9 @@ static void cpuidle_idle_call(void)
tick_nohz_idle_stop_tick();

next_state = cpuidle_find_deepest_state(drv, dev, max_latency_ns);
+ if (!cpu_idle_force_poll)
+ perf_lopwr_cb(true);
+
call_cpuidle(drv, dev, next_state);
} else {
bool stop_tick = true;
@@ -244,12 +246,17 @@ static void cpuidle_idle_call(void)
else
tick_nohz_idle_retain_tick();

+ if (!cpu_idle_force_poll)
+ perf_lopwr_cb(true);
+
entered_state = call_cpuidle(drv, dev, next_state);
/*
* Give the governor an opportunity to reflect on the outcome
*/
cpuidle_reflect(dev, entered_state);
}
+ if (!cpu_idle_force_poll)
+ perf_lopwr_cb(false);

exit_idle:
__current_set_polling();
@@ -259,7 +266,6 @@ static void cpuidle_idle_call(void)
*/
if (WARN_ON_ONCE(irqs_disabled())) {
local_irq_enable();
- perf_lopwr_cb(false);
}
}

@@ -293,8 +299,6 @@ static void do_idle(void)
while (!need_resched()) {
rmb();

- perf_lopwr_cb(true);
-
local_irq_disable();

if (cpu_is_offline(cpu)) {
--
2.33.0.153.gba50c8fa24-goog

2021-09-09 08:01:56

by Stephane Eranian

[permalink] [raw]
Subject: [PATCH v1 12/13] perf tools: improve error handling of AMD Branch Sampling

This patch improves the error message printed by perf when perf_event_open()
fails on AMD Zen3 when using the branch sampling feature. In the case of
EINVAL, there are two main reasons: frequency mode or period is smaller than
the depth of the branch sampling buffer (16). The patch checks the parameters
of the call and tries to print a relevant message to explain the error:

$ perf record -b -e cpu/branch-brs/ -c 10 ls
Error:
AMD Branch Sampling does not support sampling period smaller than what is reported in /sys/devices/cpu/caps/branches.

$ perf record -b -e cpu/branch-brs/ ls
Error:
AMD Branch Sampling does not support frequency mode sampling, must pass a fixed sampling period via -c option or cpu/branch-brs,period=xxxx/.

Signed-off-by: Stephane Eranian <[email protected]>
---
tools/perf/util/evsel.c | 8 ++++++++
1 file changed, 8 insertions(+)

diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index f203f178fdb9..8945f08f41b2 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -2824,6 +2824,14 @@ int evsel__open_strerror(struct evsel *evsel, struct target *target,
return scnprintf(msg, size, "The 'aux_output' feature is not supported, update the kernel.");
if (is_amd && (evsel->core.attr.precise_ip || !strncmp(evsel->pmu_name, "ibs", 3)) && (evsel->core.attr.exclude_kernel))
return scnprintf(msg, size, "AMD IBS can't exclude kernel events. Try running at a higher privilege level.");
+
+ if (is_amd && ((evsel->core.attr.config & 0xff) == 0xc4) && (evsel->core.attr.sample_type & PERF_SAMPLE_BRANCH_STACK)) {
+ if (evsel->core.attr.freq)
+ return scnprintf(msg, size, "AMD Branch Sampling does not support frequency mode sampling, must pass a fixed sampling period via -c option or cpu/branch-brs,period=xxxx/.");
+ /* another reason is that the period is too small */
+ return scnprintf(msg, size, "AMD Branch Sampling does not support sampling period smaller than what is reported in /sys/devices/cpu/caps/branches.");
+ }
+
break;
case ENODATA:
return scnprintf(msg, size, "Cannot collect data source with the load latency event alone. "
--
2.33.0.153.gba50c8fa24-goog

2021-09-09 08:02:42

by Stephane Eranian

[permalink] [raw]
Subject: [PATCH v1 13/13] perf report: add addr_from/addr_to sort dimensions

This patch adds a new pair of sorting dimensions used with branch sampling.
With the existing symbol_from/symbol_to, branches captured in the same function
would be collapsed into a single function if the latency (cycles) was all the same.
That is the case on Intel Broadwell for instance. Since Intel Skylake, the latency
is captured by hardware and therefore is used to disambiguate branches.

This patch adds addr_from/addr_to sort dimensions that will sort branches based on
their addresses and not the function there are in. The output is still the function
name but the offset within the function is provided to uniquely identify each branch.
These new sort dimensions also help with annotate because they create different entries
in the histogram which, in turn, generates proper branch annotation.

Here is an example using AMD's branch sampling:

$ perf record -a -b -c 1000037 -e cpu/branch-brs/ test_prg

$ perf report
Samples: 6M of event 'cpu/branch-brs/', Event count (approx.): 6901276
Overhead Command Source Shared Object Source Symbol Target Symbol Basic Block Cycle
99.65% test_prg test_prg [.] test_thread [.] test_thread -
0.02% test_prg [kernel.vmlinux] [k] asm_sysvec_apic_timer_interrupt [k] error_entry -

$ perf report -F overhead,comm,dso,addr_from,addr_to
Samples: 6M of event 'cpu/branch-brs/', Event count (approx.): 6901276
Overhead Command Shared Object Source Address Target Address
4.22% test_prg test_prg [.] test_thread+0x3c [.] test_thread+0x4
4.13% test_prg test_prg [.] test_thread+0x4 [.] test_thread+0x3a
4.09% test_prg test_prg [.] test_thread+0x3a [.] test_thread+0x6
4.08% test_prg test_prg [.] test_thread+0x2 [.] test_thread+0x3c
4.06% test_prg test_prg [.] test_thread+0x3e [.] test_thread+0x2
3.87% test_prg test_prg [.] test_thread+0x6 [.] test_thread+0x38
3.84% test_prg test_prg [.] test_thread [.] test_thread+0x3e
3.76% test_prg test_prg [.] test_thread+0x1e [.] test_thread
3.76% test_prg test_prg [.] test_thread+0x38 [.] test_thread+0x8
3.56% test_prg test_prg [.] test_thread+0x22 [.] test_thread+0x1e
3.54% test_prg test_prg [.] test_thread+0x8 [.] test_thread+0x36
3.47% test_prg test_prg [.] test_thread+0x1c [.] test_thread+0x22
3.45% test_prg test_prg [.] test_thread+0x36 [.] test_thread+0xa
3.28% test_prg test_prg [.] test_thread+0x24 [.] test_thread+0x1c
3.25% test_prg test_prg [.] test_thread+0xa [.] test_thread+0x34
3.24% test_prg test_prg [.] test_thread+0x1a [.] test_thread+0x24
3.20% test_prg test_prg [.] test_thread+0x34 [.] test_thread+0xc
3.04% test_prg test_prg [.] test_thread+0x26 [.] test_thread+0x1a
3.01% test_prg test_prg [.] test_thread+0xc [.] test_thread+0x32
2.98% test_prg test_prg [.] test_thread+0x18 [.] test_thread+0x26
2.94% test_prg test_prg [.] test_thread+0x32 [.] test_thread+0xe
2.76% test_prg test_prg [.] test_thread+0x28 [.] test_thread+0x18
2.73% test_prg test_prg [.] test_thread+0xe [.] test_thread+0x30
2.67% test_prg test_prg [.] test_thread+0x30 [.] test_thread+0x10
2.67% test_prg test_prg [.] test_thread+0x16 [.] test_thread+0x28
2.46% test_prg test_prg [.] test_thread+0x10 [.] test_thread+0x2e
2.44% test_prg test_prg [.] test_thread+0x2a [.] test_thread+0x16
2.38% test_prg test_prg [.] test_thread+0x14 [.] test_thread+0x2a
2.32% test_prg test_prg [.] test_thread+0x2e [.] test_thread+0x12
2.28% test_prg test_prg [.] test_thread+0x12 [.] test_thread+0x2c
2.16% test_prg test_prg [.] test_thread+0x2c [.] test_thread+0x14
0.02% test_prg [kernel.vmlinux] [k] asm_sysvec_apic_ti+0x5 [k] error_entry

Signed-off-by: Stephane Eranian <[email protected]>
---
tools/perf/util/hist.c | 2 +
tools/perf/util/hist.h | 2 +
tools/perf/util/sort.c | 128 +++++++++++++++++++++++++++++++++++++++++
tools/perf/util/sort.h | 2 +
4 files changed, 134 insertions(+)

diff --git a/tools/perf/util/hist.c b/tools/perf/util/hist.c
index 65fe65ba03c2..39924202568f 100644
--- a/tools/perf/util/hist.c
+++ b/tools/perf/util/hist.c
@@ -124,6 +124,7 @@ void hists__calc_col_len(struct hists *hists, struct hist_entry *h)
} else {
symlen = unresolved_col_width + 4 + 2;
hists__new_col_len(hists, HISTC_SYMBOL_FROM, symlen);
+ hists__new_col_len(hists, HISTC_ADDR_FROM, symlen);
hists__set_unres_dso_col_len(hists, HISTC_DSO_FROM);
}

@@ -138,6 +139,7 @@ void hists__calc_col_len(struct hists *hists, struct hist_entry *h)
} else {
symlen = unresolved_col_width + 4 + 2;
hists__new_col_len(hists, HISTC_SYMBOL_TO, symlen);
+ hists__new_col_len(hists, HISTC_ADDR_TO, symlen);
hists__set_unres_dso_col_len(hists, HISTC_DSO_TO);
}

diff --git a/tools/perf/util/hist.h b/tools/perf/util/hist.h
index 5343b62476e6..9592853fab73 100644
--- a/tools/perf/util/hist.h
+++ b/tools/perf/util/hist.h
@@ -76,6 +76,8 @@ enum hist_column {
HISTC_LOCAL_INS_LAT,
HISTC_GLOBAL_INS_LAT,
HISTC_P_STAGE_CYC,
+ HISTC_ADDR_FROM,
+ HISTC_ADDR_TO,
HISTC_NR_COLS, /* Last entry */
};

diff --git a/tools/perf/util/sort.c b/tools/perf/util/sort.c
index 568a88c001c6..e5be5e24f9c5 100644
--- a/tools/perf/util/sort.c
+++ b/tools/perf/util/sort.c
@@ -990,6 +990,128 @@ struct sort_entry sort_sym_to = {
.se_width_idx = HISTC_SYMBOL_TO,
};

+static int _hist_entry__addr_snprintf(struct map_symbol *ms,
+ u64 ip, char level, char *bf, size_t size,
+ unsigned int width)
+{
+ struct symbol *sym = ms->sym;
+ struct map *map = ms->map;
+ size_t ret = 0, offs;
+
+ ret += repsep_snprintf(bf + ret, size - ret, "[%c] ", level);
+ if (sym && map) {
+ if (sym->type == STT_OBJECT) {
+ ret += repsep_snprintf(bf + ret, size - ret, "%s", sym->name);
+ ret += repsep_snprintf(bf + ret, size - ret, "+0x%llx",
+ ip - map->unmap_ip(map, sym->start));
+ } else {
+ ret += repsep_snprintf(bf + ret, size - ret, "%.*s",
+ width - ret,
+ sym->name);
+ offs = ip - sym->start;
+ if (offs)
+ ret += repsep_snprintf(bf + ret, size - ret, "+0x%llx", offs);
+ }
+ } else {
+ size_t len = BITS_PER_LONG / 4;
+ ret += repsep_snprintf(bf + ret, size - ret, "%-#.*llx",
+ len, ip);
+ }
+
+ return ret;
+}
+
+static int hist_entry__addr_from_snprintf(struct hist_entry *he, char *bf,
+ size_t size, unsigned int width)
+{
+ if (he->branch_info) {
+ struct addr_map_symbol *from = &he->branch_info->from;
+
+ return _hist_entry__addr_snprintf(&from->ms, from->al_addr,
+ he->level, bf, size, width);
+ }
+
+ return repsep_snprintf(bf, size, "%-*.*s", width, width, "N/A");
+}
+
+static int hist_entry__addr_to_snprintf(struct hist_entry *he, char *bf,
+ size_t size, unsigned int width)
+{
+ if (he->branch_info) {
+ struct addr_map_symbol *to = &he->branch_info->to;
+
+ return _hist_entry__addr_snprintf(&to->ms, to->al_addr,
+ he->level, bf, size, width);
+ }
+
+ return repsep_snprintf(bf, size, "%-*.*s", width, width, "N/A");
+}
+
+static int64_t
+sort__addr_from_cmp(struct hist_entry *left, struct hist_entry *right)
+{
+ struct addr_map_symbol *from_l;
+ struct addr_map_symbol *from_r;
+ int64_t ret;
+
+ if (!left->branch_info || !right->branch_info)
+ return cmp_null(left->branch_info, right->branch_info);
+
+ from_l = &left->branch_info->from;
+ from_r = &right->branch_info->from;
+
+ /*
+ * comparing symbol address alone is not enough since it's a
+ * relative address within a dso.
+ */
+ ret = _sort__dso_cmp(from_l->ms.map, from_r->ms.map);
+ if (ret != 0)
+ return ret;
+
+ return _sort__addr_cmp(from_l->addr, from_r->addr);
+}
+
+static int64_t
+sort__addr_to_cmp(struct hist_entry *left, struct hist_entry *right)
+{
+ struct addr_map_symbol *to_l;
+ struct addr_map_symbol *to_r;
+ int64_t ret;
+
+ if (!left->branch_info || !right->branch_info)
+ return cmp_null(left->branch_info, right->branch_info);
+
+ to_l = &left->branch_info->to;
+ to_r = &right->branch_info->to;
+
+ /*
+ * comparing symbol address alone is not enough since it's a
+ * relative address within a dso.
+ */
+ ret = _sort__dso_cmp(to_l->ms.map, to_r->ms.map);
+ if (ret != 0)
+ return ret;
+
+ return _sort__addr_cmp(to_l->addr, to_r->addr);
+}
+
+struct sort_entry sort_addr_from = {
+ .se_header = "Source Address",
+ .se_cmp = sort__addr_from_cmp,
+ .se_snprintf = hist_entry__addr_from_snprintf,
+ .se_filter = hist_entry__sym_from_filter, /* shared with sym_from */
+ .se_width_idx = HISTC_ADDR_FROM,
+};
+
+struct sort_entry sort_addr_to = {
+ .se_header = "Target Address",
+ .se_cmp = sort__addr_to_cmp,
+ .se_snprintf = hist_entry__addr_to_snprintf,
+ .se_filter = hist_entry__sym_to_filter, /* shared with sym_to */
+ .se_width_idx = HISTC_ADDR_TO,
+};
+
+
static int64_t
sort__mispredict_cmp(struct hist_entry *left, struct hist_entry *right)
{
@@ -1897,6 +2019,8 @@ static struct sort_dimension bstack_sort_dimensions[] = {
DIM(SORT_SRCLINE_FROM, "srcline_from", sort_srcline_from),
DIM(SORT_SRCLINE_TO, "srcline_to", sort_srcline_to),
DIM(SORT_SYM_IPC, "ipc_lbr", sort_sym_ipc),
+ DIM(SORT_ADDR_FROM, "addr_from", sort_addr_from),
+ DIM(SORT_ADDR_TO, "addr_to", sort_addr_to),
};

#undef DIM
@@ -3128,6 +3252,10 @@ static bool get_elide(int idx, FILE *output)
return __get_elide(symbol_conf.dso_from_list, "dso_from", output);
case HISTC_DSO_TO:
return __get_elide(symbol_conf.dso_to_list, "dso_to", output);
+ case HISTC_ADDR_FROM:
+ return __get_elide(symbol_conf.sym_from_list, "addr_from", output);
+ case HISTC_ADDR_TO:
+ return __get_elide(symbol_conf.sym_to_list, "addr_to", output);
default:
break;
}
diff --git a/tools/perf/util/sort.h b/tools/perf/util/sort.h
index b67c469aba79..571f6745dced 100644
--- a/tools/perf/util/sort.h
+++ b/tools/perf/util/sort.h
@@ -250,6 +250,8 @@ enum sort_type {
SORT_SRCLINE_FROM,
SORT_SRCLINE_TO,
SORT_SYM_IPC,
+ SORT_ADDR_FROM,
+ SORT_ADDR_TO,

/* memory mode specific sort keys */
__SORT_MEMORY_MODE,
--
2.33.0.153.gba50c8fa24-goog

2021-09-09 08:04:39

by Stephane Eranian

[permalink] [raw]
Subject: [PATCH v1 11/13] perf tools: improve IBS error handling

From: Kim Phillips <[email protected]>

This patch improves the error message returned on failed perf_event_open() on
AMD when using IBS.

Signed-off-by: Kim Phillips <[email protected]>
---
tools/perf/util/evsel.c | 42 +++++++++++++++++++++++++++++++++++++++++
1 file changed, 42 insertions(+)

diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index f61e5dd53f5d..f203f178fdb9 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -2684,12 +2684,52 @@ static bool find_process(const char *name)
return ret ? false : true;
}

+static bool is_amd;
+
+static char *fgrep(FILE *inf, const char *str)
+{
+ char line[256];
+ int slen = strlen(str);
+
+ while (!feof(inf)) {
+ if (!fgets(line, 256, inf))
+ break;
+ if (strncmp(line, str, slen))
+ continue;
+
+ return strdup(line);
+ }
+
+ return NULL;
+}
+
+static void detect_amd(void)
+{
+ FILE *inf = fopen("/proc/cpuinfo", "r");
+ char *res;
+
+ if (!inf)
+ return;
+
+ res = fgrep(inf, "vendor_id");
+
+ if (res) {
+ char *s = strchr(res, ':');
+
+ is_amd = s && !strcmp(s, ": AuthenticAMD\n");
+ free(res);
+ }
+ fclose(inf);
+}
+
int evsel__open_strerror(struct evsel *evsel, struct target *target,
int err, char *msg, size_t size)
{
char sbuf[STRERR_BUFSIZE];
int printed = 0, enforced = 0;

+ detect_amd();
+
switch (err) {
case EPERM:
case EACCES:
@@ -2782,6 +2822,8 @@ int evsel__open_strerror(struct evsel *evsel, struct target *target,
return scnprintf(msg, size, "wrong clockid (%d).", clockid);
if (perf_missing_features.aux_output)
return scnprintf(msg, size, "The 'aux_output' feature is not supported, update the kernel.");
+ if (is_amd && (evsel->core.attr.precise_ip || !strncmp(evsel->pmu_name, "ibs", 3)) && (evsel->core.attr.exclude_kernel))
+ return scnprintf(msg, size, "AMD IBS can't exclude kernel events. Try running at a higher privilege level.");
break;
case ENODATA:
return scnprintf(msg, size, "Cannot collect data source with the load latency event alone. "
--
2.33.0.153.gba50c8fa24-goog

2021-09-09 08:21:34

by Stephane Eranian

[permalink] [raw]
Subject: [PATCH v1 02/13] x86/cpufeatures: add AMD Fam19h Branch Sampling feature

This patch adds a cpu feature for AMD Fam19h Branch Sampling feature as bit 31
of EBX on CPUID leaf function 0x80000008.

Signed-off-by: Stephane Eranian <[email protected]>
---
arch/x86/include/asm/cpufeatures.h | 1 +
1 file changed, 1 insertion(+)

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index d0ce5cfd3ac1..74fd7ab7d74d 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -313,6 +313,7 @@
#define X86_FEATURE_AMD_SSBD (13*32+24) /* "" Speculative Store Bypass Disable */
#define X86_FEATURE_VIRT_SSBD (13*32+25) /* Virtualized Speculative Store Bypass Disable */
#define X86_FEATURE_AMD_SSB_NO (13*32+26) /* "" Speculative Store Bypass is fixed in hardware. */
+#define X86_FEATURE_AMD_BRS (13*32+31) /* Branch Sampling available */

/* Thermal and Power Management Leaf, CPUID level 0x00000006 (EAX), word 14 */
#define X86_FEATURE_DTHERM (14*32+ 0) /* Digital Thermal Sensor */
--
2.33.0.153.gba50c8fa24-goog

2021-09-09 08:57:04

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v1 00/13] perf/x86/amd: Add AMD Fam19h Branch Sampling support

On Thu, Sep 09, 2021 at 12:56:47AM -0700, Stephane Eranian wrote:
> This patch series adds support for the AMD Fam19h 16-deep branch sampling
> feature as described in the AMD PPR Fam19h Model 01h Revision B1 section 2.1.13.

Yay..

> BRS interacts with the NMI interrupt as well. Because enabling BRS is expensive,
> it is only activated after P event occurrences, where P is the desired sampling period.
> At P occurrences of the event, the counter overflows, the CPU catches the NMI interrupt,
> activates BRS for 16 branches until it saturates, and then delivers the NMI to the kernel.

WTF... ?!? Srsly? You're joking right?

Also, can you please fix you MUA to wrap at 78 chars like normal people?

2021-09-09 09:17:34

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v1 07/13] perf/core: add idle hooks

On Thu, Sep 09, 2021 at 12:56:54AM -0700, Stephane Eranian wrote:
> This patch adds a new set of hooks to connect perf_events with the
> idle task. On some PMU models, it may be necessary to flush or stop
> the PMU when going to low power. Upon return from low power, the opposite
> action, i.e., re-enable the PMU, may be necessary. The patch adds
> perf_pmu_register_lopwr_cb() to register a callback on entry or return
> from low power. The callback is invoked with a boolean arg. If true,
> then this is an entry. If false, this is a return.
>
> The callback is invoked from the idle code with interrupts already
> disabled.

Urghh... we've had this before and always managed to refuse doing this
(in generic code).

Also, it's broken..

> +/*
> + * The perf_lopwr_cb() is invoked from the idle task. As such it
> + * cannot grab a mutex that may end up sleeping. The idle task cannot
> + * sleep by construction. Therefore we create a spinlock and a new
> + * list of PMUs to invoke on idle. The list is protected by a spinlock
> + * Normally we would use the pmus_lock and iterate over each PMUs. But
> + * mutex is not possible and we need to iterate only over the PMU which
> + * do require a idle callback.

That rationale is broken, pmus is also SRCU protected and SRCU iteration
would actually work from the idle context, unlike:

> + */
> +static DEFINE_SPINLOCK(lopwr_cb_lock);

which is not allowed from preempt disable context, if you really need
that lock, you need a raw_spinlock_t.

> +static LIST_HEAD(lopwr_cb_pmus);
> +static DEFINE_PER_CPU(int, lopwr_nr_active);
> +
> +void perf_lopwr_active_inc(void)
> +{
> + __this_cpu_inc(lopwr_nr_active);
> +}
> +
> +void perf_lopwr_active_dec(void)
> +{
> + __this_cpu_dec(lopwr_nr_active);
> +}
> +
> +/*
> + * lopwr_in = true means going to low power state
> + * lopwr_in = false means returning from low power state
> + */
> +void perf_lopwr_cb(bool lopwr_in)
> +{
> + struct pmu *pmu;
> + unsigned long flags;
> +
> + if (!__this_cpu_read(lopwr_nr_active))
> + return;

We have static_branch and static_call to deal with these things.

> +
> + spin_lock_irqsave(&lopwr_cb_lock, flags);
> +
> + list_for_each_entry(pmu, &lopwr_cb_pmus, lopwr_entry) {
> + if (pmu->lopwr_cb)
> + pmu->lopwr_cb(lopwr_in);
> + }

I *really* do not like unbound iteration in the idle path.

> +
> + spin_unlock_irqrestore(&lopwr_cb_lock, flags);
> +}
> +EXPORT_SYMBOL_GPL(perf_lopwr_cb);

Why?

> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> index 912b47aa99d8..14ce130aee1b 100644
> --- a/kernel/sched/idle.c
> +++ b/kernel/sched/idle.c
> @@ -179,6 +179,7 @@ static void cpuidle_idle_call(void)
> */
> if (need_resched()) {
> local_irq_enable();
> + perf_lopwr_cb(false);
> return;
> }
>
> @@ -191,7 +192,14 @@ static void cpuidle_idle_call(void)
> if (cpuidle_not_available(drv, dev)) {
> tick_nohz_idle_stop_tick();
>
> + if (!cpu_idle_force_poll)
> + perf_lopwr_cb(true);
> +
> default_idle_call();
> +
> + if (!cpu_idle_force_poll)
> + perf_lopwr_cb(false);
> +
> goto exit_idle;
> }
>
> @@ -249,8 +257,10 @@ static void cpuidle_idle_call(void)
> /*
> * It is up to the idle functions to reenable local interrupts
> */
> - if (WARN_ON_ONCE(irqs_disabled()))
> + if (WARN_ON_ONCE(irqs_disabled())) {
> local_irq_enable();
> + perf_lopwr_cb(false);
> + }
> }
>
> /*
> @@ -279,9 +289,12 @@ static void do_idle(void)
> __current_set_polling();
> tick_nohz_idle_enter();
>
> +
> while (!need_resched()) {
> rmb();
>
> + perf_lopwr_cb(true);
> +
> local_irq_disable();
>
> if (cpu_is_offline(cpu)) {

I so hate all of this... but mostly, this doesn't seem right, you're
unconditionally calling this, even for high power idle states.

*IFF* you really have to do this, stuff it in an AMD idle state driver
that knows about the relevant idle states. But really, why can't we just
take the his while the event is running?

2021-09-09 09:17:59

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v1 08/13] perf/x86/core: add idle hooks

On Thu, Sep 09, 2021 at 12:56:55AM -0700, Stephane Eranian wrote:

> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> index 14ce130aee1b..c0ddc3c32a33 100644
> --- a/kernel/sched/idle.c
> +++ b/kernel/sched/idle.c
> @@ -179,7 +179,6 @@ static void cpuidle_idle_call(void)
> */
> if (need_resched()) {
> local_irq_enable();
> - perf_lopwr_cb(false);
> return;
> }
>
> @@ -230,6 +229,9 @@ static void cpuidle_idle_call(void)
> tick_nohz_idle_stop_tick();
>
> next_state = cpuidle_find_deepest_state(drv, dev, max_latency_ns);
> + if (!cpu_idle_force_poll)
> + perf_lopwr_cb(true);
> +
> call_cpuidle(drv, dev, next_state);
> } else {
> bool stop_tick = true;
> @@ -244,12 +246,17 @@ static void cpuidle_idle_call(void)
> else
> tick_nohz_idle_retain_tick();
>
> + if (!cpu_idle_force_poll)
> + perf_lopwr_cb(true);
> +
> entered_state = call_cpuidle(drv, dev, next_state);
> /*
> * Give the governor an opportunity to reflect on the outcome
> */
> cpuidle_reflect(dev, entered_state);
> }
> + if (!cpu_idle_force_poll)
> + perf_lopwr_cb(false);
>
> exit_idle:
> __current_set_polling();
> @@ -259,7 +266,6 @@ static void cpuidle_idle_call(void)
> */
> if (WARN_ON_ONCE(irqs_disabled())) {
> local_irq_enable();
> - perf_lopwr_cb(false);
> }
> }
>
> @@ -293,8 +299,6 @@ static void do_idle(void)
> while (!need_resched()) {
> rmb();
>
> - perf_lopwr_cb(true);
> -
> local_irq_disable();
>
> if (cpu_is_offline(cpu)) {

Why are you sendimg me a patch that undoes most of what the previous
patch does? That's just bad form.

2021-09-09 09:21:36

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v1 09/13] perf/x86/amd: add idle hooks for branch sampling

On Thu, Sep 09, 2021 at 12:56:56AM -0700, Stephane Eranian wrote:
> On AMD Fam19h Zen3, the branch sampling (BRS) feature must be disabled before entering low power
> and re-enabled (if was active) when returning from low power. Otherwise, the NMI interrupt may
> be held up for too long and cause problems. Stopping BRS will cause the NMI to be delivered if it
> was held up.

What you sent me ^^^, vs reflow at 72 chars vvv:

On AMD Fam19h Zen3, the branch sampling (BRS) feature must be disabled
before entering low power and re-enabled (if was active) when returning
from low power. Otherwise, the NMI interrupt may be held up for too long
and cause problems. Stopping BRS will cause the NMI to be delivered if
it was held up.

(because git show has that stupid extra indentation for Changelogs)

2021-09-09 10:43:36

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH v1 07/13] perf/core: add idle hooks

Hi Stephane,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on tip/perf/core]
[also build test ERROR on next-20210909]
[cannot apply to tip/sched/core tip/x86/core v5.14]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url: https://github.com/0day-ci/linux/commits/Stephane-Eranian/perf-x86-amd-Add-AMD-Fam19h-Branch-Sampling-support/20210909-160050
base: https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git 4034fb207e302cc0b1f304084d379640c1fb1436
config: um-x86_64_defconfig (attached as .config)
compiler: gcc-9 (Debian 9.3.0-22) 9.3.0
reproduce (this is a W=1 build):
# https://github.com/0day-ci/linux/commit/35333fae36d4dee3e2225674fca8d745364482fa
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review Stephane-Eranian/perf-x86-amd-Add-AMD-Fam19h-Branch-Sampling-support/20210909-160050
git checkout 35333fae36d4dee3e2225674fca8d745364482fa
# save the attached .config to linux build tree
make W=1 ARCH=um SUBARCH=x86_64

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <[email protected]>

All errors (new ones prefixed by >>):

kernel/sched/idle.c: In function 'cpuidle_idle_call':
>> kernel/sched/idle.c:182:3: error: implicit declaration of function 'perf_lopwr_cb' [-Werror=implicit-function-declaration]
182 | perf_lopwr_cb(false);
| ^~~~~~~~~~~~~
cc1: some warnings being treated as errors


vim +/perf_lopwr_cb +182 kernel/sched/idle.c

160
161 /**
162 * cpuidle_idle_call - the main idle function
163 *
164 * NOTE: no locks or semaphores should be used here
165 *
166 * On architectures that support TIF_POLLING_NRFLAG, is called with polling
167 * set, and it returns with polling set. If it ever stops polling, it
168 * must clear the polling bit.
169 */
170 static void cpuidle_idle_call(void)
171 {
172 struct cpuidle_device *dev = cpuidle_get_device();
173 struct cpuidle_driver *drv = cpuidle_get_cpu_driver(dev);
174 int next_state, entered_state;
175
176 /*
177 * Check if the idle task must be rescheduled. If it is the
178 * case, exit the function after re-enabling the local irq.
179 */
180 if (need_resched()) {
181 local_irq_enable();
> 182 perf_lopwr_cb(false);
183 return;
184 }
185
186 /*
187 * The RCU framework needs to be told that we are entering an idle
188 * section, so no more rcu read side critical sections and one more
189 * step to the grace period
190 */
191
192 if (cpuidle_not_available(drv, dev)) {
193 tick_nohz_idle_stop_tick();
194
195 if (!cpu_idle_force_poll)
196 perf_lopwr_cb(true);
197
198 default_idle_call();
199
200 if (!cpu_idle_force_poll)
201 perf_lopwr_cb(false);
202
203 goto exit_idle;
204 }
205
206 /*
207 * Suspend-to-idle ("s2idle") is a system state in which all user space
208 * has been frozen, all I/O devices have been suspended and the only
209 * activity happens here and in interrupts (if any). In that case bypass
210 * the cpuidle governor and go straight for the deepest idle state
211 * available. Possibly also suspend the local tick and the entire
212 * timekeeping to prevent timer interrupts from kicking us out of idle
213 * until a proper wakeup interrupt happens.
214 */
215
216 if (idle_should_enter_s2idle() || dev->forced_idle_latency_limit_ns) {
217 u64 max_latency_ns;
218
219 if (idle_should_enter_s2idle()) {
220
221 entered_state = call_cpuidle_s2idle(drv, dev);
222 if (entered_state > 0)
223 goto exit_idle;
224
225 max_latency_ns = U64_MAX;
226 } else {
227 max_latency_ns = dev->forced_idle_latency_limit_ns;
228 }
229
230 tick_nohz_idle_stop_tick();
231
232 next_state = cpuidle_find_deepest_state(drv, dev, max_latency_ns);
233 call_cpuidle(drv, dev, next_state);
234 } else {
235 bool stop_tick = true;
236
237 /*
238 * Ask the cpuidle framework to choose a convenient idle state.
239 */
240 next_state = cpuidle_select(drv, dev, &stop_tick);
241
242 if (stop_tick || tick_nohz_tick_stopped())
243 tick_nohz_idle_stop_tick();
244 else
245 tick_nohz_idle_retain_tick();
246
247 entered_state = call_cpuidle(drv, dev, next_state);
248 /*
249 * Give the governor an opportunity to reflect on the outcome
250 */
251 cpuidle_reflect(dev, entered_state);
252 }
253
254 exit_idle:
255 __current_set_polling();
256
257 /*
258 * It is up to the idle functions to reenable local interrupts
259 */
260 if (WARN_ON_ONCE(irqs_disabled())) {
261 local_irq_enable();
262 perf_lopwr_cb(false);
263 }
264 }
265

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/[email protected]


Attachments:
(No filename) (5.25 kB)
.config.gz (9.31 kB)
Download all attachments

2021-09-09 10:46:31

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH v1 03/13] perf/x86/amd: add AMD Fam19h Branch Sampling support

Hi Stephane,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on tip/perf/core]
[also build test WARNING on next-20210909]
[cannot apply to tip/sched/core tip/x86/core v5.14]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url: https://github.com/0day-ci/linux/commits/Stephane-Eranian/perf-x86-amd-Add-AMD-Fam19h-Branch-Sampling-support/20210909-160050
base: https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git 4034fb207e302cc0b1f304084d379640c1fb1436
config: x86_64-allyesconfig (attached as .config)
compiler: gcc-9 (Debian 9.3.0-22) 9.3.0
reproduce (this is a W=1 build):
# https://github.com/0day-ci/linux/commit/7ff5f8448eecfc188c2240b2a5a5ea0b00f55662
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review Stephane-Eranian/perf-x86-amd-Add-AMD-Fam19h-Branch-Sampling-support/20210909-160050
git checkout 7ff5f8448eecfc188c2240b2a5a5ea0b00f55662
# save the attached .config to linux build tree
make W=1 ARCH=x86_64

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <[email protected]>

All warnings (new ones prefixed by >>):

arch/x86/events/amd/core.c: In function 'amd_pmu_enable_all':
>> arch/x86/events/amd/core.c:676:24: warning: variable 'hwc' set but not used [-Wunused-but-set-variable]
676 | struct hw_perf_event *hwc;
| ^~~


vim +/hwc +676 arch/x86/events/amd/core.c

672
673 static void amd_pmu_enable_all(int added)
674 {
675 struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
> 676 struct hw_perf_event *hwc;
677 int idx;
678
679 for (idx = 0; idx < x86_pmu.num_counters; idx++) {
680 hwc = &cpuc->events[idx]->hw;
681
682 /* only activate events which are marked as active */
683 if (!test_bit(idx, cpuc->active_mask))
684 continue;
685
686 amd_pmu_enable_event(cpuc->events[idx]);
687 }
688 }
689

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/[email protected]


Attachments:
(No filename) (2.27 kB)
.config.gz (64.43 kB)
Download all attachments

2021-09-09 11:06:17

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH v1 07/13] perf/core: add idle hooks

Hi Stephane,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on tip/perf/core]
[also build test ERROR on next-20210909]
[cannot apply to tip/sched/core tip/x86/core v5.14]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url: https://github.com/0day-ci/linux/commits/Stephane-Eranian/perf-x86-amd-Add-AMD-Fam19h-Branch-Sampling-support/20210909-160050
base: https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git 4034fb207e302cc0b1f304084d379640c1fb1436
config: riscv-randconfig-r042-20210908 (attached as .config)
compiler: clang version 14.0.0 (https://github.com/llvm/llvm-project 261cbe98c38f8c1ee1a482fe76511110e790f58a)
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# install riscv cross compiling tool for clang build
# apt-get install binutils-riscv64-linux-gnu
# https://github.com/0day-ci/linux/commit/35333fae36d4dee3e2225674fca8d745364482fa
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review Stephane-Eranian/perf-x86-amd-Add-AMD-Fam19h-Branch-Sampling-support/20210909-160050
git checkout 35333fae36d4dee3e2225674fca8d745364482fa
# save the attached .config to linux build tree
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=riscv

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <[email protected]>

All errors (new ones prefixed by >>):

>> kernel/sched/idle.c:182:3: error: implicit declaration of function 'perf_lopwr_cb' [-Werror,-Wimplicit-function-declaration]
perf_lopwr_cb(false);
^
kernel/sched/idle.c:196:4: error: implicit declaration of function 'perf_lopwr_cb' [-Werror,-Wimplicit-function-declaration]
perf_lopwr_cb(true);
^
kernel/sched/idle.c:262:3: error: implicit declaration of function 'perf_lopwr_cb' [-Werror,-Wimplicit-function-declaration]
perf_lopwr_cb(false);
^
kernel/sched/idle.c:296:3: error: implicit declaration of function 'perf_lopwr_cb' [-Werror,-Wimplicit-function-declaration]
perf_lopwr_cb(true);
^
4 errors generated.


vim +/perf_lopwr_cb +182 kernel/sched/idle.c

160
161 /**
162 * cpuidle_idle_call - the main idle function
163 *
164 * NOTE: no locks or semaphores should be used here
165 *
166 * On architectures that support TIF_POLLING_NRFLAG, is called with polling
167 * set, and it returns with polling set. If it ever stops polling, it
168 * must clear the polling bit.
169 */
170 static void cpuidle_idle_call(void)
171 {
172 struct cpuidle_device *dev = cpuidle_get_device();
173 struct cpuidle_driver *drv = cpuidle_get_cpu_driver(dev);
174 int next_state, entered_state;
175
176 /*
177 * Check if the idle task must be rescheduled. If it is the
178 * case, exit the function after re-enabling the local irq.
179 */
180 if (need_resched()) {
181 local_irq_enable();
> 182 perf_lopwr_cb(false);
183 return;
184 }
185
186 /*
187 * The RCU framework needs to be told that we are entering an idle
188 * section, so no more rcu read side critical sections and one more
189 * step to the grace period
190 */
191
192 if (cpuidle_not_available(drv, dev)) {
193 tick_nohz_idle_stop_tick();
194
195 if (!cpu_idle_force_poll)
196 perf_lopwr_cb(true);
197
198 default_idle_call();
199
200 if (!cpu_idle_force_poll)
201 perf_lopwr_cb(false);
202
203 goto exit_idle;
204 }
205
206 /*
207 * Suspend-to-idle ("s2idle") is a system state in which all user space
208 * has been frozen, all I/O devices have been suspended and the only
209 * activity happens here and in interrupts (if any). In that case bypass
210 * the cpuidle governor and go straight for the deepest idle state
211 * available. Possibly also suspend the local tick and the entire
212 * timekeeping to prevent timer interrupts from kicking us out of idle
213 * until a proper wakeup interrupt happens.
214 */
215
216 if (idle_should_enter_s2idle() || dev->forced_idle_latency_limit_ns) {
217 u64 max_latency_ns;
218
219 if (idle_should_enter_s2idle()) {
220
221 entered_state = call_cpuidle_s2idle(drv, dev);
222 if (entered_state > 0)
223 goto exit_idle;
224
225 max_latency_ns = U64_MAX;
226 } else {
227 max_latency_ns = dev->forced_idle_latency_limit_ns;
228 }
229
230 tick_nohz_idle_stop_tick();
231
232 next_state = cpuidle_find_deepest_state(drv, dev, max_latency_ns);
233 call_cpuidle(drv, dev, next_state);
234 } else {
235 bool stop_tick = true;
236
237 /*
238 * Ask the cpuidle framework to choose a convenient idle state.
239 */
240 next_state = cpuidle_select(drv, dev, &stop_tick);
241
242 if (stop_tick || tick_nohz_tick_stopped())
243 tick_nohz_idle_stop_tick();
244 else
245 tick_nohz_idle_retain_tick();
246
247 entered_state = call_cpuidle(drv, dev, next_state);
248 /*
249 * Give the governor an opportunity to reflect on the outcome
250 */
251 cpuidle_reflect(dev, entered_state);
252 }
253
254 exit_idle:
255 __current_set_polling();
256
257 /*
258 * It is up to the idle functions to reenable local interrupts
259 */
260 if (WARN_ON_ONCE(irqs_disabled())) {
261 local_irq_enable();
262 perf_lopwr_cb(false);
263 }
264 }
265

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/[email protected]


Attachments:
(No filename) (6.13 kB)
.config.gz (25.59 kB)
Download all attachments

2021-09-09 15:38:33

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH v1 03/13] perf/x86/amd: add AMD Fam19h Branch Sampling support

Hi Stephane,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on tip/perf/core]
[also build test WARNING on next-20210909]
[cannot apply to tip/sched/core tip/x86/core v5.14]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url: https://github.com/0day-ci/linux/commits/Stephane-Eranian/perf-x86-amd-Add-AMD-Fam19h-Branch-Sampling-support/20210909-160050
base: https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git 4034fb207e302cc0b1f304084d379640c1fb1436
config: i386-randconfig-a012-20210908 (attached as .config)
compiler: clang version 14.0.0 (https://github.com/llvm/llvm-project 261cbe98c38f8c1ee1a482fe76511110e790f58a)
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# https://github.com/0day-ci/linux/commit/7ff5f8448eecfc188c2240b2a5a5ea0b00f55662
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review Stephane-Eranian/perf-x86-amd-Add-AMD-Fam19h-Branch-Sampling-support/20210909-160050
git checkout 7ff5f8448eecfc188c2240b2a5a5ea0b00f55662
# save the attached .config to linux build tree
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=i386

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <[email protected]>

All warnings (new ones prefixed by >>):

>> arch/x86/events/amd/core.c:676:24: warning: variable 'hwc' set but not used [-Wunused-but-set-variable]
struct hw_perf_event *hwc;
^
1 warning generated.


vim +/hwc +676 arch/x86/events/amd/core.c

672
673 static void amd_pmu_enable_all(int added)
674 {
675 struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
> 676 struct hw_perf_event *hwc;
677 int idx;
678
679 for (idx = 0; idx < x86_pmu.num_counters; idx++) {
680 hwc = &cpuc->events[idx]->hw;
681
682 /* only activate events which are marked as active */
683 if (!test_bit(idx, cpuc->active_mask))
684 continue;
685
686 amd_pmu_enable_event(cpuc->events[idx]);
687 }
688 }
689

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/[email protected]


Attachments:
(No filename) (2.49 kB)
.config.gz (34.31 kB)
Download all attachments

2021-09-14 00:59:44

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: [PATCH v1 11/13] perf tools: improve IBS error handling

Em Thu, Sep 09, 2021 at 12:56:58AM -0700, Stephane Eranian escreveu:
> From: Kim Phillips <[email protected]>
>
> This patch improves the error message returned on failed perf_event_open() on
> AMD when using IBS.
>
> Signed-off-by: Kim Phillips <[email protected]>
> ---
> tools/perf/util/evsel.c | 42 +++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 42 insertions(+)
>
> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
> index f61e5dd53f5d..f203f178fdb9 100644
> --- a/tools/perf/util/evsel.c
> +++ b/tools/perf/util/evsel.c
> @@ -2684,12 +2684,52 @@ static bool find_process(const char *name)
> return ret ? false : true;
> }
>
> +static bool is_amd;
> +
> +static char *fgrep(FILE *inf, const char *str)
> +{
> + char line[256];
> + int slen = strlen(str);
> +
> + while (!feof(inf)) {
> + if (!fgets(line, 256, inf))
> + break;
> + if (strncmp(line, str, slen))
> + continue;
> +
> + return strdup(line);
> + }
> +
> + return NULL;
> +}
> +
> +static void detect_amd(void)
> +{
> + FILE *inf = fopen("/proc/cpuinfo", "r");
> + char *res;
> +
> + if (!inf)
> + return;
> +
> + res = fgrep(inf, "vendor_id");
> +
> + if (res) {
> + char *s = strchr(res, ':');
> +
> + is_amd = s && !strcmp(s, ": AuthenticAMD\n");
> + free(res);
> + }
> + fclose(inf);
> +}
> +

We have perf_env for such details, for instance in
tools/perf/util/sample-raw.c we have:o

const char *arch_pf = perf_env__arch(evlist->env);
const char *cpuid = perf_env__cpuid(evlist->env);

else if (arch_pf && !strcmp("x86", arch_pf) &&
cpuid && strstarts(cpuid, "AuthenticAMD") &&
evlist__has_amd_ibs(evlist)) {



> int evsel__open_strerror(struct evsel *evsel, struct target *target,
> int err, char *msg, size_t size)
> {
> char sbuf[STRERR_BUFSIZE];
> int printed = 0, enforced = 0;
>
> + detect_amd();
> +
> switch (err) {
> case EPERM:
> case EACCES:
> @@ -2782,6 +2822,8 @@ int evsel__open_strerror(struct evsel *evsel, struct target *target,
> return scnprintf(msg, size, "wrong clockid (%d).", clockid);
> if (perf_missing_features.aux_output)
> return scnprintf(msg, size, "The 'aux_output' feature is not supported, update the kernel.");
> + if (is_amd && (evsel->core.attr.precise_ip || !strncmp(evsel->pmu_name, "ibs", 3)) && (evsel->core.attr.exclude_kernel))
> + return scnprintf(msg, size, "AMD IBS can't exclude kernel events. Try running at a higher privilege level.");
> break;
> case ENODATA:
> return scnprintf(msg, size, "Cannot collect data source with the load latency event alone. "
> --
> 2.33.0.153.gba50c8fa24-goog

--

- Arnaldo

2021-09-15 05:57:28

by Stephane Eranian

[permalink] [raw]
Subject: Re: [PATCH v1 00/13] perf/x86/amd: Add AMD Fam19h Branch Sampling support

On Thu, Sep 9, 2021 at 1:55 AM Peter Zijlstra <[email protected]> wrote:
>
> On Thu, Sep 09, 2021 at 12:56:47AM -0700, Stephane Eranian wrote:
> > This patch series adds support for the AMD Fam19h 16-deep branch sampling
> > feature as described in the AMD PPR Fam19h Model 01h Revision B1 section 2.1.13.
>
> Yay..
>
> > BRS interacts with the NMI interrupt as well. Because enabling BRS is expensive,
> > it is only activated after P event occurrences, where P is the desired sampling period.
> > At P occurrences of the event, the counter overflows, the CPU catches the NMI interrupt,
> > activates BRS for 16 branches until it saturates, and then delivers the NMI to the kernel.
>
> WTF... ?!? Srsly? You're joking right?
>

As I said, this is because of the cost of running BRS usually for
millions of branches to keep only the last 16.
Running branch sampling in general on any arch is never totally free.

>
> Also, can you please fix you MUA to wrap at 78 chars like normal people?

Ok, I fixed that now.

2021-09-15 09:08:55

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v1 00/13] perf/x86/amd: Add AMD Fam19h Branch Sampling support

On Tue, Sep 14, 2021 at 10:55:12PM -0700, Stephane Eranian wrote:
> On Thu, Sep 9, 2021 at 1:55 AM Peter Zijlstra <[email protected]> wrote:
> >
> > On Thu, Sep 09, 2021 at 12:56:47AM -0700, Stephane Eranian wrote:
> > > This patch series adds support for the AMD Fam19h 16-deep branch sampling
> > > feature as described in the AMD PPR Fam19h Model 01h Revision B1 section 2.1.13.
> >
> > Yay..
> >
> > > BRS interacts with the NMI interrupt as well. Because enabling BRS is expensive,
> > > it is only activated after P event occurrences, where P is the desired sampling period.
> > > At P occurrences of the event, the counter overflows, the CPU catches the NMI interrupt,
> > > activates BRS for 16 branches until it saturates, and then delivers the NMI to the kernel.
> >
> > WTF... ?!? Srsly? You're joking right?
> >
>
> As I said, this is because of the cost of running BRS usually for
> millions of branches to keep only the last 16.
> Running branch sampling in general on any arch is never totally free.

Holding up the NMI will disrupt the sampling of the other events, which
is, IMO unacceptible and would require this event to be exclusive on the
whole PMU, simply because sharing it doesn't work.

(also, other NMI sources might object)

Also, by only having LBRs post overflow you can't apply LBR based
analysis to other events, which seems quite limiting.

This really seems like a very sub-optimal solution. I mean, it's awesome
AMD gets branch records, but this seems a very poor solution.

2021-09-27 20:19:03

by Song Liu

[permalink] [raw]
Subject: Re: [PATCH v1 00/13] perf/x86/amd: Add AMD Fam19h Branch Sampling support

Hi Stephane,

On Tue, Sep 14, 2021 at 10:57 PM Stephane Eranian <[email protected]> wrote:
>
> On Thu, Sep 9, 2021 at 1:55 AM Peter Zijlstra <[email protected]> wrote:
> >
> > On Thu, Sep 09, 2021 at 12:56:47AM -0700, Stephane Eranian wrote:
> > > This patch series adds support for the AMD Fam19h 16-deep branch sampling
> > > feature as described in the AMD PPR Fam19h Model 01h Revision B1 section 2.1.13.
> >
> > Yay..
> >
> > > BRS interacts with the NMI interrupt as well. Because enabling BRS is expensive,
> > > it is only activated after P event occurrences, where P is the desired sampling period.
> > > At P occurrences of the event, the counter overflows, the CPU catches the NMI interrupt,
> > > activates BRS for 16 branches until it saturates, and then delivers the NMI to the kernel.
> >
> > WTF... ?!? Srsly? You're joking right?
> >
>
> As I said, this is because of the cost of running BRS usually for
> millions of branches to keep only the last 16.
> Running branch sampling in general on any arch is never totally free.

Could you please share some data on how expensive the BRS is? We are
hoping to use
BRS/LBR without PMI (bpf_get_branch_snapshot). If it is too expensive,
we may need
some heuristic to turn it on/off.

Thanks,
Song

2021-10-04 23:51:42

by Kim Phillips

[permalink] [raw]
Subject: Re: [PATCH v1 12/13] perf tools: improve error handling of AMD Branch Sampling

On 9/9/21 2:56 AM, Stephane Eranian wrote:
> This patch improves the error message printed by perf when perf_event_open()
> fails on AMD Zen3 when using the branch sampling feature. In the case of
> EINVAL, there are two main reasons: frequency mode or period is smaller than
> the depth of the branch sampling buffer (16). The patch checks the parameters
> of the call and tries to print a relevant message to explain the error:
>
> $ perf record -b -e cpu/branch-brs/ -c 10 ls
> Error:
> AMD Branch Sampling does not support sampling period smaller than what is reported in /sys/devices/cpu/caps/branches.
>
> $ perf record -b -e cpu/branch-brs/ ls
> Error:
> AMD Branch Sampling does not support frequency mode sampling, must pass a fixed sampling period via -c option or cpu/branch-brs,period=xxxx/.
>
> Signed-off-by: Stephane Eranian <[email protected]>
> ---

Hi Stephane,

I've rewritten this patch based on Arnaldo's comments to the previous
(11/13) patch. The new version attached depends on this 2-patch series:

https://lore.kernel.org/lkml/[email protected]/T/#mc4c9c582e3816ab31af6d0187e6803de1a98ac84

Thanks,

Kim


Attachments:
0003-perf-tools-Improve-error-handling-of-AMD-Branch-Samp.patch (2.58 kB)

2021-10-04 23:52:25

by Kim Phillips

[permalink] [raw]
Subject: Re: [PATCH v1 11/13] perf tools: improve IBS error handling

On 9/13/21 2:34 PM, Arnaldo Carvalho de Melo wrote:
> Em Thu, Sep 09, 2021 at 12:56:58AM -0700, Stephane Eranian escreveu:
>> From: Kim Phillips <[email protected]>
>> +static void detect_amd(void)
>> +{
>> + FILE *inf = fopen("/proc/cpuinfo", "r");
>> + char *res;
>> +
>> + if (!inf)
>> + return;
>> +
>> + res = fgrep(inf, "vendor_id");
>> +
>> + if (res) {
>> + char *s = strchr(res, ':');
>> +
>> + is_amd = s && !strcmp(s, ": AuthenticAMD\n");
>> + free(res);
>> + }
>> + fclose(inf);
>> +}
>> +
>
> We have perf_env for such details, for instance in
> tools/perf/util/sample-raw.c we have:o
>
> const char *arch_pf = perf_env__arch(evlist->env);
> const char *cpuid = perf_env__cpuid(evlist->env);
>
> else if (arch_pf && !strcmp("x86", arch_pf) &&
> cpuid && strstarts(cpuid, "AuthenticAMD") &&
> evlist__has_amd_ibs(evlist)) {

OK, I've rebased this 11/13 patch onto the new perf_env__{arch,cpuid}
code, and posted it here:

https://lore.kernel.org/lkml/[email protected]/T/#mc4c9c582e3816ab31af6d0187e6803de1a98ac84

The following 12/13 patch in this series changes, too, but since it
depends on prior patches in the series, I'll reply-all to 12/13 with its
new version.

Arnaldo, would it be ok to apply the two new patches that replace this
11/13? They don't depend on any others in this series, and it would
save Stephane from having to carry them.

Thanks,

Kim

2021-10-04 23:57:57

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: [PATCH v1 11/13] perf tools: improve IBS error handling



On October 4, 2021 6:57:09 PM GMT-03:00, Kim Phillips <[email protected]> wrote:
>On 9/13/21 2:34 PM, Arnaldo Carvalho de Melo wrote:
>> Em Thu, Sep 09, 2021 at 12:56:58AM -0700, Stephane Eranian escreveu:
>>> From: Kim Phillips <[email protected]>
>>> +static void detect_amd(void)
>>> +{
>>> + FILE *inf = fopen("/proc/cpuinfo", "r");
>>> + char *res;
>>> +
>>> + if (!inf)
>>> + return;
>>> +
>>> + res = fgrep(inf, "vendor_id");
>>> +
>>> + if (res) {
>>> + char *s = strchr(res, ':');
>>> +
>>> + is_amd = s && !strcmp(s, ": AuthenticAMD\n");
>>> + free(res);
>>> + }
>>> + fclose(inf);
>>> +}
>>> +
>>
>> We have perf_env for such details, for instance in
>> tools/perf/util/sample-raw.c we have:o
>>
>> const char *arch_pf = perf_env__arch(evlist->env);
>> const char *cpuid = perf_env__cpuid(evlist->env);
>>
>> else if (arch_pf && !strcmp("x86", arch_pf) &&
>> cpuid && strstarts(cpuid, "AuthenticAMD") &&
>> evlist__has_amd_ibs(evlist)) {
>
>OK, I've rebased this 11/13 patch onto the new perf_env__{arch,cpuid}
>code, and posted it here:
>
>https://lore.kernel.org/lkml/[email protected]/T/#mc4c9c582e3816ab31af6d0187e6803de1a98ac84
>
>The following 12/13 patch in this series changes, too, but since it
>depends on prior patches in the series, I'll reply-all to 12/13 with its
>new version.
>
>Arnaldo, would it be ok to apply the two new patches that replace this
>11/13? They don't depend on any others in this series, and it would
>save Stephane from having to carry them.


I'll work in this tomorrow, thanks

- Arnaldo
>
>Thanks,
>
>Kim

2021-10-28 18:32:22

by Stephane Eranian

[permalink] [raw]
Subject: Re: [PATCH v1 00/13] perf/x86/amd: Add AMD Fam19h Branch Sampling support

On Wed, Sep 15, 2021 at 2:04 AM Peter Zijlstra <[email protected]> wrote:
>
> On Tue, Sep 14, 2021 at 10:55:12PM -0700, Stephane Eranian wrote:
> > On Thu, Sep 9, 2021 at 1:55 AM Peter Zijlstra <[email protected]> wrote:
> > >
> > > On Thu, Sep 09, 2021 at 12:56:47AM -0700, Stephane Eranian wrote:
> > > > This patch series adds support for the AMD Fam19h 16-deep branch sampling
> > > > feature as described in the AMD PPR Fam19h Model 01h Revision B1 section 2.1.13.
> > >
> > > Yay..
> > >
> > > > BRS interacts with the NMI interrupt as well. Because enabling BRS is expensive,
> > > > it is only activated after P event occurrences, where P is the desired sampling period.
> > > > At P occurrences of the event, the counter overflows, the CPU catches the NMI interrupt,
> > > > activates BRS for 16 branches until it saturates, and then delivers the NMI to the kernel.
> > >
> > > WTF... ?!? Srsly? You're joking right?
> > >
> >
> > As I said, this is because of the cost of running BRS usually for
> > millions of branches to keep only the last 16.
> > Running branch sampling in general on any arch is never totally free.
>
> Holding up the NMI will disrupt the sampling of the other events, which
> is, IMO unacceptible and would require this event to be exclusive on the
> whole PMU, simply because sharing it doesn't work.
>
Sorry for the long delay, I have been very busy.

You are right on this. It would hold the NMI for 16 taken branches.
Making the event exclusive creates a problem with the NMI watchdog.
We can try to hack something in to allow NMI watchdog + the sampling
event and nothing else.

> (also, other NMI sources might object)
>
On AMD, there is also IBS op, IBS Fetch both firing on NMI. but that
is less of a concern because the instruction address is captured by IBS
and the interrupted IP is not useful. So the interrupt skid is not important.

> Also, by only having LBRs post overflow you can't apply LBR based
> analysis to other events, which seems quite limiting.
>
This is a very limited functionality designed to support basic sampling
primarily to support autoFDO where there is only one sampling event.

> This really seems like a very sub-optimal solution. I mean, it's awesome
> AMD gets branch records, but this seems a very poor solution.

For now, this is what we have. It is important to get some basic form of branch
sampling on Zen3 even if it is not perfect because it enables optimizations such
as autoFDO for compilers today. We have verified that autoFDO works well with
branch sampling on Zen3.

I hope it will improve in the future.