2022-02-09 10:05:10

by Stephane Eranian

[permalink] [raw]
Subject: [PATCH v6 00/12] perf/x86/amd: Add AMD Fam19h Branch Sampling support

Add support for the AMD Fam19h 16-deep branch sampling feature as described
in the AMD PPR Fam19h Model 01h Revision B1 section 2.1.13. This is a model
specific extension. It is not an architected AMD feature.

The Branch Sampling Feature (BRS) provides the statistical taken branch
information necessary to enable autoFDO-style optimization by compilers,
i.e., basic block execution counts.

BRS operates with a 16-deep saturating buffer in MSR registers. There is no
hardware branch type filtering. All control flow changes are captured. BRS
relies on specific programming of the core PMU of Fam19h. In particular,
the following requirements must be met:
- the sampling period be greater than 16 (BRS depth)
- the sampling period must use fixed and not frequency mode

BRS interacts with the NMI interrupt as well. Because enabling BRS is
expensive, it is only activated after P event occurrences, where P is the
desired sampling period. At P occurrences of the event, the counter
overflows, the CPU catches the NMI interrupt, activates BRS for 16 branches
until it saturates, and then delivers the NMI to the kernel. Between the
overflow and the time BRS activates more branches may be executed skewing the
period. All along, the sampling event keeps counting. The skid may be
attenuated by reducing the sampling period by 16.

BRS is integrated into perf_events seamlessly via the same
PERF_RECORD_BRANCH_STACK sample format. BRS generates branch
perf_branch_entry records in the sampling buffer. There is no prediction or
latency information supported. The branches are stored in reverse order of
execution. The most recent branch is the first entry in each record.

Because BRS must be stopped when a CPU goes into low power mode, the series
includes patches to add callbacks on ACPI low power entry and exit which is
used on AMD processors.

Given that there is no privilege filterting with BRS, the kernel implements
filtering on privlege level.

This version adds a few simple modifications to perf record and report.
1. add the branch-brs event as a builtin such as it can used directly:
perf record -e branch-brs ...
2. improve error handling for AMD IBS and is contributed by Kim Phillips.
3. use the better error handling to improve error handling for BRS.
4. add two new sort dimensions to help display the branch sampling
information. Because there is no latency information associated with the
branch sampling feature perf report would collapse all samples within a
function into a single histogram entry. This is expected because the
default sort mode for PERF_SAMPLE_BRANCH_STACK is symbol_from/symbol_to.
This propagates to the annotation.

For more detailed view of the branch samples, the new sort dimensions
addr_from,addr_to can be used instead as follows:

$ perf report --sort=overhead,comm,dso,addr_from,addr_to
# Overhead Command Shared Object Source Address Target Address
# ........ .......... .............. .............. ..............
#
4.21% test_prg test_prg [.] test_threa+0x3c [.] test_threa+0x4
4.14% test_prg test_prg [.] test_threa+0x3e [.] test_threa+0x2
4.10% test_prg test_prg [.] test_threa+0x4 [.] test_threa+0x3a
4.07% test_prg test_prg [.] test_threa+0x2 [.] test_threa+0x3c

Versus the default output:

$ perf report
# Overhead Command Source Shared Object Source Symbol Target Symbol Basic Block Cycles
# ........ ......... .................... ................ ................. ..................
#
99.52% test_prg test_prg [.] test_thread [.] test_thread -

BRS can be used with any sampling event. However, it is recommended to use
the RETIRED_BRANCH event because it matches what the BRS captures. For
convenience, a pseudo event matching the branches captured by BRS is
exported by the kernel (branch-brs):

$ perf record -b -e cpu/branch-brs/ -c 1000037 test

$ perf report -D
56531696056126 0x193c000 [0x1a8]: PERF_RECORD_SAMPLE(IP, 0x2): 18122/18230: 0x401d24 period: 1000037 addr: 0
... branch stack: nr:16
..... 0: 0000000000401d24 -> 0000000000401d5a 0 cycles 0
..... 1: 0000000000401d5c -> 0000000000401d24 0 cycles 0
..... 2: 0000000000401d22 -> 0000000000401d5c 0 cycles 0
..... 3: 0000000000401d5e -> 0000000000401d22 0 cycles 0
..... 4: 0000000000401d20 -> 0000000000401d5e 0 cycles 0
..... 5: 0000000000401d3e -> 0000000000401d20 0 cycles 0
..... 6: 0000000000401d42 -> 0000000000401d3e 0 cycles 0
..... 7: 0000000000401d3c -> 0000000000401d42 0 cycles 0
..... 8: 0000000000401d44 -> 0000000000401d3c 0 cycles 0
..... 9: 0000000000401d3a -> 0000000000401d44 0 cycles 0
..... 10: 0000000000401d46 -> 0000000000401d3a 0 cycles 0
..... 11: 0000000000401d38 -> 0000000000401d46 0 cycles 0
..... 12: 0000000000401d48 -> 0000000000401d38 0 cycles 0
..... 13: 0000000000401d36 -> 0000000000401d48 0 cycles 0
..... 14: 0000000000401d4a -> 0000000000401d36 0 cycles 0
..... 15: 0000000000401d34 -> 0000000000401d4a 0 cycles 0
... thread: test:18230
...... dso: test

Special thanks to Kim Phillips @ AMD for the testing, reviews and
contributions.

V2 makes the following changes:
- the low power callback code has be reworked completly. It is not
impacting the generic perf_events code anymore. This is all handled
via x86 code and only for ACPI low power driver which seems to be the
default on AMD. The change in acpi_pad.c and processor_idle.c has no
impact on non x86 architectures, on Intel x86 or AMD without BRS, a
jump label is used to void the code unless necessary
- BRS is an opt-in compile time option for the kernel
- branch_stack bit clearing helper is introduced
- As for the fact that BRS holds the NMI and that it may conflict with
other sampling events and introduced skid, this is not really a problem
because AMD PMI skid is already very large prompting special handling in
amd_pmu_wait_on_overflow(), so adding a few cycles while the CPU executes
at most 16 taken branches is not a problem.


V3 makes the following changes:
- simplifies the handling of BRS enable/disable to mimic the Intel LBR code
path more closely. That removes some callbacks in generic x86 code
- add config option to compile BRS as an opt-in (off by default)
- updated perf tool error reporting patch updates by Kim Phillips

V4 makes the following changes:
- rebase to latest tip.git (commit 6f5ac142e5df)
- integrate Kim Phillips latest perf tool error handling patches


V5 makes the following changes:
- rebased to 5.17-rc1
- integrated feedback from PeterZ about AMD perf_events callbacks
- fix cpufeature macros name X86_FEATURE_BRS
- integrated all perf tool error handling from kim Phillips

V6 makes the following changes:
- rebase to 5.17-rc3
- fix the typo in the Kconfig for BRS opt-in
- reword all changelogs to be more aligned with standard


Kim Phillips (1):
perf tools: Improve IBS error handling

Stephane Eranian (11):
perf/core: add perf_clear_branch_entry_bitfields() helper
x86/cpufeatures: add AMD Fam19h Branch Sampling feature
perf/x86/amd: add AMD Fam19h Branch Sampling support
perf/x86/amd: add branch-brs helper event for Fam19h BRS
perf/x86/amd: enable branch sampling priv level filtering
perf/x86/amd: add AMD branch sampling period adjustment
perf/x86/amd: make Zen3 branch sampling opt-in
ACPI: add perf low power callback
perf/x86/amd: add idle hooks for branch sampling
perf tools: Improve error handling of AMD Branch Sampling
perf report: add addr_from/addr_to sort dimensions

arch/x86/events/Kconfig | 8 +
arch/x86/events/amd/Makefile | 1 +
arch/x86/events/amd/brs.c | 363 +++++++++++++++++++++++++++++
arch/x86/events/amd/core.c | 217 ++++++++++++++++-
arch/x86/events/core.c | 17 +-
arch/x86/events/intel/lbr.c | 36 ++-
arch/x86/events/perf_event.h | 143 ++++++++++--
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/include/asm/msr-index.h | 4 +
arch/x86/include/asm/perf_event.h | 21 ++
drivers/acpi/acpi_pad.c | 6 +
drivers/acpi/processor_idle.c | 5 +
include/linux/perf_event.h | 22 ++
tools/perf/util/evsel.c | 38 +++
tools/perf/util/hist.c | 2 +
tools/perf/util/hist.h | 2 +
tools/perf/util/sort.c | 128 ++++++++++
tools/perf/util/sort.h | 2 +
18 files changed, 976 insertions(+), 40 deletions(-)
create mode 100644 arch/x86/events/amd/brs.c

--
2.35.0.263.gb82422642f-goog



2022-02-09 10:31:45

by Stephane Eranian

[permalink] [raw]
Subject: [PATCH v6 11/12] perf tools: Improve error handling of AMD Branch Sampling

Improve the error message printed by perf when perf_event_open() fails on
AMD Zen3 when using the branch sampling feature. In the case of EINVAL, there
are two main reasons: frequency mode or period is smaller than the depth of
the branch sampling buffer (16). The patch checks the parameters of the call
and tries to print a relevant message to explain the error:

$ perf record -b -e cpu/branch-brs/ -c 10 ls
Error:
AMD Branch Sampling does not support sampling period smaller than what is reported in /sys/devices/cpu/caps/branches.

$ perf record -b -e cpu/branch-brs/ ls
Error:
AMD Branch Sampling does not support frequency mode sampling, must pass a fixed sampling period via -c option or cpu/branch-brs,period=xxxx/.

Signed-off-by: Stephane Eranian <[email protected]>
[Rebased on commit 9fe8895a27a84 ("perf env: Add perf_env__cpuid, perf_env__{nr_}pmu_mappings")]
Signed-off-by: Kim Phillips <[email protected]>
---
tools/perf/util/evsel.c | 14 ++++++++++++++
1 file changed, 14 insertions(+)

diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index d42f63a484df..7311e7b4d34d 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -2857,6 +2857,12 @@ static bool is_amd_ibs(struct evsel *evsel)
return evsel->core.attr.precise_ip || !strncmp(evsel->pmu_name, "ibs", 3);
}

+static bool is_amd_brs(struct evsel *evsel)
+{
+ return ((evsel->core.attr.config & 0xff) == 0xc4) &&
+ (evsel->core.attr.sample_type & PERF_SAMPLE_BRANCH_STACK);
+}
+
int evsel__open_strerror(struct evsel *evsel, struct target *target,
int err, char *msg, size_t size)
{
@@ -2971,6 +2977,14 @@ int evsel__open_strerror(struct evsel *evsel, struct target *target,
return scnprintf(msg, size,
"AMD IBS may only be available in system-wide/per-cpu mode. Try using -a, or -C and workload affinity");
}
+ if (is_amd_brs(evsel)) {
+ if (evsel->core.attr.freq)
+ return scnprintf(msg, size,
+ "AMD Branch Sampling does not support frequency mode sampling, must pass a fixed sampling period via -c option or cpu/branch-brs,period=xxxx/.");
+ /* another reason is that the period is too small */
+ return scnprintf(msg, size,
+ "AMD Branch Sampling does not support sampling period smaller than what is reported in /sys/devices/cpu/caps/branches.");
+ }
}

break;
--
2.35.0.263.gb82422642f-goog


2022-02-09 11:35:51

by Stephane Eranian

[permalink] [raw]
Subject: [PATCH v6 07/12] perf/x86/amd: make Zen3 branch sampling opt-in

Add a kernel config option CONFIG_PERF_EVENTS_AMD_BRS
to make the support for AMD Zen3 Branch Sampling (BRS) an opt-in
compile time option.

Signed-off-by: Stephane Eranian <[email protected]>
---
arch/x86/events/Kconfig | 8 ++++++
arch/x86/events/amd/Makefile | 3 ++-
arch/x86/events/perf_event.h | 49 ++++++++++++++++++++++++++++--------
3 files changed, 49 insertions(+), 11 deletions(-)

diff --git a/arch/x86/events/Kconfig b/arch/x86/events/Kconfig
index d6cdfe631674..09c56965750a 100644
--- a/arch/x86/events/Kconfig
+++ b/arch/x86/events/Kconfig
@@ -44,4 +44,12 @@ config PERF_EVENTS_AMD_UNCORE

To compile this driver as a module, choose M here: the
module will be called 'amd-uncore'.
+
+config PERF_EVENTS_AMD_BRS
+ depends on PERF_EVENTS && CPU_SUP_AMD
+ bool "AMD Zen3 Branch Sampling support"
+ help
+ Enable AMD Zen3 branch sampling support (BRS) which samples up to
+ 16 consecutive taken branches in registers.
+
endmenu
diff --git a/arch/x86/events/amd/Makefile b/arch/x86/events/amd/Makefile
index cf323ffab5cd..b9f5d4610256 100644
--- a/arch/x86/events/amd/Makefile
+++ b/arch/x86/events/amd/Makefile
@@ -1,5 +1,6 @@
# SPDX-License-Identifier: GPL-2.0
-obj-$(CONFIG_CPU_SUP_AMD) += core.o brs.o
+obj-$(CONFIG_CPU_SUP_AMD) += core.o
+obj-$(CONFIG_PERF_EVENTS_AMD_BRS) += brs.o
obj-$(CONFIG_PERF_EVENTS_AMD_POWER) += power.o
obj-$(CONFIG_X86_LOCAL_APIC) += ibs.o
obj-$(CONFIG_PERF_EVENTS_AMD_UNCORE) += amd-uncore.o
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 25b037b571e4..4d050579dcbd 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -1218,6 +1218,8 @@ static inline bool fixed_counter_disabled(int i, struct pmu *pmu)
#ifdef CONFIG_CPU_SUP_AMD

int amd_pmu_init(void);
+
+#ifdef CONFIG_PERF_EVENTS_AMD_BRS
int amd_brs_init(void);
void amd_brs_disable(void);
void amd_brs_enable(void);
@@ -1252,25 +1254,52 @@ static inline void amd_pmu_brs_del(struct perf_event *event)

void amd_pmu_brs_sched_task(struct perf_event_context *ctx, bool sched_in);

-/*
- * check if BRS is activated on the CPU
- * active defined as it has non-zero users and DBG_EXT_CFG.BRSEN=1
- */
-static inline bool amd_brs_active(void)
+static inline s64 amd_brs_adjust_period(s64 period)
{
- struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+ if (period > x86_pmu.lbr_nr)
+ return period - x86_pmu.lbr_nr;

- return cpuc->brs_active;
+ return period;
+}
+#else
+static inline int amd_brs_init(void)
+{
+ return 0;
}
+static inline void amd_brs_disable(void) {}
+static inline void amd_brs_enable(void) {}
+static inline void amd_brs_drain(void) {}
+static inline void amd_brs_lopwr_init(void) {}
+static inline void amd_brs_disable_all(void) {}
+static inline int amd_brs_setup_filter(struct perf_event *event)
+{
+ return 0;
+}
+static inline void amd_brs_reset(void) {}

-static inline s64 amd_brs_adjust_period(s64 period)
+static inline void amd_pmu_brs_add(struct perf_event *event)
{
- if (period > x86_pmu.lbr_nr)
- return period - x86_pmu.lbr_nr;
+}
+
+static inline void amd_pmu_brs_del(struct perf_event *event)
+{
+}
+
+static inline void amd_pmu_brs_sched_task(struct perf_event_context *ctx, bool sched_in)
+{
+}

+static inline s64 amd_brs_adjust_period(s64 period)
+{
return period;
}

+static inline void amd_brs_enable_all(void)
+{
+}
+
+#endif
+
#else /* CONFIG_CPU_SUP_AMD */

static inline int amd_pmu_init(void)
--
2.35.0.263.gb82422642f-goog


2022-02-16 15:16:41

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: [PATCH v6 11/12] perf tools: Improve error handling of AMD Branch Sampling

Em Tue, Feb 08, 2022 at 01:16:36PM -0800, Stephane Eranian escreveu:
> Improve the error message printed by perf when perf_event_open() fails on
> AMD Zen3 when using the branch sampling feature. In the case of EINVAL, there
> are two main reasons: frequency mode or period is smaller than the depth of
> the branch sampling buffer (16). The patch checks the parameters of the call
> and tries to print a relevant message to explain the error:
>
> $ perf record -b -e cpu/branch-brs/ -c 10 ls
> Error:
> AMD Branch Sampling does not support sampling period smaller than what is reported in /sys/devices/cpu/caps/branches.
>
> $ perf record -b -e cpu/branch-brs/ ls
> Error:
> AMD Branch Sampling does not support frequency mode sampling, must pass a fixed sampling period via -c option or cpu/branch-brs,period=xxxx/.
>
> Signed-off-by: Stephane Eranian <[email protected]>
> [Rebased on commit 9fe8895a27a84 ("perf env: Add perf_env__cpuid, perf_env__{nr_}pmu_mappings")]
> Signed-off-by: Kim Phillips <[email protected]>
> ---
> tools/perf/util/evsel.c | 14 ++++++++++++++
> 1 file changed, 14 insertions(+)
>
> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
> index d42f63a484df..7311e7b4d34d 100644
> --- a/tools/perf/util/evsel.c
> +++ b/tools/perf/util/evsel.c
> @@ -2857,6 +2857,12 @@ static bool is_amd_ibs(struct evsel *evsel)
> return evsel->core.attr.precise_ip || !strncmp(evsel->pmu_name, "ibs", 3);
> }
>
> +static bool is_amd_brs(struct evsel *evsel)
> +{
> + return ((evsel->core.attr.config & 0xff) == 0xc4) &&
> + (evsel->core.attr.sample_type & PERF_SAMPLE_BRANCH_STACK);
> +}
> +

Well, this assumes we're on x86_64, right? Shouldn't we have some extra
condition using perf_env routines to check we're on x86_64.

Did a quick check and powerpc also supports PERF_SAMPLE_BRANCH_STACK

⬢[acme@toolbox perf]$ find arch/ -name "*.c" | xargs grep PERF_SAMPLE_BRANCH_STACK
arch/powerpc/perf/core-book3s.c: if (event->attr.sample_type & PERF_SAMPLE_BRANCH_STACK) {
arch/x86/events/intel/ds.c: if (sample_type & PERF_SAMPLE_BRANCH_STACK) {
arch/x86/events/intel/ds.c: PERF_SAMPLE_BRANCH_STACK |
arch/x86/events/intel/lbr.c: * in PERF_SAMPLE_BRANCH_STACK sample may vary.
arch/x86/kvm/vmx/pmu_intel.c: * - set 'sample_type = PERF_SAMPLE_BRANCH_STACK' and
arch/x86/kvm/vmx/pmu_intel.c: .sample_type = PERF_SAMPLE_BRANCH_STACK,
⬢[acme@toolbox perf]$

arch/powerpc/perf/core-book3s.c:

if (event->attr.sample_type & PERF_SAMPLE_BRANCH_STACK) {
struct cpu_hw_events *cpuhw;
cpuhw = this_cpu_ptr(&cpu_hw_events);
power_pmu_bhrb_read(event, cpuhw);
data.br_stack = &cpuhw->bhrb_stack;
}

> int evsel__open_strerror(struct evsel *evsel, struct target *target,
> int err, char *msg, size_t size)
> {
> @@ -2971,6 +2977,14 @@ int evsel__open_strerror(struct evsel *evsel, struct target *target,
> return scnprintf(msg, size,
> "AMD IBS may only be available in system-wide/per-cpu mode. Try using -a, or -C and workload affinity");
> }
> + if (is_amd_brs(evsel)) {
> + if (evsel->core.attr.freq)
> + return scnprintf(msg, size,
> + "AMD Branch Sampling does not support frequency mode sampling, must pass a fixed sampling period via -c option or cpu/branch-brs,period=xxxx/.");
> + /* another reason is that the period is too small */
> + return scnprintf(msg, size,
> + "AMD Branch Sampling does not support sampling period smaller than what is reported in /sys/devices/cpu/caps/branches.");
> + }
> }
>
> break;
> --
> 2.35.0.263.gb82422642f-goog

--

- Arnaldo