2022-11-07 06:38:55

by Anshuman Khandual

[permalink] [raw]
Subject: [PATCH V5 0/7] arm64/perf: Enable branch stack sampling

This series enables perf branch stack sampling support on arm64 platform
via a new arch feature called Branch Record Buffer Extension (BRBE). All
relevant register definitions could be accessed here.

https://developer.arm.com/documentation/ddi0601/2021-12/AArch64-Registers

This series applies on v6.1-rc4.

Changes in V5:

- Changed BRBCR_EL1.VIRTUAL from 0b1 to 0b01
- Changed BRBFCR_EL1.EnL into BRBFCR_EL1.EnI
- Changed config ARM_BRBE_PMU from 'tristate' to 'bool'

Changes in V4:

https://lore.kernel.org/all/[email protected]/

- Changed ../tools/sysreg declarations as suggested
- Set PERF_SAMPLE_BRANCH_STACK in data.sample_flags
- Dropped perfmon_capable() check in armpmu_event_init()
- s/pr_warn_once/pr_info in armpmu_event_init()
- Added brbe_format element into struct pmu_hw_events
- Changed v1p1 as brbe_v1p1 in struct pmu_hw_events
- Dropped pr_info() from arm64_pmu_brbe_probe(), solved LOCKDEP warning

Changes in V3:

https://lore.kernel.org/all/[email protected]/

- Moved brbe_stack from the stack and now dynamically allocated
- Return PERF_BR_PRIV_UNKNOWN instead of -1 in brbe_fetch_perf_priv()
- Moved BRBIDR0, BRBCR, BRBFCR registers and fields into tools/sysreg
- Created dummy BRBINF_EL1 field definitions in tools/sysreg
- Dropped ARMPMU_EVT_PRIV framework which cached perfmon_capable()
- Both exception and exception return branche records are now captured
only if the event has PERF_SAMPLE_BRANCH_KERNEL which would already
been checked in generic perf via perf_allow_kernel()

Changes in V2:

https://lore.kernel.org/all/[email protected]/

- Dropped branch sample filter helpers consolidation patch from this series
- Added new hw_perf_event.flags element ARMPMU_EVT_PRIV to cache perfmon_capable()
- Use cached perfmon_capable() while configuring BRBE branch record filters

Changes in V1:

https://lore.kernel.org/linux-arm-kernel/[email protected]/

- Added CONFIG_PERF_EVENTS wrapper for all branch sample filter helpers
- Process new perf branch types via PERF_BR_EXTEND_ABI

Changes in RFC V2:

https://lore.kernel.org/linux-arm-kernel/[email protected]/

- Added branch_sample_priv() while consolidating other branch sample filter helpers
- Changed all SYS_BRBXXXN_EL1 register definition encodings per Marc
- Changed the BRBE driver as per proposed BRBE related perf ABI changes (V5)
- Added documentation for struct arm_pmu changes, updated commit message
- Updated commit message for BRBE detection infrastructure patch
- PERF_SAMPLE_BRANCH_KERNEL gets checked during arm event init (outside the driver)
- Branch privilege state capture mechanism has now moved inside the driver

Changes in RFC V1:

https://lore.kernel.org/all/[email protected]/

Cc: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Mark Brown <[email protected]>
Cc: James Clark <[email protected]>
Cc: Rob Herring <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Suzuki Poulose <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]

Anshuman Khandual (7):
arm64/perf: Add BRBE registers and fields
arm64/perf: Update struct arm_pmu for BRBE
arm64/perf: Update struct pmu_hw_events for BRBE
driver/perf/arm_pmu_platform: Add support for BRBE attributes detection
arm64/perf: Drive BRBE from perf event states
arm64/perf: Add BRBE driver
arm64/perf: Enable branch stack sampling

arch/arm64/include/asm/sysreg.h | 103 ++++++++
arch/arm64/kernel/perf_event.c | 49 ++++
arch/arm64/tools/sysreg | 161 ++++++++++++
drivers/perf/Kconfig | 11 +
drivers/perf/Makefile | 1 +
drivers/perf/arm_pmu.c | 66 ++++-
drivers/perf/arm_pmu_brbe.c | 441 ++++++++++++++++++++++++++++++++
drivers/perf/arm_pmu_brbe.h | 259 +++++++++++++++++++
drivers/perf/arm_pmu_platform.c | 34 +++
include/linux/perf/arm_pmu.h | 68 +++++
10 files changed, 1190 insertions(+), 3 deletions(-)
create mode 100644 drivers/perf/arm_pmu_brbe.c
create mode 100644 drivers/perf/arm_pmu_brbe.h

--
2.25.1



2022-11-07 06:42:55

by Anshuman Khandual

[permalink] [raw]
Subject: [PATCH V5 4/7] driver/perf/arm_pmu_platform: Add support for BRBE attributes detection

This adds arm pmu infrastrure to probe BRBE implementation's attributes via
driver exported callbacks later. The actual BRBE feature detection will be
added by the driver itself.

CPU specific BRBE entries, cycle count, format support gets detected during
PMU init. This information gets saved in per-cpu struct pmu_hw_events which
later helps in operating BRBE during a perf event context.

Cc: Will Deacon <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Anshuman Khandual <[email protected]>
---
drivers/perf/arm_pmu_platform.c | 34 +++++++++++++++++++++++++++++++++
1 file changed, 34 insertions(+)

diff --git a/drivers/perf/arm_pmu_platform.c b/drivers/perf/arm_pmu_platform.c
index 933b96e243b8..acdc445081aa 100644
--- a/drivers/perf/arm_pmu_platform.c
+++ b/drivers/perf/arm_pmu_platform.c
@@ -172,6 +172,36 @@ static int armpmu_request_irqs(struct arm_pmu *armpmu)
return err;
}

+static void arm_brbe_probe_cpu(void *info)
+{
+ struct pmu_hw_events *hw_events;
+ struct arm_pmu *armpmu = info;
+
+ /*
+ * Return from here, if BRBE driver has not been
+ * implemented for this PMU. This helps prevent
+ * kernel crash later when brbe_probe() will be
+ * called on the PMU.
+ */
+ if (!armpmu->brbe_probe)
+ return;
+
+ hw_events = per_cpu_ptr(armpmu->hw_events, smp_processor_id());
+ armpmu->brbe_probe(hw_events);
+}
+
+static int armpmu_request_brbe(struct arm_pmu *armpmu)
+{
+ int cpu, err = 0;
+
+ for_each_cpu(cpu, &armpmu->supported_cpus) {
+ err = smp_call_function_single(cpu, arm_brbe_probe_cpu, armpmu, 1);
+ if (err)
+ return err;
+ }
+ return err;
+}
+
static void armpmu_free_irqs(struct arm_pmu *armpmu)
{
int cpu;
@@ -229,6 +259,10 @@ int arm_pmu_device_probe(struct platform_device *pdev,
if (ret)
goto out_free_irqs;

+ ret = armpmu_request_brbe(pmu);
+ if (ret)
+ goto out_free_irqs;
+
ret = armpmu_register(pmu);
if (ret) {
dev_err(dev, "failed to register PMU devices!\n");
--
2.25.1


2022-11-07 06:43:31

by Anshuman Khandual

[permalink] [raw]
Subject: [PATCH V5 2/7] arm64/perf: Update struct arm_pmu for BRBE

Although BRBE is an armv8 speciifc HW feature, abstracting out its various
function callbacks at the struct arm_pmu level is preferred, as it cleaner
, easier to follow and maintain.

Besides some helpers i.e brbe_supported(), brbe_probe() and brbe_reset()
might not fit seamlessly, when tried to be embedded via existing arm_pmu
helpers in the armv8 implementation.

Updates the struct arm_pmu to include all required helpers that will drive
BRBE functionality for a given PMU implementation. These are the following.

- brbe_filter : Convert perf event filters into BRBE HW filters
- brbe_probe : Probe BRBE HW and capture its attributes
- brbe_enable : Enable BRBE HW with a given config
- brbe_disable : Disable BRBE HW
- brbe_read : Read BRBE buffer for captured branch records
- brbe_reset : Reset BRBE buffer
- brbe_supported: Whether BRBE is supported or not

A BRBE driver implementation needs to provide these functionalities.

Cc: Will Deacon <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Anshuman Khandual <[email protected]>
---
arch/arm64/kernel/perf_event.c | 36 ++++++++++++++++++++++++++++++++++
include/linux/perf/arm_pmu.h | 21 ++++++++++++++++++++
2 files changed, 57 insertions(+)

diff --git a/arch/arm64/kernel/perf_event.c b/arch/arm64/kernel/perf_event.c
index 7b0643fe2f13..c97377e28288 100644
--- a/arch/arm64/kernel/perf_event.c
+++ b/arch/arm64/kernel/perf_event.c
@@ -1025,6 +1025,35 @@ static int armv8pmu_filter_match(struct perf_event *event)
return evtype != ARMV8_PMUV3_PERFCTR_CHAIN;
}

+static void armv8pmu_brbe_filter(struct pmu_hw_events *hw_event, struct perf_event *event)
+{
+}
+
+static void armv8pmu_brbe_enable(struct pmu_hw_events *hw_event)
+{
+}
+
+static void armv8pmu_brbe_disable(struct pmu_hw_events *hw_event)
+{
+}
+
+static void armv8pmu_brbe_read(struct pmu_hw_events *hw_event, struct perf_event *event)
+{
+}
+
+static void armv8pmu_brbe_probe(struct pmu_hw_events *hw_event)
+{
+}
+
+static void armv8pmu_brbe_reset(struct pmu_hw_events *hw_event)
+{
+}
+
+static bool armv8pmu_brbe_supported(struct perf_event *event)
+{
+ return false;
+}
+
static void armv8pmu_reset(void *info)
{
struct arm_pmu *cpu_pmu = (struct arm_pmu *)info;
@@ -1257,6 +1286,13 @@ static int armv8_pmu_init(struct arm_pmu *cpu_pmu, char *name,

cpu_pmu->pmu.event_idx = armv8pmu_user_event_idx;

+ cpu_pmu->brbe_filter = armv8pmu_brbe_filter;
+ cpu_pmu->brbe_enable = armv8pmu_brbe_enable;
+ cpu_pmu->brbe_disable = armv8pmu_brbe_disable;
+ cpu_pmu->brbe_read = armv8pmu_brbe_read;
+ cpu_pmu->brbe_probe = armv8pmu_brbe_probe;
+ cpu_pmu->brbe_reset = armv8pmu_brbe_reset;
+ cpu_pmu->brbe_supported = armv8pmu_brbe_supported;
cpu_pmu->name = name;
cpu_pmu->map_event = map_event;
cpu_pmu->attr_groups[ARMPMU_ATTR_GROUP_EVENTS] = events ?
diff --git a/include/linux/perf/arm_pmu.h b/include/linux/perf/arm_pmu.h
index 0356cb6a215d..67a6d59786f2 100644
--- a/include/linux/perf/arm_pmu.h
+++ b/include/linux/perf/arm_pmu.h
@@ -101,6 +101,27 @@ struct arm_pmu {
void (*reset)(void *);
int (*map_event)(struct perf_event *event);
int (*filter_match)(struct perf_event *event);
+
+ /* Convert perf event filters into BRBE HW filters */
+ void (*brbe_filter)(struct pmu_hw_events *hw_events, struct perf_event *event);
+
+ /* Probe BRBE HW and capture its attributes */
+ void (*brbe_probe)(struct pmu_hw_events *hw_events);
+
+ /* Enable BRBE HW with a given config */
+ void (*brbe_enable)(struct pmu_hw_events *hw_events);
+
+ /* Disable BRBE HW */
+ void (*brbe_disable)(struct pmu_hw_events *hw_events);
+
+ /* Process BRBE buffer for captured branch records */
+ void (*brbe_read)(struct pmu_hw_events *hw_events, struct perf_event *event);
+
+ /* Reset BRBE buffer */
+ void (*brbe_reset)(struct pmu_hw_events *hw_events);
+
+ /* Check whether BRBE is supported */
+ bool (*brbe_supported)(struct perf_event *event);
int num_events;
bool secure_access; /* 32-bit ARM only */
#define ARMV8_PMUV3_MAX_COMMON_EVENTS 0x40
--
2.25.1


2022-11-07 06:43:52

by Anshuman Khandual

[permalink] [raw]
Subject: [PATCH V5 1/7] arm64/perf: Add BRBE registers and fields

This adds BRBE related register definitions and various other related field
macros there in. These will be used subsequently in a BRBE driver which is
being added later on.

Cc: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Anshuman Khandual <[email protected]>
---
arch/arm64/include/asm/sysreg.h | 103 ++++++++++++++++++++
arch/arm64/tools/sysreg | 161 ++++++++++++++++++++++++++++++++
2 files changed, 264 insertions(+)

diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h
index 7d301700d1a9..78335c7807dc 100644
--- a/arch/arm64/include/asm/sysreg.h
+++ b/arch/arm64/include/asm/sysreg.h
@@ -161,6 +161,109 @@
#define SYS_DBGDTRTX_EL0 sys_reg(2, 3, 0, 5, 0)
#define SYS_DBGVCR32_EL2 sys_reg(2, 4, 0, 7, 0)

+#define __SYS_BRBINFO(n) sys_reg(2, 1, 8, ((n) & 0xf), ((((n) & 0x10)) >> 2 + 0))
+#define __SYS_BRBSRC(n) sys_reg(2, 1, 8, ((n) & 0xf), ((((n) & 0x10)) >> 2 + 1))
+#define __SYS_BRBTGT(n) sys_reg(2, 1, 8, ((n) & 0xf), ((((n) & 0x10)) >> 2 + 2))
+
+#define SYS_BRBINF0_EL1 __SYS_BRBINFO(0)
+#define SYS_BRBINF1_EL1 __SYS_BRBINFO(1)
+#define SYS_BRBINF2_EL1 __SYS_BRBINFO(2)
+#define SYS_BRBINF3_EL1 __SYS_BRBINFO(3)
+#define SYS_BRBINF4_EL1 __SYS_BRBINFO(4)
+#define SYS_BRBINF5_EL1 __SYS_BRBINFO(5)
+#define SYS_BRBINF6_EL1 __SYS_BRBINFO(6)
+#define SYS_BRBINF7_EL1 __SYS_BRBINFO(7)
+#define SYS_BRBINF8_EL1 __SYS_BRBINFO(8)
+#define SYS_BRBINF9_EL1 __SYS_BRBINFO(9)
+#define SYS_BRBINF10_EL1 __SYS_BRBINFO(10)
+#define SYS_BRBINF11_EL1 __SYS_BRBINFO(11)
+#define SYS_BRBINF12_EL1 __SYS_BRBINFO(12)
+#define SYS_BRBINF13_EL1 __SYS_BRBINFO(13)
+#define SYS_BRBINF14_EL1 __SYS_BRBINFO(14)
+#define SYS_BRBINF15_EL1 __SYS_BRBINFO(15)
+#define SYS_BRBINF16_EL1 __SYS_BRBINFO(16)
+#define SYS_BRBINF17_EL1 __SYS_BRBINFO(17)
+#define SYS_BRBINF18_EL1 __SYS_BRBINFO(18)
+#define SYS_BRBINF19_EL1 __SYS_BRBINFO(19)
+#define SYS_BRBINF20_EL1 __SYS_BRBINFO(20)
+#define SYS_BRBINF21_EL1 __SYS_BRBINFO(21)
+#define SYS_BRBINF22_EL1 __SYS_BRBINFO(22)
+#define SYS_BRBINF23_EL1 __SYS_BRBINFO(23)
+#define SYS_BRBINF24_EL1 __SYS_BRBINFO(24)
+#define SYS_BRBINF25_EL1 __SYS_BRBINFO(25)
+#define SYS_BRBINF26_EL1 __SYS_BRBINFO(26)
+#define SYS_BRBINF27_EL1 __SYS_BRBINFO(27)
+#define SYS_BRBINF28_EL1 __SYS_BRBINFO(28)
+#define SYS_BRBINF29_EL1 __SYS_BRBINFO(29)
+#define SYS_BRBINF30_EL1 __SYS_BRBINFO(30)
+#define SYS_BRBINF31_EL1 __SYS_BRBINFO(31)
+
+#define SYS_BRBSRC0_EL1 __SYS_BRBSRC(0)
+#define SYS_BRBSRC1_EL1 __SYS_BRBSRC(1)
+#define SYS_BRBSRC2_EL1 __SYS_BRBSRC(2)
+#define SYS_BRBSRC3_EL1 __SYS_BRBSRC(3)
+#define SYS_BRBSRC4_EL1 __SYS_BRBSRC(4)
+#define SYS_BRBSRC5_EL1 __SYS_BRBSRC(5)
+#define SYS_BRBSRC6_EL1 __SYS_BRBSRC(6)
+#define SYS_BRBSRC7_EL1 __SYS_BRBSRC(7)
+#define SYS_BRBSRC8_EL1 __SYS_BRBSRC(8)
+#define SYS_BRBSRC9_EL1 __SYS_BRBSRC(9)
+#define SYS_BRBSRC10_EL1 __SYS_BRBSRC(10)
+#define SYS_BRBSRC11_EL1 __SYS_BRBSRC(11)
+#define SYS_BRBSRC12_EL1 __SYS_BRBSRC(12)
+#define SYS_BRBSRC13_EL1 __SYS_BRBSRC(13)
+#define SYS_BRBSRC14_EL1 __SYS_BRBSRC(14)
+#define SYS_BRBSRC15_EL1 __SYS_BRBSRC(15)
+#define SYS_BRBSRC16_EL1 __SYS_BRBSRC(16)
+#define SYS_BRBSRC17_EL1 __SYS_BRBSRC(17)
+#define SYS_BRBSRC18_EL1 __SYS_BRBSRC(18)
+#define SYS_BRBSRC19_EL1 __SYS_BRBSRC(19)
+#define SYS_BRBSRC20_EL1 __SYS_BRBSRC(20)
+#define SYS_BRBSRC21_EL1 __SYS_BRBSRC(21)
+#define SYS_BRBSRC22_EL1 __SYS_BRBSRC(22)
+#define SYS_BRBSRC23_EL1 __SYS_BRBSRC(23)
+#define SYS_BRBSRC24_EL1 __SYS_BRBSRC(24)
+#define SYS_BRBSRC25_EL1 __SYS_BRBSRC(25)
+#define SYS_BRBSRC26_EL1 __SYS_BRBSRC(26)
+#define SYS_BRBSRC27_EL1 __SYS_BRBSRC(27)
+#define SYS_BRBSRC28_EL1 __SYS_BRBSRC(28)
+#define SYS_BRBSRC29_EL1 __SYS_BRBSRC(29)
+#define SYS_BRBSRC30_EL1 __SYS_BRBSRC(30)
+#define SYS_BRBSRC31_EL1 __SYS_BRBSRC(31)
+
+#define SYS_BRBTGT0_EL1 __SYS_BRBTGT(0)
+#define SYS_BRBTGT1_EL1 __SYS_BRBTGT(1)
+#define SYS_BRBTGT2_EL1 __SYS_BRBTGT(2)
+#define SYS_BRBTGT3_EL1 __SYS_BRBTGT(3)
+#define SYS_BRBTGT4_EL1 __SYS_BRBTGT(4)
+#define SYS_BRBTGT5_EL1 __SYS_BRBTGT(5)
+#define SYS_BRBTGT6_EL1 __SYS_BRBTGT(6)
+#define SYS_BRBTGT7_EL1 __SYS_BRBTGT(7)
+#define SYS_BRBTGT8_EL1 __SYS_BRBTGT(8)
+#define SYS_BRBTGT9_EL1 __SYS_BRBTGT(9)
+#define SYS_BRBTGT10_EL1 __SYS_BRBTGT(10)
+#define SYS_BRBTGT11_EL1 __SYS_BRBTGT(11)
+#define SYS_BRBTGT12_EL1 __SYS_BRBTGT(12)
+#define SYS_BRBTGT13_EL1 __SYS_BRBTGT(13)
+#define SYS_BRBTGT14_EL1 __SYS_BRBTGT(14)
+#define SYS_BRBTGT15_EL1 __SYS_BRBTGT(15)
+#define SYS_BRBTGT16_EL1 __SYS_BRBTGT(16)
+#define SYS_BRBTGT17_EL1 __SYS_BRBTGT(17)
+#define SYS_BRBTGT18_EL1 __SYS_BRBTGT(18)
+#define SYS_BRBTGT19_EL1 __SYS_BRBTGT(19)
+#define SYS_BRBTGT20_EL1 __SYS_BRBTGT(20)
+#define SYS_BRBTGT21_EL1 __SYS_BRBTGT(21)
+#define SYS_BRBTGT22_EL1 __SYS_BRBTGT(22)
+#define SYS_BRBTGT23_EL1 __SYS_BRBTGT(23)
+#define SYS_BRBTGT24_EL1 __SYS_BRBTGT(24)
+#define SYS_BRBTGT25_EL1 __SYS_BRBTGT(25)
+#define SYS_BRBTGT26_EL1 __SYS_BRBTGT(26)
+#define SYS_BRBTGT27_EL1 __SYS_BRBTGT(27)
+#define SYS_BRBTGT28_EL1 __SYS_BRBTGT(28)
+#define SYS_BRBTGT29_EL1 __SYS_BRBTGT(29)
+#define SYS_BRBTGT30_EL1 __SYS_BRBTGT(30)
+#define SYS_BRBTGT31_EL1 __SYS_BRBTGT(31)
+
#define SYS_MIDR_EL1 sys_reg(3, 0, 0, 0, 0)
#define SYS_MPIDR_EL1 sys_reg(3, 0, 0, 0, 5)
#define SYS_REVIDR_EL1 sys_reg(3, 0, 0, 0, 6)
diff --git a/arch/arm64/tools/sysreg b/arch/arm64/tools/sysreg
index 384757a7eda9..45b1834de1ae 100644
--- a/arch/arm64/tools/sysreg
+++ b/arch/arm64/tools/sysreg
@@ -167,6 +167,167 @@ Enum 3:0 BT
EndEnum
EndSysreg

+
+# This is just a dummy register declaration to get all common field masks and
+# shifts for accessing given BRBINF contents.
+Sysreg BRBINF_EL1 2 1 8 0 0
+Res0 63:47
+Field 46 CCU
+Field 45:32 CC
+Res0 31:18
+Field 17 LASTFAILED
+Field 16 T
+Res0 15:14
+Enum 13:8 TYPE
+ 0b000000 UNCOND_DIR
+ 0b000001 INDIR
+ 0b000010 DIR_LINK
+ 0b000011 INDIR_LINK
+ 0b000101 RET_SUB
+ 0b000111 RET_EXCPT
+ 0b001000 COND_DIR
+ 0b100001 DEBUG_HALT
+ 0b100010 CALL
+ 0b100011 TRAP
+ 0b100100 SERROR
+ 0b100110 INST_DEBUG
+ 0b100111 DATA_DEBUG
+ 0b101010 ALGN_FAULT
+ 0b101011 INST_FAULT
+ 0b101100 DATA_FAULT
+ 0b101110 IRQ
+ 0b101111 FIQ
+ 0b111001 DEBUG_EXIT
+EndEnum
+Enum 7:6 EL
+ 0b00 EL0
+ 0b01 EL1
+ 0b10 EL2
+ 0b11 EL3
+EndEnum
+Field 5 MPRED
+Res0 4:2
+Enum 1:0 VALID
+ 0b00 NONE
+ 0b01 TARGET
+ 0b10 SOURCE
+ 0b11 FULL
+EndEnum
+EndSysreg
+
+Sysreg BRBCR_EL1 2 1 9 0 0
+Res0 63:24
+Field 23 EXCEPTION
+Field 22 ERTN
+Res0 21:9
+Field 8 FZP
+Res0 7
+Enum 6:5 TS
+ 0b01 VIRTUAL
+ 0b10 GST_PHYSICAL
+ 0b11 PHYSICAL
+EndEnum
+Field 4 MPRED
+Field 3 CC
+Res0 2
+Field 1 E1BRE
+Field 0 E0BRE
+EndSysreg
+
+Sysreg BRBFCR_EL1 2 1 9 0 1
+Res0 63:30
+Enum 29:28 BANK
+ 0b0 FIRST
+ 0b1 SECOND
+EndEnum
+Res0 27:23
+Field 22 CONDDIR
+Field 21 DIRCALL
+Field 20 INDCALL
+Field 19 RTN
+Field 18 INDIRECT
+Field 17 DIRECT
+Field 16 EnI
+Res0 15:8
+Field 7 PAUSED
+Field 6 LASTFAILED
+Res0 5:0
+EndSysreg
+
+Sysreg BRBTS_EL1 2 1 9 0 2
+Field 63:0 TS
+EndSysreg
+
+Sysreg BRBINFINJ_EL1 2 1 9 1 0
+Res0 63:47
+Field 46 CCU
+Field 45:32 CC
+Res0 31:18
+Field 17 LASTFAILED
+Field 16 T
+Res0 15:14
+Enum 13:8 TYPE
+ 0b000000 UNCOND_DIR
+ 0b000001 INDIR
+ 0b000010 DIR_LINK
+ 0b000011 INDIR_LINK
+ 0b000100 RET_SUB
+ 0b000100 RET_SUB
+ 0b000111 RET_EXCPT
+ 0b001000 COND_DIR
+ 0b100001 DEBUG_HALT
+ 0b100010 CALL
+ 0b100011 TRAP
+ 0b100100 SERROR
+ 0b100110 INST_DEBUG
+ 0b100111 DATA_DEBUG
+ 0b101010 ALGN_FAULT
+ 0b101011 INST_FAULT
+ 0b101100 DATA_FAULT
+ 0b101110 IRQ
+ 0b101111 FIQ
+ 0b111001 DEBUG_EXIT
+EndEnum
+Enum 7:6 EL
+ 0b00 EL0
+ 0b01 EL1
+ 0b10 EL2
+ 0b11 EL3
+EndEnum
+Field 5 MPRED
+Res0 4:2
+Enum 1:0 VALID
+ 0b00 NONE
+ 0b01 TARGET
+ 0b10 SOURCE
+ 0b00 FULL
+EndEnum
+EndSysreg
+
+Sysreg BRBSRCINJ_EL1 2 1 9 1 1
+Field 63:0 ADDRESS
+EndSysreg
+
+Sysreg BRBTGTINJ_EL1 2 1 9 1 2
+Field 63:0 ADDRESS
+EndSysreg
+
+Sysreg BRBIDR0_EL1 2 1 9 2 0
+Res0 63:16
+Enum 15:12 CC
+ 0b101 20_BIT
+EndEnum
+Enum 11:8 FORMAT
+ 0b0 0
+EndEnum
+Enum 7:0 NUMREC
+ 0b1000 8
+ 0b10000 16
+ 0b100000 32
+ 0b1000000 64
+EndEnum
+EndSysreg
+
Sysreg ID_AA64ZFR0_EL1 3 0 0 4 4
Res0 63:60
Enum 59:56 F64MM
--
2.25.1


2022-11-07 06:45:15

by Anshuman Khandual

[permalink] [raw]
Subject: [PATCH V5 5/7] arm64/perf: Drive BRBE from perf event states

Branch stack sampling rides along the normal perf event and all the branch
records get captured during the PMU interrupt. This just changes perf event
handling on the arm64 platform to accommodate required BRBE operations that
will enable branch stack sampling support.

Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Anshuman Khandual <[email protected]>
---
arch/arm64/kernel/perf_event.c | 7 ++++++
drivers/perf/arm_pmu.c | 40 ++++++++++++++++++++++++++++++++++
2 files changed, 47 insertions(+)

diff --git a/arch/arm64/kernel/perf_event.c b/arch/arm64/kernel/perf_event.c
index c97377e28288..97db333d1208 100644
--- a/arch/arm64/kernel/perf_event.c
+++ b/arch/arm64/kernel/perf_event.c
@@ -874,6 +874,13 @@ static irqreturn_t armv8pmu_handle_irq(struct arm_pmu *cpu_pmu)
if (!armpmu_event_set_period(event))
continue;

+ if (has_branch_stack(event)) {
+ cpu_pmu->brbe_read(cpuc, event);
+ data.br_stack = &cpuc->branches->brbe_stack;
+ data.sample_flags |= PERF_SAMPLE_BRANCH_STACK;
+ cpu_pmu->brbe_reset(cpuc);
+ }
+
/*
* Perf event overflow will queue the processing of the event as
* an irq_work which will be taken care of in the handling of
diff --git a/drivers/perf/arm_pmu.c b/drivers/perf/arm_pmu.c
index 5048a500441e..1a8dca4e513e 100644
--- a/drivers/perf/arm_pmu.c
+++ b/drivers/perf/arm_pmu.c
@@ -271,12 +271,22 @@ armpmu_stop(struct perf_event *event, int flags)
{
struct arm_pmu *armpmu = to_arm_pmu(event->pmu);
struct hw_perf_event *hwc = &event->hw;
+ struct pmu_hw_events *hw_events = this_cpu_ptr(armpmu->hw_events);

/*
* ARM pmu always has to update the counter, so ignore
* PERF_EF_UPDATE, see comments in armpmu_start().
*/
if (!(hwc->state & PERF_HES_STOPPED)) {
+ if (has_branch_stack(event)) {
+ WARN_ON_ONCE(!hw_events->brbe_users);
+ hw_events->brbe_users--;
+ if (!hw_events->brbe_users) {
+ hw_events->brbe_context = NULL;
+ armpmu->brbe_disable(hw_events);
+ }
+ }
+
armpmu->disable(event);
armpmu_event_update(event);
hwc->state |= PERF_HES_STOPPED | PERF_HES_UPTODATE;
@@ -287,6 +297,7 @@ static void armpmu_start(struct perf_event *event, int flags)
{
struct arm_pmu *armpmu = to_arm_pmu(event->pmu);
struct hw_perf_event *hwc = &event->hw;
+ struct pmu_hw_events *hw_events = this_cpu_ptr(armpmu->hw_events);

/*
* ARM pmu always has to reprogram the period, so ignore
@@ -304,6 +315,14 @@ static void armpmu_start(struct perf_event *event, int flags)
* happened since disabling.
*/
armpmu_event_set_period(event);
+ if (has_branch_stack(event)) {
+ if (event->ctx->task && hw_events->brbe_context != event->ctx) {
+ armpmu->brbe_reset(hw_events);
+ hw_events->brbe_context = event->ctx;
+ }
+ armpmu->brbe_enable(hw_events);
+ hw_events->brbe_users++;
+ }
armpmu->enable(event);
}

@@ -349,6 +368,10 @@ armpmu_add(struct perf_event *event, int flags)
hw_events->events[idx] = event;

hwc->state = PERF_HES_STOPPED | PERF_HES_UPTODATE;
+
+ if (has_branch_stack(event))
+ armpmu->brbe_filter(hw_events, event);
+
if (flags & PERF_EF_START)
armpmu_start(event, PERF_EF_RELOAD);

@@ -443,6 +466,7 @@ __hw_perf_event_init(struct perf_event *event)
{
struct arm_pmu *armpmu = to_arm_pmu(event->pmu);
struct hw_perf_event *hwc = &event->hw;
+ struct pmu_hw_events *hw_events = this_cpu_ptr(armpmu->hw_events);
int mapping;

hwc->flags = 0;
@@ -492,6 +516,9 @@ __hw_perf_event_init(struct perf_event *event)
local64_set(&hwc->period_left, hwc->sample_period);
}

+ if (has_branch_stack(event))
+ armpmu->brbe_filter(hw_events, event);
+
return validate_group(event);
}

@@ -520,6 +547,18 @@ static int armpmu_event_init(struct perf_event *event)
return __hw_perf_event_init(event);
}

+static void armpmu_sched_task(struct perf_event_context *ctx, bool sched_in)
+{
+ struct arm_pmu *armpmu = to_arm_pmu(ctx->pmu);
+ struct pmu_hw_events *hw_events = this_cpu_ptr(armpmu->hw_events);
+
+ if (!hw_events->brbe_users)
+ return;
+
+ if (sched_in)
+ armpmu->brbe_reset(hw_events);
+}
+
static void armpmu_enable(struct pmu *pmu)
{
struct arm_pmu *armpmu = to_arm_pmu(pmu);
@@ -877,6 +916,7 @@ static struct arm_pmu *__armpmu_alloc(gfp_t flags)
}

pmu->pmu = (struct pmu) {
+ .sched_task = armpmu_sched_task,
.pmu_enable = armpmu_enable,
.pmu_disable = armpmu_disable,
.event_init = armpmu_event_init,
--
2.25.1


2022-11-07 07:04:16

by Anshuman Khandual

[permalink] [raw]
Subject: [PATCH V5 7/7] arm64/perf: Enable branch stack sampling

Now that all the required pieces are already in place, just enable the perf
branch stack sampling support on arm64 platform, by removing the gate which
blocks it in armpmu_event_init().

Cc: Mark Rutland <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Anshuman Khandual <[email protected]>
---
drivers/perf/arm_pmu.c | 25 ++++++++++++++++++++++---
1 file changed, 22 insertions(+), 3 deletions(-)

diff --git a/drivers/perf/arm_pmu.c b/drivers/perf/arm_pmu.c
index 1a8dca4e513e..dc5e4f9aca22 100644
--- a/drivers/perf/arm_pmu.c
+++ b/drivers/perf/arm_pmu.c
@@ -537,9 +537,28 @@ static int armpmu_event_init(struct perf_event *event)
!cpumask_test_cpu(event->cpu, &armpmu->supported_cpus))
return -ENOENT;

- /* does not support taken branch sampling */
- if (has_branch_stack(event))
- return -EOPNOTSUPP;
+ if (has_branch_stack(event)) {
+ /*
+ * BRBE support is absent. Select CONFIG_ARM_BRBE_PMU
+ * in the config, before branch stack sampling events
+ * can be requested.
+ */
+ if (!IS_ENABLED(CONFIG_ARM_BRBE_PMU)) {
+ pr_info("BRBE is disabled, select CONFIG_ARM_BRBE_PMU\n");
+ return -EOPNOTSUPP;
+ }
+
+ /*
+ * Branch stack sampling event can not be supported in
+ * case either the required driver itself is absent or
+ * BRBE buffer, is not supported. Besides checking for
+ * the callback prevents a crash in case it's absent.
+ */
+ if (!armpmu->brbe_supported || !armpmu->brbe_supported(event)) {
+ pr_info("BRBE is not supported\n");
+ return -EOPNOTSUPP;
+ }
+ }

if (armpmu->map_event(event) == -ENOENT)
return -ENOENT;
--
2.25.1


2022-11-07 07:05:08

by Anshuman Khandual

[permalink] [raw]
Subject: [PATCH V5 3/7] arm64/perf: Update struct pmu_hw_events for BRBE

A single perf event instance BRBE related contexts and data will be tracked
in struct pmu_hw_events. Hence update the structure to accommodate required
details related to BRBE.

Cc: Will Deacon <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Anshuman Khandual <[email protected]>
---
drivers/perf/arm_pmu.c | 1 +
include/linux/perf/arm_pmu.h | 27 +++++++++++++++++++++++++++
2 files changed, 28 insertions(+)

diff --git a/drivers/perf/arm_pmu.c b/drivers/perf/arm_pmu.c
index 3f07df5a7e95..5048a500441e 100644
--- a/drivers/perf/arm_pmu.c
+++ b/drivers/perf/arm_pmu.c
@@ -905,6 +905,7 @@ static struct arm_pmu *__armpmu_alloc(gfp_t flags)

events = per_cpu_ptr(pmu->hw_events, cpu);
raw_spin_lock_init(&events->pmu_lock);
+ events->branches = kmalloc(sizeof(struct brbe_records), flags);
events->percpu_pmu = pmu;
}

diff --git a/include/linux/perf/arm_pmu.h b/include/linux/perf/arm_pmu.h
index 67a6d59786f2..bda0d9984a98 100644
--- a/include/linux/perf/arm_pmu.h
+++ b/include/linux/perf/arm_pmu.h
@@ -44,6 +44,16 @@ static_assert((PERF_EVENT_FLAG_ARCH & ARMPMU_EVT_47BIT) == ARMPMU_EVT_47BIT);
}, \
}

+/*
+ * Maximum branch records in BRBE
+ */
+#define BRBE_MAX_ENTRIES 64
+
+struct brbe_records {
+ struct perf_branch_stack brbe_stack;
+ struct perf_branch_entry brbe_entries[BRBE_MAX_ENTRIES];
+};
+
/* The events for a given PMU register set. */
struct pmu_hw_events {
/*
@@ -70,6 +80,23 @@ struct pmu_hw_events {
struct arm_pmu *percpu_pmu;

int irq;
+
+ /* Detected BRBE attributes */
+ bool brbe_v1p1;
+ int brbe_cc;
+ int brbe_nr;
+ int brbe_format;
+
+ /* Evaluated BRBE configuration */
+ u64 brbfcr;
+ u64 brbcr;
+
+ /* Tracked BRBE context */
+ unsigned int brbe_users;
+ void *brbe_context;
+
+ /* Captured BRBE buffer - copied as is into perf_sample_data */
+ struct brbe_records *branches;
};

enum armpmu_attr_groups {
--
2.25.1


2022-11-07 16:21:15

by Mark Brown

[permalink] [raw]
Subject: Re: [PATCH V5 1/7] arm64/perf: Add BRBE registers and fields

On Mon, Nov 07, 2022 at 11:55:08AM +0530, Anshuman Khandual wrote:
> This adds BRBE related register definitions and various other related field
> macros there in. These will be used subsequently in a BRBE driver which is
> being added later on.

Reviewed-by: Mark Brown <[email protected]>

Like I just said to Rob incremental differences are easier to review
when things are split up into a patch per register but this all looks
fine now so no worries.


Attachments:
(No filename) (466.00 B)
signature.asc (499.00 B)
Download all attachments

2022-11-09 11:51:45

by Suzuki K Poulose

[permalink] [raw]
Subject: Re: [PATCH V5 2/7] arm64/perf: Update struct arm_pmu for BRBE

On 07/11/2022 06:25, Anshuman Khandual wrote:
> Although BRBE is an armv8 speciifc HW feature, abstracting out its various
> function callbacks at the struct arm_pmu level is preferred, as it cleaner
> , easier to follow and maintain.
>
> Besides some helpers i.e brbe_supported(), brbe_probe() and brbe_reset()
> might not fit seamlessly, when tried to be embedded via existing arm_pmu
> helpers in the armv8 implementation.
>
> Updates the struct arm_pmu to include all required helpers that will drive
> BRBE functionality for a given PMU implementation. These are the following.
>
> - brbe_filter : Convert perf event filters into BRBE HW filters
> - brbe_probe : Probe BRBE HW and capture its attributes
> - brbe_enable : Enable BRBE HW with a given config
> - brbe_disable : Disable BRBE HW
> - brbe_read : Read BRBE buffer for captured branch records
> - brbe_reset : Reset BRBE buffer
> - brbe_supported: Whether BRBE is supported or not
>
> A BRBE driver implementation needs to provide these functionalities.

Could these not be hidden from the generic arm_pmu and kept in the
arm64 pmu backend ? It looks like they are quite easy to simply
move these to the corresponding hooks in arm64 pmu.

Suzuki


2022-11-18 07:14:19

by Anshuman Khandual

[permalink] [raw]
Subject: Re: [PATCH V5 2/7] arm64/perf: Update struct arm_pmu for BRBE



On 11/9/22 17:00, Suzuki K Poulose wrote:
> On 07/11/2022 06:25, Anshuman Khandual wrote:
>> Although BRBE is an armv8 speciifc HW feature, abstracting out its various
>> function callbacks at the struct arm_pmu level is preferred, as it cleaner
>> , easier to follow and maintain.
>>
>> Besides some helpers i.e brbe_supported(), brbe_probe() and brbe_reset()
>> might not fit seamlessly, when tried to be embedded via existing arm_pmu
>> helpers in the armv8 implementation.
>>
>> Updates the struct arm_pmu to include all required helpers that will drive
>> BRBE functionality for a given PMU implementation. These are the following.
>>
>> - brbe_filter    : Convert perf event filters into BRBE HW filters
>> - brbe_probe    : Probe BRBE HW and capture its attributes
>> - brbe_enable    : Enable BRBE HW with a given config
>> - brbe_disable    : Disable BRBE HW
>> - brbe_read    : Read BRBE buffer for captured branch records
>> - brbe_reset    : Reset BRBE buffer
>> - brbe_supported: Whether BRBE is supported or not
>>
>> A BRBE driver implementation needs to provide these functionalities.
>
> Could these not be hidden from the generic arm_pmu and kept in the
> arm64 pmu backend  ? It looks like they are quite easy to simply
> move these to the corresponding hooks in arm64 pmu.

We have had this discussion multiple times in the past [1], but I still
believe, keeping BRBE implementation hooks at the PMU level rather than
embedding them with other PMU events handling, is a much better logical
abstraction.

[1] https://lore.kernel.org/all/[email protected]/

--------------------------------------------------------------------------
>
> One thing to answer in the commit msg is why we need the hooks here.
> Have we concluded that adding BRBE hooks to struct arm_pmu for what is
> an armv8 specific feature is the right approach? I don't recall
> reaching that conclusion.

Although it might be possible to have this implementation embedded in
the existing armv8 PMU implementation, I still believe that the BRBE
functionalities abstracted out at the arm_pmu level with a separate
config option is cleaner, easier to follow and to maintain as well.

Besides some helpers i.e brbe_supported(), brbe_probe() and brbe_reset()
might not fit seamlessly, when tried to be embedded via existing arm_pmu
helpers in the armv8 implementation.

Nonetheless if arm_pmu based additional BRBE helpers is absolutely a no
go for folks here in general, will explore arm64 based implementation.
----------------------------------------------------------------------------

I am still waiting for maintainer's take on this issue. I will be happy to
rework this series to move all these implementation inside arm64 callbacks
instead, if that is required or preferred by the maintainers. But according
to me, this current abstraction layout is much better.

- Anshuman

2022-11-18 18:12:39

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH V5 2/7] arm64/perf: Update struct arm_pmu for BRBE


Hi Anshuman,

Apologies for the delayi n reviewing this.

On Fri, Nov 18, 2022 at 12:09:07PM +0530, Anshuman Khandual wrote:
> On 11/9/22 17:00, Suzuki K Poulose wrote:
> > On 07/11/2022 06:25, Anshuman Khandual wrote:
> >> Although BRBE is an armv8 speciifc HW feature, abstracting out its various
> >> function callbacks at the struct arm_pmu level is preferred, as it cleaner
> >> , easier to follow and maintain.
> >>
> >> Besides some helpers i.e brbe_supported(), brbe_probe() and brbe_reset()
> >> might not fit seamlessly, when tried to be embedded via existing arm_pmu
> >> helpers in the armv8 implementation.
> >>
> >> Updates the struct arm_pmu to include all required helpers that will drive
> >> BRBE functionality for a given PMU implementation. These are the following.
> >>
> >> - brbe_filter    : Convert perf event filters into BRBE HW filters
> >> - brbe_probe    : Probe BRBE HW and capture its attributes
> >> - brbe_enable    : Enable BRBE HW with a given config
> >> - brbe_disable    : Disable BRBE HW
> >> - brbe_read    : Read BRBE buffer for captured branch records
> >> - brbe_reset    : Reset BRBE buffer
> >> - brbe_supported: Whether BRBE is supported or not
> >>
> >> A BRBE driver implementation needs to provide these functionalities.
> >
> > Could these not be hidden from the generic arm_pmu and kept in the
> > arm64 pmu backend  ? It looks like they are quite easy to simply
> > move these to the corresponding hooks in arm64 pmu.
>
> We have had this discussion multiple times in the past [1], but I still
> believe, keeping BRBE implementation hooks at the PMU level rather than
> embedding them with other PMU events handling, is a much better logical
> abstraction.
>
> [1] https://lore.kernel.org/all/[email protected]/
>
> --------------------------------------------------------------------------
> >
> > One thing to answer in the commit msg is why we need the hooks here.
> > Have we concluded that adding BRBE hooks to struct arm_pmu for what is
> > an armv8 specific feature is the right approach? I don't recall
> > reaching that conclusion.
>
> Although it might be possible to have this implementation embedded in
> the existing armv8 PMU implementation, I still believe that the BRBE
> functionalities abstracted out at the arm_pmu level with a separate
> config option is cleaner, easier to follow and to maintain as well.
>
> Besides some helpers i.e brbe_supported(), brbe_probe() and brbe_reset()
> might not fit seamlessly, when tried to be embedded via existing arm_pmu
> helpers in the armv8 implementation.
>
> Nonetheless if arm_pmu based additional BRBE helpers is absolutely a no
> go for folks here in general, will explore arm64 based implementation.
> ----------------------------------------------------------------------------
>
> I am still waiting for maintainer's take on this issue. I will be happy to
> rework this series to move all these implementation inside arm64 callbacks
> instead, if that is required or preferred by the maintainers. But according
> to me, this current abstraction layout is much better.

To be honest, I'm not sure what's best right now; but at the moment it's not
clear to me why this couldn't fit within the existing hooks.

Above you say brbe_supported() / brbe_probe() / brbe_reset() didn't fit
seamlessly; can you give an example of problem? I think I'm missing something
obvious.

Thanks,
Mark.

2022-11-18 18:44:45

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH V5 4/7] driver/perf/arm_pmu_platform: Add support for BRBE attributes detection

On Mon, Nov 07, 2022 at 11:55:11AM +0530, Anshuman Khandual wrote:
> This adds arm pmu infrastrure to probe BRBE implementation's attributes via
> driver exported callbacks later. The actual BRBE feature detection will be
> added by the driver itself.
>
> CPU specific BRBE entries, cycle count, format support gets detected during
> PMU init. This information gets saved in per-cpu struct pmu_hw_events which
> later helps in operating BRBE during a perf event context.

Do we expect this to vary between CPUs handled by the same struct arm_pmu ?

> Cc: Will Deacon <[email protected]>
> Cc: Mark Rutland <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Signed-off-by: Anshuman Khandual <[email protected]>
> ---
> drivers/perf/arm_pmu_platform.c | 34 +++++++++++++++++++++++++++++++++
> 1 file changed, 34 insertions(+)
>
> diff --git a/drivers/perf/arm_pmu_platform.c b/drivers/perf/arm_pmu_platform.c
> index 933b96e243b8..acdc445081aa 100644
> --- a/drivers/perf/arm_pmu_platform.c
> +++ b/drivers/perf/arm_pmu_platform.c
> @@ -172,6 +172,36 @@ static int armpmu_request_irqs(struct arm_pmu *armpmu)
> return err;
> }
>
> +static void arm_brbe_probe_cpu(void *info)
> +{
> + struct pmu_hw_events *hw_events;
> + struct arm_pmu *armpmu = info;
> +
> + /*
> + * Return from here, if BRBE driver has not been
> + * implemented for this PMU. This helps prevent
> + * kernel crash later when brbe_probe() will be
> + * called on the PMU.
> + */
> + if (!armpmu->brbe_probe)
> + return;

Since this is a field on struct arm_pmu, why doesn't armpmu_request_brbe()
check this before calling smp_call_function_single(), to avoid the redundant
IPI?

> +
> + hw_events = per_cpu_ptr(armpmu->hw_events, smp_processor_id());
> + armpmu->brbe_probe(hw_events);
> +}
> +
> +static int armpmu_request_brbe(struct arm_pmu *armpmu)
> +{
> + int cpu, err = 0;
> +
> + for_each_cpu(cpu, &armpmu->supported_cpus) {
> + err = smp_call_function_single(cpu, arm_brbe_probe_cpu, armpmu, 1);

Why does this need to be called on each CPU in the supported_cpus mask?

I don't see anything here to handle late hotplug, so this looks suspicious.
Either we're missing something, or it's redundant at boot time.

Thanks,
Mark.

> + if (err)
> + return err;
> + }
> + return err;
> +}
> +
> static void armpmu_free_irqs(struct arm_pmu *armpmu)
> {
> int cpu;
> @@ -229,6 +259,10 @@ int arm_pmu_device_probe(struct platform_device *pdev,
> if (ret)
> goto out_free_irqs;
>
> + ret = armpmu_request_brbe(pmu);
> + if (ret)
> + goto out_free_irqs;
> +
> ret = armpmu_register(pmu);
> if (ret) {
> dev_err(dev, "failed to register PMU devices!\n");
> --
> 2.25.1
>

2022-11-18 18:53:29

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH V5 5/7] arm64/perf: Drive BRBE from perf event states

On Mon, Nov 07, 2022 at 11:55:12AM +0530, Anshuman Khandual wrote:
> Branch stack sampling rides along the normal perf event and all the branch
> records get captured during the PMU interrupt. This just changes perf event
> handling on the arm64 platform to accommodate required BRBE operations that
> will enable branch stack sampling support.
>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Arnaldo Carvalho de Melo <[email protected]>
> Cc: Mark Rutland <[email protected]>
> Cc: Will Deacon <[email protected]>
> Cc: Catalin Marinas <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Signed-off-by: Anshuman Khandual <[email protected]>
> ---
> arch/arm64/kernel/perf_event.c | 7 ++++++
> drivers/perf/arm_pmu.c | 40 ++++++++++++++++++++++++++++++++++
> 2 files changed, 47 insertions(+)
>
> diff --git a/arch/arm64/kernel/perf_event.c b/arch/arm64/kernel/perf_event.c
> index c97377e28288..97db333d1208 100644
> --- a/arch/arm64/kernel/perf_event.c
> +++ b/arch/arm64/kernel/perf_event.c
> @@ -874,6 +874,13 @@ static irqreturn_t armv8pmu_handle_irq(struct arm_pmu *cpu_pmu)
> if (!armpmu_event_set_period(event))
> continue;
>
> + if (has_branch_stack(event)) {
> + cpu_pmu->brbe_read(cpuc, event);
> + data.br_stack = &cpuc->branches->brbe_stack;
> + data.sample_flags |= PERF_SAMPLE_BRANCH_STACK;
> + cpu_pmu->brbe_reset(cpuc);
> + }
> +
> /*
> * Perf event overflow will queue the processing of the event as
> * an irq_work which will be taken care of in the handling of
> diff --git a/drivers/perf/arm_pmu.c b/drivers/perf/arm_pmu.c
> index 5048a500441e..1a8dca4e513e 100644
> --- a/drivers/perf/arm_pmu.c
> +++ b/drivers/perf/arm_pmu.c
> @@ -271,12 +271,22 @@ armpmu_stop(struct perf_event *event, int flags)
> {
> struct arm_pmu *armpmu = to_arm_pmu(event->pmu);
> struct hw_perf_event *hwc = &event->hw;
> + struct pmu_hw_events *hw_events = this_cpu_ptr(armpmu->hw_events);
>
> /*
> * ARM pmu always has to update the counter, so ignore
> * PERF_EF_UPDATE, see comments in armpmu_start().
> */
> if (!(hwc->state & PERF_HES_STOPPED)) {
> + if (has_branch_stack(event)) {
> + WARN_ON_ONCE(!hw_events->brbe_users);
> + hw_events->brbe_users--;
> + if (!hw_events->brbe_users) {
> + hw_events->brbe_context = NULL;
> + armpmu->brbe_disable(hw_events);
> + }
> + }

Can't we do the actual enable/disable we start/stop the PMU as a whole?

If we just counted the numberoof users here we could do the actual
enable/disable in armpmu_{enable,disable}() or armv8pmu_{start,stop}(), like we
do when checking hw_events->used_mask.

[...]

> @@ -349,6 +368,10 @@ armpmu_add(struct perf_event *event, int flags)
> hw_events->events[idx] = event;
>
> hwc->state = PERF_HES_STOPPED | PERF_HES_UPTODATE;
> +
> + if (has_branch_stack(event))
> + armpmu->brbe_filter(hw_events, event);

What exactly do we need to do here? Since the BRBE is shared, I'm suprised that
there's any pwer-event configuration beyond "yes" or "no".

> +
> if (flags & PERF_EF_START)
> armpmu_start(event, PERF_EF_RELOAD);
>
> @@ -443,6 +466,7 @@ __hw_perf_event_init(struct perf_event *event)
> {
> struct arm_pmu *armpmu = to_arm_pmu(event->pmu);
> struct hw_perf_event *hwc = &event->hw;
> + struct pmu_hw_events *hw_events = this_cpu_ptr(armpmu->hw_events);
> int mapping;
>
> hwc->flags = 0;
> @@ -492,6 +516,9 @@ __hw_perf_event_init(struct perf_event *event)
> local64_set(&hwc->period_left, hwc->sample_period);
> }
>
> + if (has_branch_stack(event))
> + armpmu->brbe_filter(hw_events, event);

I do not understand why we would use hw_events here; at this point the event
has only been created, and not even added yet; it doesn't have a counter index.
isn't even being installed into HW.

What am I missing?

> +
> return validate_group(event);
> }
>
> @@ -520,6 +547,18 @@ static int armpmu_event_init(struct perf_event *event)
> return __hw_perf_event_init(event);
> }
>
> +static void armpmu_sched_task(struct perf_event_context *ctx, bool sched_in)
> +{
> + struct arm_pmu *armpmu = to_arm_pmu(ctx->pmu);
> + struct pmu_hw_events *hw_events = this_cpu_ptr(armpmu->hw_events);
> +
> + if (!hw_events->brbe_users)
> + return;
> +
> + if (sched_in)
> + armpmu->brbe_reset(hw_events);

I see that LBR does a save/restore, whereas IIUC here we discard without even
reading the old values. Is that the intent? Shouldn't we snapshot them into the
task context?

Thanks,
Mark.

> +}
> +
> static void armpmu_enable(struct pmu *pmu)
> {
> struct arm_pmu *armpmu = to_arm_pmu(pmu);
> @@ -877,6 +916,7 @@ static struct arm_pmu *__armpmu_alloc(gfp_t flags)
> }
>
> pmu->pmu = (struct pmu) {
> + .sched_task = armpmu_sched_task,
> .pmu_enable = armpmu_enable,
> .pmu_disable = armpmu_disable,
> .event_init = armpmu_event_init,
> --
> 2.25.1
>

2022-11-21 06:49:06

by Anshuman Khandual

[permalink] [raw]
Subject: Re: [PATCH V5 4/7] driver/perf/arm_pmu_platform: Add support for BRBE attributes detection



On 11/18/22 23:31, Mark Rutland wrote:
> On Mon, Nov 07, 2022 at 11:55:11AM +0530, Anshuman Khandual wrote:
>> This adds arm pmu infrastrure to probe BRBE implementation's attributes via
>> driver exported callbacks later. The actual BRBE feature detection will be
>> added by the driver itself.
>>
>> CPU specific BRBE entries, cycle count, format support gets detected during
>> PMU init. This information gets saved in per-cpu struct pmu_hw_events which
>> later helps in operating BRBE during a perf event context.
>
> Do we expect this to vary between CPUs handled by the same struct arm_pmu ?

BRBE registers are per CPU, and the spec does not assert about BRBE properties
being the same across the system, served via same the struct arm_pmu. Hence it
would be inaccurate to make that assumption, which might have just avoided all
these IPI based probes during boot.

>
>> Cc: Will Deacon <[email protected]>
>> Cc: Mark Rutland <[email protected]>
>> Cc: [email protected]
>> Cc: [email protected]
>> Signed-off-by: Anshuman Khandual <[email protected]>
>> ---
>> drivers/perf/arm_pmu_platform.c | 34 +++++++++++++++++++++++++++++++++
>> 1 file changed, 34 insertions(+)
>>
>> diff --git a/drivers/perf/arm_pmu_platform.c b/drivers/perf/arm_pmu_platform.c
>> index 933b96e243b8..acdc445081aa 100644
>> --- a/drivers/perf/arm_pmu_platform.c
>> +++ b/drivers/perf/arm_pmu_platform.c
>> @@ -172,6 +172,36 @@ static int armpmu_request_irqs(struct arm_pmu *armpmu)
>> return err;
>> }
>>
>> +static void arm_brbe_probe_cpu(void *info)
>> +{
>> + struct pmu_hw_events *hw_events;
>> + struct arm_pmu *armpmu = info;
>> +
>> + /*
>> + * Return from here, if BRBE driver has not been
>> + * implemented for this PMU. This helps prevent
>> + * kernel crash later when brbe_probe() will be
>> + * called on the PMU.
>> + */
>> + if (!armpmu->brbe_probe)
>> + return;
>
> Since this is a field on struct arm_pmu, why doesn't armpmu_request_brbe()
> check this before calling smp_call_function_single(), to avoid the redundant
> IPI?

Makes sense, I will move the check inside armpmu_request_brbe() with return
code -ENODEV when not available.

>
>> +
>> + hw_events = per_cpu_ptr(armpmu->hw_events, smp_processor_id());
>> + armpmu->brbe_probe(hw_events);
>> +}
>> +
>> +static int armpmu_request_brbe(struct arm_pmu *armpmu)
>> +{
>> + int cpu, err = 0;
>> +
>> + for_each_cpu(cpu, &armpmu->supported_cpus) {
>> + err = smp_call_function_single(cpu, arm_brbe_probe_cpu, armpmu, 1);
>
> Why does this need to be called on each CPU in the supported_cpus mask?

Is not supported_cpus derived after partitioning the IRQ in pmu_parse_percpu_irq().
The idea is to fill up BRBE buffer attributes, on all such supported cpus which could
trigger PMU interrupt. Is the concern, that not all cpus in supported_cpus mask might
not be online during boot, hence IPIs could not be served, hence BRBE attributed for
them could not be fetched ?

>
> I don't see anything here to handle late hotplug, so this looks suspicious.

Right, I should add cpu hotplug handling, otherwise risk loosing BRBE support on cpus
which might have been offline during boot i.e when above IPI based probe happened ?

> Either we're missing something, or it's redundant at boot time.

Should we add cpu hotplug online-offline handlers like some other PMU drivers ? Let
me know if there are some other concerns.

cpuhp_setup_state_multi(CPUHP_AP_ONLINE_DYN, DRVNAME,
arm_brbe_cpu_startup,
arm_brbe_cpu_teardown)

2022-11-21 12:21:25

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH V5 4/7] driver/perf/arm_pmu_platform: Add support for BRBE attributes detection

On Mon, Nov 21, 2022 at 12:06:31PM +0530, Anshuman Khandual wrote:
>
>
> On 11/18/22 23:31, Mark Rutland wrote:
> > On Mon, Nov 07, 2022 at 11:55:11AM +0530, Anshuman Khandual wrote:
> >> This adds arm pmu infrastrure to probe BRBE implementation's attributes via
> >> driver exported callbacks later. The actual BRBE feature detection will be
> >> added by the driver itself.
> >>
> >> CPU specific BRBE entries, cycle count, format support gets detected during
> >> PMU init. This information gets saved in per-cpu struct pmu_hw_events which
> >> later helps in operating BRBE during a perf event context.
> >
> > Do we expect this to vary between CPUs handled by the same struct arm_pmu ?
>
> BRBE registers are per CPU, and the spec does not assert about BRBE properties
> being the same across the system, served via same the struct arm_pmu.

The same is true of the PMU, and struct arm_pmu does not cover the whole
system, it covers each *micro-architecture* within the system.

I think BRBE should be treated the same, i.e. uniform *within* a struct
arm_pmu.

> Hence it would be inaccurate to make that assumption, which might have just
> avoided all these IPI based probes during boot.

FWIW, I would be happy to IPI all CPUs during boot to verify uniformity of CPUs
within an arm_pmu; I just don't think that BRBE should be treated differently
from the rest of the PMU features.

[...]

> >> + hw_events = per_cpu_ptr(armpmu->hw_events, smp_processor_id());
> >> + armpmu->brbe_probe(hw_events);
> >> +}
> >> +
> >> +static int armpmu_request_brbe(struct arm_pmu *armpmu)
> >> +{
> >> + int cpu, err = 0;
> >> +
> >> + for_each_cpu(cpu, &armpmu->supported_cpus) {
> >> + err = smp_call_function_single(cpu, arm_brbe_probe_cpu, armpmu, 1);
> >
> > Why does this need to be called on each CPU in the supported_cpus mask?
>
> Is not supported_cpus derived after partitioning the IRQ in pmu_parse_percpu_irq().
> The idea is to fill up BRBE buffer attributes, on all such supported cpus which could
> trigger PMU interrupt. Is the concern, that not all cpus in supported_cpus mask might
> not be online during boot, hence IPIs could not be served, hence BRBE attributed for
> them could not be fetched ?

As above, I think this is solvable if we mandate that BRBE must be uniform
*within* an arm_pmu's supported CPUs; then we only need one CPU in the
supported_cpus mask to be present at boot time, as with the rest of the PMU
code.

We could *verify* that when onlining a CPU.

> > I don't see anything here to handle late hotplug, so this looks suspicious.
>
> Right, I should add cpu hotplug handling, otherwise risk loosing BRBE support on cpus
> which might have been offline during boot i.e when above IPI based probe happened ?
>
> > Either we're missing something, or it's redundant at boot time.
>
> Should we add cpu hotplug online-offline handlers like some other PMU drivers ? Let
> me know if there are some other concerns.
>
> cpuhp_setup_state_multi(CPUHP_AP_ONLINE_DYN, DRVNAME,
> arm_brbe_cpu_startup,
> arm_brbe_cpu_teardown)

We *could* add that, but that's going to require ordering against the existing
hooks for probing arm_pmu.

Why can't this hang off the exising hooks for arm_pmu? We're treating this as
part of the PMU anyway, so I don't understand why we should probe it
separately.

Thanks,
Mark.

2022-11-28 08:36:00

by Anshuman Khandual

[permalink] [raw]
Subject: Re: [PATCH V5 4/7] driver/perf/arm_pmu_platform: Add support for BRBE attributes detection



On 11/21/22 17:09, Mark Rutland wrote:
> On Mon, Nov 21, 2022 at 12:06:31PM +0530, Anshuman Khandual wrote:
>>
>>
>> On 11/18/22 23:31, Mark Rutland wrote:
>>> On Mon, Nov 07, 2022 at 11:55:11AM +0530, Anshuman Khandual wrote:
>>>> This adds arm pmu infrastrure to probe BRBE implementation's attributes via
>>>> driver exported callbacks later. The actual BRBE feature detection will be
>>>> added by the driver itself.
>>>>
>>>> CPU specific BRBE entries, cycle count, format support gets detected during
>>>> PMU init. This information gets saved in per-cpu struct pmu_hw_events which
>>>> later helps in operating BRBE during a perf event context.
>>>
>>> Do we expect this to vary between CPUs handled by the same struct arm_pmu ?
>>
>> BRBE registers are per CPU, and the spec does not assert about BRBE properties
>> being the same across the system, served via same the struct arm_pmu.
>
> The same is true of the PMU, and struct arm_pmu does not cover the whole
> system, it covers each *micro-architecture* within the system.
>
> I think BRBE should be treated the same, i.e. uniform *within* a struct
> arm_pmu.

Understood, detected on one and verified on all ?

>
>> Hence it would be inaccurate to make that assumption, which might have just
>> avoided all these IPI based probes during boot.
>
> FWIW, I would be happy to IPI all CPUs during boot to verify uniformity of CPUs
> within an arm_pmu; I just don't think that BRBE should be treated differently
> from the rest of the PMU features.

Hence BRBE probing should be done inside an updated __armv8pmu_probe_pmu().

static int armv8pmu_probe_pmu(struct arm_pmu *cpu_pmu)
{
struct armv8pmu_probe_info probe = {
.pmu = cpu_pmu,
.present = false,
};
int ret;

ret = smp_call_function_any(&cpu_pmu->supported_cpus,
__armv8pmu_probe_pmu,
&probe, 1);
if (ret)
return ret;

return probe.present ? 0 : -ENODEV;
}

But if BRBE is assumed (and verified) to be same across the micro-architecture,
then following BRBE attributes when captured should be part of 'struct arm_pmu'
instead of 'struct pmu_hw_events' as is the case currently.

/* Detected BRBE attributes */
bool brbe_v1p1;
int brbe_cc;
int brbe_nr;
int brbe_format;

>
> [...]
>
>>>> + hw_events = per_cpu_ptr(armpmu->hw_events, smp_processor_id());
>>>> + armpmu->brbe_probe(hw_events);
>>>> +}
>>>> +
>>>> +static int armpmu_request_brbe(struct arm_pmu *armpmu)
>>>> +{
>>>> + int cpu, err = 0;
>>>> +
>>>> + for_each_cpu(cpu, &armpmu->supported_cpus) {
>>>> + err = smp_call_function_single(cpu, arm_brbe_probe_cpu, armpmu, 1);
>>>
>>> Why does this need to be called on each CPU in the supported_cpus mask?
>>
>> Is not supported_cpus derived after partitioning the IRQ in pmu_parse_percpu_irq().
>> The idea is to fill up BRBE buffer attributes, on all such supported cpus which could
>> trigger PMU interrupt. Is the concern, that not all cpus in supported_cpus mask might
>> not be online during boot, hence IPIs could not be served, hence BRBE attributed for
>> them could not be fetched ?
>
> As above, I think this is solvable if we mandate that BRBE must be uniform
> *within* an arm_pmu's supported CPUs; then we only need one CPU in the
> supported_cpus mask to be present at boot time, as with the rest of the PMU
> code.
>
> We could *verify* that when onlining a CPU.

Understood.

>
>>> I don't see anything here to handle late hotplug, so this looks suspicious.
>>
>> Right, I should add cpu hotplug handling, otherwise risk loosing BRBE support on cpus
>> which might have been offline during boot i.e when above IPI based probe happened ?
>>
>>> Either we're missing something, or it's redundant at boot time.
>>
>> Should we add cpu hotplug online-offline handlers like some other PMU drivers ? Let
>> me know if there are some other concerns.
>>
>> cpuhp_setup_state_multi(CPUHP_AP_ONLINE_DYN, DRVNAME,
>> arm_brbe_cpu_startup,
>> arm_brbe_cpu_teardown)
>
> We *could* add that, but that's going to require ordering against the existing
> hooks for probing arm_pmu.

Right.

>
> Why can't this hang off the exising hooks for arm_pmu? We're treating this as
> part of the PMU anyway, so I don't understand why we should probe it
> separately.
Okay, will try and see what all changes are required to move the probing into generic
arm_pmu probe, and capture the BRBE attributes inside struct arm_pmu.

2022-11-29 06:47:00

by Anshuman Khandual

[permalink] [raw]
Subject: Re: [PATCH V5 2/7] arm64/perf: Update struct arm_pmu for BRBE



On 11/18/22 23:17, Mark Rutland wrote:
>
> Hi Anshuman,
>
> Apologies for the delayi n reviewing this.
>
> On Fri, Nov 18, 2022 at 12:09:07PM +0530, Anshuman Khandual wrote:
>> On 11/9/22 17:00, Suzuki K Poulose wrote:
>>> On 07/11/2022 06:25, Anshuman Khandual wrote:
>>>> Although BRBE is an armv8 speciifc HW feature, abstracting out its various
>>>> function callbacks at the struct arm_pmu level is preferred, as it cleaner
>>>> , easier to follow and maintain.
>>>>
>>>> Besides some helpers i.e brbe_supported(), brbe_probe() and brbe_reset()
>>>> might not fit seamlessly, when tried to be embedded via existing arm_pmu
>>>> helpers in the armv8 implementation.
>>>>
>>>> Updates the struct arm_pmu to include all required helpers that will drive
>>>> BRBE functionality for a given PMU implementation. These are the following.
>>>>
>>>> - brbe_filter    : Convert perf event filters into BRBE HW filters
>>>> - brbe_probe    : Probe BRBE HW and capture its attributes
>>>> - brbe_enable    : Enable BRBE HW with a given config
>>>> - brbe_disable    : Disable BRBE HW
>>>> - brbe_read    : Read BRBE buffer for captured branch records
>>>> - brbe_reset    : Reset BRBE buffer
>>>> - brbe_supported: Whether BRBE is supported or not
>>>>
>>>> A BRBE driver implementation needs to provide these functionalities.
>>>
>>> Could these not be hidden from the generic arm_pmu and kept in the
>>> arm64 pmu backend  ? It looks like they are quite easy to simply
>>> move these to the corresponding hooks in arm64 pmu.
>>
>> We have had this discussion multiple times in the past [1], but I still
>> believe, keeping BRBE implementation hooks at the PMU level rather than
>> embedding them with other PMU events handling, is a much better logical
>> abstraction.
>>
>> [1] https://lore.kernel.org/all/[email protected]/
>>
>> --------------------------------------------------------------------------
>>>
>>> One thing to answer in the commit msg is why we need the hooks here.
>>> Have we concluded that adding BRBE hooks to struct arm_pmu for what is
>>> an armv8 specific feature is the right approach? I don't recall
>>> reaching that conclusion.
>>
>> Although it might be possible to have this implementation embedded in
>> the existing armv8 PMU implementation, I still believe that the BRBE
>> functionalities abstracted out at the arm_pmu level with a separate
>> config option is cleaner, easier to follow and to maintain as well.
>>
>> Besides some helpers i.e brbe_supported(), brbe_probe() and brbe_reset()
>> might not fit seamlessly, when tried to be embedded via existing arm_pmu
>> helpers in the armv8 implementation.
>>
>> Nonetheless if arm_pmu based additional BRBE helpers is absolutely a no
>> go for folks here in general, will explore arm64 based implementation.
>> ----------------------------------------------------------------------------
>>
>> I am still waiting for maintainer's take on this issue. I will be happy to
>> rework this series to move all these implementation inside arm64 callbacks
>> instead, if that is required or preferred by the maintainers. But according
>> to me, this current abstraction layout is much better.
>
> To be honest, I'm not sure what's best right now; but at the moment it's not
> clear to me why this couldn't fit within the existing hooks.
>
> Above you say brbe_supported() / brbe_probe() / brbe_reset() didn't fit
> seamlessly; can you give an example of problem? I think I'm missing something
> obvious.

I tried to move them inside armv8 implementation callbacks.

arm64_pmu_brbe_supported() can be moved inside __armv8_pmuv3_map_event(), so that
event viability can be validated during armpmu_event_init(). arm64_pmu_brbe_probe()
can be moved inside __armv8pmu_probe_pmu() as you have suggested earlier on another
thread. arm64_pmu_brbe_reset() can also be moved inside armv8pmu_enable_event(),
and also armv8pmu_reset().

The only problem being armpmu_sched_task() where earlier we had BRBE reset, but I
guess it can be replaced with entire PMU reset which does the BRBE reset as well ?

static void armpmu_sched_task(struct perf_event_context *ctx, bool sched_in)
{
struct arm_pmu *armpmu = to_arm_pmu(ctx->pmu);

if (sched_in)
armpmu->reset(armpmu);
}

2022-11-29 06:47:19

by Anshuman Khandual

[permalink] [raw]
Subject: Re: [PATCH V5 5/7] arm64/perf: Drive BRBE from perf event states



On 11/18/22 23:45, Mark Rutland wrote:
> On Mon, Nov 07, 2022 at 11:55:12AM +0530, Anshuman Khandual wrote:
>> Branch stack sampling rides along the normal perf event and all the branch
>> records get captured during the PMU interrupt. This just changes perf event
>> handling on the arm64 platform to accommodate required BRBE operations that
>> will enable branch stack sampling support.
>>
>> Cc: Peter Zijlstra <[email protected]>
>> Cc: Ingo Molnar <[email protected]>
>> Cc: Arnaldo Carvalho de Melo <[email protected]>
>> Cc: Mark Rutland <[email protected]>
>> Cc: Will Deacon <[email protected]>
>> Cc: Catalin Marinas <[email protected]>
>> Cc: [email protected]
>> Cc: [email protected]
>> Cc: [email protected]
>> Signed-off-by: Anshuman Khandual <[email protected]>
>> ---
>> arch/arm64/kernel/perf_event.c | 7 ++++++
>> drivers/perf/arm_pmu.c | 40 ++++++++++++++++++++++++++++++++++
>> 2 files changed, 47 insertions(+)
>>
>> diff --git a/arch/arm64/kernel/perf_event.c b/arch/arm64/kernel/perf_event.c
>> index c97377e28288..97db333d1208 100644
>> --- a/arch/arm64/kernel/perf_event.c
>> +++ b/arch/arm64/kernel/perf_event.c
>> @@ -874,6 +874,13 @@ static irqreturn_t armv8pmu_handle_irq(struct arm_pmu *cpu_pmu)
>> if (!armpmu_event_set_period(event))
>> continue;
>>
>> + if (has_branch_stack(event)) {
>> + cpu_pmu->brbe_read(cpuc, event);
>> + data.br_stack = &cpuc->branches->brbe_stack;
>> + data.sample_flags |= PERF_SAMPLE_BRANCH_STACK;
>> + cpu_pmu->brbe_reset(cpuc);
>> + }
>> +
>> /*
>> * Perf event overflow will queue the processing of the event as
>> * an irq_work which will be taken care of in the handling of
>> diff --git a/drivers/perf/arm_pmu.c b/drivers/perf/arm_pmu.c
>> index 5048a500441e..1a8dca4e513e 100644
>> --- a/drivers/perf/arm_pmu.c
>> +++ b/drivers/perf/arm_pmu.c
>> @@ -271,12 +271,22 @@ armpmu_stop(struct perf_event *event, int flags)
>> {
>> struct arm_pmu *armpmu = to_arm_pmu(event->pmu);
>> struct hw_perf_event *hwc = &event->hw;
>> + struct pmu_hw_events *hw_events = this_cpu_ptr(armpmu->hw_events);
>>
>> /*
>> * ARM pmu always has to update the counter, so ignore
>> * PERF_EF_UPDATE, see comments in armpmu_start().
>> */
>> if (!(hwc->state & PERF_HES_STOPPED)) {
>> + if (has_branch_stack(event)) {
>> + WARN_ON_ONCE(!hw_events->brbe_users);
>> + hw_events->brbe_users--;
>> + if (!hw_events->brbe_users) {
>> + hw_events->brbe_context = NULL;
>> + armpmu->brbe_disable(hw_events);
>> + }
>> + }
>
> Can't we do the actual enable/disable we start/stop the PMU as a whole?
>
> If we just counted the numberoof users here we could do the actual
> enable/disable in armpmu_{enable,disable}() or armv8pmu_{start,stop}(), like we
> do when checking hw_events->used_mask.

Right, it can be moved inside armv8pmu_{enable, disable}

diff --git a/arch/arm64/kernel/perf_event.c b/arch/arm64/kernel/perf_event.c
index 7b0643fe2f13..e64832bc83ba 100644
--- a/arch/arm64/kernel/perf_event.c
+++ b/arch/arm64/kernel/perf_event.c
@@ -789,6 +789,19 @@ static void armv8pmu_enable_event(struct perf_event *event)
* Enable counter
*/
armv8pmu_enable_event_counter(event);
+
+ if (has_branch_stack(event)) {
+ struct arm_pmu *armpmu = to_arm_pmu(event->pmu);
+ struct pmu_hw_events *hw_events = this_cpu_ptr(armpmu->hw_events);
+
+ if (event->ctx->task && hw_events->brbe_context != event->ctx) {
+ arm64_pmu_brbe_reset(hw_events);
+ hw_events->brbe_context = event->ctx;
+ }
+ hw_events->brbe_users++;
+ arm64_pmu_brbe_enable(event);
+ }
}

static void armv8pmu_disable_event(struct perf_event *event)
@@ -802,6 +815,14 @@ static void armv8pmu_disable_event(struct perf_event *event)
* Disable interrupt for this counter
*/
armv8pmu_disable_event_irq(event);
+
+ if (has_branch_stack(event)) {
+ struct arm_pmu *armpmu = to_arm_pmu(event->pmu);
+ struct pmu_hw_events *hw_events = this_cpu_ptr(armpmu->hw_events);
+
+ if (!hw_events->brbe_users)
+ arm64_pmu_brbe_disable(event);
+ }
}


>
> [...]
>
>> @@ -349,6 +368,10 @@ armpmu_add(struct perf_event *event, int flags)
>> hw_events->events[idx] = event;
>>
>> hwc->state = PERF_HES_STOPPED | PERF_HES_UPTODATE;
>> +
>> + if (has_branch_stack(event))
>> + armpmu->brbe_filter(hw_events, event);
>
> What exactly do we need to do here? Since the BRBE is shared, I'm suprised that
> there's any pwer-event configuration beyond "yes" or "no".

Entire arm64_pmu_brbe_filter() can be moved inside arm64_pmu_brbe_enable().

>
>> +
>> if (flags & PERF_EF_START)
>> armpmu_start(event, PERF_EF_RELOAD);
>>
>> @@ -443,6 +466,7 @@ __hw_perf_event_init(struct perf_event *event)
>> {
>> struct arm_pmu *armpmu = to_arm_pmu(event->pmu);
>> struct hw_perf_event *hwc = &event->hw;
>> + struct pmu_hw_events *hw_events = this_cpu_ptr(armpmu->hw_events);
>> int mapping;
>>
>> hwc->flags = 0;
>> @@ -492,6 +516,9 @@ __hw_perf_event_init(struct perf_event *event)
>> local64_set(&hwc->period_left, hwc->sample_period);
>> }
>>
>> + if (has_branch_stack(event))
>> + armpmu->brbe_filter(hw_events, event);
>
> I do not understand why we would use hw_events here; at this point the event
> has only been created, and not even added yet; it doesn't have a counter index.
> isn't even being installed into HW.
>
> What am I missing?

perf event's requested branch sampling attributes need to be transferred into the
BRBE HW config, when the event gets scheduled on a given CPU. But this transfer
happens in two stages.

event -----> pmu_hw_events ----> BRBE HW

1. event->attr.branch_sample_type --> pmu_hw_events.[brbcr,brbfcr] (event add)
2. pmu_hw_events.[brbcr,brbfcr] --> BRBXXX_EL1 registers (event enable)

But I guess both these stages can be done in event_enable() and the intermediary
pmu_hw_events.[brbcr,brbfcr] elements can also be dropped.

>
>> +
>> return validate_group(event);
>> }
>>
>> @@ -520,6 +547,18 @@ static int armpmu_event_init(struct perf_event *event)
>> return __hw_perf_event_init(event);
>> }
>>
>> +static void armpmu_sched_task(struct perf_event_context *ctx, bool sched_in)
>> +{
>> + struct arm_pmu *armpmu = to_arm_pmu(ctx->pmu);
>> + struct pmu_hw_events *hw_events = this_cpu_ptr(armpmu->hw_events);
>> +
>> + if (!hw_events->brbe_users)
>> + return;
>> +
>> + if (sched_in)
>> + armpmu->brbe_reset(hw_events);
>
> I see that LBR does a save/restore, whereas IIUC here we discard without even
> reading the old values. Is that the intent? Shouldn't we snapshot them into the
> task context?

Yes, currently that is the intent. We cannot let the branch records from one task
slip across into another task corrupting the capture on the later. Only Intel LBR
implements such task context save/restore via hw_events, only when LBR call stack
feature is supported otherwise, all other sched_task() implementation just flush
or invalidate captured branches before changing the context.

We could implement such BRBE context store/restore mechanism via given BRBXXXINJ
registers/instructions. But this would make our context switch bit expensive and
also might not provide substantial benefit as well. Although this can be explored
later after the initial enablement.