2023-07-11 08:53:35

by Anshuman Khandual

[permalink] [raw]
Subject: [PATCH V13 - RESEND 00/10] arm64/perf: Enable branch stack sampling

This series enables perf branch stack sampling support on arm64 platform
via a new arch feature called Branch Record Buffer Extension (BRBE). All
relevant register definitions could be accessed here.

https://developer.arm.com/documentation/ddi0601/2021-12/AArch64-Registers

This series applies on 6.5-rc1.

Changes in V13:

https://lore.kernel.org/all/[email protected]/

- Added branch callback stubs for aarch32 pmuv3 based platforms
- Updated the comments for capture_brbe_regset()
- Deleted the comments in __read_brbe_regset()
- Reversed the arguments order in capture_brbe_regset() and brbe_branch_save()
- Fixed BRBE_BANK[0|1]_IDX_MAX indices comparison in armv8pmu_branch_read()
- Fixed BRBE_BANK[0|1]_IDX_MAX indices comparison in capture_brbe_regset()

Changes in V12:

https://lore.kernel.org/all/[email protected]/

- Replaced branch types with complete DIRECT/INDIRECT prefixes/suffixes
- Replaced branch types with complete INSN/ALIGN prefixes/suffixes
- Replaced return branch types as simple RET/ERET
- Replaced time field GST_PHYSICAL as GUEST_PHYSICAL
- Added 0 padding for BRBIDR0_EL1.NUMREC enum values
- Dropped helper arm_pmu_branch_stack_supported()
- Renamed armv8pmu_branch_valid() as armv8pmu_branch_attr_valid()
- Separated perf_task_ctx_cache setup from arm_pmu private allocation
- Collected changes to branch_records_alloc() in a single patch [5/10]
- Reworked and cleaned up branch_records_alloc()
- Reworked armv8pmu_branch_read() with new loop iterations in patch [6/10]
- Reworked capture_brbe_regset() with new loop iterations in patch [8/10]
- Updated the comment in branch_type_to_brbcr()
- Fixed the comment before stitch_stored_live_entries()
- Fixed BRBINFINJ_EL1 definition for VALID_FULL enum field
- Factored out helper __read_brbe_regset() from capture_brbe_regset()
- Dropped the helper copy_brbe_regset()
- Simplified stitch_stored_live_entries() with memcpy(), memmove()
- Reworked armv8pmu_probe_pmu() to bail out early with !probe.present
- Rework brbe_attributes_probe() without 'struct brbe_hw_attr'
- Dropped 'struct brbe_hw_attr' argument from capture_brbe_regset()
- Dropped 'struct brbe_hw_attr' argument from brbe_branch_save()
- Dropped arm_pmu->private and added arm_pmu->reg_trbidr instead

Changes in V11:

https://lore.kernel.org/all/[email protected]/

- Fixed the crash for per-cpu events without event->pmu_ctx->task_ctx_data

Changes in V10:

https://lore.kernel.org/all/[email protected]/

- Rebased the series on v6.4-rc2
- Moved ARMV8 PMUV3 changes inside drivers/perf/arm_pmuv3.c
- Moved BRBE driver changes inside drivers/perf/arm_brbe.[c|h]
- Moved the WARN_ON() inside the if condition in armv8pmu_handle_irq()

Changes in V9:

https://lore.kernel.org/all/[email protected]/

- Fixed build problem with has_branch_stack() in arm64 header
- BRBINF_EL1 definition has been changed from 'Sysreg' to 'SysregFields'
- Renamed all BRBINF_EL1 call sites as BRBINFx_EL1
- Dropped static const char branch_filter_error_msg[]
- Implemented a positive list check for BRBE supported perf branch filters
- Added a comment in armv8pmu_handle_irq()
- Implemented per-cpu allocation for struct branch_record records
- Skipped looping through bank 1 if an invalid record is detected in bank 0
- Added comment in armv8pmu_branch_read() explaining prohibited region etc
- Added comment warning about erroneously marking transactions as aborted
- Replaced the first argument (perf_branch_entry) in capture_brbe_flags()
- Dropped the last argument (idx) in capture_brbe_flags()
- Dropped the brbcr argument from capture_brbe_flags()
- Used perf_sample_save_brstack() to capture branch records for perf_sample_data
- Added comment explaining rationale for setting BRBCR_EL1_FZP for user only traces
- Dropped BRBE prohibited state mechanism while in armv8pmu_branch_read()
- Implemented event task context based branch records save mechanism

Changes in V8:

https://lore.kernel.org/all/[email protected]/

- Replaced arm_pmu->features as arm_pmu->has_branch_stack, updated its helper
- Added a comment and line break before arm_pmu->private element
- Added WARN_ON_ONCE() in helpers i.e armv8pmu_branch_[read|valid|enable|disable]()
- Dropped comments in armv8pmu_enable_event() and armv8pmu_disable_event()
- Replaced open bank encoding in BRBFCR_EL1 with SYS_FIELD_PREP()
- Changed brbe_hw_attr->brbe_version from 'bool' to 'int'
- Updated pr_warn() as pr_warn_once() with values in brbe_get_perf_[type|priv]()
- Replaced all pr_warn_once() as pr_debug_once() in armv8pmu_branch_valid()
- Added a comment in branch_type_to_brbcr() for the BRBCR_EL1 privilege settings
- Modified the comment related to BRBINFx_EL1.LASTFAILED in capture_brbe_flags()
- Modified brbe_get_perf_entry_type() as brbe_set_perf_entry_type()
- Renamed brbe_valid() as brbe_record_is_complete()
- Renamed brbe_source() as brbe_record_is_source_only()
- Renamed brbe_target() as brbe_record_is_target_only()
- Inverted checks for !brbe_record_is_[target|source]_only() for info capture
- Replaced 'fetch' with 'get' in all helpers that extract field value
- Dropped 'static int brbe_current_bank' optimization in select_brbe_bank()
- Dropped select_brbe_bank_index() completely, added capture_branch_entry()
- Process captured branch entries in two separate loops one for each BRBE bank
- Moved branch_records_alloc() inside armv8pmu_probe_pmu()
- Added a forward declaration for the helper has_branch_stack()
- Added new callbacks armv8pmu_private_alloc() and armv8pmu_private_free()
- Updated armv8pmu_probe_pmu() to allocate the private structure before SMP call

Changes in V7:

https://lore.kernel.org/all/[email protected]/

- Folded [PATCH 7/7] into [PATCH 3/7] which enables branch stack sampling event
- Defined BRBFCR_EL1_BRANCH_FILTERS, BRBCR_EL1_DEFAULT_CONFIG in the header
- Defined BRBFCR_EL1_DEFAULT_CONFIG in the header
- Updated BRBCR_EL1_DEFAULT_CONFIG with BRBCR_EL1_FZP
- Defined BRBCR_EL1_DEFAULT_TS in the header
- Updated BRBCR_EL1_DEFAULT_CONFIG with BRBCR_EL1_DEFAULT_TS
- Moved BRBCR_EL1_DEFAULT_CONFIG check inside branch_type_to_brbcr()
- Moved down BRBCR_EL1_CC, BRBCR_EL1_MPRED later in branch_type_to_brbcr()
- Also set BRBE in paused state in armv8pmu_branch_disable()
- Dropped brbe_paused(), set_brbe_paused() helpers
- Extracted error string via branch_filter_error_msg[] for armv8pmu_branch_valid()
- Replaced brbe_v1p1 with brbe_version in struct brbe_hw_attr
- Added valid_brbe_[cc, format, version]() helpers
- Split a separate brbe_attributes_probe() from armv8pmu_branch_probe()
- Capture event->attr.branch_sample_type earlier in armv8pmu_branch_valid()
- Defined enum brbe_bank_idx with possible values for BRBE bank indices
- Changed armpmu->hw_attr into armpmu->private
- Added missing space in stub definition for armv8pmu_branch_valid()
- Replaced both kmalloc() with kzalloc()
- Added BRBE_BANK_MAX_ENTRIES
- Updated comment for capture_brbe_flags()
- Updated comment for struct brbe_hw_attr
- Dropped space after type cast in couple of places
- Replaced inverse with negation for testing BRBCR_EL1_FZP in armv8pmu_branch_read()
- Captured cpuc->branches->branch_entries[idx] in a local variable
- Dropped saved_priv from armv8pmu_branch_read()
- Reorganize PERF_SAMPLE_BRANCH_NO_[CYCLES|NO_FLAGS] related configuration
- Replaced with FIELD_GET() and FIELD_PREP() wherever applicable
- Replaced BRBCR_EL1_TS_PHYSICAL with BRBCR_EL1_TS_VIRTUAL
- Moved valid_brbe_nr(), valid_brbe_cc(), valid_brbe_format(), valid_brbe_version()
select_brbe_bank(), select_brbe_bank_index() helpers inside the C implementation
- Reorganized brbe_valid_nr() and dropped the pr_warn() message
- Changed probe sequence in brbe_attributes_probe()
- Added 'brbcr' argument into capture_brbe_flags() to ascertain correct state
- Disable BRBE before disabling the PMU event counter
- Enable PERF_SAMPLE_BRANCH_HV filters when is_kernel_in_hyp_mode()
- Guard armv8pmu_reset() & armv8pmu_sched_task() with arm_pmu_branch_stack_supported()

Changes in V6:

https://lore.kernel.org/linux-arm-kernel/[email protected]/

- Restore the exception level privilege after reading the branch records
- Unpause the buffer after reading the branch records
- Decouple BRBCR_EL1_EXCEPTION/ERTN from perf event privilege level
- Reworked BRBE implementation and branch stack sampling support on arm pmu
- BRBE implementation is now part of overall ARMV8 PMU implementation
- BRBE implementation moved from drivers/perf/ to inside arch/arm64/kernel/
- CONFIG_ARM_BRBE_PMU renamed as CONFIG_ARM64_BRBE in arch/arm64/Kconfig
- File moved - drivers/perf/arm_pmu_brbe.c -> arch/arm64/kernel/brbe.c
- File moved - drivers/perf/arm_pmu_brbe.h -> arch/arm64/kernel/brbe.h
- BRBE name has been dropped from struct arm_pmu and struct hw_pmu_events
- BRBE name has been abstracted out as 'branches' in arm_pmu and hw_pmu_events
- BRBE name has been abstracted out as 'branches' in ARMV8 PMU implementation
- Added sched_task() callback into struct arm_pmu
- Added 'hw_attr' into struct arm_pmu encapsulating possible PMU HW attributes
- Dropped explicit attributes brbe_(v1p1, nr, cc, format) from struct arm_pmu
- Dropped brbfcr, brbcr, registers scratch area from struct hw_pmu_events
- Dropped brbe_users, brbe_context tracking in struct hw_pmu_events
- Added 'features' tracking into struct arm_pmu with ARM_PMU_BRANCH_STACK flag
- armpmu->hw_attr maps into 'struct brbe_hw_attr' inside BRBE implementation
- Set ARM_PMU_BRANCH_STACK in 'arm_pmu->features' after successful BRBE probe
- Added armv8pmu_branch_reset() inside armv8pmu_branch_enable()
- Dropped brbe_supported() as events will be rejected via ARM_PMU_BRANCH_STACK
- Dropped set_brbe_disabled() as well
- Reformated armv8pmu_branch_valid() warnings while rejecting unsupported events

Changes in V5:

https://lore.kernel.org/linux-arm-kernel/[email protected]/

- Changed BRBCR_EL1.VIRTUAL from 0b1 to 0b01
- Changed BRBFCR_EL1.EnL into BRBFCR_EL1.EnI
- Changed config ARM_BRBE_PMU from 'tristate' to 'bool'

Changes in V4:

https://lore.kernel.org/all/[email protected]/

- Changed ../tools/sysreg declarations as suggested
- Set PERF_SAMPLE_BRANCH_STACK in data.sample_flags
- Dropped perfmon_capable() check in armpmu_event_init()
- s/pr_warn_once/pr_info in armpmu_event_init()
- Added brbe_format element into struct pmu_hw_events
- Changed v1p1 as brbe_v1p1 in struct pmu_hw_events
- Dropped pr_info() from arm64_pmu_brbe_probe(), solved LOCKDEP warning

Changes in V3:

https://lore.kernel.org/all/[email protected]/

- Moved brbe_stack from the stack and now dynamically allocated
- Return PERF_BR_PRIV_UNKNOWN instead of -1 in brbe_fetch_perf_priv()
- Moved BRBIDR0, BRBCR, BRBFCR registers and fields into tools/sysreg
- Created dummy BRBINF_EL1 field definitions in tools/sysreg
- Dropped ARMPMU_EVT_PRIV framework which cached perfmon_capable()
- Both exception and exception return branche records are now captured
only if the event has PERF_SAMPLE_BRANCH_KERNEL which would already
been checked in generic perf via perf_allow_kernel()

Changes in V2:

https://lore.kernel.org/all/[email protected]/

- Dropped branch sample filter helpers consolidation patch from this series
- Added new hw_perf_event.flags element ARMPMU_EVT_PRIV to cache perfmon_capable()
- Use cached perfmon_capable() while configuring BRBE branch record filters

Changes in V1:

https://lore.kernel.org/linux-arm-kernel/[email protected]/

- Added CONFIG_PERF_EVENTS wrapper for all branch sample filter helpers
- Process new perf branch types via PERF_BR_EXTEND_ABI

Changes in RFC V2:

https://lore.kernel.org/linux-arm-kernel/[email protected]/

- Added branch_sample_priv() while consolidating other branch sample filter helpers
- Changed all SYS_BRBXXXN_EL1 register definition encodings per Marc
- Changed the BRBE driver as per proposed BRBE related perf ABI changes (V5)
- Added documentation for struct arm_pmu changes, updated commit message
- Updated commit message for BRBE detection infrastructure patch
- PERF_SAMPLE_BRANCH_KERNEL gets checked during arm event init (outside the driver)
- Branch privilege state capture mechanism has now moved inside the driver

Changes in RFC V1:

https://lore.kernel.org/all/[email protected]/

Cc: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Mark Brown <[email protected]>
Cc: James Clark <[email protected]>
Cc: Rob Herring <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Suzuki Poulose <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]

Anshuman Khandual (10):
drivers: perf: arm_pmu: Add new sched_task() callback
arm64/perf: Add BRBE registers and fields
arm64/perf: Add branch stack support in struct arm_pmu
arm64/perf: Add branch stack support in struct pmu_hw_events
arm64/perf: Add branch stack support in ARMV8 PMU
arm64/perf: Enable branch stack events via FEAT_BRBE
arm64/perf: Add PERF_ATTACH_TASK_DATA to events with has_branch_stack()
arm64/perf: Add struct brbe_regset helper functions
arm64/perf: Implement branch records save on task sched out
arm64/perf: Implement branch records save on PMU IRQ

arch/arm/include/asm/arm_pmuv3.h | 12 +
arch/arm64/include/asm/perf_event.h | 46 ++
arch/arm64/include/asm/sysreg.h | 103 ++++
arch/arm64/tools/sysreg | 158 ++++++
drivers/perf/Kconfig | 11 +
drivers/perf/Makefile | 1 +
drivers/perf/arm_brbe.c | 716 ++++++++++++++++++++++++++++
drivers/perf/arm_brbe.h | 270 +++++++++++
drivers/perf/arm_pmu.c | 12 +-
drivers/perf/arm_pmuv3.c | 110 ++++-
include/linux/perf/arm_pmu.h | 19 +-
11 files changed, 1431 insertions(+), 27 deletions(-)
create mode 100644 drivers/perf/arm_brbe.c
create mode 100644 drivers/perf/arm_brbe.h

--
2.25.1



2023-07-11 08:53:37

by Anshuman Khandual

[permalink] [raw]
Subject: [PATCH V13 - RESEND 02/10] arm64/perf: Add BRBE registers and fields

This adds BRBE related register definitions and various other related field
macros there in. These will be used subsequently in a BRBE driver which is
being added later on.

Cc: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: [email protected]
Cc: [email protected]
Tested-by: James Clark <[email protected]>
Reviewed-by: Mark Brown <[email protected]>
Signed-off-by: Anshuman Khandual <[email protected]>
---
arch/arm64/include/asm/sysreg.h | 103 +++++++++++++++++++++
arch/arm64/tools/sysreg | 158 ++++++++++++++++++++++++++++++++
2 files changed, 261 insertions(+)

diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h
index b481935e9314..f95e30c13c8b 100644
--- a/arch/arm64/include/asm/sysreg.h
+++ b/arch/arm64/include/asm/sysreg.h
@@ -163,6 +163,109 @@
#define SYS_DBGDTRTX_EL0 sys_reg(2, 3, 0, 5, 0)
#define SYS_DBGVCR32_EL2 sys_reg(2, 4, 0, 7, 0)

+#define __SYS_BRBINFO(n) sys_reg(2, 1, 8, ((n) & 0xf), ((((n) & 0x10)) >> 2 + 0))
+#define __SYS_BRBSRC(n) sys_reg(2, 1, 8, ((n) & 0xf), ((((n) & 0x10)) >> 2 + 1))
+#define __SYS_BRBTGT(n) sys_reg(2, 1, 8, ((n) & 0xf), ((((n) & 0x10)) >> 2 + 2))
+
+#define SYS_BRBINF0_EL1 __SYS_BRBINFO(0)
+#define SYS_BRBINF1_EL1 __SYS_BRBINFO(1)
+#define SYS_BRBINF2_EL1 __SYS_BRBINFO(2)
+#define SYS_BRBINF3_EL1 __SYS_BRBINFO(3)
+#define SYS_BRBINF4_EL1 __SYS_BRBINFO(4)
+#define SYS_BRBINF5_EL1 __SYS_BRBINFO(5)
+#define SYS_BRBINF6_EL1 __SYS_BRBINFO(6)
+#define SYS_BRBINF7_EL1 __SYS_BRBINFO(7)
+#define SYS_BRBINF8_EL1 __SYS_BRBINFO(8)
+#define SYS_BRBINF9_EL1 __SYS_BRBINFO(9)
+#define SYS_BRBINF10_EL1 __SYS_BRBINFO(10)
+#define SYS_BRBINF11_EL1 __SYS_BRBINFO(11)
+#define SYS_BRBINF12_EL1 __SYS_BRBINFO(12)
+#define SYS_BRBINF13_EL1 __SYS_BRBINFO(13)
+#define SYS_BRBINF14_EL1 __SYS_BRBINFO(14)
+#define SYS_BRBINF15_EL1 __SYS_BRBINFO(15)
+#define SYS_BRBINF16_EL1 __SYS_BRBINFO(16)
+#define SYS_BRBINF17_EL1 __SYS_BRBINFO(17)
+#define SYS_BRBINF18_EL1 __SYS_BRBINFO(18)
+#define SYS_BRBINF19_EL1 __SYS_BRBINFO(19)
+#define SYS_BRBINF20_EL1 __SYS_BRBINFO(20)
+#define SYS_BRBINF21_EL1 __SYS_BRBINFO(21)
+#define SYS_BRBINF22_EL1 __SYS_BRBINFO(22)
+#define SYS_BRBINF23_EL1 __SYS_BRBINFO(23)
+#define SYS_BRBINF24_EL1 __SYS_BRBINFO(24)
+#define SYS_BRBINF25_EL1 __SYS_BRBINFO(25)
+#define SYS_BRBINF26_EL1 __SYS_BRBINFO(26)
+#define SYS_BRBINF27_EL1 __SYS_BRBINFO(27)
+#define SYS_BRBINF28_EL1 __SYS_BRBINFO(28)
+#define SYS_BRBINF29_EL1 __SYS_BRBINFO(29)
+#define SYS_BRBINF30_EL1 __SYS_BRBINFO(30)
+#define SYS_BRBINF31_EL1 __SYS_BRBINFO(31)
+
+#define SYS_BRBSRC0_EL1 __SYS_BRBSRC(0)
+#define SYS_BRBSRC1_EL1 __SYS_BRBSRC(1)
+#define SYS_BRBSRC2_EL1 __SYS_BRBSRC(2)
+#define SYS_BRBSRC3_EL1 __SYS_BRBSRC(3)
+#define SYS_BRBSRC4_EL1 __SYS_BRBSRC(4)
+#define SYS_BRBSRC5_EL1 __SYS_BRBSRC(5)
+#define SYS_BRBSRC6_EL1 __SYS_BRBSRC(6)
+#define SYS_BRBSRC7_EL1 __SYS_BRBSRC(7)
+#define SYS_BRBSRC8_EL1 __SYS_BRBSRC(8)
+#define SYS_BRBSRC9_EL1 __SYS_BRBSRC(9)
+#define SYS_BRBSRC10_EL1 __SYS_BRBSRC(10)
+#define SYS_BRBSRC11_EL1 __SYS_BRBSRC(11)
+#define SYS_BRBSRC12_EL1 __SYS_BRBSRC(12)
+#define SYS_BRBSRC13_EL1 __SYS_BRBSRC(13)
+#define SYS_BRBSRC14_EL1 __SYS_BRBSRC(14)
+#define SYS_BRBSRC15_EL1 __SYS_BRBSRC(15)
+#define SYS_BRBSRC16_EL1 __SYS_BRBSRC(16)
+#define SYS_BRBSRC17_EL1 __SYS_BRBSRC(17)
+#define SYS_BRBSRC18_EL1 __SYS_BRBSRC(18)
+#define SYS_BRBSRC19_EL1 __SYS_BRBSRC(19)
+#define SYS_BRBSRC20_EL1 __SYS_BRBSRC(20)
+#define SYS_BRBSRC21_EL1 __SYS_BRBSRC(21)
+#define SYS_BRBSRC22_EL1 __SYS_BRBSRC(22)
+#define SYS_BRBSRC23_EL1 __SYS_BRBSRC(23)
+#define SYS_BRBSRC24_EL1 __SYS_BRBSRC(24)
+#define SYS_BRBSRC25_EL1 __SYS_BRBSRC(25)
+#define SYS_BRBSRC26_EL1 __SYS_BRBSRC(26)
+#define SYS_BRBSRC27_EL1 __SYS_BRBSRC(27)
+#define SYS_BRBSRC28_EL1 __SYS_BRBSRC(28)
+#define SYS_BRBSRC29_EL1 __SYS_BRBSRC(29)
+#define SYS_BRBSRC30_EL1 __SYS_BRBSRC(30)
+#define SYS_BRBSRC31_EL1 __SYS_BRBSRC(31)
+
+#define SYS_BRBTGT0_EL1 __SYS_BRBTGT(0)
+#define SYS_BRBTGT1_EL1 __SYS_BRBTGT(1)
+#define SYS_BRBTGT2_EL1 __SYS_BRBTGT(2)
+#define SYS_BRBTGT3_EL1 __SYS_BRBTGT(3)
+#define SYS_BRBTGT4_EL1 __SYS_BRBTGT(4)
+#define SYS_BRBTGT5_EL1 __SYS_BRBTGT(5)
+#define SYS_BRBTGT6_EL1 __SYS_BRBTGT(6)
+#define SYS_BRBTGT7_EL1 __SYS_BRBTGT(7)
+#define SYS_BRBTGT8_EL1 __SYS_BRBTGT(8)
+#define SYS_BRBTGT9_EL1 __SYS_BRBTGT(9)
+#define SYS_BRBTGT10_EL1 __SYS_BRBTGT(10)
+#define SYS_BRBTGT11_EL1 __SYS_BRBTGT(11)
+#define SYS_BRBTGT12_EL1 __SYS_BRBTGT(12)
+#define SYS_BRBTGT13_EL1 __SYS_BRBTGT(13)
+#define SYS_BRBTGT14_EL1 __SYS_BRBTGT(14)
+#define SYS_BRBTGT15_EL1 __SYS_BRBTGT(15)
+#define SYS_BRBTGT16_EL1 __SYS_BRBTGT(16)
+#define SYS_BRBTGT17_EL1 __SYS_BRBTGT(17)
+#define SYS_BRBTGT18_EL1 __SYS_BRBTGT(18)
+#define SYS_BRBTGT19_EL1 __SYS_BRBTGT(19)
+#define SYS_BRBTGT20_EL1 __SYS_BRBTGT(20)
+#define SYS_BRBTGT21_EL1 __SYS_BRBTGT(21)
+#define SYS_BRBTGT22_EL1 __SYS_BRBTGT(22)
+#define SYS_BRBTGT23_EL1 __SYS_BRBTGT(23)
+#define SYS_BRBTGT24_EL1 __SYS_BRBTGT(24)
+#define SYS_BRBTGT25_EL1 __SYS_BRBTGT(25)
+#define SYS_BRBTGT26_EL1 __SYS_BRBTGT(26)
+#define SYS_BRBTGT27_EL1 __SYS_BRBTGT(27)
+#define SYS_BRBTGT28_EL1 __SYS_BRBTGT(28)
+#define SYS_BRBTGT29_EL1 __SYS_BRBTGT(29)
+#define SYS_BRBTGT30_EL1 __SYS_BRBTGT(30)
+#define SYS_BRBTGT31_EL1 __SYS_BRBTGT(31)
+
#define SYS_MIDR_EL1 sys_reg(3, 0, 0, 0, 0)
#define SYS_MPIDR_EL1 sys_reg(3, 0, 0, 0, 5)
#define SYS_REVIDR_EL1 sys_reg(3, 0, 0, 0, 6)
diff --git a/arch/arm64/tools/sysreg b/arch/arm64/tools/sysreg
index 1ea4a3dc68f8..9892af96262f 100644
--- a/arch/arm64/tools/sysreg
+++ b/arch/arm64/tools/sysreg
@@ -1002,6 +1002,164 @@ UnsignedEnum 3:0 BT
EndEnum
EndSysreg

+
+SysregFields BRBINFx_EL1
+Res0 63:47
+Field 46 CCU
+Field 45:32 CC
+Res0 31:18
+Field 17 LASTFAILED
+Field 16 T
+Res0 15:14
+Enum 13:8 TYPE
+ 0b000000 UNCOND_DIRECT
+ 0b000001 INDIRECT
+ 0b000010 DIRECT_LINK
+ 0b000011 INDIRECT_LINK
+ 0b000101 RET
+ 0b000111 ERET
+ 0b001000 COND_DIRECT
+ 0b100001 DEBUG_HALT
+ 0b100010 CALL
+ 0b100011 TRAP
+ 0b100100 SERROR
+ 0b100110 INSN_DEBUG
+ 0b100111 DATA_DEBUG
+ 0b101010 ALIGN_FAULT
+ 0b101011 INSN_FAULT
+ 0b101100 DATA_FAULT
+ 0b101110 IRQ
+ 0b101111 FIQ
+ 0b111001 DEBUG_EXIT
+EndEnum
+Enum 7:6 EL
+ 0b00 EL0
+ 0b01 EL1
+ 0b10 EL2
+ 0b11 EL3
+EndEnum
+Field 5 MPRED
+Res0 4:2
+Enum 1:0 VALID
+ 0b00 NONE
+ 0b01 TARGET
+ 0b10 SOURCE
+ 0b11 FULL
+EndEnum
+EndSysregFields
+
+Sysreg BRBCR_EL1 2 1 9 0 0
+Res0 63:24
+Field 23 EXCEPTION
+Field 22 ERTN
+Res0 21:9
+Field 8 FZP
+Res0 7
+Enum 6:5 TS
+ 0b01 VIRTUAL
+ 0b10 GUEST_PHYSICAL
+ 0b11 PHYSICAL
+EndEnum
+Field 4 MPRED
+Field 3 CC
+Res0 2
+Field 1 E1BRE
+Field 0 E0BRE
+EndSysreg
+
+Sysreg BRBFCR_EL1 2 1 9 0 1
+Res0 63:30
+Enum 29:28 BANK
+ 0b0 FIRST
+ 0b1 SECOND
+EndEnum
+Res0 27:23
+Field 22 CONDDIR
+Field 21 DIRCALL
+Field 20 INDCALL
+Field 19 RTN
+Field 18 INDIRECT
+Field 17 DIRECT
+Field 16 EnI
+Res0 15:8
+Field 7 PAUSED
+Field 6 LASTFAILED
+Res0 5:0
+EndSysreg
+
+Sysreg BRBTS_EL1 2 1 9 0 2
+Field 63:0 TS
+EndSysreg
+
+Sysreg BRBINFINJ_EL1 2 1 9 1 0
+Res0 63:47
+Field 46 CCU
+Field 45:32 CC
+Res0 31:18
+Field 17 LASTFAILED
+Field 16 T
+Res0 15:14
+Enum 13:8 TYPE
+ 0b000000 UNCOND_DIRECT
+ 0b000001 INDIRECT
+ 0b000010 DIRECT_LINK
+ 0b000011 INDIRECT_LINK
+ 0b000101 RET
+ 0b000111 ERET
+ 0b001000 COND_DIRECT
+ 0b100001 DEBUG_HALT
+ 0b100010 CALL
+ 0b100011 TRAP
+ 0b100100 SERROR
+ 0b100110 INSN_DEBUG
+ 0b100111 DATA_DEBUG
+ 0b101010 ALIGN_FAULT
+ 0b101011 INSN_FAULT
+ 0b101100 DATA_FAULT
+ 0b101110 IRQ
+ 0b101111 FIQ
+ 0b111001 DEBUG_EXIT
+EndEnum
+Enum 7:6 EL
+ 0b00 EL0
+ 0b01 EL1
+ 0b10 EL2
+ 0b11 EL3
+EndEnum
+Field 5 MPRED
+Res0 4:2
+Enum 1:0 VALID
+ 0b00 NONE
+ 0b01 TARGET
+ 0b10 SOURCE
+ 0b11 FULL
+EndEnum
+EndSysreg
+
+Sysreg BRBSRCINJ_EL1 2 1 9 1 1
+Field 63:0 ADDRESS
+EndSysreg
+
+Sysreg BRBTGTINJ_EL1 2 1 9 1 2
+Field 63:0 ADDRESS
+EndSysreg
+
+Sysreg BRBIDR0_EL1 2 1 9 2 0
+Res0 63:16
+Enum 15:12 CC
+ 0b101 20_BIT
+EndEnum
+Enum 11:8 FORMAT
+ 0b0 0
+EndEnum
+Enum 7:0 NUMREC
+ 0b0001000 8
+ 0b0010000 16
+ 0b0100000 32
+ 0b1000000 64
+EndEnum
+EndSysreg
+
Sysreg ID_AA64ZFR0_EL1 3 0 0 4 4
Res0 63:60
UnsignedEnum 59:56 F64MM
--
2.25.1


2023-07-11 08:53:44

by Anshuman Khandual

[permalink] [raw]
Subject: [PATCH V13 - RESEND 10/10] arm64/perf: Implement branch records save on PMU IRQ

This modifies armv8pmu_branch_read() to concatenate live entries along with
task context stored entries and then process the resultant buffer to create
perf branch entry array for perf_sample_data. It follows the same principle
like task sched out.

Cc: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: [email protected]
Cc: [email protected]
Tested-by: James Clark <[email protected]>
Acked-by: Mark Rutland <[email protected]>
Signed-off-by: Anshuman Khandual <[email protected]>
---
drivers/perf/arm_brbe.c | 69 +++++++++++++++++++----------------------
1 file changed, 32 insertions(+), 37 deletions(-)

diff --git a/drivers/perf/arm_brbe.c b/drivers/perf/arm_brbe.c
index 2177632befa6..cb765aaf1b52 100644
--- a/drivers/perf/arm_brbe.c
+++ b/drivers/perf/arm_brbe.c
@@ -647,41 +647,44 @@ void armv8pmu_branch_reset(void)
isb();
}

-static bool capture_branch_entry(struct pmu_hw_events *cpuc,
- struct perf_event *event, int idx)
+static void brbe_regset_branch_entries(struct pmu_hw_events *cpuc, struct perf_event *event,
+ struct brbe_regset *regset, int idx)
{
struct perf_branch_entry *entry = &cpuc->branches->branch_entries[idx];
- u64 brbinf = get_brbinf_reg(idx);
-
- /*
- * There are no valid entries anymore on the buffer.
- * Abort the branch record processing to save some
- * cycles and also reduce the capture/process load
- * for the user space as well.
- */
- if (brbe_invalid(brbinf))
- return false;
+ u64 brbinf = regset[idx].brbinf;

perf_clear_branch_entry_bitfields(entry);
if (brbe_record_is_complete(brbinf)) {
- entry->from = get_brbsrc_reg(idx);
- entry->to = get_brbtgt_reg(idx);
+ entry->from = regset[idx].brbsrc;
+ entry->to = regset[idx].brbtgt;
} else if (brbe_record_is_source_only(brbinf)) {
- entry->from = get_brbsrc_reg(idx);
+ entry->from = regset[idx].brbsrc;
entry->to = 0;
} else if (brbe_record_is_target_only(brbinf)) {
entry->from = 0;
- entry->to = get_brbtgt_reg(idx);
+ entry->to = regset[idx].brbtgt;
}
capture_brbe_flags(entry, event, brbinf);
- return true;
+}
+
+static void process_branch_entries(struct pmu_hw_events *cpuc, struct perf_event *event,
+ struct brbe_regset *regset, int nr_regset)
+{
+ int idx;
+
+ for (idx = 0; idx < nr_regset; idx++)
+ brbe_regset_branch_entries(cpuc, event, regset, idx);
+
+ cpuc->branches->branch_stack.nr = nr_regset;
+ cpuc->branches->branch_stack.hw_idx = -1ULL;
}

void armv8pmu_branch_read(struct pmu_hw_events *cpuc, struct perf_event *event)
{
- int nr_hw_entries = brbe_get_numrec(cpuc->percpu_pmu->reg_brbidr);
+ struct arm64_perf_task_context *task_ctx = event->pmu_ctx->task_ctx_data;
+ struct brbe_regset live[BRBE_MAX_ENTRIES];
+ int nr_live, nr_store, nr_hw_entries;
u64 brbfcr, brbcr;
- int idx = 0;

brbcr = read_sysreg_s(SYS_BRBCR_EL1);
brbfcr = read_sysreg_s(SYS_BRBFCR_EL1);
@@ -693,25 +696,17 @@ void armv8pmu_branch_read(struct pmu_hw_events *cpuc, struct perf_event *event)
write_sysreg_s(brbfcr | BRBFCR_EL1_PAUSED, SYS_BRBFCR_EL1);
isb();

- /* Loop through bank 0 */
- select_brbe_bank(BRBE_BANK_IDX_0);
- while (idx < nr_hw_entries && idx <= BRBE_BANK0_IDX_MAX) {
- if (!capture_branch_entry(cpuc, event, idx))
- goto skip_bank_1;
- idx++;
- }
-
- /* Loop through bank 1 */
- select_brbe_bank(BRBE_BANK_IDX_1);
- while (idx < nr_hw_entries && idx <= BRBE_BANK1_IDX_MAX) {
- if (!capture_branch_entry(cpuc, event, idx))
- break;
- idx++;
+ nr_hw_entries = brbe_get_numrec(cpuc->percpu_pmu->reg_brbidr);
+ nr_live = capture_brbe_regset(live, nr_hw_entries);
+ if (event->ctx->task) {
+ nr_store = task_ctx->nr_brbe_records;
+ nr_store = stitch_stored_live_entries(task_ctx->store, live, nr_store,
+ nr_live, nr_hw_entries);
+ process_branch_entries(cpuc, event, task_ctx->store, nr_store);
+ task_ctx->nr_brbe_records = 0;
+ } else {
+ process_branch_entries(cpuc, event, live, nr_live);
}
-
-skip_bank_1:
- cpuc->branches->branch_stack.nr = idx;
- cpuc->branches->branch_stack.hw_idx = -1ULL;
process_branch_aborts(cpuc);

/* Unpause the buffer */
--
2.25.1


2023-07-11 08:53:58

by Anshuman Khandual

[permalink] [raw]
Subject: [PATCH V13 - RESEND 09/10] arm64/perf: Implement branch records save on task sched out

This modifies current armv8pmu_sched_task(), to implement a branch records
save mechanism via armv8pmu_branch_save() when a task scheds out of a cpu.
BRBE is paused and disabled for all exception levels before branch records
get captured, which then get concatenated with all existing stored records
present in the task context maintaining the contiguity. Although the final
length of the concatenated buffer does not exceed implemented BRBE length.

Cc: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: [email protected]
Cc: [email protected]
Tested-by: James Clark <[email protected]>
Acked-by: Mark Rutland <[email protected]>
Signed-off-by: Anshuman Khandual <[email protected]>
---
arch/arm/include/asm/arm_pmuv3.h | 1 +
arch/arm64/include/asm/perf_event.h | 2 ++
drivers/perf/arm_brbe.c | 30 +++++++++++++++++++++++++++++
drivers/perf/arm_pmuv3.c | 14 ++++++++++++--
4 files changed, 45 insertions(+), 2 deletions(-)

diff --git a/arch/arm/include/asm/arm_pmuv3.h b/arch/arm/include/asm/arm_pmuv3.h
index 3d8faf4200dc..3d047d2c9430 100644
--- a/arch/arm/include/asm/arm_pmuv3.h
+++ b/arch/arm/include/asm/arm_pmuv3.h
@@ -259,5 +259,6 @@ static inline void armv8pmu_branch_probe(struct arm_pmu *arm_pmu) { }
static inline void armv8pmu_branch_reset(void) { }
static inline int armv8pmu_task_ctx_cache_alloc(struct arm_pmu *arm_pmu) { return 0; }
static inline void armv8pmu_task_ctx_cache_free(struct arm_pmu *arm_pmu) { }
+static inline void armv8pmu_branch_save(struct arm_pmu *arm_pmu, void *ctx) { }
#endif
#endif
diff --git a/arch/arm64/include/asm/perf_event.h b/arch/arm64/include/asm/perf_event.h
index b0c12a5882df..36e7dfb466a6 100644
--- a/arch/arm64/include/asm/perf_event.h
+++ b/arch/arm64/include/asm/perf_event.h
@@ -40,6 +40,7 @@ void armv8pmu_branch_probe(struct arm_pmu *arm_pmu);
void armv8pmu_branch_reset(void);
int armv8pmu_task_ctx_cache_alloc(struct arm_pmu *arm_pmu);
void armv8pmu_task_ctx_cache_free(struct arm_pmu *arm_pmu);
+void armv8pmu_branch_save(struct arm_pmu *arm_pmu, void *ctx);
#else
static inline void armv8pmu_branch_read(struct pmu_hw_events *cpuc, struct perf_event *event)
{
@@ -66,6 +67,7 @@ static inline void armv8pmu_branch_probe(struct arm_pmu *arm_pmu) { }
static inline void armv8pmu_branch_reset(void) { }
static inline int armv8pmu_task_ctx_cache_alloc(struct arm_pmu *arm_pmu) { return 0; }
static inline void armv8pmu_task_ctx_cache_free(struct arm_pmu *arm_pmu) { }
+static inline void armv8pmu_branch_save(struct arm_pmu *arm_pmu, void *ctx) { }
#endif
#endif
#endif
diff --git a/drivers/perf/arm_brbe.c b/drivers/perf/arm_brbe.c
index 203cd4f350d5..2177632befa6 100644
--- a/drivers/perf/arm_brbe.c
+++ b/drivers/perf/arm_brbe.c
@@ -165,6 +165,36 @@ static int stitch_stored_live_entries(struct brbe_regset *stored,
return min(nr_live + nr_stored, nr_max);
}

+static int brbe_branch_save(struct brbe_regset *live, int nr_hw_entries)
+{
+ u64 brbfcr = read_sysreg_s(SYS_BRBFCR_EL1);
+ int nr_live;
+
+ write_sysreg_s(brbfcr | BRBFCR_EL1_PAUSED, SYS_BRBFCR_EL1);
+ isb();
+
+ nr_live = capture_brbe_regset(live, nr_hw_entries);
+
+ write_sysreg_s(brbfcr & ~BRBFCR_EL1_PAUSED, SYS_BRBFCR_EL1);
+ isb();
+
+ return nr_live;
+}
+
+void armv8pmu_branch_save(struct arm_pmu *arm_pmu, void *ctx)
+{
+ struct arm64_perf_task_context *task_ctx = ctx;
+ struct brbe_regset live[BRBE_MAX_ENTRIES];
+ int nr_live, nr_store, nr_hw_entries;
+
+ nr_hw_entries = brbe_get_numrec(arm_pmu->reg_brbidr);
+ nr_live = brbe_branch_save(live, nr_hw_entries);
+ nr_store = task_ctx->nr_brbe_records;
+ nr_store = stitch_stored_live_entries(task_ctx->store, live, nr_store,
+ nr_live, nr_hw_entries);
+ task_ctx->nr_brbe_records = nr_store;
+}
+
/*
* Generic perf branch filters supported on BRBE
*
diff --git a/drivers/perf/arm_pmuv3.c b/drivers/perf/arm_pmuv3.c
index 408974d5c57b..aa3c7b3dcdd6 100644
--- a/drivers/perf/arm_pmuv3.c
+++ b/drivers/perf/arm_pmuv3.c
@@ -923,9 +923,19 @@ static int armv8pmu_user_event_idx(struct perf_event *event)
static void armv8pmu_sched_task(struct perf_event_pmu_context *pmu_ctx, bool sched_in)
{
struct arm_pmu *armpmu = to_arm_pmu(pmu_ctx->pmu);
+ void *task_ctx = pmu_ctx ? pmu_ctx->task_ctx_data : NULL;

- if (sched_in && armpmu->has_branch_stack)
- armv8pmu_branch_reset();
+ if (armpmu->has_branch_stack) {
+ /* Save branch records in task_ctx on sched out */
+ if (task_ctx && !sched_in) {
+ armv8pmu_branch_save(armpmu, task_ctx);
+ return;
+ }
+
+ /* Reset branch records on sched in */
+ if (sched_in)
+ armv8pmu_branch_reset();
+ }
}

/*
--
2.25.1


2023-07-11 08:54:36

by Anshuman Khandual

[permalink] [raw]
Subject: [PATCH V13 - RESEND 06/10] arm64/perf: Enable branch stack events via FEAT_BRBE

This enables branch stack sampling events in ARMV8 PMU, via an architecture
feature FEAT_BRBE aka branch record buffer extension. This defines required
branch helper functions pmuv8pmu_branch_XXXXX() and the implementation here
is wrapped with a new config option CONFIG_ARM64_BRBE.

Cc: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: [email protected]
Cc: [email protected]
Tested-by: James Clark <[email protected]>
Signed-off-by: Anshuman Khandual <[email protected]>
---
arch/arm64/include/asm/perf_event.h | 9 +
drivers/perf/Kconfig | 11 +
drivers/perf/Makefile | 1 +
drivers/perf/arm_brbe.c | 549 ++++++++++++++++++++++++++++
drivers/perf/arm_brbe.h | 257 +++++++++++++
drivers/perf/arm_pmuv3.c | 4 +
6 files changed, 831 insertions(+)
create mode 100644 drivers/perf/arm_brbe.c
create mode 100644 drivers/perf/arm_brbe.h

diff --git a/arch/arm64/include/asm/perf_event.h b/arch/arm64/include/asm/perf_event.h
index ebc392ba3559..49a973571415 100644
--- a/arch/arm64/include/asm/perf_event.h
+++ b/arch/arm64/include/asm/perf_event.h
@@ -31,6 +31,14 @@ struct perf_event;
#ifdef CONFIG_PERF_EVENTS
static inline bool has_branch_stack(struct perf_event *event);

+#ifdef CONFIG_ARM64_BRBE
+void armv8pmu_branch_read(struct pmu_hw_events *cpuc, struct perf_event *event);
+bool armv8pmu_branch_attr_valid(struct perf_event *event);
+void armv8pmu_branch_enable(struct perf_event *event);
+void armv8pmu_branch_disable(struct perf_event *event);
+void armv8pmu_branch_probe(struct arm_pmu *arm_pmu);
+void armv8pmu_branch_reset(void);
+#else
static inline void armv8pmu_branch_read(struct pmu_hw_events *cpuc, struct perf_event *event)
{
WARN_ON_ONCE(!has_branch_stack(event));
@@ -56,3 +64,4 @@ static inline void armv8pmu_branch_probe(struct arm_pmu *arm_pmu) { }
static inline void armv8pmu_branch_reset(void) { }
#endif
#endif
+#endif
diff --git a/drivers/perf/Kconfig b/drivers/perf/Kconfig
index f4572a5cca72..7c8448051741 100644
--- a/drivers/perf/Kconfig
+++ b/drivers/perf/Kconfig
@@ -180,6 +180,17 @@ config ARM_SPE_PMU
Extension, which provides periodic sampling of operations in
the CPU pipeline and reports this via the perf AUX interface.

+config ARM64_BRBE
+ bool "Enable support for Branch Record Buffer Extension (BRBE)"
+ depends on PERF_EVENTS && ARM64 && ARM_PMU
+ default y
+ help
+ Enable perf support for Branch Record Buffer Extension (BRBE) which
+ records all branches taken in an execution path. This supports some
+ branch types and privilege based filtering. It captured additional
+ relevant information such as cycle count, misprediction and branch
+ type, branch privilege level etc.
+
config ARM_DMC620_PMU
tristate "Enable PMU support for the ARM DMC-620 memory controller"
depends on (ARM64 && ACPI) || COMPILE_TEST
diff --git a/drivers/perf/Makefile b/drivers/perf/Makefile
index 16b3ec4db916..a8b7bc22e3d6 100644
--- a/drivers/perf/Makefile
+++ b/drivers/perf/Makefile
@@ -18,6 +18,7 @@ obj-$(CONFIG_RISCV_PMU_SBI) += riscv_pmu_sbi.o
obj-$(CONFIG_THUNDERX2_PMU) += thunderx2_pmu.o
obj-$(CONFIG_XGENE_PMU) += xgene_pmu.o
obj-$(CONFIG_ARM_SPE_PMU) += arm_spe_pmu.o
+obj-$(CONFIG_ARM64_BRBE) += arm_brbe.o
obj-$(CONFIG_ARM_DMC620_PMU) += arm_dmc620_pmu.o
obj-$(CONFIG_MARVELL_CN10K_TAD_PMU) += marvell_cn10k_tad_pmu.o
obj-$(CONFIG_MARVELL_CN10K_DDR_PMU) += marvell_cn10k_ddr_pmu.o
diff --git a/drivers/perf/arm_brbe.c b/drivers/perf/arm_brbe.c
new file mode 100644
index 000000000000..79106300cf2e
--- /dev/null
+++ b/drivers/perf/arm_brbe.c
@@ -0,0 +1,549 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Branch Record Buffer Extension Driver.
+ *
+ * Copyright (C) 2022 ARM Limited
+ *
+ * Author: Anshuman Khandual <[email protected]>
+ */
+#include "arm_brbe.h"
+
+static bool valid_brbe_nr(int brbe_nr)
+{
+ return brbe_nr == BRBIDR0_EL1_NUMREC_8 ||
+ brbe_nr == BRBIDR0_EL1_NUMREC_16 ||
+ brbe_nr == BRBIDR0_EL1_NUMREC_32 ||
+ brbe_nr == BRBIDR0_EL1_NUMREC_64;
+}
+
+static bool valid_brbe_cc(int brbe_cc)
+{
+ return brbe_cc == BRBIDR0_EL1_CC_20_BIT;
+}
+
+static bool valid_brbe_format(int brbe_format)
+{
+ return brbe_format == BRBIDR0_EL1_FORMAT_0;
+}
+
+static bool valid_brbe_version(int brbe_version)
+{
+ return brbe_version == ID_AA64DFR0_EL1_BRBE_IMP ||
+ brbe_version == ID_AA64DFR0_EL1_BRBE_BRBE_V1P1;
+}
+
+static void select_brbe_bank(int bank)
+{
+ u64 brbfcr;
+
+ WARN_ON(bank > BRBE_BANK_IDX_1);
+ brbfcr = read_sysreg_s(SYS_BRBFCR_EL1);
+ brbfcr &= ~BRBFCR_EL1_BANK_MASK;
+ brbfcr |= SYS_FIELD_PREP(BRBFCR_EL1, BANK, bank);
+ write_sysreg_s(brbfcr, SYS_BRBFCR_EL1);
+ isb();
+}
+
+/*
+ * Generic perf branch filters supported on BRBE
+ *
+ * New branch filters need to be evaluated whether they could be supported on
+ * BRBE. This ensures that such branch filters would not just be accepted, to
+ * fail silently. PERF_SAMPLE_BRANCH_HV is a special case that is selectively
+ * supported only on platforms where kernel is in hyp mode.
+ */
+#define BRBE_EXCLUDE_BRANCH_FILTERS (PERF_SAMPLE_BRANCH_ABORT_TX | \
+ PERF_SAMPLE_BRANCH_IN_TX | \
+ PERF_SAMPLE_BRANCH_NO_TX | \
+ PERF_SAMPLE_BRANCH_CALL_STACK)
+
+#define BRBE_ALLOWED_BRANCH_FILTERS (PERF_SAMPLE_BRANCH_USER | \
+ PERF_SAMPLE_BRANCH_KERNEL | \
+ PERF_SAMPLE_BRANCH_HV | \
+ PERF_SAMPLE_BRANCH_ANY | \
+ PERF_SAMPLE_BRANCH_ANY_CALL | \
+ PERF_SAMPLE_BRANCH_ANY_RETURN | \
+ PERF_SAMPLE_BRANCH_IND_CALL | \
+ PERF_SAMPLE_BRANCH_COND | \
+ PERF_SAMPLE_BRANCH_IND_JUMP | \
+ PERF_SAMPLE_BRANCH_CALL | \
+ PERF_SAMPLE_BRANCH_NO_FLAGS | \
+ PERF_SAMPLE_BRANCH_NO_CYCLES | \
+ PERF_SAMPLE_BRANCH_TYPE_SAVE | \
+ PERF_SAMPLE_BRANCH_HW_INDEX | \
+ PERF_SAMPLE_BRANCH_PRIV_SAVE)
+
+#define BRBE_PERF_BRANCH_FILTERS (BRBE_ALLOWED_BRANCH_FILTERS | \
+ BRBE_EXCLUDE_BRANCH_FILTERS)
+
+bool armv8pmu_branch_attr_valid(struct perf_event *event)
+{
+ u64 branch_type = event->attr.branch_sample_type;
+
+ /*
+ * Ensure both perf branch filter allowed and exclude
+ * masks are always in sync with the generic perf ABI.
+ */
+ BUILD_BUG_ON(BRBE_PERF_BRANCH_FILTERS != (PERF_SAMPLE_BRANCH_MAX - 1));
+
+ if (branch_type & ~BRBE_ALLOWED_BRANCH_FILTERS) {
+ pr_debug_once("requested branch filter not supported 0x%llx\n", branch_type);
+ return false;
+ }
+
+ /*
+ * If the event does not have at least one of the privilege
+ * branch filters as in PERF_SAMPLE_BRANCH_PLM_ALL, the core
+ * perf will adjust its value based on perf event's existing
+ * privilege level via attr.exclude_[user|kernel|hv].
+ *
+ * As event->attr.branch_sample_type might have been changed
+ * when the event reaches here, it is not possible to figure
+ * out whether the event originally had HV privilege request
+ * or got added via the core perf. Just report this situation
+ * once and continue ignoring if there are other instances.
+ */
+ if ((branch_type & PERF_SAMPLE_BRANCH_HV) && !is_kernel_in_hyp_mode())
+ pr_debug_once("hypervisor privilege filter not supported 0x%llx\n", branch_type);
+
+ return true;
+}
+
+static int brbe_attributes_probe(struct arm_pmu *armpmu, u32 brbe)
+{
+ u64 brbidr = read_sysreg_s(SYS_BRBIDR0_EL1);
+ int brbe_version, brbe_format, brbe_cc, brbe_nr;
+
+ brbe_version = brbe;
+ brbe_format = brbe_get_format(brbidr);
+ brbe_cc = brbe_get_cc_bits(brbidr);
+ brbe_nr = brbe_get_numrec(brbidr);
+ armpmu->reg_brbidr = brbidr;
+
+ if (!valid_brbe_version(brbe_version) ||
+ !valid_brbe_format(brbe_format) ||
+ !valid_brbe_cc(brbe_cc) ||
+ !valid_brbe_nr(brbe_nr))
+ return -EOPNOTSUPP;
+ return 0;
+}
+
+void armv8pmu_branch_probe(struct arm_pmu *armpmu)
+{
+ u64 aa64dfr0 = read_sysreg_s(SYS_ID_AA64DFR0_EL1);
+ u32 brbe;
+
+ brbe = cpuid_feature_extract_unsigned_field(aa64dfr0, ID_AA64DFR0_EL1_BRBE_SHIFT);
+ if (!brbe)
+ return;
+
+ if (brbe_attributes_probe(armpmu, brbe))
+ return;
+
+ armpmu->has_branch_stack = 1;
+}
+
+static u64 branch_type_to_brbfcr(int branch_type)
+{
+ u64 brbfcr = 0;
+
+ if (branch_type & PERF_SAMPLE_BRANCH_ANY) {
+ brbfcr |= BRBFCR_EL1_BRANCH_FILTERS;
+ return brbfcr;
+ }
+
+ if (branch_type & PERF_SAMPLE_BRANCH_ANY_CALL) {
+ brbfcr |= BRBFCR_EL1_INDCALL;
+ brbfcr |= BRBFCR_EL1_DIRCALL;
+ }
+
+ if (branch_type & PERF_SAMPLE_BRANCH_ANY_RETURN)
+ brbfcr |= BRBFCR_EL1_RTN;
+
+ if (branch_type & PERF_SAMPLE_BRANCH_IND_CALL)
+ brbfcr |= BRBFCR_EL1_INDCALL;
+
+ if (branch_type & PERF_SAMPLE_BRANCH_COND)
+ brbfcr |= BRBFCR_EL1_CONDDIR;
+
+ if (branch_type & PERF_SAMPLE_BRANCH_IND_JUMP)
+ brbfcr |= BRBFCR_EL1_INDIRECT;
+
+ if (branch_type & PERF_SAMPLE_BRANCH_CALL)
+ brbfcr |= BRBFCR_EL1_DIRCALL;
+
+ return brbfcr;
+}
+
+static u64 branch_type_to_brbcr(int branch_type)
+{
+ u64 brbcr = BRBCR_EL1_DEFAULT_TS;
+
+ /*
+ * BRBE should be paused on PMU interrupt while tracing kernel
+ * space to stop capturing further branch records. Otherwise
+ * interrupt handler branch records might get into the samples
+ * which is not desired.
+ *
+ * BRBE need not be paused on PMU interrupt while tracing only
+ * the user space, because it will automatically be inside the
+ * prohibited region. But even after PMU overflow occurs, the
+ * interrupt could still take much more cycles, before it can
+ * be taken and by that time BRBE will have been overwritten.
+ * Hence enable pause on PMU interrupt mechanism even for user
+ * only traces as well.
+ */
+ brbcr |= BRBCR_EL1_FZP;
+
+ /*
+ * When running in the hyp mode, writing into BRBCR_EL1
+ * actually writes into BRBCR_EL2 instead. Field E2BRE
+ * is also at the same position as E1BRE.
+ */
+ if (branch_type & PERF_SAMPLE_BRANCH_USER)
+ brbcr |= BRBCR_EL1_E0BRE;
+
+ if (branch_type & PERF_SAMPLE_BRANCH_KERNEL)
+ brbcr |= BRBCR_EL1_E1BRE;
+
+ if (branch_type & PERF_SAMPLE_BRANCH_HV) {
+ if (is_kernel_in_hyp_mode())
+ brbcr |= BRBCR_EL1_E1BRE;
+ }
+
+ if (!(branch_type & PERF_SAMPLE_BRANCH_NO_CYCLES))
+ brbcr |= BRBCR_EL1_CC;
+
+ if (!(branch_type & PERF_SAMPLE_BRANCH_NO_FLAGS))
+ brbcr |= BRBCR_EL1_MPRED;
+
+ /*
+ * The exception and exception return branches could be
+ * captured, irrespective of the perf event's privilege.
+ * If the perf event does not have enough privilege for
+ * a given exception level, then addresses which falls
+ * under that exception level will be reported as zero
+ * for the captured branch record, creating source only
+ * or target only records.
+ */
+ if (branch_type & PERF_SAMPLE_BRANCH_ANY) {
+ brbcr |= BRBCR_EL1_EXCEPTION;
+ brbcr |= BRBCR_EL1_ERTN;
+ }
+
+ if (branch_type & PERF_SAMPLE_BRANCH_ANY_CALL)
+ brbcr |= BRBCR_EL1_EXCEPTION;
+
+ if (branch_type & PERF_SAMPLE_BRANCH_ANY_RETURN)
+ brbcr |= BRBCR_EL1_ERTN;
+
+ return brbcr & BRBCR_EL1_DEFAULT_CONFIG;
+}
+
+void armv8pmu_branch_enable(struct perf_event *event)
+{
+ u64 branch_type = event->attr.branch_sample_type;
+ u64 brbfcr, brbcr;
+
+ brbfcr = read_sysreg_s(SYS_BRBFCR_EL1);
+ brbfcr &= ~BRBFCR_EL1_DEFAULT_CONFIG;
+ brbfcr |= branch_type_to_brbfcr(branch_type);
+ write_sysreg_s(brbfcr, SYS_BRBFCR_EL1);
+ isb();
+
+ brbcr = read_sysreg_s(SYS_BRBCR_EL1);
+ brbcr &= ~BRBCR_EL1_DEFAULT_CONFIG;
+ brbcr |= branch_type_to_brbcr(branch_type);
+ write_sysreg_s(brbcr, SYS_BRBCR_EL1);
+ isb();
+ armv8pmu_branch_reset();
+}
+
+void armv8pmu_branch_disable(struct perf_event *event)
+{
+ u64 brbfcr = read_sysreg_s(SYS_BRBFCR_EL1);
+ u64 brbcr = read_sysreg_s(SYS_BRBCR_EL1);
+
+ brbcr &= ~(BRBCR_EL1_E0BRE | BRBCR_EL1_E1BRE);
+ brbfcr |= BRBFCR_EL1_PAUSED;
+ write_sysreg_s(brbcr, SYS_BRBCR_EL1);
+ write_sysreg_s(brbfcr, SYS_BRBFCR_EL1);
+ isb();
+}
+
+static void brbe_set_perf_entry_type(struct perf_branch_entry *entry, u64 brbinf)
+{
+ int brbe_type = brbe_get_type(brbinf);
+
+ switch (brbe_type) {
+ case BRBINFx_EL1_TYPE_UNCOND_DIRECT:
+ entry->type = PERF_BR_UNCOND;
+ break;
+ case BRBINFx_EL1_TYPE_INDIRECT:
+ entry->type = PERF_BR_IND;
+ break;
+ case BRBINFx_EL1_TYPE_DIRECT_LINK:
+ entry->type = PERF_BR_CALL;
+ break;
+ case BRBINFx_EL1_TYPE_INDIRECT_LINK:
+ entry->type = PERF_BR_IND_CALL;
+ break;
+ case BRBINFx_EL1_TYPE_RET:
+ entry->type = PERF_BR_RET;
+ break;
+ case BRBINFx_EL1_TYPE_COND_DIRECT:
+ entry->type = PERF_BR_COND;
+ break;
+ case BRBINFx_EL1_TYPE_CALL:
+ entry->type = PERF_BR_CALL;
+ break;
+ case BRBINFx_EL1_TYPE_TRAP:
+ entry->type = PERF_BR_SYSCALL;
+ break;
+ case BRBINFx_EL1_TYPE_ERET:
+ entry->type = PERF_BR_ERET;
+ break;
+ case BRBINFx_EL1_TYPE_IRQ:
+ entry->type = PERF_BR_IRQ;
+ break;
+ case BRBINFx_EL1_TYPE_DEBUG_HALT:
+ entry->type = PERF_BR_EXTEND_ABI;
+ entry->new_type = PERF_BR_ARM64_DEBUG_HALT;
+ break;
+ case BRBINFx_EL1_TYPE_SERROR:
+ entry->type = PERF_BR_SERROR;
+ break;
+ case BRBINFx_EL1_TYPE_INSN_DEBUG:
+ entry->type = PERF_BR_EXTEND_ABI;
+ entry->new_type = PERF_BR_ARM64_DEBUG_INST;
+ break;
+ case BRBINFx_EL1_TYPE_DATA_DEBUG:
+ entry->type = PERF_BR_EXTEND_ABI;
+ entry->new_type = PERF_BR_ARM64_DEBUG_DATA;
+ break;
+ case BRBINFx_EL1_TYPE_ALIGN_FAULT:
+ entry->type = PERF_BR_EXTEND_ABI;
+ entry->new_type = PERF_BR_NEW_FAULT_ALGN;
+ break;
+ case BRBINFx_EL1_TYPE_INSN_FAULT:
+ entry->type = PERF_BR_EXTEND_ABI;
+ entry->new_type = PERF_BR_NEW_FAULT_INST;
+ break;
+ case BRBINFx_EL1_TYPE_DATA_FAULT:
+ entry->type = PERF_BR_EXTEND_ABI;
+ entry->new_type = PERF_BR_NEW_FAULT_DATA;
+ break;
+ case BRBINFx_EL1_TYPE_FIQ:
+ entry->type = PERF_BR_EXTEND_ABI;
+ entry->new_type = PERF_BR_ARM64_FIQ;
+ break;
+ case BRBINFx_EL1_TYPE_DEBUG_EXIT:
+ entry->type = PERF_BR_EXTEND_ABI;
+ entry->new_type = PERF_BR_ARM64_DEBUG_EXIT;
+ break;
+ default:
+ pr_warn_once("%d - unknown branch type captured\n", brbe_type);
+ entry->type = PERF_BR_UNKNOWN;
+ break;
+ }
+}
+
+static int brbe_get_perf_priv(u64 brbinf)
+{
+ int brbe_el = brbe_get_el(brbinf);
+
+ switch (brbe_el) {
+ case BRBINFx_EL1_EL_EL0:
+ return PERF_BR_PRIV_USER;
+ case BRBINFx_EL1_EL_EL1:
+ return PERF_BR_PRIV_KERNEL;
+ case BRBINFx_EL1_EL_EL2:
+ if (is_kernel_in_hyp_mode())
+ return PERF_BR_PRIV_KERNEL;
+ return PERF_BR_PRIV_HV;
+ default:
+ pr_warn_once("%d - unknown branch privilege captured\n", brbe_el);
+ return PERF_BR_PRIV_UNKNOWN;
+ }
+}
+
+static void capture_brbe_flags(struct perf_branch_entry *entry, struct perf_event *event,
+ u64 brbinf)
+{
+ if (branch_sample_type(event))
+ brbe_set_perf_entry_type(entry, brbinf);
+
+ if (!branch_sample_no_cycles(event))
+ entry->cycles = brbe_get_cycles(brbinf);
+
+ if (!branch_sample_no_flags(event)) {
+ /*
+ * BRBINFx_EL1.LASTFAILED indicates that a TME transaction failed (or
+ * was cancelled) prior to this record, and some number of records
+ * prior to this one, may have been generated during an attempt to
+ * execute the transaction.
+ *
+ * We will remove such entries later in process_branch_aborts().
+ */
+ entry->abort = brbe_get_lastfailed(brbinf);
+
+ /*
+ * All these information (i.e transaction state and mispredicts)
+ * are available for source only and complete branch records.
+ */
+ if (brbe_record_is_complete(brbinf) ||
+ brbe_record_is_source_only(brbinf)) {
+ entry->mispred = brbe_get_mispredict(brbinf);
+ entry->predicted = !entry->mispred;
+ entry->in_tx = brbe_get_in_tx(brbinf);
+ }
+ }
+
+ if (branch_sample_priv(event)) {
+ /*
+ * All these information (i.e branch privilege level) are
+ * available for target only and complete branch records.
+ */
+ if (brbe_record_is_complete(brbinf) ||
+ brbe_record_is_target_only(brbinf))
+ entry->priv = brbe_get_perf_priv(brbinf);
+ }
+}
+
+/*
+ * A branch record with BRBINFx_EL1.LASTFAILED set, implies that all
+ * preceding consecutive branch records, that were in a transaction
+ * (i.e their BRBINFx_EL1.TX set) have been aborted.
+ *
+ * Similarly BRBFCR_EL1.LASTFAILED set, indicate that all preceding
+ * consecutive branch records up to the last record, which were in a
+ * transaction (i.e their BRBINFx_EL1.TX set) have been aborted.
+ *
+ * --------------------------------- -------------------
+ * | 00 | BRBSRC | BRBTGT | BRBINF | | TX = 1 | LF = 0 | [TX success]
+ * --------------------------------- -------------------
+ * | 01 | BRBSRC | BRBTGT | BRBINF | | TX = 1 | LF = 0 | [TX success]
+ * --------------------------------- -------------------
+ * | 02 | BRBSRC | BRBTGT | BRBINF | | TX = 0 | LF = 0 |
+ * --------------------------------- -------------------
+ * | 03 | BRBSRC | BRBTGT | BRBINF | | TX = 1 | LF = 0 | [TX failed]
+ * --------------------------------- -------------------
+ * | 04 | BRBSRC | BRBTGT | BRBINF | | TX = 1 | LF = 0 | [TX failed]
+ * --------------------------------- -------------------
+ * | 05 | BRBSRC | BRBTGT | BRBINF | | TX = 0 | LF = 1 |
+ * --------------------------------- -------------------
+ * | .. | BRBSRC | BRBTGT | BRBINF | | TX = 0 | LF = 0 |
+ * --------------------------------- -------------------
+ * | 61 | BRBSRC | BRBTGT | BRBINF | | TX = 1 | LF = 0 | [TX failed]
+ * --------------------------------- -------------------
+ * | 62 | BRBSRC | BRBTGT | BRBINF | | TX = 1 | LF = 0 | [TX failed]
+ * --------------------------------- -------------------
+ * | 63 | BRBSRC | BRBTGT | BRBINF | | TX = 1 | LF = 0 | [TX failed]
+ * --------------------------------- -------------------
+ *
+ * BRBFCR_EL1.LASTFAILED == 1
+ *
+ * BRBFCR_EL1.LASTFAILED fails all those consecutive, in transaction
+ * branches records near the end of the BRBE buffer.
+ *
+ * Architecture does not guarantee a non transaction (TX = 0) branch
+ * record between two different transactions. So it is possible that
+ * a subsequent lastfailed record (TX = 0, LF = 1) might erroneously
+ * mark more than required transactions as aborted.
+ */
+static void process_branch_aborts(struct pmu_hw_events *cpuc)
+{
+ u64 brbfcr = read_sysreg_s(SYS_BRBFCR_EL1);
+ bool lastfailed = !!(brbfcr & BRBFCR_EL1_LASTFAILED);
+ int idx = brbe_get_numrec(cpuc->percpu_pmu->reg_brbidr) - 1;
+ struct perf_branch_entry *entry;
+
+ do {
+ entry = &cpuc->branches->branch_entries[idx];
+ if (entry->in_tx) {
+ entry->abort = lastfailed;
+ } else {
+ lastfailed = entry->abort;
+ entry->abort = false;
+ }
+ } while (idx--, idx >= 0);
+}
+
+void armv8pmu_branch_reset(void)
+{
+ asm volatile(BRB_IALL);
+ isb();
+}
+
+static bool capture_branch_entry(struct pmu_hw_events *cpuc,
+ struct perf_event *event, int idx)
+{
+ struct perf_branch_entry *entry = &cpuc->branches->branch_entries[idx];
+ u64 brbinf = get_brbinf_reg(idx);
+
+ /*
+ * There are no valid entries anymore on the buffer.
+ * Abort the branch record processing to save some
+ * cycles and also reduce the capture/process load
+ * for the user space as well.
+ */
+ if (brbe_invalid(brbinf))
+ return false;
+
+ perf_clear_branch_entry_bitfields(entry);
+ if (brbe_record_is_complete(brbinf)) {
+ entry->from = get_brbsrc_reg(idx);
+ entry->to = get_brbtgt_reg(idx);
+ } else if (brbe_record_is_source_only(brbinf)) {
+ entry->from = get_brbsrc_reg(idx);
+ entry->to = 0;
+ } else if (brbe_record_is_target_only(brbinf)) {
+ entry->from = 0;
+ entry->to = get_brbtgt_reg(idx);
+ }
+ capture_brbe_flags(entry, event, brbinf);
+ return true;
+}
+
+void armv8pmu_branch_read(struct pmu_hw_events *cpuc, struct perf_event *event)
+{
+ int nr_hw_entries = brbe_get_numrec(cpuc->percpu_pmu->reg_brbidr);
+ u64 brbfcr, brbcr;
+ int idx = 0;
+
+ brbcr = read_sysreg_s(SYS_BRBCR_EL1);
+ brbfcr = read_sysreg_s(SYS_BRBFCR_EL1);
+
+ /* Ensure pause on PMU interrupt is enabled */
+ WARN_ON_ONCE(!(brbcr & BRBCR_EL1_FZP));
+
+ /* Pause the buffer */
+ write_sysreg_s(brbfcr | BRBFCR_EL1_PAUSED, SYS_BRBFCR_EL1);
+ isb();
+
+ /* Loop through bank 0 */
+ select_brbe_bank(BRBE_BANK_IDX_0);
+ while (idx < nr_hw_entries && idx <= BRBE_BANK0_IDX_MAX) {
+ if (!capture_branch_entry(cpuc, event, idx))
+ goto skip_bank_1;
+ idx++;
+ }
+
+ /* Loop through bank 1 */
+ select_brbe_bank(BRBE_BANK_IDX_1);
+ while (idx < nr_hw_entries && idx <= BRBE_BANK1_IDX_MAX) {
+ if (!capture_branch_entry(cpuc, event, idx))
+ break;
+ idx++;
+ }
+
+skip_bank_1:
+ cpuc->branches->branch_stack.nr = idx;
+ cpuc->branches->branch_stack.hw_idx = -1ULL;
+ process_branch_aborts(cpuc);
+
+ /* Unpause the buffer */
+ write_sysreg_s(brbfcr & ~BRBFCR_EL1_PAUSED, SYS_BRBFCR_EL1);
+ isb();
+ armv8pmu_branch_reset();
+}
diff --git a/drivers/perf/arm_brbe.h b/drivers/perf/arm_brbe.h
new file mode 100644
index 000000000000..a47480eec070
--- /dev/null
+++ b/drivers/perf/arm_brbe.h
@@ -0,0 +1,257 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Branch Record Buffer Extension Helpers.
+ *
+ * Copyright (C) 2022 ARM Limited
+ *
+ * Author: Anshuman Khandual <[email protected]>
+ */
+#define pr_fmt(fmt) "brbe: " fmt
+
+#include <linux/perf/arm_pmu.h>
+
+#define BRBFCR_EL1_BRANCH_FILTERS (BRBFCR_EL1_DIRECT | \
+ BRBFCR_EL1_INDIRECT | \
+ BRBFCR_EL1_RTN | \
+ BRBFCR_EL1_INDCALL | \
+ BRBFCR_EL1_DIRCALL | \
+ BRBFCR_EL1_CONDDIR)
+
+#define BRBFCR_EL1_DEFAULT_CONFIG (BRBFCR_EL1_BANK_MASK | \
+ BRBFCR_EL1_PAUSED | \
+ BRBFCR_EL1_EnI | \
+ BRBFCR_EL1_BRANCH_FILTERS)
+
+/*
+ * BRBTS_EL1 is currently not used for branch stack implementation
+ * purpose but BRBCR_EL1.TS needs to have a valid value from all
+ * available options. BRBCR_EL1_TS_VIRTUAL is selected for this.
+ */
+#define BRBCR_EL1_DEFAULT_TS FIELD_PREP(BRBCR_EL1_TS_MASK, BRBCR_EL1_TS_VIRTUAL)
+
+#define BRBCR_EL1_DEFAULT_CONFIG (BRBCR_EL1_EXCEPTION | \
+ BRBCR_EL1_ERTN | \
+ BRBCR_EL1_CC | \
+ BRBCR_EL1_MPRED | \
+ BRBCR_EL1_E1BRE | \
+ BRBCR_EL1_E0BRE | \
+ BRBCR_EL1_FZP | \
+ BRBCR_EL1_DEFAULT_TS)
+/*
+ * BRBE Instructions
+ *
+ * BRB_IALL : Invalidate the entire buffer
+ * BRB_INJ : Inject latest branch record derived from [BRBSRCINJ, BRBTGTINJ, BRBINFINJ]
+ */
+#define BRB_IALL __emit_inst(0xD5000000 | sys_insn(1, 1, 7, 2, 4) | (0x1f))
+#define BRB_INJ __emit_inst(0xD5000000 | sys_insn(1, 1, 7, 2, 5) | (0x1f))
+
+/*
+ * BRBE Buffer Organization
+ *
+ * BRBE buffer is arranged as multiple banks of 32 branch record
+ * entries each. An individual branch record in a given bank could
+ * be accessed, after selecting the bank in BRBFCR_EL1.BANK and
+ * accessing the registers i.e [BRBSRC, BRBTGT, BRBINF] set with
+ * indices [0..31].
+ *
+ * Bank 0
+ *
+ * --------------------------------- ------
+ * | 00 | BRBSRC | BRBTGT | BRBINF | | 00 |
+ * --------------------------------- ------
+ * | 01 | BRBSRC | BRBTGT | BRBINF | | 01 |
+ * --------------------------------- ------
+ * | .. | BRBSRC | BRBTGT | BRBINF | | .. |
+ * --------------------------------- ------
+ * | 31 | BRBSRC | BRBTGT | BRBINF | | 31 |
+ * --------------------------------- ------
+ *
+ * Bank 1
+ *
+ * --------------------------------- ------
+ * | 32 | BRBSRC | BRBTGT | BRBINF | | 00 |
+ * --------------------------------- ------
+ * | 33 | BRBSRC | BRBTGT | BRBINF | | 01 |
+ * --------------------------------- ------
+ * | .. | BRBSRC | BRBTGT | BRBINF | | .. |
+ * --------------------------------- ------
+ * | 63 | BRBSRC | BRBTGT | BRBINF | | 31 |
+ * --------------------------------- ------
+ */
+#define BRBE_BANK_MAX_ENTRIES 32
+
+#define BRBE_BANK0_IDX_MIN 0
+#define BRBE_BANK0_IDX_MAX 31
+#define BRBE_BANK1_IDX_MIN 32
+#define BRBE_BANK1_IDX_MAX 63
+
+struct brbe_hw_attr {
+ int brbe_version;
+ int brbe_cc;
+ int brbe_nr;
+ int brbe_format;
+};
+
+enum brbe_bank_idx {
+ BRBE_BANK_IDX_INVALID = -1,
+ BRBE_BANK_IDX_0,
+ BRBE_BANK_IDX_1,
+ BRBE_BANK_IDX_MAX
+};
+
+#define RETURN_READ_BRBSRCN(n) \
+ read_sysreg_s(SYS_BRBSRC##n##_EL1)
+
+#define RETURN_READ_BRBTGTN(n) \
+ read_sysreg_s(SYS_BRBTGT##n##_EL1)
+
+#define RETURN_READ_BRBINFN(n) \
+ read_sysreg_s(SYS_BRBINF##n##_EL1)
+
+#define BRBE_REGN_CASE(n, case_macro) \
+ case n: return case_macro(n); break
+
+#define BRBE_REGN_SWITCH(x, case_macro) \
+ do { \
+ switch (x) { \
+ BRBE_REGN_CASE(0, case_macro); \
+ BRBE_REGN_CASE(1, case_macro); \
+ BRBE_REGN_CASE(2, case_macro); \
+ BRBE_REGN_CASE(3, case_macro); \
+ BRBE_REGN_CASE(4, case_macro); \
+ BRBE_REGN_CASE(5, case_macro); \
+ BRBE_REGN_CASE(6, case_macro); \
+ BRBE_REGN_CASE(7, case_macro); \
+ BRBE_REGN_CASE(8, case_macro); \
+ BRBE_REGN_CASE(9, case_macro); \
+ BRBE_REGN_CASE(10, case_macro); \
+ BRBE_REGN_CASE(11, case_macro); \
+ BRBE_REGN_CASE(12, case_macro); \
+ BRBE_REGN_CASE(13, case_macro); \
+ BRBE_REGN_CASE(14, case_macro); \
+ BRBE_REGN_CASE(15, case_macro); \
+ BRBE_REGN_CASE(16, case_macro); \
+ BRBE_REGN_CASE(17, case_macro); \
+ BRBE_REGN_CASE(18, case_macro); \
+ BRBE_REGN_CASE(19, case_macro); \
+ BRBE_REGN_CASE(20, case_macro); \
+ BRBE_REGN_CASE(21, case_macro); \
+ BRBE_REGN_CASE(22, case_macro); \
+ BRBE_REGN_CASE(23, case_macro); \
+ BRBE_REGN_CASE(24, case_macro); \
+ BRBE_REGN_CASE(25, case_macro); \
+ BRBE_REGN_CASE(26, case_macro); \
+ BRBE_REGN_CASE(27, case_macro); \
+ BRBE_REGN_CASE(28, case_macro); \
+ BRBE_REGN_CASE(29, case_macro); \
+ BRBE_REGN_CASE(30, case_macro); \
+ BRBE_REGN_CASE(31, case_macro); \
+ default: \
+ pr_warn("unknown register index\n"); \
+ return -1; \
+ } \
+ } while (0)
+
+static inline int buffer_to_brbe_idx(int buffer_idx)
+{
+ return buffer_idx % BRBE_BANK_MAX_ENTRIES;
+}
+
+static inline u64 get_brbsrc_reg(int buffer_idx)
+{
+ int brbe_idx = buffer_to_brbe_idx(buffer_idx);
+
+ BRBE_REGN_SWITCH(brbe_idx, RETURN_READ_BRBSRCN);
+}
+
+static inline u64 get_brbtgt_reg(int buffer_idx)
+{
+ int brbe_idx = buffer_to_brbe_idx(buffer_idx);
+
+ BRBE_REGN_SWITCH(brbe_idx, RETURN_READ_BRBTGTN);
+}
+
+static inline u64 get_brbinf_reg(int buffer_idx)
+{
+ int brbe_idx = buffer_to_brbe_idx(buffer_idx);
+
+ BRBE_REGN_SWITCH(brbe_idx, RETURN_READ_BRBINFN);
+}
+
+static inline u64 brbe_record_valid(u64 brbinf)
+{
+ return FIELD_GET(BRBINFx_EL1_VALID_MASK, brbinf);
+}
+
+static inline bool brbe_invalid(u64 brbinf)
+{
+ return brbe_record_valid(brbinf) == BRBINFx_EL1_VALID_NONE;
+}
+
+static inline bool brbe_record_is_complete(u64 brbinf)
+{
+ return brbe_record_valid(brbinf) == BRBINFx_EL1_VALID_FULL;
+}
+
+static inline bool brbe_record_is_source_only(u64 brbinf)
+{
+ return brbe_record_valid(brbinf) == BRBINFx_EL1_VALID_SOURCE;
+}
+
+static inline bool brbe_record_is_target_only(u64 brbinf)
+{
+ return brbe_record_valid(brbinf) == BRBINFx_EL1_VALID_TARGET;
+}
+
+static inline int brbe_get_in_tx(u64 brbinf)
+{
+ return FIELD_GET(BRBINFx_EL1_T_MASK, brbinf);
+}
+
+static inline int brbe_get_mispredict(u64 brbinf)
+{
+ return FIELD_GET(BRBINFx_EL1_MPRED_MASK, brbinf);
+}
+
+static inline int brbe_get_lastfailed(u64 brbinf)
+{
+ return FIELD_GET(BRBINFx_EL1_LASTFAILED_MASK, brbinf);
+}
+
+static inline int brbe_get_cycles(u64 brbinf)
+{
+ /*
+ * Captured cycle count is unknown and hence
+ * should not be passed on to the user space.
+ */
+ if (brbinf & BRBINFx_EL1_CCU)
+ return 0;
+
+ return FIELD_GET(BRBINFx_EL1_CC_MASK, brbinf);
+}
+
+static inline int brbe_get_type(u64 brbinf)
+{
+ return FIELD_GET(BRBINFx_EL1_TYPE_MASK, brbinf);
+}
+
+static inline int brbe_get_el(u64 brbinf)
+{
+ return FIELD_GET(BRBINFx_EL1_EL_MASK, brbinf);
+}
+
+static inline int brbe_get_numrec(u64 brbidr)
+{
+ return FIELD_GET(BRBIDR0_EL1_NUMREC_MASK, brbidr);
+}
+
+static inline int brbe_get_format(u64 brbidr)
+{
+ return FIELD_GET(BRBIDR0_EL1_FORMAT_MASK, brbidr);
+}
+
+static inline int brbe_get_cc_bits(u64 brbidr)
+{
+ return FIELD_GET(BRBIDR0_EL1_CC_MASK, brbidr);
+}
diff --git a/drivers/perf/arm_pmuv3.c b/drivers/perf/arm_pmuv3.c
index 10bbe02f4079..7c9e9045c24e 100644
--- a/drivers/perf/arm_pmuv3.c
+++ b/drivers/perf/arm_pmuv3.c
@@ -813,6 +813,10 @@ static irqreturn_t armv8pmu_handle_irq(struct arm_pmu *cpu_pmu)
if (!armpmu_event_set_period(event))
continue;

+ /*
+ * PMU IRQ should remain asserted until all branch records
+ * are captured and processed into struct perf_sample_data.
+ */
if (has_branch_stack(event) && !WARN_ON(!cpuc->branches)) {
armv8pmu_branch_read(cpuc, event);
perf_sample_save_brstack(&data, event, &cpuc->branches->branch_stack);
--
2.25.1


2023-07-11 08:56:23

by Anshuman Khandual

[permalink] [raw]
Subject: [PATCH V13 - RESEND 03/10] arm64/perf: Add branch stack support in struct arm_pmu

This updates 'struct arm_pmu' for branch stack sampling support being added
later. This adds an element 'reg_trbidr' to capture BRBE attribute details.
These updates here will help in tracking any branch stack sampling support.

This also enables perf branch stack sampling event on all 'struct arm pmu',
supporting the feature but after removing the current gate that blocks such
events unconditionally in armpmu_event_init(). Instead a quick probe can be
initiated via arm_pmu->has_branch_stack to ascertain the support.

Cc: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: [email protected]
Cc: [email protected]
Tested-by: James Clark <[email protected]>
Signed-off-by: Anshuman Khandual <[email protected]>
---
drivers/perf/arm_pmu.c | 3 +--
include/linux/perf/arm_pmu.h | 9 ++++++++-
2 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/drivers/perf/arm_pmu.c b/drivers/perf/arm_pmu.c
index c0475a96c2a0..fe456a46fe8d 100644
--- a/drivers/perf/arm_pmu.c
+++ b/drivers/perf/arm_pmu.c
@@ -512,8 +512,7 @@ static int armpmu_event_init(struct perf_event *event)
!cpumask_test_cpu(event->cpu, &armpmu->supported_cpus))
return -ENOENT;

- /* does not support taken branch sampling */
- if (has_branch_stack(event))
+ if (has_branch_stack(event) && !armpmu->has_branch_stack)
return -EOPNOTSUPP;

return __hw_perf_event_init(event);
diff --git a/include/linux/perf/arm_pmu.h b/include/linux/perf/arm_pmu.h
index 3d3bd944d630..5ecd3d0e7645 100644
--- a/include/linux/perf/arm_pmu.h
+++ b/include/linux/perf/arm_pmu.h
@@ -104,7 +104,9 @@ struct arm_pmu {
int (*map_event)(struct perf_event *event);
void (*sched_task)(struct perf_event_pmu_context *pmu_ctx, bool sched_in);
int num_events;
- bool secure_access; /* 32-bit ARM only */
+ unsigned int secure_access : 1, /* 32-bit ARM only */
+ has_branch_stack: 1, /* 64-bit ARM only */
+ reserved : 30;
#define ARMV8_PMUV3_MAX_COMMON_EVENTS 0x40
DECLARE_BITMAP(pmceid_bitmap, ARMV8_PMUV3_MAX_COMMON_EVENTS);
#define ARMV8_PMUV3_EXT_COMMON_EVENT_BASE 0x4000
@@ -120,6 +122,11 @@ struct arm_pmu {

/* Only to be used by ACPI probing code */
unsigned long acpi_cpuid;
+
+ /* Implementation specific attributes */
+#ifdef CONFIG_ARM64_BRBE
+ u64 reg_brbidr;
+#endif
};

#define to_arm_pmu(p) (container_of(p, struct arm_pmu, pmu))
--
2.25.1


2023-07-11 08:56:25

by Anshuman Khandual

[permalink] [raw]
Subject: [PATCH V13 - RESEND 05/10] arm64/perf: Add branch stack support in ARMV8 PMU

This enables support for branch stack sampling event in ARMV8 PMU, checking
has_branch_stack() on the event inside 'struct arm_pmu' callbacks. Although
these branch stack helpers armv8pmu_branch_XXXXX() are just dummy functions
for now. While here, this also defines arm_pmu's sched_task() callback with
armv8pmu_sched_task(), which resets the branch record buffer on a sched_in.

Cc: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: [email protected]
Cc: [email protected]
Tested-by: James Clark <[email protected]>
Signed-off-by: Anshuman Khandual <[email protected]>
---
arch/arm/include/asm/arm_pmuv3.h | 9 +++
arch/arm64/include/asm/perf_event.h | 31 +++++++++++
drivers/perf/arm_pmuv3.c | 86 +++++++++++++++++++++--------
3 files changed, 102 insertions(+), 24 deletions(-)

diff --git a/arch/arm/include/asm/arm_pmuv3.h b/arch/arm/include/asm/arm_pmuv3.h
index f3cd04ff022d..d7c438939a6f 100644
--- a/arch/arm/include/asm/arm_pmuv3.h
+++ b/arch/arm/include/asm/arm_pmuv3.h
@@ -249,4 +249,13 @@ static inline bool is_pmuv3p5(int pmuver)
return pmuver >= ARMV8_PMU_DFR_VER_V3P5;
}

+/* BRBE stubs */
+#ifdef CONFIG_PERF_EVENTS
+static inline void armv8pmu_branch_read(struct pmu_hw_events *cpuc, struct perf_event *event) { }
+static inline bool armv8pmu_branch_attr_valid(struct perf_event *event) { return false; }
+static inline void armv8pmu_branch_enable(struct perf_event *event) { }
+static inline void armv8pmu_branch_disable(struct perf_event *event) { }
+static inline void armv8pmu_branch_probe(struct arm_pmu *arm_pmu) { }
+static inline void armv8pmu_branch_reset(void) { }
+#endif
#endif
diff --git a/arch/arm64/include/asm/perf_event.h b/arch/arm64/include/asm/perf_event.h
index eb7071c9eb34..ebc392ba3559 100644
--- a/arch/arm64/include/asm/perf_event.h
+++ b/arch/arm64/include/asm/perf_event.h
@@ -24,4 +24,35 @@ extern unsigned long perf_misc_flags(struct pt_regs *regs);
(regs)->pstate = PSR_MODE_EL1h; \
}

+struct pmu_hw_events;
+struct arm_pmu;
+struct perf_event;
+
+#ifdef CONFIG_PERF_EVENTS
+static inline bool has_branch_stack(struct perf_event *event);
+
+static inline void armv8pmu_branch_read(struct pmu_hw_events *cpuc, struct perf_event *event)
+{
+ WARN_ON_ONCE(!has_branch_stack(event));
+}
+
+static inline bool armv8pmu_branch_attr_valid(struct perf_event *event)
+{
+ WARN_ON_ONCE(!has_branch_stack(event));
+ return false;
+}
+
+static inline void armv8pmu_branch_enable(struct perf_event *event)
+{
+ WARN_ON_ONCE(!has_branch_stack(event));
+}
+
+static inline void armv8pmu_branch_disable(struct perf_event *event)
+{
+ WARN_ON_ONCE(!has_branch_stack(event));
+}
+
+static inline void armv8pmu_branch_probe(struct arm_pmu *arm_pmu) { }
+static inline void armv8pmu_branch_reset(void) { }
+#endif
#endif
diff --git a/drivers/perf/arm_pmuv3.c b/drivers/perf/arm_pmuv3.c
index 08b3a1bf0ef6..10bbe02f4079 100644
--- a/drivers/perf/arm_pmuv3.c
+++ b/drivers/perf/arm_pmuv3.c
@@ -721,38 +721,21 @@ static void armv8pmu_enable_event(struct perf_event *event)
* Enable counter and interrupt, and set the counter to count
* the event that we're interested in.
*/
-
- /*
- * Disable counter
- */
armv8pmu_disable_event_counter(event);
-
- /*
- * Set event.
- */
armv8pmu_write_event_type(event);
-
- /*
- * Enable interrupt for this counter
- */
armv8pmu_enable_event_irq(event);
-
- /*
- * Enable counter
- */
armv8pmu_enable_event_counter(event);
+
+ if (has_branch_stack(event))
+ armv8pmu_branch_enable(event);
}

static void armv8pmu_disable_event(struct perf_event *event)
{
- /*
- * Disable counter
- */
- armv8pmu_disable_event_counter(event);
+ if (has_branch_stack(event))
+ armv8pmu_branch_disable(event);

- /*
- * Disable interrupt for this counter
- */
+ armv8pmu_disable_event_counter(event);
armv8pmu_disable_event_irq(event);
}

@@ -830,6 +813,11 @@ static irqreturn_t armv8pmu_handle_irq(struct arm_pmu *cpu_pmu)
if (!armpmu_event_set_period(event))
continue;

+ if (has_branch_stack(event) && !WARN_ON(!cpuc->branches)) {
+ armv8pmu_branch_read(cpuc, event);
+ perf_sample_save_brstack(&data, event, &cpuc->branches->branch_stack);
+ }
+
/*
* Perf event overflow will queue the processing of the event as
* an irq_work which will be taken care of in the handling of
@@ -928,6 +916,14 @@ static int armv8pmu_user_event_idx(struct perf_event *event)
return event->hw.idx;
}

+static void armv8pmu_sched_task(struct perf_event_pmu_context *pmu_ctx, bool sched_in)
+{
+ struct arm_pmu *armpmu = to_arm_pmu(pmu_ctx->pmu);
+
+ if (sched_in && armpmu->has_branch_stack)
+ armv8pmu_branch_reset();
+}
+
/*
* Add an event filter to a given event.
*/
@@ -998,6 +994,9 @@ static void armv8pmu_reset(void *info)
pmcr |= ARMV8_PMU_PMCR_LP;

armv8pmu_pmcr_write(pmcr);
+
+ if (cpu_pmu->has_branch_stack)
+ armv8pmu_branch_reset();
}

static int __armv8_pmuv3_map_event_id(struct arm_pmu *armpmu,
@@ -1035,6 +1034,9 @@ static int __armv8_pmuv3_map_event(struct perf_event *event,

hw_event_id = __armv8_pmuv3_map_event_id(armpmu, event);

+ if (has_branch_stack(event) && !armv8pmu_branch_attr_valid(event))
+ return -EOPNOTSUPP;
+
/*
* CHAIN events only work when paired with an adjacent counter, and it
* never makes sense for a user to open one in isolation, as they'll be
@@ -1151,6 +1153,33 @@ static void __armv8pmu_probe_pmu(void *info)
cpu_pmu->reg_pmmir = read_pmmir();
else
cpu_pmu->reg_pmmir = 0;
+ armv8pmu_branch_probe(cpu_pmu);
+}
+
+static int branch_records_alloc(struct arm_pmu *armpmu)
+{
+ struct branch_records __percpu *records;
+ int cpu;
+
+ records = alloc_percpu_gfp(struct branch_records, GFP_KERNEL);
+ if (!records)
+ return -ENOMEM;
+
+ /*
+ * FIXME: Memory allocated via records gets completely
+ * consumed here, never required to be freed up later. Hence
+ * losing access to on stack 'records' is acceptable.
+ * Otherwise this alloc handle has to be saved some where.
+ */
+ for_each_possible_cpu(cpu) {
+ struct pmu_hw_events *events_cpu;
+ struct branch_records *records_cpu;
+
+ events_cpu = per_cpu_ptr(armpmu->hw_events, cpu);
+ records_cpu = per_cpu_ptr(records, cpu);
+ events_cpu->branches = records_cpu;
+ }
+ return 0;
}

static int armv8pmu_probe_pmu(struct arm_pmu *cpu_pmu)
@@ -1167,7 +1196,15 @@ static int armv8pmu_probe_pmu(struct arm_pmu *cpu_pmu)
if (ret)
return ret;

- return probe.present ? 0 : -ENODEV;
+ if (!probe.present)
+ return -ENODEV;
+
+ if (cpu_pmu->has_branch_stack) {
+ ret = branch_records_alloc(cpu_pmu);
+ if (ret)
+ return ret;
+ }
+ return 0;
}

static void armv8pmu_disable_user_access_ipi(void *unused)
@@ -1230,6 +1267,7 @@ static int armv8_pmu_init(struct arm_pmu *cpu_pmu, char *name,
cpu_pmu->set_event_filter = armv8pmu_set_event_filter;

cpu_pmu->pmu.event_idx = armv8pmu_user_event_idx;
+ cpu_pmu->sched_task = armv8pmu_sched_task;

cpu_pmu->name = name;
cpu_pmu->map_event = map_event;
--
2.25.1


2023-07-11 08:56:58

by Anshuman Khandual

[permalink] [raw]
Subject: [PATCH V13 - RESEND 07/10] arm64/perf: Add PERF_ATTACH_TASK_DATA to events with has_branch_stack()

Short running processes i.e those getting very small cpu run time each time
when they get scheduled on, might not accumulate much branch records before
a PMU IRQ really happens. This increases possibility, for such processes to
loose much of its branch records, while being scheduled in-out of various
cpus on the system.

There is a need to save all occurred branch records during the cpu run time
while the process gets scheduled out. It requires an event context specific
buffer for such storage.

This adds PERF_ATTACH_TASK_DATA flag unconditionally, for all branch stack
sampling events, which would allocate task_ctx_data during its event init.
This also creates a platform specific task_ctx_data kmem cache which will
serve such allocation requests.

This adds a new structure 'arm64_perf_task_context' which encapsulates brbe
register set for maximum possible BRBE entries on the HW along with a valid
records tracking element.

Cc: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: [email protected]
Cc: [email protected]
Tested-by: James Clark <[email protected]>
Signed-off-by: Anshuman Khandual <[email protected]>
---
arch/arm/include/asm/arm_pmuv3.h | 2 ++
arch/arm64/include/asm/perf_event.h | 4 ++++
drivers/perf/arm_brbe.c | 21 +++++++++++++++++++++
drivers/perf/arm_brbe.h | 13 +++++++++++++
drivers/perf/arm_pmuv3.c | 16 +++++++++++++---
5 files changed, 53 insertions(+), 3 deletions(-)

diff --git a/arch/arm/include/asm/arm_pmuv3.h b/arch/arm/include/asm/arm_pmuv3.h
index d7c438939a6f..3d8faf4200dc 100644
--- a/arch/arm/include/asm/arm_pmuv3.h
+++ b/arch/arm/include/asm/arm_pmuv3.h
@@ -257,5 +257,7 @@ static inline void armv8pmu_branch_enable(struct perf_event *event) { }
static inline void armv8pmu_branch_disable(struct perf_event *event) { }
static inline void armv8pmu_branch_probe(struct arm_pmu *arm_pmu) { }
static inline void armv8pmu_branch_reset(void) { }
+static inline int armv8pmu_task_ctx_cache_alloc(struct arm_pmu *arm_pmu) { return 0; }
+static inline void armv8pmu_task_ctx_cache_free(struct arm_pmu *arm_pmu) { }
#endif
#endif
diff --git a/arch/arm64/include/asm/perf_event.h b/arch/arm64/include/asm/perf_event.h
index 49a973571415..b0c12a5882df 100644
--- a/arch/arm64/include/asm/perf_event.h
+++ b/arch/arm64/include/asm/perf_event.h
@@ -38,6 +38,8 @@ void armv8pmu_branch_enable(struct perf_event *event);
void armv8pmu_branch_disable(struct perf_event *event);
void armv8pmu_branch_probe(struct arm_pmu *arm_pmu);
void armv8pmu_branch_reset(void);
+int armv8pmu_task_ctx_cache_alloc(struct arm_pmu *arm_pmu);
+void armv8pmu_task_ctx_cache_free(struct arm_pmu *arm_pmu);
#else
static inline void armv8pmu_branch_read(struct pmu_hw_events *cpuc, struct perf_event *event)
{
@@ -62,6 +64,8 @@ static inline void armv8pmu_branch_disable(struct perf_event *event)

static inline void armv8pmu_branch_probe(struct arm_pmu *arm_pmu) { }
static inline void armv8pmu_branch_reset(void) { }
+static inline int armv8pmu_task_ctx_cache_alloc(struct arm_pmu *arm_pmu) { return 0; }
+static inline void armv8pmu_task_ctx_cache_free(struct arm_pmu *arm_pmu) { }
#endif
#endif
#endif
diff --git a/drivers/perf/arm_brbe.c b/drivers/perf/arm_brbe.c
index 79106300cf2e..a74459445813 100644
--- a/drivers/perf/arm_brbe.c
+++ b/drivers/perf/arm_brbe.c
@@ -109,6 +109,27 @@ bool armv8pmu_branch_attr_valid(struct perf_event *event)
return true;
}

+static inline struct kmem_cache *
+arm64_create_brbe_task_ctx_kmem_cache(size_t size)
+{
+ return kmem_cache_create("arm64_brbe_task_ctx", size, 0, 0, NULL);
+}
+
+int armv8pmu_task_ctx_cache_alloc(struct arm_pmu *arm_pmu)
+{
+ size_t size = sizeof(struct arm64_perf_task_context);
+
+ arm_pmu->pmu.task_ctx_cache = arm64_create_brbe_task_ctx_kmem_cache(size);
+ if (!arm_pmu->pmu.task_ctx_cache)
+ return -ENOMEM;
+ return 0;
+}
+
+void armv8pmu_task_ctx_cache_free(struct arm_pmu *arm_pmu)
+{
+ kmem_cache_destroy(arm_pmu->pmu.task_ctx_cache);
+}
+
static int brbe_attributes_probe(struct arm_pmu *armpmu, u32 brbe)
{
u64 brbidr = read_sysreg_s(SYS_BRBIDR0_EL1);
diff --git a/drivers/perf/arm_brbe.h b/drivers/perf/arm_brbe.h
index a47480eec070..4a72c2ba7140 100644
--- a/drivers/perf/arm_brbe.h
+++ b/drivers/perf/arm_brbe.h
@@ -80,12 +80,25 @@
* --------------------------------- ------
*/
#define BRBE_BANK_MAX_ENTRIES 32
+#define BRBE_MAX_BANK 2
+#define BRBE_MAX_ENTRIES (BRBE_BANK_MAX_ENTRIES * BRBE_MAX_BANK)

#define BRBE_BANK0_IDX_MIN 0
#define BRBE_BANK0_IDX_MAX 31
#define BRBE_BANK1_IDX_MIN 32
#define BRBE_BANK1_IDX_MAX 63

+struct brbe_regset {
+ unsigned long brbsrc;
+ unsigned long brbtgt;
+ unsigned long brbinf;
+};
+
+struct arm64_perf_task_context {
+ struct brbe_regset store[BRBE_MAX_ENTRIES];
+ int nr_brbe_records;
+};
+
struct brbe_hw_attr {
int brbe_version;
int brbe_cc;
diff --git a/drivers/perf/arm_pmuv3.c b/drivers/perf/arm_pmuv3.c
index 7c9e9045c24e..408974d5c57b 100644
--- a/drivers/perf/arm_pmuv3.c
+++ b/drivers/perf/arm_pmuv3.c
@@ -1038,8 +1038,12 @@ static int __armv8_pmuv3_map_event(struct perf_event *event,

hw_event_id = __armv8_pmuv3_map_event_id(armpmu, event);

- if (has_branch_stack(event) && !armv8pmu_branch_attr_valid(event))
- return -EOPNOTSUPP;
+ if (has_branch_stack(event)) {
+ if (!armv8pmu_branch_attr_valid(event))
+ return -EOPNOTSUPP;
+
+ event->attach_state |= PERF_ATTACH_TASK_DATA;
+ }

/*
* CHAIN events only work when paired with an adjacent counter, and it
@@ -1204,9 +1208,15 @@ static int armv8pmu_probe_pmu(struct arm_pmu *cpu_pmu)
return -ENODEV;

if (cpu_pmu->has_branch_stack) {
- ret = branch_records_alloc(cpu_pmu);
+ ret = armv8pmu_task_ctx_cache_alloc(cpu_pmu);
if (ret)
return ret;
+
+ ret = branch_records_alloc(cpu_pmu);
+ if (ret) {
+ armv8pmu_task_ctx_cache_free(cpu_pmu);
+ return ret;
+ }
}
return 0;
}
--
2.25.1


2023-07-11 20:01:52

by Randy Dunlap

[permalink] [raw]
Subject: Re: [PATCH V13 - RESEND 06/10] arm64/perf: Enable branch stack events via FEAT_BRBE

Hi--

On 7/11/23 01:24, Anshuman Khandual wrote:
> diff --git a/drivers/perf/Kconfig b/drivers/perf/Kconfig
> index f4572a5cca72..7c8448051741 100644
> --- a/drivers/perf/Kconfig
> +++ b/drivers/perf/Kconfig
> @@ -180,6 +180,17 @@ config ARM_SPE_PMU
> Extension, which provides periodic sampling of operations in
> the CPU pipeline and reports this via the perf AUX interface.
>
> +config ARM64_BRBE
> + bool "Enable support for Branch Record Buffer Extension (BRBE)"
> + depends on PERF_EVENTS && ARM64 && ARM_PMU
> + default y
> + help
> + Enable perf support for Branch Record Buffer Extension (BRBE) which
> + records all branches taken in an execution path. This supports some
> + branch types and privilege based filtering. It captured additional

preferably:
captures

> + relevant information such as cycle count, misprediction and branch
> + type, branch privilege level etc.

--
~Randy

2023-07-12 03:03:20

by Anshuman Khandual

[permalink] [raw]
Subject: Re: [PATCH V13 - RESEND 06/10] arm64/perf: Enable branch stack events via FEAT_BRBE



On 7/12/23 00:56, Randy Dunlap wrote:
> Hi--
>
> On 7/11/23 01:24, Anshuman Khandual wrote:
>> diff --git a/drivers/perf/Kconfig b/drivers/perf/Kconfig
>> index f4572a5cca72..7c8448051741 100644
>> --- a/drivers/perf/Kconfig
>> +++ b/drivers/perf/Kconfig
>> @@ -180,6 +180,17 @@ config ARM_SPE_PMU
>> Extension, which provides periodic sampling of operations in
>> the CPU pipeline and reports this via the perf AUX interface.
>>
>> +config ARM64_BRBE
>> + bool "Enable support for Branch Record Buffer Extension (BRBE)"
>> + depends on PERF_EVENTS && ARM64 && ARM_PMU
>> + default y
>> + help
>> + Enable perf support for Branch Record Buffer Extension (BRBE) which
>> + records all branches taken in an execution path. This supports some
>> + branch types and privilege based filtering. It captured additional
>
> preferably:
> captures

Agreed, will change.

>
>> + relevant information such as cycle count, misprediction and branch
>> + type, branch privilege level etc.
>

2023-07-25 07:35:06

by Yang Shen

[permalink] [raw]
Subject: Re: [PATCH V13 - RESEND 06/10] arm64/perf: Enable branch stack events via FEAT_BRBE



在 2023/7/11 16:24, Anshuman Khandual 写道:
> This enables branch stack sampling events in ARMV8 PMU, via an architecture
> feature FEAT_BRBE aka branch record buffer extension. This defines required
> branch helper functions pmuv8pmu_branch_XXXXX() and the implementation here
> is wrapped with a new config option CONFIG_ARM64_BRBE.
>
> Cc: Catalin Marinas <[email protected]>
> Cc: Will Deacon <[email protected]>
> Cc: Mark Rutland <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Tested-by: James Clark <[email protected]>
> Signed-off-by: Anshuman Khandual <[email protected]>
> ---
> arch/arm64/include/asm/perf_event.h | 9 +
> drivers/perf/Kconfig | 11 +
> drivers/perf/Makefile | 1 +
> drivers/perf/arm_brbe.c | 549 ++++++++++++++++++++++++++++
> drivers/perf/arm_brbe.h | 257 +++++++++++++
> drivers/perf/arm_pmuv3.c | 4 +
> 6 files changed, 831 insertions(+)
> create mode 100644 drivers/perf/arm_brbe.c
> create mode 100644 drivers/perf/arm_brbe.h
>
> diff --git a/arch/arm64/include/asm/perf_event.h b/arch/arm64/include/asm/perf_event.h
> index ebc392ba3559..49a973571415 100644
> --- a/arch/arm64/include/asm/perf_event.h
> +++ b/arch/arm64/include/asm/perf_event.h
> @@ -31,6 +31,14 @@ struct perf_event;
> #ifdef CONFIG_PERF_EVENTS
> static inline bool has_branch_stack(struct perf_event *event);
>
> +#ifdef CONFIG_ARM64_BRBE
> +void armv8pmu_branch_read(struct pmu_hw_events *cpuc, struct perf_event *event);
> +bool armv8pmu_branch_attr_valid(struct perf_event *event);
> +void armv8pmu_branch_enable(struct perf_event *event);
> +void armv8pmu_branch_disable(struct perf_event *event);
> +void armv8pmu_branch_probe(struct arm_pmu *arm_pmu);
> +void armv8pmu_branch_reset(void);
> +#else
> static inline void armv8pmu_branch_read(struct pmu_hw_events *cpuc, struct perf_event *event)
> {
> WARN_ON_ONCE(!has_branch_stack(event));
> @@ -56,3 +64,4 @@ static inline void armv8pmu_branch_probe(struct arm_pmu *arm_pmu) { }
> static inline void armv8pmu_branch_reset(void) { }
> #endif
> #endif
> +#endif
> diff --git a/drivers/perf/Kconfig b/drivers/perf/Kconfig
> index f4572a5cca72..7c8448051741 100644
> --- a/drivers/perf/Kconfig
> +++ b/drivers/perf/Kconfig
> @@ -180,6 +180,17 @@ config ARM_SPE_PMU
> Extension, which provides periodic sampling of operations in
> the CPU pipeline and reports this via the perf AUX interface.
>
> +config ARM64_BRBE
> + bool "Enable support for Branch Record Buffer Extension (BRBE)"
> + depends on PERF_EVENTS && ARM64 && ARM_PMU
> + default y
> + help
> + Enable perf support for Branch Record Buffer Extension (BRBE) which
> + records all branches taken in an execution path. This supports some
> + branch types and privilege based filtering. It captured additional
> + relevant information such as cycle count, misprediction and branch
> + type, branch privilege level etc.
> +
> config ARM_DMC620_PMU
> tristate "Enable PMU support for the ARM DMC-620 memory controller"
> depends on (ARM64 && ACPI) || COMPILE_TEST
> diff --git a/drivers/perf/Makefile b/drivers/perf/Makefile
> index 16b3ec4db916..a8b7bc22e3d6 100644
> --- a/drivers/perf/Makefile
> +++ b/drivers/perf/Makefile
> @@ -18,6 +18,7 @@ obj-$(CONFIG_RISCV_PMU_SBI) += riscv_pmu_sbi.o
> obj-$(CONFIG_THUNDERX2_PMU) += thunderx2_pmu.o
> obj-$(CONFIG_XGENE_PMU) += xgene_pmu.o
> obj-$(CONFIG_ARM_SPE_PMU) += arm_spe_pmu.o
> +obj-$(CONFIG_ARM64_BRBE) += arm_brbe.o
> obj-$(CONFIG_ARM_DMC620_PMU) += arm_dmc620_pmu.o
> obj-$(CONFIG_MARVELL_CN10K_TAD_PMU) += marvell_cn10k_tad_pmu.o
> obj-$(CONFIG_MARVELL_CN10K_DDR_PMU) += marvell_cn10k_ddr_pmu.o
> diff --git a/drivers/perf/arm_brbe.c b/drivers/perf/arm_brbe.c
> new file mode 100644
> index 000000000000..79106300cf2e
> --- /dev/null
> +++ b/drivers/perf/arm_brbe.c
> @@ -0,0 +1,549 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Branch Record Buffer Extension Driver.
> + *
> + * Copyright (C) 2022 ARM Limited
> + *
> + * Author: Anshuman Khandual <[email protected]>
> + */
> +#include "arm_brbe.h"
> +
> +static bool valid_brbe_nr(int brbe_nr)
> +{
> + return brbe_nr == BRBIDR0_EL1_NUMREC_8 ||
> + brbe_nr == BRBIDR0_EL1_NUMREC_16 ||
> + brbe_nr == BRBIDR0_EL1_NUMREC_32 ||
> + brbe_nr == BRBIDR0_EL1_NUMREC_64;
> +}
> +
> +static bool valid_brbe_cc(int brbe_cc)
> +{
> + return brbe_cc == BRBIDR0_EL1_CC_20_BIT;
> +}
> +
> +static bool valid_brbe_format(int brbe_format)
> +{
> + return brbe_format == BRBIDR0_EL1_FORMAT_0;
> +}
> +
> +static bool valid_brbe_version(int brbe_version)
> +{
> + return brbe_version == ID_AA64DFR0_EL1_BRBE_IMP ||
> + brbe_version == ID_AA64DFR0_EL1_BRBE_BRBE_V1P1;
> +}
> +
> +static void select_brbe_bank(int bank)
> +{
> + u64 brbfcr;
> +
> + WARN_ON(bank > BRBE_BANK_IDX_1);
> + brbfcr = read_sysreg_s(SYS_BRBFCR_EL1);
> + brbfcr &= ~BRBFCR_EL1_BANK_MASK;
> + brbfcr |= SYS_FIELD_PREP(BRBFCR_EL1, BANK, bank);
> + write_sysreg_s(brbfcr, SYS_BRBFCR_EL1);
> + isb();
> +}
> +
> +/*
> + * Generic perf branch filters supported on BRBE
> + *
> + * New branch filters need to be evaluated whether they could be supported on
> + * BRBE. This ensures that such branch filters would not just be accepted, to
> + * fail silently. PERF_SAMPLE_BRANCH_HV is a special case that is selectively
> + * supported only on platforms where kernel is in hyp mode.
> + */
> +#define BRBE_EXCLUDE_BRANCH_FILTERS (PERF_SAMPLE_BRANCH_ABORT_TX | \
> + PERF_SAMPLE_BRANCH_IN_TX | \
> + PERF_SAMPLE_BRANCH_NO_TX | \
> + PERF_SAMPLE_BRANCH_CALL_STACK)
> +
> +#define BRBE_ALLOWED_BRANCH_FILTERS (PERF_SAMPLE_BRANCH_USER | \
> + PERF_SAMPLE_BRANCH_KERNEL | \
> + PERF_SAMPLE_BRANCH_HV | \
> + PERF_SAMPLE_BRANCH_ANY | \
> + PERF_SAMPLE_BRANCH_ANY_CALL | \
> + PERF_SAMPLE_BRANCH_ANY_RETURN | \
> + PERF_SAMPLE_BRANCH_IND_CALL | \
> + PERF_SAMPLE_BRANCH_COND | \
> + PERF_SAMPLE_BRANCH_IND_JUMP | \
> + PERF_SAMPLE_BRANCH_CALL | \
> + PERF_SAMPLE_BRANCH_NO_FLAGS | \
> + PERF_SAMPLE_BRANCH_NO_CYCLES | \
> + PERF_SAMPLE_BRANCH_TYPE_SAVE | \
> + PERF_SAMPLE_BRANCH_HW_INDEX | \
> + PERF_SAMPLE_BRANCH_PRIV_SAVE)
> +
> +#define BRBE_PERF_BRANCH_FILTERS (BRBE_ALLOWED_BRANCH_FILTERS | \
> + BRBE_EXCLUDE_BRANCH_FILTERS)
> +
> +bool armv8pmu_branch_attr_valid(struct perf_event *event)
> +{
> + u64 branch_type = event->attr.branch_sample_type;
> +
> + /*
> + * Ensure both perf branch filter allowed and exclude
> + * masks are always in sync with the generic perf ABI.
> + */
> + BUILD_BUG_ON(BRBE_PERF_BRANCH_FILTERS != (PERF_SAMPLE_BRANCH_MAX - 1));
> +
> + if (branch_type & ~BRBE_ALLOWED_BRANCH_FILTERS) {
> + pr_debug_once("requested branch filter not supported 0x%llx\n", branch_type);
> + return false;
> + }
> +
> + /*
> + * If the event does not have at least one of the privilege
> + * branch filters as in PERF_SAMPLE_BRANCH_PLM_ALL, the core
> + * perf will adjust its value based on perf event's existing
> + * privilege level via attr.exclude_[user|kernel|hv].
> + *
> + * As event->attr.branch_sample_type might have been changed
> + * when the event reaches here, it is not possible to figure
> + * out whether the event originally had HV privilege request
> + * or got added via the core perf. Just report this situation
> + * once and continue ignoring if there are other instances.
> + */
> + if ((branch_type & PERF_SAMPLE_BRANCH_HV) && !is_kernel_in_hyp_mode())
> + pr_debug_once("hypervisor privilege filter not supported 0x%llx\n", branch_type);
> +
> + return true;
> +}
> +
> +static int brbe_attributes_probe(struct arm_pmu *armpmu, u32 brbe)
> +{
> + u64 brbidr = read_sysreg_s(SYS_BRBIDR0_EL1);
> + int brbe_version, brbe_format, brbe_cc, brbe_nr;
> +
> + brbe_version = brbe;
> + brbe_format = brbe_get_format(brbidr);
> + brbe_cc = brbe_get_cc_bits(brbidr);
> + brbe_nr = brbe_get_numrec(brbidr);
> + armpmu->reg_brbidr = brbidr;
> +
> + if (!valid_brbe_version(brbe_version) ||
> + !valid_brbe_format(brbe_format) ||
> + !valid_brbe_cc(brbe_cc) ||
> + !valid_brbe_nr(brbe_nr))
> + return -EOPNOTSUPP;
> + return 0;
> +}
> +
> +void armv8pmu_branch_probe(struct arm_pmu *armpmu)
> +{
> + u64 aa64dfr0 = read_sysreg_s(SYS_ID_AA64DFR0_EL1);
> + u32 brbe;
> +
> + brbe = cpuid_feature_extract_unsigned_field(aa64dfr0, ID_AA64DFR0_EL1_BRBE_SHIFT);
> + if (!brbe)
> + return;
> +
> + if (brbe_attributes_probe(armpmu, brbe))
> + return;
> +
> + armpmu->has_branch_stack = 1;
> +}
> +
> +static u64 branch_type_to_brbfcr(int branch_type)
> +{
> + u64 brbfcr = 0;
> +
> + if (branch_type & PERF_SAMPLE_BRANCH_ANY) {
> + brbfcr |= BRBFCR_EL1_BRANCH_FILTERS;
> + return brbfcr;
> + }
> +
> + if (branch_type & PERF_SAMPLE_BRANCH_ANY_CALL) {
> + brbfcr |= BRBFCR_EL1_INDCALL;
> + brbfcr |= BRBFCR_EL1_DIRCALL;
> + }
> +
> + if (branch_type & PERF_SAMPLE_BRANCH_ANY_RETURN)
> + brbfcr |= BRBFCR_EL1_RTN;
> +
> + if (branch_type & PERF_SAMPLE_BRANCH_IND_CALL)
> + brbfcr |= BRBFCR_EL1_INDCALL;
> +
> + if (branch_type & PERF_SAMPLE_BRANCH_COND)
> + brbfcr |= BRBFCR_EL1_CONDDIR;
> +
> + if (branch_type & PERF_SAMPLE_BRANCH_IND_JUMP)
> + brbfcr |= BRBFCR_EL1_INDIRECT;
> +
> + if (branch_type & PERF_SAMPLE_BRANCH_CALL)
> + brbfcr |= BRBFCR_EL1_DIRCALL;
> +
> + return brbfcr;
> +}
> +
> +static u64 branch_type_to_brbcr(int branch_type)
> +{
> + u64 brbcr = BRBCR_EL1_DEFAULT_TS;
> +
> + /*
> + * BRBE should be paused on PMU interrupt while tracing kernel
> + * space to stop capturing further branch records. Otherwise
> + * interrupt handler branch records might get into the samples
> + * which is not desired.
> + *
> + * BRBE need not be paused on PMU interrupt while tracing only
> + * the user space, because it will automatically be inside the
> + * prohibited region. But even after PMU overflow occurs, the
> + * interrupt could still take much more cycles, before it can
> + * be taken and by that time BRBE will have been overwritten.
> + * Hence enable pause on PMU interrupt mechanism even for user
> + * only traces as well.
> + */
> + brbcr |= BRBCR_EL1_FZP;
> +
> + /*
> + * When running in the hyp mode, writing into BRBCR_EL1
> + * actually writes into BRBCR_EL2 instead. Field E2BRE
> + * is also at the same position as E1BRE.
> + */
> + if (branch_type & PERF_SAMPLE_BRANCH_USER)
> + brbcr |= BRBCR_EL1_E0BRE;
> +
> + if (branch_type & PERF_SAMPLE_BRANCH_KERNEL)
> + brbcr |= BRBCR_EL1_E1BRE;
> +
> + if (branch_type & PERF_SAMPLE_BRANCH_HV) {
> + if (is_kernel_in_hyp_mode())
> + brbcr |= BRBCR_EL1_E1BRE;
> + }
> +
> + if (!(branch_type & PERF_SAMPLE_BRANCH_NO_CYCLES))
> + brbcr |= BRBCR_EL1_CC;

Hi Anshuman,

Here is problem about enable CYCLES_COUNT. The SPEC defines that the CYCLES_COUNT is only

valid when the BRECR_EL1.CC & BRBCR_EL2.CC is true. And here the SPEC also defines that

when PSTATE.EL == EL2 and HCR_EL2.E2h == '1', 'MSR BRBCR_EL1, <Xt>' means writing to

BRBCR_EL2 actually. So 'armv8pmu_branch_enable' can only set the BRBCR_EL2.CC, while the

BRECR_EL1.CC is still 0. The CYCLES_COUNT will be always 0 in records.

As a solution, maybe BRBCR_EL12 should be added for driver according to the registers definition.

Or, do you have a more standard solution?

Thanks,

Yang


> +
> + if (!(branch_type & PERF_SAMPLE_BRANCH_NO_FLAGS))
> + brbcr |= BRBCR_EL1_MPRED;
> +
> + /*
> + * The exception and exception return branches could be
> + * captured, irrespective of the perf event's privilege.
> + * If the perf event does not have enough privilege for
> + * a given exception level, then addresses which falls
> + * under that exception level will be reported as zero
> + * for the captured branch record, creating source only
> + * or target only records.
> + */
> + if (branch_type & PERF_SAMPLE_BRANCH_ANY) {
> + brbcr |= BRBCR_EL1_EXCEPTION;
> + brbcr |= BRBCR_EL1_ERTN;
> + }
> +
> + if (branch_type & PERF_SAMPLE_BRANCH_ANY_CALL)
> + brbcr |= BRBCR_EL1_EXCEPTION;
> +
> + if (branch_type & PERF_SAMPLE_BRANCH_ANY_RETURN)
> + brbcr |= BRBCR_EL1_ERTN;
> +
> + return brbcr & BRBCR_EL1_DEFAULT_CONFIG;
> +}
> +
> +void armv8pmu_branch_enable(struct perf_event *event)
> +{
> + u64 branch_type = event->attr.branch_sample_type;
> + u64 brbfcr, brbcr;
> +
> + brbfcr = read_sysreg_s(SYS_BRBFCR_EL1);
> + brbfcr &= ~BRBFCR_EL1_DEFAULT_CONFIG;
> + brbfcr |= branch_type_to_brbfcr(branch_type);
> + write_sysreg_s(brbfcr, SYS_BRBFCR_EL1);
> + isb();
> +
> + brbcr = read_sysreg_s(SYS_BRBCR_EL1);
> + brbcr &= ~BRBCR_EL1_DEFAULT_CONFIG;
> + brbcr |= branch_type_to_brbcr(branch_type);
> + write_sysreg_s(brbcr, SYS_BRBCR_EL1);
> + isb();
> + armv8pmu_branch_reset();
> +}
> +
> +void armv8pmu_branch_disable(struct perf_event *event)
> +{
> + u64 brbfcr = read_sysreg_s(SYS_BRBFCR_EL1);
> + u64 brbcr = read_sysreg_s(SYS_BRBCR_EL1);
> +
> + brbcr &= ~(BRBCR_EL1_E0BRE | BRBCR_EL1_E1BRE);
> + brbfcr |= BRBFCR_EL1_PAUSED;
> + write_sysreg_s(brbcr, SYS_BRBCR_EL1);
> + write_sysreg_s(brbfcr, SYS_BRBFCR_EL1);
> + isb();
> +}
> +
> +static void brbe_set_perf_entry_type(struct perf_branch_entry *entry, u64 brbinf)
> +{
> + int brbe_type = brbe_get_type(brbinf);
> +
> + switch (brbe_type) {
> + case BRBINFx_EL1_TYPE_UNCOND_DIRECT:
> + entry->type = PERF_BR_UNCOND;
> + break;
> + case BRBINFx_EL1_TYPE_INDIRECT:
> + entry->type = PERF_BR_IND;
> + break;
> + case BRBINFx_EL1_TYPE_DIRECT_LINK:
> + entry->type = PERF_BR_CALL;
> + break;
> + case BRBINFx_EL1_TYPE_INDIRECT_LINK:
> + entry->type = PERF_BR_IND_CALL;
> + break;
> + case BRBINFx_EL1_TYPE_RET:
> + entry->type = PERF_BR_RET;
> + break;
> + case BRBINFx_EL1_TYPE_COND_DIRECT:
> + entry->type = PERF_BR_COND;
> + break;
> + case BRBINFx_EL1_TYPE_CALL:
> + entry->type = PERF_BR_CALL;
> + break;
> + case BRBINFx_EL1_TYPE_TRAP:
> + entry->type = PERF_BR_SYSCALL;
> + break;
> + case BRBINFx_EL1_TYPE_ERET:
> + entry->type = PERF_BR_ERET;
> + break;
> + case BRBINFx_EL1_TYPE_IRQ:
> + entry->type = PERF_BR_IRQ;
> + break;
> + case BRBINFx_EL1_TYPE_DEBUG_HALT:
> + entry->type = PERF_BR_EXTEND_ABI;
> + entry->new_type = PERF_BR_ARM64_DEBUG_HALT;
> + break;
> + case BRBINFx_EL1_TYPE_SERROR:
> + entry->type = PERF_BR_SERROR;
> + break;
> + case BRBINFx_EL1_TYPE_INSN_DEBUG:
> + entry->type = PERF_BR_EXTEND_ABI;
> + entry->new_type = PERF_BR_ARM64_DEBUG_INST;
> + break;
> + case BRBINFx_EL1_TYPE_DATA_DEBUG:
> + entry->type = PERF_BR_EXTEND_ABI;
> + entry->new_type = PERF_BR_ARM64_DEBUG_DATA;
> + break;
> + case BRBINFx_EL1_TYPE_ALIGN_FAULT:
> + entry->type = PERF_BR_EXTEND_ABI;
> + entry->new_type = PERF_BR_NEW_FAULT_ALGN;
> + break;
> + case BRBINFx_EL1_TYPE_INSN_FAULT:
> + entry->type = PERF_BR_EXTEND_ABI;
> + entry->new_type = PERF_BR_NEW_FAULT_INST;
> + break;
> + case BRBINFx_EL1_TYPE_DATA_FAULT:
> + entry->type = PERF_BR_EXTEND_ABI;
> + entry->new_type = PERF_BR_NEW_FAULT_DATA;
> + break;
> + case BRBINFx_EL1_TYPE_FIQ:
> + entry->type = PERF_BR_EXTEND_ABI;
> + entry->new_type = PERF_BR_ARM64_FIQ;
> + break;
> + case BRBINFx_EL1_TYPE_DEBUG_EXIT:
> + entry->type = PERF_BR_EXTEND_ABI;
> + entry->new_type = PERF_BR_ARM64_DEBUG_EXIT;
> + break;
> + default:
> + pr_warn_once("%d - unknown branch type captured\n", brbe_type);
> + entry->type = PERF_BR_UNKNOWN;
> + break;
> + }
> +}
> +
> +static int brbe_get_perf_priv(u64 brbinf)
> +{
> + int brbe_el = brbe_get_el(brbinf);
> +
> + switch (brbe_el) {
> + case BRBINFx_EL1_EL_EL0:
> + return PERF_BR_PRIV_USER;
> + case BRBINFx_EL1_EL_EL1:
> + return PERF_BR_PRIV_KERNEL;
> + case BRBINFx_EL1_EL_EL2:
> + if (is_kernel_in_hyp_mode())
> + return PERF_BR_PRIV_KERNEL;
> + return PERF_BR_PRIV_HV;
> + default:
> + pr_warn_once("%d - unknown branch privilege captured\n", brbe_el);
> + return PERF_BR_PRIV_UNKNOWN;
> + }
> +}
> +
> +static void capture_brbe_flags(struct perf_branch_entry *entry, struct perf_event *event,
> + u64 brbinf)
> +{
> + if (branch_sample_type(event))
> + brbe_set_perf_entry_type(entry, brbinf);
> +
> + if (!branch_sample_no_cycles(event))
> + entry->cycles = brbe_get_cycles(brbinf);
> +
> + if (!branch_sample_no_flags(event)) {
> + /*
> + * BRBINFx_EL1.LASTFAILED indicates that a TME transaction failed (or
> + * was cancelled) prior to this record, and some number of records
> + * prior to this one, may have been generated during an attempt to
> + * execute the transaction.
> + *
> + * We will remove such entries later in process_branch_aborts().
> + */
> + entry->abort = brbe_get_lastfailed(brbinf);
> +
> + /*
> + * All these information (i.e transaction state and mispredicts)
> + * are available for source only and complete branch records.
> + */
> + if (brbe_record_is_complete(brbinf) ||
> + brbe_record_is_source_only(brbinf)) {
> + entry->mispred = brbe_get_mispredict(brbinf);
> + entry->predicted = !entry->mispred;
> + entry->in_tx = brbe_get_in_tx(brbinf);
> + }
> + }
> +
> + if (branch_sample_priv(event)) {
> + /*
> + * All these information (i.e branch privilege level) are
> + * available for target only and complete branch records.
> + */
> + if (brbe_record_is_complete(brbinf) ||
> + brbe_record_is_target_only(brbinf))
> + entry->priv = brbe_get_perf_priv(brbinf);
> + }
> +}
> +
> +/*
> + * A branch record with BRBINFx_EL1.LASTFAILED set, implies that all
> + * preceding consecutive branch records, that were in a transaction
> + * (i.e their BRBINFx_EL1.TX set) have been aborted.
> + *
> + * Similarly BRBFCR_EL1.LASTFAILED set, indicate that all preceding
> + * consecutive branch records up to the last record, which were in a
> + * transaction (i.e their BRBINFx_EL1.TX set) have been aborted.
> + *
> + * --------------------------------- -------------------
> + * | 00 | BRBSRC | BRBTGT | BRBINF | | TX = 1 | LF = 0 | [TX success]
> + * --------------------------------- -------------------
> + * | 01 | BRBSRC | BRBTGT | BRBINF | | TX = 1 | LF = 0 | [TX success]
> + * --------------------------------- -------------------
> + * | 02 | BRBSRC | BRBTGT | BRBINF | | TX = 0 | LF = 0 |
> + * --------------------------------- -------------------
> + * | 03 | BRBSRC | BRBTGT | BRBINF | | TX = 1 | LF = 0 | [TX failed]
> + * --------------------------------- -------------------
> + * | 04 | BRBSRC | BRBTGT | BRBINF | | TX = 1 | LF = 0 | [TX failed]
> + * --------------------------------- -------------------
> + * | 05 | BRBSRC | BRBTGT | BRBINF | | TX = 0 | LF = 1 |
> + * --------------------------------- -------------------
> + * | .. | BRBSRC | BRBTGT | BRBINF | | TX = 0 | LF = 0 |
> + * --------------------------------- -------------------
> + * | 61 | BRBSRC | BRBTGT | BRBINF | | TX = 1 | LF = 0 | [TX failed]
> + * --------------------------------- -------------------
> + * | 62 | BRBSRC | BRBTGT | BRBINF | | TX = 1 | LF = 0 | [TX failed]
> + * --------------------------------- -------------------
> + * | 63 | BRBSRC | BRBTGT | BRBINF | | TX = 1 | LF = 0 | [TX failed]
> + * --------------------------------- -------------------
> + *
> + * BRBFCR_EL1.LASTFAILED == 1
> + *
> + * BRBFCR_EL1.LASTFAILED fails all those consecutive, in transaction
> + * branches records near the end of the BRBE buffer.
> + *
> + * Architecture does not guarantee a non transaction (TX = 0) branch
> + * record between two different transactions. So it is possible that
> + * a subsequent lastfailed record (TX = 0, LF = 1) might erroneously
> + * mark more than required transactions as aborted.
> + */
> +static void process_branch_aborts(struct pmu_hw_events *cpuc)
> +{
> + u64 brbfcr = read_sysreg_s(SYS_BRBFCR_EL1);
> + bool lastfailed = !!(brbfcr & BRBFCR_EL1_LASTFAILED);
> + int idx = brbe_get_numrec(cpuc->percpu_pmu->reg_brbidr) - 1;
> + struct perf_branch_entry *entry;
> +
> + do {
> + entry = &cpuc->branches->branch_entries[idx];
> + if (entry->in_tx) {
> + entry->abort = lastfailed;
> + } else {
> + lastfailed = entry->abort;
> + entry->abort = false;
> + }
> + } while (idx--, idx >= 0);
> +}
> +
> +void armv8pmu_branch_reset(void)
> +{
> + asm volatile(BRB_IALL);
> + isb();
> +}
> +
> +static bool capture_branch_entry(struct pmu_hw_events *cpuc,
> + struct perf_event *event, int idx)
> +{
> + struct perf_branch_entry *entry = &cpuc->branches->branch_entries[idx];
> + u64 brbinf = get_brbinf_reg(idx);
> +
> + /*
> + * There are no valid entries anymore on the buffer.
> + * Abort the branch record processing to save some
> + * cycles and also reduce the capture/process load
> + * for the user space as well.
> + */
> + if (brbe_invalid(brbinf))
> + return false;
> +
> + perf_clear_branch_entry_bitfields(entry);
> + if (brbe_record_is_complete(brbinf)) {
> + entry->from = get_brbsrc_reg(idx);
> + entry->to = get_brbtgt_reg(idx);
> + } else if (brbe_record_is_source_only(brbinf)) {
> + entry->from = get_brbsrc_reg(idx);
> + entry->to = 0;
> + } else if (brbe_record_is_target_only(brbinf)) {
> + entry->from = 0;
> + entry->to = get_brbtgt_reg(idx);
> + }
> + capture_brbe_flags(entry, event, brbinf);
> + return true;
> +}
> +
> +void armv8pmu_branch_read(struct pmu_hw_events *cpuc, struct perf_event *event)
> +{
> + int nr_hw_entries = brbe_get_numrec(cpuc->percpu_pmu->reg_brbidr);
> + u64 brbfcr, brbcr;
> + int idx = 0;
> +
> + brbcr = read_sysreg_s(SYS_BRBCR_EL1);
> + brbfcr = read_sysreg_s(SYS_BRBFCR_EL1);
> +
> + /* Ensure pause on PMU interrupt is enabled */
> + WARN_ON_ONCE(!(brbcr & BRBCR_EL1_FZP));
> +
> + /* Pause the buffer */
> + write_sysreg_s(brbfcr | BRBFCR_EL1_PAUSED, SYS_BRBFCR_EL1);
> + isb();
> +
> + /* Loop through bank 0 */
> + select_brbe_bank(BRBE_BANK_IDX_0);
> + while (idx < nr_hw_entries && idx <= BRBE_BANK0_IDX_MAX) {
> + if (!capture_branch_entry(cpuc, event, idx))
> + goto skip_bank_1;
> + idx++;
> + }
> +
> + /* Loop through bank 1 */
> + select_brbe_bank(BRBE_BANK_IDX_1);
> + while (idx < nr_hw_entries && idx <= BRBE_BANK1_IDX_MAX) {
> + if (!capture_branch_entry(cpuc, event, idx))
> + break;
> + idx++;
> + }
> +
> +skip_bank_1:
> + cpuc->branches->branch_stack.nr = idx;
> + cpuc->branches->branch_stack.hw_idx = -1ULL;
> + process_branch_aborts(cpuc);
> +
> + /* Unpause the buffer */
> + write_sysreg_s(brbfcr & ~BRBFCR_EL1_PAUSED, SYS_BRBFCR_EL1);
> + isb();
> + armv8pmu_branch_reset();
> +}
> diff --git a/drivers/perf/arm_brbe.h b/drivers/perf/arm_brbe.h
> new file mode 100644
> index 000000000000..a47480eec070
> --- /dev/null
> +++ b/drivers/perf/arm_brbe.h
> @@ -0,0 +1,257 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * Branch Record Buffer Extension Helpers.
> + *
> + * Copyright (C) 2022 ARM Limited
> + *
> + * Author: Anshuman Khandual <[email protected]>
> + */
> +#define pr_fmt(fmt) "brbe: " fmt
> +
> +#include <linux/perf/arm_pmu.h>
> +
> +#define BRBFCR_EL1_BRANCH_FILTERS (BRBFCR_EL1_DIRECT | \
> + BRBFCR_EL1_INDIRECT | \
> + BRBFCR_EL1_RTN | \
> + BRBFCR_EL1_INDCALL | \
> + BRBFCR_EL1_DIRCALL | \
> + BRBFCR_EL1_CONDDIR)
> +
> +#define BRBFCR_EL1_DEFAULT_CONFIG (BRBFCR_EL1_BANK_MASK | \
> + BRBFCR_EL1_PAUSED | \
> + BRBFCR_EL1_EnI | \
> + BRBFCR_EL1_BRANCH_FILTERS)
> +
> +/*
> + * BRBTS_EL1 is currently not used for branch stack implementation
> + * purpose but BRBCR_EL1.TS needs to have a valid value from all
> + * available options. BRBCR_EL1_TS_VIRTUAL is selected for this.
> + */
> +#define BRBCR_EL1_DEFAULT_TS FIELD_PREP(BRBCR_EL1_TS_MASK, BRBCR_EL1_TS_VIRTUAL)
> +
> +#define BRBCR_EL1_DEFAULT_CONFIG (BRBCR_EL1_EXCEPTION | \
> + BRBCR_EL1_ERTN | \
> + BRBCR_EL1_CC | \
> + BRBCR_EL1_MPRED | \
> + BRBCR_EL1_E1BRE | \
> + BRBCR_EL1_E0BRE | \
> + BRBCR_EL1_FZP | \
> + BRBCR_EL1_DEFAULT_TS)
> +/*
> + * BRBE Instructions
> + *
> + * BRB_IALL : Invalidate the entire buffer
> + * BRB_INJ : Inject latest branch record derived from [BRBSRCINJ, BRBTGTINJ, BRBINFINJ]
> + */
> +#define BRB_IALL __emit_inst(0xD5000000 | sys_insn(1, 1, 7, 2, 4) | (0x1f))
> +#define BRB_INJ __emit_inst(0xD5000000 | sys_insn(1, 1, 7, 2, 5) | (0x1f))
> +
> +/*
> + * BRBE Buffer Organization
> + *
> + * BRBE buffer is arranged as multiple banks of 32 branch record
> + * entries each. An individual branch record in a given bank could
> + * be accessed, after selecting the bank in BRBFCR_EL1.BANK and
> + * accessing the registers i.e [BRBSRC, BRBTGT, BRBINF] set with
> + * indices [0..31].
> + *
> + * Bank 0
> + *
> + * --------------------------------- ------
> + * | 00 | BRBSRC | BRBTGT | BRBINF | | 00 |
> + * --------------------------------- ------
> + * | 01 | BRBSRC | BRBTGT | BRBINF | | 01 |
> + * --------------------------------- ------
> + * | .. | BRBSRC | BRBTGT | BRBINF | | .. |
> + * --------------------------------- ------
> + * | 31 | BRBSRC | BRBTGT | BRBINF | | 31 |
> + * --------------------------------- ------
> + *
> + * Bank 1
> + *
> + * --------------------------------- ------
> + * | 32 | BRBSRC | BRBTGT | BRBINF | | 00 |
> + * --------------------------------- ------
> + * | 33 | BRBSRC | BRBTGT | BRBINF | | 01 |
> + * --------------------------------- ------
> + * | .. | BRBSRC | BRBTGT | BRBINF | | .. |
> + * --------------------------------- ------
> + * | 63 | BRBSRC | BRBTGT | BRBINF | | 31 |
> + * --------------------------------- ------
> + */
> +#define BRBE_BANK_MAX_ENTRIES 32
> +
> +#define BRBE_BANK0_IDX_MIN 0
> +#define BRBE_BANK0_IDX_MAX 31
> +#define BRBE_BANK1_IDX_MIN 32
> +#define BRBE_BANK1_IDX_MAX 63
> +
> +struct brbe_hw_attr {
> + int brbe_version;
> + int brbe_cc;
> + int brbe_nr;
> + int brbe_format;
> +};
> +
> +enum brbe_bank_idx {
> + BRBE_BANK_IDX_INVALID = -1,
> + BRBE_BANK_IDX_0,
> + BRBE_BANK_IDX_1,
> + BRBE_BANK_IDX_MAX
> +};
> +
> +#define RETURN_READ_BRBSRCN(n) \
> + read_sysreg_s(SYS_BRBSRC##n##_EL1)
> +
> +#define RETURN_READ_BRBTGTN(n) \
> + read_sysreg_s(SYS_BRBTGT##n##_EL1)
> +
> +#define RETURN_READ_BRBINFN(n) \
> + read_sysreg_s(SYS_BRBINF##n##_EL1)
> +
> +#define BRBE_REGN_CASE(n, case_macro) \
> + case n: return case_macro(n); break
> +
> +#define BRBE_REGN_SWITCH(x, case_macro) \
> + do { \
> + switch (x) { \
> + BRBE_REGN_CASE(0, case_macro); \
> + BRBE_REGN_CASE(1, case_macro); \
> + BRBE_REGN_CASE(2, case_macro); \
> + BRBE_REGN_CASE(3, case_macro); \
> + BRBE_REGN_CASE(4, case_macro); \
> + BRBE_REGN_CASE(5, case_macro); \
> + BRBE_REGN_CASE(6, case_macro); \
> + BRBE_REGN_CASE(7, case_macro); \
> + BRBE_REGN_CASE(8, case_macro); \
> + BRBE_REGN_CASE(9, case_macro); \
> + BRBE_REGN_CASE(10, case_macro); \
> + BRBE_REGN_CASE(11, case_macro); \
> + BRBE_REGN_CASE(12, case_macro); \
> + BRBE_REGN_CASE(13, case_macro); \
> + BRBE_REGN_CASE(14, case_macro); \
> + BRBE_REGN_CASE(15, case_macro); \
> + BRBE_REGN_CASE(16, case_macro); \
> + BRBE_REGN_CASE(17, case_macro); \
> + BRBE_REGN_CASE(18, case_macro); \
> + BRBE_REGN_CASE(19, case_macro); \
> + BRBE_REGN_CASE(20, case_macro); \
> + BRBE_REGN_CASE(21, case_macro); \
> + BRBE_REGN_CASE(22, case_macro); \
> + BRBE_REGN_CASE(23, case_macro); \
> + BRBE_REGN_CASE(24, case_macro); \
> + BRBE_REGN_CASE(25, case_macro); \
> + BRBE_REGN_CASE(26, case_macro); \
> + BRBE_REGN_CASE(27, case_macro); \
> + BRBE_REGN_CASE(28, case_macro); \
> + BRBE_REGN_CASE(29, case_macro); \
> + BRBE_REGN_CASE(30, case_macro); \
> + BRBE_REGN_CASE(31, case_macro); \
> + default: \
> + pr_warn("unknown register index\n"); \
> + return -1; \
> + } \
> + } while (0)
> +
> +static inline int buffer_to_brbe_idx(int buffer_idx)
> +{
> + return buffer_idx % BRBE_BANK_MAX_ENTRIES;
> +}
> +
> +static inline u64 get_brbsrc_reg(int buffer_idx)
> +{
> + int brbe_idx = buffer_to_brbe_idx(buffer_idx);
> +
> + BRBE_REGN_SWITCH(brbe_idx, RETURN_READ_BRBSRCN);
> +}
> +
> +static inline u64 get_brbtgt_reg(int buffer_idx)
> +{
> + int brbe_idx = buffer_to_brbe_idx(buffer_idx);
> +
> + BRBE_REGN_SWITCH(brbe_idx, RETURN_READ_BRBTGTN);
> +}
> +
> +static inline u64 get_brbinf_reg(int buffer_idx)
> +{
> + int brbe_idx = buffer_to_brbe_idx(buffer_idx);
> +
> + BRBE_REGN_SWITCH(brbe_idx, RETURN_READ_BRBINFN);
> +}
> +
> +static inline u64 brbe_record_valid(u64 brbinf)
> +{
> + return FIELD_GET(BRBINFx_EL1_VALID_MASK, brbinf);
> +}
> +
> +static inline bool brbe_invalid(u64 brbinf)
> +{
> + return brbe_record_valid(brbinf) == BRBINFx_EL1_VALID_NONE;
> +}
> +
> +static inline bool brbe_record_is_complete(u64 brbinf)
> +{
> + return brbe_record_valid(brbinf) == BRBINFx_EL1_VALID_FULL;
> +}
> +
> +static inline bool brbe_record_is_source_only(u64 brbinf)
> +{
> + return brbe_record_valid(brbinf) == BRBINFx_EL1_VALID_SOURCE;
> +}
> +
> +static inline bool brbe_record_is_target_only(u64 brbinf)
> +{
> + return brbe_record_valid(brbinf) == BRBINFx_EL1_VALID_TARGET;
> +}
> +
> +static inline int brbe_get_in_tx(u64 brbinf)
> +{
> + return FIELD_GET(BRBINFx_EL1_T_MASK, brbinf);
> +}
> +
> +static inline int brbe_get_mispredict(u64 brbinf)
> +{
> + return FIELD_GET(BRBINFx_EL1_MPRED_MASK, brbinf);
> +}
> +
> +static inline int brbe_get_lastfailed(u64 brbinf)
> +{
> + return FIELD_GET(BRBINFx_EL1_LASTFAILED_MASK, brbinf);
> +}
> +
> +static inline int brbe_get_cycles(u64 brbinf)
> +{
> + /*
> + * Captured cycle count is unknown and hence
> + * should not be passed on to the user space.
> + */
> + if (brbinf & BRBINFx_EL1_CCU)
> + return 0;
> +
> + return FIELD_GET(BRBINFx_EL1_CC_MASK, brbinf);
> +}
> +
> +static inline int brbe_get_type(u64 brbinf)
> +{
> + return FIELD_GET(BRBINFx_EL1_TYPE_MASK, brbinf);
> +}
> +
> +static inline int brbe_get_el(u64 brbinf)
> +{
> + return FIELD_GET(BRBINFx_EL1_EL_MASK, brbinf);
> +}
> +
> +static inline int brbe_get_numrec(u64 brbidr)
> +{
> + return FIELD_GET(BRBIDR0_EL1_NUMREC_MASK, brbidr);
> +}
> +
> +static inline int brbe_get_format(u64 brbidr)
> +{
> + return FIELD_GET(BRBIDR0_EL1_FORMAT_MASK, brbidr);
> +}
> +
> +static inline int brbe_get_cc_bits(u64 brbidr)
> +{
> + return FIELD_GET(BRBIDR0_EL1_CC_MASK, brbidr);
> +}
> diff --git a/drivers/perf/arm_pmuv3.c b/drivers/perf/arm_pmuv3.c
> index 10bbe02f4079..7c9e9045c24e 100644
> --- a/drivers/perf/arm_pmuv3.c
> +++ b/drivers/perf/arm_pmuv3.c
> @@ -813,6 +813,10 @@ static irqreturn_t armv8pmu_handle_irq(struct arm_pmu *cpu_pmu)
> if (!armpmu_event_set_period(event))
> continue;
>
> + /*
> + * PMU IRQ should remain asserted until all branch records
> + * are captured and processed into struct perf_sample_data.
> + */
> if (has_branch_stack(event) && !WARN_ON(!cpuc->branches)) {
> armv8pmu_branch_read(cpuc, event);
> perf_sample_save_brstack(&data, event, &cpuc->branches->branch_stack);


2023-07-25 12:24:52

by Anshuman Khandual

[permalink] [raw]
Subject: Re: [PATCH V13 - RESEND 06/10] arm64/perf: Enable branch stack events via FEAT_BRBE

Hello Yang,

On 7/25/23 12:42, Yang Shen wrote:
>> +    if (!(branch_type & PERF_SAMPLE_BRANCH_NO_CYCLES))
>> +        brbcr |= BRBCR_EL1_CC;
>
> Hi Anshuman,
>
> Here is problem about enable CYCLES_COUNT. The SPEC defines that the CYCLES_COUNT is only
>
> valid when the BRECR_EL1.CC & BRBCR_EL2.CC is true. And here the SPEC also defines that
>
> when PSTATE.EL == EL2 and HCR_EL2.E2h == '1', 'MSR BRBCR_EL1, <Xt>' means writing to
>
> BRBCR_EL2 actually. So 'armv8pmu_branch_enable' can only set the BRBCR_EL2.CC, while the
>
> BRECR_EL1.CC is still 0. The CYCLES_COUNT will be always 0 in records.


Agreed, this is a valid problem i.e BRBCR_EL1.CC and BRBCR_EL2.CC both needs to be set
for valid cycle count information regardless if the kernel runs in EL1 or EL2. A simple
hack in the current code setting BRBCR_EL12.C, which in turn sets BRBCR_EL1.CC when the
kernel runs in EL2 solves the problem.

>
> As a solution, maybe BRBCR_EL12 should be added for driver according to the registers definition.

Right, will add the definition for BRBCR_EL12 in arch/arm64/tools/sysreg

>
> Or, do you have a more standard solution?

Right, there are some nuances involved here.

Kernel could boot

a. Directly into EL2 and stays in EL2 for good
b. Directly into EL2 but switches into EL1
c. Directly into EL1 without ever going into EL2

In all the above cases BRBCR_EL1.CC and BRBCR_EL2.CC needs to be set when cycle count
is requested in the perf event interface (event->attr.branch_sample_type) via clearing
PERF_SAMPLE_BRANCH_NO_CYCLES.


- For the case as in (c) where kernel boots into EL1 directly and hence cannot ever set
EL2 register, BRBCR_EL2.CC would be a booting requirement - updated in booting.rst

- For the cases as in (a) and (b) kernel boots via EL2, hence there is an opportunity
to set both BRBCR_EL1.CC (via accessed BRBCR_EL12.CC) and BRBCR_EL2.CC. Depending on
where the kernel lands up eventually, either BRBCR_EL1.CC or BRBCR_EL2.CC will be the
toggle switch to ON or OFF cycle counting in the driver via branch_type_to_brbcr().
So a new macro __init_el2_brbe is required which will get called from init_el2_state
setting both the register fields as explained earlier.

I am working on these changes, will post the code soon.

2023-07-25 13:47:33

by Suzuki K Poulose

[permalink] [raw]
Subject: Re: [PATCH V13 - RESEND 06/10] arm64/perf: Enable branch stack events via FEAT_BRBE

On 25/07/2023 12:42, Anshuman Khandual wrote:
> Hello Yang,
>
> On 7/25/23 12:42, Yang Shen wrote:
>>> +    if (!(branch_type & PERF_SAMPLE_BRANCH_NO_CYCLES))
>>> +        brbcr |= BRBCR_EL1_CC;
>>
>> Hi Anshuman,
>>
>> Here is problem about enable CYCLES_COUNT. The SPEC defines that the CYCLES_COUNT is only
>>
>> valid when the BRECR_EL1.CC & BRBCR_EL2.CC is true. And here the SPEC also defines that
>>
>> when PSTATE.EL == EL2 and HCR_EL2.E2h == '1', 'MSR BRBCR_EL1, <Xt>' means writing to
>>
>> BRBCR_EL2 actually. So 'armv8pmu_branch_enable' can only set the BRBCR_EL2.CC, while the
>>
>> BRECR_EL1.CC is still 0. The CYCLES_COUNT will be always 0 in records.
>
>
> Agreed, this is a valid problem i.e BRBCR_EL1.CC and BRBCR_EL2.CC both needs to be set
> for valid cycle count information regardless if the kernel runs in EL1 or EL2. A simple
> hack in the current code setting BRBCR_EL12.C, which in turn sets BRBCR_EL1.CC when the
> kernel runs in EL2 solves the problem.
>
>>
>> As a solution, maybe BRBCR_EL12 should be added for driver according to the registers definition.
>
> Right, will add the definition for BRBCR_EL12 in arch/arm64/tools/sysreg
>
>>
>> Or, do you have a more standard solution?
>
> Right, there are some nuances involved here.
>
> Kernel could boot
>
> a. Directly into EL2 and stays in EL2 for good
> b. Directly into EL2 but switches into EL1
> c. Directly into EL1 without ever going into EL2
>
> In all the above cases BRBCR_EL1.CC and BRBCR_EL2.CC needs to be set when cycle count
> is requested in the perf event interface (event->attr.branch_sample_type) via clearing
> PERF_SAMPLE_BRANCH_NO_CYCLES.
>
>
> - For the case as in (c) where kernel boots into EL1 directly and hence cannot ever set
> EL2 register, BRBCR_EL2.CC would be a booting requirement - updated in booting.rst
>
> - For the cases as in (a) and (b) kernel boots via EL2, hence there is an opportunity
> to set both BRBCR_EL1.CC (via accessed BRBCR_EL12.CC) and BRBCR_EL2.CC. Depending on

You don't need to use BRBCR_EL12, if you do it early enough, before
HCR_EL2.E2H == 1 is applied.

Suzuki


2023-07-26 06:17:49

by Anshuman Khandual

[permalink] [raw]
Subject: Re: [PATCH V13 - RESEND 06/10] arm64/perf: Enable branch stack events via FEAT_BRBE



On 7/25/23 18:59, Suzuki K Poulose wrote:
> On 25/07/2023 12:42, Anshuman Khandual wrote:
>> Hello Yang,
>>
>> On 7/25/23 12:42, Yang Shen wrote:
>>>> +    if (!(branch_type & PERF_SAMPLE_BRANCH_NO_CYCLES))
>>>> +        brbcr |= BRBCR_EL1_CC;
>>>
>>> Hi Anshuman,
>>>
>>> Here is problem about enable CYCLES_COUNT. The SPEC defines that the CYCLES_COUNT is only
>>>
>>> valid when the BRECR_EL1.CC & BRBCR_EL2.CC is true. And here the SPEC also defines that
>>>
>>> when PSTATE.EL == EL2 and HCR_EL2.E2h == '1', 'MSR BRBCR_EL1, <Xt>' means writing to
>>>
>>> BRBCR_EL2 actually. So 'armv8pmu_branch_enable' can only set the BRBCR_EL2.CC, while the
>>>
>>> BRECR_EL1.CC is still 0. The CYCLES_COUNT will be always 0 in records.
>>
>>
>> Agreed, this is a valid problem i.e BRBCR_EL1.CC and BRBCR_EL2.CC both needs to be set
>> for valid cycle count information regardless if the kernel runs in EL1 or EL2. A simple
>> hack in the current code setting BRBCR_EL12.C, which in turn sets BRBCR_EL1.CC when the
>> kernel runs in EL2 solves the problem.
>>
>>>
>>> As a solution, maybe BRBCR_EL12 should be added for driver according to the registers definition.
>>
>> Right, will add the definition for BRBCR_EL12 in arch/arm64/tools/sysreg
>>
>>>
>>> Or, do you have a more standard solution?
>>
>> Right, there are some nuances involved here.
>>
>> Kernel could boot
>>     
>> a. Directly into EL2 and stays in EL2 for good
>> b. Directly into EL2 but switches into EL1
>> c. Directly into EL1 without ever going into EL2
>>
>> In all the above cases BRBCR_EL1.CC and BRBCR_EL2.CC needs to be set when cycle count
>> is requested in the perf event interface (event->attr.branch_sample_type) via clearing
>> PERF_SAMPLE_BRANCH_NO_CYCLES.
>>
>>
>> - For the case as in (c) where kernel boots into EL1 directly and hence cannot ever set
>>    EL2 register, BRBCR_EL2.CC would be a booting requirement - updated in booting.rst
>>
>> - For the cases as in (a) and (b) kernel boots via EL2, hence there is an opportunity
>>    to set both BRBCR_EL1.CC (via accessed BRBCR_EL12.CC) and BRBCR_EL2.CC. Depending on
>
> You don't need to use BRBCR_EL12, if you do it early enough, before
> HCR_EL2.E2H == 1 is applied.

Agreed. Please find the code change which solves this reported problem, please
have a look and let me know if anything needs changing.

------------------------------------------------------------------------------
Documentation/arch/arm64/booting.rst | 6 ++++
arch/arm64/include/asm/el2_setup.h | 45 ++++++++++++++++++++++++++++
arch/arm64/tools/sysreg | 38 +++++++++++++++++++++++
3 files changed, 89 insertions(+)

diff --git a/Documentation/arch/arm64/booting.rst b/Documentation/arch/arm64/booting.rst
index b57776a68f15..2276df285e83 100644
--- a/Documentation/arch/arm64/booting.rst
+++ b/Documentation/arch/arm64/booting.rst
@@ -349,6 +349,12 @@ Before jumping into the kernel, the following conditions must be met:

- HWFGWTR_EL2.nSMPRI_EL1 (bit 54) must be initialised to 0b01.

+ For CPUs with feature Branch Record Buffer Extension (FEAT_BRBE):
+
+ - If the kernel is entered at EL1 and EL2 is present:
+
+ - BRBCR_EL2.CC (bit 3) must be initialised to 0b1.
+
For CPUs with the Scalable Matrix Extension FA64 feature (FEAT_SME_FA64):

- If EL3 is present:
diff --git a/arch/arm64/include/asm/el2_setup.h b/arch/arm64/include/asm/el2_setup.h
index 8e5ffb58f83e..75b04eff2dc7 100644
--- a/arch/arm64/include/asm/el2_setup.h
+++ b/arch/arm64/include/asm/el2_setup.h
@@ -150,6 +150,50 @@
msr cptr_el2, x0 // Disable copro. traps to EL2
.endm

+/*
+ * Enable BRBE cycle count
+ *
+ * BRBE requires both BRBCR_EL1.CC and BRBCR_EL2.CC fields, be set
+ * for the cycle counts to be available in BRBINF<N>_EL1.CC during
+ * branch record processing after a PMU interrupt. This enables CC
+ * field on both these registers while still executing inside EL2.
+ *
+ * BRBE driver would still be able to toggle branch records cycle
+ * count support via BRBCR_EL1.CC field regardless of whether the
+ * kernel end up executing in EL1 or EL2.
+ */
+.macro __init_el2_brbe
+ mrs x1, id_aa64dfr0_el1
+ ubfx x1, x1, #ID_AA64DFR0_EL1_BRBE_SHIFT, #4
+ cbz x1, .Lskip_brbe_cc_\@
+
+ mrs_s x0, SYS_BRBCR_EL2
+ orr x0, x0, BRBCR_EL2_CC
+ msr_s SYS_BRBCR_EL2, x0
+
+ /*
+ * Accessing BRBCR_EL1 register here does not require
+ * BRBCR_EL12 addressing mode as HCR_EL2.E2H is still
+ * clear. Regardless, check for HCR_E2H and be on the
+ * safer side.
+ */
+ mrs x1, hcr_el2
+ and x1, x1, #HCR_E2H
+ cbz x1, .Lset_brbe_el1_direct_\@
+
+ mrs_s x0, SYS_BRBCR_EL12
+ orr x0, x0, BRBCR_EL12_CC
+ msr_s SYS_BRBCR_EL12, x0
+ b .Lskip_brbe_cc_\@
+
+.Lset_brbe_el1_direct_\@:
+ mrs_s x0, SYS_BRBCR_EL1
+ orr x0, x0, BRBCR_EL1_CC
+ msr_s SYS_BRBCR_EL1, x0
+
+.Lskip_brbe_cc_\@:
+.endm
+
/* Disable any fine grained traps */
.macro __init_el2_fgt
mrs x1, id_aa64mmfr0_el1
@@ -224,6 +268,7 @@
__init_el2_nvhe_idregs
__init_el2_cptr
__init_el2_fgt
+ __init_el2_brbe
.endm

#ifndef __KVM_NVHE_HYPERVISOR__
diff --git a/arch/arm64/tools/sysreg b/arch/arm64/tools/sysreg
index 9892af96262f..7d1d6b3976b4 100644
--- a/arch/arm64/tools/sysreg
+++ b/arch/arm64/tools/sysreg
@@ -1048,6 +1048,44 @@ Enum 1:0 VALID
EndEnum
EndSysregFields

+Sysreg BRBCR_EL12 2 5 9 0 0
+Res0 63:24
+Field 23 EXCEPTION
+Field 22 ERTN
+Res0 21:9
+Field 8 FZP
+Res0 7
+Enum 6:5 TS
+ 0b01 VIRTUAL
+ 0b10 GUEST_PHYSICAL
+ 0b11 PHYSICAL
+EndEnum
+Field 4 MPRED
+Field 3 CC
+Res0 2
+Field 1 E1BRE
+Field 0 E0BRE
+EndSysreg
+
+Sysreg BRBCR_EL2 2 4 9 0 0
+Res0 63:24
+Field 23 EXCEPTION
+Field 22 ERTN
+Res0 21:9
+Field 8 FZP
+Res0 7
+Enum 6:5 TS
+ 0b01 VIRTUAL
+ 0b10 GUEST_PHYSICAL
+ 0b11 PHYSICAL
+EndEnum
+Field 4 MPRED
+Field 3 CC
+Res0 2
+Field 1 E1BRE
+Field 0 E0BRE
+EndSysreg
+
Sysreg BRBCR_EL1 2 1 9 0 0
Res0 63:24
Field 23 EXCEPTION
--
2.25.1

2023-07-26 07:02:44

by Anshuman Khandual

[permalink] [raw]
Subject: Re: [PATCH V13 - RESEND 06/10] arm64/perf: Enable branch stack events via FEAT_BRBE


On 7/11/23 13:54, Anshuman Khandual wrote:
> +static void capture_brbe_flags(struct perf_branch_entry *entry, struct perf_event *event,
> + u64 brbinf)
> +{
> + if (branch_sample_type(event))
> + brbe_set_perf_entry_type(entry, brbinf);
> +
> + if (!branch_sample_no_cycles(event))
> + entry->cycles = brbe_get_cycles(brbinf);

Just revisiting the branch records cycle capture process in BRBE.

'cycles' field is a 16 bit field in struct perf_branch_entry.

struct perf_branch_entry {
__u64 from;
__u64 to;
__u64 mispred:1, /* target mispredicted */
predicted:1,/* target predicted */
in_tx:1, /* in transaction */
abort:1, /* transaction abort */
cycles:16, /* cycle count to last branch */
type:4, /* branch type */
spec:2, /* branch speculation info */
new_type:4, /* additional branch type */
priv:3, /* privilege level */
reserved:31;
};

static inline int brbe_get_cycles(u64 brbinf)
{
/*
* Captured cycle count is unknown and hence
* should not be passed on to the user space.
*/
if (brbinf & BRBINFx_EL1_CCU)
return 0;

return FIELD_GET(BRBINFx_EL1_CC_MASK, brbinf);
}

capture_brbe_flags()
{

.....
if (!branch_sample_no_cycles(event))
entry->cycles = brbe_get_cycles(brbinf);
.....

}

BRBINF<N>_EL1.CC[45:32] is a 14 bits field which gets assigned into
perf_branch_entry->cycles field which is 16 bits wide. Although the
cycle count representation here is mantissa and exponent based one.

CC bits[7:0] indicate the mantissa M
CC bits[13:8] indicate the exponent E

The actual cycle count is based on the following formula as the per
the spec [1], which needs to be derived from BRBINF<N>_EL1.CC field.

-------------------------------------------------------------------
The cycle count is expressed using the following function:

if IsZero(E) then UInt(M) else UInt('1':M:Zeros(UInt(E)-1))

If required, the cycle count is rounded to a multiple of 2(E-1) towards zero before being encoded.

A value of all ones in both the mantissa and exponent indicates the cycle count value exceeded the size of the cycle counter.
-------------------------------------------------------------------

Given that there is only 16 bits in perf_branch_entry structure for
cycles count, I assume that it needs to be derived in the user perf
tool (per the above calculation) before being interpreted ?

[1] https://developer.arm.com/documentation/ddi0601/2023-06/AArch64-Registers/BRBINF-n--EL1--Branch-Record-Buffer-Information-Register--n-?lang=en

- Anshuman



2023-07-28 17:16:08

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH V13 - RESEND 02/10] arm64/perf: Add BRBE registers and fields

On Tue, Jul 11, 2023 at 01:54:47PM +0530, Anshuman Khandual wrote:
> This adds BRBE related register definitions and various other related field
> macros there in. These will be used subsequently in a BRBE driver which is
> being added later on.
>
> Cc: Catalin Marinas <[email protected]>
> Cc: Will Deacon <[email protected]>
> Cc: Marc Zyngier <[email protected]>
> Cc: Mark Rutland <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Tested-by: James Clark <[email protected]>
> Reviewed-by: Mark Brown <[email protected]>
> Signed-off-by: Anshuman Khandual <[email protected]>
> ---
> arch/arm64/include/asm/sysreg.h | 103 +++++++++++++++++++++
> arch/arm64/tools/sysreg | 158 ++++++++++++++++++++++++++++++++
> 2 files changed, 261 insertions(+)
>
> diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h
> index b481935e9314..f95e30c13c8b 100644
> --- a/arch/arm64/include/asm/sysreg.h
> +++ b/arch/arm64/include/asm/sysreg.h
> @@ -163,6 +163,109 @@
> #define SYS_DBGDTRTX_EL0 sys_reg(2, 3, 0, 5, 0)
> #define SYS_DBGVCR32_EL2 sys_reg(2, 4, 0, 7, 0)
>
> +#define __SYS_BRBINFO(n) sys_reg(2, 1, 8, ((n) & 0xf), ((((n) & 0x10)) >> 2 + 0))
> +#define __SYS_BRBSRC(n) sys_reg(2, 1, 8, ((n) & 0xf), ((((n) & 0x10)) >> 2 + 1))
> +#define __SYS_BRBTGT(n) sys_reg(2, 1, 8, ((n) & 0xf), ((((n) & 0x10)) >> 2 + 2))

It's that time on a Friday but... aren't these macros busted? I think you
need brackets before adding the offset, otherwise wouldn't, for example,
target registers 0-15 all access info register 0 and __SYS_BRBTGT(16) would
then start accessing source register 0?

I'm surprised that the compiler doesn't warn about this, but even more
surprised that you managed to test this.

Please tell me I'm wrong!

Will

2023-07-28 17:43:48

by James Clark

[permalink] [raw]
Subject: Re: [PATCH V13 - RESEND 02/10] arm64/perf: Add BRBE registers and fields



On 28/07/2023 17:20, Will Deacon wrote:
> On Tue, Jul 11, 2023 at 01:54:47PM +0530, Anshuman Khandual wrote:
>> This adds BRBE related register definitions and various other related field
>> macros there in. These will be used subsequently in a BRBE driver which is
>> being added later on.
>>
>> Cc: Catalin Marinas <[email protected]>
>> Cc: Will Deacon <[email protected]>
>> Cc: Marc Zyngier <[email protected]>
>> Cc: Mark Rutland <[email protected]>
>> Cc: [email protected]
>> Cc: [email protected]
>> Tested-by: James Clark <[email protected]>
>> Reviewed-by: Mark Brown <[email protected]>
>> Signed-off-by: Anshuman Khandual <[email protected]>
>> ---
>> arch/arm64/include/asm/sysreg.h | 103 +++++++++++++++++++++
>> arch/arm64/tools/sysreg | 158 ++++++++++++++++++++++++++++++++
>> 2 files changed, 261 insertions(+)
>>
>> diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h
>> index b481935e9314..f95e30c13c8b 100644
>> --- a/arch/arm64/include/asm/sysreg.h
>> +++ b/arch/arm64/include/asm/sysreg.h
>> @@ -163,6 +163,109 @@
>> #define SYS_DBGDTRTX_EL0 sys_reg(2, 3, 0, 5, 0)
>> #define SYS_DBGVCR32_EL2 sys_reg(2, 4, 0, 7, 0)
>>
>> +#define __SYS_BRBINFO(n) sys_reg(2, 1, 8, ((n) & 0xf), ((((n) & 0x10)) >> 2 + 0))
>> +#define __SYS_BRBSRC(n) sys_reg(2, 1, 8, ((n) & 0xf), ((((n) & 0x10)) >> 2 + 1))
>> +#define __SYS_BRBTGT(n) sys_reg(2, 1, 8, ((n) & 0xf), ((((n) & 0x10)) >> 2 + 2))
>
> It's that time on a Friday but... aren't these macros busted? I think you
> need brackets before adding the offset, otherwise wouldn't, for example,
> target registers 0-15 all access info register 0 and __SYS_BRBTGT(16) would
> then start accessing source register 0?
>
> I'm surprised that the compiler doesn't warn about this, but even more
> surprised that you managed to test this.
>
> Please tell me I'm wrong!
>
> Will

No I think you are right, it is wrong. Luckily there is already an
extraneous bracket so you you can fix it by moving one a place down:

sys_reg(2, 1, 8, ((n) & 0xf), ((((n) & 0x10) >> 2) + 2))

It's interesting because the test [1] is doing quite a bit and looking
at the branch info, and that src and targets match up to function names.
I also manually looked at the branch buffers and didn't see anything
obviously wrong like things that looked like branch infos in the source
or target fields. Will have to take another look to see if it would be
possible for the test to catch this.

James

[1]:
https://gitlab.arm.com/linux-arm/linux-jc/-/commit/3a7ddce70c2daadb63fcc511de0a89055ca48b32

2023-07-31 03:30:20

by Anshuman Khandual

[permalink] [raw]
Subject: Re: [PATCH V13 - RESEND 02/10] arm64/perf: Add BRBE registers and fields



On 7/28/23 22:22, James Clark wrote:
>
>
> On 28/07/2023 17:20, Will Deacon wrote:
>> On Tue, Jul 11, 2023 at 01:54:47PM +0530, Anshuman Khandual wrote:
>>> This adds BRBE related register definitions and various other related field
>>> macros there in. These will be used subsequently in a BRBE driver which is
>>> being added later on.
>>>
>>> Cc: Catalin Marinas <[email protected]>
>>> Cc: Will Deacon <[email protected]>
>>> Cc: Marc Zyngier <[email protected]>
>>> Cc: Mark Rutland <[email protected]>
>>> Cc: [email protected]
>>> Cc: [email protected]
>>> Tested-by: James Clark <[email protected]>
>>> Reviewed-by: Mark Brown <[email protected]>
>>> Signed-off-by: Anshuman Khandual <[email protected]>
>>> ---
>>> arch/arm64/include/asm/sysreg.h | 103 +++++++++++++++++++++
>>> arch/arm64/tools/sysreg | 158 ++++++++++++++++++++++++++++++++
>>> 2 files changed, 261 insertions(+)
>>>
>>> diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h
>>> index b481935e9314..f95e30c13c8b 100644
>>> --- a/arch/arm64/include/asm/sysreg.h
>>> +++ b/arch/arm64/include/asm/sysreg.h
>>> @@ -163,6 +163,109 @@
>>> #define SYS_DBGDTRTX_EL0 sys_reg(2, 3, 0, 5, 0)
>>> #define SYS_DBGVCR32_EL2 sys_reg(2, 4, 0, 7, 0)
>>>
>>> +#define __SYS_BRBINFO(n) sys_reg(2, 1, 8, ((n) & 0xf), ((((n) & 0x10)) >> 2 + 0))
>>> +#define __SYS_BRBSRC(n) sys_reg(2, 1, 8, ((n) & 0xf), ((((n) & 0x10)) >> 2 + 1))
>>> +#define __SYS_BRBTGT(n) sys_reg(2, 1, 8, ((n) & 0xf), ((((n) & 0x10)) >> 2 + 2))
>>
>> It's that time on a Friday but... aren't these macros busted? I think you
>> need brackets before adding the offset, otherwise wouldn't, for example,
>> target registers 0-15 all access info register 0 and __SYS_BRBTGT(16) would
>> then start accessing source register 0?
>>
>> I'm surprised that the compiler doesn't warn about this, but even more
>> surprised that you managed to test this.
>>
>> Please tell me I'm wrong!
>>
>> Will
>
> No I think you are right, it is wrong. Luckily there is already an
> extraneous bracket so you you can fix it by moving one a place down:
>
> sys_reg(2, 1, 8, ((n) & 0xf), ((((n) & 0x10) >> 2) + 2))
>
> It's interesting because the test [1] is doing quite a bit and looking
> at the branch info, and that src and targets match up to function names.
> I also manually looked at the branch buffers and didn't see anything
> obviously wrong like things that looked like branch infos in the source
> or target fields. Will have to take another look to see if it would be
> possible for the test to catch this.
>
> James
>
> [1]:
> https://gitlab.arm.com/linux-arm/linux-jc/-/commit/3a7ddce70c2daadb63fcc511de0a89055ca48b32

((((n) & 0x10)) >> 2 + 2) ---> ((((n) & 0x10) >> 2) + 2)

The additional brackets are useful in explicitly telling the compiler but
what it the compiler is just doing the right thing implicitly i.e computing
the shifting operation before doing the offset addition. During testing, all
those captured branch records looked alright. But that is no excuse, for not
doing the right thing to begin with i.e adding explicit brackets. I will fix
these in next version.

2023-07-31 08:31:56

by James Clark

[permalink] [raw]
Subject: Re: [PATCH V13 - RESEND 02/10] arm64/perf: Add BRBE registers and fields



On 31/07/2023 03:33, Anshuman Khandual wrote:
>
>
> On 7/28/23 22:22, James Clark wrote:
>>
>>
>> On 28/07/2023 17:20, Will Deacon wrote:
>>> On Tue, Jul 11, 2023 at 01:54:47PM +0530, Anshuman Khandual wrote:
>>>> This adds BRBE related register definitions and various other related field
>>>> macros there in. These will be used subsequently in a BRBE driver which is
>>>> being added later on.
>>>>
>>>> Cc: Catalin Marinas <[email protected]>
>>>> Cc: Will Deacon <[email protected]>
>>>> Cc: Marc Zyngier <[email protected]>
>>>> Cc: Mark Rutland <[email protected]>
>>>> Cc: [email protected]
>>>> Cc: [email protected]
>>>> Tested-by: James Clark <[email protected]>
>>>> Reviewed-by: Mark Brown <[email protected]>
>>>> Signed-off-by: Anshuman Khandual <[email protected]>
>>>> ---
>>>> arch/arm64/include/asm/sysreg.h | 103 +++++++++++++++++++++
>>>> arch/arm64/tools/sysreg | 158 ++++++++++++++++++++++++++++++++
>>>> 2 files changed, 261 insertions(+)
>>>>
>>>> diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h
>>>> index b481935e9314..f95e30c13c8b 100644
>>>> --- a/arch/arm64/include/asm/sysreg.h
>>>> +++ b/arch/arm64/include/asm/sysreg.h
>>>> @@ -163,6 +163,109 @@
>>>> #define SYS_DBGDTRTX_EL0 sys_reg(2, 3, 0, 5, 0)
>>>> #define SYS_DBGVCR32_EL2 sys_reg(2, 4, 0, 7, 0)
>>>>
>>>> +#define __SYS_BRBINFO(n) sys_reg(2, 1, 8, ((n) & 0xf), ((((n) & 0x10)) >> 2 + 0))
>>>> +#define __SYS_BRBSRC(n) sys_reg(2, 1, 8, ((n) & 0xf), ((((n) & 0x10)) >> 2 + 1))
>>>> +#define __SYS_BRBTGT(n) sys_reg(2, 1, 8, ((n) & 0xf), ((((n) & 0x10)) >> 2 + 2))
>>>
>>> It's that time on a Friday but... aren't these macros busted? I think you
>>> need brackets before adding the offset, otherwise wouldn't, for example,
>>> target registers 0-15 all access info register 0 and __SYS_BRBTGT(16) would
>>> then start accessing source register 0?
>>>
>>> I'm surprised that the compiler doesn't warn about this, but even more
>>> surprised that you managed to test this.
>>>
>>> Please tell me I'm wrong!
>>>
>>> Will
>>
>> No I think you are right, it is wrong. Luckily there is already an
>> extraneous bracket so you you can fix it by moving one a place down:
>>
>> sys_reg(2, 1, 8, ((n) & 0xf), ((((n) & 0x10) >> 2) + 2))
>>
>> It's interesting because the test [1] is doing quite a bit and looking
>> at the branch info, and that src and targets match up to function names.
>> I also manually looked at the branch buffers and didn't see anything
>> obviously wrong like things that looked like branch infos in the source
>> or target fields. Will have to take another look to see if it would be
>> possible for the test to catch this.
>>
>> James
>>
>> [1]:
>> https://gitlab.arm.com/linux-arm/linux-jc/-/commit/3a7ddce70c2daadb63fcc511de0a89055ca48b32
>
> ((((n) & 0x10)) >> 2 + 2) ---> ((((n) & 0x10) >> 2) + 2)
>
> The additional brackets are useful in explicitly telling the compiler but
> what it the compiler is just doing the right thing implicitly i.e computing
> the shifting operation before doing the offset addition. During testing, all
> those captured branch records looked alright. But that is no excuse, for not
> doing the right thing to begin with i.e adding explicit brackets. I will fix
> these in next version.

Are you sure? If you see the return value here, it's 0 until register
16, then it becomes 1:

https://godbolt.org/z/c7zhbno3n

If you add the bracket it does actually change the return value.


2023-07-31 09:50:35

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH V13 - RESEND 02/10] arm64/perf: Add BRBE registers and fields

On Mon, Jul 31, 2023 at 08:03:21AM +0530, Anshuman Khandual wrote:
>
>
> On 7/28/23 22:22, James Clark wrote:
> >
> >
> > On 28/07/2023 17:20, Will Deacon wrote:
> >> On Tue, Jul 11, 2023 at 01:54:47PM +0530, Anshuman Khandual wrote:
> >>> This adds BRBE related register definitions and various other related field
> >>> macros there in. These will be used subsequently in a BRBE driver which is
> >>> being added later on.
> >>>
> >>> Cc: Catalin Marinas <[email protected]>
> >>> Cc: Will Deacon <[email protected]>
> >>> Cc: Marc Zyngier <[email protected]>
> >>> Cc: Mark Rutland <[email protected]>
> >>> Cc: [email protected]
> >>> Cc: [email protected]
> >>> Tested-by: James Clark <[email protected]>
> >>> Reviewed-by: Mark Brown <[email protected]>
> >>> Signed-off-by: Anshuman Khandual <[email protected]>
> >>> ---
> >>> arch/arm64/include/asm/sysreg.h | 103 +++++++++++++++++++++
> >>> arch/arm64/tools/sysreg | 158 ++++++++++++++++++++++++++++++++
> >>> 2 files changed, 261 insertions(+)
> >>>
> >>> diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h
> >>> index b481935e9314..f95e30c13c8b 100644
> >>> --- a/arch/arm64/include/asm/sysreg.h
> >>> +++ b/arch/arm64/include/asm/sysreg.h
> >>> @@ -163,6 +163,109 @@
> >>> #define SYS_DBGDTRTX_EL0 sys_reg(2, 3, 0, 5, 0)
> >>> #define SYS_DBGVCR32_EL2 sys_reg(2, 4, 0, 7, 0)
> >>>
> >>> +#define __SYS_BRBINFO(n) sys_reg(2, 1, 8, ((n) & 0xf), ((((n) & 0x10)) >> 2 + 0))
> >>> +#define __SYS_BRBSRC(n) sys_reg(2, 1, 8, ((n) & 0xf), ((((n) & 0x10)) >> 2 + 1))
> >>> +#define __SYS_BRBTGT(n) sys_reg(2, 1, 8, ((n) & 0xf), ((((n) & 0x10)) >> 2 + 2))
> >>
> >> It's that time on a Friday but... aren't these macros busted? I think you
> >> need brackets before adding the offset, otherwise wouldn't, for example,
> >> target registers 0-15 all access info register 0 and __SYS_BRBTGT(16) would
> >> then start accessing source register 0?
> >>
> >> I'm surprised that the compiler doesn't warn about this, but even more
> >> surprised that you managed to test this.
> >>
> >> Please tell me I'm wrong!
> >>
> >> Will
> >
> > No I think you are right, it is wrong. Luckily there is already an
> > extraneous bracket so you you can fix it by moving one a place down:
> >
> > sys_reg(2, 1, 8, ((n) & 0xf), ((((n) & 0x10) >> 2) + 2))
> >
> > It's interesting because the test [1] is doing quite a bit and looking
> > at the branch info, and that src and targets match up to function names.
> > I also manually looked at the branch buffers and didn't see anything
> > obviously wrong like things that looked like branch infos in the source
> > or target fields. Will have to take another look to see if it would be
> > possible for the test to catch this.
> >
> > James
> >
> > [1]:
> > https://gitlab.arm.com/linux-arm/linux-jc/-/commit/3a7ddce70c2daadb63fcc511de0a89055ca48b32
>
> ((((n) & 0x10)) >> 2 + 2) ---> ((((n) & 0x10) >> 2) + 2)
>
> The additional brackets are useful in explicitly telling the compiler but
> what it the compiler is just doing the right thing implicitly i.e computing
> the shifting operation before doing the offset addition.

No; that is not correct. In c, '+' has higher precedence than '>>'.

For 'a >> b + c' the compiler *must* treat that as 'a >> (b + c)', and not as
'(a >> b) + c'

That's trivial to test:

| [mark@gravadlaks:~]% cat shiftadd.c
| #include <stdio.h>
|
| unsigned long logshiftadd(unsigned long a,
| unsigned long b,
| unsigned long c)
| {
| printf("%ld >> %ld + %ld is %ld\n",
| a, b, c, a >> b + c);
| }
|
| int main(int argc, char *argv)
| {
| logshiftadd(0, 0, 0);
| logshiftadd(0, 0, 1);
| logshiftadd(0, 0, 2);
| printf("\n");
| logshiftadd(1024, 0, 0);
| logshiftadd(1024, 0, 1);
| logshiftadd(1024, 0, 2);
| printf("\n");
| logshiftadd(1024, 2, 0);
| logshiftadd(1024, 2, 1);
| logshiftadd(1024, 2, 2);
|
| return 0;
| }
| [mark@gravadlaks:~]% gcc shiftadd.c -o shiftadd
| [mark@gravadlaks:~]% ./shiftadd
| 0 >> 0 + 0 is 0
| 0 >> 0 + 1 is 0
| 0 >> 0 + 2 is 0
|
| 1024 >> 0 + 0 is 1024
| 1024 >> 0 + 1 is 512
| 1024 >> 0 + 2 is 256
|
| 1024 >> 2 + 0 is 256
| 1024 >> 2 + 1 is 128
| 1024 >> 2 + 2 is 64

> During testing, all > those captured branch records looked alright.

I think we clearly need better testing here.

Thanks,
Mark.

2023-07-31 12:36:07

by Anshuman Khandual

[permalink] [raw]
Subject: Re: [PATCH V13 - RESEND 02/10] arm64/perf: Add BRBE registers and fields



On 7/31/23 14:36, Mark Rutland wrote:
> On Mon, Jul 31, 2023 at 08:03:21AM +0530, Anshuman Khandual wrote:
>>
>>
>> On 7/28/23 22:22, James Clark wrote:
>>>
>>>
>>> On 28/07/2023 17:20, Will Deacon wrote:
>>>> On Tue, Jul 11, 2023 at 01:54:47PM +0530, Anshuman Khandual wrote:
>>>>> This adds BRBE related register definitions and various other related field
>>>>> macros there in. These will be used subsequently in a BRBE driver which is
>>>>> being added later on.
>>>>>
>>>>> Cc: Catalin Marinas <[email protected]>
>>>>> Cc: Will Deacon <[email protected]>
>>>>> Cc: Marc Zyngier <[email protected]>
>>>>> Cc: Mark Rutland <[email protected]>
>>>>> Cc: [email protected]
>>>>> Cc: [email protected]
>>>>> Tested-by: James Clark <[email protected]>
>>>>> Reviewed-by: Mark Brown <[email protected]>
>>>>> Signed-off-by: Anshuman Khandual <[email protected]>
>>>>> ---
>>>>> arch/arm64/include/asm/sysreg.h | 103 +++++++++++++++++++++
>>>>> arch/arm64/tools/sysreg | 158 ++++++++++++++++++++++++++++++++
>>>>> 2 files changed, 261 insertions(+)
>>>>>
>>>>> diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h
>>>>> index b481935e9314..f95e30c13c8b 100644
>>>>> --- a/arch/arm64/include/asm/sysreg.h
>>>>> +++ b/arch/arm64/include/asm/sysreg.h
>>>>> @@ -163,6 +163,109 @@
>>>>> #define SYS_DBGDTRTX_EL0 sys_reg(2, 3, 0, 5, 0)
>>>>> #define SYS_DBGVCR32_EL2 sys_reg(2, 4, 0, 7, 0)
>>>>>
>>>>> +#define __SYS_BRBINFO(n) sys_reg(2, 1, 8, ((n) & 0xf), ((((n) & 0x10)) >> 2 + 0))
>>>>> +#define __SYS_BRBSRC(n) sys_reg(2, 1, 8, ((n) & 0xf), ((((n) & 0x10)) >> 2 + 1))
>>>>> +#define __SYS_BRBTGT(n) sys_reg(2, 1, 8, ((n) & 0xf), ((((n) & 0x10)) >> 2 + 2))
>>>>
>>>> It's that time on a Friday but... aren't these macros busted? I think you
>>>> need brackets before adding the offset, otherwise wouldn't, for example,
>>>> target registers 0-15 all access info register 0 and __SYS_BRBTGT(16) would
>>>> then start accessing source register 0?
>>>>
>>>> I'm surprised that the compiler doesn't warn about this, but even more
>>>> surprised that you managed to test this.
>>>>
>>>> Please tell me I'm wrong!
>>>>
>>>> Will
>>>
>>> No I think you are right, it is wrong. Luckily there is already an
>>> extraneous bracket so you you can fix it by moving one a place down:
>>>
>>> sys_reg(2, 1, 8, ((n) & 0xf), ((((n) & 0x10) >> 2) + 2))
>>>
>>> It's interesting because the test [1] is doing quite a bit and looking
>>> at the branch info, and that src and targets match up to function names.
>>> I also manually looked at the branch buffers and didn't see anything
>>> obviously wrong like things that looked like branch infos in the source
>>> or target fields. Will have to take another look to see if it would be
>>> possible for the test to catch this.
>>>
>>> James
>>>
>>> [1]:
>>> https://gitlab.arm.com/linux-arm/linux-jc/-/commit/3a7ddce70c2daadb63fcc511de0a89055ca48b32
>>
>> ((((n) & 0x10)) >> 2 + 2) ---> ((((n) & 0x10) >> 2) + 2)
>>
>> The additional brackets are useful in explicitly telling the compiler but
>> what it the compiler is just doing the right thing implicitly i.e computing
>> the shifting operation before doing the offset addition.
>
> No; that is not correct. In c, '+' has higher precedence than '>>'.
>
> For 'a >> b + c' the compiler *must* treat that as 'a >> (b + c)', and not as
> '(a >> b) + c'
>
> That's trivial to test:
>
> | [mark@gravadlaks:~]% cat shiftadd.c
> | #include <stdio.h>
> |
> | unsigned long logshiftadd(unsigned long a,
> | unsigned long b,
> | unsigned long c)
> | {
> | printf("%ld >> %ld + %ld is %ld\n",
> | a, b, c, a >> b + c);
> | }
> |
> | int main(int argc, char *argv)
> | {
> | logshiftadd(0, 0, 0);
> | logshiftadd(0, 0, 1);
> | logshiftadd(0, 0, 2);
> | printf("\n");
> | logshiftadd(1024, 0, 0);
> | logshiftadd(1024, 0, 1);
> | logshiftadd(1024, 0, 2);
> | printf("\n");
> | logshiftadd(1024, 2, 0);
> | logshiftadd(1024, 2, 1);
> | logshiftadd(1024, 2, 2);
> |
> | return 0;
> | }
> | [mark@gravadlaks:~]% gcc shiftadd.c -o shiftadd
> | [mark@gravadlaks:~]% ./shiftadd
> | 0 >> 0 + 0 is 0
> | 0 >> 0 + 1 is 0
> | 0 >> 0 + 2 is 0
> |
> | 1024 >> 0 + 0 is 1024
> | 1024 >> 0 + 1 is 512
> | 1024 >> 0 + 2 is 256
> |
> | 1024 >> 2 + 0 is 256
> | 1024 >> 2 + 1 is 128
> | 1024 >> 2 + 2 is 64

Understood.

>
>> During testing, all > those captured branch records looked alright.
>
> I think we clearly need better testing here
I am still thinking - how could this might have been missed. Could it be
possible that these wrongly computed higher indices were getting folded
back/rolled over into the same legal range indices for a given bank. If
they did, branch record processing would have still captured almost all
of them, may be in an incorrect order. Branch order does matter for the
stitched mechanism.

2023-07-31 14:16:20

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH V13 - RESEND 00/10] arm64/perf: Enable branch stack sampling

Hi Anshuman,

On Tue, Jul 11, 2023 at 01:54:45PM +0530, Anshuman Khandual wrote:
> This series enables perf branch stack sampling support on arm64 platform
> via a new arch feature called Branch Record Buffer Extension (BRBE). All
> relevant register definitions could be accessed here.
>
> https://developer.arm.com/documentation/ddi0601/2021-12/AArch64-Registers
>
> This series applies on 6.5-rc1.
>
> Changes in V13:

I had a go at reviewing this series and, aside from the macro issue I've
already pointed out, I really struggled with the way that you've put the
series together:

- You incrementally introduce dead code, forcing the reviewer to keep
previous patches in their head awaiting for a caller to come along
later.

Example: Patch 4 literally just adds a new struct to the kernel.

- You change arch/arm/, where this driver shouldn't even be _compiled_
despite adding CONFIG_ARM64_BRBE.

Example: Patch 5 adds some BRBE stubs to
arch/arm/include/asm/arm_pmuv3.h

- You undo/rework code that was introduced earlier in the series

Example: armv8pmu_branch_read() is introduced as a useless stub in
patch 5, rewritten in patch 6 and then rewritten again in
patch 10. Why should I waste time reviewing three versions
of this function?

- You make unrelated cosmetic changes to the existing code inside
patches adding new features.

Example: Patch 5 randomly removes some comments from the existing
code.

- The commit messages are, at best, useless and err more on the side
of nonsensical.

Example: Look at patch 3:

| This updates 'struct arm_pmu' for branch stack sampling support being added
| later. This adds an element 'reg_trbidr' to capture BRBE attribute details.
| These updates here will help in tracking any branch stack sampling support.
|
| This also enables perf branch stack sampling event on all 'struct arm pmu',
| supporting the feature but after removing the current gate that blocks such
| events unconditionally in armpmu_event_init(). Instead a quick probe can be
| initiated via arm_pmu->has_branch_stack to ascertain the support.

If I remove everything that isn't just describing the code, I'm left with:

- 'reg_trbidr' captures BRBE attribute details
- These updates here will help in tracking any branch stack sampling support.
- perf branch stack sampling event is now enabled when it is supported
- Probing is quick

But crucial information is missing:

* What is BRBE?
* What is a BRBE attribute?
* How are the details of an attribute captured?
* How do these "updates" (which ones?) help in tracking branch stack sampling?
* What is being tracked and why?
* How quick is the probing and why do we care?
* What is the perf branch stack sampling event and what does it mean
to enable it? Does it offer something useful to the user?
* Why do we want any of this?

(these examples are not intended to be an exhaustive list of things that
need fixing)

Overall, this makes the code needlessly difficult to review. However, I
don't reckon it's too much effort on your side to fix the things above.
You've been doing this for long enough (on the author and reviewer side)
that I hope you see what I'm getting at. If not, try reviewing your own
patches right before you hit 'git send-email'; I pretty much always find
a problem with my own code that way.

So, please, can you post a v14 which:

1. Fixes the broken register access macros
2. Adds some meaningful tests at the end of the series
3. Squashes the new driver code (i.e. at least everything in
arm_brbe.c and possibly just everything under drivers/perf/) down
into a single patch
4. Does any _necessary_ cleanup or refactoring at the start of the
series, leaving out cosmetic stuff for now
5. Rewrites the commit messages following the guidelines in
submitting-patches.rst. You don't need to talk about specific C
expressions; we have the code for that already and if it's doing
something subtle then you can add a comment.
6. Resolves the open CYCLES_COUNT issue from Yang and Suzuki

Cheers,

Will

2023-08-02 13:14:45

by Suzuki K Poulose

[permalink] [raw]
Subject: Re: [PATCH V13 - RESEND 06/10] arm64/perf: Enable branch stack events via FEAT_BRBE

On 26/07/2023 06:32, Anshuman Khandual wrote:
>
>
> On 7/25/23 18:59, Suzuki K Poulose wrote:
>> On 25/07/2023 12:42, Anshuman Khandual wrote:
>>> Hello Yang,
>>>
>>> On 7/25/23 12:42, Yang Shen wrote:
>>>>> +    if (!(branch_type & PERF_SAMPLE_BRANCH_NO_CYCLES))
>>>>> +        brbcr |= BRBCR_EL1_CC;
>>>>
>>>> Hi Anshuman,
>>>>
>>>> Here is problem about enable CYCLES_COUNT. The SPEC defines that the CYCLES_COUNT is only
>>>>
>>>> valid when the BRECR_EL1.CC & BRBCR_EL2.CC is true. And here the SPEC also defines that
>>>>
>>>> when PSTATE.EL == EL2 and HCR_EL2.E2h == '1', 'MSR BRBCR_EL1, <Xt>' means writing to
>>>>
>>>> BRBCR_EL2 actually. So 'armv8pmu_branch_enable' can only set the BRBCR_EL2.CC, while the
>>>>
>>>> BRECR_EL1.CC is still 0. The CYCLES_COUNT will be always 0 in records.
>>>
>>>
>>> Agreed, this is a valid problem i.e BRBCR_EL1.CC and BRBCR_EL2.CC both needs to be set
>>> for valid cycle count information regardless if the kernel runs in EL1 or EL2. A simple
>>> hack in the current code setting BRBCR_EL12.C, which in turn sets BRBCR_EL1.CC when the
>>> kernel runs in EL2 solves the problem.
>>>
>>>>
>>>> As a solution, maybe BRBCR_EL12 should be added for driver according to the registers definition.
>>>
>>> Right, will add the definition for BRBCR_EL12 in arch/arm64/tools/sysreg
>>>
>>>>
>>>> Or, do you have a more standard solution?
>>>
>>> Right, there are some nuances involved here.
>>>
>>> Kernel could boot
>>>
>>> a. Directly into EL2 and stays in EL2 for good
>>> b. Directly into EL2 but switches into EL1
>>> c. Directly into EL1 without ever going into EL2
>>>
>>> In all the above cases BRBCR_EL1.CC and BRBCR_EL2.CC needs to be set when cycle count
>>> is requested in the perf event interface (event->attr.branch_sample_type) via clearing
>>> PERF_SAMPLE_BRANCH_NO_CYCLES.
>>>
>>>
>>> - For the case as in (c) where kernel boots into EL1 directly and hence cannot ever set
>>>    EL2 register, BRBCR_EL2.CC would be a booting requirement - updated in booting.rst
>>>
>>> - For the cases as in (a) and (b) kernel boots via EL2, hence there is an opportunity
>>>    to set both BRBCR_EL1.CC (via accessed BRBCR_EL12.CC) and BRBCR_EL2.CC. Depending on
>>
>> You don't need to use BRBCR_EL12, if you do it early enough, before
>> HCR_EL2.E2H == 1 is applied.
>
> Agreed. Please find the code change which solves this reported problem, please
> have a look and let me know if anything needs changing.
>
> ------------------------------------------------------------------------------
> Documentation/arch/arm64/booting.rst | 6 ++++
> arch/arm64/include/asm/el2_setup.h | 45 ++++++++++++++++++++++++++++
> arch/arm64/tools/sysreg | 38 +++++++++++++++++++++++
> 3 files changed, 89 insertions(+)
>
> diff --git a/Documentation/arch/arm64/booting.rst b/Documentation/arch/arm64/booting.rst
> index b57776a68f15..2276df285e83 100644
> --- a/Documentation/arch/arm64/booting.rst
> +++ b/Documentation/arch/arm64/booting.rst
> @@ -349,6 +349,12 @@ Before jumping into the kernel, the following conditions must be met:
>
> - HWFGWTR_EL2.nSMPRI_EL1 (bit 54) must be initialised to 0b01.
>
> + For CPUs with feature Branch Record Buffer Extension (FEAT_BRBE):
> +
> + - If the kernel is entered at EL1 and EL2 is present:
> +
> + - BRBCR_EL2.CC (bit 3) must be initialised to 0b1.
> +
> For CPUs with the Scalable Matrix Extension FA64 feature (FEAT_SME_FA64):
>
> - If EL3 is present:
> diff --git a/arch/arm64/include/asm/el2_setup.h b/arch/arm64/include/asm/el2_setup.h
> index 8e5ffb58f83e..75b04eff2dc7 100644
> --- a/arch/arm64/include/asm/el2_setup.h
> +++ b/arch/arm64/include/asm/el2_setup.h
> @@ -150,6 +150,50 @@
> msr cptr_el2, x0 // Disable copro. traps to EL2
> .endm
>
> +/*
> + * Enable BRBE cycle count
> + *
> + * BRBE requires both BRBCR_EL1.CC and BRBCR_EL2.CC fields, be set
> + * for the cycle counts to be available in BRBINF<N>_EL1.CC during
> + * branch record processing after a PMU interrupt. This enables CC
> + * field on both these registers while still executing inside EL2.
> + *
> + * BRBE driver would still be able to toggle branch records cycle
> + * count support via BRBCR_EL1.CC field regardless of whether the
> + * kernel end up executing in EL1 or EL2.
> + */
> +.macro __init_el2_brbe
> + mrs x1, id_aa64dfr0_el1
> + ubfx x1, x1, #ID_AA64DFR0_EL1_BRBE_SHIFT, #4
> + cbz x1, .Lskip_brbe_cc_\@
> +
> + mrs_s x0, SYS_BRBCR_EL2
> + orr x0, x0, BRBCR_EL2_CC
> + msr_s SYS_BRBCR_EL2, x0
> +
> + /*
> + * Accessing BRBCR_EL1 register here does not require
> + * BRBCR_EL12 addressing mode as HCR_EL2.E2H is still
> + * clear. Regardless, check for HCR_E2H and be on the
> + * safer side.
> + */
> + mrs x1, hcr_el2
> + and x1, x1, #HCR_E2H
> + cbz x1, .Lset_brbe_el1_direct_\@
> +
> + mrs_s x0, SYS_BRBCR_EL12
> + orr x0, x0, BRBCR_EL12_CC
> + msr_s SYS_BRBCR_EL12, x0
> + b .Lskip_brbe_cc_\@
> +
> +.Lset_brbe_el1_direct_\@:
> + mrs_s x0, SYS_BRBCR_EL1
> + orr x0, x0, BRBCR_EL1_CC
> + msr_s SYS_BRBCR_EL1, x0
> +
> +.Lskip_brbe_cc_\@:
> +.endm
> +
> /* Disable any fine grained traps */
> .macro __init_el2_fgt
> mrs x1, id_aa64mmfr0_el1
> @@ -224,6 +268,7 @@
> __init_el2_nvhe_idregs
> __init_el2_cptr
> __init_el2_fgt
> + __init_el2_brbe
> .endm
>
> #ifndef __KVM_NVHE_HYPERVISOR__
> diff --git a/arch/arm64/tools/sysreg b/arch/arm64/tools/sysreg
> index 9892af96262f..7d1d6b3976b4 100644
> --- a/arch/arm64/tools/sysreg
> +++ b/arch/arm64/tools/sysreg
> @@ -1048,6 +1048,44 @@ Enum 1:0 VALID
> EndEnum
> EndSysregFields
>
> +Sysreg BRBCR_EL12 2 5 9 0 0
> +Res0 63:24
> +Field 23 EXCEPTION
> +Field 22 ERTN
> +Res0 21:9
> +Field 8 FZP
> +Res0 7
> +Enum 6:5 TS
> + 0b01 VIRTUAL
> + 0b10 GUEST_PHYSICAL
> + 0b11 PHYSICAL
> +EndEnum
> +Field 4 MPRED
> +Field 3 CC
> +Res0 2
> +Field 1 E1BRE
> +Field 0 E0BRE
> +EndSysreg

As this is exactly same as BRBCR_EL1, please could we use SysregFields
for BRBCR_EL1 and reuse it here ?



> +
> +Sysreg BRBCR_EL2 2 4 9 0 0
> +Res0 63:24
> +Field 23 EXCEPTION
> +Field 22 ERTN
> +Res0 21:9
> +Field 8 FZP
> +Res0 7
> +Enum 6:5 TS
> + 0b01 VIRTUAL
> + 0b10 GUEST_PHYSICAL
> + 0b11 PHYSICAL
> +EndEnum
> +Field 4 MPRED
> +Field 3 CC
> +Res0 2
> +Field 1 E1BRE

E2BRE?
> +Field 0 E0BRE

E0HBRE?

Rest looks good to me

Suzuki

> +EndSysreg
> +
> Sysreg BRBCR_EL1 2 1 9 0 0
> Res0 63:24
> Field 23 EXCEPTION


2023-08-02 14:26:20

by Rajnesh Kanwal

[permalink] [raw]
Subject: Re: [PATCH V13 - RESEND 09/10] arm64/perf: Implement branch records save on task sched out

>diff --git a/drivers/perf/arm_brbe.c b/drivers/perf/arm_brbe.c
>index 203cd4f350d5..2177632befa6 100644
>--- a/drivers/perf/arm_brbe.c
>+++ b/drivers/perf/arm_brbe.c
>@@ -165,6 +165,36 @@ static int stitch_stored_live_entries(struct brbe_regset *stored,
> return min(nr_live + nr_stored, nr_max);
> }
>
>+static int brbe_branch_save(struct brbe_regset *live, int nr_hw_entries)
>+{
>+ u64 brbfcr = read_sysreg_s(SYS_BRBFCR_EL1);
>+ int nr_live;
>+
>+ write_sysreg_s(brbfcr | BRBFCR_EL1_PAUSED, SYS_BRBFCR_EL1);
>+ isb();
>+
>+ nr_live = capture_brbe_regset(live, nr_hw_entries);
>+
>+ write_sysreg_s(brbfcr & ~BRBFCR_EL1_PAUSED, SYS_BRBFCR_EL1);
>+ isb();
>+
>+ return nr_live;
>+}
>+
>+void armv8pmu_branch_save(struct arm_pmu *arm_pmu, void *ctx)
>+{
>+ struct arm64_perf_task_context *task_ctx = ctx;
>+ struct brbe_regset live[BRBE_MAX_ENTRIES];
>+ int nr_live, nr_store, nr_hw_entries;
>+
>+ nr_hw_entries = brbe_get_numrec(arm_pmu->reg_brbidr);
>+ nr_live = brbe_branch_save(live, nr_hw_entries);
>+ nr_store = task_ctx->nr_brbe_records;
>+ nr_store = stitch_stored_live_entries(task_ctx->store, live, nr_store,
>+ nr_live, nr_hw_entries);
>+ task_ctx->nr_brbe_records = nr_store;
>+}

Asking out-of-curiosity. Have you thought about virtualization use
case. Current LBR implementation create an event for the guest
and save/restore happens using the sched_task callback. (Correct me
if I am wrong). Given you are only saving and processing those saved
entries, how do you plan to expose the entries to the guest?

Thanks
Rajnesh

>+
> /*
> * Generic perf branch filters supported on BRBE
> *
>diff --git a/drivers/perf/arm_pmuv3.c b/drivers/perf/arm_pmuv3.c
>index 408974d5c57b..aa3c7b3dcdd6 100644
>--- a/drivers/perf/arm_pmuv3.c
>+++ b/drivers/perf/arm_pmuv3.c
>@@ -923,9 +923,19 @@ static int armv8pmu_user_event_idx(struct perf_event *event)
> static void armv8pmu_sched_task(struct perf_event_pmu_context *pmu_ctx, bool sched_in)
> {
> struct arm_pmu *armpmu = to_arm_pmu(pmu_ctx->pmu);
>+ void *task_ctx = pmu_ctx ? pmu_ctx->task_ctx_data : NULL;
>
>- if (sched_in && armpmu->has_branch_stack)
>- armv8pmu_branch_reset();
>+ if (armpmu->has_branch_stack) {
>+ /* Save branch records in task_ctx on sched out */
>+ if (task_ctx && !sched_in) {
>+ armv8pmu_branch_save(armpmu, task_ctx);
>+ return;
>+ }
>+
>+ /* Reset branch records on sched in */
>+ if (sched_in)
>+ armv8pmu_branch_reset();
>+ }
> }

2023-08-02 19:38:40

by Marc Zyngier

[permalink] [raw]
Subject: Re: [PATCH V13 - RESEND 09/10] arm64/perf: Implement branch records save on task sched out

On Wed, 02 Aug 2023 12:59:31 +0100,
Rajnesh Kanwal <[email protected]> wrote:
>
> >diff --git a/drivers/perf/arm_brbe.c b/drivers/perf/arm_brbe.c
> >index 203cd4f350d5..2177632befa6 100644
> >--- a/drivers/perf/arm_brbe.c
> >+++ b/drivers/perf/arm_brbe.c
> >@@ -165,6 +165,36 @@ static int stitch_stored_live_entries(struct brbe_regset *stored,
> > return min(nr_live + nr_stored, nr_max);
> > }
> >
> >+static int brbe_branch_save(struct brbe_regset *live, int nr_hw_entries)
> >+{
> >+ u64 brbfcr = read_sysreg_s(SYS_BRBFCR_EL1);
> >+ int nr_live;
> >+
> >+ write_sysreg_s(brbfcr | BRBFCR_EL1_PAUSED, SYS_BRBFCR_EL1);
> >+ isb();
> >+
> >+ nr_live = capture_brbe_regset(live, nr_hw_entries);
> >+
> >+ write_sysreg_s(brbfcr & ~BRBFCR_EL1_PAUSED, SYS_BRBFCR_EL1);
> >+ isb();
> >+
> >+ return nr_live;
> >+}
> >+
> >+void armv8pmu_branch_save(struct arm_pmu *arm_pmu, void *ctx)
> >+{
> >+ struct arm64_perf_task_context *task_ctx = ctx;
> >+ struct brbe_regset live[BRBE_MAX_ENTRIES];
> >+ int nr_live, nr_store, nr_hw_entries;
> >+
> >+ nr_hw_entries = brbe_get_numrec(arm_pmu->reg_brbidr);
> >+ nr_live = brbe_branch_save(live, nr_hw_entries);
> >+ nr_store = task_ctx->nr_brbe_records;
> >+ nr_store = stitch_stored_live_entries(task_ctx->store, live, nr_store,
> >+ nr_live, nr_hw_entries);
> >+ task_ctx->nr_brbe_records = nr_store;
> >+}
>
> Asking out-of-curiosity. Have you thought about virtualization use
> case. Current LBR implementation create an event for the guest
> and save/restore happens using the sched_task callback. (Correct me
> if I am wrong). Given you are only saving and processing those saved
> entries, how do you plan to expose the entries to the guest?

Two possibilities:

- either we perform a full save/restore of the registers so that host
and guest have isolated states

- or we trap all BRBE accesses and piggy-back on the perf framework
for that, much like we already do for the PMU.

M.

--
Without deviation from the norm, progress is not possible.

2023-08-03 02:48:08

by Anshuman Khandual

[permalink] [raw]
Subject: Re: [PATCH V13 - RESEND 06/10] arm64/perf: Enable branch stack events via FEAT_BRBE



On 8/2/23 18:10, Suzuki K Poulose wrote:
> On 26/07/2023 06:32, Anshuman Khandual wrote:
>>
>>
>> On 7/25/23 18:59, Suzuki K Poulose wrote:
>>> On 25/07/2023 12:42, Anshuman Khandual wrote:
>>>> Hello Yang,
>>>>
>>>> On 7/25/23 12:42, Yang Shen wrote:
>>>>>> +    if (!(branch_type & PERF_SAMPLE_BRANCH_NO_CYCLES))
>>>>>> +        brbcr |= BRBCR_EL1_CC;
>>>>>
>>>>> Hi Anshuman,
>>>>>
>>>>> Here is problem about enable CYCLES_COUNT. The SPEC defines that the CYCLES_COUNT is only
>>>>>
>>>>> valid when the BRECR_EL1.CC & BRBCR_EL2.CC is true. And here the SPEC also defines that
>>>>>
>>>>> when PSTATE.EL == EL2 and HCR_EL2.E2h == '1', 'MSR BRBCR_EL1, <Xt>' means writing to
>>>>>
>>>>> BRBCR_EL2 actually. So 'armv8pmu_branch_enable' can only set the BRBCR_EL2.CC, while the
>>>>>
>>>>> BRECR_EL1.CC is still 0. The CYCLES_COUNT will be always 0 in records.
>>>>
>>>>
>>>> Agreed, this is a valid problem i.e BRBCR_EL1.CC and BRBCR_EL2.CC both needs to be set
>>>> for valid cycle count information regardless if the kernel runs in EL1 or EL2. A simple
>>>> hack in the current code setting BRBCR_EL12.C, which in turn sets BRBCR_EL1.CC when the
>>>> kernel runs in EL2 solves the problem.
>>>>
>>>>>
>>>>> As a solution, maybe BRBCR_EL12 should be added for driver according to the registers definition.
>>>>
>>>> Right, will add the definition for BRBCR_EL12 in arch/arm64/tools/sysreg
>>>>
>>>>>
>>>>> Or, do you have a more standard solution?
>>>>
>>>> Right, there are some nuances involved here.
>>>>
>>>> Kernel could boot
>>>>      a. Directly into EL2 and stays in EL2 for good
>>>> b. Directly into EL2 but switches into EL1
>>>> c. Directly into EL1 without ever going into EL2
>>>>
>>>> In all the above cases BRBCR_EL1.CC and BRBCR_EL2.CC needs to be set when cycle count
>>>> is requested in the perf event interface (event->attr.branch_sample_type) via clearing
>>>> PERF_SAMPLE_BRANCH_NO_CYCLES.
>>>>
>>>>
>>>> - For the case as in (c) where kernel boots into EL1 directly and hence cannot ever set
>>>>     EL2 register, BRBCR_EL2.CC would be a booting requirement - updated in booting.rst
>>>>
>>>> - For the cases as in (a) and (b) kernel boots via EL2, hence there is an opportunity
>>>>     to set both BRBCR_EL1.CC (via accessed BRBCR_EL12.CC) and BRBCR_EL2.CC. Depending on
>>>
>>> You don't need to use BRBCR_EL12, if you do it early enough, before
>>> HCR_EL2.E2H == 1 is applied.
>>
>> Agreed. Please find the code change which solves this reported problem, please
>> have a look and let me know if anything needs changing.
>>
>> ------------------------------------------------------------------------------
>>   Documentation/arch/arm64/booting.rst |  6 ++++
>>   arch/arm64/include/asm/el2_setup.h   | 45 ++++++++++++++++++++++++++++
>>   arch/arm64/tools/sysreg              | 38 +++++++++++++++++++++++
>>   3 files changed, 89 insertions(+)
>>
>> diff --git a/Documentation/arch/arm64/booting.rst b/Documentation/arch/arm64/booting.rst
>> index b57776a68f15..2276df285e83 100644
>> --- a/Documentation/arch/arm64/booting.rst
>> +++ b/Documentation/arch/arm64/booting.rst
>> @@ -349,6 +349,12 @@ Before jumping into the kernel, the following conditions must be met:
>>         - HWFGWTR_EL2.nSMPRI_EL1 (bit 54) must be initialised to 0b01.
>>   +  For CPUs with feature Branch Record Buffer Extension (FEAT_BRBE):
>> +
>> +  - If the kernel is entered at EL1 and EL2 is present:
>> +
>> +    - BRBCR_EL2.CC (bit 3) must be initialised to 0b1.
>> +
>>     For CPUs with the Scalable Matrix Extension FA64 feature (FEAT_SME_FA64):
>>       - If EL3 is present:
>> diff --git a/arch/arm64/include/asm/el2_setup.h b/arch/arm64/include/asm/el2_setup.h
>> index 8e5ffb58f83e..75b04eff2dc7 100644
>> --- a/arch/arm64/include/asm/el2_setup.h
>> +++ b/arch/arm64/include/asm/el2_setup.h
>> @@ -150,6 +150,50 @@
>>       msr    cptr_el2, x0            // Disable copro. traps to EL2
>>   .endm
>>   +/*
>> + * Enable BRBE cycle count
>> + *
>> + * BRBE requires both BRBCR_EL1.CC and BRBCR_EL2.CC fields, be set
>> + * for the cycle counts to be available in BRBINF<N>_EL1.CC during
>> + * branch record processing after a PMU interrupt. This enables CC
>> + * field on both these registers while still executing inside EL2.
>> + *
>> + * BRBE driver would still be able to toggle branch records cycle
>> + * count support via BRBCR_EL1.CC field regardless of whether the
>> + * kernel end up executing in EL1 or EL2.
>> + */
>> +.macro __init_el2_brbe
>> +    mrs    x1, id_aa64dfr0_el1
>> +    ubfx    x1, x1, #ID_AA64DFR0_EL1_BRBE_SHIFT, #4
>> +    cbz    x1, .Lskip_brbe_cc_\@
>> +
>> +    mrs_s    x0, SYS_BRBCR_EL2
>> +    orr    x0, x0, BRBCR_EL2_CC
>> +    msr_s    SYS_BRBCR_EL2, x0
>> +
>> +    /*
>> +     * Accessing BRBCR_EL1 register here does not require
>> +     * BRBCR_EL12 addressing mode as HCR_EL2.E2H is still
>> +     * clear. Regardless, check for HCR_E2H and be on the
>> +     * safer side.
>> +     */
>> +    mrs    x1, hcr_el2
>> +    and    x1, x1, #HCR_E2H
>> +    cbz    x1, .Lset_brbe_el1_direct_\@
>> +
>> +    mrs_s    x0, SYS_BRBCR_EL12
>> +    orr    x0, x0, BRBCR_EL12_CC
>> +    msr_s    SYS_BRBCR_EL12, x0
>> +    b    .Lskip_brbe_cc_\@
>> +
>> +.Lset_brbe_el1_direct_\@:
>> +    mrs_s    x0, SYS_BRBCR_EL1
>> +    orr    x0, x0, BRBCR_EL1_CC
>> +    msr_s    SYS_BRBCR_EL1, x0
>> +
>> +.Lskip_brbe_cc_\@:
>> +.endm
>> +
>>   /* Disable any fine grained traps */
>>   .macro __init_el2_fgt
>>       mrs    x1, id_aa64mmfr0_el1
>> @@ -224,6 +268,7 @@
>>       __init_el2_nvhe_idregs
>>       __init_el2_cptr
>>       __init_el2_fgt
>> +    __init_el2_brbe
>>   .endm
>>     #ifndef __KVM_NVHE_HYPERVISOR__
>> diff --git a/arch/arm64/tools/sysreg b/arch/arm64/tools/sysreg
>> index 9892af96262f..7d1d6b3976b4 100644
>> --- a/arch/arm64/tools/sysreg
>> +++ b/arch/arm64/tools/sysreg
>> @@ -1048,6 +1048,44 @@ Enum    1:0    VALID
>>   EndEnum
>>   EndSysregFields
>>   +Sysreg    BRBCR_EL12    2    5    9    0    0
>> +Res0    63:24
>> +Field    23     EXCEPTION
>> +Field    22     ERTN
>> +Res0    21:9
>> +Field    8     FZP
>> +Res0    7
>> +Enum    6:5    TS
>> +    0b01    VIRTUAL
>> +    0b10    GUEST_PHYSICAL
>> +    0b11    PHYSICAL
>> +EndEnum
>> +Field    4    MPRED
>> +Field    3    CC
>> +Res0    2
>> +Field    1    E1BRE
>> +Field    0    E0BRE
>> +EndSysreg
>
> As this is exactly same as BRBCR_EL1, please could we use SysregFields
> for BRBCR_EL1 and reuse it here ?

Sure, will add 'SysregFields BRBCR_ELx' enlisting the register fields to
be used as 'Fields BRBCR_ELx' both for BRBCR_EL1 and BRBCR_EL12. I guess
BRBCR_EL2 still remains unchanged as it has 'E2BRE' and 'E0HBRE' fields.

But as a consequence all BRBCR_EL1_XXX fields used in the driver and its
header need to be converted as BRBCR_ELx_XXX. Will do these changes.

>
>
>
>> +
>> +Sysreg    BRBCR_EL2    2    4    9    0    0
>> +Res0    63:24
>> +Field    23     EXCEPTION
>> +Field    22     ERTN
>> +Res0    21:9
>> +Field    8     FZP
>> +Res0    7
>> +Enum    6:5    TS
>> +    0b01    VIRTUAL
>> +    0b10    GUEST_PHYSICAL
>> +    0b11    PHYSICAL
>> +EndEnum
>> +Field    4    MPRED
>> +Field    3    CC
>> +Res0    2
>> +Field    1    E1BRE
>
> E2BRE?
>> +Field    0    E0BRE
>
> E0HBRE?

Sure, I have already updated this in the development tree. This was just a
hack for the discussion purpose.

>
> Rest looks good to me
>
> Suzuki
>
>> +EndSysreg
>> +
>>   Sysreg    BRBCR_EL1    2    1    9    0    0
>>   Res0    63:24
>>   Field    23     EXCEPTION
>

2023-08-16 20:26:47

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH V13 - RESEND 02/10] arm64/perf: Add BRBE registers and fields

On Tue, Aug 15, 2023 at 11:17:19AM +0100, James Clark wrote:
>
>
> On 31/07/2023 10:06, Mark Rutland wrote:
> > On Mon, Jul 31, 2023 at 08:03:21AM +0530, Anshuman Khandual wrote:
> >>
> >>
> >> On 7/28/23 22:22, James Clark wrote:
> >>>
> >>>
> >>> On 28/07/2023 17:20, Will Deacon wrote:
> >>>> On Tue, Jul 11, 2023 at 01:54:47PM +0530, Anshuman Khandual wrote:
> >>>>> This adds BRBE related register definitions and various other related field
> >>>>> macros there in. These will be used subsequently in a BRBE driver which is
> >>>>> being added later on.
> >>>>>
> >>>>> Cc: Catalin Marinas <[email protected]>
> >>>>> Cc: Will Deacon <[email protected]>
> >>>>> Cc: Marc Zyngier <[email protected]>
> >>>>> Cc: Mark Rutland <[email protected]>
> >>>>> Cc: [email protected]
> >>>>> Cc: [email protected]
> >>>>> Tested-by: James Clark <[email protected]>
> >>>>> Reviewed-by: Mark Brown <[email protected]>
> >>>>> Signed-off-by: Anshuman Khandual <[email protected]>
> >>>>> ---
> >>>>> arch/arm64/include/asm/sysreg.h | 103 +++++++++++++++++++++
> >>>>> arch/arm64/tools/sysreg | 158 ++++++++++++++++++++++++++++++++
> >>>>> 2 files changed, 261 insertions(+)
> >>>>>
> >>>>> diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h
> >>>>> index b481935e9314..f95e30c13c8b 100644
> >>>>> --- a/arch/arm64/include/asm/sysreg.h
> >>>>> +++ b/arch/arm64/include/asm/sysreg.h
> >>>>> @@ -163,6 +163,109 @@
> >>>>> #define SYS_DBGDTRTX_EL0 sys_reg(2, 3, 0, 5, 0)
> >>>>> #define SYS_DBGVCR32_EL2 sys_reg(2, 4, 0, 7, 0)
> >>>>>
> >>>>> +#define __SYS_BRBINFO(n) sys_reg(2, 1, 8, ((n) & 0xf), ((((n) & 0x10)) >> 2 + 0))
> >>>>> +#define __SYS_BRBSRC(n) sys_reg(2, 1, 8, ((n) & 0xf), ((((n) & 0x10)) >> 2 + 1))
> >>>>> +#define __SYS_BRBTGT(n) sys_reg(2, 1, 8, ((n) & 0xf), ((((n) & 0x10)) >> 2 + 2))
> >>>>
> >>>> It's that time on a Friday but... aren't these macros busted? I think you
> >>>> need brackets before adding the offset, otherwise wouldn't, for example,
> >>>> target registers 0-15 all access info register 0 and __SYS_BRBTGT(16) would
> >>>> then start accessing source register 0?
> >>>>
> >>>> I'm surprised that the compiler doesn't warn about this, but even more
> >>>> surprised that you managed to test this.
> >>>>
> >>>> Please tell me I'm wrong!
> >>>>
> >>>> Will
> >>>
> >>> No I think you are right, it is wrong. Luckily there is already an
> >>> extraneous bracket so you you can fix it by moving one a place down:
> >>>
> >>> sys_reg(2, 1, 8, ((n) & 0xf), ((((n) & 0x10) >> 2) + 2))
> >>>
> >>> It's interesting because the test [1] is doing quite a bit and looking
> >>> at the branch info, and that src and targets match up to function names.
> >>> I also manually looked at the branch buffers and didn't see anything
> >>> obviously wrong like things that looked like branch infos in the source
> >>> or target fields. Will have to take another look to see if it would be
> >>> possible for the test to catch this.
> >>>
> >>> James
> >>>
> >>> [1]:
> >>> https://gitlab.arm.com/linux-arm/linux-jc/-/commit/3a7ddce70c2daadb63fcc511de0a89055ca48b32
> >>
> >> ((((n) & 0x10)) >> 2 + 2) ---> ((((n) & 0x10) >> 2) + 2)
> >>
> >> The additional brackets are useful in explicitly telling the compiler but
> >> what it the compiler is just doing the right thing implicitly i.e computing
> >> the shifting operation before doing the offset addition.
> >
> > No; that is not correct. In c, '+' has higher precedence than '>>'.
> >
> > For 'a >> b + c' the compiler *must* treat that as 'a >> (b + c)', and not as
> > '(a >> b) + c'
> >
> > That's trivial to test:
> >
> > | [mark@gravadlaks:~]% cat shiftadd.c
> > | #include <stdio.h>
> > |
> > | unsigned long logshiftadd(unsigned long a,
> > | unsigned long b,
> > | unsigned long c)
> > | {
> > | printf("%ld >> %ld + %ld is %ld\n",
> > | a, b, c, a >> b + c);
> > | }
> > |
> > | int main(int argc, char *argv)
> > | {
> > | logshiftadd(0, 0, 0);
> > | logshiftadd(0, 0, 1);
> > | logshiftadd(0, 0, 2);
> > | printf("\n");
> > | logshiftadd(1024, 0, 0);
> > | logshiftadd(1024, 0, 1);
> > | logshiftadd(1024, 0, 2);
> > | printf("\n");
> > | logshiftadd(1024, 2, 0);
> > | logshiftadd(1024, 2, 1);
> > | logshiftadd(1024, 2, 2);
> > |
> > | return 0;
> > | }
> > | [mark@gravadlaks:~]% gcc shiftadd.c -o shiftadd
> > | [mark@gravadlaks:~]% ./shiftadd
> > | 0 >> 0 + 0 is 0
> > | 0 >> 0 + 1 is 0
> > | 0 >> 0 + 2 is 0
> > |
> > | 1024 >> 0 + 0 is 1024
> > | 1024 >> 0 + 1 is 512
> > | 1024 >> 0 + 2 is 256
> > |
> > | 1024 >> 2 + 0 is 256
> > | 1024 >> 2 + 1 is 128
> > | 1024 >> 2 + 2 is 64
> >
> >> During testing, all > those captured branch records looked alright.
> >
> > I think we clearly need better testing here.
> >
> > Thanks,
> > Mark.
>
> Hi Will and Mark,
>
> So I started looking into the test both with and without the fix,
> strangely I couldn't see any difference in the branch outputs, or
> anywhere in the driver where it would be flipping or filtering anything
> to make it only appear to be working. This was a bit confusing, but
> added up with the original point that the test was passing and it was
> actually doing something.
>
> So I started going deeper and found what the issue (non-issue) is.
>
> Firstly why is there no warning:
>
> The expression is stringified and passed to the assembler, so it skips
> the C compiler warning settings. I can send a patch to fix this, but all
> we need to do is get the compiler to evaluate the argument and then
> throw it away, luckily there are no other issues found even with an
> allyesconfig, so BRBE was the only thing with this bug:
>
> #define read_sysreg_s(r) ({
> u64 __val;
> + u32 __maybe_unused __check_r = (u32)(r);
> asm volatile(__mrs_s("%0", r) : "=r" (__val));
> __val;
> })
>
>
> Secondly, why does BRBE actually work:
>
> Well the assembler (at least in my Clang toolchain) has a different
> order of operations to C. I put a minimal repro here:
> https://godbolt.org/z/YP9adh5xh
>
> You can see the op2 should be a 0b100000 (0x20) for BRBSRC and it
> appears to be, you can also see that moving the bracket makes no
> difference in this case.
>
> For some more evidence, the disassembler I have locally actually gives
> the correct register name, even when the bracket is wrong, and diffing
> the .o file gives no difference when moving the bracket:
>
> 0000000000000008 <main>:
> 8: d503245f bti c
> c: d503201f nop
> 10: d503201f nop
> 14: 2a1f03e0 mov w0, wzr
> 18: d5318028 mrs x8, brbsrc0_el1
> 1c: d5318128 mrs x8, brbsrc1_el1
> 20: d65f03c0 ret
>
> Seems completely crazy to me that this is actually the case. So maybe I
> am also wrong. Don't know if this counts as a compiler bug or it's just
> supposed to be like that.

From a quick dig, it's supposed to be like that: the GNU assembler uses a
different operator precedence to C, and clang's assembler does the same for
compatibility. What a great.

Compare:

https://ftp.gnu.org/old-gnu/Manuals/gas-2.9.1/html_chapter/as_6.html#SEC66

... with:

https://en.wikipedia.org/wiki/Operators_in_C_and_C%2B%2B#Operator_precedence

Adding the brackets will make this work in either case, so I think that's the
right thing to do for now.

Thanks,
Mark.

2023-08-18 10:05:36

by James Clark

[permalink] [raw]
Subject: Re: [PATCH V13 - RESEND 02/10] arm64/perf: Add BRBE registers and fields



On 31/07/2023 10:06, Mark Rutland wrote:
> On Mon, Jul 31, 2023 at 08:03:21AM +0530, Anshuman Khandual wrote:
>>
>>
>> On 7/28/23 22:22, James Clark wrote:
>>>
>>>
>>> On 28/07/2023 17:20, Will Deacon wrote:
>>>> On Tue, Jul 11, 2023 at 01:54:47PM +0530, Anshuman Khandual wrote:
>>>>> This adds BRBE related register definitions and various other related field
>>>>> macros there in. These will be used subsequently in a BRBE driver which is
>>>>> being added later on.
>>>>>
>>>>> Cc: Catalin Marinas <[email protected]>
>>>>> Cc: Will Deacon <[email protected]>
>>>>> Cc: Marc Zyngier <[email protected]>
>>>>> Cc: Mark Rutland <[email protected]>
>>>>> Cc: [email protected]
>>>>> Cc: [email protected]
>>>>> Tested-by: James Clark <[email protected]>
>>>>> Reviewed-by: Mark Brown <[email protected]>
>>>>> Signed-off-by: Anshuman Khandual <[email protected]>
>>>>> ---
>>>>> arch/arm64/include/asm/sysreg.h | 103 +++++++++++++++++++++
>>>>> arch/arm64/tools/sysreg | 158 ++++++++++++++++++++++++++++++++
>>>>> 2 files changed, 261 insertions(+)
>>>>>
>>>>> diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h
>>>>> index b481935e9314..f95e30c13c8b 100644
>>>>> --- a/arch/arm64/include/asm/sysreg.h
>>>>> +++ b/arch/arm64/include/asm/sysreg.h
>>>>> @@ -163,6 +163,109 @@
>>>>> #define SYS_DBGDTRTX_EL0 sys_reg(2, 3, 0, 5, 0)
>>>>> #define SYS_DBGVCR32_EL2 sys_reg(2, 4, 0, 7, 0)
>>>>>
>>>>> +#define __SYS_BRBINFO(n) sys_reg(2, 1, 8, ((n) & 0xf), ((((n) & 0x10)) >> 2 + 0))
>>>>> +#define __SYS_BRBSRC(n) sys_reg(2, 1, 8, ((n) & 0xf), ((((n) & 0x10)) >> 2 + 1))
>>>>> +#define __SYS_BRBTGT(n) sys_reg(2, 1, 8, ((n) & 0xf), ((((n) & 0x10)) >> 2 + 2))
>>>>
>>>> It's that time on a Friday but... aren't these macros busted? I think you
>>>> need brackets before adding the offset, otherwise wouldn't, for example,
>>>> target registers 0-15 all access info register 0 and __SYS_BRBTGT(16) would
>>>> then start accessing source register 0?
>>>>
>>>> I'm surprised that the compiler doesn't warn about this, but even more
>>>> surprised that you managed to test this.
>>>>
>>>> Please tell me I'm wrong!
>>>>
>>>> Will
>>>
>>> No I think you are right, it is wrong. Luckily there is already an
>>> extraneous bracket so you you can fix it by moving one a place down:
>>>
>>> sys_reg(2, 1, 8, ((n) & 0xf), ((((n) & 0x10) >> 2) + 2))
>>>
>>> It's interesting because the test [1] is doing quite a bit and looking
>>> at the branch info, and that src and targets match up to function names.
>>> I also manually looked at the branch buffers and didn't see anything
>>> obviously wrong like things that looked like branch infos in the source
>>> or target fields. Will have to take another look to see if it would be
>>> possible for the test to catch this.
>>>
>>> James
>>>
>>> [1]:
>>> https://gitlab.arm.com/linux-arm/linux-jc/-/commit/3a7ddce70c2daadb63fcc511de0a89055ca48b32
>>
>> ((((n) & 0x10)) >> 2 + 2) ---> ((((n) & 0x10) >> 2) + 2)
>>
>> The additional brackets are useful in explicitly telling the compiler but
>> what it the compiler is just doing the right thing implicitly i.e computing
>> the shifting operation before doing the offset addition.
>
> No; that is not correct. In c, '+' has higher precedence than '>>'.
>
> For 'a >> b + c' the compiler *must* treat that as 'a >> (b + c)', and not as
> '(a >> b) + c'
>
> That's trivial to test:
>
> | [mark@gravadlaks:~]% cat shiftadd.c
> | #include <stdio.h>
> |
> | unsigned long logshiftadd(unsigned long a,
> | unsigned long b,
> | unsigned long c)
> | {
> | printf("%ld >> %ld + %ld is %ld\n",
> | a, b, c, a >> b + c);
> | }
> |
> | int main(int argc, char *argv)
> | {
> | logshiftadd(0, 0, 0);
> | logshiftadd(0, 0, 1);
> | logshiftadd(0, 0, 2);
> | printf("\n");
> | logshiftadd(1024, 0, 0);
> | logshiftadd(1024, 0, 1);
> | logshiftadd(1024, 0, 2);
> | printf("\n");
> | logshiftadd(1024, 2, 0);
> | logshiftadd(1024, 2, 1);
> | logshiftadd(1024, 2, 2);
> |
> | return 0;
> | }
> | [mark@gravadlaks:~]% gcc shiftadd.c -o shiftadd
> | [mark@gravadlaks:~]% ./shiftadd
> | 0 >> 0 + 0 is 0
> | 0 >> 0 + 1 is 0
> | 0 >> 0 + 2 is 0
> |
> | 1024 >> 0 + 0 is 1024
> | 1024 >> 0 + 1 is 512
> | 1024 >> 0 + 2 is 256
> |
> | 1024 >> 2 + 0 is 256
> | 1024 >> 2 + 1 is 128
> | 1024 >> 2 + 2 is 64
>
>> During testing, all > those captured branch records looked alright.
>
> I think we clearly need better testing here.
>
> Thanks,
> Mark.

Hi Will and Mark,

So I started looking into the test both with and without the fix,
strangely I couldn't see any difference in the branch outputs, or
anywhere in the driver where it would be flipping or filtering anything
to make it only appear to be working. This was a bit confusing, but
added up with the original point that the test was passing and it was
actually doing something.

So I started going deeper and found what the issue (non-issue) is.

Firstly why is there no warning:

The expression is stringified and passed to the assembler, so it skips
the C compiler warning settings. I can send a patch to fix this, but all
we need to do is get the compiler to evaluate the argument and then
throw it away, luckily there are no other issues found even with an
allyesconfig, so BRBE was the only thing with this bug:

#define read_sysreg_s(r) ({
u64 __val;
+ u32 __maybe_unused __check_r = (u32)(r);
asm volatile(__mrs_s("%0", r) : "=r" (__val));
__val;
})


Secondly, why does BRBE actually work:

Well the assembler (at least in my Clang toolchain) has a different
order of operations to C. I put a minimal repro here:
https://godbolt.org/z/YP9adh5xh

You can see the op2 should be a 0b100000 (0x20) for BRBSRC and it
appears to be, you can also see that moving the bracket makes no
difference in this case.

For some more evidence, the disassembler I have locally actually gives
the correct register name, even when the bracket is wrong, and diffing
the .o file gives no difference when moving the bracket:

0000000000000008 <main>:
8: d503245f bti c
c: d503201f nop
10: d503201f nop
14: 2a1f03e0 mov w0, wzr
18: d5318028 mrs x8, brbsrc0_el1
1c: d5318128 mrs x8, brbsrc1_el1
20: d65f03c0 ret

Seems completely crazy to me that this is actually the case. So maybe I
am also wrong. Don't know if this counts as a compiler bug or it's just
supposed to be like that.

Thanks
James

2023-08-21 11:49:45

by Anshuman Khandual

[permalink] [raw]
Subject: Re: [PATCH V13 - RESEND 00/10] arm64/perf: Enable branch stack sampling

Hello Will,

Separated this part out just to understand it better.

On 8/18/23 23:26, Will Deacon wrote:
>> These stubs are necessary to protect PMU drivers built on arm32 which share
>> basic branch record processing abstraction at ARMV8 PMU level. It compiles
>> successfully both on arm32 and arm64 platforms with these changes. Subject
>> line for this patch clearly mentions that as well.

> But they shouldn't be needed. When CONFIG_ARM64_BRBE is not selected, no
> branch record processing functions should be needed, empty or otherwise.

But that is not how the code is organized currently. CONFIG_ARM64_BRBE enables
the driver to bring in BRBE HW specific branch stack callback implementation
details. But these callbacks are always present at ARMV8 PMU level even without
BRBE implementation as well. Hence these call sites are present, regardless of
CONFIG_ARM64_BRBE.

> The callers should not exist in the first place (i.e. the empty definitions

But how to achieve that ? Branch stack needs to be driven along side normal PMU
events, which in turn get driven from 'struct arm_pmu' callbacks. Hence branch
stack callbacks too needs to be at ARMV8 PMU level, and closely tied to normal
PMU event handling callbacks. Let's examine from where all these new callbacks
are called from.

* armv8pmu_disable_event() ---> armv8pmu_branch_disable()
* armv8pmu_handle_irq() ---> armv8pmu_branch_read()
* armv8pmu_sched_task() ---> armv8pmu_branch_save()
* armv8pmu_sched_task() ---> armv8pmu_branch_reset()
* armv8pmu_reset() ---> armv8pmu_branch_reset()
* __armv8_pmuv3_map_event() ---> armv8pmu_branch_attr_valid()
* __armv8pmu_probe_pmu() ---> armv8pmu_branch_probe()
* armv8pmu_probe_pmu() ---> armv8pmu_task_ctx_cache_alloc()
* armv8pmu_probe_pmu() ---> branch_records_alloc()

Please note that, branch_records_alloc() being deferred allocation is only one
that is platform agnostic.

I did not intend to make any of the BRBE details visible at ARMV8 PMU level i.e
right inside armv8pmu_XXXX() definitions, as ARMV8 PMU is shared between arm32
and arm64 platforms.

Hence these new branch stack callbacks are added along with required fallback
stubs for build protection, so that the HW implementations can be hidden inside
a new driver wrapped in CONFIG_ARM64_BRBE. Please note that these new branch
stack functions are not 'struct arm_pmu' callbacks but instead normal helpers.

> should live in the core driver code, not in the arch header, or the calling

Driver code is compiled with CONFIG_ARM64_BRBE, hence the real definitions are
there. Instead default stubs are required only when armv8pmu_branch_XXX() call
backs are not defined. But are you suggesting that these stubs be moved inside
drivers/perf/arm_pmuv3.c (where all call sites reside), so that there will be
just one stub copy both for arm32 and arm64 platforms removing their duplicate
definitions from arch headers i.e arch/arm64/include/asm/perf_event.h and
arch/arm/include/asm/arm_pmuv3.h ?

> code should not be compiled at all).

Branch stack sampling is always enabled from core perf perspective without any
config option to wrap around, hence calling code cannot be selectively enabled
or disabled.

- Anshuman

2023-09-27 09:14:43

by Anshuman Khandual

[permalink] [raw]
Subject: Re: [PATCH V13 - RESEND 00/10] arm64/perf: Enable branch stack sampling


On 7/11/23 13:54, Anshuman Khandual wrote:
> This series enables perf branch stack sampling support on arm64 platform
> via a new arch feature called Branch Record Buffer Extension (BRBE). All
> relevant register definitions could be accessed here.
>
> https://developer.arm.com/documentation/ddi0601/2021-12/AArch64-Registers
>
> This series applies on 6.5-rc1.
>
> Changes in V13:
>
> https://lore.kernel.org/all/[email protected]/
>
> - Added branch callback stubs for aarch32 pmuv3 based platforms
> - Updated the comments for capture_brbe_regset()
> - Deleted the comments in __read_brbe_regset()
> - Reversed the arguments order in capture_brbe_regset() and brbe_branch_save()
> - Fixed BRBE_BANK[0|1]_IDX_MAX indices comparison in armv8pmu_branch_read()
> - Fixed BRBE_BANK[0|1]_IDX_MAX indices comparison in capture_brbe_regset()

Hello All,

Upcoming V14 series is still being worked out, and might take some more time. But
meanwhile please find its corresponding development branch for experimentation
purpose, if required. Thank you.

https://git.gitlab.arm.com/linux-arm/linux-anshuman.git (brbe_v14_rc3)

- Anshuman