2024-04-05 02:47:11

by Anshuman Khandual

[permalink] [raw]
Subject: [PATCH V17 0/9] arm64/perf: Enable branch stack sampling

This series enables perf branch stack sampling support on arm64 platform
via a new arch feature called Branch Record Buffer Extension (BRBE). All
the relevant register definitions could be accessed here.

https://developer.arm.com/documentation/ddi0601/2021-12/AArch64-Registers

This series applies on 6.9-rc2.

Also this series is being hosted below for quick access, review and test.

https://git.gitlab.arm.com/linux-arm/linux-anshuman.git (brbe_v17)

There are still some open questions regarding handling multiple perf events
with different privilege branch filters getting on the same PMU, supporting
guest branch stack tracing from the host etc. Finally also looking for some
suggestions regarding supporting BRBE inside the guest. The series has been
re-organized completely as suggested earlier.

- Anshuman

========== Perf Branch Stack Sampling Support (arm64 platforms) ===========

Currently arm64 platform does not support perf branch stack sampling. Hence
any event requesting for branch stack records i.e PERF_SAMPLE_BRANCH_STACK
marked in event->attr.sample_type, will be rejected in armpmu_event_init().

static int armpmu_event_init(struct perf_event *event)
{
........
/* does not support taken branch sampling */
if (has_branch_stack(event))
return -EOPNOTSUPP;
........
}

$perf record -j any,u,k ls
Error:
cycles:P: PMU Hardware or event type doesn't support branch stack sampling.

-------------------- CONFIG_ARM64_BRBE and FEAT_BRBE ----------------------

After this series, perf branch stack sampling feature gets enabled on arm64
platforms where FEAT_BRBE HW feature is supported, and CONFIG_ARM64_BRBE is
also selected during build. Let's observe all all possible scenarios here.

1. Feature not built (!CONFIG_ARM64_BRBE):

Falls back to the current behaviour i.e event gets rejected.

2. Feature built but HW not supported (CONFIG_ARM64_BRBE && !FEAT_BRBE):

Falls back to the current behaviour i.e event gets rejected.

3. Feature built and HW supported (CONFIG_ARM64_BRBE && FEAT_BRBE):

Platform supports branch stack sampling requests. Let's observe through a
simple example here.

$perf record -j any_call,u,k,save_type ls

[Please refer perf-record man pages for all possible branch filter options]

$perf report
-------------------------- Snip ----------------------
# Overhead Command Source Shared Object Source Symbol Target Symbol Basic Block Cycles
# ........ ....... .................... ............................................ ............................................ ..................
#
3.52% ls [kernel.kallsyms] [k] sched_clock_noinstr [k] arch_counter_get_cntpct 16
3.52% ls [kernel.kallsyms] [k] sched_clock [k] sched_clock_noinstr 9
1.85% ls [kernel.kallsyms] [k] sched_clock_cpu [k] sched_clock 5
1.80% ls [kernel.kallsyms] [k] irqtime_account_irq [k] sched_clock_cpu 20
1.58% ls [kernel.kallsyms] [k] gic_handle_irq [k] generic_handle_domain_irq 19
1.58% ls [kernel.kallsyms] [k] call_on_irq_stack [k] gic_handle_irq 9
1.58% ls [kernel.kallsyms] [k] do_interrupt_handler [k] call_on_irq_stack 23
1.58% ls [kernel.kallsyms] [k] generic_handle_domain_irq [k] __irq_resolve_mapping 6
1.58% ls [kernel.kallsyms] [k] __irq_resolve_mapping [k] __rcu_read_lock 10
-------------------------- Snip ----------------------

$perf report -D | grep cycles
-------------------------- Snip ----------------------
.... 1: ffff800080dd3334 -> ffff800080dd759c 39 cycles P 0 IND_CALL
.... 2: ffff800080ffaea0 -> ffff800080ffb688 16 cycles P 0 IND_CALL
.... 3: ffff800080139918 -> ffff800080ffae64 9 cycles P 0 CALL
.... 4: ffff800080dd3324 -> ffff8000801398f8 7 cycles P 0 CALL
.... 5: ffff8000800f8548 -> ffff800080dd330c 21 cycles P 0 IND_CALL
.... 6: ffff8000800f864c -> ffff8000800f84ec 6 cycles P 0 CALL
.... 7: ffff8000800f86dc -> ffff8000800f8638 11 cycles P 0 CALL
.... 8: ffff8000800f86d4 -> ffff800081008630 16 cycles P 0 CALL
-------------------------- Snip ----------------------

perf script and other tooling can also be applied on the captured perf.data
Similarly branch stack sampling records can be collected via direct system
call i.e perf_event_open() method after setting 'struct perf_event_attr' as
required.

event->attr.sample_type |= PERF_SAMPLE_BRANCH_STACK
event->attr.branch_sample_type |= PERF_SAMPLE_BRANCH_<FILTER_1> |
PERF_SAMPLE_BRANCH_<FILTER_2> |
PERF_SAMPLE_BRANCH_<FILTER_3> |
...............................

But all branch filters might not be supported on the platform.

----------------------- BRBE Branch Filters Support -----------------------

- Following branch filters are supported on arm64.

PERF_SAMPLE_BRANCH_USER /* Branch privilege filters */
PERF_SAMPLE_BRANCH_KERNEL
PERF_SAMPLE_BRANCH_HV

PERF_SAMPLE_BRANCH_ANY /* Branch type filters */
PERF_SAMPLE_BRANCH_ANY_CALL
PERF_SAMPLE_BRANCH_ANY_RETURN
PERF_SAMPLE_BRANCH_IND_CALL
PERF_SAMPLE_BRANCH_COND
PERF_SAMPLE_BRANCH_IND_JUMP
PERF_SAMPLE_BRANCH_CALL

PERF_SAMPLE_BRANCH_NO_FLAGS /* Branch record flags */
PERF_SAMPLE_BRANCH_NO_CYCLES
PERF_SAMPLE_BRANCH_TYPE_SAVE
PERF_SAMPLE_BRANCH_HW_INDEX
PERF_SAMPLE_BRANCH_PRIV_SAVE

- Following branch filters are not supported on arm64.

PERF_SAMPLE_BRANCH_ABORT_TX
PERF_SAMPLE_BRANCH_IN_TX
PERF_SAMPLE_BRANCH_NO_TX
PERF_SAMPLE_BRANCH_CALL_STACK

Events requesting above non-supported branch filters get rejected.

------------------ Possible 'branch_sample_type' Mismatch -----------------

Branch stack sampling attributes 'event->attr.branch_sample_type' generally
remain the same for all the events during a perf record session.

$perf record -e <event_1> -e <event_2> -j <branch_filters> [workload]

event_1->attr.branch_sample_type == event_2->attr.branch_sample_type

This 'branch_sample_type' is used to configure the BRBE hardware, when both
events i.e <event_1> and <event_2> get scheduled on a given PMU. But during
PMU HW event's privilege filter inheritance, 'branch_sample_type' does not
remain the same for all events. Let's consider the following example

$perf record -e cycles:u -e instructions:k -j any,save_type ls

cycles->attr.branch_sample_type != instructions->attr.branch_sample_type

Because cycles event inherits PERF_SAMPLE_BRANCH_USER and instruction event
inherits PERF_SAMPLE_BRANCH_KERNEL. The proposed solution here configures
BRBE hardware with 'branch_sample_type' from last event to be added in the
PMU and hence captured branch records only get passed on to matching events
during a PMU interrupt.

static int
armpmu_add(struct perf_event *event, int flags)
{
........
if (has_branch_stack(event)) {
/*
* Reset branch records buffer if a new task event gets
* scheduled on a PMU which might have existing records.
* Otherwise older branch records present in the buffer
* might leak into the new task event.
*/
if (event->ctx->task && hw_events->brbe_context != event->ctx) {
hw_events->brbe_context = event->ctx;
if (armpmu->branch_reset)
armpmu->branch_reset();
}
hw_events->brbe_users++;
Here -------> hw_events->brbe_sample_type = event->attr.branch_sample_type;
}
........
}

Instead of overriding existing 'branch_sample_type', both could be merged.

--------------------------- Virtualisation support ------------------------

- Branch stack sampling is not currently supported inside the guest (TODO)

- FEAT_BRBE advertised as absent via clearing ID_AA64DFR0_EL1.BRBE
- Future support in guest requires emulating FEAT_BRBE

- Branch stack sampling the guest is not supported in the host (TODO)

- Tracing the guest with event->attr.exclude_guest = 0
- There are multiple challenges involved regarding mixing events
with mismatched branch_sample_type and exclude_guest and passing
on captured BRBE records to intended events during PMU interrupt

- Guest access for BRBE registers and instructions has been blocked

- BRBE state save is not required for VHE host (EL2) guest (EL1) transition

- BRBE state is saved for NVHE host (EL1) guest (EL1) transition

-------------------------------- Testing ---------------------------------

- Cross compiled for both arm64 and arm32 platforms
- Passes all branch tests with 'perf test branch' on arm64

-------------------------------- Questions -------------------------------

- Instead of configuring the BRBE HW with branch_sample_type from the last
event to be added on the PMU as proposed, could those be merged together
e.g all privilege requests ORed, to form a common BRBE configuration and
all events get branch records after a PMU interrupt ?

Changes in V17:

- Added back Reviewed-by tags from Mark Brown
- Updated the commit message regarding the field BRBINFx_EL1_TYPE_IMPDEF_TRAP_EL3
- Added leading 0s for all values as BRBIDR0_EL1.NUMREC is a 8 bit field
- Added leading 0s for all values as BRBFCR_EL1.BANK is a 2 bit field
- Reordered BRBCR_EL1/BRBCR_EL12/BRBCR_EL2 registers as per sysreg encodings
- Renamed s/FIRST/BANK_0 and s/SECOND/BANK_1 in BRBFCR_EL1.BANK
- Renamed s/UNCOND_DIRECT/DIRECT_UNCOND in BRBINFx_EL1.TYPE
- Renamed s/COND_DIRECT/DIRECT_COND in BRBINFx_EL1.TYPE
- Dropped __SYS_BRBINF/__SYS_BRBSRC/__SYS_BRBTGT and their expansions
- Moved all existing BRBE registers from sysreg.h header to tools/sysreg format
- Updated the commit message including about sys_insn_descs[]
- Changed KVM to use existing SYS_BRBSRC/TGT/INF_EL1(n) format
- Moved the BRBE instructions into sys_insn_descs[] array
- ARM PMUV3 changes have been moved into the BRBE driver patch instead
- Moved down branch_stack_add() in armpmu_add() after event's basic checks
- Added new callbacks in struct arm_pmu e.g branch_stack_[init|add|del]()
- Renamed struct arm_pmu callback branch_reset() as branch_stack_reset()
- Dropped the comment in armpmu_event_init()
- Renamed 'pmu_hw_events' elements from 'brbe_' to more generic 'branch_'
- Separated out from the BRBE driver implementation patch
- Dropped the comment in __init_el2_brbe()
- Updated __init_el2_brbe() with BRBCR_EL2.MPRED requirements
- Updated __init_el2_brbe() with __check_hvhe() constructs
- Updated booting.rst regarding MPRED, MDCR_EL3 and fine grained control
- Dropped Documentation/arch/arm64/brbe.rst
- Renamed armv8pmu_branch_reset() as armv8pmu_branch_stack_reset()
- Separated out booting.rst and EL2 boot requirements into a new patch
- Dropped process_branch_aborts() completely
- Added an warning if transaction states get detected unexpectedly
- Dropped enum brbe_bank_idx from the driver
- Defined armv8pmu_branch_stack_init/add/del() callbacks in the driver
- Changed BRBE driver to use existing SYS_BRBSRC/TGT/INF_EL1(n) format
- Dropped isb() call sites in __debug_[save|restore]_brbe()
- Changed to [read|write]_sysreg_el1() accessors in __debug_[save|restore]_brbe()

Changes in V16

https://lore.kernel.org/all/[email protected]/

- Updated BRBINFx_EL1.TYPE = 0b110000 as field IMPDEF_TRAP_EL3
- Updated BRBCR_ELx[9] as field FZPSS
- Updated BRBINFINJ_EL1 to use sysreg field BRBINFx_EL1
- Added BRB_INF_SRC_TGT_EL1 macro for corresponding BRB_[INF|SRC|TGT] expansion
- Renamed arm_brbe.h as arm_pmuv3_branch.h
- Updated perf_sample_save_brstack()'s new argument requirements with NULL
- Fixed typo (s/informations/information) in Documentation/arch/arm64/brbe.rst
- Added SPDX-License-Identifier in Documentation/arch/arm64/brbe.rst
- Added new PERF_SAMPLE_BRANCH_COUNTERS into BRBE_EXCLUDE_BRANCH_FILTERS
- Dropped BRBFCR_EL1 and BRBCR_EL1 from enum vcpu_sysreg
- Reverted back the KVM NVHE patch - use host_debug_state based 'brbcr_el1'
element and dropped the previous dependency on Jame's coresight series

Changes in V15:

https://lore.kernel.org/all/[email protected]/

- Added a comment for armv8pmu_branch_probe() regarding single cpu probe
- Added a text in brbe.rst regarding single cpu probe
- Dropped runtime BRBE enable for setting DEBUG_STATE_SAVE_BRBE
- Dropped zero_branch_stack based zero branch records mechanism
- Replaced BRBFCR_EL1_DEFAULT_CONFIG with BRBFCR_EL1_CONFIG_MASK
- Added BRBFCR_EL1_CONFIG_MASK masking in branch_type_to_brbfcr()
- Moved BRBE helpers from arm_brbe.h into arm_brbe.c
- Moved armv8_pmu_xxx() declaration inside arm_brbe.h for arm64 (CONFIG_ARM64_BRBE)
- Moved armv8_pmu_xxx() stub definitions inside arm_brbe.h for arm32 (!CONFIG_ARM64_BRBE)
- Included arm_brbe.h header both in arm_pmuv3.c and arm_brbe.c
- Dropped BRBE custom pr_fmt()
- Dropped CONFIG_PERF_EVENTS wrapping from header entries
- Flush branch records when a cpu bound event follows a task bound event
- Dropped BRBFCR_EL1 from __debug_save_brbe()/__debug_restore_brbe()
- Always save the live SYS_BRBCR_EL1 in host context and then check if
BRBE was enabled before resetting SYS_BRBCR_EL1 for the host

Changes in V14:

https://lore.kernel.org/all/[email protected]/

- This series has been reorganised as suggested during V13
- There are just eight patches now i.e 5 enablement and 3 perf branch tests

- Fixed brackets problem in __SYS_BRBINFO/BRBSRC/BRBTGT() macros
- Renamed the macro i.e s/__SYS_BRBINFO/__SYS_BRBINF/
- Renamed s/BRB_IALL/BRB_IALL_INSN and s/BRBE_INJ/BRB_INJ_INSN
- Moved BRB_IALL_INSN and SYS_BRB_INSN instructions to sysreg patch
- Changed E1BRE as ExBRE in sysreg fields inside BRBCR_ELx
- Used BRBCR_ELx for defining all BRBCR_EL1, BRBCR_EL2, and BRBCR_EL12 (new)

- Folded the following three patches into a single patch i.e [PATCH 3/8]

drivers: perf: arm_pmu: Add new sched_task() callback
arm64/perf: Add branch stack support in struct arm_pmu
arm64/perf: Add branch stack support in struct pmu_hw_events
arm64/perf: Add branch stack support in ARMV8 PMU
arm64/perf: Add PERF_ATTACH_TASK_DATA to events with has_branch_stack()

- All armv8pmu_branch_xxxx() stub definitions have been moved inside
include/linux/perf/arm_pmuv3.h for easy access from both arm32 and arm64
- Added brbe_users, brbe_context and brbe_sample_type in struct pmu_hw_events
- Added comments for all the above new elements in struct pmu_hw_events
- Added branch_reset() and sched_task() callbacks
- Changed and optimized branch records processing during a PMU IRQ
- NO branch records get captured for event with mismatched brbe_sample_type
- Branch record context is tracked from armpmu_del() & armpmu_add()
- Branch record hardware is driven from armv8pmu_start() & armv8pmu_stop()
- Dropped NULL check for 'pmu_ctx' inside armv8pmu_sched_task()
- Moved down PERF_ATTACH_TASK_DATA assignment with a preceding comment
- In conflicting branch sample type requests, first event takes precedence

- Folded the following five patches from V13 into a single patch i.e
[PATCH 4/8]

arm64/perf: Enable branch stack events via FEAT_BRBE
arm64/perf: Add struct brbe_regset helper functions
arm64/perf: Implement branch records save on task sched out
arm64/perf: Implement branch records save on PMU IRQ

- Fixed the year in copyright statement
- Added Documentation/arch/arm64/brbe.rst
- Updated Documentation/arch/arm64/booting.rst (BRBCR_EL2.CC for EL1 entry)
- Added __init_el2_brbe() which enables branch record cycle count support
- Disabled EL2 traps in __init_el2_fgt() while accessing BRBE registers and
executing instructions
- Changed CONFIG_ARM64_BRBE user visible description
- Fixed a typo in CONFIG_ARM64_BRBE config option description text
- Added BUILD_BUG_ON() co-relating BRBE_BANK_MAX_ENTRIES and MAX_BRANCH_RECORDS
- Dropped arm64_create_brbe_task_ctx_kmem_cache()
- Moved down comment for PERF_SAMPLE_BRANCH_KERNEL in branch_type_to_brbcr()
- Renamed BRBCR_ELx_DEFAULT_CONFIG as BRBCR_ELx_CONFIG_MASK
- Replaced BRBCR_ELx_DEFAULT_TS with BRBCR_ELx_TS_MASK in BRBCR_ELx_CONFIG_MASK
- Replaced BRBCR_ELx_E1BRE instances with BRBCR_ELx_ExBRE

- Added BRBE specific branch stack sampling perf test patches into the series
- Added a patch to prevent guest accesses into BRBE registers and instructions
- Added a patch to save the BRBE host context in NVHE environment
- Updated most commit messages

Changes in V13:

https://lore.kernel.org/all/[email protected]/
https://lore.kernel.org/all/[email protected]/

- Added branch callback stubs for aarch32 pmuv3 based platforms
- Updated the comments for capture_brbe_regset()
- Deleted the comments in __read_brbe_regset()
- Reversed the arguments order in capture_brbe_regset() and brbe_branch_save()
- Fixed BRBE_BANK[0|1]_IDX_MAX indices comparison in armv8pmu_branch_read()
- Fixed BRBE_BANK[0|1]_IDX_MAX indices comparison in capture_brbe_regset()

Changes in V12:

https://lore.kernel.org/all/[email protected]/

- Replaced branch types with complete DIRECT/INDIRECT prefixes/suffixes
- Replaced branch types with complete INSN/ALIGN prefixes/suffixes
- Replaced return branch types as simple RET/ERET
- Replaced time field GST_PHYSICAL as GUEST_PHYSICAL
- Added 0 padding for BRBIDR0_EL1.NUMREC enum values
- Dropped helper arm_pmu_branch_stack_supported()
- Renamed armv8pmu_branch_valid() as armv8pmu_branch_attr_valid()
- Separated perf_task_ctx_cache setup from arm_pmu private allocation
- Collected changes to branch_records_alloc() in a single patch [5/10]
- Reworked and cleaned up branch_records_alloc()
- Reworked armv8pmu_branch_read() with new loop iterations in patch [6/10]
- Reworked capture_brbe_regset() with new loop iterations in patch [8/10]
- Updated the comment in branch_type_to_brbcr()
- Fixed the comment before stitch_stored_live_entries()
- Fixed BRBINFINJ_EL1 definition for VALID_FULL enum field
- Factored out helper __read_brbe_regset() from capture_brbe_regset()
- Dropped the helper copy_brbe_regset()
- Simplified stitch_stored_live_entries() with memcpy(), memmove()
- Reworked armv8pmu_probe_pmu() to bail out early with !probe.present
- Rework brbe_attributes_probe() without 'struct brbe_hw_attr'
- Dropped 'struct brbe_hw_attr' argument from capture_brbe_regset()
- Dropped 'struct brbe_hw_attr' argument from brbe_branch_save()
- Dropped arm_pmu->private and added arm_pmu->reg_trbidr instead

Changes in V11:

https://lore.kernel.org/all/[email protected]/

- Fixed the crash for per-cpu events without event->pmu_ctx->task_ctx_data

Changes in V10:

https://lore.kernel.org/all/[email protected]/

- Rebased the series on v6.4-rc2
- Moved ARMV8 PMUV3 changes inside drivers/perf/arm_pmuv3.c
- Moved BRBE driver changes inside drivers/perf/arm_brbe.[c|h]
- Moved the WARN_ON() inside the if condition in armv8pmu_handle_irq()

Changes in V9:

https://lore.kernel.org/all/[email protected]/

- Fixed build problem with has_branch_stack() in arm64 header
- BRBINF_EL1 definition has been changed from 'Sysreg' to 'SysregFields'
- Renamed all BRBINF_EL1 call sites as BRBINFx_EL1
- Dropped static const char branch_filter_error_msg[]
- Implemented a positive list check for BRBE supported perf branch filters
- Added a comment in armv8pmu_handle_irq()
- Implemented per-cpu allocation for struct branch_record records
- Skipped looping through bank 1 if an invalid record is detected in bank 0
- Added comment in armv8pmu_branch_read() explaining prohibited region etc
- Added comment warning about erroneously marking transactions as aborted
- Replaced the first argument (perf_branch_entry) in capture_brbe_flags()
- Dropped the last argument (idx) in capture_brbe_flags()
- Dropped the brbcr argument from capture_brbe_flags()
- Used perf_sample_save_brstack() to capture branch records for perf_sample_data
- Added comment explaining rationale for setting BRBCR_EL1_FZP for user only traces
- Dropped BRBE prohibited state mechanism while in armv8pmu_branch_read()
- Implemented event task context based branch records save mechanism

Changes in V8:

https://lore.kernel.org/all/[email protected]/

- Replaced arm_pmu->features as arm_pmu->has_branch_stack, updated its helper
- Added a comment and line break before arm_pmu->private element
- Added WARN_ON_ONCE() in helpers i.e armv8pmu_branch_[read|valid|enable|disable]()
- Dropped comments in armv8pmu_enable_event() and armv8pmu_disable_event()
- Replaced open bank encoding in BRBFCR_EL1 with SYS_FIELD_PREP()
- Changed brbe_hw_attr->brbe_version from 'bool' to 'int'
- Updated pr_warn() as pr_warn_once() with values in brbe_get_perf_[type|priv]()
- Replaced all pr_warn_once() as pr_debug_once() in armv8pmu_branch_valid()
- Added a comment in branch_type_to_brbcr() for the BRBCR_EL1 privilege settings
- Modified the comment related to BRBINFx_EL1.LASTFAILED in capture_brbe_flags()
- Modified brbe_get_perf_entry_type() as brbe_set_perf_entry_type()
- Renamed brbe_valid() as brbe_record_is_complete()
- Renamed brbe_source() as brbe_record_is_source_only()
- Renamed brbe_target() as brbe_record_is_target_only()
- Inverted checks for !brbe_record_is_[target|source]_only() for info capture
- Replaced 'fetch' with 'get' in all helpers that extract field value
- Dropped 'static int brbe_current_bank' optimization in select_brbe_bank()
- Dropped select_brbe_bank_index() completely, added capture_branch_entry()
- Process captured branch entries in two separate loops one for each BRBE bank
- Moved branch_records_alloc() inside armv8pmu_probe_pmu()
- Added a forward declaration for the helper has_branch_stack()
- Added new callbacks armv8pmu_private_alloc() and armv8pmu_private_free()
- Updated armv8pmu_probe_pmu() to allocate the private structure before SMP call

Changes in V7:

https://lore.kernel.org/all/[email protected]/

- Folded [PATCH 7/7] into [PATCH 3/7] which enables branch stack sampling event
- Defined BRBFCR_EL1_BRANCH_FILTERS, BRBCR_EL1_DEFAULT_CONFIG in the header
- Defined BRBFCR_EL1_DEFAULT_CONFIG in the header
- Updated BRBCR_EL1_DEFAULT_CONFIG with BRBCR_EL1_FZP
- Defined BRBCR_EL1_DEFAULT_TS in the header
- Updated BRBCR_EL1_DEFAULT_CONFIG with BRBCR_EL1_DEFAULT_TS
- Moved BRBCR_EL1_DEFAULT_CONFIG check inside branch_type_to_brbcr()
- Moved down BRBCR_EL1_CC, BRBCR_EL1_MPRED later in branch_type_to_brbcr()
- Also set BRBE in paused state in armv8pmu_branch_disable()
- Dropped brbe_paused(), set_brbe_paused() helpers
- Extracted error string via branch_filter_error_msg[] for armv8pmu_branch_valid()
- Replaced brbe_v1p1 with brbe_version in struct brbe_hw_attr
- Added valid_brbe_[cc, format, version]() helpers
- Split a separate brbe_attributes_probe() from armv8pmu_branch_probe()
- Capture event->attr.branch_sample_type earlier in armv8pmu_branch_valid()
- Defined enum brbe_bank_idx with possible values for BRBE bank indices
- Changed armpmu->hw_attr into armpmu->private
- Added missing space in stub definition for armv8pmu_branch_valid()
- Replaced both kmalloc() with kzalloc()
- Added BRBE_BANK_MAX_ENTRIES
- Updated comment for capture_brbe_flags()
- Updated comment for struct brbe_hw_attr
- Dropped space after type cast in couple of places
- Replaced inverse with negation for testing BRBCR_EL1_FZP in armv8pmu_branch_read()
- Captured cpuc->branches->branch_entries[idx] in a local variable
- Dropped saved_priv from armv8pmu_branch_read()
- Reorganize PERF_SAMPLE_BRANCH_NO_[CYCLES|NO_FLAGS] related configuration
- Replaced with FIELD_GET() and FIELD_PREP() wherever applicable
- Replaced BRBCR_EL1_TS_PHYSICAL with BRBCR_EL1_TS_VIRTUAL
- Moved valid_brbe_nr(), valid_brbe_cc(), valid_brbe_format(), valid_brbe_version()
select_brbe_bank(), select_brbe_bank_index() helpers inside the C implementation
- Reorganized brbe_valid_nr() and dropped the pr_warn() message
- Changed probe sequence in brbe_attributes_probe()
- Added 'brbcr' argument into capture_brbe_flags() to ascertain correct state
- Disable BRBE before disabling the PMU event counter
- Enable PERF_SAMPLE_BRANCH_HV filters when is_kernel_in_hyp_mode()
- Guard armv8pmu_reset() & armv8pmu_sched_task() with arm_pmu_branch_stack_supported()

Changes in V6:

https://lore.kernel.org/linux-arm-kernel/[email protected]/

- Restore the exception level privilege after reading the branch records
- Unpause the buffer after reading the branch records
- Decouple BRBCR_EL1_EXCEPTION/ERTN from perf event privilege level
- Reworked BRBE implementation and branch stack sampling support on arm pmu
- BRBE implementation is now part of overall ARMV8 PMU implementation
- BRBE implementation moved from drivers/perf/ to inside arch/arm64/kernel/
- CONFIG_ARM_BRBE_PMU renamed as CONFIG_ARM64_BRBE in arch/arm64/Kconfig
- File moved - drivers/perf/arm_pmu_brbe.c -> arch/arm64/kernel/brbe.c
- File moved - drivers/perf/arm_pmu_brbe.h -> arch/arm64/kernel/brbe.h
- BRBE name has been dropped from struct arm_pmu and struct hw_pmu_events
- BRBE name has been abstracted out as 'branches' in arm_pmu and hw_pmu_events
- BRBE name has been abstracted out as 'branches' in ARMV8 PMU implementation
- Added sched_task() callback into struct arm_pmu
- Added 'hw_attr' into struct arm_pmu encapsulating possible PMU HW attributes
- Dropped explicit attributes brbe_(v1p1, nr, cc, format) from struct arm_pmu
- Dropped brbfcr, brbcr, registers scratch area from struct hw_pmu_events
- Dropped brbe_users, brbe_context tracking in struct hw_pmu_events
- Added 'features' tracking into struct arm_pmu with ARM_PMU_BRANCH_STACK flag
- armpmu->hw_attr maps into 'struct brbe_hw_attr' inside BRBE implementation
- Set ARM_PMU_BRANCH_STACK in 'arm_pmu->features' after successful BRBE probe
- Added armv8pmu_branch_reset() inside armv8pmu_branch_enable()
- Dropped brbe_supported() as events will be rejected via ARM_PMU_BRANCH_STACK
- Dropped set_brbe_disabled() as well
- Reformated armv8pmu_branch_valid() warnings while rejecting unsupported events

Changes in V5:

https://lore.kernel.org/linux-arm-kernel/[email protected]/

- Changed BRBCR_EL1.VIRTUAL from 0b1 to 0b01
- Changed BRBFCR_EL1.EnL into BRBFCR_EL1.EnI
- Changed config ARM_BRBE_PMU from 'tristate' to 'bool'

Changes in V4:

https://lore.kernel.org/all/[email protected]/

- Changed ../tools/sysreg declarations as suggested
- Set PERF_SAMPLE_BRANCH_STACK in data.sample_flags
- Dropped perfmon_capable() check in armpmu_event_init()
- s/pr_warn_once/pr_info in armpmu_event_init()
- Added brbe_format element into struct pmu_hw_events
- Changed v1p1 as brbe_v1p1 in struct pmu_hw_events
- Dropped pr_info() from arm64_pmu_brbe_probe(), solved LOCKDEP warning

Changes in V3:

https://lore.kernel.org/all/[email protected]/

- Moved brbe_stack from the stack and now dynamically allocated
- Return PERF_BR_PRIV_UNKNOWN instead of -1 in brbe_fetch_perf_priv()
- Moved BRBIDR0, BRBCR, BRBFCR registers and fields into tools/sysreg
- Created dummy BRBINF_EL1 field definitions in tools/sysreg
- Dropped ARMPMU_EVT_PRIV framework which cached perfmon_capable()
- Both exception and exception return branche records are now captured
only if the event has PERF_SAMPLE_BRANCH_KERNEL which would already
been checked in generic perf via perf_allow_kernel()

Changes in V2:

https://lore.kernel.org/all/[email protected]/

- Dropped branch sample filter helpers consolidation patch from this series
- Added new hw_perf_event.flags element ARMPMU_EVT_PRIV to cache perfmon_capable()
- Use cached perfmon_capable() while configuring BRBE branch record filters

Changes in V1:

https://lore.kernel.org/linux-arm-kernel/[email protected]/

- Added CONFIG_PERF_EVENTS wrapper for all branch sample filter helpers
- Process new perf branch types via PERF_BR_EXTEND_ABI

Changes in RFC V2:

https://lore.kernel.org/linux-arm-kernel/[email protected]/

- Added branch_sample_priv() while consolidating other branch sample filter helpers
- Changed all SYS_BRBXXXN_EL1 register definition encodings per Marc
- Changed the BRBE driver as per proposed BRBE related perf ABI changes (V5)
- Added documentation for struct arm_pmu changes, updated commit message
- Updated commit message for BRBE detection infrastructure patch
- PERF_SAMPLE_BRANCH_KERNEL gets checked during arm event init (outside the driver)
- Branch privilege state capture mechanism has now moved inside the driver

Changes in RFC V1:

https://lore.kernel.org/all/[email protected]/

Cc: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Mark Brown <[email protected]>
Cc: James Clark <[email protected]>
Cc: Rob Herring <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Suzuki Poulose <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]

Anshuman Khandual (6):
arm64/sysreg: Add BRBE registers and fields
KVM: arm64: Explicitly handle BRBE traps as UNDEFINED
drivers: perf: arm_pmu: Add infrastructure for branch stack sampling
arm64/boot: Enable EL2 requirements for BRBE
drivers: perf: arm_pmuv3: Enable branch stack sampling via FEAT_BRBE
KVM: arm64: nvhe: Disable branch generation in nVHE guests

James Clark (3):
perf: test: Speed up running brstack test on an Arm model
perf: test: Remove empty lines from branch filter test output
perf: test: Extend branch stack sampling test for Arm64 BRBE

Documentation/arch/arm64/booting.rst | 26 +
arch/arm64/include/asm/el2_setup.h | 90 ++-
arch/arm64/include/asm/kvm_host.h | 5 +-
arch/arm64/include/asm/sysreg.h | 17 +-
arch/arm64/kvm/debug.c | 5 +
arch/arm64/kvm/hyp/nvhe/debug-sr.c | 31 +
arch/arm64/kvm/sys_regs.c | 56 ++
arch/arm64/tools/sysreg | 131 ++++
drivers/perf/Kconfig | 11 +
drivers/perf/Makefile | 1 +
drivers/perf/arm_brbe.c | 968 +++++++++++++++++++++++++
drivers/perf/arm_pmu.c | 25 +-
drivers/perf/arm_pmuv3.c | 146 +++-
drivers/perf/arm_pmuv3_branch.h | 73 ++
include/linux/perf/arm_pmu.h | 37 +-
tools/perf/tests/builtin-test.c | 1 +
tools/perf/tests/shell/test_brstack.sh | 57 +-
tools/perf/tests/tests.h | 1 +
tools/perf/tests/workloads/Build | 2 +
tools/perf/tests/workloads/traploop.c | 39 +
20 files changed, 1696 insertions(+), 26 deletions(-)
create mode 100644 drivers/perf/arm_brbe.c
create mode 100644 drivers/perf/arm_pmuv3_branch.h
create mode 100644 tools/perf/tests/workloads/traploop.c

--
2.25.1



2024-04-05 02:47:19

by Anshuman Khandual

[permalink] [raw]
Subject: [PATCH V17 1/9] arm64/sysreg: Add BRBE registers and fields

This adds BRBE related register definitions and various other related field
macros there in. These will be used subsequently in a BRBE driver, which is
being added later on. While here, this drops redundant register definitions
from the header i.e (arch/arm64/include/asm/sysreg.h).

BRBINFx_EL1_TYPE_IMPDEF_TRAP_EL3 register field value has been derived from
latest ARM DDI 0601 ID121123, AKA 2023-12 instead of latest ARM ARM i.e ARM
DDI 0487J.a. Please find the definition here.

https://developer.arm.com/documentation/ddi0601/2023-12/

Cc: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: [email protected]
Cc: [email protected]
Reviewed-by: Mark Brown <[email protected]>
Signed-off-by: Anshuman Khandual <[email protected]>
----
Changes in V17:

- Added back Reviewed-by tags from Mark Brown
- Updated the commit message regarding the field BRBINFx_EL1_TYPE_IMPDEF_TRAP_EL3
- Added leading 0s for all values as BRBIDR0_EL1.NUMREC is a 8 bit field
- Added leading 0s for all values as BRBFCR_EL1.BANK is a 2 bit field
- Reordered BRBCR_EL1/BRBCR_EL12/BRBCR_EL2 registers as per sysreg encodings
- Renamed s/FIRST/BANK_0 and s/SECOND/BANK_1 in BRBFCR_EL1.BANK
- Renamed s/UNCOND_DIRECT/DIRECT_UNCOND in BRBINFx_EL1.TYPE
- Renamed s/COND_DIRECT/DIRECT_COND in BRBINFx_EL1.TYPE
- Dropped __SYS_BRBINF/__SYS_BRBSRC/__SYS_BRBTGT and their expansions
- Moved all existing BRBE registers from sysreg.h header to tools/sysreg format

arch/arm64/include/asm/sysreg.h | 17 ++---
arch/arm64/tools/sysreg | 131 ++++++++++++++++++++++++++++++++
2 files changed, 137 insertions(+), 11 deletions(-)

diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h
index 9e8999592f3a..6db64ce5b12b 100644
--- a/arch/arm64/include/asm/sysreg.h
+++ b/arch/arm64/include/asm/sysreg.h
@@ -195,16 +195,8 @@
#define SYS_DBGVCR32_EL2 sys_reg(2, 4, 0, 7, 0)

#define SYS_BRBINF_EL1(n) sys_reg(2, 1, 8, (n & 15), (((n & 16) >> 2) | 0))
-#define SYS_BRBINFINJ_EL1 sys_reg(2, 1, 9, 1, 0)
#define SYS_BRBSRC_EL1(n) sys_reg(2, 1, 8, (n & 15), (((n & 16) >> 2) | 1))
-#define SYS_BRBSRCINJ_EL1 sys_reg(2, 1, 9, 1, 1)
#define SYS_BRBTGT_EL1(n) sys_reg(2, 1, 8, (n & 15), (((n & 16) >> 2) | 2))
-#define SYS_BRBTGTINJ_EL1 sys_reg(2, 1, 9, 1, 2)
-#define SYS_BRBTS_EL1 sys_reg(2, 1, 9, 0, 2)
-
-#define SYS_BRBCR_EL1 sys_reg(2, 1, 9, 0, 0)
-#define SYS_BRBFCR_EL1 sys_reg(2, 1, 9, 0, 1)
-#define SYS_BRBIDR0_EL1 sys_reg(2, 1, 9, 2, 0)

#define SYS_TRCITECR_EL1 sys_reg(3, 0, 1, 2, 3)
#define SYS_TRCACATR(m) sys_reg(2, 1, 2, ((m & 7) << 1), (2 | (m >> 3)))
@@ -270,8 +262,6 @@
/* ETM */
#define SYS_TRCOSLAR sys_reg(2, 1, 1, 0, 4)

-#define SYS_BRBCR_EL2 sys_reg(2, 4, 9, 0, 0)
-
#define SYS_MIDR_EL1 sys_reg(3, 0, 0, 0, 0)
#define SYS_MPIDR_EL1 sys_reg(3, 0, 0, 0, 5)
#define SYS_REVIDR_EL1 sys_reg(3, 0, 0, 0, 6)
@@ -601,7 +591,6 @@
#define SYS_CNTHV_CVAL_EL2 sys_reg(3, 4, 14, 3, 2)

/* VHE encodings for architectural EL0/1 system registers */
-#define SYS_BRBCR_EL12 sys_reg(2, 5, 9, 0, 0)
#define SYS_SCTLR_EL12 sys_reg(3, 5, 1, 0, 0)
#define SYS_CPACR_EL12 sys_reg(3, 5, 1, 0, 2)
#define SYS_SCTLR2_EL12 sys_reg(3, 5, 1, 0, 3)
@@ -794,6 +783,12 @@
#define OP_COSP_RCTX sys_insn(1, 3, 7, 3, 6)
#define OP_CPP_RCTX sys_insn(1, 3, 7, 3, 7)

+/*
+ * BRBE Instructions
+ */
+#define BRB_IALL_INSN __emit_inst(0xd5000000 | OP_BRB_IALL | (0x1f))
+#define BRB_INJ_INSN __emit_inst(0xd5000000 | OP_BRB_INJ | (0x1f))
+
/* Common SCTLR_ELx flags. */
#define SCTLR_ELx_ENTP2 (BIT(60))
#define SCTLR_ELx_DSSBS (BIT(44))
diff --git a/arch/arm64/tools/sysreg b/arch/arm64/tools/sysreg
index a4c1dd4741a4..e5bf99234c74 100644
--- a/arch/arm64/tools/sysreg
+++ b/arch/arm64/tools/sysreg
@@ -1025,6 +1025,137 @@ UnsignedEnum 3:0 MTEPERM
EndEnum
EndSysreg

+
+SysregFields BRBINFx_EL1
+Res0 63:47
+Field 46 CCU
+Field 45:32 CC
+Res0 31:18
+Field 17 LASTFAILED
+Field 16 T
+Res0 15:14
+Enum 13:8 TYPE
+ 0b000000 DIRECT_UNCOND
+ 0b000001 INDIRECT
+ 0b000010 DIRECT_LINK
+ 0b000011 INDIRECT_LINK
+ 0b000101 RET
+ 0b000111 ERET
+ 0b001000 DIRECT_COND
+ 0b100001 DEBUG_HALT
+ 0b100010 CALL
+ 0b100011 TRAP
+ 0b100100 SERROR
+ 0b100110 INSN_DEBUG
+ 0b100111 DATA_DEBUG
+ 0b101010 ALIGN_FAULT
+ 0b101011 INSN_FAULT
+ 0b101100 DATA_FAULT
+ 0b101110 IRQ
+ 0b101111 FIQ
+ 0b110000 IMPDEF_TRAP_EL3
+ 0b111001 DEBUG_EXIT
+EndEnum
+Enum 7:6 EL
+ 0b00 EL0
+ 0b01 EL1
+ 0b10 EL2
+ 0b11 EL3
+EndEnum
+Field 5 MPRED
+Res0 4:2
+Enum 1:0 VALID
+ 0b00 NONE
+ 0b01 TARGET
+ 0b10 SOURCE
+ 0b11 FULL
+EndEnum
+EndSysregFields
+
+SysregFields BRBCR_ELx
+Res0 63:24
+Field 23 EXCEPTION
+Field 22 ERTN
+Res0 21:10
+Field 9 FZPSS
+Field 8 FZP
+Res0 7
+Enum 6:5 TS
+ 0b01 VIRTUAL
+ 0b10 GUEST_PHYSICAL
+ 0b11 PHYSICAL
+EndEnum
+Field 4 MPRED
+Field 3 CC
+Res0 2
+Field 1 ExBRE
+Field 0 E0BRE
+EndSysregFields
+
+Sysreg BRBCR_EL1 2 1 9 0 0
+Fields BRBCR_ELx
+EndSysreg
+
+Sysreg BRBFCR_EL1 2 1 9 0 1
+Res0 63:30
+Enum 29:28 BANK
+ 0b00 BANK_0
+ 0b01 BANK_1
+EndEnum
+Res0 27:23
+Field 22 CONDDIR
+Field 21 DIRCALL
+Field 20 INDCALL
+Field 19 RTN
+Field 18 INDIRECT
+Field 17 DIRECT
+Field 16 EnI
+Res0 15:8
+Field 7 PAUSED
+Field 6 LASTFAILED
+Res0 5:0
+EndSysreg
+
+Sysreg BRBTS_EL1 2 1 9 0 2
+Field 63:0 TS
+EndSysreg
+
+Sysreg BRBINFINJ_EL1 2 1 9 1 0
+Fields BRBINFx_EL1
+EndSysreg
+
+Sysreg BRBSRCINJ_EL1 2 1 9 1 1
+Field 63:0 ADDRESS
+EndSysreg
+
+Sysreg BRBTGTINJ_EL1 2 1 9 1 2
+Field 63:0 ADDRESS
+EndSysreg
+
+Sysreg BRBIDR0_EL1 2 1 9 2 0
+Res0 63:16
+Enum 15:12 CC
+ 0b101 20_BIT
+EndEnum
+Enum 11:8 FORMAT
+ 0b0 0
+EndEnum
+Enum 7:0 NUMREC
+ 0b00001000 8
+ 0b00010000 16
+ 0b00100000 32
+ 0b01000000 64
+EndEnum
+EndSysreg
+
+Sysreg BRBCR_EL2 2 4 9 0 0
+Fields BRBCR_ELx
+EndSysreg
+
+Sysreg BRBCR_EL12 2 5 9 0 0
+Fields BRBCR_ELx
+EndSysreg
+
Sysreg ID_AA64ZFR0_EL1 3 0 0 4 4
Res0 63:60
UnsignedEnum 59:56 F64MM
--
2.25.1


2024-04-05 02:48:17

by Anshuman Khandual

[permalink] [raw]
Subject: [PATCH V17 2/9] KVM: arm64: Explicitly handle BRBE traps as UNDEFINED

The Branch Record Buffer Extension (BRBE) adds a number of system registers
and instructions, which we don't currently intend to expose to guests. Our
existing logic handles this safely, but this could be improved with some
explicit handling of BRBE.

The presence of BRBE is currently hidden from guests as the cpufeature
code's ftr_id_aa64dfr0[] table doesn't have an entry for the BRBE field,
and so this will be zero in the sanitised value of ID_AA64DFR0 exposed to
guests via read_sanitised_id_aa64dfr0_el1(). As the ftr_id_aa64dfr0[] table
may gain an entry for the BRBE field in future, for robustness we should
explicitly mask out the BRBE field in read_sanitised_id_aa64dfr0_el1().

The BRBE system registers and instructions are currently trapped by the
existing configuration of the fine-grained traps. As neither the registers
are not described in the sys_reg_descs[] nor the instructions are described
in the sys_insn_descs[] table, emulate_sys_reg() will warn that these are
unknown before injecting an UNDEFINED exception into the guest.
Well-behaved guests shouldn't try to use the registers or instructions, but
badly-behaved guests could use these, resulting in unnecessary warnings. To
avoid those warnings, we should explicitly handle the BRBE registers, and
instructions as UNDEFINED.

Address the above by having read_sanitised_id_aa64dfr0_el1() mask out the
ID_AA64DFR0.BRBE field, and by adding sys_reg_descs entries for all of the
BRBE system registers, also by adding sys_insn_descs entries for all of the
BRBE instructions, treating these all as UNDEFINED.

Cc: Marc Zyngier <[email protected]>
Cc: Oliver Upton <[email protected]>
Cc: James Morse <[email protected]>
Cc: Suzuki K Poulose <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Anshuman Khandual <[email protected]>
----
Changes in V17:

- Updated the commit message including about sys_insn_descs[]
- Changed KVM to use existing SYS_BRBSRC/TGT/INF_EL1(n) format
- Moved the BRBE instructions into sys_insn_descs[] array

arch/arm64/kvm/sys_regs.c | 56 +++++++++++++++++++++++++++++++++++++++
1 file changed, 56 insertions(+)

diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
index c9f4f387155f..3981aa32c5a3 100644
--- a/arch/arm64/kvm/sys_regs.c
+++ b/arch/arm64/kvm/sys_regs.c
@@ -1304,6 +1304,11 @@ static int set_pmcr(struct kvm_vcpu *vcpu, const struct sys_reg_desc *r,
return 0;
}

+#define BRB_INF_SRC_TGT_EL1(n) \
+ { SYS_DESC(SYS_BRBINF_EL1(n)), undef_access }, \
+ { SYS_DESC(SYS_BRBSRC_EL1(n)), undef_access }, \
+ { SYS_DESC(SYS_BRBTGT_EL1(n)), undef_access } \
+
/* Silly macro to expand the DBG{BCR,BVR,WVR,WCR}n_EL1 registers in one go */
#define DBG_BCR_BVR_WCR_WVR_EL1(n) \
{ SYS_DESC(SYS_DBGBVRn_EL1(n)), \
@@ -1708,6 +1713,9 @@ static u64 read_sanitised_id_aa64dfr0_el1(struct kvm_vcpu *vcpu,
/* Hide SPE from guests */
val &= ~ID_AA64DFR0_EL1_PMSVer_MASK;

+ /* Hide BRBE from guests */
+ val &= ~ID_AA64DFR0_EL1_BRBE_MASK;
+
return val;
}

@@ -2226,6 +2234,52 @@ static const struct sys_reg_desc sys_reg_descs[] = {
{ SYS_DESC(SYS_DBGCLAIMCLR_EL1), trap_raz_wi },
{ SYS_DESC(SYS_DBGAUTHSTATUS_EL1), trap_dbgauthstatus_el1 },

+ /*
+ * BRBE branch record sysreg address space is interleaved between
+ * corresponding BRBINF<N>_EL1, BRBSRC<N>_EL1, and BRBTGT<N>_EL1.
+ */
+ BRB_INF_SRC_TGT_EL1(0),
+ BRB_INF_SRC_TGT_EL1(16),
+ BRB_INF_SRC_TGT_EL1(1),
+ BRB_INF_SRC_TGT_EL1(17),
+ BRB_INF_SRC_TGT_EL1(2),
+ BRB_INF_SRC_TGT_EL1(18),
+ BRB_INF_SRC_TGT_EL1(3),
+ BRB_INF_SRC_TGT_EL1(19),
+ BRB_INF_SRC_TGT_EL1(4),
+ BRB_INF_SRC_TGT_EL1(20),
+ BRB_INF_SRC_TGT_EL1(5),
+ BRB_INF_SRC_TGT_EL1(21),
+ BRB_INF_SRC_TGT_EL1(6),
+ BRB_INF_SRC_TGT_EL1(22),
+ BRB_INF_SRC_TGT_EL1(7),
+ BRB_INF_SRC_TGT_EL1(23),
+ BRB_INF_SRC_TGT_EL1(8),
+ BRB_INF_SRC_TGT_EL1(24),
+ BRB_INF_SRC_TGT_EL1(9),
+ BRB_INF_SRC_TGT_EL1(25),
+ BRB_INF_SRC_TGT_EL1(10),
+ BRB_INF_SRC_TGT_EL1(26),
+ BRB_INF_SRC_TGT_EL1(11),
+ BRB_INF_SRC_TGT_EL1(27),
+ BRB_INF_SRC_TGT_EL1(12),
+ BRB_INF_SRC_TGT_EL1(28),
+ BRB_INF_SRC_TGT_EL1(13),
+ BRB_INF_SRC_TGT_EL1(29),
+ BRB_INF_SRC_TGT_EL1(14),
+ BRB_INF_SRC_TGT_EL1(30),
+ BRB_INF_SRC_TGT_EL1(15),
+ BRB_INF_SRC_TGT_EL1(31),
+
+ /* Remaining BRBE sysreg addresses space */
+ { SYS_DESC(SYS_BRBCR_EL1), undef_access },
+ { SYS_DESC(SYS_BRBFCR_EL1), undef_access },
+ { SYS_DESC(SYS_BRBTS_EL1), undef_access },
+ { SYS_DESC(SYS_BRBINFINJ_EL1), undef_access },
+ { SYS_DESC(SYS_BRBSRCINJ_EL1), undef_access },
+ { SYS_DESC(SYS_BRBTGTINJ_EL1), undef_access },
+ { SYS_DESC(SYS_BRBIDR0_EL1), undef_access },
+
{ SYS_DESC(SYS_MDCCSR_EL0), trap_raz_wi },
{ SYS_DESC(SYS_DBGDTR_EL0), trap_raz_wi },
// DBGDTR[TR]X_EL0 share the same encoding
@@ -2738,6 +2792,8 @@ static struct sys_reg_desc sys_insn_descs[] = {
{ SYS_DESC(SYS_DC_CISW), access_dcsw },
{ SYS_DESC(SYS_DC_CIGSW), access_dcgsw },
{ SYS_DESC(SYS_DC_CIGDSW), access_dcgsw },
+ { SYS_DESC(OP_BRB_IALL), undef_access },
+ { SYS_DESC(OP_BRB_INJ), undef_access },
};

static const struct sys_reg_desc *first_idreg;
--
2.25.1


2024-04-05 02:48:37

by Anshuman Khandual

[permalink] [raw]
Subject: [PATCH V17 4/9] arm64/boot: Enable EL2 requirements for BRBE

Fine grained trap control for BRBE registers, and instructions access need
to be configured in HDFGRTR_EL2, HDFGWTR_EL2 and HFGITR_EL2 registers when
kernel enters at EL1 but EL2 is present. This changes __init_el2_fgt() as
required.

Similarly cycle and mis-prediction capture need to be enabled in BRBCR_EL1
and BRBCR_EL2 when the kernel enters either into EL1 or EL2. This adds new
__init_el2_brbe() to achieve this objective.

This also updates Documentation/arch/arm64/booting.rst with all the above
EL2 along with MDRC_EL3.SBRBE requirements.

First this replaces an existing hard encoding (1 << 62) with corresponding
applicable macro HDFGRTR_EL2_nPMSNEVFR_EL1_MASK.

Cc: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Oliver Upton <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Anshuman Khandual <[email protected]>
----
Changes in V17:

- New patch added in the series
- Separated out from the BRBE driver implementation patch
- Dropped the comment in __init_el2_brbe()
- Updated __init_el2_brbe() with BRBCR_EL2.MPRED requirements
- Updated __init_el2_brbe() with __check_hvhe() constructs
- Updated booting.rst regarding MPRED, MDCR_EL3 and fine grained control

Documentation/arch/arm64/booting.rst | 26 ++++++++
arch/arm64/include/asm/el2_setup.h | 90 +++++++++++++++++++++++++++-
2 files changed, 113 insertions(+), 3 deletions(-)

diff --git a/Documentation/arch/arm64/booting.rst b/Documentation/arch/arm64/booting.rst
index b57776a68f15..512210da7dd2 100644
--- a/Documentation/arch/arm64/booting.rst
+++ b/Documentation/arch/arm64/booting.rst
@@ -349,6 +349,32 @@ Before jumping into the kernel, the following conditions must be met:

- HWFGWTR_EL2.nSMPRI_EL1 (bit 54) must be initialised to 0b01.

+ For CPUs with feature Branch Record Buffer Extension (FEAT_BRBE):
+
+ - If the kernel is entered at EL2 and EL1 is present:
+
+ - BRBCR_EL1.CC (bit 3) must be initialised to 0b1.
+ - BRBCR_EL1.MPRED (bit 4) must be initialised to 0b1.
+
+ - If the kernel is entered at EL1 and EL2 is present:
+
+ - BRBCR_EL2.CC (bit 3) must be initialised to 0b1.
+ - BRBCR_EL2.MPRED (bit 4) must be initialised to 0b1.
+
+ - HDFGRTR_EL2.nBRBDATA (bit 61) must be initialised to 0b1.
+ - HDFGRTR_EL2.nBRBCTL (bit 60) must be initialised to 0b1.
+ - HDFGRTR_EL2.nBRBIDR (bit 59) must be initialised to 0b1.
+
+ - HDFGWTR_EL2.nBRBDATA (bit 61) must be initialised to 0b1.
+ - HDFGWTR_EL2.nBRBCTL (bit 60) must be initialised to 0b1.
+
+ - HFGITR_EL2.nBRBIALL (bit 56) must be initialised to 0b1.
+ - HFGITR_EL2.nBRBINJ (bit 55) must be initialised to 0b1.
+
+ - If EL3 is present:
+
+ - MDCR_EL3.SBRBE (bits 33:32) must be initialised to 0b11.
+
For CPUs with the Scalable Matrix Extension FA64 feature (FEAT_SME_FA64):

- If EL3 is present:
diff --git a/arch/arm64/include/asm/el2_setup.h b/arch/arm64/include/asm/el2_setup.h
index b7afaa026842..7c12a8e658d4 100644
--- a/arch/arm64/include/asm/el2_setup.h
+++ b/arch/arm64/include/asm/el2_setup.h
@@ -154,6 +154,41 @@
.Lskip_set_cptr_\@:
.endm

+#ifdef CONFIG_ARM64_BRBE
+/*
+ * Enable BRBE cycle count and miss-prediction
+ *
+ * BRBE requires both BRBCR_EL1.CC and BRBCR_EL2.CC fields, be set
+ * for the cycle counts to be available in BRBINF<N>_EL1.CC during
+ * branch record processing after a PMU interrupt. This enables CC
+ * field on both these registers while still executing inside EL2.
+ *
+ * BRBE driver would still be able to toggle branch records cycle
+ * count support via BRBCR_EL1.CC field regardless of whether the
+ * kernel ends up executing in EL1 or EL2.
+ *
+ * The same principle applies for branch record mis-prediction info
+ * as well, thus requiring MPRED field to be set on both BRBCR_EL1
+ * and BRBCR_EL2 while still executing inside EL2.
+ */
+.macro __init_el2_brbe
+ mrs x1, id_aa64dfr0_el1
+ ubfx x1, x1, #ID_AA64DFR0_EL1_BRBE_SHIFT, #4
+ cbz x1, .Lskip_brbe_\@
+
+ mov_q x0, BRBCR_ELx_CC | BRBCR_ELx_MPRED
+ msr_s SYS_BRBCR_EL2, x0
+
+ __check_hvhe .Lset_brbe_nvhe_\@, x1
+ msr_s SYS_BRBCR_EL12, x0 // VHE
+ b .Lskip_brbe_\@
+
+.Lset_brbe_nvhe_\@:
+ msr_s SYS_BRBCR_EL1, x0 // NVHE
+.Lskip_brbe_\@:
+.endm
+#endif /* CONFIG_ARM64_BRBE */
+
/* Disable any fine grained traps */
.macro __init_el2_fgt
mrs x1, id_aa64mmfr0_el1
@@ -161,16 +196,48 @@
cbz x1, .Lskip_fgt_\@

mov x0, xzr
+ mov x2, xzr
mrs x1, id_aa64dfr0_el1
ubfx x1, x1, #ID_AA64DFR0_EL1_PMSVer_SHIFT, #4
cmp x1, #3
b.lt .Lset_debug_fgt_\@
+
/* Disable PMSNEVFR_EL1 read and write traps */
- orr x0, x0, #(1 << 62)
+ orr x0, x0, #HDFGRTR_EL2_nPMSNEVFR_EL1_MASK
+ orr x2, x2, #HDFGWTR_EL2_nPMSNEVFR_EL1_MASK

.Lset_debug_fgt_\@:
+#ifdef CONFIG_ARM64_BRBE
+ mrs x1, id_aa64dfr0_el1
+ ubfx x1, x1, #ID_AA64DFR0_EL1_BRBE_SHIFT, #4
+ cbz x1, .Lskip_brbe_reg_fgt_\@
+
+ /*
+ * Disable read traps for the following registers
+ *
+ * [BRBSRC|BRBTGT|RBINF]_EL1
+ * [BRBSRCINJ|BRBTGTINJ|BRBINFINJ|BRBTS]_EL1
+ */
+ orr x0, x0, #HDFGRTR_EL2_nBRBDATA_MASK
+
+ /*
+ * Disable write traps for the following registers
+ *
+ * [BRBSRCINJ|BRBTGTINJ|BRBINFINJ|BRBTS]_EL1
+ */
+ orr x2, x2, #HDFGWTR_EL2_nBRBDATA_MASK
+
+ /* Disable read and write traps for [BRBCR|BRBFCR]_EL1 */
+ orr x0, x0, #HDFGRTR_EL2_nBRBCTL_MASK
+ orr x2, x2, #HDFGWTR_EL2_nBRBCTL_MASK
+
+ /* Disable read traps for BRBIDR_EL1 */
+ orr x0, x0, #HDFGRTR_EL2_nBRBIDR_MASK
+
+.Lskip_brbe_reg_fgt_\@:
+#endif /* CONFIG_ARM64_BRBE */
msr_s SYS_HDFGRTR_EL2, x0
- msr_s SYS_HDFGWTR_EL2, x0
+ msr_s SYS_HDFGWTR_EL2, x2

mov x0, xzr
mrs x1, id_aa64pfr1_el1
@@ -193,7 +260,21 @@
.Lset_fgt_\@:
msr_s SYS_HFGRTR_EL2, x0
msr_s SYS_HFGWTR_EL2, x0
- msr_s SYS_HFGITR_EL2, xzr
+ mov x0, xzr
+#ifdef CONFIG_ARM64_BRBE
+ mrs x1, id_aa64dfr0_el1
+ ubfx x1, x1, #ID_AA64DFR0_EL1_BRBE_SHIFT, #4
+ cbz x1, .Lskip_brbe_insn_fgt_\@
+
+ /* Disable traps for BRBIALL instruction */
+ orr x0, x0, #HFGITR_EL2_nBRBIALL_MASK
+
+ /* Disable traps for BRBINJ instruction */
+ orr x0, x0, #HFGITR_EL2_nBRBINJ_MASK
+
+.Lskip_brbe_insn_fgt_\@:
+#endif /* CONFIG_ARM64_BRBE */
+ msr_s SYS_HFGITR_EL2, x0

mrs x1, id_aa64pfr0_el1 // AMU traps UNDEF without AMU
ubfx x1, x1, #ID_AA64PFR0_EL1_AMU_SHIFT, #4
@@ -228,6 +309,9 @@
__init_el2_nvhe_idregs
__init_el2_cptr
__init_el2_fgt
+#ifdef CONFIG_ARM64_BRBE
+ __init_el2_brbe
+#endif
.endm

#ifndef __KVM_NVHE_HYPERVISOR__
--
2.25.1


2024-04-05 02:48:55

by Anshuman Khandual

[permalink] [raw]
Subject: [PATCH V17 5/9] drivers: perf: arm_pmuv3: Enable branch stack sampling via FEAT_BRBE

This extends recently added branch stack sampling framework in ARMV8 PMU to
enable such events via new architecture feature called Branch Record Buffer
Extension aka BRBE. This implements all the armv8pmu_branch_xxx() callbacks
as expected at ARMV8 PMU level required to drive perf branch stack sampling
events. This adds a new config option CONFIG_ARM64_BRBE to encapsulate this
BRBE based implementation, available only on ARM64 platforms.

BRBE hardware captures a branch record via three distinct system registers
representing branch source address, branch target address, and other branch
information. A BRBE buffer implementation is organized as multiple banks of
32 branch records each, which is a collection of BRBSRC_EL1, BRBTGT_EL1 and
BRBINF_EL1 registers. Though total BRBE record entries i.e BRBE_MAX_ENTRIES
cannot exceed MAX_BRANCH_RECORDS as defined for ARM PMU.

Branch stack sampling is enabled and disabled along with regular PMU events
This adds required function callbacks in armv8pmu_branch_xxx() format, to
drive the PMU branch stack hardware when supported. This also adds fallback
stub definitions for these callbacks for PMUs which would not have required
support.

BRBE hardware attributes get captured in a new reg_brbidr element in struct
arm_pmu during armv8pmu_branch_probe() which is called from broader probing
function __armv8pmu_probe_pmu(). Attributes such as number of branch record
entries implemented in the hardware can be derived from armpmu->reg_brbidr.

BRBE gets enabled via armv8pmu_branch_enable() where it also derives branch
filter, and additional requirements from event's 'attr.branch_sample_type'
and configures them via BRBFCR_EL1 and BRBCR_EL1 registers.

PMU event overflow triggers IRQ, where current branch records get captured,
stitched along with older records available in 'task_ctx', before getting
processed for core perf ring buffer. Task context switch outs incrementally
save current branch records in event's 'pmu_ctx->task_ctx_data' to optimize
workload's branch record samples.

In case multiple events with different branch sample type requests converge
on the same PMU, BRBE gets enabled for branch filters for the last event's
branch sample type. No branch records will be captured and processed for an
event if BRBE hardware config does not match its branch sample type, while
handling the PMU IRQ.

Cc: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Anshuman Khandual <[email protected]>
----
Changes in V17:

- Dropped Documentation/arch/arm64/brbe.rst
- Renamed armv8pmu_branch_reset() as armv8pmu_branch_stack_reset()
- Separated out booting.rst and EL2 boot requirements into a new patch
- Moved ARM PMUV3 changes into this patch
- Dropped process_branch_aborts() completely
- Added an warning if transaction states get detected unexpectedly
- Dropped enum brbe_bank_idx from the driver
- Defined armv8pmu_branch_stack_init/add/del() callbacks in the driver
- Changed BRBE driver to use existing SYS_BRBSRC/TGT/INF_EL1(n) format

drivers/perf/Kconfig | 11 +
drivers/perf/Makefile | 1 +
drivers/perf/arm_brbe.c | 968 ++++++++++++++++++++++++++++++++
drivers/perf/arm_pmuv3.c | 146 ++++-
drivers/perf/arm_pmuv3_branch.h | 73 +++
include/linux/perf/arm_pmu.h | 5 +
6 files changed, 1203 insertions(+), 1 deletion(-)
create mode 100644 drivers/perf/arm_brbe.c
create mode 100644 drivers/perf/arm_pmuv3_branch.h

diff --git a/drivers/perf/Kconfig b/drivers/perf/Kconfig
index 7526a9e714fa..a8ce723642f0 100644
--- a/drivers/perf/Kconfig
+++ b/drivers/perf/Kconfig
@@ -204,6 +204,17 @@ config ARM_SPE_PMU
Extension, which provides periodic sampling of operations in
the CPU pipeline and reports this via the perf AUX interface.

+config ARM64_BRBE
+ bool "Enable support for branch stack sampling using FEAT_BRBE"
+ depends on PERF_EVENTS && ARM64 && ARM_PMU
+ default y
+ help
+ Enable perf support for Branch Record Buffer Extension (BRBE) which
+ records all branches taken in an execution path. This supports some
+ branch types and privilege based filtering. It captures additional
+ relevant information such as cycle count, misprediction and branch
+ type, branch privilege level etc.
+
config ARM_DMC620_PMU
tristate "Enable PMU support for the ARM DMC-620 memory controller"
depends on (ARM64 && ACPI) || COMPILE_TEST
diff --git a/drivers/perf/Makefile b/drivers/perf/Makefile
index 29b1c28203ef..7f9b2b67fea2 100644
--- a/drivers/perf/Makefile
+++ b/drivers/perf/Makefile
@@ -19,6 +19,7 @@ obj-$(CONFIG_STARFIVE_STARLINK_PMU) += starfive_starlink_pmu.o
obj-$(CONFIG_THUNDERX2_PMU) += thunderx2_pmu.o
obj-$(CONFIG_XGENE_PMU) += xgene_pmu.o
obj-$(CONFIG_ARM_SPE_PMU) += arm_spe_pmu.o
+obj-$(CONFIG_ARM64_BRBE) += arm_brbe.o
obj-$(CONFIG_ARM_DMC620_PMU) += arm_dmc620_pmu.o
obj-$(CONFIG_MARVELL_CN10K_TAD_PMU) += marvell_cn10k_tad_pmu.o
obj-$(CONFIG_MARVELL_CN10K_DDR_PMU) += marvell_cn10k_ddr_pmu.o
diff --git a/drivers/perf/arm_brbe.c b/drivers/perf/arm_brbe.c
new file mode 100644
index 000000000000..762c643506a1
--- /dev/null
+++ b/drivers/perf/arm_brbe.c
@@ -0,0 +1,968 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Branch Record Buffer Extension Driver.
+ *
+ * Copyright (C) 2022-2023 ARM Limited
+ *
+ * Author: Anshuman Khandual <[email protected]>
+ */
+#include "arm_pmuv3_branch.h"
+
+#define BRBFCR_EL1_BRANCH_FILTERS (BRBFCR_EL1_DIRECT | \
+ BRBFCR_EL1_INDIRECT | \
+ BRBFCR_EL1_RTN | \
+ BRBFCR_EL1_INDCALL | \
+ BRBFCR_EL1_DIRCALL | \
+ BRBFCR_EL1_CONDDIR)
+
+#define BRBFCR_EL1_CONFIG_MASK (BRBFCR_EL1_BANK_MASK | \
+ BRBFCR_EL1_PAUSED | \
+ BRBFCR_EL1_EnI | \
+ BRBFCR_EL1_BRANCH_FILTERS)
+
+/*
+ * BRBTS_EL1 is currently not used for branch stack implementation
+ * purpose but BRBCR_ELx.TS needs to have a valid value from all
+ * available options. BRBCR_ELx_TS_VIRTUAL is selected for this.
+ */
+#define BRBCR_ELx_DEFAULT_TS FIELD_PREP(BRBCR_ELx_TS_MASK, BRBCR_ELx_TS_VIRTUAL)
+
+#define BRBCR_ELx_CONFIG_MASK (BRBCR_ELx_EXCEPTION | \
+ BRBCR_ELx_ERTN | \
+ BRBCR_ELx_CC | \
+ BRBCR_ELx_MPRED | \
+ BRBCR_ELx_ExBRE | \
+ BRBCR_ELx_E0BRE | \
+ BRBCR_ELx_FZP | \
+ BRBCR_ELx_TS_MASK)
+
+/*
+ * BRBE Buffer Organization
+ *
+ * BRBE buffer is arranged as multiple banks of 32 branch record
+ * entries each. An individual branch record in a given bank could
+ * be accessed, after selecting the bank in BRBFCR_EL1.BANK and
+ * accessing the registers i.e [BRBSRC, BRBTGT, BRBINF] set with
+ * indices [0..31].
+ *
+ * Bank 0
+ *
+ * --------------------------------- ------
+ * | 00 | BRBSRC | BRBTGT | BRBINF | | 00 |
+ * --------------------------------- ------
+ * | 01 | BRBSRC | BRBTGT | BRBINF | | 01 |
+ * --------------------------------- ------
+ * | .. | BRBSRC | BRBTGT | BRBINF | | .. |
+ * --------------------------------- ------
+ * | 31 | BRBSRC | BRBTGT | BRBINF | | 31 |
+ * --------------------------------- ------
+ *
+ * Bank 1
+ *
+ * --------------------------------- ------
+ * | 32 | BRBSRC | BRBTGT | BRBINF | | 00 |
+ * --------------------------------- ------
+ * | 33 | BRBSRC | BRBTGT | BRBINF | | 01 |
+ * --------------------------------- ------
+ * | .. | BRBSRC | BRBTGT | BRBINF | | .. |
+ * --------------------------------- ------
+ * | 63 | BRBSRC | BRBTGT | BRBINF | | 31 |
+ * --------------------------------- ------
+ */
+#define BRBE_BANK_MAX_ENTRIES 32
+#define BRBE_MAX_BANK 2
+#define BRBE_MAX_ENTRIES (BRBE_BANK_MAX_ENTRIES * BRBE_MAX_BANK)
+
+#define BRBE_BANK0_IDX_MIN 0
+#define BRBE_BANK0_IDX_MAX 31
+#define BRBE_BANK1_IDX_MIN 32
+#define BRBE_BANK1_IDX_MAX 63
+
+struct brbe_regset {
+ unsigned long brbsrc;
+ unsigned long brbtgt;
+ unsigned long brbinf;
+};
+
+struct arm64_perf_task_context {
+ struct brbe_regset store[BRBE_MAX_ENTRIES];
+ int nr_brbe_records;
+};
+
+struct brbe_hw_attr {
+ int brbe_version;
+ int brbe_cc;
+ int brbe_nr;
+ int brbe_format;
+};
+
+#define RETURN_READ_BRBSRCN(n) \
+ read_sysreg_s(SYS_BRBSRC_EL1(n))
+
+#define RETURN_READ_BRBTGTN(n) \
+ read_sysreg_s(SYS_BRBTGT_EL1(n))
+
+#define RETURN_READ_BRBINFN(n) \
+ read_sysreg_s(SYS_BRBINF_EL1(n))
+
+#define BRBE_REGN_CASE(n, case_macro) \
+ case n: return case_macro(n); break
+
+#define BRBE_REGN_SWITCH(x, case_macro) \
+ do { \
+ switch (x) { \
+ BRBE_REGN_CASE(0, case_macro); \
+ BRBE_REGN_CASE(1, case_macro); \
+ BRBE_REGN_CASE(2, case_macro); \
+ BRBE_REGN_CASE(3, case_macro); \
+ BRBE_REGN_CASE(4, case_macro); \
+ BRBE_REGN_CASE(5, case_macro); \
+ BRBE_REGN_CASE(6, case_macro); \
+ BRBE_REGN_CASE(7, case_macro); \
+ BRBE_REGN_CASE(8, case_macro); \
+ BRBE_REGN_CASE(9, case_macro); \
+ BRBE_REGN_CASE(10, case_macro); \
+ BRBE_REGN_CASE(11, case_macro); \
+ BRBE_REGN_CASE(12, case_macro); \
+ BRBE_REGN_CASE(13, case_macro); \
+ BRBE_REGN_CASE(14, case_macro); \
+ BRBE_REGN_CASE(15, case_macro); \
+ BRBE_REGN_CASE(16, case_macro); \
+ BRBE_REGN_CASE(17, case_macro); \
+ BRBE_REGN_CASE(18, case_macro); \
+ BRBE_REGN_CASE(19, case_macro); \
+ BRBE_REGN_CASE(20, case_macro); \
+ BRBE_REGN_CASE(21, case_macro); \
+ BRBE_REGN_CASE(22, case_macro); \
+ BRBE_REGN_CASE(23, case_macro); \
+ BRBE_REGN_CASE(24, case_macro); \
+ BRBE_REGN_CASE(25, case_macro); \
+ BRBE_REGN_CASE(26, case_macro); \
+ BRBE_REGN_CASE(27, case_macro); \
+ BRBE_REGN_CASE(28, case_macro); \
+ BRBE_REGN_CASE(29, case_macro); \
+ BRBE_REGN_CASE(30, case_macro); \
+ BRBE_REGN_CASE(31, case_macro); \
+ default: \
+ pr_warn("unknown register index\n"); \
+ return -1; \
+ } \
+ } while (0)
+
+static inline int buffer_to_brbe_idx(int buffer_idx)
+{
+ return buffer_idx % BRBE_BANK_MAX_ENTRIES;
+}
+
+static inline u64 get_brbsrc_reg(int buffer_idx)
+{
+ int brbe_idx = buffer_to_brbe_idx(buffer_idx);
+
+ BRBE_REGN_SWITCH(brbe_idx, RETURN_READ_BRBSRCN);
+}
+
+static inline u64 get_brbtgt_reg(int buffer_idx)
+{
+ int brbe_idx = buffer_to_brbe_idx(buffer_idx);
+
+ BRBE_REGN_SWITCH(brbe_idx, RETURN_READ_BRBTGTN);
+}
+
+static inline u64 get_brbinf_reg(int buffer_idx)
+{
+ int brbe_idx = buffer_to_brbe_idx(buffer_idx);
+
+ BRBE_REGN_SWITCH(brbe_idx, RETURN_READ_BRBINFN);
+}
+
+static inline u64 brbe_record_valid(u64 brbinf)
+{
+ return FIELD_GET(BRBINFx_EL1_VALID_MASK, brbinf);
+}
+
+static inline bool brbe_invalid(u64 brbinf)
+{
+ return brbe_record_valid(brbinf) == BRBINFx_EL1_VALID_NONE;
+}
+
+static inline bool brbe_record_is_complete(u64 brbinf)
+{
+ return brbe_record_valid(brbinf) == BRBINFx_EL1_VALID_FULL;
+}
+
+static inline bool brbe_record_is_source_only(u64 brbinf)
+{
+ return brbe_record_valid(brbinf) == BRBINFx_EL1_VALID_SOURCE;
+}
+
+static inline bool brbe_record_is_target_only(u64 brbinf)
+{
+ return brbe_record_valid(brbinf) == BRBINFx_EL1_VALID_TARGET;
+}
+
+static inline int brbe_get_in_tx(u64 brbinf)
+{
+ return FIELD_GET(BRBINFx_EL1_T_MASK, brbinf);
+}
+
+static inline int brbe_get_mispredict(u64 brbinf)
+{
+ return FIELD_GET(BRBINFx_EL1_MPRED_MASK, brbinf);
+}
+
+static inline int brbe_get_lastfailed(u64 brbinf)
+{
+ return FIELD_GET(BRBINFx_EL1_LASTFAILED_MASK, brbinf);
+}
+
+static inline int brbe_get_cycles(u64 brbinf)
+{
+ /*
+ * Captured cycle count is unknown and hence
+ * should not be passed on to the user space.
+ */
+ if (brbinf & BRBINFx_EL1_CCU)
+ return 0;
+
+ return FIELD_GET(BRBINFx_EL1_CC_MASK, brbinf);
+}
+
+static inline int brbe_get_type(u64 brbinf)
+{
+ return FIELD_GET(BRBINFx_EL1_TYPE_MASK, brbinf);
+}
+
+static inline int brbe_get_el(u64 brbinf)
+{
+ return FIELD_GET(BRBINFx_EL1_EL_MASK, brbinf);
+}
+
+static inline int brbe_get_numrec(u64 brbidr)
+{
+ return FIELD_GET(BRBIDR0_EL1_NUMREC_MASK, brbidr);
+}
+
+static inline int brbe_get_format(u64 brbidr)
+{
+ return FIELD_GET(BRBIDR0_EL1_FORMAT_MASK, brbidr);
+}
+
+static inline int brbe_get_cc_bits(u64 brbidr)
+{
+ return FIELD_GET(BRBIDR0_EL1_CC_MASK, brbidr);
+}
+
+void armv8pmu_branch_stack_reset(void)
+{
+ asm volatile(BRB_IALL_INSN);
+ isb();
+}
+
+void armv8pmu_branch_stack_add(struct perf_event *event, struct pmu_hw_events *hw_events)
+{
+ /*
+ * Reset branch records buffer if a new CPU bound event
+ * gets scheduled on a PMU. Otherwise existing branch
+ * records present in the buffer might just leak into
+ * such events.
+ *
+ * Also reset current 'hw_events->branch_context' because
+ * any previous task bound event now would have lost an
+ * opportunity for continuous branch records.
+ */
+ if (!event->ctx->task) {
+ hw_events->branch_context = NULL;
+ armv8pmu_branch_stack_reset();
+ }
+
+ /*
+ * Reset branch records buffer if a new task event gets
+ * scheduled on a PMU which might have existing records.
+ * Otherwise older branch records present in the buffer
+ * might leak into the new task event.
+ */
+ if (event->ctx->task && hw_events->branch_context != event->ctx) {
+ hw_events->branch_context = event->ctx;
+ armv8pmu_branch_stack_reset();
+ }
+ hw_events->branch_users++;
+ hw_events->branch_sample_type = event->attr.branch_sample_type;
+}
+
+void armv8pmu_branch_stack_del(struct perf_event *event, struct pmu_hw_events *hw_events)
+{
+ WARN_ON_ONCE(!hw_events->branch_users);
+ hw_events->branch_users--;
+ if (!hw_events->branch_users) {
+ hw_events->branch_context = NULL;
+ hw_events->branch_sample_type = 0;
+ }
+}
+
+static bool valid_brbe_nr(int brbe_nr)
+{
+ return brbe_nr == BRBIDR0_EL1_NUMREC_8 ||
+ brbe_nr == BRBIDR0_EL1_NUMREC_16 ||
+ brbe_nr == BRBIDR0_EL1_NUMREC_32 ||
+ brbe_nr == BRBIDR0_EL1_NUMREC_64;
+}
+
+static bool valid_brbe_cc(int brbe_cc)
+{
+ return brbe_cc == BRBIDR0_EL1_CC_20_BIT;
+}
+
+static bool valid_brbe_format(int brbe_format)
+{
+ return brbe_format == BRBIDR0_EL1_FORMAT_0;
+}
+
+static bool valid_brbe_version(int brbe_version)
+{
+ return brbe_version == ID_AA64DFR0_EL1_BRBE_IMP ||
+ brbe_version == ID_AA64DFR0_EL1_BRBE_BRBE_V1P1;
+}
+
+static void select_brbe_bank(int bank)
+{
+ u64 brbfcr;
+
+ WARN_ON(bank > 1);
+ brbfcr = read_sysreg_s(SYS_BRBFCR_EL1);
+ brbfcr &= ~BRBFCR_EL1_BANK_MASK;
+ brbfcr |= SYS_FIELD_PREP(BRBFCR_EL1, BANK, bank);
+ write_sysreg_s(brbfcr, SYS_BRBFCR_EL1);
+ isb();
+}
+
+static bool __read_brbe_regset(struct brbe_regset *entry, int idx)
+{
+ entry->brbinf = get_brbinf_reg(idx);
+
+ if (brbe_invalid(entry->brbinf))
+ return false;
+
+ entry->brbsrc = get_brbsrc_reg(idx);
+ entry->brbtgt = get_brbtgt_reg(idx);
+ return true;
+}
+
+/*
+ * Read all BRBE entries in HW until the first invalid entry.
+ *
+ * The caller must ensure that the BRBE is not concurrently modifying these
+ * branch entries.
+ */
+static int capture_brbe_regset(struct brbe_regset *buf, int nr_hw_entries)
+{
+ int idx = 0;
+
+ select_brbe_bank(0);
+ while (idx < nr_hw_entries && idx <= BRBE_BANK0_IDX_MAX) {
+ if (!__read_brbe_regset(&buf[idx], idx))
+ return idx;
+ idx++;
+ }
+
+ select_brbe_bank(1);
+ while (idx < nr_hw_entries && idx <= BRBE_BANK1_IDX_MAX) {
+ if (!__read_brbe_regset(&buf[idx], idx))
+ return idx;
+ idx++;
+ }
+ return idx;
+}
+
+/*
+ * This function concatenates branch records from stored and live buffer
+ * up to maximum nr_max records and the stored buffer holds the resultant
+ * buffer. The concatenated buffer contains all the branch records from
+ * the live buffer but might contain some from stored buffer considering
+ * the maximum combined length does not exceed 'nr_max'.
+ *
+ * Stored records Live records
+ * ------------------------------------------------^
+ * | S0 | L0 | Newest |
+ * --------------------------------- |
+ * | S1 | L1 | |
+ * --------------------------------- |
+ * | S2 | L2 | |
+ * --------------------------------- |
+ * | S3 | L3 | |
+ * --------------------------------- |
+ * | S4 | L4 | nr_max
+ * --------------------------------- |
+ * | | L5 | |
+ * --------------------------------- |
+ * | | L6 | |
+ * --------------------------------- |
+ * | | L7 | |
+ * --------------------------------- |
+ * | | | |
+ * --------------------------------- |
+ * | | | Oldest |
+ * ------------------------------------------------V
+ *
+ *
+ * S0 is the newest in the stored records, where as L7 is the oldest in
+ * the live records. Unless the live buffer is detected as being full
+ * thus potentially dropping off some older records, L7 and S0 records
+ * are contiguous in time for a user task context. The stitched buffer
+ * here represents maximum possible branch records, contiguous in time.
+ *
+ * Stored records Live records
+ * ------------------------------------------------^
+ * | L0 | L0 | Newest |
+ * --------------------------------- |
+ * | L1 | L1 | |
+ * --------------------------------- |
+ * | L2 | L2 | |
+ * --------------------------------- |
+ * | L3 | L3 | |
+ * --------------------------------- |
+ * | L4 | L4 | nr_max
+ * --------------------------------- |
+ * | L5 | L5 | |
+ * --------------------------------- |
+ * | L6 | L6 | |
+ * --------------------------------- |
+ * | L7 | L7 | |
+ * --------------------------------- |
+ * | S0 | | |
+ * --------------------------------- |
+ * | S1 | | Oldest |
+ * ------------------------------------------------V
+ * | S2 | <----|
+ * ----------------- |
+ * | S3 | <----| Dropped off after nr_max
+ * ----------------- |
+ * | S4 | <----|
+ * -----------------
+ */
+static int stitch_stored_live_entries(struct brbe_regset *stored,
+ struct brbe_regset *live,
+ int nr_stored, int nr_live,
+ int nr_max)
+{
+ int nr_move = min(nr_stored, nr_max - nr_live);
+
+ /* Move the tail of the buffer to make room for the new entries */
+ memmove(&stored[nr_live], &stored[0], nr_move * sizeof(*stored));
+
+ /* Copy the new entries into the head of the buffer */
+ memcpy(&stored[0], &live[0], nr_live * sizeof(*stored));
+
+ /* Return the number of entries in the stitched buffer */
+ return min(nr_live + nr_stored, nr_max);
+}
+
+static int brbe_branch_save(struct brbe_regset *live, int nr_hw_entries)
+{
+ u64 brbfcr = read_sysreg_s(SYS_BRBFCR_EL1);
+ int nr_live;
+
+ write_sysreg_s(brbfcr | BRBFCR_EL1_PAUSED, SYS_BRBFCR_EL1);
+ isb();
+
+ nr_live = capture_brbe_regset(live, nr_hw_entries);
+
+ write_sysreg_s(brbfcr & ~BRBFCR_EL1_PAUSED, SYS_BRBFCR_EL1);
+ isb();
+
+ return nr_live;
+}
+
+void armv8pmu_branch_save(struct arm_pmu *arm_pmu, void *ctx)
+{
+ struct arm64_perf_task_context *task_ctx = ctx;
+ struct brbe_regset live[BRBE_MAX_ENTRIES];
+ int nr_live, nr_store, nr_hw_entries;
+
+ nr_hw_entries = brbe_get_numrec(arm_pmu->reg_brbidr);
+ nr_live = brbe_branch_save(live, nr_hw_entries);
+ nr_store = task_ctx->nr_brbe_records;
+ nr_store = stitch_stored_live_entries(task_ctx->store, live, nr_store,
+ nr_live, nr_hw_entries);
+ task_ctx->nr_brbe_records = nr_store;
+}
+
+/*
+ * Generic perf branch filters supported on BRBE
+ *
+ * New branch filters need to be evaluated whether they could be supported on
+ * BRBE. This ensures that such branch filters would not just be accepted, to
+ * fail silently. PERF_SAMPLE_BRANCH_HV is a special case that is selectively
+ * supported only on platforms where kernel is in hyp mode.
+ */
+#define BRBE_EXCLUDE_BRANCH_FILTERS (PERF_SAMPLE_BRANCH_ABORT_TX | \
+ PERF_SAMPLE_BRANCH_IN_TX | \
+ PERF_SAMPLE_BRANCH_NO_TX | \
+ PERF_SAMPLE_BRANCH_CALL_STACK | \
+ PERF_SAMPLE_BRANCH_COUNTERS)
+
+#define BRBE_ALLOWED_BRANCH_FILTERS (PERF_SAMPLE_BRANCH_USER | \
+ PERF_SAMPLE_BRANCH_KERNEL | \
+ PERF_SAMPLE_BRANCH_HV | \
+ PERF_SAMPLE_BRANCH_ANY | \
+ PERF_SAMPLE_BRANCH_ANY_CALL | \
+ PERF_SAMPLE_BRANCH_ANY_RETURN | \
+ PERF_SAMPLE_BRANCH_IND_CALL | \
+ PERF_SAMPLE_BRANCH_COND | \
+ PERF_SAMPLE_BRANCH_IND_JUMP | \
+ PERF_SAMPLE_BRANCH_CALL | \
+ PERF_SAMPLE_BRANCH_NO_FLAGS | \
+ PERF_SAMPLE_BRANCH_NO_CYCLES | \
+ PERF_SAMPLE_BRANCH_TYPE_SAVE | \
+ PERF_SAMPLE_BRANCH_HW_INDEX | \
+ PERF_SAMPLE_BRANCH_PRIV_SAVE)
+
+#define BRBE_PERF_BRANCH_FILTERS (BRBE_ALLOWED_BRANCH_FILTERS | \
+ BRBE_EXCLUDE_BRANCH_FILTERS)
+
+bool armv8pmu_branch_attr_valid(struct perf_event *event)
+{
+ u64 branch_type = event->attr.branch_sample_type;
+
+ /*
+ * Ensure both perf branch filter allowed and exclude
+ * masks are always in sync with the generic perf ABI.
+ */
+ BUILD_BUG_ON(BRBE_PERF_BRANCH_FILTERS != (PERF_SAMPLE_BRANCH_MAX - 1));
+
+ if (branch_type & ~BRBE_ALLOWED_BRANCH_FILTERS) {
+ pr_debug_once("requested branch filter not supported 0x%llx\n", branch_type);
+ return false;
+ }
+
+ /*
+ * If the event does not have at least one of the privilege
+ * branch filters as in PERF_SAMPLE_BRANCH_PLM_ALL, the core
+ * perf will adjust its value based on perf event's existing
+ * privilege level via attr.exclude_[user|kernel|hv].
+ *
+ * As event->attr.branch_sample_type might have been changed
+ * when the event reaches here, it is not possible to figure
+ * out whether the event originally had HV privilege request
+ * or got added via the core perf. Just report this situation
+ * once and continue ignoring if there are other instances.
+ */
+ if ((branch_type & PERF_SAMPLE_BRANCH_HV) && !is_kernel_in_hyp_mode())
+ pr_debug_once("hypervisor privilege filter not supported 0x%llx\n", branch_type);
+
+ return true;
+}
+
+int armv8pmu_task_ctx_cache_alloc(struct arm_pmu *arm_pmu)
+{
+ size_t size = sizeof(struct arm64_perf_task_context);
+
+ arm_pmu->pmu.task_ctx_cache = kmem_cache_create("arm64_brbe_task_ctx", size, 0, 0, NULL);
+ if (!arm_pmu->pmu.task_ctx_cache)
+ return -ENOMEM;
+ return 0;
+}
+
+void armv8pmu_task_ctx_cache_free(struct arm_pmu *arm_pmu)
+{
+ kmem_cache_destroy(arm_pmu->pmu.task_ctx_cache);
+}
+
+static int brbe_attributes_probe(struct arm_pmu *armpmu, u32 brbe)
+{
+ u64 brbidr = read_sysreg_s(SYS_BRBIDR0_EL1);
+ int brbe_version, brbe_format, brbe_cc, brbe_nr;
+
+ brbe_version = brbe;
+ brbe_format = brbe_get_format(brbidr);
+ brbe_cc = brbe_get_cc_bits(brbidr);
+ brbe_nr = brbe_get_numrec(brbidr);
+ armpmu->reg_brbidr = brbidr;
+
+ if (!valid_brbe_version(brbe_version) ||
+ !valid_brbe_format(brbe_format) ||
+ !valid_brbe_cc(brbe_cc) ||
+ !valid_brbe_nr(brbe_nr))
+ return -EOPNOTSUPP;
+ return 0;
+}
+
+void armv8pmu_branch_probe(struct arm_pmu *armpmu)
+{
+ u64 aa64dfr0 = read_sysreg_s(SYS_ID_AA64DFR0_EL1);
+ u32 brbe;
+
+ /*
+ * BRBE implementation's branch entries cannot exceed maximum
+ * branch records supported at the ARM PMU level abstraction.
+ * Otherwise there is always a possibility of array overflow,
+ * while processing BRBE branch records.
+ */
+ BUILD_BUG_ON(BRBE_BANK_MAX_ENTRIES > MAX_BRANCH_RECORDS);
+
+ brbe = cpuid_feature_extract_unsigned_field(aa64dfr0, ID_AA64DFR0_EL1_BRBE_SHIFT);
+ if (!brbe)
+ return;
+
+ if (brbe_attributes_probe(armpmu, brbe))
+ return;
+
+ armpmu->has_branch_stack = 1;
+}
+
+/*
+ * BRBE supports the following functional branch type filters while
+ * generating branch records. These branch filters can be enabled,
+ * either individually or as a group i.e ORing multiple filters
+ * with each other.
+ *
+ * BRBFCR_EL1_CONDDIR - Conditional direct branch
+ * BRBFCR_EL1_DIRCALL - Direct call
+ * BRBFCR_EL1_INDCALL - Indirect call
+ * BRBFCR_EL1_INDIRECT - Indirect branch
+ * BRBFCR_EL1_DIRECT - Direct branch
+ * BRBFCR_EL1_RTN - Subroutine return
+ */
+static u64 branch_type_to_brbfcr(int branch_type)
+{
+ u64 brbfcr = 0;
+
+ if (branch_type & PERF_SAMPLE_BRANCH_ANY) {
+ brbfcr |= BRBFCR_EL1_BRANCH_FILTERS;
+ return brbfcr;
+ }
+
+ if (branch_type & PERF_SAMPLE_BRANCH_ANY_CALL) {
+ brbfcr |= BRBFCR_EL1_INDCALL;
+ brbfcr |= BRBFCR_EL1_DIRCALL;
+ }
+
+ if (branch_type & PERF_SAMPLE_BRANCH_ANY_RETURN)
+ brbfcr |= BRBFCR_EL1_RTN;
+
+ if (branch_type & PERF_SAMPLE_BRANCH_IND_CALL)
+ brbfcr |= BRBFCR_EL1_INDCALL;
+
+ if (branch_type & PERF_SAMPLE_BRANCH_COND)
+ brbfcr |= BRBFCR_EL1_CONDDIR;
+
+ if (branch_type & PERF_SAMPLE_BRANCH_IND_JUMP)
+ brbfcr |= BRBFCR_EL1_INDIRECT;
+
+ if (branch_type & PERF_SAMPLE_BRANCH_CALL)
+ brbfcr |= BRBFCR_EL1_DIRCALL;
+
+ return brbfcr & BRBFCR_EL1_CONFIG_MASK;
+}
+
+/*
+ * BRBE supports the following privilege mode filters while generating
+ * branch records.
+ *
+ * BRBCR_ELx_E0BRE - EL0 branch records
+ * BRBCR_ELx_ExBRE - EL1/EL2 branch records
+ *
+ * BRBE also supports the following additional functional branch type
+ * filters while generating branch records.
+ *
+ * BRBCR_ELx_EXCEPTION - Exception
+ * BRBCR_ELx_ERTN - Exception return
+ */
+static u64 branch_type_to_brbcr(int branch_type)
+{
+ u64 brbcr = BRBCR_ELx_DEFAULT_TS;
+
+ /*
+ * BRBE should be paused on PMU interrupt while tracing kernel
+ * space to stop capturing further branch records. Otherwise
+ * interrupt handler branch records might get into the samples
+ * which is not desired.
+ *
+ * BRBE need not be paused on PMU interrupt while tracing only
+ * the user space, because it will automatically be inside the
+ * prohibited region. But even after PMU overflow occurs, the
+ * interrupt could still take much more cycles, before it can
+ * be taken and by that time BRBE will have been overwritten.
+ * Hence enable pause on PMU interrupt mechanism even for user
+ * only traces as well.
+ */
+ brbcr |= BRBCR_ELx_FZP;
+
+ if (branch_type & PERF_SAMPLE_BRANCH_USER)
+ brbcr |= BRBCR_ELx_E0BRE;
+
+ /*
+ * When running in the hyp mode, writing into BRBCR_EL1
+ * actually writes into BRBCR_EL2 instead. Field E2BRE
+ * is also at the same position as E1BRE.
+ */
+ if (branch_type & PERF_SAMPLE_BRANCH_KERNEL)
+ brbcr |= BRBCR_ELx_ExBRE;
+
+ if (branch_type & PERF_SAMPLE_BRANCH_HV) {
+ if (is_kernel_in_hyp_mode())
+ brbcr |= BRBCR_ELx_ExBRE;
+ }
+
+ if (!(branch_type & PERF_SAMPLE_BRANCH_NO_CYCLES))
+ brbcr |= BRBCR_ELx_CC;
+
+ if (!(branch_type & PERF_SAMPLE_BRANCH_NO_FLAGS))
+ brbcr |= BRBCR_ELx_MPRED;
+
+ /*
+ * The exception and exception return branches could be
+ * captured, irrespective of the perf event's privilege.
+ * If the perf event does not have enough privilege for
+ * a given exception level, then addresses which falls
+ * under that exception level will be reported as zero
+ * for the captured branch record, creating source only
+ * or target only records.
+ */
+ if (branch_type & PERF_SAMPLE_BRANCH_ANY) {
+ brbcr |= BRBCR_ELx_EXCEPTION;
+ brbcr |= BRBCR_ELx_ERTN;
+ }
+
+ if (branch_type & PERF_SAMPLE_BRANCH_ANY_CALL)
+ brbcr |= BRBCR_ELx_EXCEPTION;
+
+ if (branch_type & PERF_SAMPLE_BRANCH_ANY_RETURN)
+ brbcr |= BRBCR_ELx_ERTN;
+
+ return brbcr & BRBCR_ELx_CONFIG_MASK;
+}
+
+void armv8pmu_branch_enable(struct arm_pmu *arm_pmu)
+{
+ struct pmu_hw_events *cpuc = this_cpu_ptr(arm_pmu->hw_events);
+ u64 brbfcr, brbcr;
+
+ if (!(cpuc->branch_sample_type && cpuc->branch_users))
+ return;
+
+ /*
+ * BRBE gets configured with a new mismatched branch sample
+ * type request, overriding any previous branch filters.
+ */
+ brbfcr = read_sysreg_s(SYS_BRBFCR_EL1);
+ brbfcr &= ~BRBFCR_EL1_CONFIG_MASK;
+ brbfcr |= branch_type_to_brbfcr(cpuc->branch_sample_type);
+ write_sysreg_s(brbfcr, SYS_BRBFCR_EL1);
+ isb();
+
+ brbcr = read_sysreg_s(SYS_BRBCR_EL1);
+ brbcr &= ~BRBCR_ELx_CONFIG_MASK;
+ brbcr |= branch_type_to_brbcr(cpuc->branch_sample_type);
+ write_sysreg_s(brbcr, SYS_BRBCR_EL1);
+ isb();
+}
+
+void armv8pmu_branch_disable(void)
+{
+ u64 brbfcr, brbcr;
+
+ brbcr = read_sysreg_s(SYS_BRBCR_EL1);
+ brbfcr = read_sysreg_s(SYS_BRBFCR_EL1);
+ brbcr &= ~(BRBCR_ELx_E0BRE | BRBCR_ELx_ExBRE);
+ brbfcr |= BRBFCR_EL1_PAUSED;
+ write_sysreg_s(brbcr, SYS_BRBCR_EL1);
+ write_sysreg_s(brbfcr, SYS_BRBFCR_EL1);
+ isb();
+}
+
+static void brbe_set_perf_entry_type(struct perf_branch_entry *entry, u64 brbinf)
+{
+ int brbe_type = brbe_get_type(brbinf);
+
+ switch (brbe_type) {
+ case BRBINFx_EL1_TYPE_DIRECT_UNCOND:
+ entry->type = PERF_BR_UNCOND;
+ break;
+ case BRBINFx_EL1_TYPE_INDIRECT:
+ entry->type = PERF_BR_IND;
+ break;
+ case BRBINFx_EL1_TYPE_DIRECT_LINK:
+ entry->type = PERF_BR_CALL;
+ break;
+ case BRBINFx_EL1_TYPE_INDIRECT_LINK:
+ entry->type = PERF_BR_IND_CALL;
+ break;
+ case BRBINFx_EL1_TYPE_RET:
+ entry->type = PERF_BR_RET;
+ break;
+ case BRBINFx_EL1_TYPE_DIRECT_COND:
+ entry->type = PERF_BR_COND;
+ break;
+ case BRBINFx_EL1_TYPE_CALL:
+ entry->type = PERF_BR_CALL;
+ break;
+ case BRBINFx_EL1_TYPE_TRAP:
+ entry->type = PERF_BR_SYSCALL;
+ break;
+ case BRBINFx_EL1_TYPE_ERET:
+ entry->type = PERF_BR_ERET;
+ break;
+ case BRBINFx_EL1_TYPE_IRQ:
+ entry->type = PERF_BR_IRQ;
+ break;
+ case BRBINFx_EL1_TYPE_DEBUG_HALT:
+ entry->type = PERF_BR_EXTEND_ABI;
+ entry->new_type = PERF_BR_ARM64_DEBUG_HALT;
+ break;
+ case BRBINFx_EL1_TYPE_SERROR:
+ entry->type = PERF_BR_SERROR;
+ break;
+ case BRBINFx_EL1_TYPE_INSN_DEBUG:
+ entry->type = PERF_BR_EXTEND_ABI;
+ entry->new_type = PERF_BR_ARM64_DEBUG_INST;
+ break;
+ case BRBINFx_EL1_TYPE_DATA_DEBUG:
+ entry->type = PERF_BR_EXTEND_ABI;
+ entry->new_type = PERF_BR_ARM64_DEBUG_DATA;
+ break;
+ case BRBINFx_EL1_TYPE_ALIGN_FAULT:
+ entry->type = PERF_BR_EXTEND_ABI;
+ entry->new_type = PERF_BR_NEW_FAULT_ALGN;
+ break;
+ case BRBINFx_EL1_TYPE_INSN_FAULT:
+ entry->type = PERF_BR_EXTEND_ABI;
+ entry->new_type = PERF_BR_NEW_FAULT_INST;
+ break;
+ case BRBINFx_EL1_TYPE_DATA_FAULT:
+ entry->type = PERF_BR_EXTEND_ABI;
+ entry->new_type = PERF_BR_NEW_FAULT_DATA;
+ break;
+ case BRBINFx_EL1_TYPE_FIQ:
+ entry->type = PERF_BR_EXTEND_ABI;
+ entry->new_type = PERF_BR_ARM64_FIQ;
+ break;
+ case BRBINFx_EL1_TYPE_DEBUG_EXIT:
+ entry->type = PERF_BR_EXTEND_ABI;
+ entry->new_type = PERF_BR_ARM64_DEBUG_EXIT;
+ break;
+ default:
+ pr_warn_once("%d - unknown branch type captured\n", brbe_type);
+ entry->type = PERF_BR_UNKNOWN;
+ break;
+ }
+}
+
+static int brbe_get_perf_priv(u64 brbinf)
+{
+ int brbe_el = brbe_get_el(brbinf);
+
+ switch (brbe_el) {
+ case BRBINFx_EL1_EL_EL0:
+ return PERF_BR_PRIV_USER;
+ case BRBINFx_EL1_EL_EL1:
+ return PERF_BR_PRIV_KERNEL;
+ case BRBINFx_EL1_EL_EL2:
+ if (is_kernel_in_hyp_mode())
+ return PERF_BR_PRIV_KERNEL;
+ return PERF_BR_PRIV_HV;
+ default:
+ pr_warn_once("%d - unknown branch privilege captured\n", brbe_el);
+ return PERF_BR_PRIV_UNKNOWN;
+ }
+}
+
+static void capture_brbe_flags(struct perf_branch_entry *entry, struct perf_event *event,
+ u64 brbinf)
+{
+ if (branch_sample_type(event))
+ brbe_set_perf_entry_type(entry, brbinf);
+
+ if (!branch_sample_no_cycles(event))
+ entry->cycles = brbe_get_cycles(brbinf);
+
+ if (!branch_sample_no_flags(event)) {
+ /*
+ * BRBINFx_EL1.LASTFAILED indicates that a TME transaction failed (or
+ * was cancelled) prior to this record, and some number of records
+ * prior to this one, may have been generated during an attempt to
+ * execute the transaction.
+ */
+ entry->abort = brbe_get_lastfailed(brbinf);
+
+ /*
+ * All these information (i.e transaction state and mispredicts)
+ * are available for source only and complete branch records.
+ */
+ if (brbe_record_is_complete(brbinf) ||
+ brbe_record_is_source_only(brbinf)) {
+ entry->mispred = brbe_get_mispredict(brbinf);
+ entry->predicted = !entry->mispred;
+ entry->in_tx = brbe_get_in_tx(brbinf);
+ }
+
+ /*
+ * Currently TME feature is neither implemented in any hardware
+ * nor it is being supported in the kernel. Just warn here once
+ * if TME related information shows up rather unexpectedly.
+ */
+ if (entry->abort || entry->in_tx)
+ pr_warn_once("Unknown transaction states %d %d\n",
+ entry->abort, entry->in_tx);
+ }
+
+ if (branch_sample_priv(event)) {
+ /*
+ * All these information (i.e branch privilege level) are
+ * available for target only and complete branch records.
+ */
+ if (brbe_record_is_complete(brbinf) ||
+ brbe_record_is_target_only(brbinf))
+ entry->priv = brbe_get_perf_priv(brbinf);
+ }
+}
+
+static void brbe_regset_branch_entries(struct pmu_hw_events *cpuc, struct perf_event *event,
+ struct brbe_regset *regset, int idx)
+{
+ struct perf_branch_entry *entry = &cpuc->branches->branch_entries[idx];
+ u64 brbinf = regset[idx].brbinf;
+
+ perf_clear_branch_entry_bitfields(entry);
+ if (brbe_record_is_complete(brbinf)) {
+ entry->from = regset[idx].brbsrc;
+ entry->to = regset[idx].brbtgt;
+ } else if (brbe_record_is_source_only(brbinf)) {
+ entry->from = regset[idx].brbsrc;
+ entry->to = 0;
+ } else if (brbe_record_is_target_only(brbinf)) {
+ entry->from = 0;
+ entry->to = regset[idx].brbtgt;
+ }
+ capture_brbe_flags(entry, event, brbinf);
+}
+
+static void process_branch_entries(struct pmu_hw_events *cpuc, struct perf_event *event,
+ struct brbe_regset *regset, int nr_regset)
+{
+ int idx;
+
+ for (idx = 0; idx < nr_regset; idx++)
+ brbe_regset_branch_entries(cpuc, event, regset, idx);
+
+ cpuc->branches->branch_stack.nr = nr_regset;
+ cpuc->branches->branch_stack.hw_idx = -1ULL;
+}
+
+void armv8pmu_branch_read(struct pmu_hw_events *cpuc, struct perf_event *event)
+{
+ struct arm64_perf_task_context *task_ctx = event->pmu_ctx->task_ctx_data;
+ struct brbe_regset live[BRBE_MAX_ENTRIES];
+ int nr_live, nr_store, nr_hw_entries;
+
+ nr_hw_entries = brbe_get_numrec(cpuc->percpu_pmu->reg_brbidr);
+ nr_live = capture_brbe_regset(live, nr_hw_entries);
+ if (event->ctx->task) {
+ nr_store = task_ctx->nr_brbe_records;
+ nr_store = stitch_stored_live_entries(task_ctx->store, live, nr_store,
+ nr_live, nr_hw_entries);
+ process_branch_entries(cpuc, event, task_ctx->store, nr_store);
+ task_ctx->nr_brbe_records = 0;
+ } else {
+ process_branch_entries(cpuc, event, live, nr_live);
+ }
+}
diff --git a/drivers/perf/arm_pmuv3.c b/drivers/perf/arm_pmuv3.c
index 23fa6c5da82c..6137ae4ba7c3 100644
--- a/drivers/perf/arm_pmuv3.c
+++ b/drivers/perf/arm_pmuv3.c
@@ -26,6 +26,7 @@
#include <linux/nmi.h>

#include <asm/arm_pmuv3.h>
+#include "arm_pmuv3_branch.h"

/* ARMv8 Cortex-A53 specific event types. */
#define ARMV8_A53_PERFCTR_PREF_LINEFILL 0xC2
@@ -829,14 +830,56 @@ static void armv8pmu_start(struct arm_pmu *cpu_pmu)
armv8pmu_pmcr_write(armv8pmu_pmcr_read() | ARMV8_PMU_PMCR_E);

kvm_vcpu_pmu_resync_el0();
+ if (cpu_pmu->has_branch_stack)
+ armv8pmu_branch_enable(cpu_pmu);
}

static void armv8pmu_stop(struct arm_pmu *cpu_pmu)
{
+ if (cpu_pmu->has_branch_stack)
+ armv8pmu_branch_disable();
+
/* Disable all counters */
armv8pmu_pmcr_write(armv8pmu_pmcr_read() & ~ARMV8_PMU_PMCR_E);
}

+static void read_branch_records(struct pmu_hw_events *cpuc,
+ struct perf_event *event,
+ struct perf_sample_data *data,
+ bool *branch_captured)
+{
+ /*
+ * CPU specific branch records buffer must have been allocated already
+ * for the hardware records to be captured and processed further.
+ */
+ if (WARN_ON(!cpuc->branches))
+ return;
+
+ /*
+ * Overflowed event's branch_sample_type does not match the configured
+ * branch filters in the BRBE HW. So the captured branch records here
+ * cannot be co-related to the overflowed event. Report to the user as
+ * if no branch records have been captured, and flush branch records.
+ * The same scenario is applicable when the current task context does
+ * not match with overflown event.
+ */
+ if ((cpuc->branch_sample_type != event->attr.branch_sample_type) ||
+ (event->ctx->task && cpuc->branch_context != event->ctx))
+ return;
+
+ /*
+ * Read the branch records from the hardware once after the PMU IRQ
+ * has been triggered but subsequently same records can be used for
+ * other events that might have been overflowed simultaneously thus
+ * saving much CPU cycles.
+ */
+ if (!*branch_captured) {
+ armv8pmu_branch_read(cpuc, event);
+ *branch_captured = true;
+ }
+ perf_sample_save_brstack(data, event, &cpuc->branches->branch_stack, NULL);
+}
+
static irqreturn_t armv8pmu_handle_irq(struct arm_pmu *cpu_pmu)
{
u32 pmovsr;
@@ -844,6 +887,7 @@ static irqreturn_t armv8pmu_handle_irq(struct arm_pmu *cpu_pmu)
struct pmu_hw_events *cpuc = this_cpu_ptr(cpu_pmu->hw_events);
struct pt_regs *regs;
int idx;
+ bool branch_captured = false;

/*
* Get and reset the IRQ flags
@@ -887,6 +931,13 @@ static irqreturn_t armv8pmu_handle_irq(struct arm_pmu *cpu_pmu)
if (!armpmu_event_set_period(event))
continue;

+ /*
+ * PMU IRQ should remain asserted until all branch records
+ * are captured and processed into struct perf_sample_data.
+ */
+ if (has_branch_stack(event) && cpu_pmu->has_branch_stack)
+ read_branch_records(cpuc, event, &data, &branch_captured);
+
/*
* Perf event overflow will queue the processing of the event as
* an irq_work which will be taken care of in the handling of
@@ -896,6 +947,8 @@ static irqreturn_t armv8pmu_handle_irq(struct arm_pmu *cpu_pmu)
cpu_pmu->disable(event);
}
armv8pmu_start(cpu_pmu);
+ if (cpu_pmu->has_branch_stack)
+ armv8pmu_branch_stack_reset();

return IRQ_HANDLED;
}
@@ -985,6 +1038,40 @@ static int armv8pmu_user_event_idx(struct perf_event *event)
return event->hw.idx;
}

+static bool armv8pmu_branch_stack_init(struct perf_event *event)
+{
+ if (armv8pmu_branch_attr_valid(event)) {
+ /*
+ * If a task gets scheduled out, the current branch records
+ * get saved in the task's context data, which can be later
+ * used to fill in the records upon an event overflow. Let's
+ * enable PERF_ATTACH_TASK_DATA in 'event->attach_state' for
+ * all branch stack sampling perf events.
+ */
+ event->attach_state |= PERF_ATTACH_TASK_DATA;
+ return true;
+ }
+ return false;
+}
+
+static void armv8pmu_sched_task(struct perf_event_pmu_context *pmu_ctx, bool sched_in)
+{
+ struct arm_pmu *armpmu = to_arm_pmu(pmu_ctx->pmu);
+ void *task_ctx = pmu_ctx->task_ctx_data;
+
+ if (armpmu->has_branch_stack) {
+ /* Save branch records in task_ctx on sched out */
+ if (task_ctx && !sched_in) {
+ armv8pmu_branch_save(armpmu, task_ctx);
+ return;
+ }
+
+ /* Reset branch records on sched in */
+ if (sched_in)
+ armv8pmu_branch_stack_reset();
+ }
+}
+
/*
* Add an event filter to a given event.
*/
@@ -1077,6 +1164,9 @@ static void armv8pmu_reset(void *info)
pmcr |= ARMV8_PMU_PMCR_LP;

armv8pmu_pmcr_write(pmcr);
+
+ if (cpu_pmu->has_branch_stack)
+ armv8pmu_branch_stack_reset();
}

static int __armv8_pmuv3_map_event_id(struct arm_pmu *armpmu,
@@ -1229,6 +1319,41 @@ static void __armv8pmu_probe_pmu(void *info)
cpu_pmu->reg_pmmir = read_pmmir();
else
cpu_pmu->reg_pmmir = 0;
+
+ /*
+ * BRBE is being probed on a single cpu for a
+ * given PMU. The remaining cpus, are assumed
+ * to have the exact same BRBE implementation.
+ */
+ armv8pmu_branch_probe(cpu_pmu);
+}
+
+static int branch_records_alloc(struct arm_pmu *armpmu)
+{
+ struct branch_records __percpu *records;
+ int cpu;
+
+ records = alloc_percpu_gfp(struct branch_records, GFP_KERNEL);
+ if (!records)
+ return -ENOMEM;
+
+ /*
+ * percpu memory allocated for 'records' gets completely consumed
+ * here, and never required to be freed up later. So permanently
+ * losing access to this anchor i.e 'records' is acceptable.
+ *
+ * Otherwise this allocation handle would have to be saved up for
+ * free_percpu() release later if required.
+ */
+ for_each_possible_cpu(cpu) {
+ struct pmu_hw_events *events_cpu;
+ struct branch_records *records_cpu;
+
+ events_cpu = per_cpu_ptr(armpmu->hw_events, cpu);
+ records_cpu = per_cpu_ptr(records, cpu);
+ events_cpu->branches = records_cpu;
+ }
+ return 0;
}

static int armv8pmu_probe_pmu(struct arm_pmu *cpu_pmu)
@@ -1245,7 +1370,21 @@ static int armv8pmu_probe_pmu(struct arm_pmu *cpu_pmu)
if (ret)
return ret;

- return probe.present ? 0 : -ENODEV;
+ if (!probe.present)
+ return -ENODEV;
+
+ if (cpu_pmu->has_branch_stack) {
+ ret = armv8pmu_task_ctx_cache_alloc(cpu_pmu);
+ if (ret)
+ return ret;
+
+ ret = branch_records_alloc(cpu_pmu);
+ if (ret) {
+ armv8pmu_task_ctx_cache_free(cpu_pmu);
+ return ret;
+ }
+ }
+ return 0;
}

static void armv8pmu_disable_user_access_ipi(void *unused)
@@ -1304,6 +1443,11 @@ static int armv8_pmu_init(struct arm_pmu *cpu_pmu, char *name,
cpu_pmu->set_event_filter = armv8pmu_set_event_filter;

cpu_pmu->pmu.event_idx = armv8pmu_user_event_idx;
+ cpu_pmu->sched_task = armv8pmu_sched_task;
+ cpu_pmu->branch_stack_init = armv8pmu_branch_stack_init;
+ cpu_pmu->branch_stack_add = armv8pmu_branch_stack_add;
+ cpu_pmu->branch_stack_del = armv8pmu_branch_stack_del;
+ cpu_pmu->branch_stack_reset = armv8pmu_branch_stack_reset;

cpu_pmu->name = name;
cpu_pmu->map_event = map_event;
diff --git a/drivers/perf/arm_pmuv3_branch.h b/drivers/perf/arm_pmuv3_branch.h
new file mode 100644
index 000000000000..392f3c46e300
--- /dev/null
+++ b/drivers/perf/arm_pmuv3_branch.h
@@ -0,0 +1,73 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Branch Record Buffer Extension Helpers.
+ *
+ * Copyright (C) 2022-2023 ARM Limited
+ *
+ * Author: Anshuman Khandual <[email protected]>
+ */
+#include <linux/perf/arm_pmu.h>
+
+#ifdef CONFIG_ARM64_BRBE
+void armv8pmu_branch_stack_add(struct perf_event *event, struct pmu_hw_events *cpuc);
+void armv8pmu_branch_stack_del(struct perf_event *event, struct pmu_hw_events *cpuc);
+void armv8pmu_branch_stack_reset(void);
+void armv8pmu_branch_probe(struct arm_pmu *arm_pmu);
+bool armv8pmu_branch_attr_valid(struct perf_event *event);
+void armv8pmu_branch_enable(struct arm_pmu *arm_pmu);
+void armv8pmu_branch_disable(void);
+void armv8pmu_branch_read(struct pmu_hw_events *cpuc,
+ struct perf_event *event);
+void armv8pmu_branch_save(struct arm_pmu *arm_pmu, void *ctx);
+int armv8pmu_task_ctx_cache_alloc(struct arm_pmu *arm_pmu);
+void armv8pmu_task_ctx_cache_free(struct arm_pmu *arm_pmu);
+#else
+static inline void armv8pmu_branch_stack_add(struct perf_event *event, struct pmu_hw_events *cpuc)
+{
+}
+
+static inline void armv8pmu_branch_stack_del(struct perf_event *event, struct pmu_hw_events *cpuc)
+{
+}
+
+static inline void armv8pmu_branch_stack_reset(void)
+{
+}
+
+static inline void armv8pmu_branch_probe(struct arm_pmu *arm_pmu)
+{
+}
+
+static inline bool armv8pmu_branch_attr_valid(struct perf_event *event)
+{
+ WARN_ON_ONCE(!has_branch_stack(event));
+ return false;
+}
+
+static inline void armv8pmu_branch_enable(struct arm_pmu *arm_pmu)
+{
+}
+
+static inline void armv8pmu_branch_disable(void)
+{
+}
+
+static inline void armv8pmu_branch_read(struct pmu_hw_events *cpuc,
+ struct perf_event *event)
+{
+ WARN_ON_ONCE(!has_branch_stack(event));
+}
+
+static inline void armv8pmu_branch_save(struct arm_pmu *arm_pmu, void *ctx)
+{
+}
+
+static inline int armv8pmu_task_ctx_cache_alloc(struct arm_pmu *arm_pmu)
+{
+ return 0;
+}
+
+static inline void armv8pmu_task_ctx_cache_free(struct arm_pmu *arm_pmu)
+{
+}
+#endif
diff --git a/include/linux/perf/arm_pmu.h b/include/linux/perf/arm_pmu.h
index 9eda16dd684e..a8f916aa6823 100644
--- a/include/linux/perf/arm_pmu.h
+++ b/include/linux/perf/arm_pmu.h
@@ -141,6 +141,11 @@ struct arm_pmu {
/* store the PMMIR_EL1 to expose slots */
u64 reg_pmmir;

+#ifdef CONFIG_ARM64_BRBE
+ /* store the BRBIDR0_EL1 capturing attributes */
+ u64 reg_brbidr;
+#endif
+
/* Only to be used by ACPI probing code */
unsigned long acpi_cpuid;
};
--
2.25.1


2024-04-05 02:49:04

by Anshuman Khandual

[permalink] [raw]
Subject: [PATCH V17 6/9] KVM: arm64: nvhe: Disable branch generation in nVHE guests

Disable the BRBE before we enter the guest, saving the status and enable it
back once we get out of the guest. This avoids capturing branch records in
the guest kernel or userspace, which would be confusing the host samples.

Cc: Marc Zyngier <[email protected]>
Cc: Oliver Upton <[email protected]>
Cc: James Morse <[email protected]>
Cc: Suzuki K Poulose <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: [email protected]
Cc: [email protected]
CC: [email protected]
Signed-off-by: Anshuman Khandual <[email protected]>
----
Changes in V17:

- Dropped isb() call sites in __debug_[save|restore]_brbe()
- Changed to [read|write]_sysreg_el1() accessors in __debug_[save|restore]_brbe()

arch/arm64/include/asm/kvm_host.h | 5 ++++-
arch/arm64/kvm/debug.c | 5 +++++
arch/arm64/kvm/hyp/nvhe/debug-sr.c | 31 ++++++++++++++++++++++++++++++
3 files changed, 40 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 9e8a496fb284..a105bea8ecd0 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -609,7 +609,7 @@ struct kvm_vcpu_arch {
u8 cflags;

/* Input flags to the hypervisor code, potentially cleared after use */
- u8 iflags;
+ u16 iflags;

/* State flags for kernel bookkeeping, unused by the hypervisor code */
u8 sflags;
@@ -650,6 +650,7 @@ struct kvm_vcpu_arch {
u64 pmscr_el1;
/* Self-hosted trace */
u64 trfcr_el1;
+ u64 brbcr_el1;
} host_debug_state;

/* VGIC state */
@@ -819,6 +820,8 @@ struct kvm_vcpu_arch {
#define DEBUG_STATE_SAVE_TRBE __vcpu_single_flag(iflags, BIT(6))
/* vcpu running in HYP context */
#define VCPU_HYP_CONTEXT __vcpu_single_flag(iflags, BIT(7))
+/* Save BRBE context if active */
+#define DEBUG_STATE_SAVE_BRBE __vcpu_single_flag(iflags, BIT(8))

/* SVE enabled for host EL0 */
#define HOST_SVE_ENABLED __vcpu_single_flag(sflags, BIT(0))
diff --git a/arch/arm64/kvm/debug.c b/arch/arm64/kvm/debug.c
index ce8886122ed3..8fa648943f0f 100644
--- a/arch/arm64/kvm/debug.c
+++ b/arch/arm64/kvm/debug.c
@@ -336,10 +336,15 @@ void kvm_arch_vcpu_load_debug_state_flags(struct kvm_vcpu *vcpu)
if (cpuid_feature_extract_unsigned_field(dfr0, ID_AA64DFR0_EL1_TraceBuffer_SHIFT) &&
!(read_sysreg_s(SYS_TRBIDR_EL1) & TRBIDR_EL1_P))
vcpu_set_flag(vcpu, DEBUG_STATE_SAVE_TRBE);
+
+ /* Check if we have BRBE implemented and available at the host */
+ if (cpuid_feature_extract_unsigned_field(dfr0, ID_AA64DFR0_EL1_BRBE_SHIFT))
+ vcpu_set_flag(vcpu, DEBUG_STATE_SAVE_BRBE);
}

void kvm_arch_vcpu_put_debug_state_flags(struct kvm_vcpu *vcpu)
{
vcpu_clear_flag(vcpu, DEBUG_STATE_SAVE_SPE);
vcpu_clear_flag(vcpu, DEBUG_STATE_SAVE_TRBE);
+ vcpu_clear_flag(vcpu, DEBUG_STATE_SAVE_BRBE);
}
diff --git a/arch/arm64/kvm/hyp/nvhe/debug-sr.c b/arch/arm64/kvm/hyp/nvhe/debug-sr.c
index 7746ea507b6f..a6dec3646afc 100644
--- a/arch/arm64/kvm/hyp/nvhe/debug-sr.c
+++ b/arch/arm64/kvm/hyp/nvhe/debug-sr.c
@@ -79,6 +79,32 @@ static void __debug_restore_trace(u64 trfcr_el1)
write_sysreg_el1(trfcr_el1, SYS_TRFCR);
}

+static void __debug_save_brbe(u64 *brbcr_el1)
+{
+ *brbcr_el1 = 0;
+
+ /* Check if the BRBE is enabled */
+ if (!(read_sysreg_el1(SYS_BRBCR) & (BRBCR_ELx_E0BRE | BRBCR_ELx_ExBRE)))
+ return;
+
+ /*
+ * Prohibit branch record generation while we are in guest.
+ * Since access to BRBCR_EL1 is trapped, the guest can't
+ * modify the filtering set by the host.
+ */
+ *brbcr_el1 = read_sysreg_el1(SYS_BRBCR);
+ write_sysreg_el1(0, SYS_BRBCR);
+}
+
+static void __debug_restore_brbe(u64 brbcr_el1)
+{
+ if (!brbcr_el1)
+ return;
+
+ /* Restore BRBE controls */
+ write_sysreg_el1(brbcr_el1, SYS_BRBCR);
+}
+
void __debug_save_host_buffers_nvhe(struct kvm_vcpu *vcpu)
{
/* Disable and flush SPE data generation */
@@ -87,6 +113,9 @@ void __debug_save_host_buffers_nvhe(struct kvm_vcpu *vcpu)
/* Disable and flush Self-Hosted Trace generation */
if (vcpu_get_flag(vcpu, DEBUG_STATE_SAVE_TRBE))
__debug_save_trace(&vcpu->arch.host_debug_state.trfcr_el1);
+ /* Disable BRBE branch records */
+ if (vcpu_get_flag(vcpu, DEBUG_STATE_SAVE_BRBE))
+ __debug_save_brbe(&vcpu->arch.host_debug_state.brbcr_el1);
}

void __debug_switch_to_guest(struct kvm_vcpu *vcpu)
@@ -100,6 +129,8 @@ void __debug_restore_host_buffers_nvhe(struct kvm_vcpu *vcpu)
__debug_restore_spe(vcpu->arch.host_debug_state.pmscr_el1);
if (vcpu_get_flag(vcpu, DEBUG_STATE_SAVE_TRBE))
__debug_restore_trace(vcpu->arch.host_debug_state.trfcr_el1);
+ if (vcpu_get_flag(vcpu, DEBUG_STATE_SAVE_BRBE))
+ __debug_restore_brbe(vcpu->arch.host_debug_state.brbcr_el1);
}

void __debug_switch_to_host(struct kvm_vcpu *vcpu)
--
2.25.1


2024-04-05 02:49:29

by Anshuman Khandual

[permalink] [raw]
Subject: [PATCH V17 7/9] perf: test: Speed up running brstack test on an Arm model

From: James Clark <[email protected]>

The test runs quite slowly in the model, so replace "xargs -n1" with
"tr ' ' '\n'" which does the same thing but in single digit minutes
instead of double digit minutes.

Also reduce the number of loops in the test application. Unfortunately
this causes intermittent failures on x86, presumably because the
sampling interval is too big to pickup any loops, so keep it the same
there.

Cc: Mark Rutland <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: [email protected]
Cc: [email protected]
Signed-off-by: James Clark <[email protected]>
Signed-off-by: Anshuman Khandual <[email protected]>
---
tools/perf/tests/shell/test_brstack.sh | 15 ++++++++++++---
1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/tools/perf/tests/shell/test_brstack.sh b/tools/perf/tests/shell/test_brstack.sh
index 5f14d0cb013f..5ea64d0c4a6f 100755
--- a/tools/perf/tests/shell/test_brstack.sh
+++ b/tools/perf/tests/shell/test_brstack.sh
@@ -18,7 +18,6 @@ fi
skip_test_missing_symbol brstack_bench

TMPDIR=$(mktemp -d /tmp/__perf_test.program.XXXXX)
-TESTPROG="perf test -w brstack"

cleanup() {
rm -rf $TMPDIR
@@ -26,11 +25,21 @@ cleanup() {

trap cleanup EXIT TERM INT

+is_arm64() {
+ uname -m | grep -q aarch64
+}
+
+if is_arm64; then
+ TESTPROG="perf test -w brstack 5000"
+else
+ TESTPROG="perf test -w brstack"
+fi
+
test_user_branches() {
echo "Testing user branch stack sampling"

perf record -o $TMPDIR/perf.data --branch-filter any,save_type,u -- ${TESTPROG} > /dev/null 2>&1
- perf script -i $TMPDIR/perf.data --fields brstacksym | xargs -n1 > $TMPDIR/perf.script
+ perf script -i $TMPDIR/perf.data --fields brstacksym | tr ' ' '\n' > $TMPDIR/perf.script

# example of branch entries:
# brstack_foo+0x14/brstack_bar+0x40/P/-/-/0/CALL
@@ -59,7 +68,7 @@ test_filter() {
echo "Testing branch stack filtering permutation ($test_filter_filter,$test_filter_expect)"

perf record -o $TMPDIR/perf.data --branch-filter $test_filter_filter,save_type,u -- ${TESTPROG} > /dev/null 2>&1
- perf script -i $TMPDIR/perf.data --fields brstack | xargs -n1 > $TMPDIR/perf.script
+ perf script -i $TMPDIR/perf.data --fields brstack | tr ' ' '\n' > $TMPDIR/perf.script

# fail if we find any branch type that doesn't match any of the expected ones
# also consider UNKNOWN branch types (-)
--
2.25.1


2024-04-05 02:49:41

by Anshuman Khandual

[permalink] [raw]
Subject: [PATCH V17 8/9] perf: test: Remove empty lines from branch filter test output

From: James Clark <[email protected]>

In the perf script command, spaces are turned into newlines. But when
there is a double space this results in empty lines which fail the
following inverse grep test, so strip the empty lines.

Cc: Mark Rutland <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: [email protected]
Cc: [email protected]
Signed-off-by: James Clark <[email protected]>
Signed-off-by: Anshuman Khandual <[email protected]>
---
tools/perf/tests/shell/test_brstack.sh | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/perf/tests/shell/test_brstack.sh b/tools/perf/tests/shell/test_brstack.sh
index 5ea64d0c4a6f..928790f35747 100755
--- a/tools/perf/tests/shell/test_brstack.sh
+++ b/tools/perf/tests/shell/test_brstack.sh
@@ -68,7 +68,7 @@ test_filter() {
echo "Testing branch stack filtering permutation ($test_filter_filter,$test_filter_expect)"

perf record -o $TMPDIR/perf.data --branch-filter $test_filter_filter,save_type,u -- ${TESTPROG} > /dev/null 2>&1
- perf script -i $TMPDIR/perf.data --fields brstack | tr ' ' '\n' > $TMPDIR/perf.script
+ perf script -i $TMPDIR/perf.data --fields brstack | tr ' ' '\n' | sed '/^[[:space:]]*$/d' > $TMPDIR/perf.script

# fail if we find any branch type that doesn't match any of the expected ones
# also consider UNKNOWN branch types (-)
--
2.25.1


2024-04-05 02:49:54

by Anshuman Khandual

[permalink] [raw]
Subject: [PATCH V17 9/9] perf: test: Extend branch stack sampling test for Arm64 BRBE

From: James Clark <[email protected]>

Add Arm64 BRBE-specific testing to the existing branch stack sampling test.
The test currently passes on the Arm FVP RevC model, but no hardware has
been tested yet.

Cc: Mark Rutland <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: [email protected]
Cc: [email protected]
Co-developed-by: German Gomez <[email protected]>
Signed-off-by: German Gomez <[email protected]>
Signed-off-by: James Clark <[email protected]>
Signed-off-by: Anshuman Khandual <[email protected]>
---
tools/perf/tests/builtin-test.c | 1 +
tools/perf/tests/shell/test_brstack.sh | 42 ++++++++++++++++++++++++--
tools/perf/tests/tests.h | 1 +
tools/perf/tests/workloads/Build | 2 ++
tools/perf/tests/workloads/traploop.c | 39 ++++++++++++++++++++++++
5 files changed, 82 insertions(+), 3 deletions(-)
create mode 100644 tools/perf/tests/workloads/traploop.c

diff --git a/tools/perf/tests/builtin-test.c b/tools/perf/tests/builtin-test.c
index d13ee7683d9d..40083c4a7654 100644
--- a/tools/perf/tests/builtin-test.c
+++ b/tools/perf/tests/builtin-test.c
@@ -149,6 +149,7 @@ static struct test_workload *workloads[] = {
&workload__sqrtloop,
&workload__brstack,
&workload__datasym,
+ &workload__traploop
};

static int num_subtests(const struct test_suite *t)
diff --git a/tools/perf/tests/shell/test_brstack.sh b/tools/perf/tests/shell/test_brstack.sh
index 928790f35747..6a4069c930e8 100755
--- a/tools/perf/tests/shell/test_brstack.sh
+++ b/tools/perf/tests/shell/test_brstack.sh
@@ -53,12 +53,43 @@ test_user_branches() {
grep -E -m1 "^brstack_foo\+[^ ]*/brstack_bench\+[^ ]*/RET/.*$" $TMPDIR/perf.script
grep -E -m1 "^brstack_bench\+[^ ]*/brstack_bench\+[^ ]*/COND/.*$" $TMPDIR/perf.script
grep -E -m1 "^brstack\+[^ ]*/brstack\+[^ ]*/UNCOND/.*$" $TMPDIR/perf.script
+
+ if is_arm64; then
+ # in arm64 with BRBE, we get IRQ entries that correspond
+ # to any point in the process
+ grep -m1 "/IRQ/" $TMPDIR/perf.script
+ fi
set +x

# some branch types are still not being tested:
# IND COND_CALL COND_RET SYSCALL SYSRET IRQ SERROR NO_TX
}

+test_arm64_trap_eret_branches() {
+ echo "Testing trap & eret branches (arm64 brbe)"
+ perf record -o $TMPDIR/perf.data --branch-filter any,save_type,u -- \
+ perf test -w traploop 250
+ perf script -i $TMPDIR/perf.data --fields brstacksym | tr ' ' '\n' > $TMPDIR/perf.script
+ set -x
+ # BRBINF<n>.TYPE == TRAP are mapped to PERF_BR_SYSCALL by the BRBE driver
+ grep -E -m1 "^trap_bench\+[^ ]*/\[unknown\][^ ]*/SYSCALL/" $TMPDIR/perf.script
+ grep -E -m1 "^\[unknown\][^ ]*/trap_bench\+[^ ]*/ERET/" $TMPDIR/perf.script
+ set +x
+}
+
+test_arm64_kernel_branches() {
+ echo "Testing kernel branches (arm64 brbe)"
+ # skip if perf doesn't have enough privileges
+ if ! perf record --branch-filter any,k -o- -- true > /dev/null; then
+ echo "[skipped: not enough privileges]"
+ return 0
+ fi
+ perf record -o $TMPDIR/perf.data --branch-filter any,k -- uname -a
+ perf script -i $TMPDIR/perf.data --fields brstack | tr ' ' '\n' > $TMPDIR/perf.script
+ grep -E -m1 "0xffff[0-9a-f]{12}" $TMPDIR/perf.script
+ ! egrep -E -m1 "0x0000[0-9a-f]{12}" $TMPDIR/perf.script
+}
+
# first argument <arg0> is the argument passed to "--branch-stack <arg0>,save_type,u"
# second argument are the expected branch types for the given filter
test_filter() {
@@ -81,11 +112,16 @@ set -e

test_user_branches

-test_filter "any_call" "CALL|IND_CALL|COND_CALL|SYSCALL|IRQ"
+if is_arm64; then
+ test_arm64_trap_eret_branches
+ test_arm64_kernel_branches
+fi
+
+test_filter "any_call" "CALL|IND_CALL|COND_CALL|SYSCALL|IRQ|FAULT_DATA|FAULT_INST"
test_filter "call" "CALL|SYSCALL"
test_filter "cond" "COND"
test_filter "any_ret" "RET|COND_RET|SYSRET|ERET"

test_filter "call,cond" "CALL|SYSCALL|COND"
-test_filter "any_call,cond" "CALL|IND_CALL|COND_CALL|IRQ|SYSCALL|COND"
-test_filter "cond,any_call,any_ret" "COND|CALL|IND_CALL|COND_CALL|SYSCALL|IRQ|RET|COND_RET|SYSRET|ERET"
+test_filter "any_call,cond" "CALL|IND_CALL|COND_CALL|IRQ|SYSCALL|COND|FAULT_DATA|FAULT_INST"
+test_filter "cond,any_call,any_ret" "COND|CALL|IND_CALL|COND_CALL|SYSCALL|IRQ|RET|COND_RET|SYSRET|ERET|FAULT_DATA|FAULT_INST"
diff --git a/tools/perf/tests/tests.h b/tools/perf/tests/tests.h
index 3aa7701ee0e9..e4c677adccb6 100644
--- a/tools/perf/tests/tests.h
+++ b/tools/perf/tests/tests.h
@@ -205,6 +205,7 @@ DECLARE_WORKLOAD(leafloop);
DECLARE_WORKLOAD(sqrtloop);
DECLARE_WORKLOAD(brstack);
DECLARE_WORKLOAD(datasym);
+DECLARE_WORKLOAD(traploop);

extern const char *dso_to_test;
extern const char *test_objdump_path;
diff --git a/tools/perf/tests/workloads/Build b/tools/perf/tests/workloads/Build
index a1f34d5861e3..a9dc93d8468b 100644
--- a/tools/perf/tests/workloads/Build
+++ b/tools/perf/tests/workloads/Build
@@ -6,8 +6,10 @@ perf-y += leafloop.o
perf-y += sqrtloop.o
perf-y += brstack.o
perf-y += datasym.o
+perf-y += traploop.o

CFLAGS_sqrtloop.o = -g -O0 -fno-inline -U_FORTIFY_SOURCE
CFLAGS_leafloop.o = -g -O0 -fno-inline -fno-omit-frame-pointer -U_FORTIFY_SOURCE
CFLAGS_brstack.o = -g -O0 -fno-inline -U_FORTIFY_SOURCE
CFLAGS_datasym.o = -g -O0 -fno-inline -U_FORTIFY_SOURCE
+CFLAGS_traploop.o = -g -O0 -fno-inline -U_FORTIFY_SOURCE
diff --git a/tools/perf/tests/workloads/traploop.c b/tools/perf/tests/workloads/traploop.c
new file mode 100644
index 000000000000..7dac94897e49
--- /dev/null
+++ b/tools/perf/tests/workloads/traploop.c
@@ -0,0 +1,39 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <stdlib.h>
+#include "../tests.h"
+
+#define BENCH_RUNS 999999
+
+static volatile int cnt;
+
+#ifdef __aarch64__
+static void trap_bench(void)
+{
+ unsigned long val;
+
+ asm("mrs %0, ID_AA64ISAR0_EL1" : "=r" (val)); /* TRAP + ERET */
+}
+#else
+static void trap_bench(void)
+{
+
+}
+#endif
+
+static int traploop(int argc, const char **argv)
+{
+ int num_loops = BENCH_RUNS;
+
+ if (argc > 0)
+ num_loops = atoi(argv[0]);
+
+ while (1) {
+ if ((cnt++) > num_loops)
+ break;
+
+ trap_bench();
+ }
+ return 0;
+}
+
+DEFINE_WORKLOAD(traploop);
--
2.25.1


2024-04-05 02:51:15

by Anshuman Khandual

[permalink] [raw]
Subject: [PATCH V17 3/9] drivers: perf: arm_pmu: Add infrastructure for branch stack sampling

In order to support the Branch Record Buffer Extension (BRBE), we need to
extend the arm_pmu framework with some basic infrastructure for the branch
stack sampling which arm_pmu drivers can opt-in using a new feature flag
called 'has_branch_stack'. Subsequent patches will use this to add support
for BRBE in the PMUv3 driver.

Branch stack sampling support i.e capturing branch records during execution
in core perf, rides along with normal HW events being scheduled on the PMU.
This prepares ARMV8 PMU framework for branch stack support on relevant PMUs
with required HW implementation.

With BRBE, the hardware records branches into a hardware FIFO, which will
be sampled by software when perf events overflow. A task may be context-
switched an arbitrary number of times between overflows, and to avoid
losing samples we need to save the current records when a task is context-
switched out. To do these we'll need to use the pmu::sched_task() callback,
and we'll also need to allocate some per-task storage space via event flag
PERF_ATTACH_TASK_DATA.

Cc: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Anshuman Khandual <[email protected]>
----
Changes in V17:

- ARM PMUV3 changes have been moved into the BRBE driver patch instead
- Moved down branch_stack_add() in armpmu_add() after event's basic checks
- Added new callbacks in struct arm_pmu e.g branch_stack_[init|add|del]()
- Renamed struct arm_pmu callback branch_reset() as branch_stack_reset()
- Dropped the comment in armpmu_event_init()
- Renamed 'pmu_hw_events' elements from 'brbe_' to more generic 'branch_'

drivers/perf/arm_pmu.c | 25 ++++++++++++++++++++++---
include/linux/perf/arm_pmu.h | 32 +++++++++++++++++++++++++++++++-
2 files changed, 53 insertions(+), 4 deletions(-)

diff --git a/drivers/perf/arm_pmu.c b/drivers/perf/arm_pmu.c
index 8458fe2cebb4..e694103e811f 100644
--- a/drivers/perf/arm_pmu.c
+++ b/drivers/perf/arm_pmu.c
@@ -317,6 +317,9 @@ armpmu_del(struct perf_event *event, int flags)
struct hw_perf_event *hwc = &event->hw;
int idx = hwc->idx;

+ if (has_branch_stack(event))
+ armpmu->branch_stack_del(event, hw_events);
+
armpmu_stop(event, PERF_EF_UPDATE);
hw_events->events[idx] = NULL;
armpmu->clear_event_idx(hw_events, event);
@@ -342,6 +345,9 @@ armpmu_add(struct perf_event *event, int flags)
if (idx < 0)
return idx;

+ if (has_branch_stack(event))
+ armpmu->branch_stack_add(event, hw_events);
+
/*
* If there is an event in the counter we are going to use then make
* sure it is disabled.
@@ -511,13 +517,25 @@ static int armpmu_event_init(struct perf_event *event)
!cpumask_test_cpu(event->cpu, &armpmu->supported_cpus))
return -ENOENT;

- /* does not support taken branch sampling */
- if (has_branch_stack(event))
- return -EOPNOTSUPP;
+ if (has_branch_stack(event)) {
+ if (!armpmu->has_branch_stack)
+ return -EOPNOTSUPP;
+
+ if (!armpmu->branch_stack_init(event))
+ return -EOPNOTSUPP;
+ }

return __hw_perf_event_init(event);
}

+static void armpmu_sched_task(struct perf_event_pmu_context *pmu_ctx, bool sched_in)
+{
+ struct arm_pmu *armpmu = to_arm_pmu(pmu_ctx->pmu);
+
+ if (armpmu->sched_task)
+ armpmu->sched_task(pmu_ctx, sched_in);
+}
+
static void armpmu_enable(struct pmu *pmu)
{
struct arm_pmu *armpmu = to_arm_pmu(pmu);
@@ -864,6 +882,7 @@ struct arm_pmu *armpmu_alloc(void)
}

pmu->pmu = (struct pmu) {
+ .sched_task = armpmu_sched_task,
.pmu_enable = armpmu_enable,
.pmu_disable = armpmu_disable,
.event_init = armpmu_event_init,
diff --git a/include/linux/perf/arm_pmu.h b/include/linux/perf/arm_pmu.h
index b3b34f6670cf..9eda16dd684e 100644
--- a/include/linux/perf/arm_pmu.h
+++ b/include/linux/perf/arm_pmu.h
@@ -46,6 +46,18 @@ static_assert((PERF_EVENT_FLAG_ARCH & ARMPMU_EVT_63BIT) == ARMPMU_EVT_63BIT);
}, \
}

+/*
+ * Maximum branch record entries which could be processed
+ * for core perf branch stack sampling support, regardless
+ * of the hardware support available on a given ARM PMU.
+ */
+#define MAX_BRANCH_RECORDS 64
+
+struct branch_records {
+ struct perf_branch_stack branch_stack;
+ struct perf_branch_entry branch_entries[MAX_BRANCH_RECORDS];
+};
+
/* The events for a given PMU register set. */
struct pmu_hw_events {
/*
@@ -66,6 +78,17 @@ struct pmu_hw_events {
struct arm_pmu *percpu_pmu;

int irq;
+
+ struct branch_records *branches;
+
+ /* Active context for task events */
+ void *branch_context;
+
+ /* Active events requesting branch records */
+ unsigned int branch_users;
+
+ /* Active branch sample type filters */
+ unsigned long branch_sample_type;
};

enum armpmu_attr_groups {
@@ -96,8 +119,15 @@ struct arm_pmu {
void (*stop)(struct arm_pmu *);
void (*reset)(void *);
int (*map_event)(struct perf_event *event);
+ void (*sched_task)(struct perf_event_pmu_context *pmu_ctx, bool sched_in);
+ bool (*branch_stack_init)(struct perf_event *event);
+ void (*branch_stack_add)(struct perf_event *event, struct pmu_hw_events *cpuc);
+ void (*branch_stack_del)(struct perf_event *event, struct pmu_hw_events *cpuc);
+ void (*branch_stack_reset)(void);
int num_events;
- bool secure_access; /* 32-bit ARM only */
+ unsigned int secure_access : 1, /* 32-bit ARM only */
+ has_branch_stack: 1, /* 64-bit ARM only */
+ reserved : 30;
#define ARMV8_PMUV3_MAX_COMMON_EVENTS 0x40
DECLARE_BITMAP(pmceid_bitmap, ARMV8_PMUV3_MAX_COMMON_EVENTS);
#define ARMV8_PMUV3_EXT_COMMON_EVENT_BASE 0x4000
--
2.25.1


2024-05-21 13:24:52

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH V17 1/9] arm64/sysreg: Add BRBE registers and fields

On Fri, Apr 05, 2024 at 08:16:31AM +0530, Anshuman Khandual wrote:
> This adds BRBE related register definitions and various other related field
> macros there in. These will be used subsequently in a BRBE driver, which is
> being added later on. While here, this drops redundant register definitions
> from the header i.e (arch/arm64/include/asm/sysreg.h).
>
> BRBINFx_EL1_TYPE_IMPDEF_TRAP_EL3 register field value has been derived from
> latest ARM DDI 0601 ID121123, AKA 2023-12 instead of latest ARM ARM i.e ARM
> DDI 0487J.a. Please find the definition here.
>
> https://developer.arm.com/documentation/ddi0601/2023-12/

The K.a release of the ARM ARM has now been published, and that includes
the "0b110000 IMPLEMENTATION DEFINED exception to EL3" value, so we can
point to that now.

With that in mind, for the commit message we can say:

| This patch adds definitions related to the Branch Record Buffer
| Extension (BRBE) as per ARM DDI 0487K.a. These will be used by KVM and
| a BRBE driver in subsequent patches.
|
| Some existing BRBE definitions in asm/sysreg.h are replaced with
| equivalent generated definitions.

Aside from a couple of minor issues below, this looks good to me. With
those fixed up (and the commit message above):

Reviewed-by: Mark Rutland <[email protected]>

[...]

> +Sysreg BRBIDR0_EL1 2 1 9 2 0
> +Res0 63:16
> +Enum 15:12 CC
> + 0b101 20_BIT
> +EndEnum

Please pad this to the width of the field, i.e.

0b0101 20_bit

> +Enum 11:8 FORMAT
> + 0b0 0
> +EndEnum

Likewise, this should be padded to the width of the field.

As with the BRBFCR_EL1.BANK definitions, I reckon it's worth givin the
enum value a prefix, so please make this:

0b0000 FORMAT_0

Mark.

2024-05-21 13:26:36

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH V17 2/9] KVM: arm64: Explicitly handle BRBE traps as UNDEFINED

Hi Anshuman,

On Fri, Apr 05, 2024 at 08:16:32AM +0530, Anshuman Khandual wrote:
> The Branch Record Buffer Extension (BRBE) adds a number of system registers
> and instructions, which we don't currently intend to expose to guests. Our
> existing logic handles this safely, but this could be improved with some
> explicit handling of BRBE.
>
> The presence of BRBE is currently hidden from guests as the cpufeature
> code's ftr_id_aa64dfr0[] table doesn't have an entry for the BRBE field,
> and so this will be zero in the sanitised value of ID_AA64DFR0 exposed to
> guests via read_sanitised_id_aa64dfr0_el1(). As the ftr_id_aa64dfr0[] table
> may gain an entry for the BRBE field in future, for robustness we should
> explicitly mask out the BRBE field in read_sanitised_id_aa64dfr0_el1().
>
> The BRBE system registers and instructions are currently trapped by the
> existing configuration of the fine-grained traps. As neither the registers
> are not described in the sys_reg_descs[] nor the instructions are described
> in the sys_insn_descs[] table, emulate_sys_reg() will warn that these are
> unknown before injecting an UNDEFINED exception into the guest.

That last sentence doesn't make sense, and I think it has been mangled.
My suggested text in v16 was:

| As the registers and instructions are not described in the
| sys_reg_descs[] table, emulate_sys_reg() will warn that these are
| unknown before injecting an UNDEFINED exception into the guest.

.. and I'd be happy with changing that to:

| As neither the registers nor the instructions are described in the
| sys_reg_descs[] table, emulate_sys_reg() will warn that these are
| unknown before injecting an UNDEFINED exception into the guest.

> Well-behaved guests shouldn't try to use the registers or instructions, but
> badly-behaved guests could use these, resulting in unnecessary warnings.

I see that I had mangled this sentence originally -- thanks for
correcting that; this looks fine to me.

> To avoid those warnings, we should explicitly handle the BRBE
> registers, and instructions as UNDEFINED.

I think the added comma is unnecessary and makes this hard to read. My
suggested text in v16 was:

| To avoid those warnings, we should explicitly handle the BRBE
| registers and instructions as UNDEFINED.

> Address the above by having read_sanitised_id_aa64dfr0_el1() mask out the
> ID_AA64DFR0.BRBE field, and by adding sys_reg_descs entries for all of the
> BRBE system registers, also by adding sys_insn_descs entries for all of the
> BRBE instructions, treating these all as UNDEFINED.

Similarly, I think this was clearer as I originally suggested:

| Address the above by having read_sanitised_id_aa64dfr0_el1() mask out
| the ID_AA64DFR0.BRBE field, and by adding sys_reg_desc entries for all
| of the BRBE system registers and instructions, treating these all as
| UNDEFINED.

.. or we can simplify that to:

| Address the above by having read_sanitised_id_aa64dfr0_el1() mask out
| the ID_AA64DFR0.BRBE field, and explicitly handling all of the BRBE
| system registers and instructions as UNDEFINED.

> Cc: Marc Zyngier <[email protected]>
> Cc: Oliver Upton <[email protected]>
> Cc: James Morse <[email protected]>
> Cc: Suzuki K Poulose <[email protected]>
> Cc: Catalin Marinas <[email protected]>
> Cc: Will Deacon <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Signed-off-by: Anshuman Khandual <[email protected]>
> ----
> Changes in V17:
>
> - Updated the commit message including about sys_insn_descs[]
> - Changed KVM to use existing SYS_BRBSRC/TGT/INF_EL1(n) format
> - Moved the BRBE instructions into sys_insn_descs[] array

Aside from my nits on the commit message above, these changes all look
good to me. So with the commit message cleaned up as above:

Reviewed-by: Mark Rutland <[email protected]>

Mark.

>
> arch/arm64/kvm/sys_regs.c | 56 +++++++++++++++++++++++++++++++++++++++
> 1 file changed, 56 insertions(+)
>
> diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
> index c9f4f387155f..3981aa32c5a3 100644
> --- a/arch/arm64/kvm/sys_regs.c
> +++ b/arch/arm64/kvm/sys_regs.c
> @@ -1304,6 +1304,11 @@ static int set_pmcr(struct kvm_vcpu *vcpu, const struct sys_reg_desc *r,
> return 0;
> }
>
> +#define BRB_INF_SRC_TGT_EL1(n) \
> + { SYS_DESC(SYS_BRBINF_EL1(n)), undef_access }, \
> + { SYS_DESC(SYS_BRBSRC_EL1(n)), undef_access }, \
> + { SYS_DESC(SYS_BRBTGT_EL1(n)), undef_access } \
> +
> /* Silly macro to expand the DBG{BCR,BVR,WVR,WCR}n_EL1 registers in one go */
> #define DBG_BCR_BVR_WCR_WVR_EL1(n) \
> { SYS_DESC(SYS_DBGBVRn_EL1(n)), \
> @@ -1708,6 +1713,9 @@ static u64 read_sanitised_id_aa64dfr0_el1(struct kvm_vcpu *vcpu,
> /* Hide SPE from guests */
> val &= ~ID_AA64DFR0_EL1_PMSVer_MASK;
>
> + /* Hide BRBE from guests */
> + val &= ~ID_AA64DFR0_EL1_BRBE_MASK;
> +
> return val;
> }
>
> @@ -2226,6 +2234,52 @@ static const struct sys_reg_desc sys_reg_descs[] = {
> { SYS_DESC(SYS_DBGCLAIMCLR_EL1), trap_raz_wi },
> { SYS_DESC(SYS_DBGAUTHSTATUS_EL1), trap_dbgauthstatus_el1 },
>
> + /*
> + * BRBE branch record sysreg address space is interleaved between
> + * corresponding BRBINF<N>_EL1, BRBSRC<N>_EL1, and BRBTGT<N>_EL1.
> + */
> + BRB_INF_SRC_TGT_EL1(0),
> + BRB_INF_SRC_TGT_EL1(16),
> + BRB_INF_SRC_TGT_EL1(1),
> + BRB_INF_SRC_TGT_EL1(17),
> + BRB_INF_SRC_TGT_EL1(2),
> + BRB_INF_SRC_TGT_EL1(18),
> + BRB_INF_SRC_TGT_EL1(3),
> + BRB_INF_SRC_TGT_EL1(19),
> + BRB_INF_SRC_TGT_EL1(4),
> + BRB_INF_SRC_TGT_EL1(20),
> + BRB_INF_SRC_TGT_EL1(5),
> + BRB_INF_SRC_TGT_EL1(21),
> + BRB_INF_SRC_TGT_EL1(6),
> + BRB_INF_SRC_TGT_EL1(22),
> + BRB_INF_SRC_TGT_EL1(7),
> + BRB_INF_SRC_TGT_EL1(23),
> + BRB_INF_SRC_TGT_EL1(8),
> + BRB_INF_SRC_TGT_EL1(24),
> + BRB_INF_SRC_TGT_EL1(9),
> + BRB_INF_SRC_TGT_EL1(25),
> + BRB_INF_SRC_TGT_EL1(10),
> + BRB_INF_SRC_TGT_EL1(26),
> + BRB_INF_SRC_TGT_EL1(11),
> + BRB_INF_SRC_TGT_EL1(27),
> + BRB_INF_SRC_TGT_EL1(12),
> + BRB_INF_SRC_TGT_EL1(28),
> + BRB_INF_SRC_TGT_EL1(13),
> + BRB_INF_SRC_TGT_EL1(29),
> + BRB_INF_SRC_TGT_EL1(14),
> + BRB_INF_SRC_TGT_EL1(30),
> + BRB_INF_SRC_TGT_EL1(15),
> + BRB_INF_SRC_TGT_EL1(31),
> +
> + /* Remaining BRBE sysreg addresses space */
> + { SYS_DESC(SYS_BRBCR_EL1), undef_access },
> + { SYS_DESC(SYS_BRBFCR_EL1), undef_access },
> + { SYS_DESC(SYS_BRBTS_EL1), undef_access },
> + { SYS_DESC(SYS_BRBINFINJ_EL1), undef_access },
> + { SYS_DESC(SYS_BRBSRCINJ_EL1), undef_access },
> + { SYS_DESC(SYS_BRBTGTINJ_EL1), undef_access },
> + { SYS_DESC(SYS_BRBIDR0_EL1), undef_access },
> +
> { SYS_DESC(SYS_MDCCSR_EL0), trap_raz_wi },
> { SYS_DESC(SYS_DBGDTR_EL0), trap_raz_wi },
> // DBGDTR[TR]X_EL0 share the same encoding
> @@ -2738,6 +2792,8 @@ static struct sys_reg_desc sys_insn_descs[] = {
> { SYS_DESC(SYS_DC_CISW), access_dcsw },
> { SYS_DESC(SYS_DC_CIGSW), access_dcgsw },
> { SYS_DESC(SYS_DC_CIGDSW), access_dcgsw },
> + { SYS_DESC(OP_BRB_IALL), undef_access },
> + { SYS_DESC(OP_BRB_INJ), undef_access },
> };
>
> static const struct sys_reg_desc *first_idreg;
> --
> 2.25.1
>

2024-05-21 13:45:14

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH V17 3/9] drivers: perf: arm_pmu: Add infrastructure for branch stack sampling

On Fri, Apr 05, 2024 at 08:16:33AM +0530, Anshuman Khandual wrote:
> In order to support the Branch Record Buffer Extension (BRBE), we need to
> extend the arm_pmu framework with some basic infrastructure for the branch
> stack sampling which arm_pmu drivers can opt-in using a new feature flag
> called 'has_branch_stack'. Subsequent patches will use this to add support
> for BRBE in the PMUv3 driver.

Please, just use ther *exact* wording I asked for last time:

| In order to support the Branch Record Buffer Extension (BRBE), we need to
| extend the arm_pmu framework with some basic infrastructure for branch stack
| sampling which arm_pmu drivers can opt-in to using. Subsequent patches will
| use this to add support for BRBE in the PMUv3 driver.

At this point in the commit message, the 'has_branch_stack' flag doesn't
matter, and dropping the 'to' after 'opt-in' makes this painful to read.

> Branch stack sampling support i.e capturing branch records during execution
> in core perf, rides along with normal HW events being scheduled on the PMU.
> This prepares ARMV8 PMU framework for branch stack support on relevant PMUs
> with required HW implementation.

Please delete this paragraph.

> With BRBE, the hardware records branches into a hardware FIFO, which will
> be sampled by software when perf events overflow. A task may be context-
> switched an arbitrary number of times between overflows, and to avoid
> losing samples we need to save the current records when a task is context-
> switched out. To do these we'll need to use the pmu::sched_task() callback,
> and we'll also need to allocate some per-task storage space via event flag
> PERF_ATTACH_TASK_DATA.

[...]

> /* The events for a given PMU register set. */
> struct pmu_hw_events {
> /*
> @@ -66,6 +78,17 @@ struct pmu_hw_events {
> struct arm_pmu *percpu_pmu;
>
> int irq;
> +
> + struct branch_records *branches;
> +
> + /* Active context for task events */
> + void *branch_context;
> +
> + /* Active events requesting branch records */
> + unsigned int branch_users;
> +
> + /* Active branch sample type filters */
> + unsigned long branch_sample_type;
> };

At this point in the series I understand why we have the 'branches' and
'branch_users' fields, but the 'branch_context' and 'branch_sample_type'
fields haven't been introduced and are not obvious.

What exactly is branch_context, and why is that a 'void *' ?

I can understand if that's a PMU-specific structure to track the active
branch records, but if so I don't understand why 'branch_sample_type'
isn't folded into that.

Mark.

2024-05-29 10:53:07

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH V17 4/9] arm64/boot: Enable EL2 requirements for BRBE

On Fri, Apr 05, 2024 at 08:16:34AM +0530, Anshuman Khandual wrote:
> Fine grained trap control for BRBE registers, and instructions access need
> to be configured in HDFGRTR_EL2, HDFGWTR_EL2 and HFGITR_EL2 registers when
> kernel enters at EL1 but EL2 is present. This changes __init_el2_fgt() as
> required.
>
> Similarly cycle and mis-prediction capture need to be enabled in BRBCR_EL1
> and BRBCR_EL2 when the kernel enters either into EL1 or EL2. This adds new
> __init_el2_brbe() to achieve this objective.
>
> This also updates Documentation/arch/arm64/booting.rst with all the above
> EL2 along with MDRC_EL3.SBRBE requirements.
>
> First this replaces an existing hard encoding (1 << 62) with corresponding
> applicable macro HDFGRTR_EL2_nPMSNEVFR_EL1_MASK.
>
> Cc: Catalin Marinas <[email protected]>
> Cc: Will Deacon <[email protected]>
> Cc: Jonathan Corbet <[email protected]>
> Cc: Marc Zyngier <[email protected]>
> Cc: Oliver Upton <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Signed-off-by: Anshuman Khandual <[email protected]>
> ----
> Changes in V17:
>
> - New patch added in the series
> - Separated out from the BRBE driver implementation patch
> - Dropped the comment in __init_el2_brbe()
> - Updated __init_el2_brbe() with BRBCR_EL2.MPRED requirements
> - Updated __init_el2_brbe() with __check_hvhe() constructs
> - Updated booting.rst regarding MPRED, MDCR_EL3 and fine grained control
>
> Documentation/arch/arm64/booting.rst | 26 ++++++++
> arch/arm64/include/asm/el2_setup.h | 90 +++++++++++++++++++++++++++-
> 2 files changed, 113 insertions(+), 3 deletions(-)
>
> diff --git a/Documentation/arch/arm64/booting.rst b/Documentation/arch/arm64/booting.rst
> index b57776a68f15..512210da7dd2 100644
> --- a/Documentation/arch/arm64/booting.rst
> +++ b/Documentation/arch/arm64/booting.rst
> @@ -349,6 +349,32 @@ Before jumping into the kernel, the following conditions must be met:
>
> - HWFGWTR_EL2.nSMPRI_EL1 (bit 54) must be initialised to 0b01.
>
> + For CPUs with feature Branch Record Buffer Extension (FEAT_BRBE):
> +
> + - If the kernel is entered at EL2 and EL1 is present:
> +
> + - BRBCR_EL1.CC (bit 3) must be initialised to 0b1.
> + - BRBCR_EL1.MPRED (bit 4) must be initialised to 0b1.

IIUC this isn't necessary; if the kernel is entered at EL2, it's capable
of initializing the EL1 regs, and it doesn't look like this silently
affects something we'd need in the absence of a BRBE driver.

AFAICT the __init_el2_brbe() code you add below handles this, so I think
this is redundant and can be deleted.

> + - If the kernel is entered at EL1 and EL2 is present:
> +
> + - BRBCR_EL2.CC (bit 3) must be initialised to 0b1.
> + - BRBCR_EL2.MPRED (bit 4) must be initialised to 0b1.
> +
> + - HDFGRTR_EL2.nBRBDATA (bit 61) must be initialised to 0b1.
> + - HDFGRTR_EL2.nBRBCTL (bit 60) must be initialised to 0b1.
> + - HDFGRTR_EL2.nBRBIDR (bit 59) must be initialised to 0b1.
> +
> + - HDFGWTR_EL2.nBRBDATA (bit 61) must be initialised to 0b1.
> + - HDFGWTR_EL2.nBRBCTL (bit 60) must be initialised to 0b1.
> +
> + - HFGITR_EL2.nBRBIALL (bit 56) must be initialised to 0b1.
> + - HFGITR_EL2.nBRBINJ (bit 55) must be initialised to 0b1.
> +
> + - If EL3 is present:
> +
> + - MDCR_EL3.SBRBE (bits 33:32) must be initialised to 0b11.

Minor nit: please list the EL3 requirements first, that way this can be
read in EL3->EL2->EL1 order to match the FW boot-flow order.

> +
> For CPUs with the Scalable Matrix Extension FA64 feature (FEAT_SME_FA64):
>
> - If EL3 is present:
> diff --git a/arch/arm64/include/asm/el2_setup.h b/arch/arm64/include/asm/el2_setup.h
> index b7afaa026842..7c12a8e658d4 100644
> --- a/arch/arm64/include/asm/el2_setup.h
> +++ b/arch/arm64/include/asm/el2_setup.h
> @@ -154,6 +154,41 @@
> .Lskip_set_cptr_\@:
> .endm
>
> +#ifdef CONFIG_ARM64_BRBE
> +/*
> + * Enable BRBE cycle count and miss-prediction
> + *
> + * BRBE requires both BRBCR_EL1.CC and BRBCR_EL2.CC fields, be set
> + * for the cycle counts to be available in BRBINF<N>_EL1.CC during
> + * branch record processing after a PMU interrupt. This enables CC
> + * field on both these registers while still executing inside EL2.

Huh, it's a bit of an oddity to do that for a register that gets the E2H
treatment, but that is what the ARM ARM says, looking at the pseudocode
in ARM DDI 0487K.a:

| // BRBCycleCountingEnabled()
| // =========================
| // Returns TRUE if the recording of cycle counts is allowed,
| // FALSE otherwise.
| boolean BRBCycleCountingEnabled()
| if HaveEL(EL2) && BRBCR_EL2.CC == '0' then return FALSE;
| if BRBCR_EL1.CC == '0' then return FALSE;
| return TRUE;

.. and likewise for MPRED:

| // BRBEMispredictAllowed()
| // =======================
| // Returns TRUE if the recording of branch misprediction is allowed,
| // FALSE otherwise.
| boolean BRBEMispredictAllowed()
| if HaveEL(EL2) && BRBCR_EL2.MPRED == '0' then return FALSE;
| if BRBCR_EL1.MPRED == '0' then return FALSE;
| return TRUE;

.. though BRBCycleCountingEnabled() isn't actually used anywhere, while
BRBEMispredictAllowed is used by BRBEBranch(), since that does:

| (ccu, cc) = BranchEncCycleCount();
| ...
| bit mispredict = if BRBEMispredictAllowed() && BranchMispredict() then '1' else '0';

.. where BranchEncCycleCount() is a stub that doesn't mention
BRBCycleCountingEnabled() at all, so it's not clear to me whether CCU is
guaranteed to be set.

> + *
> + * BRBE driver would still be able to toggle branch records cycle
> + * count support via BRBCR_EL1.CC field regardless of whether the
> + * kernel ends up executing in EL1 or EL2.
> + *
> + * The same principle applies for branch record mis-prediction info
> + * as well, thus requiring MPRED field to be set on both BRBCR_EL1
> + * and BRBCR_EL2 while still executing inside EL2.
> + */

I think we can clarify this comment to:

/*
* Enable BRBE to record cycle counts and branch mispredicts.
*
* At any EL, to record cycle counts BRBE requires that both
* BRBCR_EL2.CC=1 and BRBCR_EL1.CC=1.
*
* At any EL, to record branch mispredicts BRBE requires that both
* BRBCR_EL2.MPRED=1 and BRBCR_EL1.MPRED=1.
*
* When HCR_EL2.E2H=1, the BRBCR_EL1 encoding is redirected to
* BRBCR_EL2, but the {CC,MPRED} bits in the real BRBCR_EL1 register
* still apply.
*
* Set {CC,MPRBED} in both BRBCR_EL2 and BRBCR_EL1 so that at runtime we
* only need to enable/disable thse in BRBCR_EL1 regardless of whether
* the kernel ends up executing in EL1 or EL2.
*/

> +.macro __init_el2_brbe
> + mrs x1, id_aa64dfr0_el1
> + ubfx x1, x1, #ID_AA64DFR0_EL1_BRBE_SHIFT, #4
> + cbz x1, .Lskip_brbe_\@
> +
> + mov_q x0, BRBCR_ELx_CC | BRBCR_ELx_MPRED
> + msr_s SYS_BRBCR_EL2, x0
> +
> + __check_hvhe .Lset_brbe_nvhe_\@, x1
> + msr_s SYS_BRBCR_EL12, x0 // VHE
> + b .Lskip_brbe_\@
> +
> +.Lset_brbe_nvhe_\@:
> + msr_s SYS_BRBCR_EL1, x0 // NVHE
> +.Lskip_brbe_\@:
> +.endm
> +#endif /* CONFIG_ARM64_BRBE */
> +
> /* Disable any fine grained traps */
> .macro __init_el2_fgt
> mrs x1, id_aa64mmfr0_el1
> @@ -161,16 +196,48 @@
> cbz x1, .Lskip_fgt_\@
>
> mov x0, xzr
> + mov x2, xzr
> mrs x1, id_aa64dfr0_el1
> ubfx x1, x1, #ID_AA64DFR0_EL1_PMSVer_SHIFT, #4
> cmp x1, #3
> b.lt .Lset_debug_fgt_\@
> +
> /* Disable PMSNEVFR_EL1 read and write traps */
> - orr x0, x0, #(1 << 62)
> + orr x0, x0, #HDFGRTR_EL2_nPMSNEVFR_EL1_MASK
> + orr x2, x2, #HDFGWTR_EL2_nPMSNEVFR_EL1_MASK
>
> .Lset_debug_fgt_\@:
> +#ifdef CONFIG_ARM64_BRBE
> + mrs x1, id_aa64dfr0_el1
> + ubfx x1, x1, #ID_AA64DFR0_EL1_BRBE_SHIFT, #4
> + cbz x1, .Lskip_brbe_reg_fgt_\@
> +
> + /*
> + * Disable read traps for the following registers
> + *
> + * [BRBSRC|BRBTGT|RBINF]_EL1
> + * [BRBSRCINJ|BRBTGTINJ|BRBINFINJ|BRBTS]_EL1
> + */
> + orr x0, x0, #HDFGRTR_EL2_nBRBDATA_MASK
> +
> + /*
> + * Disable write traps for the following registers
> + *
> + * [BRBSRCINJ|BRBTGTINJ|BRBINFINJ|BRBTS]_EL1
> + */
> + orr x2, x2, #HDFGWTR_EL2_nBRBDATA_MASK
> +
> + /* Disable read and write traps for [BRBCR|BRBFCR]_EL1 */
> + orr x0, x0, #HDFGRTR_EL2_nBRBCTL_MASK
> + orr x2, x2, #HDFGWTR_EL2_nBRBCTL_MASK
> +
> + /* Disable read traps for BRBIDR_EL1 */
> + orr x0, x0, #HDFGRTR_EL2_nBRBIDR_MASK
> +
> +.Lskip_brbe_reg_fgt_\@:
> +#endif /* CONFIG_ARM64_BRBE */
> msr_s SYS_HDFGRTR_EL2, x0
> - msr_s SYS_HDFGWTR_EL2, x0
> + msr_s SYS_HDFGWTR_EL2, x2
>
> mov x0, xzr
> mrs x1, id_aa64pfr1_el1
> @@ -193,7 +260,21 @@
> .Lset_fgt_\@:
> msr_s SYS_HFGRTR_EL2, x0
> msr_s SYS_HFGWTR_EL2, x0
> - msr_s SYS_HFGITR_EL2, xzr
> + mov x0, xzr
> +#ifdef CONFIG_ARM64_BRBE
> + mrs x1, id_aa64dfr0_el1
> + ubfx x1, x1, #ID_AA64DFR0_EL1_BRBE_SHIFT, #4
> + cbz x1, .Lskip_brbe_insn_fgt_\@
> +
> + /* Disable traps for BRBIALL instruction */
> + orr x0, x0, #HFGITR_EL2_nBRBIALL_MASK
> +
> + /* Disable traps for BRBINJ instruction */
> + orr x0, x0, #HFGITR_EL2_nBRBINJ_MASK
> +
> +.Lskip_brbe_insn_fgt_\@:
> +#endif /* CONFIG_ARM64_BRBE */
> + msr_s SYS_HFGITR_EL2, x0
>
> mrs x1, id_aa64pfr0_el1 // AMU traps UNDEF without AMU
> ubfx x1, x1, #ID_AA64PFR0_EL1_AMU_SHIFT, #4
> @@ -228,6 +309,9 @@
> __init_el2_nvhe_idregs
> __init_el2_cptr
> __init_el2_fgt
> +#ifdef CONFIG_ARM64_BRBE
> + __init_el2_brbe
> +#endif

This largely looks fine, but I note that we haven't bothered with
ifdeffery for PMU and SPE, so I suspect it might be worth getting rid of
the ifdeffery for BRBE.

Mark.

2024-05-30 09:47:50

by James Clark

[permalink] [raw]
Subject: Re: [PATCH V17 0/9] arm64/perf: Enable branch stack sampling



On 05/04/2024 03:46, Anshuman Khandual wrote:
> This series enables perf branch stack sampling support on arm64 platform
> via a new arch feature called Branch Record Buffer Extension (BRBE). All
> the relevant register definitions could be accessed here.
>
> https://developer.arm.com/documentation/ddi0601/2021-12/AArch64-Registers
>
> This series applies on 6.9-rc2.
>
> Also this series is being hosted below for quick access, review and test.
>
> https://git.gitlab.arm.com/linux-arm/linux-anshuman.git (brbe_v17)
>
> There are still some open questions regarding handling multiple perf events
> with different privilege branch filters getting on the same PMU, supporting
> guest branch stack tracing from the host etc. Finally also looking for some
> suggestions regarding supporting BRBE inside the guest. The series has been
> re-organized completely as suggested earlier.
>
> - Anshuman
>
[...]
>
> ------------------ Possible 'branch_sample_type' Mismatch -----------------
>
> Branch stack sampling attributes 'event->attr.branch_sample_type' generally
> remain the same for all the events during a perf record session.
>
> $perf record -e <event_1> -e <event_2> -j <branch_filters> [workload]
>
> event_1->attr.branch_sample_type == event_2->attr.branch_sample_type
>
> This 'branch_sample_type' is used to configure the BRBE hardware, when both
> events i.e <event_1> and <event_2> get scheduled on a given PMU. But during
> PMU HW event's privilege filter inheritance, 'branch_sample_type' does not
> remain the same for all events. Let's consider the following example
>
> $perf record -e cycles:u -e instructions:k -j any,save_type ls
>
> cycles->attr.branch_sample_type != instructions->attr.branch_sample_type
>
> Because cycles event inherits PERF_SAMPLE_BRANCH_USER and instruction event
> inherits PERF_SAMPLE_BRANCH_KERNEL. The proposed solution here configures
> BRBE hardware with 'branch_sample_type' from last event to be added in the
> PMU and hence captured branch records only get passed on to matching events
> during a PMU interrupt.
>

Hi Anshuman,

Surely because of this example we should merge? At least we have to try
to make the most common basic command lines work. Unless we expect all
tools to know whether the branch buffer is shared between PMUs on each
architecture or not. The driver knows though, so can merge the settings
because it all has to go into one BRBE.

Merging the settings in tools would be a much harder problem.

Any user that doesn't have permission for anything in the merged result
can continue to get nothing.

And we can always add filtering in the kernel later on if we want to to
make it appear to behave even more normally.

> static int
> armpmu_add(struct perf_event *event, int flags)
> {
> ........
> if (has_branch_stack(event)) {
> /*
> * Reset branch records buffer if a new task event gets
> * scheduled on a PMU which might have existing records.
> * Otherwise older branch records present in the buffer
> * might leak into the new task event.
> */
> if (event->ctx->task && hw_events->brbe_context != event->ctx) {
> hw_events->brbe_context = event->ctx;
> if (armpmu->branch_reset)
> armpmu->branch_reset();
> }
> hw_events->brbe_users++;
> Here -------> hw_events->brbe_sample_type = event->attr.branch_sample_type;
> }
> ........
> }
>
> Instead of overriding existing 'branch_sample_type', both could be merged.
>

I can't see any use case where anyone would want the override behavior.
Especially if you imagine multiple users not even aware of each other.
Either the current "no records for mismatches" or the merged one make sense.

James

2024-05-30 10:03:28

by James Clark

[permalink] [raw]
Subject: Re: [PATCH V17 0/9] arm64/perf: Enable branch stack sampling



On 05/04/2024 03:46, Anshuman Khandual wrote:
> This series enables perf branch stack sampling support on arm64 platform
> via a new arch feature called Branch Record Buffer Extension (BRBE). All
> the relevant register definitions could be accessed here.
>
> https://developer.arm.com/documentation/ddi0601/2021-12/AArch64-Registers
>
> This series applies on 6.9-rc2.
>
> Also this series is being hosted below for quick access, review and test.
>
> https://git.gitlab.arm.com/linux-arm/linux-anshuman.git (brbe_v17)
>
> There are still some open questions regarding handling multiple perf events
> with different privilege branch filters getting on the same PMU, supporting
> guest branch stack tracing from the host etc. Finally also looking for some
> suggestions regarding supporting BRBE inside the guest. The series has been
> re-organized completely as suggested earlier.

For guest support I'm still of this opinion:

* No support for the host looking into guests (the addresses don't
make sense anyway without also running Perf record in the guest)
* Save and restore the host buffer and registers on guest switch (if
it was ever used by either host or guest)
* Let the guest do whatever it wants with BRBE without any
virtualisation

Merging this with the current PMU virtualistion stuff seems like a lot
of work for no use case (host looking into guests). Having said that, it
might not even be worth discussing on this patchset apart from "no guest
support", and we can do it later to avoid confusion that it's being
proposed for this version.

James

2024-05-30 17:41:30

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH V17 0/9] arm64/perf: Enable branch stack sampling

On Thu, May 30, 2024 at 10:47:34AM +0100, James Clark wrote:
> On 05/04/2024 03:46, Anshuman Khandual wrote:
> > This series enables perf branch stack sampling support on arm64 platform
> > via a new arch feature called Branch Record Buffer Extension (BRBE). All
> > the relevant register definitions could be accessed here.
> >
> > https://developer.arm.com/documentation/ddi0601/2021-12/AArch64-Registers
> >
> > This series applies on 6.9-rc2.
> >
> > Also this series is being hosted below for quick access, review and test.
> >
> > https://git.gitlab.arm.com/linux-arm/linux-anshuman.git (brbe_v17)
> >
> > There are still some open questions regarding handling multiple perf events
> > with different privilege branch filters getting on the same PMU, supporting
> > guest branch stack tracing from the host etc. Finally also looking for some
> > suggestions regarding supporting BRBE inside the guest. The series has been
> > re-organized completely as suggested earlier.
> >
> > - Anshuman
> >
> [...]
> >
> > ------------------ Possible 'branch_sample_type' Mismatch -----------------
> >
> > Branch stack sampling attributes 'event->attr.branch_sample_type' generally
> > remain the same for all the events during a perf record session.
> >
> > $perf record -e <event_1> -e <event_2> -j <branch_filters> [workload]
> >
> > event_1->attr.branch_sample_type == event_2->attr.branch_sample_type
> >
> > This 'branch_sample_type' is used to configure the BRBE hardware, when both
> > events i.e <event_1> and <event_2> get scheduled on a given PMU. But during
> > PMU HW event's privilege filter inheritance, 'branch_sample_type' does not
> > remain the same for all events. Let's consider the following example
> >
> > $perf record -e cycles:u -e instructions:k -j any,save_type ls
> >
> > cycles->attr.branch_sample_type != instructions->attr.branch_sample_type
> >
> > Because cycles event inherits PERF_SAMPLE_BRANCH_USER and instruction event
> > inherits PERF_SAMPLE_BRANCH_KERNEL. The proposed solution here configures
> > BRBE hardware with 'branch_sample_type' from last event to be added in the
> > PMU and hence captured branch records only get passed on to matching events
> > during a PMU interrupt.
> >
>
> Hi Anshuman,
>
> Surely because of this example we should merge? At least we have to try
> to make the most common basic command lines work. Unless we expect all
> tools to know whether the branch buffer is shared between PMUs on each
> architecture or not. The driver knows though, so can merge the settings
> because it all has to go into one BRBE.

The difficulty here is that these are opened as independent events (not
in the same event group), and so from the driver's PoV, this is no
different two two users independently doing:

perf record -e event:u -j any,save_type -p ${SOME_PID}

perf record -e event:k -j any,save_type -p ${SOME_PID}

.. where either would be surprised to get the merged result.

So, if we merge the filters into the most permissive set, we *must*
filter them when handing them to userspace so that each event gets the
expected set of branch records.

Assuming we do that, for Anshuman's case above:

perf record -e cycles:u -e instructions:k -j any,save_type ls

.. the overflow of the cycles evnt will only record user branch
records, and the overflow of the instructions event will only record
kernel branch records.

I think it's arguable either way as to whether users want that or merged
records; we should probably figure out with the tools folk what the
desired behaviour is for that command line, but as above I don't think
that we can have the kernel give both events merged records unless
userspace asks for that explicitly.

> Merging the settings in tools would be a much harder problem.

I can see that it'd be harder to do that up-front when parsing each
event independently, but there is a phase where the tool knows about all
of the events and *might* be able to do that, where the driver doesn't
really know at any point that these events are related.

Regardless, I assume that if the user passes:

perf record -e cycles:u -e instructions:k -k any,u,k,save_type ls

.. both events will be opened with PERF_SAMPLE_BRANCH_USER and
PERF_SAMPLE_BRANCH_KERNEL, so maybe that's good, and in-kernel filtering
is sufficient.

> Any user that doesn't have permission for anything in the merged result
> can continue to get nothing.
>
> And we can always add filtering in the kernel later on if we want to to
> make it appear to behave even more normally.

As above, I think if we merge the HW filters in the kernel then the
kernel must do SW filtering. I don't think that's something we can leave
for later.

> > static int
> > armpmu_add(struct perf_event *event, int flags)
> > {
> > ........
> > if (has_branch_stack(event)) {
> > /*
> > * Reset branch records buffer if a new task event gets
> > * scheduled on a PMU which might have existing records.
> > * Otherwise older branch records present in the buffer
> > * might leak into the new task event.
> > */
> > if (event->ctx->task && hw_events->brbe_context != event->ctx) {
> > hw_events->brbe_context = event->ctx;
> > if (armpmu->branch_reset)
> > armpmu->branch_reset();
> > }
> > hw_events->brbe_users++;
> > Here -------> hw_events->brbe_sample_type = event->attr.branch_sample_type;
> > }
> > ........
> > }
> >
> > Instead of overriding existing 'branch_sample_type', both could be merged.
> >
>
> I can't see any use case where anyone would want the override behavior.
> Especially if you imagine multiple users not even aware of each other.

I completely agree that one event overriding another is not an
acceptable solution.

> Either the current "no records for mismatches" or the merged one make sense.

I think our options are:

1) Do not allow events with conflicting HW filters to be scheduled (e.g.
treating this like a counter constraint).

2) Allow events with conflicting HW filters to be scheduled, merge the
active HW filters, and SW filter the records in the kernel.

3) Allow events with conflicting branch filters to be scheduled, but
only those which match the "active" filter get records.

So far (2) seems to me like the best option, and IIUC that's what x86
does with LBR. I suspect that also justifies discarding records at
context switch time, since we can't know how many relevant records were
discarded due to conflicting records (and so cannot safely stitch
records), and users have to expect that they may get fewer records than
may exist in HW anyway.

Mark.

2024-05-31 13:03:48

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH V17 0/9] arm64/perf: Enable branch stack sampling

On Thu, May 30, 2024 at 06:41:14PM +0100, Mark Rutland wrote:
> On Thu, May 30, 2024 at 10:47:34AM +0100, James Clark wrote:
> > On 05/04/2024 03:46, Anshuman Khandual wrote:
> > > ------------------ Possible 'branch_sample_type' Mismatch -----------------
> > >
> > > Branch stack sampling attributes 'event->attr.branch_sample_type' generally
> > > remain the same for all the events during a perf record session.
> > >
> > > $perf record -e <event_1> -e <event_2> -j <branch_filters> [workload]
> > >
> > > event_1->attr.branch_sample_type == event_2->attr.branch_sample_type
> > >
> > > This 'branch_sample_type' is used to configure the BRBE hardware, when both
> > > events i.e <event_1> and <event_2> get scheduled on a given PMU. But during
> > > PMU HW event's privilege filter inheritance, 'branch_sample_type' does not
> > > remain the same for all events. Let's consider the following example
> > >
> > > $perf record -e cycles:u -e instructions:k -j any,save_type ls
> > >
> > > cycles->attr.branch_sample_type != instructions->attr.branch_sample_type
> > >
> > > Because cycles event inherits PERF_SAMPLE_BRANCH_USER and instruction event
> > > inherits PERF_SAMPLE_BRANCH_KERNEL. The proposed solution here configures
> > > BRBE hardware with 'branch_sample_type' from last event to be added in the
> > > PMU and hence captured branch records only get passed on to matching events
> > > during a PMU interrupt.
> > >
> >
> > Hi Anshuman,
> >
> > Surely because of this example we should merge? At least we have to try
> > to make the most common basic command lines work. Unless we expect all
> > tools to know whether the branch buffer is shared between PMUs on each
> > architecture or not. The driver knows though, so can merge the settings
> > because it all has to go into one BRBE.
>
> The difficulty here is that these are opened as independent events (not
> in the same event group), and so from the driver's PoV, this is no
> different two two users independently doing:
>
> perf record -e event:u -j any,save_type -p ${SOME_PID}
>
> perf record -e event:k -j any,save_type -p ${SOME_PID}
>
> .. where either would be surprised to get the merged result.

I took a look at how x86 handles this, and it looks like they may have the
problem we'd like to avoid. AFAICT, intel_pmu_lbr_add() blats cpuc->br_sel with
the branch selection of the last event added, and

So I took a look at what happens on my x86-64 desktop running v5.10.0-9-amd64
from Debian 11.

Running the following program:

| int main (int argc, char *argv[])
| {
| for (;;) {
| asm volatile("" ::: "memory");
| }
|
| return 0;
| }

I set /proc/sys/kernel/perf_event_paranoid to 2 and started two independent
perf sessions:

perf record -e cycles:u -j any -o perf-user.data -p 1320224

sudo perf record -e cycles:k -j any -o perf-kernel.data -p 1320224

.. after ~10 seconds, I killed both sessions with ^C.

When i susbsequently do 'perf report -i perf-kernel.data, I see:

| Samples: 295 of event 'cycles:k', Event count (approx.): 295
| Overhead Command Source Shared Object Source Symbol Target Symbol Basic Block Cycles
| 99.66% loop loop [k] main [k] main -
| 0.34% loop [kernel.kallsyms] [k] native_irq_return_iret [k] main -

.. where the user symbols are surprising.

Similarly for 'perf report -i perf-user.data', I see:

| Samples: 198K of event 'cycles:u', Event count (approx.): 198739
| Overhead Command Source Shared Object Source Symbol Target Symbol Basic Block Cycles
| 99.99% loop loop [.] main [.] main -
| 0.00% loop [unknown] [.] 0xffffffff87801007 [.] main -
| 0.00% loop [unknown] [.] 0xffffffff86e05626 [.] 0xffffffff86e05629 -
| 0.00% loop [unknown] [.] 0xffffffff86e0563d [.] 0xffffffff86e0c850 -
| 0.00% loop [unknown] [.] 0xffffffff86e0c86f [.] 0xffffffff86e6b3f0 -
| 0.00% loop [unknown] [.] 0xffffffff86e0c884 [.] 0xffffffff86e11ed0 -
| 0.00% loop [unknown] [.] 0xffffffff86e0c88a [.] 0xffffffff86e13850 -
| 0.00% loop [unknown] [.] 0xffffffff86e11eee [.] 0xffffffff86e0c889 -
| 0.00% loop [unknown] [.] 0xffffffff86e13885 [.] 0xffffffff86e13888 -
| 0.00% loop [unknown] [.] 0xffffffff86e13889 [.] 0xffffffff86e138a1 -
| 0.00% loop [unknown] [.] 0xffffffff86e138a9 [.] 0xffffffff86e6b320 -
| 0.00% loop [unknown] [.] 0xffffffff86e138c3 [.] 0xffffffff86e6b3f0 -
| 0.00% loop [unknown] [.] 0xffffffff86e6b33a [.] 0xffffffff86e138ae -
| 0.00% loop [unknown] [.] 0xffffffff86e6b3fb [.] 0xffffffff86e0c874 -
| 0.00% loop [unknown] [.] 0xffffffff86ff6c91 [.] 0xffffffff87a01ca0 -
| 0.00% loop [unknown] [.] 0xffffffff87a01ca0 [.] 0xffffffff87a01ca5 -
| 0.00% loop [unknown] [.] 0xffffffff87a01ca5 [.] 0xffffffff87a01cb1 -
| 0.00% loop [unknown] [.] 0xffffffff87a01cb5 [.] 0xffffffff86e05600 -

Where the unknown (kernel!) samples are surprising.

Peter, do you have any opinion on this?

My thinking is that the "last scheduled event branch selection wins"
isn't the behaviour we actually want, and either:

(a) Conflicting events shouldn't be scheduled concurrently (e.g. treat
that like running out of counters).

(b) The HW filters should be configured to allow anything permited by
any of the events, and SW filtering should remove the unexpected
records on a per-event basis.

.. but I imagine (b) is hard maybe? I don't know if LBR tells you which
CPU mode the src/dst were in.

Mark.

2024-06-03 05:13:12

by Anshuman Khandual

[permalink] [raw]
Subject: Re: [PATCH V17 1/9] arm64/sysreg: Add BRBE registers and fields



On 5/21/24 18:54, Mark Rutland wrote:
> On Fri, Apr 05, 2024 at 08:16:31AM +0530, Anshuman Khandual wrote:
>> This adds BRBE related register definitions and various other related field
>> macros there in. These will be used subsequently in a BRBE driver, which is
>> being added later on. While here, this drops redundant register definitions
>> from the header i.e (arch/arm64/include/asm/sysreg.h).
>>
>> BRBINFx_EL1_TYPE_IMPDEF_TRAP_EL3 register field value has been derived from
>> latest ARM DDI 0601 ID121123, AKA 2023-12 instead of latest ARM ARM i.e ARM
>> DDI 0487J.a. Please find the definition here.
>>
>> https://developer.arm.com/documentation/ddi0601/2023-12/
>
> The K.a release of the ARM ARM has now been published, and that includes
> the "0b110000 IMPLEMENTATION DEFINED exception to EL3" value, so we can
> point to that now.

Sure.

>
> With that in mind, for the commit message we can say:
>
> | This patch adds definitions related to the Branch Record Buffer
> | Extension (BRBE) as per ARM DDI 0487K.a. These will be used by KVM and
> | a BRBE driver in subsequent patches.
> |
> | Some existing BRBE definitions in asm/sysreg.h are replaced with
> | equivalent generated definitions.

Replaced the commit message wordings as suggested.

>
> Aside from a couple of minor issues below, this looks good to me. With
> those fixed up (and the commit message above):
>
> Reviewed-by: Mark Rutland <[email protected]>

Thanks Mark.

>
> [...]
>
>> +Sysreg BRBIDR0_EL1 2 1 9 2 0
>> +Res0 63:16
>> +Enum 15:12 CC
>> + 0b101 20_BIT
>> +EndEnum
>
> Please pad this to the width of the field, i.e.
>
> 0b0101 20_bit

Done.

>
>> +Enum 11:8 FORMAT
>> + 0b0 0
>> +EndEnum
>
> Likewise, this should be padded to the width of the field.

Done.

>
> As with the BRBFCR_EL1.BANK definitions, I reckon it's worth givin the
> enum value a prefix, so please make this:
>
> 0b0000 FORMAT_0

Sure, will change as suggested.

2024-06-03 05:31:31

by Anshuman Khandual

[permalink] [raw]
Subject: Re: [PATCH V17 2/9] KVM: arm64: Explicitly handle BRBE traps as UNDEFINED



On 5/21/24 18:56, Mark Rutland wrote:
> Hi Anshuman,
>
> On Fri, Apr 05, 2024 at 08:16:32AM +0530, Anshuman Khandual wrote:
>> The Branch Record Buffer Extension (BRBE) adds a number of system registers
>> and instructions, which we don't currently intend to expose to guests. Our
>> existing logic handles this safely, but this could be improved with some
>> explicit handling of BRBE.
>>
>> The presence of BRBE is currently hidden from guests as the cpufeature
>> code's ftr_id_aa64dfr0[] table doesn't have an entry for the BRBE field,
>> and so this will be zero in the sanitised value of ID_AA64DFR0 exposed to
>> guests via read_sanitised_id_aa64dfr0_el1(). As the ftr_id_aa64dfr0[] table
>> may gain an entry for the BRBE field in future, for robustness we should
>> explicitly mask out the BRBE field in read_sanitised_id_aa64dfr0_el1().
>>
>> The BRBE system registers and instructions are currently trapped by the
>> existing configuration of the fine-grained traps. As neither the registers
>> are not described in the sys_reg_descs[] nor the instructions are described
>> in the sys_insn_descs[] table, emulate_sys_reg() will warn that these are
>> unknown before injecting an UNDEFINED exception into the guest.
>
> That last sentence doesn't make sense, and I think it has been mangled.
> My suggested text in v16 was:
>
> | As the registers and instructions are not described in the
> | sys_reg_descs[] table, emulate_sys_reg() will warn that these are
> | unknown before injecting an UNDEFINED exception into the guest.
>
> ... and I'd be happy with changing that to:
>
> | As neither the registers nor the instructions are described in the
> | sys_reg_descs[] table, emulate_sys_reg() will warn that these are
> | unknown before injecting an UNDEFINED exception into the guest.

Sure, will replace the last sentence as suggested.

>
>> Well-behaved guests shouldn't try to use the registers or instructions, but
>> badly-behaved guests could use these, resulting in unnecessary warnings.
>
> I see that I had mangled this sentence originally -- thanks for
> correcting that; this looks fine to me.
>
>> To avoid those warnings, we should explicitly handle the BRBE
>> registers, and instructions as UNDEFINED.
>
> I think the added comma is unnecessary and makes this hard to read. My
> suggested text in v16 was:
>
> | To avoid those warnings, we should explicitly handle the BRBE
> | registers and instructions as UNDEFINED.

Sure, will change.

>
>> Address the above by having read_sanitised_id_aa64dfr0_el1() mask out the
>> ID_AA64DFR0.BRBE field, and by adding sys_reg_descs entries for all of the
>> BRBE system registers, also by adding sys_insn_descs entries for all of the
>> BRBE instructions, treating these all as UNDEFINED.
>
> Similarly, I think this was clearer as I originally suggested:
>
> | Address the above by having read_sanitised_id_aa64dfr0_el1() mask out
> | the ID_AA64DFR0.BRBE field, and by adding sys_reg_desc entries for all
> | of the BRBE system registers and instructions, treating these all as
> | UNDEFINED.
>
> ... or we can simplify that to:
>
> | Address the above by having read_sanitised_id_aa64dfr0_el1() mask out
> | the ID_AA64DFR0.BRBE field, and explicitly handling all of the BRBE
> | system registers and instructions as UNDEFINED.

Sure, will change the above as suggested.

After all these changes, final commit message looks like the following.

KVM: arm64: Explicitly handle BRBE traps as UNDEFINED

The Branch Record Buffer Extension (BRBE) adds a number of system registers
and instructions, which we don't currently intend to expose to guests. Our
existing logic handles this safely, but this could be improved with some
explicit handling of BRBE.

The presence of BRBE is currently hidden from guests as the cpufeature
code's ftr_id_aa64dfr0[] table doesn't have an entry for the BRBE field,
and so this will be zero in the sanitised value of ID_AA64DFR0 exposed to
guests via read_sanitised_id_aa64dfr0_el1(). As the ftr_id_aa64dfr0[] table
may gain an entry for the BRBE field in future, for robustness we should
explicitly mask out the BRBE field in read_sanitised_id_aa64dfr0_el1().

The BRBE system registers and instructions are currently trapped by the
existing configuration of the fine-grained traps. As neither the registers
nor the instructions are described in the sys_reg_descs[] table,
emulate_sys_reg() will warn that these are unknown before injecting an
UNDEFINED exception into the guest.

Well-behaved guests shouldn't try to use the registers or instructions, but
badly-behaved guests could use these, resulting in unnecessary warnings. To
avoid those warnings, we should explicitly handle the BRBE registers and
instructions as UNDEFINED.

Address the above by having read_sanitised_id_aa64dfr0_el1() mask out the
ID_AA64DFR0.BRBE field, and explicitly handling all of the BRBE system
registers and instructions as UNDEFINED.

>
>> Cc: Marc Zyngier <[email protected]>
>> Cc: Oliver Upton <[email protected]>
>> Cc: James Morse <[email protected]>
>> Cc: Suzuki K Poulose <[email protected]>
>> Cc: Catalin Marinas <[email protected]>
>> Cc: Will Deacon <[email protected]>
>> Cc: [email protected]
>> Cc: [email protected]
>> Cc: [email protected]
>> Signed-off-by: Anshuman Khandual <[email protected]>
>> ----
>> Changes in V17:
>>
>> - Updated the commit message including about sys_insn_descs[]
>> - Changed KVM to use existing SYS_BRBSRC/TGT/INF_EL1(n) format
>> - Moved the BRBE instructions into sys_insn_descs[] array
>
> Aside from my nits on the commit message above, these changes all look
> good to me. So with the commit message cleaned up as above:
>
> Reviewed-by: Mark Rutland <[email protected]>

Thanks Mark.

>
> Mark.
>
>>
>> arch/arm64/kvm/sys_regs.c | 56 +++++++++++++++++++++++++++++++++++++++
>> 1 file changed, 56 insertions(+)
>>
>> diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
>> index c9f4f387155f..3981aa32c5a3 100644
>> --- a/arch/arm64/kvm/sys_regs.c
>> +++ b/arch/arm64/kvm/sys_regs.c
>> @@ -1304,6 +1304,11 @@ static int set_pmcr(struct kvm_vcpu *vcpu, const struct sys_reg_desc *r,
>> return 0;
>> }
>>
>> +#define BRB_INF_SRC_TGT_EL1(n) \
>> + { SYS_DESC(SYS_BRBINF_EL1(n)), undef_access }, \
>> + { SYS_DESC(SYS_BRBSRC_EL1(n)), undef_access }, \
>> + { SYS_DESC(SYS_BRBTGT_EL1(n)), undef_access } \
>> +
>> /* Silly macro to expand the DBG{BCR,BVR,WVR,WCR}n_EL1 registers in one go */
>> #define DBG_BCR_BVR_WCR_WVR_EL1(n) \
>> { SYS_DESC(SYS_DBGBVRn_EL1(n)), \
>> @@ -1708,6 +1713,9 @@ static u64 read_sanitised_id_aa64dfr0_el1(struct kvm_vcpu *vcpu,
>> /* Hide SPE from guests */
>> val &= ~ID_AA64DFR0_EL1_PMSVer_MASK;
>>
>> + /* Hide BRBE from guests */
>> + val &= ~ID_AA64DFR0_EL1_BRBE_MASK;
>> +
>> return val;
>> }
>>
>> @@ -2226,6 +2234,52 @@ static const struct sys_reg_desc sys_reg_descs[] = {
>> { SYS_DESC(SYS_DBGCLAIMCLR_EL1), trap_raz_wi },
>> { SYS_DESC(SYS_DBGAUTHSTATUS_EL1), trap_dbgauthstatus_el1 },
>>
>> + /*
>> + * BRBE branch record sysreg address space is interleaved between
>> + * corresponding BRBINF<N>_EL1, BRBSRC<N>_EL1, and BRBTGT<N>_EL1.
>> + */
>> + BRB_INF_SRC_TGT_EL1(0),
>> + BRB_INF_SRC_TGT_EL1(16),
>> + BRB_INF_SRC_TGT_EL1(1),
>> + BRB_INF_SRC_TGT_EL1(17),
>> + BRB_INF_SRC_TGT_EL1(2),
>> + BRB_INF_SRC_TGT_EL1(18),
>> + BRB_INF_SRC_TGT_EL1(3),
>> + BRB_INF_SRC_TGT_EL1(19),
>> + BRB_INF_SRC_TGT_EL1(4),
>> + BRB_INF_SRC_TGT_EL1(20),
>> + BRB_INF_SRC_TGT_EL1(5),
>> + BRB_INF_SRC_TGT_EL1(21),
>> + BRB_INF_SRC_TGT_EL1(6),
>> + BRB_INF_SRC_TGT_EL1(22),
>> + BRB_INF_SRC_TGT_EL1(7),
>> + BRB_INF_SRC_TGT_EL1(23),
>> + BRB_INF_SRC_TGT_EL1(8),
>> + BRB_INF_SRC_TGT_EL1(24),
>> + BRB_INF_SRC_TGT_EL1(9),
>> + BRB_INF_SRC_TGT_EL1(25),
>> + BRB_INF_SRC_TGT_EL1(10),
>> + BRB_INF_SRC_TGT_EL1(26),
>> + BRB_INF_SRC_TGT_EL1(11),
>> + BRB_INF_SRC_TGT_EL1(27),
>> + BRB_INF_SRC_TGT_EL1(12),
>> + BRB_INF_SRC_TGT_EL1(28),
>> + BRB_INF_SRC_TGT_EL1(13),
>> + BRB_INF_SRC_TGT_EL1(29),
>> + BRB_INF_SRC_TGT_EL1(14),
>> + BRB_INF_SRC_TGT_EL1(30),
>> + BRB_INF_SRC_TGT_EL1(15),
>> + BRB_INF_SRC_TGT_EL1(31),
>> +
>> + /* Remaining BRBE sysreg addresses space */
>> + { SYS_DESC(SYS_BRBCR_EL1), undef_access },
>> + { SYS_DESC(SYS_BRBFCR_EL1), undef_access },
>> + { SYS_DESC(SYS_BRBTS_EL1), undef_access },
>> + { SYS_DESC(SYS_BRBINFINJ_EL1), undef_access },
>> + { SYS_DESC(SYS_BRBSRCINJ_EL1), undef_access },
>> + { SYS_DESC(SYS_BRBTGTINJ_EL1), undef_access },
>> + { SYS_DESC(SYS_BRBIDR0_EL1), undef_access },
>> +
>> { SYS_DESC(SYS_MDCCSR_EL0), trap_raz_wi },
>> { SYS_DESC(SYS_DBGDTR_EL0), trap_raz_wi },
>> // DBGDTR[TR]X_EL0 share the same encoding
>> @@ -2738,6 +2792,8 @@ static struct sys_reg_desc sys_insn_descs[] = {
>> { SYS_DESC(SYS_DC_CISW), access_dcsw },
>> { SYS_DESC(SYS_DC_CIGSW), access_dcgsw },
>> { SYS_DESC(SYS_DC_CIGDSW), access_dcgsw },
>> + { SYS_DESC(OP_BRB_IALL), undef_access },
>> + { SYS_DESC(OP_BRB_INJ), undef_access },
>> };
>>
>> static const struct sys_reg_desc *first_idreg;
>> --
>> 2.25.1
>>

2024-06-03 06:43:07

by Anshuman Khandual

[permalink] [raw]
Subject: Re: [PATCH V17 3/9] drivers: perf: arm_pmu: Add infrastructure for branch stack sampling



On 5/21/24 19:14, Mark Rutland wrote:
> On Fri, Apr 05, 2024 at 08:16:33AM +0530, Anshuman Khandual wrote:
>> In order to support the Branch Record Buffer Extension (BRBE), we need to
>> extend the arm_pmu framework with some basic infrastructure for the branch
>> stack sampling which arm_pmu drivers can opt-in using a new feature flag
>> called 'has_branch_stack'. Subsequent patches will use this to add support
>> for BRBE in the PMUv3 driver.
>
> Please, just use ther *exact* wording I asked for last time:
>
> | In order to support the Branch Record Buffer Extension (BRBE), we need to
> | extend the arm_pmu framework with some basic infrastructure for branch stack
> | sampling which arm_pmu drivers can opt-in to using. Subsequent patches will
> | use this to add support for BRBE in the PMUv3 driver.
>
> At this point in the commit message, the 'has_branch_stack' flag doesn't
> matter, and dropping the 'to' after 'opt-in' makes this painful to read.

Okay, will replace with the original paragraph.

>
>> Branch stack sampling support i.e capturing branch records during execution
>> in core perf, rides along with normal HW events being scheduled on the PMU.
>> This prepares ARMV8 PMU framework for branch stack support on relevant PMUs
>> with required HW implementation.
>
> Please delete this paragraph.

Done.

>
>> With BRBE, the hardware records branches into a hardware FIFO, which will
>> be sampled by software when perf events overflow. A task may be context-
>> switched an arbitrary number of times between overflows, and to avoid
>> losing samples we need to save the current records when a task is context-
>> switched out. To do these we'll need to use the pmu::sched_task() callback,
>> and we'll also need to allocate some per-task storage space via event flag
>> PERF_ATTACH_TASK_DATA.
>
> [...]
>
>> /* The events for a given PMU register set. */
>> struct pmu_hw_events {
>> /*
>> @@ -66,6 +78,17 @@ struct pmu_hw_events {
>> struct arm_pmu *percpu_pmu;
>>
>> int irq;
>> +
>> + struct branch_records *branches;
>> +
>> + /* Active context for task events */
>> + void *branch_context;
>> +
>> + /* Active events requesting branch records */
>> + unsigned int branch_users;
>> +
>> + /* Active branch sample type filters */
>> + unsigned long branch_sample_type;
>> };
>
> At this point in the series I understand why we have the 'branches' and
> 'branch_users' fields, but the 'branch_context' and 'branch_sample_type'
> fields haven't been introduced and are not obvious.
>
> What exactly is branch_context, and why is that a 'void *' ?

branch_context tracks event->ctx which is 'struct perf_event_context *'. The
'void *' seemed more generic in case this tracking structure changes later.
But this could be changed as 'struct perf_event_context *' if required.

>
> I can understand if that's a PMU-specific structure to track the active
> branch records, but if so I don't understand why 'branch_sample_type'
> isn't folded into that.

branch_sample_type is applicable both for cpu and task bound events, where as
branch_context is applicable only for task bound events tracking their active
branch records that need to be dropped (or saved), in case a cpu bound event
comes in.

2024-06-03 09:16:49

by Anshuman Khandual

[permalink] [raw]
Subject: Re: [PATCH V17 4/9] arm64/boot: Enable EL2 requirements for BRBE



On 5/29/24 16:21, Mark Rutland wrote:
> On Fri, Apr 05, 2024 at 08:16:34AM +0530, Anshuman Khandual wrote:
>> Fine grained trap control for BRBE registers, and instructions access need
>> to be configured in HDFGRTR_EL2, HDFGWTR_EL2 and HFGITR_EL2 registers when
>> kernel enters at EL1 but EL2 is present. This changes __init_el2_fgt() as
>> required.
>>
>> Similarly cycle and mis-prediction capture need to be enabled in BRBCR_EL1
>> and BRBCR_EL2 when the kernel enters either into EL1 or EL2. This adds new
>> __init_el2_brbe() to achieve this objective.
>>
>> This also updates Documentation/arch/arm64/booting.rst with all the above
>> EL2 along with MDRC_EL3.SBRBE requirements.
>>
>> First this replaces an existing hard encoding (1 << 62) with corresponding
>> applicable macro HDFGRTR_EL2_nPMSNEVFR_EL1_MASK.
>>
>> Cc: Catalin Marinas <[email protected]>
>> Cc: Will Deacon <[email protected]>
>> Cc: Jonathan Corbet <[email protected]>
>> Cc: Marc Zyngier <[email protected]>
>> Cc: Oliver Upton <[email protected]>
>> Cc: [email protected]
>> Cc: [email protected]
>> Cc: [email protected]
>> Signed-off-by: Anshuman Khandual <[email protected]>
>> ----
>> Changes in V17:
>>
>> - New patch added in the series
>> - Separated out from the BRBE driver implementation patch
>> - Dropped the comment in __init_el2_brbe()
>> - Updated __init_el2_brbe() with BRBCR_EL2.MPRED requirements
>> - Updated __init_el2_brbe() with __check_hvhe() constructs
>> - Updated booting.rst regarding MPRED, MDCR_EL3 and fine grained control
>>
>> Documentation/arch/arm64/booting.rst | 26 ++++++++
>> arch/arm64/include/asm/el2_setup.h | 90 +++++++++++++++++++++++++++-
>> 2 files changed, 113 insertions(+), 3 deletions(-)
>>
>> diff --git a/Documentation/arch/arm64/booting.rst b/Documentation/arch/arm64/booting.rst
>> index b57776a68f15..512210da7dd2 100644
>> --- a/Documentation/arch/arm64/booting.rst
>> +++ b/Documentation/arch/arm64/booting.rst
>> @@ -349,6 +349,32 @@ Before jumping into the kernel, the following conditions must be met:
>>
>> - HWFGWTR_EL2.nSMPRI_EL1 (bit 54) must be initialised to 0b01.
>>
>> + For CPUs with feature Branch Record Buffer Extension (FEAT_BRBE):
>> +
>> + - If the kernel is entered at EL2 and EL1 is present:
>> +
>> + - BRBCR_EL1.CC (bit 3) must be initialised to 0b1.
>> + - BRBCR_EL1.MPRED (bit 4) must be initialised to 0b1.
>
> IIUC this isn't necessary; if the kernel is entered at EL2, it's capable
> of initializing the EL1 regs, and it doesn't look like this silently
> affects something we'd need in the absence of a BRBE driver.

No, this does not affect anything other than the BRBE driver.

>
> AFAICT the __init_el2_brbe() code you add below handles this, so I think
> this is redundant and can be deleted.

Did not understand the above. __init_el2_brbe() handles setting both BRBCR_EL2
and BRBCR_EL1 for CC and MPRED config irrespective of whether the kernel enters
EL2 directly or enters EL1 via EL2. But should not that be documented here for
both those scenarios ? OR because once the kernel is in EL2, it can configure
EL1 as required, so it is not a booting requirement anymore ?

>
>> + - If the kernel is entered at EL1 and EL2 is present:
>> +
>> + - BRBCR_EL2.CC (bit 3) must be initialised to 0b1.
>> + - BRBCR_EL2.MPRED (bit 4) must be initialised to 0b1.
>> +
>> + - HDFGRTR_EL2.nBRBDATA (bit 61) must be initialised to 0b1.
>> + - HDFGRTR_EL2.nBRBCTL (bit 60) must be initialised to 0b1.
>> + - HDFGRTR_EL2.nBRBIDR (bit 59) must be initialised to 0b1.
>> +
>> + - HDFGWTR_EL2.nBRBDATA (bit 61) must be initialised to 0b1.
>> + - HDFGWTR_EL2.nBRBCTL (bit 60) must be initialised to 0b1.
>> +
>> + - HFGITR_EL2.nBRBIALL (bit 56) must be initialised to 0b1.
>> + - HFGITR_EL2.nBRBINJ (bit 55) must be initialised to 0b1.
>> +
>> + - If EL3 is present:
>> +
>> + - MDCR_EL3.SBRBE (bits 33:32) must be initialised to 0b11.
>
> Minor nit: please list the EL3 requirements first, that way this can be
> read in EL3->EL2->EL1 order to match the FW boot-flow order.

Makes sense, will change the order.

>
>> +
>> For CPUs with the Scalable Matrix Extension FA64 feature (FEAT_SME_FA64):
>>
>> - If EL3 is present:
>> diff --git a/arch/arm64/include/asm/el2_setup.h b/arch/arm64/include/asm/el2_setup.h
>> index b7afaa026842..7c12a8e658d4 100644
>> --- a/arch/arm64/include/asm/el2_setup.h
>> +++ b/arch/arm64/include/asm/el2_setup.h
>> @@ -154,6 +154,41 @@
>> .Lskip_set_cptr_\@:
>> .endm
>>
>> +#ifdef CONFIG_ARM64_BRBE
>> +/*
>> + * Enable BRBE cycle count and miss-prediction
>> + *
>> + * BRBE requires both BRBCR_EL1.CC and BRBCR_EL2.CC fields, be set
>> + * for the cycle counts to be available in BRBINF<N>_EL1.CC during
>> + * branch record processing after a PMU interrupt. This enables CC
>> + * field on both these registers while still executing inside EL2.
>
> Huh, it's a bit of an oddity to do that for a register that gets the E2H
> treatment, but that is what the ARM ARM says, looking at the pseudocode
> in ARM DDI 0487K.a:
>
> | // BRBCycleCountingEnabled()
> | // =========================
> | // Returns TRUE if the recording of cycle counts is allowed,
> | // FALSE otherwise.
> | boolean BRBCycleCountingEnabled()
> | if HaveEL(EL2) && BRBCR_EL2.CC == '0' then return FALSE;
> | if BRBCR_EL1.CC == '0' then return FALSE;
> | return TRUE;
>
> ... and likewise for MPRED:
>
> | // BRBEMispredictAllowed()
> | // =======================
> | // Returns TRUE if the recording of branch misprediction is allowed,
> | // FALSE otherwise.
> | boolean BRBEMispredictAllowed()
> | if HaveEL(EL2) && BRBCR_EL2.MPRED == '0' then return FALSE;
> | if BRBCR_EL1.MPRED == '0' then return FALSE;
> | return TRUE;
>
> ... though BRBCycleCountingEnabled() isn't actually used anywhere, while
> BRBEMispredictAllowed is used by BRBEBranch(), since that does:
>
> | (ccu, cc) = BranchEncCycleCount();
> | ...
> | bit mispredict = if BRBEMispredictAllowed() && BranchMispredict() then '1' else '0';
>
> ... where BranchEncCycleCount() is a stub that doesn't mention
> BRBCycleCountingEnabled() at all, so it's not clear to me whether CCU is
> guaranteed to be set.
>
>> + *
>> + * BRBE driver would still be able to toggle branch records cycle
>> + * count support via BRBCR_EL1.CC field regardless of whether the
>> + * kernel ends up executing in EL1 or EL2.
>> + *
>> + * The same principle applies for branch record mis-prediction info
>> + * as well, thus requiring MPRED field to be set on both BRBCR_EL1
>> + * and BRBCR_EL2 while still executing inside EL2.
>> + */
>
> I think we can clarify this comment to:
>
> /*
> * Enable BRBE to record cycle counts and branch mispredicts.
> *
> * At any EL, to record cycle counts BRBE requires that both
> * BRBCR_EL2.CC=1 and BRBCR_EL1.CC=1.
> *
> * At any EL, to record branch mispredicts BRBE requires that both
> * BRBCR_EL2.MPRED=1 and BRBCR_EL1.MPRED=1.
> *
> * When HCR_EL2.E2H=1, the BRBCR_EL1 encoding is redirected to
> * BRBCR_EL2, but the {CC,MPRED} bits in the real BRBCR_EL1 register
> * still apply.
> *
> * Set {CC,MPRBED} in both BRBCR_EL2 and BRBCR_EL1 so that at runtime we
> * only need to enable/disable thse in BRBCR_EL1 regardless of whether
> * the kernel ends up executing in EL1 or EL2.
> */

Sure, will update this comment.

>
>> +.macro __init_el2_brbe
>> + mrs x1, id_aa64dfr0_el1
>> + ubfx x1, x1, #ID_AA64DFR0_EL1_BRBE_SHIFT, #4
>> + cbz x1, .Lskip_brbe_\@
>> +
>> + mov_q x0, BRBCR_ELx_CC | BRBCR_ELx_MPRED
>> + msr_s SYS_BRBCR_EL2, x0
>> +
>> + __check_hvhe .Lset_brbe_nvhe_\@, x1
>> + msr_s SYS_BRBCR_EL12, x0 // VHE
>> + b .Lskip_brbe_\@
>> +
>> +.Lset_brbe_nvhe_\@:
>> + msr_s SYS_BRBCR_EL1, x0 // NVHE
>> +.Lskip_brbe_\@:
>> +.endm
>> +#endif /* CONFIG_ARM64_BRBE */
>> +
>> /* Disable any fine grained traps */
>> .macro __init_el2_fgt
>> mrs x1, id_aa64mmfr0_el1
>> @@ -161,16 +196,48 @@
>> cbz x1, .Lskip_fgt_\@
>>
>> mov x0, xzr
>> + mov x2, xzr
>> mrs x1, id_aa64dfr0_el1
>> ubfx x1, x1, #ID_AA64DFR0_EL1_PMSVer_SHIFT, #4
>> cmp x1, #3
>> b.lt .Lset_debug_fgt_\@
>> +
>> /* Disable PMSNEVFR_EL1 read and write traps */
>> - orr x0, x0, #(1 << 62)
>> + orr x0, x0, #HDFGRTR_EL2_nPMSNEVFR_EL1_MASK
>> + orr x2, x2, #HDFGWTR_EL2_nPMSNEVFR_EL1_MASK
>>
>> .Lset_debug_fgt_\@:
>> +#ifdef CONFIG_ARM64_BRBE
>> + mrs x1, id_aa64dfr0_el1
>> + ubfx x1, x1, #ID_AA64DFR0_EL1_BRBE_SHIFT, #4
>> + cbz x1, .Lskip_brbe_reg_fgt_\@
>> +
>> + /*
>> + * Disable read traps for the following registers
>> + *
>> + * [BRBSRC|BRBTGT|RBINF]_EL1
>> + * [BRBSRCINJ|BRBTGTINJ|BRBINFINJ|BRBTS]_EL1
>> + */
>> + orr x0, x0, #HDFGRTR_EL2_nBRBDATA_MASK
>> +
>> + /*
>> + * Disable write traps for the following registers
>> + *
>> + * [BRBSRCINJ|BRBTGTINJ|BRBINFINJ|BRBTS]_EL1
>> + */
>> + orr x2, x2, #HDFGWTR_EL2_nBRBDATA_MASK
>> +
>> + /* Disable read and write traps for [BRBCR|BRBFCR]_EL1 */
>> + orr x0, x0, #HDFGRTR_EL2_nBRBCTL_MASK
>> + orr x2, x2, #HDFGWTR_EL2_nBRBCTL_MASK
>> +
>> + /* Disable read traps for BRBIDR_EL1 */
>> + orr x0, x0, #HDFGRTR_EL2_nBRBIDR_MASK
>> +
>> +.Lskip_brbe_reg_fgt_\@:
>> +#endif /* CONFIG_ARM64_BRBE */
>> msr_s SYS_HDFGRTR_EL2, x0
>> - msr_s SYS_HDFGWTR_EL2, x0
>> + msr_s SYS_HDFGWTR_EL2, x2
>>
>> mov x0, xzr
>> mrs x1, id_aa64pfr1_el1
>> @@ -193,7 +260,21 @@
>> .Lset_fgt_\@:
>> msr_s SYS_HFGRTR_EL2, x0
>> msr_s SYS_HFGWTR_EL2, x0
>> - msr_s SYS_HFGITR_EL2, xzr
>> + mov x0, xzr
>> +#ifdef CONFIG_ARM64_BRBE
>> + mrs x1, id_aa64dfr0_el1
>> + ubfx x1, x1, #ID_AA64DFR0_EL1_BRBE_SHIFT, #4
>> + cbz x1, .Lskip_brbe_insn_fgt_\@
>> +
>> + /* Disable traps for BRBIALL instruction */
>> + orr x0, x0, #HFGITR_EL2_nBRBIALL_MASK
>> +
>> + /* Disable traps for BRBINJ instruction */
>> + orr x0, x0, #HFGITR_EL2_nBRBINJ_MASK
>> +
>> +.Lskip_brbe_insn_fgt_\@:
>> +#endif /* CONFIG_ARM64_BRBE */
>> + msr_s SYS_HFGITR_EL2, x0
>>
>> mrs x1, id_aa64pfr0_el1 // AMU traps UNDEF without AMU
>> ubfx x1, x1, #ID_AA64PFR0_EL1_AMU_SHIFT, #4
>> @@ -228,6 +309,9 @@
>> __init_el2_nvhe_idregs
>> __init_el2_cptr
>> __init_el2_fgt
>> +#ifdef CONFIG_ARM64_BRBE
>> + __init_el2_brbe
>> +#endif
>
> This largely looks fine, but I note that we haven't bothered with
> ifdeffery for PMU and SPE, so I suspect it might be worth getting rid of
> the ifdeffery for BRBE.

Sure, will drop the #ifdef

>
> Mark.

2024-06-03 09:25:20

by Anshuman Khandual

[permalink] [raw]
Subject: Re: [PATCH V17 0/9] arm64/perf: Enable branch stack sampling



On 5/30/24 15:33, James Clark wrote:
>
>
> On 05/04/2024 03:46, Anshuman Khandual wrote:
>> This series enables perf branch stack sampling support on arm64 platform
>> via a new arch feature called Branch Record Buffer Extension (BRBE). All
>> the relevant register definitions could be accessed here.
>>
>> https://developer.arm.com/documentation/ddi0601/2021-12/AArch64-Registers
>>
>> This series applies on 6.9-rc2.
>>
>> Also this series is being hosted below for quick access, review and test.
>>
>> https://git.gitlab.arm.com/linux-arm/linux-anshuman.git (brbe_v17)
>>
>> There are still some open questions regarding handling multiple perf events
>> with different privilege branch filters getting on the same PMU, supporting
>> guest branch stack tracing from the host etc. Finally also looking for some
>> suggestions regarding supporting BRBE inside the guest. The series has been
>> re-organized completely as suggested earlier.
>
> For guest support I'm still of this opinion:
>
> * No support for the host looking into guests (the addresses don't
> make sense anyway without also running Perf record in the guest)
> * Save and restore the host buffer and registers on guest switch (if
> it was ever used by either host or guest)
> * Let the guest do whatever it wants with BRBE without any
> virtualisation
>
> Merging this with the current PMU virtualistion stuff seems like a lot
> of work for no use case (host looking into guests). Having said that, it
> might not even be worth discussing on this patchset apart from "no guest
> support", and we can do it later to avoid confusion that it's being
> proposed for this version.

Agreed, let's just have "no guest support" for now in this proposed series
without any more additional changes to keep things simpler and separated.
I will also update the cover letter next time around making this clear.

2024-06-03 09:45:09

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH V17 4/9] arm64/boot: Enable EL2 requirements for BRBE

On Mon, Jun 03, 2024 at 02:41:32PM +0530, Anshuman Khandual wrote:
> On 5/29/24 16:21, Mark Rutland wrote:
> > On Fri, Apr 05, 2024 at 08:16:34AM +0530, Anshuman Khandual wrote:

> >> + For CPUs with feature Branch Record Buffer Extension (FEAT_BRBE):
> >> +
> >> + - If the kernel is entered at EL2 and EL1 is present:
> >> +
> >> + - BRBCR_EL1.CC (bit 3) must be initialised to 0b1.
> >> + - BRBCR_EL1.MPRED (bit 4) must be initialised to 0b1.
> >
> > IIUC this isn't necessary; if the kernel is entered at EL2, it's capable
> > of initializing the EL1 regs, and it doesn't look like this silently
> > affects something we'd need in the absence of a BRBE driver.
>
> No, this does not affect anything other than the BRBE driver.

Ok.

> > AFAICT the __init_el2_brbe() code you add below handles this, so I think
> > this is redundant and can be deleted.
>
> Did not understand the above. __init_el2_brbe() handles setting both BRBCR_EL2
> and BRBCR_EL1 for CC and MPRED config irrespective of whether the kernel enters
> EL2 directly or enters EL1 via EL2. But should not that be documented here for
> both those scenarios ? OR because once the kernel is in EL2, it can configure
> EL1 as required, so it is not a booting requirement anymore ?

The latter -- since the kernel can set this up, and only needs to do
that for the BRBE driver to work, this doesn't need to be a requiremnt
on FW/bootloader. So we can drop this when the kernel is booted at EL2.

Mark.

2024-06-03 09:45:52

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH V17 0/9] arm64/perf: Enable branch stack sampling

On Mon, Jun 03, 2024 at 02:48:58PM +0530, Anshuman Khandual wrote:
>
>
> On 5/30/24 15:33, James Clark wrote:
> >
> >
> > On 05/04/2024 03:46, Anshuman Khandual wrote:
> >> This series enables perf branch stack sampling support on arm64 platform
> >> via a new arch feature called Branch Record Buffer Extension (BRBE). All
> >> the relevant register definitions could be accessed here.
> >>
> >> https://developer.arm.com/documentation/ddi0601/2021-12/AArch64-Registers
> >>
> >> This series applies on 6.9-rc2.
> >>
> >> Also this series is being hosted below for quick access, review and test.
> >>
> >> https://git.gitlab.arm.com/linux-arm/linux-anshuman.git (brbe_v17)
> >>
> >> There are still some open questions regarding handling multiple perf events
> >> with different privilege branch filters getting on the same PMU, supporting
> >> guest branch stack tracing from the host etc. Finally also looking for some
> >> suggestions regarding supporting BRBE inside the guest. The series has been
> >> re-organized completely as suggested earlier.
> >
> > For guest support I'm still of this opinion:
> >
> > * No support for the host looking into guests (the addresses don't
> > make sense anyway without also running Perf record in the guest)
> > * Save and restore the host buffer and registers on guest switch (if
> > it was ever used by either host or guest)
> > * Let the guest do whatever it wants with BRBE without any
> > virtualisation
> >
> > Merging this with the current PMU virtualistion stuff seems like a lot
> > of work for no use case (host looking into guests). Having said that, it
> > might not even be worth discussing on this patchset apart from "no guest
> > support", and we can do it later to avoid confusion that it's being
> > proposed for this version.
>
> Agreed, let's just have "no guest support" for now in this proposed series
> without any more additional changes to keep things simpler and separated.
> I will also update the cover letter next time around making this clear.

FWIW, that sounds good to me.

Mark.

2024-06-06 03:58:27

by Anshuman Khandual

[permalink] [raw]
Subject: Re: [PATCH V17 0/9] arm64/perf: Enable branch stack sampling



On 5/30/24 15:17, James Clark wrote:
>
>
> On 05/04/2024 03:46, Anshuman Khandual wrote:
>> This series enables perf branch stack sampling support on arm64 platform
>> via a new arch feature called Branch Record Buffer Extension (BRBE). All
>> the relevant register definitions could be accessed here.
>>
>> https://developer.arm.com/documentation/ddi0601/2021-12/AArch64-Registers
>>
>> This series applies on 6.9-rc2.
>>
>> Also this series is being hosted below for quick access, review and test.
>>
>> https://git.gitlab.arm.com/linux-arm/linux-anshuman.git (brbe_v17)
>>
>> There are still some open questions regarding handling multiple perf events
>> with different privilege branch filters getting on the same PMU, supporting
>> guest branch stack tracing from the host etc. Finally also looking for some
>> suggestions regarding supporting BRBE inside the guest. The series has been
>> re-organized completely as suggested earlier.
>>
>> - Anshuman
>>
> [...]
>>
>> ------------------ Possible 'branch_sample_type' Mismatch -----------------
>>
>> Branch stack sampling attributes 'event->attr.branch_sample_type' generally
>> remain the same for all the events during a perf record session.
>>
>> $perf record -e <event_1> -e <event_2> -j <branch_filters> [workload]
>>
>> event_1->attr.branch_sample_type == event_2->attr.branch_sample_type
>>
>> This 'branch_sample_type' is used to configure the BRBE hardware, when both
>> events i.e <event_1> and <event_2> get scheduled on a given PMU. But during
>> PMU HW event's privilege filter inheritance, 'branch_sample_type' does not
>> remain the same for all events. Let's consider the following example
>>
>> $perf record -e cycles:u -e instructions:k -j any,save_type ls
>>
>> cycles->attr.branch_sample_type != instructions->attr.branch_sample_type
>>
>> Because cycles event inherits PERF_SAMPLE_BRANCH_USER and instruction event
>> inherits PERF_SAMPLE_BRANCH_KERNEL. The proposed solution here configures
>> BRBE hardware with 'branch_sample_type' from last event to be added in the
>> PMU and hence captured branch records only get passed on to matching events
>> during a PMU interrupt.
>>
>
> Hi Anshuman,
>
> Surely because of this example we should merge? At least we have to try
> to make the most common basic command lines work. Unless we expect all
> tools to know whether the branch buffer is shared between PMUs on each
> architecture or not. The driver knows though, so can merge the settings
> because it all has to go into one BRBE.
>
> Merging the settings in tools would be a much harder problem.

Alright, makes sense.

>
> Any user that doesn't have permission for anything in the merged result
> can continue to get nothing.
>
> And we can always add filtering in the kernel later on if we want to to
> make it appear to behave even more normally.

Understood.

>
>> static int
>> armpmu_add(struct perf_event *event, int flags)
>> {
>> ........
>> if (has_branch_stack(event)) {
>> /*
>> * Reset branch records buffer if a new task event gets
>> * scheduled on a PMU which might have existing records.
>> * Otherwise older branch records present in the buffer
>> * might leak into the new task event.
>> */
>> if (event->ctx->task && hw_events->brbe_context != event->ctx) {
>> hw_events->brbe_context = event->ctx;
>> if (armpmu->branch_reset)
>> armpmu->branch_reset();
>> }
>> hw_events->brbe_users++;
>> Here -------> hw_events->brbe_sample_type = event->attr.branch_sample_type;
>> }
>> ........
>> }
>>
>> Instead of overriding existing 'branch_sample_type', both could be merged.
>>
>
> I can't see any use case where anyone would want the override behavior.
> Especially if you imagine multiple users not even aware of each other.
> Either the current "no records for mismatches" or the merged one make sense.

Hence I had enlisted all the three available options.

2024-06-06 04:58:55

by Anshuman Khandual

[permalink] [raw]
Subject: Re: [PATCH V17 0/9] arm64/perf: Enable branch stack sampling

On 5/30/24 23:11, Mark Rutland wrote:
> On Thu, May 30, 2024 at 10:47:34AM +0100, James Clark wrote:
>> On 05/04/2024 03:46, Anshuman Khandual wrote:
>>> This series enables perf branch stack sampling support on arm64 platform
>>> via a new arch feature called Branch Record Buffer Extension (BRBE). All
>>> the relevant register definitions could be accessed here.
>>>
>>> https://developer.arm.com/documentation/ddi0601/2021-12/AArch64-Registers
>>>
>>> This series applies on 6.9-rc2.
>>>
>>> Also this series is being hosted below for quick access, review and test.
>>>
>>> https://git.gitlab.arm.com/linux-arm/linux-anshuman.git (brbe_v17)
>>>
>>> There are still some open questions regarding handling multiple perf events
>>> with different privilege branch filters getting on the same PMU, supporting
>>> guest branch stack tracing from the host etc. Finally also looking for some
>>> suggestions regarding supporting BRBE inside the guest. The series has been
>>> re-organized completely as suggested earlier.
>>>
>>> - Anshuman
>>>
>> [...]
>>>
>>> ------------------ Possible 'branch_sample_type' Mismatch -----------------
>>>
>>> Branch stack sampling attributes 'event->attr.branch_sample_type' generally
>>> remain the same for all the events during a perf record session.
>>>
>>> $perf record -e <event_1> -e <event_2> -j <branch_filters> [workload]
>>>
>>> event_1->attr.branch_sample_type == event_2->attr.branch_sample_type
>>>
>>> This 'branch_sample_type' is used to configure the BRBE hardware, when both
>>> events i.e <event_1> and <event_2> get scheduled on a given PMU. But during
>>> PMU HW event's privilege filter inheritance, 'branch_sample_type' does not
>>> remain the same for all events. Let's consider the following example
>>>
>>> $perf record -e cycles:u -e instructions:k -j any,save_type ls
>>>
>>> cycles->attr.branch_sample_type != instructions->attr.branch_sample_type
>>>
>>> Because cycles event inherits PERF_SAMPLE_BRANCH_USER and instruction event
>>> inherits PERF_SAMPLE_BRANCH_KERNEL. The proposed solution here configures
>>> BRBE hardware with 'branch_sample_type' from last event to be added in the
>>> PMU and hence captured branch records only get passed on to matching events
>>> during a PMU interrupt.
>>>
>>
>> Hi Anshuman,
>>
>> Surely because of this example we should merge? At least we have to try
>> to make the most common basic command lines work. Unless we expect all
>> tools to know whether the branch buffer is shared between PMUs on each
>> architecture or not. The driver knows though, so can merge the settings
>> because it all has to go into one BRBE.
>
> The difficulty here is that these are opened as independent events (not
> in the same event group), and so from the driver's PoV, this is no
> different two two users independently doing:
>
> perf record -e event:u -j any,save_type -p ${SOME_PID}
>
> perf record -e event:k -j any,save_type -p ${SOME_PID}
>
> ... where either would be surprised to get the merged result.

Right, that's the problem. The proposed idea here ensures that each event
here will get only expected branch records, even though sample size might
get reduced as the HW filters overrides might not be evenly split between
them during the perf session.

>
> So, if we merge the filters into the most permissive set, we *must*
> filter them when handing them to userspace so that each event gets the
> expected set of branch records.

Agreed, if the branch filters get merged to the most permissive set via
an OR semantics, then results must be filtered before being pushed into
the ring buffer for each individual event that has overflown during the
PMU IRQ.

>
> Assuming we do that, for Anshuman's case above:
>
> perf record -e cycles:u -e instructions:k -j any,save_type ls
>
> ... the overflow of the cycles evnt will only record user branch
> records, and the overflow of the instructions event will only record
> kernel branch records.

Right.

>
> I think it's arguable either way as to whether users want that or merged
> records; we should probably figure out with the tools folk what the
> desired behaviour is for that command line, but as above I don't think
> that we can have the kernel give both events merged records unless
> userspace asks for that explicitly.

Right, we should not give merged results unless explicitly asked by the
event. Otherwise that might break the semantics.

>
>> Merging the settings in tools would be a much harder problem.
>
> I can see that it'd be harder to do that up-front when parsing each
> event independently, but there is a phase where the tool knows about all
> of the events and *might* be able to do that, where the driver doesn't
> really know at any point that these events are related.
>
> Regardless, I assume that if the user passes:
>
> perf record -e cycles:u -e instructions:k -k any,u,k,save_type ls
>
> ... both events will be opened with PERF_SAMPLE_BRANCH_USER and
> PERF_SAMPLE_BRANCH_KERNEL, so maybe that's good, and in-kernel filtering
> is sufficient.
Kernel filtering will not be required in this case as "-j any,u,k," overrides
both event's individual privilege request i.e cycles:u and instructions:k. So
both the events will receive branch records related to PERF_SAMPLE_BRANCH_USER
and PERF_SAMPLE_BRANCH_KERNEL. From branch sample perspective privilege filter
is (PERF_SAMPLE_BRANCH_USER | PERF_SAMPLE_BRANCH_KERNEL).

>
>> Any user that doesn't have permission for anything in the merged result
>> can continue to get nothing.
>>
>> And we can always add filtering in the kernel later on if we want to to
>> make it appear to behave even more normally.
>
> As above, I think if we merge the HW filters in the kernel then the
> kernel must do SW filtering. I don't think that's something we can leave
> for later.

Alright.

>
>>> static int
>>> armpmu_add(struct perf_event *event, int flags)
>>> {
>>> ........
>>> if (has_branch_stack(event)) {
>>> /*
>>> * Reset branch records buffer if a new task event gets
>>> * scheduled on a PMU which might have existing records.
>>> * Otherwise older branch records present in the buffer
>>> * might leak into the new task event.
>>> */
>>> if (event->ctx->task && hw_events->brbe_context != event->ctx) {
>>> hw_events->brbe_context = event->ctx;
>>> if (armpmu->branch_reset)
>>> armpmu->branch_reset();
>>> }
>>> hw_events->brbe_users++;
>>> Here -------> hw_events->brbe_sample_type = event->attr.branch_sample_type;
>>> }
>>> ........
>>> }
>>>
>>> Instead of overriding existing 'branch_sample_type', both could be merged.
>>>
>>
>> I can't see any use case where anyone would want the override behavior.
>> Especially if you imagine multiple users not even aware of each other.
>
> I completely agree that one event overriding another is not an
> acceptable solution.
>
>> Either the current "no records for mismatches" or the merged one make sense.
>
> I think our options are:
>
> 1) Do not allow events with conflicting HW filters to be scheduled (e.g.
> treating this like a counter constraint).

That's the easiest solution and will keep things simple but downside being
the sample size will probably get much reduced. But such scenarios will be
rare, and hence won't matter much.

>
> 2) Allow events with conflicting HW filters to be scheduled, merge the
> active HW filters, and SW filter the records in the kernel.
>
> 3) Allow events with conflicting branch filters to be scheduled, but
> only those which match the "active" filter get records.

That's the proposed solution. "Active" filter gets decided on which event
comes last thus override the previous and PMU interrupt handler delivers
branch records only to the matching events.

>
> So far (2) seems to me like the best option, and IIUC that's what x86
> does with LBR. I suspect that also justifies discarding records at
> context switch time, since we can't know how many relevant records were
> discarded due to conflicting records (and so cannot safely stitch
> records), and users have to expect that they may get fewer records than
> may exist in HW anyway.

So if we implement merging branch filters requests and SW filtering for the
captured branch records, we should also drop saving and stitching mechanism
completely ?

Coming back to the implementation for option (2), the following code change
(tentative and untested) will merge branch filter requests and drop the event
filter check during PMU interrupt.

diff --git a/drivers/perf/arm_brbe.c b/drivers/perf/arm_brbe.c
index 45ac2d0ca04c..9afa4e48d957 100644
--- a/drivers/perf/arm_brbe.c
+++ b/drivers/perf/arm_brbe.c
@@ -286,7 +286,13 @@ void armv8pmu_branch_stack_add(struct perf_event *event, struct pmu_hw_events *h
armv8pmu_branch_stack_reset();
}
hw_events->branch_users++;
- hw_events->branch_sample_type = event->attr.branch_sample_type;
+
+ /*
+ * Merge all branch filter requests from different perf
+ * events being added into this PMU. This includes both
+ * privilege and branch type filters.
+ */
+ hw_events->branch_sample_type |= event->attr.branch_sample_type;
}

void armv8pmu_branch_stack_del(struct perf_event *event, struct pmu_hw_events *hw_events)
diff --git a/drivers/perf/arm_pmuv3.c b/drivers/perf/arm_pmuv3.c
index 6137ae4ba7c3..c5311d5365cc 100644
--- a/drivers/perf/arm_pmuv3.c
+++ b/drivers/perf/arm_pmuv3.c
@@ -856,15 +856,12 @@ static void read_branch_records(struct pmu_hw_events *cpuc,
return;

/*
- * Overflowed event's branch_sample_type does not match the configured
- * branch filters in the BRBE HW. So the captured branch records here
- * cannot be co-related to the overflowed event. Report to the user as
- * if no branch records have been captured, and flush branch records.
- * The same scenario is applicable when the current task context does
- * not match with overflown event.
+ * When the current task context does not match with the PMU overflown
+ * event, the captured branch records here cannot be co-related to the
+ * overflowed event. Report to the user as if no branch records have
+ * been captured, and flush branch records.
*/
- if ((cpuc->branch_sample_type != event->attr.branch_sample_type) ||
- (event->ctx->task && cpuc->branch_context != event->ctx))
+ if (event->ctx->task && cpuc->branch_context != event->ctx)
return;

/*

and something like the following change (tentative and untested) implements the
required SW branch records filtering.

diff --git a/drivers/perf/arm_pmuv3.c b/drivers/perf/arm_pmuv3.c
index c5311d5365cc..d2390657c466 100644
--- a/drivers/perf/arm_pmuv3.c
+++ b/drivers/perf/arm_pmuv3.c
@@ -843,11 +843,97 @@ static void armv8pmu_stop(struct arm_pmu *cpu_pmu)
armv8pmu_pmcr_write(armv8pmu_pmcr_read() & ~ARMV8_PMU_PMCR_E);
}

+static bool filter_branch_type(struct perf_event *event, struct perf_branch_entry *entry)
+{
+ u64 br_type = event->attr.branch_sample_type;
+ u64 mask = PERF_BR_UNCOND;
+
+ if (br_type & PERF_SAMPLE_BRANCH_ANY)
+ return true;
+
+ if (entry->type == PERF_BR_UNKNOWN)
+ return true;
+
+ if (br_type & PERF_SAMPLE_BRANCH_IND_JUMP)
+ mask |= PERF_BR_IND;
+
+ if (br_type & PERF_SAMPLE_BRANCH_COND) {
+ mask &= ~PERF_BR_UNCOND;
+ mask |= PERF_BR_COND;
+ }
+
+ if (br_type & PERF_SAMPLE_BRANCH_CALL)
+ mask |= PERF_BR_CALL;
+
+ if (br_type & PERF_SAMPLE_BRANCH_IND_CALL)
+ mask |= PERF_BR_IND_CALL;
+
+ if (br_type & PERF_SAMPLE_BRANCH_ANY_CALL) {
+ mask |= PERF_BR_CALL;
+ mask |= PERF_BR_IRQ;
+ mask |= PERF_BR_SYSCALL;
+ mask |= PERF_BR_SERROR;
+ if (br_type & PERF_SAMPLE_BRANCH_COND)
+ mask |= PERF_BR_COND_CALL;
+ }
+
+ if (br_type & PERF_SAMPLE_BRANCH_ANY_RETURN) {
+ mask |= PERF_BR_RET;
+ mask |= PERF_BR_ERET;
+ mask |= PERF_BR_SYSRET;
+ if (br_type & PERF_SAMPLE_BRANCH_COND)
+ mask |= PERF_BR_COND_RET;
+ }
+ return mask & entry->type;
+}
+
+static bool filter_branch_privilege(struct perf_event *event, struct perf_branch_entry *entry)
+{
+ u64 br_type = event->attr.branch_sample_type;
+
+ if (!(br_type & PERF_SAMPLE_BRANCH_USER)) {
+ if (is_ttbr0_addr(entry->from) || is_ttbr0_addr(entry->to))
+ return false;
+ }
+
+ if (!(br_type & PERF_SAMPLE_BRANCH_KERNEL)) {
+ if (is_ttbr1_addr(entry->from) || is_ttbr1_addr(entry->to))
+ return false;
+ }
+
+ if (!(br_type & PERF_SAMPLE_BRANCH_HV)) {
+ if (is_kernel_in_hyp_mode()) {
+ if (is_ttbr1_addr(entry->from) || is_ttbr1_addr(entry->to))
+ return false;
+ }
+ }
+ return true;
+}
+
+static void filter_branch_records(struct perf_event *event, struct pmu_hw_events *cpuc,
+ struct branch_records *event_records)
+{
+ struct perf_branch_entry *entry;
+ int idx, count;
+
+ memset(event_records, 0, sizeof(struct branch_records));
+ for (idx = 0, count = 0; idx < cpuc->branches->branch_stack.nr; idx++) {
+ entry = &cpuc->branches->branch_entries[idx];
+ if (!filter_branch_privilege(event, entry) || !filter_branch_type(event, entry))
+ continue;
+
+ memcpy(&event_records->branch_entries[count], &entry, sizeof(struct perf_branch_entry));
+ count++;
+ }
+}
+
static void read_branch_records(struct pmu_hw_events *cpuc,
struct perf_event *event,
struct perf_sample_data *data,
bool *branch_captured)
{
+ struct branch_records event_records;
+
/*
* CPU specific branch records buffer must have been allocated already
* for the hardware records to be captured and processed further.
@@ -874,6 +960,20 @@ static void read_branch_records(struct pmu_hw_events *cpuc,
armv8pmu_branch_read(cpuc, event);
*branch_captured = true;
}
+
+ /*
+ * Filter captured branch records
+ *
+ * PMU captured branch records would contain samples applicable for
+ * the agregated branch filters, for all events that got scheduled
+ * on this PMU. Hence the branch records need to be filtered first
+ * so that each individual event get samples they had requested.
+ */
+ if (cpuc->branch_sample_type != event->attr.branch_sample_type) {
+ filter_branch_records(event, cpuc, &event_records);
+ perf_sample_save_brstack(data, event, &event_records.branch_stack, NULL);
+ return;
+ }
perf_sample_save_brstack(data, event, &cpuc->branches->branch_stack, NULL);
}


2024-06-06 06:28:00

by Anshuman Khandual

[permalink] [raw]
Subject: Re: [PATCH V17 0/9] arm64/perf: Enable branch stack sampling



On 6/6/24 10:28, Anshuman Khandual wrote:
> On 5/30/24 23:11, Mark Rutland wrote:
>> On Thu, May 30, 2024 at 10:47:34AM +0100, James Clark wrote:
>>> On 05/04/2024 03:46, Anshuman Khandual wrote:
>>>> This series enables perf branch stack sampling support on arm64 platform
>>>> via a new arch feature called Branch Record Buffer Extension (BRBE). All
>>>> the relevant register definitions could be accessed here.
>>>>
>>>> https://developer.arm.com/documentation/ddi0601/2021-12/AArch64-Registers
>>>>
>>>> This series applies on 6.9-rc2.
>>>>
>>>> Also this series is being hosted below for quick access, review and test.
>>>>
>>>> https://git.gitlab.arm.com/linux-arm/linux-anshuman.git (brbe_v17)
>>>>
>>>> There are still some open questions regarding handling multiple perf events
>>>> with different privilege branch filters getting on the same PMU, supporting
>>>> guest branch stack tracing from the host etc. Finally also looking for some
>>>> suggestions regarding supporting BRBE inside the guest. The series has been
>>>> re-organized completely as suggested earlier.
>>>>
>>>> - Anshuman
>>>>
>>> [...]
>>>>
>>>> ------------------ Possible 'branch_sample_type' Mismatch -----------------
>>>>
>>>> Branch stack sampling attributes 'event->attr.branch_sample_type' generally
>>>> remain the same for all the events during a perf record session.
>>>>
>>>> $perf record -e <event_1> -e <event_2> -j <branch_filters> [workload]
>>>>
>>>> event_1->attr.branch_sample_type == event_2->attr.branch_sample_type
>>>>
>>>> This 'branch_sample_type' is used to configure the BRBE hardware, when both
>>>> events i.e <event_1> and <event_2> get scheduled on a given PMU. But during
>>>> PMU HW event's privilege filter inheritance, 'branch_sample_type' does not
>>>> remain the same for all events. Let's consider the following example
>>>>
>>>> $perf record -e cycles:u -e instructions:k -j any,save_type ls
>>>>
>>>> cycles->attr.branch_sample_type != instructions->attr.branch_sample_type
>>>>
>>>> Because cycles event inherits PERF_SAMPLE_BRANCH_USER and instruction event
>>>> inherits PERF_SAMPLE_BRANCH_KERNEL. The proposed solution here configures
>>>> BRBE hardware with 'branch_sample_type' from last event to be added in the
>>>> PMU and hence captured branch records only get passed on to matching events
>>>> during a PMU interrupt.
>>>>
>>>
>>> Hi Anshuman,
>>>
>>> Surely because of this example we should merge? At least we have to try
>>> to make the most common basic command lines work. Unless we expect all
>>> tools to know whether the branch buffer is shared between PMUs on each
>>> architecture or not. The driver knows though, so can merge the settings
>>> because it all has to go into one BRBE.
>>
>> The difficulty here is that these are opened as independent events (not
>> in the same event group), and so from the driver's PoV, this is no
>> different two two users independently doing:
>>
>> perf record -e event:u -j any,save_type -p ${SOME_PID}
>>
>> perf record -e event:k -j any,save_type -p ${SOME_PID}
>>
>> ... where either would be surprised to get the merged result.
>
> Right, that's the problem. The proposed idea here ensures that each event
> here will get only expected branch records, even though sample size might
> get reduced as the HW filters overrides might not be evenly split between
> them during the perf session.
>
>>
>> So, if we merge the filters into the most permissive set, we *must*
>> filter them when handing them to userspace so that each event gets the
>> expected set of branch records.
>
> Agreed, if the branch filters get merged to the most permissive set via
> an OR semantics, then results must be filtered before being pushed into
> the ring buffer for each individual event that has overflown during the
> PMU IRQ.
>
>>
>> Assuming we do that, for Anshuman's case above:
>>
>> perf record -e cycles:u -e instructions:k -j any,save_type ls
>>
>> ... the overflow of the cycles evnt will only record user branch
>> records, and the overflow of the instructions event will only record
>> kernel branch records.
>
> Right.
>
>>
>> I think it's arguable either way as to whether users want that or merged
>> records; we should probably figure out with the tools folk what the
>> desired behaviour is for that command line, but as above I don't think
>> that we can have the kernel give both events merged records unless
>> userspace asks for that explicitly.
>
> Right, we should not give merged results unless explicitly asked by the
> event. Otherwise that might break the semantics.
>
>>
>>> Merging the settings in tools would be a much harder problem.
>>
>> I can see that it'd be harder to do that up-front when parsing each
>> event independently, but there is a phase where the tool knows about all
>> of the events and *might* be able to do that, where the driver doesn't
>> really know at any point that these events are related.
>>
>> Regardless, I assume that if the user passes:
>>
>> perf record -e cycles:u -e instructions:k -k any,u,k,save_type ls
>>
>> ... both events will be opened with PERF_SAMPLE_BRANCH_USER and
>> PERF_SAMPLE_BRANCH_KERNEL, so maybe that's good, and in-kernel filtering
>> is sufficient.
> Kernel filtering will not be required in this case as "-j any,u,k," overrides
> both event's individual privilege request i.e cycles:u and instructions:k. So
> both the events will receive branch records related to PERF_SAMPLE_BRANCH_USER
> and PERF_SAMPLE_BRANCH_KERNEL. From branch sample perspective privilege filter
> is (PERF_SAMPLE_BRANCH_USER | PERF_SAMPLE_BRANCH_KERNEL).
>
>>
>>> Any user that doesn't have permission for anything in the merged result
>>> can continue to get nothing.
>>>
>>> And we can always add filtering in the kernel later on if we want to to
>>> make it appear to behave even more normally.
>>
>> As above, I think if we merge the HW filters in the kernel then the
>> kernel must do SW filtering. I don't think that's something we can leave
>> for later.
>
> Alright.
>
>>
>>>> static int
>>>> armpmu_add(struct perf_event *event, int flags)
>>>> {
>>>> ........
>>>> if (has_branch_stack(event)) {
>>>> /*
>>>> * Reset branch records buffer if a new task event gets
>>>> * scheduled on a PMU which might have existing records.
>>>> * Otherwise older branch records present in the buffer
>>>> * might leak into the new task event.
>>>> */
>>>> if (event->ctx->task && hw_events->brbe_context != event->ctx) {
>>>> hw_events->brbe_context = event->ctx;
>>>> if (armpmu->branch_reset)
>>>> armpmu->branch_reset();
>>>> }
>>>> hw_events->brbe_users++;
>>>> Here -------> hw_events->brbe_sample_type = event->attr.branch_sample_type;
>>>> }
>>>> ........
>>>> }
>>>>
>>>> Instead of overriding existing 'branch_sample_type', both could be merged.
>>>>
>>>
>>> I can't see any use case where anyone would want the override behavior.
>>> Especially if you imagine multiple users not even aware of each other.
>>
>> I completely agree that one event overriding another is not an
>> acceptable solution.
>>
>>> Either the current "no records for mismatches" or the merged one make sense.
>>
>> I think our options are:
>>
>> 1) Do not allow events with conflicting HW filters to be scheduled (e.g.
>> treating this like a counter constraint).
>
> That's the easiest solution and will keep things simple but downside being
> the sample size will probably get much reduced. But such scenarios will be
> rare, and hence won't matter much.
>
>>
>> 2) Allow events with conflicting HW filters to be scheduled, merge the
>> active HW filters, and SW filter the records in the kernel.
>>
>> 3) Allow events with conflicting branch filters to be scheduled, but
>> only those which match the "active" filter get records.
>
> That's the proposed solution. "Active" filter gets decided on which event
> comes last thus override the previous and PMU interrupt handler delivers
> branch records only to the matching events.
>
>>
>> So far (2) seems to me like the best option, and IIUC that's what x86
>> does with LBR. I suspect that also justifies discarding records at
>> context switch time, since we can't know how many relevant records were
>> discarded due to conflicting records (and so cannot safely stitch
>> records), and users have to expect that they may get fewer records than
>> may exist in HW anyway.
>
> So if we implement merging branch filters requests and SW filtering for the
> captured branch records, we should also drop saving and stitching mechanism
> completely ?
>
> Coming back to the implementation for option (2), the following code change
> (tentative and untested) will merge branch filter requests and drop the event
> filter check during PMU interrupt.
>
> diff --git a/drivers/perf/arm_brbe.c b/drivers/perf/arm_brbe.c
> index 45ac2d0ca04c..9afa4e48d957 100644
> --- a/drivers/perf/arm_brbe.c
> +++ b/drivers/perf/arm_brbe.c
> @@ -286,7 +286,13 @@ void armv8pmu_branch_stack_add(struct perf_event *event, struct pmu_hw_events *h
> armv8pmu_branch_stack_reset();
> }
> hw_events->branch_users++;
> - hw_events->branch_sample_type = event->attr.branch_sample_type;
> +
> + /*
> + * Merge all branch filter requests from different perf
> + * events being added into this PMU. This includes both
> + * privilege and branch type filters.
> + */
> + hw_events->branch_sample_type |= event->attr.branch_sample_type;
> }
>
> void armv8pmu_branch_stack_del(struct perf_event *event, struct pmu_hw_events *hw_events)
> diff --git a/drivers/perf/arm_pmuv3.c b/drivers/perf/arm_pmuv3.c
> index 6137ae4ba7c3..c5311d5365cc 100644
> --- a/drivers/perf/arm_pmuv3.c
> +++ b/drivers/perf/arm_pmuv3.c
> @@ -856,15 +856,12 @@ static void read_branch_records(struct pmu_hw_events *cpuc,
> return;
>
> /*
> - * Overflowed event's branch_sample_type does not match the configured
> - * branch filters in the BRBE HW. So the captured branch records here
> - * cannot be co-related to the overflowed event. Report to the user as
> - * if no branch records have been captured, and flush branch records.
> - * The same scenario is applicable when the current task context does
> - * not match with overflown event.
> + * When the current task context does not match with the PMU overflown
> + * event, the captured branch records here cannot be co-related to the
> + * overflowed event. Report to the user as if no branch records have
> + * been captured, and flush branch records.
> */
> - if ((cpuc->branch_sample_type != event->attr.branch_sample_type) ||
> - (event->ctx->task && cpuc->branch_context != event->ctx))
> + if (event->ctx->task && cpuc->branch_context != event->ctx)
> return;
>
> /*
>
> and something like the following change (tentative and untested) implements the
> required SW branch records filtering.
>
> diff --git a/drivers/perf/arm_pmuv3.c b/drivers/perf/arm_pmuv3.c
> index c5311d5365cc..d2390657c466 100644
> --- a/drivers/perf/arm_pmuv3.c
> +++ b/drivers/perf/arm_pmuv3.c
> @@ -843,11 +843,97 @@ static void armv8pmu_stop(struct arm_pmu *cpu_pmu)
> armv8pmu_pmcr_write(armv8pmu_pmcr_read() & ~ARMV8_PMU_PMCR_E);
> }
>
> +static bool filter_branch_type(struct perf_event *event, struct perf_branch_entry *entry)
> +{
> + u64 br_type = event->attr.branch_sample_type;
> + u64 mask = PERF_BR_UNCOND;
> +
> + if (br_type & PERF_SAMPLE_BRANCH_ANY)
> + return true;
> +
> + if (entry->type == PERF_BR_UNKNOWN)
> + return true;
> +
> + if (br_type & PERF_SAMPLE_BRANCH_IND_JUMP)
> + mask |= PERF_BR_IND;
> +
> + if (br_type & PERF_SAMPLE_BRANCH_COND) {
> + mask &= ~PERF_BR_UNCOND;
> + mask |= PERF_BR_COND;
> + }
> +
> + if (br_type & PERF_SAMPLE_BRANCH_CALL)
> + mask |= PERF_BR_CALL;
> +
> + if (br_type & PERF_SAMPLE_BRANCH_IND_CALL)
> + mask |= PERF_BR_IND_CALL;
> +
> + if (br_type & PERF_SAMPLE_BRANCH_ANY_CALL) {
> + mask |= PERF_BR_CALL;
> + mask |= PERF_BR_IRQ;
> + mask |= PERF_BR_SYSCALL;
> + mask |= PERF_BR_SERROR;
> + if (br_type & PERF_SAMPLE_BRANCH_COND)
> + mask |= PERF_BR_COND_CALL;
> + }
> +
> + if (br_type & PERF_SAMPLE_BRANCH_ANY_RETURN) {
> + mask |= PERF_BR_RET;
> + mask |= PERF_BR_ERET;
> + mask |= PERF_BR_SYSRET;
> + if (br_type & PERF_SAMPLE_BRANCH_COND)
> + mask |= PERF_BR_COND_RET;
> + }
> + return mask & entry->type;
> +}
> +
> +static bool filter_branch_privilege(struct perf_event *event, struct perf_branch_entry *entry)
> +{
> + u64 br_type = event->attr.branch_sample_type;
> +
> + if (!(br_type & PERF_SAMPLE_BRANCH_USER)) {
> + if (is_ttbr0_addr(entry->from) || is_ttbr0_addr(entry->to))
> + return false;
> + }
> +
> + if (!(br_type & PERF_SAMPLE_BRANCH_KERNEL)) {
> + if (is_ttbr1_addr(entry->from) || is_ttbr1_addr(entry->to))
> + return false;
> + }
> +
> + if (!(br_type & PERF_SAMPLE_BRANCH_HV)) {
> + if (is_kernel_in_hyp_mode()) {
> + if (is_ttbr1_addr(entry->from) || is_ttbr1_addr(entry->to))
> + return false;
> + }
> + }
> + return true;
> +}
> +
> +static void filter_branch_records(struct perf_event *event, struct pmu_hw_events *cpuc,
> + struct branch_records *event_records)
> +{
> + struct perf_branch_entry *entry;
> + int idx, count;
> +
> + memset(event_records, 0, sizeof(struct branch_records));
> + for (idx = 0, count = 0; idx < cpuc->branches->branch_stack.nr; idx++) {
> + entry = &cpuc->branches->branch_entries[idx];
> + if (!filter_branch_privilege(event, entry) || !filter_branch_type(event, entry))
> + continue;
> +
> + memcpy(&event_records->branch_entries[count], &entry, sizeof(struct perf_branch_entry));
> + count++;
> + }
> +}
> +
> static void read_branch_records(struct pmu_hw_events *cpuc,
> struct perf_event *event,
> struct perf_sample_data *data,
> bool *branch_captured)
> {
> + struct branch_records event_records;
> +
> /*
> * CPU specific branch records buffer must have been allocated already
> * for the hardware records to be captured and processed further.
> @@ -874,6 +960,20 @@ static void read_branch_records(struct pmu_hw_events *cpuc,
> armv8pmu_branch_read(cpuc, event);
> *branch_captured = true;
> }
> +
> + /*
> + * Filter captured branch records
> + *
> + * PMU captured branch records would contain samples applicable for
> + * the agregated branch filters, for all events that got scheduled
> + * on this PMU. Hence the branch records need to be filtered first
> + * so that each individual event get samples they had requested.
> + */
> + if (cpuc->branch_sample_type != event->attr.branch_sample_type) {
> + filter_branch_records(event, cpuc, &event_records);
> + perf_sample_save_brstack(data, event, &event_records.branch_stack, NULL);
> + return;
> + }
> perf_sample_save_brstack(data, event, &cpuc->branches->branch_stack, NULL);
> }

But now with the above SW filtering enabled during branch record processing, there
are two possible problems

- Potential for RCU stalls, observed them some times
- IRQ handling gets delayed while processing these records repeatedly for each event

2024-06-06 11:01:37

by James Clark

[permalink] [raw]
Subject: Re: [PATCH V17 0/9] arm64/perf: Enable branch stack sampling



On 06/06/2024 05:58, Anshuman Khandual wrote:
> On 5/30/24 23:11, Mark Rutland wrote:
>> On Thu, May 30, 2024 at 10:47:34AM +0100, James Clark wrote:
>>> On 05/04/2024 03:46, Anshuman Khandual wrote:
>>>> This series enables perf branch stack sampling support on arm64 platform
>>>> via a new arch feature called Branch Record Buffer Extension (BRBE). All
>>>> the relevant register definitions could be accessed here.
>>>>
>>>> https://developer.arm.com/documentation/ddi0601/2021-12/AArch64-Registers
>>>>
>>>> This series applies on 6.9-rc2.
>>>>
>>>> Also this series is being hosted below for quick access, review and test.
>>>>
>>>> https://git.gitlab.arm.com/linux-arm/linux-anshuman.git (brbe_v17)
>>>>
>>>> There are still some open questions regarding handling multiple perf events
>>>> with different privilege branch filters getting on the same PMU, supporting
>>>> guest branch stack tracing from the host etc. Finally also looking for some
>>>> suggestions regarding supporting BRBE inside the guest. The series has been
>>>> re-organized completely as suggested earlier.
>>>>
>>>> - Anshuman
>>>>
>>> [...]
>>>>
>>>> ------------------ Possible 'branch_sample_type' Mismatch -----------------
>>>>
>>>> Branch stack sampling attributes 'event->attr.branch_sample_type' generally
>>>> remain the same for all the events during a perf record session.
>>>>
>>>> $perf record -e <event_1> -e <event_2> -j <branch_filters> [workload]
>>>>
>>>> event_1->attr.branch_sample_type == event_2->attr.branch_sample_type
>>>>
>>>> This 'branch_sample_type' is used to configure the BRBE hardware, when both
>>>> events i.e <event_1> and <event_2> get scheduled on a given PMU. But during
>>>> PMU HW event's privilege filter inheritance, 'branch_sample_type' does not
>>>> remain the same for all events. Let's consider the following example
>>>>
>>>> $perf record -e cycles:u -e instructions:k -j any,save_type ls
>>>>
>>>> cycles->attr.branch_sample_type != instructions->attr.branch_sample_type
>>>>
>>>> Because cycles event inherits PERF_SAMPLE_BRANCH_USER and instruction event
>>>> inherits PERF_SAMPLE_BRANCH_KERNEL. The proposed solution here configures
>>>> BRBE hardware with 'branch_sample_type' from last event to be added in the
>>>> PMU and hence captured branch records only get passed on to matching events
>>>> during a PMU interrupt.
>>>>
>>>
>>> Hi Anshuman,
>>>
>>> Surely because of this example we should merge? At least we have to try
>>> to make the most common basic command lines work. Unless we expect all
>>> tools to know whether the branch buffer is shared between PMUs on each
>>> architecture or not. The driver knows though, so can merge the settings
>>> because it all has to go into one BRBE.
>>
>> The difficulty here is that these are opened as independent events (not
>> in the same event group), and so from the driver's PoV, this is no
>> different two two users independently doing:
>>
>> perf record -e event:u -j any,save_type -p ${SOME_PID}
>>
>> perf record -e event:k -j any,save_type -p ${SOME_PID}
>>
>> ... where either would be surprised to get the merged result.
>
> Right, that's the problem. The proposed idea here ensures that each event
> here will get only expected branch records, even though sample size might
> get reduced as the HW filters overrides might not be evenly split between
> them during the perf session.
>
>>
>> So, if we merge the filters into the most permissive set, we *must*
>> filter them when handing them to userspace so that each event gets the
>> expected set of branch records.
>
> Agreed, if the branch filters get merged to the most permissive set via
> an OR semantics, then results must be filtered before being pushed into
> the ring buffer for each individual event that has overflown during the
> PMU IRQ.
>
>>
>> Assuming we do that, for Anshuman's case above:
>>
>> perf record -e cycles:u -e instructions:k -j any,save_type ls
>>
>> ... the overflow of the cycles evnt will only record user branch
>> records, and the overflow of the instructions event will only record
>> kernel branch records.
>
> Right.
>
>>
>> I think it's arguable either way as to whether users want that or merged
>> records; we should probably figure out with the tools folk what the
>> desired behaviour is for that command line, but as above I don't think
>> that we can have the kernel give both events merged records unless
>> userspace asks for that explicitly.
>
> Right, we should not give merged results unless explicitly asked by the
> event. Otherwise that might break the semantics.
>
>>
>>> Merging the settings in tools would be a much harder problem.
>>
>> I can see that it'd be harder to do that up-front when parsing each
>> event independently, but there is a phase where the tool knows about all
>> of the events and *might* be able to do that, where the driver doesn't
>> really know at any point that these events are related.
>>
>> Regardless, I assume that if the user passes:
>>
>> perf record -e cycles:u -e instructions:k -k any,u,k,save_type ls
>>
>> ... both events will be opened with PERF_SAMPLE_BRANCH_USER and
>> PERF_SAMPLE_BRANCH_KERNEL, so maybe that's good, and in-kernel filtering
>> is sufficient.
> Kernel filtering will not be required in this case as "-j any,u,k," overrides
> both event's individual privilege request i.e cycles:u and instructions:k. So
> both the events will receive branch records related to PERF_SAMPLE_BRANCH_USER
> and PERF_SAMPLE_BRANCH_KERNEL. From branch sample perspective privilege filter
> is (PERF_SAMPLE_BRANCH_USER | PERF_SAMPLE_BRANCH_KERNEL).
>
>>
>>> Any user that doesn't have permission for anything in the merged result
>>> can continue to get nothing.
>>>
>>> And we can always add filtering in the kernel later on if we want to to
>>> make it appear to behave even more normally.
>>
>> As above, I think if we merge the HW filters in the kernel then the
>> kernel must do SW filtering. I don't think that's something we can leave
>> for later.
>
> Alright.
>
>>
>>>> static int
>>>> armpmu_add(struct perf_event *event, int flags)
>>>> {
>>>> ........
>>>> if (has_branch_stack(event)) {
>>>> /*
>>>> * Reset branch records buffer if a new task event gets
>>>> * scheduled on a PMU which might have existing records.
>>>> * Otherwise older branch records present in the buffer
>>>> * might leak into the new task event.
>>>> */
>>>> if (event->ctx->task && hw_events->brbe_context != event->ctx) {
>>>> hw_events->brbe_context = event->ctx;
>>>> if (armpmu->branch_reset)
>>>> armpmu->branch_reset();
>>>> }
>>>> hw_events->brbe_users++;
>>>> Here -------> hw_events->brbe_sample_type = event->attr.branch_sample_type;
>>>> }
>>>> ........
>>>> }
>>>>
>>>> Instead of overriding existing 'branch_sample_type', both could be merged.
>>>>
>>>
>>> I can't see any use case where anyone would want the override behavior.
>>> Especially if you imagine multiple users not even aware of each other.
>>
>> I completely agree that one event overriding another is not an
>> acceptable solution.
>>
>>> Either the current "no records for mismatches" or the merged one make sense.
>>
>> I think our options are:
>>
>> 1) Do not allow events with conflicting HW filters to be scheduled (e.g.
>> treating this like a counter constraint).
>
> That's the easiest solution and will keep things simple but downside being
> the sample size will probably get much reduced. But such scenarios will be
> rare, and hence won't matter much.
>
>>
>> 2) Allow events with conflicting HW filters to be scheduled, merge the
>> active HW filters, and SW filter the records in the kernel.
>>
>> 3) Allow events with conflicting branch filters to be scheduled, but
>> only those which match the "active" filter get records.
>
> That's the proposed solution. "Active" filter gets decided on which event
> comes last thus override the previous and PMU interrupt handler delivers
> branch records only to the matching events.
>
>>
>> So far (2) seems to me like the best option, and IIUC that's what x86
>> does with LBR. I suspect that also justifies discarding records at
>> context switch time, since we can't know how many relevant records were
>> discarded due to conflicting records (and so cannot safely stitch
>> records), and users have to expect that they may get fewer records than
>> may exist in HW anyway.
>
> So if we implement merging branch filters requests and SW filtering for the
> captured branch records, we should also drop saving and stitching mechanism
> completely ?
>
> Coming back to the implementation for option (2), the following code change
> (tentative and untested) will merge branch filter requests and drop the event
> filter check during PMU interrupt.
>
> diff --git a/drivers/perf/arm_brbe.c b/drivers/perf/arm_brbe.c
> index 45ac2d0ca04c..9afa4e48d957 100644
> --- a/drivers/perf/arm_brbe.c
> +++ b/drivers/perf/arm_brbe.c
> @@ -286,7 +286,13 @@ void armv8pmu_branch_stack_add(struct perf_event *event, struct pmu_hw_events *h
> armv8pmu_branch_stack_reset();
> }
> hw_events->branch_users++;
> - hw_events->branch_sample_type = event->attr.branch_sample_type;
> +
> + /*
> + * Merge all branch filter requests from different perf
> + * events being added into this PMU. This includes both
> + * privilege and branch type filters.
> + */
> + hw_events->branch_sample_type |= event->attr.branch_sample_type;
> }
>
> void armv8pmu_branch_stack_del(struct perf_event *event, struct pmu_hw_events *hw_events)
> diff --git a/drivers/perf/arm_pmuv3.c b/drivers/perf/arm_pmuv3.c
> index 6137ae4ba7c3..c5311d5365cc 100644
> --- a/drivers/perf/arm_pmuv3.c
> +++ b/drivers/perf/arm_pmuv3.c
> @@ -856,15 +856,12 @@ static void read_branch_records(struct pmu_hw_events *cpuc,
> return;
>
> /*
> - * Overflowed event's branch_sample_type does not match the configured
> - * branch filters in the BRBE HW. So the captured branch records here
> - * cannot be co-related to the overflowed event. Report to the user as
> - * if no branch records have been captured, and flush branch records.
> - * The same scenario is applicable when the current task context does
> - * not match with overflown event.
> + * When the current task context does not match with the PMU overflown
> + * event, the captured branch records here cannot be co-related to the
> + * overflowed event. Report to the user as if no branch records have
> + * been captured, and flush branch records.
> */
> - if ((cpuc->branch_sample_type != event->attr.branch_sample_type) ||
> - (event->ctx->task && cpuc->branch_context != event->ctx))
> + if (event->ctx->task && cpuc->branch_context != event->ctx)
> return;
>
> /*
>
> and something like the following change (tentative and untested) implements the
> required SW branch records filtering.
>
> diff --git a/drivers/perf/arm_pmuv3.c b/drivers/perf/arm_pmuv3.c
> index c5311d5365cc..d2390657c466 100644
> --- a/drivers/perf/arm_pmuv3.c
> +++ b/drivers/perf/arm_pmuv3.c
> @@ -843,11 +843,97 @@ static void armv8pmu_stop(struct arm_pmu *cpu_pmu)
> armv8pmu_pmcr_write(armv8pmu_pmcr_read() & ~ARMV8_PMU_PMCR_E);
> }
>
> +static bool filter_branch_type(struct perf_event *event, struct perf_branch_entry *entry)
> +{
> + u64 br_type = event->attr.branch_sample_type;
> + u64 mask = PERF_BR_UNCOND;
> +
> + if (br_type & PERF_SAMPLE_BRANCH_ANY)
> + return true;
> +
> + if (entry->type == PERF_BR_UNKNOWN)
> + return true;
> +
> + if (br_type & PERF_SAMPLE_BRANCH_IND_JUMP)
> + mask |= PERF_BR_IND;
> +
> + if (br_type & PERF_SAMPLE_BRANCH_COND) {
> + mask &= ~PERF_BR_UNCOND;
> + mask |= PERF_BR_COND;
> + }
> +
> + if (br_type & PERF_SAMPLE_BRANCH_CALL)
> + mask |= PERF_BR_CALL;
> +
> + if (br_type & PERF_SAMPLE_BRANCH_IND_CALL)
> + mask |= PERF_BR_IND_CALL;
> +
> + if (br_type & PERF_SAMPLE_BRANCH_ANY_CALL) {
> + mask |= PERF_BR_CALL;
> + mask |= PERF_BR_IRQ;
> + mask |= PERF_BR_SYSCALL;
> + mask |= PERF_BR_SERROR;
> + if (br_type & PERF_SAMPLE_BRANCH_COND)
> + mask |= PERF_BR_COND_CALL;
> + }
> +
> + if (br_type & PERF_SAMPLE_BRANCH_ANY_RETURN) {
> + mask |= PERF_BR_RET;
> + mask |= PERF_BR_ERET;
> + mask |= PERF_BR_SYSRET;
> + if (br_type & PERF_SAMPLE_BRANCH_COND)
> + mask |= PERF_BR_COND_RET;
> + }
> + return mask & entry->type;
> +}

You can build this mask once for each event because the options don't
change. At the moment it's built for each record which seems excessive.

> +
> +static bool filter_branch_privilege(struct perf_event *event, struct perf_branch_entry *entry)
> +{
> + u64 br_type = event->attr.branch_sample_type;
> +
> + if (!(br_type & PERF_SAMPLE_BRANCH_USER)) {
> + if (is_ttbr0_addr(entry->from) || is_ttbr0_addr(entry->to))
> + return false;
> + }
> +
> + if (!(br_type & PERF_SAMPLE_BRANCH_KERNEL)) {
> + if (is_ttbr1_addr(entry->from) || is_ttbr1_addr(entry->to))
> + return false;
> + }
> +
> + if (!(br_type & PERF_SAMPLE_BRANCH_HV)) {
> + if (is_kernel_in_hyp_mode()) {
> + if (is_ttbr1_addr(entry->from) || is_ttbr1_addr(entry->to))
> + return false;
> + }
> + }
> + return true;
> +}
> +
> +static void filter_branch_records(struct perf_event *event, struct pmu_hw_events *cpuc,
> + struct branch_records *event_records)
> +{
> + struct perf_branch_entry *entry;
> + int idx, count;
> +
> + memset(event_records, 0, sizeof(struct branch_records));
> + for (idx = 0, count = 0; idx < cpuc->branches->branch_stack.nr; idx++) {
> + entry = &cpuc->branches->branch_entries[idx];
> + if (!filter_branch_privilege(event, entry) || !filter_branch_type(event, entry))
> + continue;
> +
> + memcpy(&event_records->branch_entries[count], &entry, sizeof(struct perf_branch_entry));
> + count++;
> + }
> +}
> +
> static void read_branch_records(struct pmu_hw_events *cpuc,
> struct perf_event *event,
> struct perf_sample_data *data,
> bool *branch_captured)
> {
> + struct branch_records event_records;
> +
> /*
> * CPU specific branch records buffer must have been allocated already
> * for the hardware records to be captured and processed further.
> @@ -874,6 +960,20 @@ static void read_branch_records(struct pmu_hw_events *cpuc,
> armv8pmu_branch_read(cpuc, event);
> *branch_captured = true;
> }
> +
> + /*
> + * Filter captured branch records
> + *
> + * PMU captured branch records would contain samples applicable for
> + * the agregated branch filters, for all events that got scheduled
> + * on this PMU. Hence the branch records need to be filtered first
> + * so that each individual event get samples they had requested.
> + */
> + if (cpuc->branch_sample_type != event->attr.branch_sample_type) {
> + filter_branch_records(event, cpuc, &event_records);
> + perf_sample_save_brstack(data, event, &event_records.branch_stack, NULL);
> + return;
> + }
> perf_sample_save_brstack(data, event, &cpuc->branches->branch_stack, NULL);
> }
>
>