2021-03-11 00:10:26

by Rob Herring (Arm)

[permalink] [raw]
Subject: [PATCH v6 00/10] libperf and arm64 userspace counter access support

Hi all,

Another version of libperf and arm64 userspace counter access support.
This originally resurrected Raphael's series[1] to enable userspace counter
access on arm64. My previous versions are here[2][3][4][5][6]. A git
branch is here[7].

Changes in v6:
- Reworking of the handling of 64-bit counters and user access. There's
a new config1 flag to request user access. This takes priority over
the 64-bit flag and the user will get the maximum size the h/w
supports without chaining.
- The libperf evsel mmap struct is stored in its own xyarray
- New tests for user 64-bit and 32-bit counters
- Rebase to v5.12-rc2

Changes in v5:
- Limit enabling/disabling access to CPUs associated with the PMU
(supported_cpus) and with the mm_struct matching current->active_mm.
The x86 method of using mm_cpumask doesn't work for arm64 as it is not
updated.
- Only set cap_user_rdpmc if event is on current cpu. See patch 2.
- Create an mmap for every event in an evsel. This results in some changes
to the libperf mmap API from the last version.
- Rebase to v5.11-rc2

Changes in v4:
- Dropped 'arm64: pmu: Add hook to handle pmu-related undefined instructions'.
The onus is on userspace to pin itself to a homogeneous subset of CPUs
and avoid any aborts on heterogeneous systems, so the hook is not needed.
- Make perf_evsel__mmap() take pages rather than bytes for size
- Fix building arm64 heterogeneous test.

Changes in v3:
- Dropped removing x86 rdpmc test until libperf tests can run via 'perf test'
- Added verbose prints for tests
- Split adding perf_evsel__mmap() to separate patch

The following changes to the arm64 support have been made compared to
Raphael's last version:

The major change is support for heterogeneous systems with some
restrictions. Specifically, userspace must pin itself to like CPUs, open
a specific PMU by type, and use h/w specific events. The tests have been
reworked to demonstrate this.

Chained events are not supported. The problem with supporting chained
events was there's no way to distinguish between a chained event and a
native 64-bit counter. We could add some flag, but do self monitoring
processes really need that? Native 64-bit counters are supported if the
PMU h/w has support. As there's already an explicit ABI to request 64-bit
counters, userspace can request 64-bit counters and if user
access is not enabled, then it must retry with 32-bit counters.

Prior versions broke the build on arm32 (surprisingly never caught by
0-day). As a result, event_mapped and event_unmapped implementations have
been moved into the arm64 code.

There was a bug in that pmc_width was not set in the user page. The tests
now check for this.

The documentation has been converted to rST. I've added sections on
chained events and heterogeneous.

The tests have been expanded to test the cycle counter access.

Rob

[1] https://lore.kernel.org/r/[email protected]/
[2] https://lore.kernel.org/r/[email protected]/
[3] https://lore.kernel.org/r/[email protected]/
[4] https://lore.kernel.org/r/[email protected]/
[5] https://lore.kernel.org/r/[email protected]/
[6] https://lore.kernel.org/r/[email protected]/
[7] git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux.git user-perf-event-v6

Raphael Gault (3):
arm64: pmu: Add function implementation to update event index in
userpage
arm64: perf: Enable PMU counter direct access for perf event
Documentation: arm64: Document PMU counters access from userspace

Rob Herring (7):
tools/include: Add an initial math64.h
libperf: Add evsel mmap support
libperf: tests: Add support for verbose printing
libperf: Add support for user space counter access
libperf: Add arm64 support to perf_mmap__read_self()
perf: arm64: Add test for userspace counter access on heterogeneous
systems
perf: arm64: Add tests for 32-bit and 64-bit counter size userspace
access

Documentation/arm64/index.rst | 1 +
.../arm64/perf_counter_user_access.rst | 60 +++++
arch/arm64/include/asm/mmu.h | 5 +
arch/arm64/include/asm/mmu_context.h | 2 +
arch/arm64/include/asm/perf_event.h | 14 ++
arch/arm64/kernel/perf_event.c | 104 ++++++++-
include/linux/perf/arm_pmu.h | 2 +
tools/include/linux/math64.h | 75 ++++++
tools/lib/perf/Documentation/libperf.txt | 2 +
tools/lib/perf/evsel.c | 55 +++++
tools/lib/perf/include/internal/evsel.h | 1 +
tools/lib/perf/include/internal/mmap.h | 3 +
tools/lib/perf/include/internal/tests.h | 32 +++
tools/lib/perf/include/perf/evsel.h | 2 +
tools/lib/perf/libperf.map | 2 +
tools/lib/perf/mmap.c | 186 +++++++++++++++
tools/lib/perf/tests/Makefile | 6 +-
tools/lib/perf/tests/test-evsel.c | 65 ++++++
tools/perf/arch/arm64/include/arch-tests.h | 12 +
tools/perf/arch/arm64/tests/Build | 1 +
tools/perf/arch/arm64/tests/arch-tests.c | 12 +
tools/perf/arch/arm64/tests/user-events.c | 215 ++++++++++++++++++
22 files changed, 852 insertions(+), 5 deletions(-)
create mode 100644 Documentation/arm64/perf_counter_user_access.rst
create mode 100644 tools/include/linux/math64.h
create mode 100644 tools/perf/arch/arm64/tests/user-events.c

--
2.27.0


2021-03-11 00:10:27

by Rob Herring (Arm)

[permalink] [raw]
Subject: [PATCH v6 03/10] tools/include: Add an initial math64.h

Add an initial math64.h similar to linux/math64.h with functions
mul_u64_u64_div64() and mul_u64_u32_shr(). This isn't a direct copy of
include/linux/math64.h as that doesn't define mul_u64_u64_div64().

Implementation was written by Peter Zilkstra based on linux/math64.h
and div64.h[1]. The original implementation was not optimal on arm64 as
__int128 division is not optimal with a call out to __udivti3, so I
dropped the __int128 variant of mul_u64_u64_div64().

[1] https://lore.kernel.org/lkml/[email protected]/

Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Rob Herring <[email protected]>
---
tools/include/linux/math64.h | 75 ++++++++++++++++++++++++++++++++++++
1 file changed, 75 insertions(+)
create mode 100644 tools/include/linux/math64.h

diff --git a/tools/include/linux/math64.h b/tools/include/linux/math64.h
new file mode 100644
index 000000000000..4ad45d5943dc
--- /dev/null
+++ b/tools/include/linux/math64.h
@@ -0,0 +1,75 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_MATH64_H
+#define _LINUX_MATH64_H
+
+#include <linux/types.h>
+
+#ifdef __x86_64__
+static inline u64 mul_u64_u64_div64(u64 a, u64 b, u64 c)
+{
+ u64 q;
+
+ asm ("mulq %2; divq %3" : "=a" (q)
+ : "a" (a), "rm" (b), "rm" (c)
+ : "rdx");
+
+ return q;
+}
+#define mul_u64_u64_div64 mul_u64_u64_div64
+#endif
+
+#ifdef __SIZEOF_INT128__
+static inline u64 mul_u64_u32_shr(u64 a, u32 b, unsigned int shift)
+{
+ return (u64)(((unsigned __int128)a * b) >> shift);
+}
+
+#else
+
+#ifdef __i386__
+static inline u64 mul_u32_u32(u32 a, u32 b)
+{
+ u32 high, low;
+
+ asm ("mull %[b]" : "=a" (low), "=d" (high)
+ : [a] "a" (a), [b] "rm" (b) );
+
+ return low | ((u64)high) << 32;
+}
+#else
+static inline u64 mul_u32_u32(u32 a, u32 b)
+{
+ return (u64)a * b;
+}
+#endif
+
+static inline u64 mul_u64_u32_shr(u64 a, u32 b, unsigned int shift)
+{
+ u32 ah, al;
+ u64 ret;
+
+ al = a;
+ ah = a >> 32;
+
+ ret = mul_u32_u32(al, b) >> shift;
+ if (ah)
+ ret += mul_u32_u32(ah, b) << (32 - shift);
+
+ return ret;
+}
+
+#endif /* __SIZEOF_INT128__ */
+
+#ifndef mul_u64_u64_div64
+static inline u64 mul_u64_u64_div64(u64 a, u64 b, u64 c)
+{
+ u64 quot, rem;
+
+ quot = a / c;
+ rem = a % c;
+
+ return quot * b + (rem * b) / c;
+}
+#endif
+
+#endif /* _LINUX_MATH64_H */
--
2.27.0

2021-03-11 00:10:27

by Rob Herring (Arm)

[permalink] [raw]
Subject: [PATCH v6 01/10] arm64: pmu: Add function implementation to update event index in userpage

From: Raphael Gault <[email protected]>

In order to be able to access the counter directly for userspace,
we need to provide the index of the counter using the userpage.
We thus need to override the event_idx function to retrieve and
convert the perf_event index to armv8 hardware index.

Since the arm_pmu driver can be used by any implementation, even
if not armv8, two components play a role into making sure the
behaviour is correct and consistent with the PMU capabilities:

* the ARMPMU_EL0_RD_CNTR flag which denotes the capability to access
counter from userspace.
* the event_idx call back, which is implemented and initialized by
the PMU implementation: if no callback is provided, the default
behaviour applies, returning 0 as index value.

Signed-off-by: Raphael Gault <[email protected]>
Signed-off-by: Rob Herring <[email protected]>
---
arch/arm64/kernel/perf_event.c | 18 ++++++++++++++++++
include/linux/perf/arm_pmu.h | 2 ++
2 files changed, 20 insertions(+)

diff --git a/arch/arm64/kernel/perf_event.c b/arch/arm64/kernel/perf_event.c
index 4658fcf88c2b..387838496955 100644
--- a/arch/arm64/kernel/perf_event.c
+++ b/arch/arm64/kernel/perf_event.c
@@ -871,6 +871,22 @@ static void armv8pmu_clear_event_idx(struct pmu_hw_events *cpuc,
clear_bit(idx - 1, cpuc->used_mask);
}

+static int armv8pmu_access_event_idx(struct perf_event *event)
+{
+ if (!(event->hw.flags & ARMPMU_EL0_RD_CNTR))
+ return 0;
+
+ /*
+ * We remap the cycle counter index to 32 to
+ * match the offset applied to the rest of
+ * the counter indices.
+ */
+ if (event->hw.idx == ARMV8_IDX_CYCLE_COUNTER)
+ return 32;
+
+ return event->hw.idx;
+}
+
/*
* Add an event filter to a given event.
*/
@@ -1098,6 +1114,8 @@ static int armv8_pmu_init(struct arm_pmu *cpu_pmu, char *name,
cpu_pmu->set_event_filter = armv8pmu_set_event_filter;
cpu_pmu->filter_match = armv8pmu_filter_match;

+ cpu_pmu->pmu.event_idx = armv8pmu_access_event_idx;
+
cpu_pmu->name = name;
cpu_pmu->map_event = map_event;
cpu_pmu->attr_groups[ARMPMU_ATTR_GROUP_EVENTS] = events ?
diff --git a/include/linux/perf/arm_pmu.h b/include/linux/perf/arm_pmu.h
index 505480217cf1..d29aa981d989 100644
--- a/include/linux/perf/arm_pmu.h
+++ b/include/linux/perf/arm_pmu.h
@@ -26,6 +26,8 @@
*/
/* Event uses a 64bit counter */
#define ARMPMU_EVT_64BIT 1
+/* Allow access to hardware counter from userspace */
+#define ARMPMU_EL0_RD_CNTR 2

#define HW_OP_UNSUPPORTED 0xFFFF
#define C(_x) PERF_COUNT_HW_CACHE_##_x
--
2.27.0

2021-03-11 00:10:27

by Rob Herring (Arm)

[permalink] [raw]
Subject: [PATCH v6 02/10] arm64: perf: Enable PMU counter direct access for perf event

From: Raphael Gault <[email protected]>

Keep track of event opened with direct access to the hardware counters
and modify permissions while they are open.

The strategy used here is the same which x86 uses: every time an event
is mapped, the permissions are set if required. The atomic field added
in the mm_context helps keep track of the different event opened and
de-activate the permissions when all are unmapped.
We also need to update the permissions in the context switch code so
that tasks keep the right permissions.

In order to enable 64-bit counters for userspace when available, a new
config1 bit is added for userspace to indicate it wants userspace counter
access. This bit allows the kernel to decide if chaining should be
disabled and chaining and userspace access are incompatible.
The modes for config1 are as follows:

config1 = 0 or 2 : user access enabled and always 32-bit
config1 = 1 : user access disabled and always 64-bit (using chaining if needed)
config1 = 3 : user access enabled and counter size matches underlying counter.

User access is enabled with config1 == 0 so that we match x86 behavior
and don't need Arm specific code (outside the counter read).

Signed-off-by: Raphael Gault <[email protected]>
Signed-off-by: Rob Herring <[email protected]>
---
I'm not completely sure if using current->active_mm in an IPI is okay?
It seems to work in my testing.

Peter Z says (event->oncpu == smp_processor_id()) in the user page
update is always true, but my testing says otherwise[1].

v6:
- Add new attr.config1 rdpmc bit for userspace to hint it wants
userspace access when also requesting 64-bit counters.

v5:
- Only set cap_user_rdpmc if event is on current cpu
- Limit enabling/disabling access to CPUs associated with the PMU
(supported_cpus) and with the mm_struct matching current->active_mm.

v2:
- Move mapped/unmapped into arm64 code. Fixes arm32.
- Rebase on cap_user_time_short changes

Changes from Raphael's v4:
- Drop homogeneous check
- Disable access for chained counters
- Set pmc_width in user page

[1] https://lore.kernel.org/lkml/CAL_JsqK+eKef5NaVnBfARCjRE3MYhfBfe54F9YHKbsTnWqLmLw@mail.gmail.com/

user fix
---
arch/arm64/include/asm/mmu.h | 5 ++
arch/arm64/include/asm/mmu_context.h | 2 +
arch/arm64/include/asm/perf_event.h | 14 +++++
arch/arm64/kernel/perf_event.c | 86 +++++++++++++++++++++++++++-
4 files changed, 104 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h
index 75beffe2ee8a..ee08447455da 100644
--- a/arch/arm64/include/asm/mmu.h
+++ b/arch/arm64/include/asm/mmu.h
@@ -18,6 +18,11 @@

typedef struct {
atomic64_t id;
+ /*
+ * non-zero if userspace have access to hardware
+ * counters directly.
+ */
+ atomic_t pmu_direct_access;
#ifdef CONFIG_COMPAT
void *sigpage;
#endif
diff --git a/arch/arm64/include/asm/mmu_context.h b/arch/arm64/include/asm/mmu_context.h
index 70ce8c1d2b07..ccb5ff417b42 100644
--- a/arch/arm64/include/asm/mmu_context.h
+++ b/arch/arm64/include/asm/mmu_context.h
@@ -21,6 +21,7 @@
#include <asm/proc-fns.h>
#include <asm-generic/mm_hooks.h>
#include <asm/cputype.h>
+#include <asm/perf_event.h>
#include <asm/sysreg.h>
#include <asm/tlbflush.h>

@@ -230,6 +231,7 @@ static inline void __switch_mm(struct mm_struct *next)
}

check_and_switch_context(next);
+ perf_switch_user_access(next);
}

static inline void
diff --git a/arch/arm64/include/asm/perf_event.h b/arch/arm64/include/asm/perf_event.h
index 60731f602d3e..112f3f63b79e 100644
--- a/arch/arm64/include/asm/perf_event.h
+++ b/arch/arm64/include/asm/perf_event.h
@@ -8,6 +8,7 @@

#include <asm/stack_pointer.h>
#include <asm/ptrace.h>
+#include <linux/mm_types.h>

#define ARMV8_PMU_MAX_COUNTERS 32
#define ARMV8_PMU_COUNTER_MASK (ARMV8_PMU_MAX_COUNTERS - 1)
@@ -254,4 +255,17 @@ extern unsigned long perf_misc_flags(struct pt_regs *regs);
(regs)->pstate = PSR_MODE_EL1h; \
}

+static inline void perf_switch_user_access(struct mm_struct *mm)
+{
+ if (!IS_ENABLED(CONFIG_PERF_EVENTS))
+ return;
+
+ if (atomic_read(&mm->context.pmu_direct_access)) {
+ write_sysreg(ARMV8_PMU_USERENR_ER|ARMV8_PMU_USERENR_CR,
+ pmuserenr_el0);
+ } else {
+ write_sysreg(0, pmuserenr_el0);
+ }
+}
+
#endif
diff --git a/arch/arm64/kernel/perf_event.c b/arch/arm64/kernel/perf_event.c
index 387838496955..9ad3cc523ef4 100644
--- a/arch/arm64/kernel/perf_event.c
+++ b/arch/arm64/kernel/perf_event.c
@@ -288,15 +288,22 @@ static const struct attribute_group armv8_pmuv3_events_attr_group = {

PMU_FORMAT_ATTR(event, "config:0-15");
PMU_FORMAT_ATTR(long, "config1:0");
+PMU_FORMAT_ATTR(rdpmc, "config1:1");

static inline bool armv8pmu_event_is_64bit(struct perf_event *event)
{
return event->attr.config1 & 0x1;
}

+static inline bool armv8pmu_event_want_user_access(struct perf_event *event)
+{
+ return event->attr.config1 & 0x2;
+}
+
static struct attribute *armv8_pmuv3_format_attrs[] = {
&format_attr_event.attr,
&format_attr_long.attr,
+ &format_attr_rdpmc.attr,
NULL,
};

@@ -356,6 +363,7 @@ static inline bool armv8pmu_event_is_chained(struct perf_event *event)
struct arm_pmu *cpu_pmu = to_arm_pmu(event->pmu);

return !WARN_ON(idx < 0) &&
+ !armv8pmu_event_want_user_access(event) &&
armv8pmu_event_is_64bit(event) &&
!armv8pmu_has_long_event(cpu_pmu) &&
(idx != ARMV8_IDX_CYCLE_COUNTER);
@@ -849,13 +857,17 @@ static int armv8pmu_get_event_idx(struct pmu_hw_events *cpuc,
if (evtype == ARMV8_PMUV3_PERFCTR_CPU_CYCLES) {
if (!test_and_set_bit(ARMV8_IDX_CYCLE_COUNTER, cpuc->used_mask))
return ARMV8_IDX_CYCLE_COUNTER;
+ else if (armv8pmu_event_is_64bit(event) &&
+ armv8pmu_event_want_user_access(event) &&
+ !armv8pmu_has_long_event(cpu_pmu))
+ return -EAGAIN;
}

/*
* Otherwise use events counters
*/
- if (armv8pmu_event_is_64bit(event) &&
- !armv8pmu_has_long_event(cpu_pmu))
+ if (armv8pmu_event_is_64bit(event) && !armv8pmu_has_long_event(cpu_pmu) &&
+ !armv8pmu_event_want_user_access(event))
return armv8pmu_get_chain_idx(cpuc, cpu_pmu);
else
return armv8pmu_get_single_idx(cpuc, cpu_pmu);
@@ -887,6 +899,46 @@ static int armv8pmu_access_event_idx(struct perf_event *event)
return event->hw.idx;
}

+static void refresh_pmuserenr(void *mm)
+{
+ if (mm == current->active_mm)
+ perf_switch_user_access(mm);
+}
+
+static void armv8pmu_event_mapped(struct perf_event *event, struct mm_struct *mm)
+{
+ struct arm_pmu *armpmu = to_arm_pmu(event->pmu);
+
+ if (!(event->hw.flags & ARMPMU_EL0_RD_CNTR))
+ return;
+
+ /*
+ * This function relies on not being called concurrently in two
+ * tasks in the same mm. Otherwise one task could observe
+ * pmu_direct_access > 1 and return all the way back to
+ * userspace with user access disabled while another task is still
+ * doing on_each_cpu_mask() to enable user access.
+ *
+ * For now, this can't happen because all callers hold mmap_lock
+ * for write. If this changes, we'll need a different solution.
+ */
+ lockdep_assert_held_write(&mm->mmap_lock);
+
+ if (atomic_inc_return(&mm->context.pmu_direct_access) == 1)
+ on_each_cpu_mask(&armpmu->supported_cpus, refresh_pmuserenr, mm, 1);
+}
+
+static void armv8pmu_event_unmapped(struct perf_event *event, struct mm_struct *mm)
+{
+ struct arm_pmu *armpmu = to_arm_pmu(event->pmu);
+
+ if (!(event->hw.flags & ARMPMU_EL0_RD_CNTR))
+ return;
+
+ if (atomic_dec_and_test(&mm->context.pmu_direct_access))
+ on_each_cpu_mask(&armpmu->supported_cpus, refresh_pmuserenr, mm, 1);
+}
+
/*
* Add an event filter to a given event.
*/
@@ -980,9 +1032,23 @@ static int __armv8_pmuv3_map_event(struct perf_event *event,
&armv8_pmuv3_perf_cache_map,
ARMV8_PMU_EVTYPE_EVENT);

- if (armv8pmu_event_is_64bit(event))
+ if (armv8pmu_event_want_user_access(event) || !armv8pmu_event_is_64bit(event)) {
+ event->hw.flags |= ARMPMU_EL0_RD_CNTR;
+ /*
+ * At this point, the counter is not assigned. If a 64-bit
+ * counter is requested, we must make sure the h/w has 64-bit
+ * counters if we set the event size to 64-bit because chaining
+ * is not supported with userspace access. This may still fail
+ * later on if the CPU cycle counter is in use.
+ */
+ if (armv8pmu_event_is_64bit(event) &&
+ (armv8pmu_has_long_event(armpmu) ||
+ hw_event_id == ARMV8_PMUV3_PERFCTR_CPU_CYCLES))
+ event->hw.flags |= ARMPMU_EVT_64BIT;
+ } else if (armv8pmu_event_is_64bit(event))
event->hw.flags |= ARMPMU_EVT_64BIT;

+
/* Only expose micro/arch events supported by this PMU */
if ((hw_event_id > 0) && (hw_event_id < ARMV8_PMUV3_MAX_COMMON_EVENTS)
&& test_bit(hw_event_id, armpmu->pmceid_bitmap)) {
@@ -1115,6 +1181,8 @@ static int armv8_pmu_init(struct arm_pmu *cpu_pmu, char *name,
cpu_pmu->filter_match = armv8pmu_filter_match;

cpu_pmu->pmu.event_idx = armv8pmu_access_event_idx;
+ cpu_pmu->pmu.event_mapped = armv8pmu_event_mapped;
+ cpu_pmu->pmu.event_unmapped = armv8pmu_event_unmapped;

cpu_pmu->name = name;
cpu_pmu->map_event = map_event;
@@ -1290,6 +1358,18 @@ void arch_perf_update_userpage(struct perf_event *event,
userpg->cap_user_time = 0;
userpg->cap_user_time_zero = 0;
userpg->cap_user_time_short = 0;
+ userpg->cap_user_rdpmc = !!(event->hw.flags & ARMPMU_EL0_RD_CNTR) &&
+ (event->oncpu == smp_processor_id());
+
+ if (userpg->cap_user_rdpmc) {
+ struct arm_pmu *cpu_pmu = to_arm_pmu(event->pmu);
+
+ if (armv8pmu_event_is_64bit(event) &&
+ (armv8pmu_has_long_event(cpu_pmu) || (userpg->index == 32)))
+ userpg->pmc_width = 64;
+ else
+ userpg->pmc_width = 32;
+ }

do {
rd = sched_clock_read_begin(&seq);
--
2.27.0

2021-03-11 00:10:28

by Rob Herring (Arm)

[permalink] [raw]
Subject: [PATCH v6 05/10] libperf: tests: Add support for verbose printing

Add __T_VERBOSE() so tests can add verbose output. The verbose output is
enabled with the '-v' command line option.

Signed-off-by: Rob Herring <[email protected]>
---
v5:
- Pass verbose flag to static tests
- Fix getopt loop with unsigned char (arm64)
v3:
- New patch
---
tools/lib/perf/include/internal/tests.h | 32 +++++++++++++++++++++++++
tools/lib/perf/tests/Makefile | 6 +++--
2 files changed, 36 insertions(+), 2 deletions(-)

diff --git a/tools/lib/perf/include/internal/tests.h b/tools/lib/perf/include/internal/tests.h
index 2093e8868a67..29425c2dabe1 100644
--- a/tools/lib/perf/include/internal/tests.h
+++ b/tools/lib/perf/include/internal/tests.h
@@ -3,11 +3,32 @@
#define __LIBPERF_INTERNAL_TESTS_H

#include <stdio.h>
+#include <unistd.h>

int tests_failed;
+int tests_verbose;
+
+static inline int get_verbose(char **argv, int argc)
+{
+ int c;
+ int verbose = 0;
+
+ while ((c = getopt(argc, argv, "v")) != -1) {
+ switch (c)
+ {
+ case 'v':
+ verbose = 1;
+ break;
+ default:
+ break;
+ }
+ }
+ return verbose;
+}

#define __T_START \
do { \
+ tests_verbose = get_verbose(argv, argc); \
fprintf(stdout, "- running %s...", __FILE__); \
fflush(NULL); \
tests_failed = 0; \
@@ -30,4 +51,15 @@ do {
} \
} while (0)

+#define __T_VERBOSE(...) \
+do { \
+ if (tests_verbose) { \
+ if (tests_verbose == 1) { \
+ fputc('\n', stderr); \
+ tests_verbose++; \
+ } \
+ fprintf(stderr, ##__VA_ARGS__); \
+ } \
+} while (0)
+
#endif /* __LIBPERF_INTERNAL_TESTS_H */
diff --git a/tools/lib/perf/tests/Makefile b/tools/lib/perf/tests/Makefile
index 96841775feaf..b536cc9a26dd 100644
--- a/tools/lib/perf/tests/Makefile
+++ b/tools/lib/perf/tests/Makefile
@@ -5,6 +5,8 @@ TESTS = test-cpumap test-threadmap test-evlist test-evsel
TESTS_SO := $(addsuffix -so,$(TESTS))
TESTS_A := $(addsuffix -a,$(TESTS))

+TEST_ARGS := $(if $(V),-v)
+
# Set compile option CFLAGS
ifdef EXTRA_CFLAGS
CFLAGS := $(EXTRA_CFLAGS)
@@ -28,9 +30,9 @@ all: $(TESTS_A) $(TESTS_SO)

run:
@echo "running static:"
- @for i in $(TESTS_A); do ./$$i; done
+ @for i in $(TESTS_A); do ./$$i $(TEST_ARGS); done
@echo "running dynamic:"
- @for i in $(TESTS_SO); do LD_LIBRARY_PATH=../ ./$$i; done
+ @for i in $(TESTS_SO); do LD_LIBRARY_PATH=../ ./$$i $(TEST_ARGS); done

clean:
$(call QUIET_CLEAN, tests)$(RM) $(TESTS_A) $(TESTS_SO)
--
2.27.0

2021-03-11 00:10:28

by Rob Herring (Arm)

[permalink] [raw]
Subject: [PATCH v6 06/10] libperf: Add support for user space counter access

x86 and arm64 can both support direct access of event counters in
userspace. The access sequence is less than trivial and currently exists
in perf test code (tools/perf/arch/x86/tests/rdpmc.c) with copies in
projects such as PAPI and libpfm4.

In order to support usersapce access, an event must be mmapped first
with perf_evsel__mmap(). Then subsequent calls to perf_evsel__read()
will use the fast path (assuming the arch supports it).

Signed-off-by: Rob Herring <[email protected]>
---
v6:
- Adapt to mmap changes adding MMAP NULL check
v5:
- Make raw count s64 instead of u64 so that counter width shifting
works
- Adapt to mmap changes
v4:
- Update perf_evsel__mmap size to pages
v3:
- Split out perf_evsel__mmap() to separate patch
---
tools/lib/perf/evsel.c | 4 ++
tools/lib/perf/include/internal/mmap.h | 3 +
tools/lib/perf/mmap.c | 88 ++++++++++++++++++++++++++
tools/lib/perf/tests/test-evsel.c | 65 +++++++++++++++++++
4 files changed, 160 insertions(+)

diff --git a/tools/lib/perf/evsel.c b/tools/lib/perf/evsel.c
index 1057e9b15528..4d67343d36c9 100644
--- a/tools/lib/perf/evsel.c
+++ b/tools/lib/perf/evsel.c
@@ -242,6 +242,10 @@ int perf_evsel__read(struct perf_evsel *evsel, int cpu, int thread,
if (FD(evsel, cpu, thread) < 0)
return -EINVAL;

+ if (MMAP(evsel, cpu, thread) &&
+ !perf_mmap__read_self(MMAP(evsel, cpu, thread), count))
+ return 0;
+
if (readn(FD(evsel, cpu, thread), count->values, size) <= 0)
return -errno;

diff --git a/tools/lib/perf/include/internal/mmap.h b/tools/lib/perf/include/internal/mmap.h
index be7556e0a2b2..5e3422f40ed5 100644
--- a/tools/lib/perf/include/internal/mmap.h
+++ b/tools/lib/perf/include/internal/mmap.h
@@ -11,6 +11,7 @@
#define PERF_SAMPLE_MAX_SIZE (1 << 16)

struct perf_mmap;
+struct perf_counts_values;

typedef void (*libperf_unmap_cb_t)(struct perf_mmap *map);

@@ -52,4 +53,6 @@ void perf_mmap__put(struct perf_mmap *map);

u64 perf_mmap__read_head(struct perf_mmap *map);

+int perf_mmap__read_self(struct perf_mmap *map, struct perf_counts_values *count);
+
#endif /* __LIBPERF_INTERNAL_MMAP_H */
diff --git a/tools/lib/perf/mmap.c b/tools/lib/perf/mmap.c
index 79d5ed6c38cc..915469f00cf4 100644
--- a/tools/lib/perf/mmap.c
+++ b/tools/lib/perf/mmap.c
@@ -8,9 +8,11 @@
#include <linux/perf_event.h>
#include <perf/mmap.h>
#include <perf/event.h>
+#include <perf/evsel.h>
#include <internal/mmap.h>
#include <internal/lib.h>
#include <linux/kernel.h>
+#include <linux/math64.h>
#include "internal.h"

void perf_mmap__init(struct perf_mmap *map, struct perf_mmap *prev,
@@ -273,3 +275,89 @@ union perf_event *perf_mmap__read_event(struct perf_mmap *map)

return event;
}
+
+#if defined(__i386__) || defined(__x86_64__)
+static u64 read_perf_counter(unsigned int counter)
+{
+ unsigned int low, high;
+
+ asm volatile("rdpmc" : "=a" (low), "=d" (high) : "c" (counter));
+
+ return low | ((u64)high) << 32;
+}
+
+static u64 read_timestamp(void)
+{
+ unsigned int low, high;
+
+ asm volatile("rdtsc" : "=a" (low), "=d" (high));
+
+ return low | ((u64)high) << 32;
+}
+#else
+static u64 read_perf_counter(unsigned int counter) { return 0; }
+static u64 read_timestamp(void) { return 0; }
+#endif
+
+int perf_mmap__read_self(struct perf_mmap *map, struct perf_counts_values *count)
+{
+ struct perf_event_mmap_page *pc = map->base;
+ u32 seq, idx, time_mult = 0, time_shift = 0;
+ u64 cnt, cyc = 0, time_offset = 0, time_cycles = 0, time_mask = ~0ULL;
+
+ if (!pc || !pc->cap_user_rdpmc)
+ return -1;
+
+ do {
+ seq = READ_ONCE(pc->lock);
+ barrier();
+
+ count->ena = READ_ONCE(pc->time_enabled);
+ count->run = READ_ONCE(pc->time_running);
+
+ if (pc->cap_user_time && count->ena != count->run) {
+ cyc = read_timestamp();
+ time_mult = READ_ONCE(pc->time_mult);
+ time_shift = READ_ONCE(pc->time_shift);
+ time_offset = READ_ONCE(pc->time_offset);
+
+ if (pc->cap_user_time_short) {
+ time_cycles = READ_ONCE(pc->time_cycles);
+ time_mask = READ_ONCE(pc->time_mask);
+ }
+ }
+
+ idx = READ_ONCE(pc->index);
+ cnt = READ_ONCE(pc->offset);
+ if (pc->cap_user_rdpmc && idx) {
+ s64 evcnt = read_perf_counter(idx - 1);
+ u16 width = READ_ONCE(pc->pmc_width);
+
+ evcnt <<= 64 - width;
+ evcnt >>= 64 - width;
+ cnt += evcnt;
+ } else
+ return -1;
+
+ barrier();
+ } while (READ_ONCE(pc->lock) != seq);
+
+ if (count->ena != count->run) {
+ u64 delta;
+
+ /* Adjust for cap_usr_time_short, a nop if not */
+ cyc = time_cycles + ((cyc - time_cycles) & time_mask);
+
+ delta = time_offset + mul_u64_u32_shr(cyc, time_mult, time_shift);
+
+ count->ena += delta;
+ if (idx)
+ count->run += delta;
+
+ cnt = mul_u64_u64_div64(cnt, count->ena, count->run);
+ }
+
+ count->val = cnt;
+
+ return 0;
+}
diff --git a/tools/lib/perf/tests/test-evsel.c b/tools/lib/perf/tests/test-evsel.c
index 0ad82d7a2a51..54fb4809b9ee 100644
--- a/tools/lib/perf/tests/test-evsel.c
+++ b/tools/lib/perf/tests/test-evsel.c
@@ -120,6 +120,69 @@ static int test_stat_thread_enable(void)
return 0;
}

+static int test_stat_user_read(int event)
+{
+ struct perf_counts_values counts = { .val = 0 };
+ struct perf_thread_map *threads;
+ struct perf_evsel *evsel;
+ struct perf_event_mmap_page *pc;
+ struct perf_event_attr attr = {
+ .type = PERF_TYPE_HARDWARE,
+ .config = event,
+ };
+ int err, i;
+
+ threads = perf_thread_map__new_dummy();
+ __T("failed to create threads", threads);
+
+ perf_thread_map__set_pid(threads, 0, 0);
+
+ evsel = perf_evsel__new(&attr);
+ __T("failed to create evsel", evsel);
+
+ err = perf_evsel__open(evsel, NULL, threads);
+ __T("failed to open evsel", err == 0);
+
+ err = perf_evsel__mmap(evsel, 0);
+ __T("failed to mmap evsel", err == 0);
+
+ pc = perf_evsel__mmap_base(evsel, 0, 0);
+
+#if defined(__i386__) || defined(__x86_64__) || defined(__aarch64__)
+ __T("userspace counter access not supported", pc->cap_user_rdpmc);
+ __T("userspace counter access not enabled", pc->index);
+ __T("userspace counter width not set", pc->pmc_width >= 32);
+#endif
+
+ perf_evsel__read(evsel, 0, 0, &counts);
+ __T("failed to read value for evsel", counts.val != 0);
+
+ for (i = 0; i < 5; i++) {
+ volatile int count = 0x10000 << i;
+ __u64 start, end, last = 0;
+
+ __T_VERBOSE("\tloop = %u, ", count);
+
+ perf_evsel__read(evsel, 0, 0, &counts);
+ start = counts.val;
+
+ while (count--) ;
+
+ perf_evsel__read(evsel, 0, 0, &counts);
+ end = counts.val;
+
+ __T("invalid counter data", (end - start) > last);
+ last = end - start;
+ __T_VERBOSE("count = %llu\n", end - start);
+ }
+
+ perf_evsel__close(evsel);
+ perf_evsel__delete(evsel);
+
+ perf_thread_map__put(threads);
+ return 0;
+}
+
int main(int argc, char **argv)
{
__T_START;
@@ -129,6 +192,8 @@ int main(int argc, char **argv)
test_stat_cpu();
test_stat_thread();
test_stat_thread_enable();
+ test_stat_user_read(PERF_COUNT_HW_INSTRUCTIONS);
+ test_stat_user_read(PERF_COUNT_HW_CPU_CYCLES);

__T_END;
return tests_failed == 0 ? 0 : -1;
--
2.27.0

2021-03-11 00:10:28

by Rob Herring (Arm)

[permalink] [raw]
Subject: [PATCH v6 04/10] libperf: Add evsel mmap support

In order to support usersapce access, an event must be mmapped. While
there's already mmap support for evlist, the usecase is a bit different
than the self monitoring with userspace access. So let's add a new
perf_evsel__mmap() function to mmap an evsel. This allows implementing
userspace access as a fastpath for perf_evsel__read().

The mmapped address is returned by perf_evsel__mmap_base() which
primarily for users/tests to check if userspace access is enabled.

Signed-off-by: Rob Herring <[email protected]>
---
v6:
- split mmap struct into it's own xyarray
v5:
- Create an mmap for every underlying event opened. Due to this, we
need a different way to get the mmap ptr, so perf_evsel__mmap_base()
is introduced.
v4:
- Change perf_evsel__mmap size to pages instead of bytes
v3:
- New patch split out from user access patch
---
tools/lib/perf/Documentation/libperf.txt | 2 +
tools/lib/perf/evsel.c | 51 ++++++++++++++++++++++++
tools/lib/perf/include/internal/evsel.h | 1 +
tools/lib/perf/include/perf/evsel.h | 2 +
tools/lib/perf/libperf.map | 2 +
5 files changed, 58 insertions(+)

diff --git a/tools/lib/perf/Documentation/libperf.txt b/tools/lib/perf/Documentation/libperf.txt
index 0c74c30ed23a..a2c73df191ca 100644
--- a/tools/lib/perf/Documentation/libperf.txt
+++ b/tools/lib/perf/Documentation/libperf.txt
@@ -136,6 +136,8 @@ SYNOPSIS
struct perf_thread_map *threads);
void perf_evsel__close(struct perf_evsel *evsel);
void perf_evsel__close_cpu(struct perf_evsel *evsel, int cpu);
+ int perf_evsel__mmap(struct perf_evsel *evsel, int pages);
+ void *perf_evsel__mmap_base(struct perf_evsel *evsel, int cpu, int thread);
int perf_evsel__read(struct perf_evsel *evsel, int cpu, int thread,
struct perf_counts_values *count);
int perf_evsel__enable(struct perf_evsel *evsel);
diff --git a/tools/lib/perf/evsel.c b/tools/lib/perf/evsel.c
index 4dc06289f4c7..1057e9b15528 100644
--- a/tools/lib/perf/evsel.c
+++ b/tools/lib/perf/evsel.c
@@ -11,10 +11,12 @@
#include <stdlib.h>
#include <internal/xyarray.h>
#include <internal/cpumap.h>
+#include <internal/mmap.h>
#include <internal/threadmap.h>
#include <internal/lib.h>
#include <linux/string.h>
#include <sys/ioctl.h>
+#include <sys/mman.h>

void perf_evsel__init(struct perf_evsel *evsel, struct perf_event_attr *attr)
{
@@ -38,6 +40,7 @@ void perf_evsel__delete(struct perf_evsel *evsel)
}

#define FD(e, x, y) (*(int *) xyarray__entry(e->fd, x, y))
+#define MMAP(e, x, y) (e->mmap ? ((struct perf_mmap *) xyarray__entry(e->mmap, x, y)) : NULL)

int perf_evsel__alloc_fd(struct perf_evsel *evsel, int ncpus, int nthreads)
{
@@ -55,6 +58,13 @@ int perf_evsel__alloc_fd(struct perf_evsel *evsel, int ncpus, int nthreads)
return evsel->fd != NULL ? 0 : -ENOMEM;
}

+static int perf_evsel__alloc_mmap(struct perf_evsel *evsel, int ncpus, int nthreads)
+{
+ evsel->mmap = xyarray__new(ncpus, nthreads, sizeof(struct perf_mmap));
+
+ return evsel->mmap != NULL ? 0 : -ENOMEM;
+}
+
static int
sys_perf_event_open(struct perf_event_attr *attr,
pid_t pid, int cpu, int group_fd,
@@ -137,6 +147,8 @@ void perf_evsel__free_fd(struct perf_evsel *evsel)
{
xyarray__delete(evsel->fd);
evsel->fd = NULL;
+ xyarray__delete(evsel->mmap);
+ evsel->mmap = NULL;
}

void perf_evsel__close(struct perf_evsel *evsel)
@@ -156,6 +168,45 @@ void perf_evsel__close_cpu(struct perf_evsel *evsel, int cpu)
perf_evsel__close_fd_cpu(evsel, cpu);
}

+int perf_evsel__mmap(struct perf_evsel *evsel, int pages)
+{
+ int ret, cpu, thread;
+ struct perf_mmap_param mp = {
+ .prot = PROT_READ | PROT_WRITE,
+ .mask = (pages * page_size) - 1,
+ };
+
+ if (evsel->mmap == NULL &&
+ perf_evsel__alloc_mmap(evsel, xyarray__max_x(evsel->fd), xyarray__max_y(evsel->fd)) < 0)
+ return -ENOMEM;
+
+ for (cpu = 0; cpu < xyarray__max_x(evsel->fd); cpu++) {
+ for (thread = 0; thread < xyarray__max_y(evsel->fd); thread++) {
+ int fd = FD(evsel, cpu, thread);
+ struct perf_mmap *map = MMAP(evsel, cpu, thread);
+
+ if (fd < 0)
+ continue;
+
+ perf_mmap__init(map, NULL, false, NULL);
+
+ ret = perf_mmap__mmap(map, &mp, fd, cpu);
+ if (ret)
+ return -1;
+ }
+ }
+
+ return 0;
+}
+
+void *perf_evsel__mmap_base(struct perf_evsel *evsel, int cpu, int thread)
+{
+ if (FD(evsel, cpu, thread) < 0 || MMAP(evsel, cpu, thread) == NULL)
+ return NULL;
+
+ return MMAP(evsel, cpu, thread)->base;
+}
+
int perf_evsel__read_size(struct perf_evsel *evsel)
{
u64 read_format = evsel->attr.read_format;
diff --git a/tools/lib/perf/include/internal/evsel.h b/tools/lib/perf/include/internal/evsel.h
index 1ffd083b235e..1c067d088bc6 100644
--- a/tools/lib/perf/include/internal/evsel.h
+++ b/tools/lib/perf/include/internal/evsel.h
@@ -41,6 +41,7 @@ struct perf_evsel {
struct perf_cpu_map *own_cpus;
struct perf_thread_map *threads;
struct xyarray *fd;
+ struct xyarray *mmap;
struct xyarray *sample_id;
u64 *id;
u32 ids;
diff --git a/tools/lib/perf/include/perf/evsel.h b/tools/lib/perf/include/perf/evsel.h
index c82ec39a4ad0..9f5265f2f39f 100644
--- a/tools/lib/perf/include/perf/evsel.h
+++ b/tools/lib/perf/include/perf/evsel.h
@@ -27,6 +27,8 @@ LIBPERF_API int perf_evsel__open(struct perf_evsel *evsel, struct perf_cpu_map *
struct perf_thread_map *threads);
LIBPERF_API void perf_evsel__close(struct perf_evsel *evsel);
LIBPERF_API void perf_evsel__close_cpu(struct perf_evsel *evsel, int cpu);
+LIBPERF_API int perf_evsel__mmap(struct perf_evsel *evsel, int pages);
+LIBPERF_API void *perf_evsel__mmap_base(struct perf_evsel *evsel, int cpu, int thread);
LIBPERF_API int perf_evsel__read(struct perf_evsel *evsel, int cpu, int thread,
struct perf_counts_values *count);
LIBPERF_API int perf_evsel__enable(struct perf_evsel *evsel);
diff --git a/tools/lib/perf/libperf.map b/tools/lib/perf/libperf.map
index 7be1af8a546c..0b993de15830 100644
--- a/tools/lib/perf/libperf.map
+++ b/tools/lib/perf/libperf.map
@@ -23,6 +23,8 @@ LIBPERF_0.0.1 {
perf_evsel__disable;
perf_evsel__open;
perf_evsel__close;
+ perf_evsel__mmap;
+ perf_evsel__mmap_base;
perf_evsel__read;
perf_evsel__cpus;
perf_evsel__threads;
--
2.27.0

2021-03-11 00:10:29

by Rob Herring (Arm)

[permalink] [raw]
Subject: [PATCH v6 07/10] libperf: Add arm64 support to perf_mmap__read_self()

Add the arm64 variants for read_perf_counter() and read_timestamp().
Unfortunately the counter number is encoded into the instruction, so the
code is a bit verbose to enumerate all possible counters.

Signed-off-by: Rob Herring <[email protected]>
---
tools/lib/perf/mmap.c | 98 +++++++++++++++++++++++++++++++++++++++++++
1 file changed, 98 insertions(+)

diff --git a/tools/lib/perf/mmap.c b/tools/lib/perf/mmap.c
index 915469f00cf4..216e6c6212a2 100644
--- a/tools/lib/perf/mmap.c
+++ b/tools/lib/perf/mmap.c
@@ -13,6 +13,7 @@
#include <internal/lib.h>
#include <linux/kernel.h>
#include <linux/math64.h>
+#include <linux/stringify.h>
#include "internal.h"

void perf_mmap__init(struct perf_mmap *map, struct perf_mmap *prev,
@@ -294,6 +295,103 @@ static u64 read_timestamp(void)

return low | ((u64)high) << 32;
}
+#elif defined(__aarch64__)
+#define read_sysreg(r) ({ \
+ u64 __val; \
+ asm volatile("mrs %0, " __stringify(r) : "=r" (__val)); \
+ __val; \
+})
+
+static u64 read_pmccntr(void)
+{
+ return read_sysreg(pmccntr_el0);
+}
+
+#define PMEVCNTR_READ(idx) \
+ static u64 read_pmevcntr_##idx(void) { \
+ return read_sysreg(pmevcntr##idx##_el0); \
+ }
+
+PMEVCNTR_READ(0);
+PMEVCNTR_READ(1);
+PMEVCNTR_READ(2);
+PMEVCNTR_READ(3);
+PMEVCNTR_READ(4);
+PMEVCNTR_READ(5);
+PMEVCNTR_READ(6);
+PMEVCNTR_READ(7);
+PMEVCNTR_READ(8);
+PMEVCNTR_READ(9);
+PMEVCNTR_READ(10);
+PMEVCNTR_READ(11);
+PMEVCNTR_READ(12);
+PMEVCNTR_READ(13);
+PMEVCNTR_READ(14);
+PMEVCNTR_READ(15);
+PMEVCNTR_READ(16);
+PMEVCNTR_READ(17);
+PMEVCNTR_READ(18);
+PMEVCNTR_READ(19);
+PMEVCNTR_READ(20);
+PMEVCNTR_READ(21);
+PMEVCNTR_READ(22);
+PMEVCNTR_READ(23);
+PMEVCNTR_READ(24);
+PMEVCNTR_READ(25);
+PMEVCNTR_READ(26);
+PMEVCNTR_READ(27);
+PMEVCNTR_READ(28);
+PMEVCNTR_READ(29);
+PMEVCNTR_READ(30);
+
+/*
+ * Read a value direct from PMEVCNTR<idx>
+ */
+static u64 read_perf_counter(unsigned int counter)
+{
+ static u64 (* const read_f[])(void) = {
+ read_pmevcntr_0,
+ read_pmevcntr_1,
+ read_pmevcntr_2,
+ read_pmevcntr_3,
+ read_pmevcntr_4,
+ read_pmevcntr_5,
+ read_pmevcntr_6,
+ read_pmevcntr_7,
+ read_pmevcntr_8,
+ read_pmevcntr_9,
+ read_pmevcntr_10,
+ read_pmevcntr_11,
+ read_pmevcntr_13,
+ read_pmevcntr_12,
+ read_pmevcntr_14,
+ read_pmevcntr_15,
+ read_pmevcntr_16,
+ read_pmevcntr_17,
+ read_pmevcntr_18,
+ read_pmevcntr_19,
+ read_pmevcntr_20,
+ read_pmevcntr_21,
+ read_pmevcntr_22,
+ read_pmevcntr_23,
+ read_pmevcntr_24,
+ read_pmevcntr_25,
+ read_pmevcntr_26,
+ read_pmevcntr_27,
+ read_pmevcntr_28,
+ read_pmevcntr_29,
+ read_pmevcntr_30,
+ read_pmccntr
+ };
+
+ if (counter < ARRAY_SIZE(read_f))
+ return (read_f[counter])();
+
+ return 0;
+}
+
+static u64 read_timestamp(void) { return read_sysreg(cntvct_el0); }
+
#else
static u64 read_perf_counter(unsigned int counter) { return 0; }
static u64 read_timestamp(void) { return 0; }
--
2.27.0

2021-03-11 00:10:29

by Rob Herring (Arm)

[permalink] [raw]
Subject: [PATCH v6 08/10] perf: arm64: Add test for userspace counter access on heterogeneous systems

Userspace counter access only works on heterogeneous systems with some
restrictions. The userspace process must be pinned to a homogeneous
subset of CPUs and must open the corresponding PMU for those CPUs. This
commit adds a test implementing these requirements.

Signed-off-by: Rob Herring <[email protected]>
---
v6:
- Add a check on cap_user_rdpmc
v5:
- Adapt to libperf mmap API changes
v4:
- Update perf_evsel__mmap params
v2:
- Drop all but heterogeneous test as others covered by libperf tests
- Rework to use libperf
---
tools/perf/arch/arm64/include/arch-tests.h | 7 +
tools/perf/arch/arm64/tests/Build | 1 +
tools/perf/arch/arm64/tests/arch-tests.c | 4 +
tools/perf/arch/arm64/tests/user-events.c | 177 +++++++++++++++++++++
4 files changed, 189 insertions(+)
create mode 100644 tools/perf/arch/arm64/tests/user-events.c

diff --git a/tools/perf/arch/arm64/include/arch-tests.h b/tools/perf/arch/arm64/include/arch-tests.h
index 90ec4c8cb880..380ad34a3f09 100644
--- a/tools/perf/arch/arm64/include/arch-tests.h
+++ b/tools/perf/arch/arm64/include/arch-tests.h
@@ -2,11 +2,18 @@
#ifndef ARCH_TESTS_H
#define ARCH_TESTS_H

+#include <linux/compiler.h>
+
#ifdef HAVE_DWARF_UNWIND_SUPPORT
struct thread;
struct perf_sample;
+int test__arch_unwind_sample(struct perf_sample *sample,
+ struct thread *thread);
#endif

extern struct test arch_tests[];
+int test__rd_pinned(struct test __maybe_unused *test,
+ int __maybe_unused subtest);
+

#endif
diff --git a/tools/perf/arch/arm64/tests/Build b/tools/perf/arch/arm64/tests/Build
index a61c06bdb757..3f9a20c17fc6 100644
--- a/tools/perf/arch/arm64/tests/Build
+++ b/tools/perf/arch/arm64/tests/Build
@@ -1,4 +1,5 @@
perf-y += regs_load.o
perf-$(CONFIG_DWARF_UNWIND) += dwarf-unwind.o

+perf-y += user-events.o
perf-y += arch-tests.o
diff --git a/tools/perf/arch/arm64/tests/arch-tests.c b/tools/perf/arch/arm64/tests/arch-tests.c
index 5b1543c98022..80ce7bd3c16d 100644
--- a/tools/perf/arch/arm64/tests/arch-tests.c
+++ b/tools/perf/arch/arm64/tests/arch-tests.c
@@ -10,6 +10,10 @@ struct test arch_tests[] = {
.func = test__dwarf_unwind,
},
#endif
+ {
+ .desc = "Pinned CPU user counter access",
+ .func = test__rd_pinned,
+ },
{
.func = NULL,
},
diff --git a/tools/perf/arch/arm64/tests/user-events.c b/tools/perf/arch/arm64/tests/user-events.c
new file mode 100644
index 000000000000..c8efc6b369e6
--- /dev/null
+++ b/tools/perf/arch/arm64/tests/user-events.c
@@ -0,0 +1,177 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <unistd.h>
+#include <sched.h>
+#include <cpumap.h>
+
+#include <perf/core.h>
+#include <perf/threadmap.h>
+#include <perf/evsel.h>
+
+#include "pmu.h"
+#include "debug.h"
+#include "tests/tests.h"
+#include "arch-tests.h"
+
+static int run_test(struct perf_evsel *evsel)
+{
+ int n;
+ volatile int tmp = 0;
+ u64 delta, i, loops = 1000;
+ struct perf_counts_values counts = { .val = 0 };
+
+ for (n = 0; n < 6; n++) {
+ u64 stamp, now;
+
+ perf_evsel__read(evsel, 0, 0, &counts);
+ stamp = counts.val;
+
+ for (i = 0; i < loops; i++)
+ tmp++;
+
+ perf_evsel__read(evsel, 0, 0, &counts);
+ now = counts.val;
+ loops *= 10;
+
+ delta = now - stamp;
+ pr_debug("%14d: %14llu\n", n, (long long)delta);
+
+ if (!delta)
+ break;
+ }
+ return delta ? 0 : -1;
+}
+
+static struct perf_pmu *pmu_for_cpu(int cpu)
+{
+ int acpu, idx;
+ struct perf_pmu *pmu = NULL;
+
+ while ((pmu = perf_pmu__scan(pmu)) != NULL) {
+ if (pmu->is_uncore)
+ continue;
+ perf_cpu_map__for_each_cpu(acpu, idx, pmu->cpus)
+ if (acpu == cpu)
+ return pmu;
+ }
+ return NULL;
+}
+
+static bool pmu_is_homogeneous(void)
+{
+ int core_cnt = 0;
+ struct perf_pmu *pmu = NULL;
+
+ while ((pmu = perf_pmu__scan(pmu)) != NULL) {
+ if (!pmu->is_uncore && !perf_cpu_map__empty(pmu->cpus))
+ core_cnt++;
+ }
+ return core_cnt == 1;
+}
+
+static int libperf_print(enum libperf_print_level level,
+ const char *fmt, va_list ap)
+{
+ (void)level;
+ return vfprintf(stderr, fmt, ap);
+}
+
+static struct perf_evsel *perf_init(struct perf_event_attr *attr)
+{
+ int err;
+ struct perf_thread_map *threads;
+ struct perf_evsel *evsel;
+ struct perf_event_mmap_page *pc;
+
+ libperf_init(libperf_print);
+
+ threads = perf_thread_map__new_dummy();
+ if (!threads) {
+ pr_err("failed to create threads\n");
+ return NULL;
+ }
+
+ perf_thread_map__set_pid(threads, 0, 0);
+
+ evsel = perf_evsel__new(attr);
+ if (!evsel) {
+ pr_err("failed to create evsel\n");
+ goto out_thread;
+ }
+
+ err = perf_evsel__open(evsel, NULL, threads);
+ if (err) {
+ pr_err("failed to open evsel\n");
+ goto out_open;
+ }
+
+ if (perf_evsel__mmap(evsel, 0)) {
+ pr_err("failed to mmap evsel\n");
+ goto out_mmap;
+ }
+
+ pc = perf_evsel__mmap_base(evsel, 0, 0);
+ if (!pc->cap_user_rdpmc) {
+ pr_err("userspace access not enabled\n");
+ goto out_mmap;
+ }
+
+ return evsel;
+
+out_mmap:
+ perf_evsel__close(evsel);
+out_open:
+ perf_evsel__delete(evsel);
+out_thread:
+ perf_thread_map__put(threads);
+ return NULL;
+}
+
+int test__rd_pinned(struct test __maybe_unused *test,
+ int __maybe_unused subtest)
+{
+ int cpu, cputmp, ret = -1;
+ struct perf_evsel *evsel;
+ struct perf_event_attr attr = {
+ .config = 0x8, /* Instruction count */
+ .config1 = 0, /* 32-bit counter */
+ .exclude_kernel = 1,
+ };
+ cpu_set_t cpu_set;
+ struct perf_pmu *pmu;
+
+ if (pmu_is_homogeneous())
+ return TEST_SKIP;
+
+ cpu = sched_getcpu();
+ pmu = pmu_for_cpu(cpu);
+ if (!pmu)
+ return -1;
+ attr.type = pmu->type;
+
+ CPU_ZERO(&cpu_set);
+ perf_cpu_map__for_each_cpu(cpu, cputmp, pmu->cpus)
+ CPU_SET(cpu, &cpu_set);
+ if (sched_setaffinity(0, sizeof(cpu_set), &cpu_set) < 0)
+ pr_err("Could not set affinity\n");
+
+ evsel = perf_init(&attr);
+ if (!evsel)
+ return -1;
+
+ perf_cpu_map__for_each_cpu(cpu, cputmp, pmu->cpus) {
+ CPU_ZERO(&cpu_set);
+ CPU_SET(cpu, &cpu_set);
+ if (sched_setaffinity(0, sizeof(cpu_set), &cpu_set) < 0)
+ pr_err("Could not set affinity\n");
+
+ pr_debug("Running on CPU %d\n", cpu);
+
+ ret = run_test(evsel);
+ if (ret)
+ break;
+ }
+
+ perf_evsel__close(evsel);
+ perf_evsel__delete(evsel);
+ return ret;
+}
--
2.27.0

2021-03-11 00:10:37

by Rob Herring (Arm)

[permalink] [raw]
Subject: [PATCH v6 09/10] perf: arm64: Add tests for 32-bit and 64-bit counter size userspace access

Add arm64 specific tests for 32-bit and 64-bit counter user access. On
arm64, counters default to 32-bit unless attr.config1:0 is set to 1. In
order to enable user access for 64-bit counters, attr.config1 must be set
to 3.

Signed-off-by: Rob Herring <[email protected]>
---
v6:
- New patch
---
tools/perf/arch/arm64/include/arch-tests.h | 5 +++
tools/perf/arch/arm64/tests/arch-tests.c | 8 +++++
tools/perf/arch/arm64/tests/user-events.c | 38 ++++++++++++++++++++++
3 files changed, 51 insertions(+)

diff --git a/tools/perf/arch/arm64/include/arch-tests.h b/tools/perf/arch/arm64/include/arch-tests.h
index 380ad34a3f09..ddfa7460e1e1 100644
--- a/tools/perf/arch/arm64/include/arch-tests.h
+++ b/tools/perf/arch/arm64/include/arch-tests.h
@@ -15,5 +15,10 @@ extern struct test arch_tests[];
int test__rd_pinned(struct test __maybe_unused *test,
int __maybe_unused subtest);

+int test__rd_64bit(struct test __maybe_unused *test,
+ int __maybe_unused subtest);
+
+int test__rd_32bit(struct test __maybe_unused *test,
+ int __maybe_unused subtest);

#endif
diff --git a/tools/perf/arch/arm64/tests/arch-tests.c b/tools/perf/arch/arm64/tests/arch-tests.c
index 80ce7bd3c16d..bbdb81aa3229 100644
--- a/tools/perf/arch/arm64/tests/arch-tests.c
+++ b/tools/perf/arch/arm64/tests/arch-tests.c
@@ -14,6 +14,14 @@ struct test arch_tests[] = {
.desc = "Pinned CPU user counter access",
.func = test__rd_pinned,
},
+ {
+ .desc = "User 64-bit counter access",
+ .func = test__rd_64bit,
+ },
+ {
+ .desc = "User 32-bit counter access",
+ .func = test__rd_32bit,
+ },
{
.func = NULL,
},
diff --git a/tools/perf/arch/arm64/tests/user-events.c b/tools/perf/arch/arm64/tests/user-events.c
index c8efc6b369e6..546323b5242c 100644
--- a/tools/perf/arch/arm64/tests/user-events.c
+++ b/tools/perf/arch/arm64/tests/user-events.c
@@ -175,3 +175,41 @@ int test__rd_pinned(struct test __maybe_unused *test,
perf_evsel__delete(evsel);
return ret;
}
+
+static int test__rd_counter_size(struct test __maybe_unused *test,
+ int config1)
+{
+ int ret;
+ struct perf_evsel *evsel;
+ struct perf_event_attr attr = {
+ .type = PERF_TYPE_HARDWARE,
+ .config = PERF_COUNT_HW_INSTRUCTIONS,
+ .config1 = config1,
+ .exclude_kernel = 1,
+ };
+
+ if (!pmu_is_homogeneous())
+ return TEST_SKIP;
+
+ evsel = perf_init(&attr);
+ if (!evsel)
+ return -1;
+
+ ret = run_test(evsel);
+
+ perf_evsel__close(evsel);
+ perf_evsel__delete(evsel);
+ return ret;
+}
+
+int test__rd_64bit(struct test __maybe_unused *test,
+ int __maybe_unused subtest)
+{
+ return test__rd_counter_size(test, 0x3);
+}
+
+int test__rd_32bit(struct test __maybe_unused *test,
+ int __maybe_unused subtest)
+{
+ return test__rd_counter_size(test, 0x2);
+}
--
2.27.0

2021-03-11 00:10:37

by Rob Herring (Arm)

[permalink] [raw]
Subject: [PATCH v6 10/10] Documentation: arm64: Document PMU counters access from userspace

From: Raphael Gault <[email protected]>

Add a documentation file to describe the access to the pmu hardware
counters from userspace

Signed-off-by: Raphael Gault <[email protected]>
Signed-off-by: Rob Herring <[email protected]>
---
v6:
- Update the chained event section with attr.config1 details
v2:
- Update links to test examples

Changes from Raphael's v4:
- Convert to rSt
- Update chained event status
- Add section for heterogeneous systems
---
Documentation/arm64/index.rst | 1 +
.../arm64/perf_counter_user_access.rst | 60 +++++++++++++++++++
2 files changed, 61 insertions(+)
create mode 100644 Documentation/arm64/perf_counter_user_access.rst

diff --git a/Documentation/arm64/index.rst b/Documentation/arm64/index.rst
index 97d65ba12a35..eb7b1cabbf08 100644
--- a/Documentation/arm64/index.rst
+++ b/Documentation/arm64/index.rst
@@ -18,6 +18,7 @@ ARM64 Architecture
memory
memory-tagging-extension
perf
+ perf_counter_user_access
pointer-authentication
silicon-errata
sve
diff --git a/Documentation/arm64/perf_counter_user_access.rst b/Documentation/arm64/perf_counter_user_access.rst
new file mode 100644
index 000000000000..a42800e72458
--- /dev/null
+++ b/Documentation/arm64/perf_counter_user_access.rst
@@ -0,0 +1,60 @@
+=============================================
+Access to PMU hardware counter from userspace
+=============================================
+
+Overview
+--------
+The perf userspace tool relies on the PMU to monitor events. It offers an
+abstraction layer over the hardware counters since the underlying
+implementation is cpu-dependent.
+Arm64 allows userspace tools to have access to the registers storing the
+hardware counters' values directly.
+
+This targets specifically self-monitoring tasks in order to reduce the overhead
+by directly accessing the registers without having to go through the kernel.
+
+How-to
+------
+The focus is set on the armv8 pmuv3 which makes sure that the access to the pmu
+registers is enabled and that the userspace has access to the relevant
+information in order to use them.
+
+In order to have access to the hardware counter it is necessary to open the event
+using the perf tool interface: the sys_perf_event_open syscall returns a fd which
+can subsequently be used with the mmap syscall in order to retrieve a page of
+memory containing information about the event.
+The PMU driver uses this page to expose to the user the hardware counter's
+index and other necessary data. Using this index enables the user to access the
+PMU registers using the `mrs` instruction.
+
+The userspace access is supported in libperf using the perf_evsel__mmap()
+and perf_evsel__read() functions. See `tools/lib/perf/tests/test-evsel.c`_ for
+an example.
+
+About heterogeneous systems
+---------------------------
+On heterogeneous systems such as big.LITTLE, userspace PMU counter access can
+only be enabled when the tasks are pinned to a homogeneous subset of cores and
+the corresponding PMU instance is opened by specifying the 'type' attribute.
+The use of generic event types is not supported in this case.
+
+Have a look at `tools/perf/arch/arm64/tests/user-events.c`_ for an example. It
+can be run using the perf tool to check that the access to the registers works
+correctly from userspace:
+
+.. code-block:: sh
+
+ perf test -v user
+
+About chained events and 64-bit counters
+----------------------------------------
+Chained events are not supported in conjunction with userspace counter
+access. If a 64-bit counter is requested (attr.config1:0), then
+userspace access must also be requested with attr.config1:1 set. This
+will disable counter chaining. The 'pmc_width' in the user page will
+indicate the actual width of the counter which could be only 32-bits
+depending on event and PMU features.
+
+.. Links
+.. _tools/perf/arch/arm64/tests/user-events.c:
+ https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/tools/perf/arch/arm64/tests/user-events.c
--
2.27.0

2021-03-12 14:01:10

by Jiri Olsa

[permalink] [raw]
Subject: Re: [PATCH v6 04/10] libperf: Add evsel mmap support

On Wed, Mar 10, 2021 at 05:08:31PM -0700, Rob Herring wrote:

SNIP

> +
> static int
> sys_perf_event_open(struct perf_event_attr *attr,
> pid_t pid, int cpu, int group_fd,
> @@ -137,6 +147,8 @@ void perf_evsel__free_fd(struct perf_evsel *evsel)
> {
> xyarray__delete(evsel->fd);
> evsel->fd = NULL;
> + xyarray__delete(evsel->mmap);
> + evsel->mmap = NULL;
> }
>
> void perf_evsel__close(struct perf_evsel *evsel)
> @@ -156,6 +168,45 @@ void perf_evsel__close_cpu(struct perf_evsel *evsel, int cpu)
> perf_evsel__close_fd_cpu(evsel, cpu);
> }
>
> +int perf_evsel__mmap(struct perf_evsel *evsel, int pages)
> +{
> + int ret, cpu, thread;
> + struct perf_mmap_param mp = {
> + .prot = PROT_READ | PROT_WRITE,
> + .mask = (pages * page_size) - 1,
> + };

I don't mind using evsel->fd for dimensions below,
but we need to check in here that it's defined,
that perf_evsel__open was called

jirka

> +
> + if (evsel->mmap == NULL &&
> + perf_evsel__alloc_mmap(evsel, xyarray__max_x(evsel->fd), xyarray__max_y(evsel->fd)) < 0)
> + return -ENOMEM;
> +
> + for (cpu = 0; cpu < xyarray__max_x(evsel->fd); cpu++) {
> + for (thread = 0; thread < xyarray__max_y(evsel->fd); thread++) {
> + int fd = FD(evsel, cpu, thread);
> + struct perf_mmap *map = MMAP(evsel, cpu, thread);
> +
> + if (fd < 0)
> + continue;
> +
> + perf_mmap__init(map, NULL, false, NULL);
> +
> + ret = perf_mmap__mmap(map, &mp, fd, cpu);
> + if (ret)
> + return -1;
> + }
> + }
> +
> + return 0;
> +}
> +

SNIP

2021-03-12 14:36:58

by Rob Herring (Arm)

[permalink] [raw]
Subject: Re: [PATCH v6 04/10] libperf: Add evsel mmap support

On Fri, Mar 12, 2021 at 6:59 AM Jiri Olsa <[email protected]> wrote:
>
> On Wed, Mar 10, 2021 at 05:08:31PM -0700, Rob Herring wrote:
>
> SNIP
>
> > +
> > static int
> > sys_perf_event_open(struct perf_event_attr *attr,
> > pid_t pid, int cpu, int group_fd,
> > @@ -137,6 +147,8 @@ void perf_evsel__free_fd(struct perf_evsel *evsel)
> > {
> > xyarray__delete(evsel->fd);
> > evsel->fd = NULL;
> > + xyarray__delete(evsel->mmap);
> > + evsel->mmap = NULL;
> > }
> >
> > void perf_evsel__close(struct perf_evsel *evsel)
> > @@ -156,6 +168,45 @@ void perf_evsel__close_cpu(struct perf_evsel *evsel, int cpu)
> > perf_evsel__close_fd_cpu(evsel, cpu);
> > }
> >
> > +int perf_evsel__mmap(struct perf_evsel *evsel, int pages)
> > +{
> > + int ret, cpu, thread;
> > + struct perf_mmap_param mp = {
> > + .prot = PROT_READ | PROT_WRITE,
> > + .mask = (pages * page_size) - 1,
> > + };
>
> I don't mind using evsel->fd for dimensions below,
> but we need to check in here that it's defined,
> that perf_evsel__open was called

Right, so I'll add this here:

if (evsel->fd == NULL)
return -EINVAL;

Note that struct evsel has dimensions in it, but they are only set in
the evlist code. I couldn't tell if that was by design or mistake.

BTW, I just noticed perf_evsel__open is leaking memory on most of its
error paths.

>
> jirka
>
> > +
> > + if (evsel->mmap == NULL &&
> > + perf_evsel__alloc_mmap(evsel, xyarray__max_x(evsel->fd), xyarray__max_y(evsel->fd)) < 0)
> > + return -ENOMEM;
> > +
> > + for (cpu = 0; cpu < xyarray__max_x(evsel->fd); cpu++) {
> > + for (thread = 0; thread < xyarray__max_y(evsel->fd); thread++) {
> > + int fd = FD(evsel, cpu, thread);
> > + struct perf_mmap *map = MMAP(evsel, cpu, thread);
> > +
> > + if (fd < 0)
> > + continue;
> > +
> > + perf_mmap__init(map, NULL, false, NULL);
> > +
> > + ret = perf_mmap__mmap(map, &mp, fd, cpu);
> > + if (ret)
> > + return -1;
> > + }
> > + }
> > +
> > + return 0;
> > +}
> > +
>
> SNIP
>

2021-03-12 18:31:35

by Jiri Olsa

[permalink] [raw]
Subject: Re: [PATCH v6 04/10] libperf: Add evsel mmap support

On Fri, Mar 12, 2021 at 07:34:39AM -0700, Rob Herring wrote:
> On Fri, Mar 12, 2021 at 6:59 AM Jiri Olsa <[email protected]> wrote:
> >
> > On Wed, Mar 10, 2021 at 05:08:31PM -0700, Rob Herring wrote:
> >
> > SNIP
> >
> > > +
> > > static int
> > > sys_perf_event_open(struct perf_event_attr *attr,
> > > pid_t pid, int cpu, int group_fd,
> > > @@ -137,6 +147,8 @@ void perf_evsel__free_fd(struct perf_evsel *evsel)
> > > {
> > > xyarray__delete(evsel->fd);
> > > evsel->fd = NULL;
> > > + xyarray__delete(evsel->mmap);
> > > + evsel->mmap = NULL;
> > > }
> > >
> > > void perf_evsel__close(struct perf_evsel *evsel)
> > > @@ -156,6 +168,45 @@ void perf_evsel__close_cpu(struct perf_evsel *evsel, int cpu)
> > > perf_evsel__close_fd_cpu(evsel, cpu);
> > > }
> > >
> > > +int perf_evsel__mmap(struct perf_evsel *evsel, int pages)
> > > +{
> > > + int ret, cpu, thread;
> > > + struct perf_mmap_param mp = {
> > > + .prot = PROT_READ | PROT_WRITE,
> > > + .mask = (pages * page_size) - 1,
> > > + };
> >
> > I don't mind using evsel->fd for dimensions below,
> > but we need to check in here that it's defined,
> > that perf_evsel__open was called
>
> Right, so I'll add this here:
>
> if (evsel->fd == NULL)
> return -EINVAL;
>
> Note that struct evsel has dimensions in it, but they are only set in
> the evlist code. I couldn't tell if that was by design or mistake.

we do? we have the cpus and threads in perf_evsel no?
also you'd need perf_evsel not evsel, right?

>
> BTW, I just noticed perf_evsel__open is leaking memory on most of its
> error paths.

nice.. let's fix it ;-)

thanks,
jirka

>
> >
> > jirka
> >
> > > +
> > > + if (evsel->mmap == NULL &&
> > > + perf_evsel__alloc_mmap(evsel, xyarray__max_x(evsel->fd), xyarray__max_y(evsel->fd)) < 0)
> > > + return -ENOMEM;
> > > +
> > > + for (cpu = 0; cpu < xyarray__max_x(evsel->fd); cpu++) {
> > > + for (thread = 0; thread < xyarray__max_y(evsel->fd); thread++) {
> > > + int fd = FD(evsel, cpu, thread);
> > > + struct perf_mmap *map = MMAP(evsel, cpu, thread);
> > > +
> > > + if (fd < 0)
> > > + continue;
> > > +
> > > + perf_mmap__init(map, NULL, false, NULL);
> > > +
> > > + ret = perf_mmap__mmap(map, &mp, fd, cpu);
> > > + if (ret)
> > > + return -1;
> > > + }
> > > + }
> > > +
> > > + return 0;
> > > +}
> > > +
> >
> > SNIP
> >
>

2021-03-30 11:33:00

by Zachary Leaf

[permalink] [raw]
Subject: Re: [PATCH v6 02/10] arm64: perf: Enable PMU counter direct access for perf event

On 11/03/2021 00:08, Rob Herring wrote:
>
> In order to enable 64-bit counters for userspace when available, a new
> config1 bit is added for userspace to indicate it wants userspace counter
> access. This bit allows the kernel to decide if chaining should be
> disabled and chaining and userspace access are incompatible.
> The modes for config1 are as follows:
>
> config1 = 0 or 2 : user access enabled and always 32-bit
> config1 = 1 : user access disabled and always 64-bit (using chaining if needed)
> config1 = 3 : user access enabled and counter size matches underlying counter.
>

Thanks for this Rob. That makes it extremely easy to request 64 bit
userspace counters without having to worry about the underlying bit
width supported on your system.

The underlying PMUv3 bit width is otherwise not accessible from
userspace as far as I can tell (e.g. the relevant PMUVer bits [11:8] of
ID_AA64DFR0_EL1 are masked off when reading from EL0 [1]), and the
workaround of requesting 64 bit, checking cap_user_rdpmc for userspace
access, and retrying with 32 bit was not that user friendly. I think it
makes a lot of sense for the kernel to handle/expose it here rather than
handled in the application code.

I think it's worth mentioning here if anyone searches, is that the 32
bit counter behaviour when added to the perf_event_mmap_page->offset is
effectively the same as a single 64 bit counter due to the offset being
incremented on overflow. Using a true 64 bit counter can avoid the
overhead of handling an interrupt for each overflow (and obviously has a
lot more headroom before it overflows, if you require very long running
perf stats).

I have tested the new config1 flags on N1-SDP and the behaviour is as
expected.

[1]
https://github.com/torvalds/linux/blob/master/Documentation/arm64/cpu-feature-registers.rst

2021-03-30 11:35:35

by Zachary Leaf

[permalink] [raw]
Subject: Re: [PATCH v6 00/10] libperf and arm64 userspace counter access support

On 11/03/2021 00:08, Rob Herring wrote:
> Changes in v6:
> ...

Thanks Rob.

I haven't tested the libperf patches but the rest of the patches are
working well here for userspace access, based on my (fairly limited)
testing on N1-SDP.

Tested-by: Zach Leaf <[email protected]>

2021-03-30 15:33:09

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH v6 01/10] arm64: pmu: Add function implementation to update event index in userpage

On Wed, Mar 10, 2021 at 05:08:28PM -0700, Rob Herring wrote:
> From: Raphael Gault <[email protected]>
>
> In order to be able to access the counter directly for userspace,
> we need to provide the index of the counter using the userpage.
> We thus need to override the event_idx function to retrieve and
> convert the perf_event index to armv8 hardware index.
>
> Since the arm_pmu driver can be used by any implementation, even
> if not armv8, two components play a role into making sure the
> behaviour is correct and consistent with the PMU capabilities:
>
> * the ARMPMU_EL0_RD_CNTR flag which denotes the capability to access
> counter from userspace.
> * the event_idx call back, which is implemented and initialized by
> the PMU implementation: if no callback is provided, the default
> behaviour applies, returning 0 as index value.
>
> Signed-off-by: Raphael Gault <[email protected]>
> Signed-off-by: Rob Herring <[email protected]>
> ---
> arch/arm64/kernel/perf_event.c | 18 ++++++++++++++++++
> include/linux/perf/arm_pmu.h | 2 ++
> 2 files changed, 20 insertions(+)

Acked-by: Will Deacon <[email protected]>

Will

2021-03-30 15:33:31

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH v6 02/10] arm64: perf: Enable PMU counter direct access for perf event

On Wed, Mar 10, 2021 at 05:08:29PM -0700, Rob Herring wrote:
> From: Raphael Gault <[email protected]>
>
> Keep track of event opened with direct access to the hardware counters
> and modify permissions while they are open.
>
> The strategy used here is the same which x86 uses: every time an event
> is mapped, the permissions are set if required. The atomic field added
> in the mm_context helps keep track of the different event opened and
> de-activate the permissions when all are unmapped.
> We also need to update the permissions in the context switch code so
> that tasks keep the right permissions.
>
> In order to enable 64-bit counters for userspace when available, a new
> config1 bit is added for userspace to indicate it wants userspace counter
> access. This bit allows the kernel to decide if chaining should be
> disabled and chaining and userspace access are incompatible.
> The modes for config1 are as follows:
>
> config1 = 0 or 2 : user access enabled and always 32-bit
> config1 = 1 : user access disabled and always 64-bit (using chaining if needed)
> config1 = 3 : user access enabled and counter size matches underlying counter.

In this last case, how does userspace know whether it got a 32-bit or a
64-bit counter?

> User access is enabled with config1 == 0 so that we match x86 behavior
> and don't need Arm specific code (outside the counter read).
>
> Signed-off-by: Raphael Gault <[email protected]>
> Signed-off-by: Rob Herring <[email protected]>
> ---
> I'm not completely sure if using current->active_mm in an IPI is okay?
> It seems to work in my testing.
>
> Peter Z says (event->oncpu == smp_processor_id()) in the user page
> update is always true, but my testing says otherwise[1].

Peter? Sounds like there's either a misunderstanding here or we have some
fundamental issue elsewhere.

> v6:
> - Add new attr.config1 rdpmc bit for userspace to hint it wants
> userspace access when also requesting 64-bit counters.
>
> v5:
> - Only set cap_user_rdpmc if event is on current cpu
> - Limit enabling/disabling access to CPUs associated with the PMU
> (supported_cpus) and with the mm_struct matching current->active_mm.
>
> v2:
> - Move mapped/unmapped into arm64 code. Fixes arm32.
> - Rebase on cap_user_time_short changes
>
> Changes from Raphael's v4:
> - Drop homogeneous check
> - Disable access for chained counters
> - Set pmc_width in user page
>
> [1] https://lore.kernel.org/lkml/CAL_JsqK+eKef5NaVnBfARCjRE3MYhfBfe54F9YHKbsTnWqLmLw@mail.gmail.com/
>
> user fix
> ---
> arch/arm64/include/asm/mmu.h | 5 ++
> arch/arm64/include/asm/mmu_context.h | 2 +
> arch/arm64/include/asm/perf_event.h | 14 +++++
> arch/arm64/kernel/perf_event.c | 86 +++++++++++++++++++++++++++-
> 4 files changed, 104 insertions(+), 3 deletions(-)
>
> diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h
> index 75beffe2ee8a..ee08447455da 100644
> --- a/arch/arm64/include/asm/mmu.h
> +++ b/arch/arm64/include/asm/mmu.h
> @@ -18,6 +18,11 @@
>
> typedef struct {
> atomic64_t id;
> + /*
> + * non-zero if userspace have access to hardware
> + * counters directly.
> + */
> + atomic_t pmu_direct_access;
> #ifdef CONFIG_COMPAT
> void *sigpage;
> #endif
> diff --git a/arch/arm64/include/asm/mmu_context.h b/arch/arm64/include/asm/mmu_context.h
> index 70ce8c1d2b07..ccb5ff417b42 100644
> --- a/arch/arm64/include/asm/mmu_context.h
> +++ b/arch/arm64/include/asm/mmu_context.h
> @@ -21,6 +21,7 @@
> #include <asm/proc-fns.h>
> #include <asm-generic/mm_hooks.h>
> #include <asm/cputype.h>
> +#include <asm/perf_event.h>
> #include <asm/sysreg.h>
> #include <asm/tlbflush.h>
>
> @@ -230,6 +231,7 @@ static inline void __switch_mm(struct mm_struct *next)
> }
>
> check_and_switch_context(next);
> + perf_switch_user_access(next);
> }
>
> static inline void
> diff --git a/arch/arm64/include/asm/perf_event.h b/arch/arm64/include/asm/perf_event.h
> index 60731f602d3e..112f3f63b79e 100644
> --- a/arch/arm64/include/asm/perf_event.h
> +++ b/arch/arm64/include/asm/perf_event.h
> @@ -8,6 +8,7 @@
>
> #include <asm/stack_pointer.h>
> #include <asm/ptrace.h>
> +#include <linux/mm_types.h>
>
> #define ARMV8_PMU_MAX_COUNTERS 32
> #define ARMV8_PMU_COUNTER_MASK (ARMV8_PMU_MAX_COUNTERS - 1)
> @@ -254,4 +255,17 @@ extern unsigned long perf_misc_flags(struct pt_regs *regs);
> (regs)->pstate = PSR_MODE_EL1h; \
> }
>
> +static inline void perf_switch_user_access(struct mm_struct *mm)
> +{
> + if (!IS_ENABLED(CONFIG_PERF_EVENTS))
> + return;

CONFIG_HW_PERF_EVENTS might be a better fit here.

> +
> + if (atomic_read(&mm->context.pmu_direct_access)) {
> + write_sysreg(ARMV8_PMU_USERENR_ER|ARMV8_PMU_USERENR_CR,
> + pmuserenr_el0);
> + } else {
> + write_sysreg(0, pmuserenr_el0);
> + }
> +}
> +
> #endif
> diff --git a/arch/arm64/kernel/perf_event.c b/arch/arm64/kernel/perf_event.c
> index 387838496955..9ad3cc523ef4 100644
> --- a/arch/arm64/kernel/perf_event.c
> +++ b/arch/arm64/kernel/perf_event.c
> @@ -288,15 +288,22 @@ static const struct attribute_group armv8_pmuv3_events_attr_group = {
>
> PMU_FORMAT_ATTR(event, "config:0-15");
> PMU_FORMAT_ATTR(long, "config1:0");
> +PMU_FORMAT_ATTR(rdpmc, "config1:1");
>
> static inline bool armv8pmu_event_is_64bit(struct perf_event *event)
> {
> return event->attr.config1 & 0x1;
> }
>
> +static inline bool armv8pmu_event_want_user_access(struct perf_event *event)
> +{
> + return event->attr.config1 & 0x2;
> +}
> +
> static struct attribute *armv8_pmuv3_format_attrs[] = {
> &format_attr_event.attr,
> &format_attr_long.attr,
> + &format_attr_rdpmc.attr,
> NULL,
> };
>
> @@ -356,6 +363,7 @@ static inline bool armv8pmu_event_is_chained(struct perf_event *event)
> struct arm_pmu *cpu_pmu = to_arm_pmu(event->pmu);
>
> return !WARN_ON(idx < 0) &&
> + !armv8pmu_event_want_user_access(event) &&
> armv8pmu_event_is_64bit(event) &&
> !armv8pmu_has_long_event(cpu_pmu) &&
> (idx != ARMV8_IDX_CYCLE_COUNTER);
> @@ -849,13 +857,17 @@ static int armv8pmu_get_event_idx(struct pmu_hw_events *cpuc,
> if (evtype == ARMV8_PMUV3_PERFCTR_CPU_CYCLES) {
> if (!test_and_set_bit(ARMV8_IDX_CYCLE_COUNTER, cpuc->used_mask))
> return ARMV8_IDX_CYCLE_COUNTER;
> + else if (armv8pmu_event_is_64bit(event) &&
> + armv8pmu_event_want_user_access(event) &&
> + !armv8pmu_has_long_event(cpu_pmu))
> + return -EAGAIN;
> }
>
> /*
> * Otherwise use events counters
> */
> - if (armv8pmu_event_is_64bit(event) &&
> - !armv8pmu_has_long_event(cpu_pmu))
> + if (armv8pmu_event_is_64bit(event) && !armv8pmu_has_long_event(cpu_pmu) &&
> + !armv8pmu_event_want_user_access(event))

This logic is duplicated in armv8pmu_event_is_chained(); can you split it
into a helper, please?

> return armv8pmu_get_chain_idx(cpuc, cpu_pmu);
> else
> return armv8pmu_get_single_idx(cpuc, cpu_pmu);
> @@ -887,6 +899,46 @@ static int armv8pmu_access_event_idx(struct perf_event *event)
> return event->hw.idx;
> }
>
> +static void refresh_pmuserenr(void *mm)
> +{
> + if (mm == current->active_mm)
> + perf_switch_user_access(mm);
> +}
> +
> +static void armv8pmu_event_mapped(struct perf_event *event, struct mm_struct *mm)
> +{
> + struct arm_pmu *armpmu = to_arm_pmu(event->pmu);
> +
> + if (!(event->hw.flags & ARMPMU_EL0_RD_CNTR))
> + return;
> +
> + /*
> + * This function relies on not being called concurrently in two
> + * tasks in the same mm. Otherwise one task could observe
> + * pmu_direct_access > 1 and return all the way back to
> + * userspace with user access disabled while another task is still
> + * doing on_each_cpu_mask() to enable user access.
> + *
> + * For now, this can't happen because all callers hold mmap_lock
> + * for write. If this changes, we'll need a different solution.
> + */
> + lockdep_assert_held_write(&mm->mmap_lock);
> +
> + if (atomic_inc_return(&mm->context.pmu_direct_access) == 1)
> + on_each_cpu_mask(&armpmu->supported_cpus, refresh_pmuserenr, mm, 1);
> +}

Why do we need to cross-call here? Seems like it would be a tonne simpler to
handle the trap. Is there a reason not to do that?

> +
> +static void armv8pmu_event_unmapped(struct perf_event *event, struct mm_struct *mm)
> +{
> + struct arm_pmu *armpmu = to_arm_pmu(event->pmu);
> +
> + if (!(event->hw.flags & ARMPMU_EL0_RD_CNTR))
> + return;
> +
> + if (atomic_dec_and_test(&mm->context.pmu_direct_access))
> + on_each_cpu_mask(&armpmu->supported_cpus, refresh_pmuserenr, mm, 1);

Given that the pmu_direct_access field is global per-mm, won't this go
wrong if multiple PMUs are opened by the same process but only a subset
are exposed to EL0? Perhaps pmu_direct_access should be treated as a mask
rather than a counter, so that we can 'and' it with the supported_cpus for
the PMU we're dealing with.

> +}
> +
> /*
> * Add an event filter to a given event.
> */
> @@ -980,9 +1032,23 @@ static int __armv8_pmuv3_map_event(struct perf_event *event,
> &armv8_pmuv3_perf_cache_map,
> ARMV8_PMU_EVTYPE_EVENT);
>
> - if (armv8pmu_event_is_64bit(event))
> + if (armv8pmu_event_want_user_access(event) || !armv8pmu_event_is_64bit(event)) {
> + event->hw.flags |= ARMPMU_EL0_RD_CNTR;

Why do you set this for all 32-bit events? The logic here feels like it
could with a bit of untangling.

> + /*
> + * At this point, the counter is not assigned. If a 64-bit
> + * counter is requested, we must make sure the h/w has 64-bit
> + * counters if we set the event size to 64-bit because chaining
> + * is not supported with userspace access. This may still fail
> + * later on if the CPU cycle counter is in use.
> + */
> + if (armv8pmu_event_is_64bit(event) &&
> + (armv8pmu_has_long_event(armpmu) ||
> + hw_event_id == ARMV8_PMUV3_PERFCTR_CPU_CYCLES))
> + event->hw.flags |= ARMPMU_EVT_64BIT;
> + } else if (armv8pmu_event_is_64bit(event))
> event->hw.flags |= ARMPMU_EVT_64BIT;
>
> +
> /* Only expose micro/arch events supported by this PMU */
> if ((hw_event_id > 0) && (hw_event_id < ARMV8_PMUV3_MAX_COMMON_EVENTS)
> && test_bit(hw_event_id, armpmu->pmceid_bitmap)) {
> @@ -1115,6 +1181,8 @@ static int armv8_pmu_init(struct arm_pmu *cpu_pmu, char *name,
> cpu_pmu->filter_match = armv8pmu_filter_match;
>
> cpu_pmu->pmu.event_idx = armv8pmu_access_event_idx;
> + cpu_pmu->pmu.event_mapped = armv8pmu_event_mapped;
> + cpu_pmu->pmu.event_unmapped = armv8pmu_event_unmapped;
>
> cpu_pmu->name = name;
> cpu_pmu->map_event = map_event;
> @@ -1290,6 +1358,18 @@ void arch_perf_update_userpage(struct perf_event *event,
> userpg->cap_user_time = 0;
> userpg->cap_user_time_zero = 0;
> userpg->cap_user_time_short = 0;
> + userpg->cap_user_rdpmc = !!(event->hw.flags & ARMPMU_EL0_RD_CNTR) &&
> + (event->oncpu == smp_processor_id());
> +
> + if (userpg->cap_user_rdpmc) {
> + struct arm_pmu *cpu_pmu = to_arm_pmu(event->pmu);
> +
> + if (armv8pmu_event_is_64bit(event) &&
> + (armv8pmu_has_long_event(cpu_pmu) || (userpg->index == 32)))

The '32' here is the fake index for the cycle counter, right? I think that
was introduced in the previous patch, so let's add a #define for it.

Will

2021-03-30 17:13:17

by Rob Herring (Arm)

[permalink] [raw]
Subject: Re: [PATCH v6 02/10] arm64: perf: Enable PMU counter direct access for perf event

On Tue, Mar 30, 2021 at 10:31 AM Will Deacon <[email protected]> wrote:
>
> On Wed, Mar 10, 2021 at 05:08:29PM -0700, Rob Herring wrote:
> > From: Raphael Gault <[email protected]>
> >
> > Keep track of event opened with direct access to the hardware counters
> > and modify permissions while they are open.
> >
> > The strategy used here is the same which x86 uses: every time an event
> > is mapped, the permissions are set if required. The atomic field added
> > in the mm_context helps keep track of the different event opened and
> > de-activate the permissions when all are unmapped.
> > We also need to update the permissions in the context switch code so
> > that tasks keep the right permissions.
> >
> > In order to enable 64-bit counters for userspace when available, a new
> > config1 bit is added for userspace to indicate it wants userspace counter
> > access. This bit allows the kernel to decide if chaining should be
> > disabled and chaining and userspace access are incompatible.
> > The modes for config1 are as follows:
> >
> > config1 = 0 or 2 : user access enabled and always 32-bit
> > config1 = 1 : user access disabled and always 64-bit (using chaining if needed)
> > config1 = 3 : user access enabled and counter size matches underlying counter.
>
> In this last case, how does userspace know whether it got a 32-bit or a
> 64-bit counter?

pmc_width in the user page. If the read loop is correct, then it
doesn't matter to the user.

>
> > User access is enabled with config1 == 0 so that we match x86 behavior
> > and don't need Arm specific code (outside the counter read).
> >
> > Signed-off-by: Raphael Gault <[email protected]>
> > Signed-off-by: Rob Herring <[email protected]>
> > ---
> > I'm not completely sure if using current->active_mm in an IPI is okay?
> > It seems to work in my testing.
> >
> > Peter Z says (event->oncpu == smp_processor_id()) in the user page
> > update is always true, but my testing says otherwise[1].
>
> Peter? Sounds like there's either a misunderstanding here or we have some
> fundamental issue elsewhere.
>
> > v6:
> > - Add new attr.config1 rdpmc bit for userspace to hint it wants
> > userspace access when also requesting 64-bit counters.
> >
> > v5:
> > - Only set cap_user_rdpmc if event is on current cpu
> > - Limit enabling/disabling access to CPUs associated with the PMU
> > (supported_cpus) and with the mm_struct matching current->active_mm.
> >
> > v2:
> > - Move mapped/unmapped into arm64 code. Fixes arm32.
> > - Rebase on cap_user_time_short changes
> >
> > Changes from Raphael's v4:
> > - Drop homogeneous check
> > - Disable access for chained counters
> > - Set pmc_width in user page
> >
> > [1] https://lore.kernel.org/lkml/CAL_JsqK+eKef5NaVnBfARCjRE3MYhfBfe54F9YHKbsTnWqLmLw@mail.gmail.com/
> >
> > user fix
> > ---
> > arch/arm64/include/asm/mmu.h | 5 ++
> > arch/arm64/include/asm/mmu_context.h | 2 +
> > arch/arm64/include/asm/perf_event.h | 14 +++++
> > arch/arm64/kernel/perf_event.c | 86 +++++++++++++++++++++++++++-
> > 4 files changed, 104 insertions(+), 3 deletions(-)
> >
> > diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h
> > index 75beffe2ee8a..ee08447455da 100644
> > --- a/arch/arm64/include/asm/mmu.h
> > +++ b/arch/arm64/include/asm/mmu.h
> > @@ -18,6 +18,11 @@
> >
> > typedef struct {
> > atomic64_t id;
> > + /*
> > + * non-zero if userspace have access to hardware
> > + * counters directly.
> > + */
> > + atomic_t pmu_direct_access;
> > #ifdef CONFIG_COMPAT
> > void *sigpage;
> > #endif
> > diff --git a/arch/arm64/include/asm/mmu_context.h b/arch/arm64/include/asm/mmu_context.h
> > index 70ce8c1d2b07..ccb5ff417b42 100644
> > --- a/arch/arm64/include/asm/mmu_context.h
> > +++ b/arch/arm64/include/asm/mmu_context.h
> > @@ -21,6 +21,7 @@
> > #include <asm/proc-fns.h>
> > #include <asm-generic/mm_hooks.h>
> > #include <asm/cputype.h>
> > +#include <asm/perf_event.h>
> > #include <asm/sysreg.h>
> > #include <asm/tlbflush.h>
> >
> > @@ -230,6 +231,7 @@ static inline void __switch_mm(struct mm_struct *next)
> > }
> >
> > check_and_switch_context(next);
> > + perf_switch_user_access(next);
> > }
> >
> > static inline void
> > diff --git a/arch/arm64/include/asm/perf_event.h b/arch/arm64/include/asm/perf_event.h
> > index 60731f602d3e..112f3f63b79e 100644
> > --- a/arch/arm64/include/asm/perf_event.h
> > +++ b/arch/arm64/include/asm/perf_event.h
> > @@ -8,6 +8,7 @@
> >
> > #include <asm/stack_pointer.h>
> > #include <asm/ptrace.h>
> > +#include <linux/mm_types.h>
> >
> > #define ARMV8_PMU_MAX_COUNTERS 32
> > #define ARMV8_PMU_COUNTER_MASK (ARMV8_PMU_MAX_COUNTERS - 1)
> > @@ -254,4 +255,17 @@ extern unsigned long perf_misc_flags(struct pt_regs *regs);
> > (regs)->pstate = PSR_MODE_EL1h; \
> > }
> >
> > +static inline void perf_switch_user_access(struct mm_struct *mm)
> > +{
> > + if (!IS_ENABLED(CONFIG_PERF_EVENTS))
> > + return;
>
> CONFIG_HW_PERF_EVENTS might be a better fit here.
>
> > +
> > + if (atomic_read(&mm->context.pmu_direct_access)) {
> > + write_sysreg(ARMV8_PMU_USERENR_ER|ARMV8_PMU_USERENR_CR,
> > + pmuserenr_el0);
> > + } else {
> > + write_sysreg(0, pmuserenr_el0);
> > + }
> > +}
> > +
> > #endif
> > diff --git a/arch/arm64/kernel/perf_event.c b/arch/arm64/kernel/perf_event.c
> > index 387838496955..9ad3cc523ef4 100644
> > --- a/arch/arm64/kernel/perf_event.c
> > +++ b/arch/arm64/kernel/perf_event.c
> > @@ -288,15 +288,22 @@ static const struct attribute_group armv8_pmuv3_events_attr_group = {
> >
> > PMU_FORMAT_ATTR(event, "config:0-15");
> > PMU_FORMAT_ATTR(long, "config1:0");
> > +PMU_FORMAT_ATTR(rdpmc, "config1:1");
> >
> > static inline bool armv8pmu_event_is_64bit(struct perf_event *event)
> > {
> > return event->attr.config1 & 0x1;
> > }
> >
> > +static inline bool armv8pmu_event_want_user_access(struct perf_event *event)
> > +{
> > + return event->attr.config1 & 0x2;
> > +}
> > +
> > static struct attribute *armv8_pmuv3_format_attrs[] = {
> > &format_attr_event.attr,
> > &format_attr_long.attr,
> > + &format_attr_rdpmc.attr,
> > NULL,
> > };
> >
> > @@ -356,6 +363,7 @@ static inline bool armv8pmu_event_is_chained(struct perf_event *event)
> > struct arm_pmu *cpu_pmu = to_arm_pmu(event->pmu);
> >
> > return !WARN_ON(idx < 0) &&
> > + !armv8pmu_event_want_user_access(event) &&
> > armv8pmu_event_is_64bit(event) &&
> > !armv8pmu_has_long_event(cpu_pmu) &&
> > (idx != ARMV8_IDX_CYCLE_COUNTER);
> > @@ -849,13 +857,17 @@ static int armv8pmu_get_event_idx(struct pmu_hw_events *cpuc,
> > if (evtype == ARMV8_PMUV3_PERFCTR_CPU_CYCLES) {
> > if (!test_and_set_bit(ARMV8_IDX_CYCLE_COUNTER, cpuc->used_mask))
> > return ARMV8_IDX_CYCLE_COUNTER;
> > + else if (armv8pmu_event_is_64bit(event) &&
> > + armv8pmu_event_want_user_access(event) &&
> > + !armv8pmu_has_long_event(cpu_pmu))
> > + return -EAGAIN;
> > }
> >
> > /*
> > * Otherwise use events counters
> > */
> > - if (armv8pmu_event_is_64bit(event) &&
> > - !armv8pmu_has_long_event(cpu_pmu))
> > + if (armv8pmu_event_is_64bit(event) && !armv8pmu_has_long_event(cpu_pmu) &&
> > + !armv8pmu_event_want_user_access(event))
>
> This logic is duplicated in armv8pmu_event_is_chained(); can you split it
> into a helper, please?
>
> > return armv8pmu_get_chain_idx(cpuc, cpu_pmu);
> > else
> > return armv8pmu_get_single_idx(cpuc, cpu_pmu);
> > @@ -887,6 +899,46 @@ static int armv8pmu_access_event_idx(struct perf_event *event)
> > return event->hw.idx;
> > }
> >
> > +static void refresh_pmuserenr(void *mm)
> > +{
> > + if (mm == current->active_mm)
> > + perf_switch_user_access(mm);
> > +}
> > +
> > +static void armv8pmu_event_mapped(struct perf_event *event, struct mm_struct *mm)
> > +{
> > + struct arm_pmu *armpmu = to_arm_pmu(event->pmu);
> > +
> > + if (!(event->hw.flags & ARMPMU_EL0_RD_CNTR))
> > + return;
> > +
> > + /*
> > + * This function relies on not being called concurrently in two
> > + * tasks in the same mm. Otherwise one task could observe
> > + * pmu_direct_access > 1 and return all the way back to
> > + * userspace with user access disabled while another task is still
> > + * doing on_each_cpu_mask() to enable user access.
> > + *
> > + * For now, this can't happen because all callers hold mmap_lock
> > + * for write. If this changes, we'll need a different solution.
> > + */
> > + lockdep_assert_held_write(&mm->mmap_lock);
> > +
> > + if (atomic_inc_return(&mm->context.pmu_direct_access) == 1)
> > + on_each_cpu_mask(&armpmu->supported_cpus, refresh_pmuserenr, mm, 1);
> > +}
>
> Why do we need to cross-call here? Seems like it would be a tonne simpler to
> handle the trap. Is there a reason not to do that?

Alignment with x86? You didn't like the trap handler either:

> Hmm... this feels pretty fragile since, although we may expect userspace only
> to trigger this in the context of the specific perf use-case, we don't have
> a way to detect that, so the ABI we're exposing is that EL0 accesses to
> non-existent counters will return 0. I don't really think that's something
> we want to commit to.

Full mail: https://lore.kernel.org/r/20200928182601.GA11974@willie-the-truck

The handler would be different, but we'd still have the problem of not
being able to detect the use case, right?

> > +
> > +static void armv8pmu_event_unmapped(struct perf_event *event, struct mm_struct *mm)
> > +{
> > + struct arm_pmu *armpmu = to_arm_pmu(event->pmu);
> > +
> > + if (!(event->hw.flags & ARMPMU_EL0_RD_CNTR))
> > + return;
> > +
> > + if (atomic_dec_and_test(&mm->context.pmu_direct_access))
> > + on_each_cpu_mask(&armpmu->supported_cpus, refresh_pmuserenr, mm, 1);
>
> Given that the pmu_direct_access field is global per-mm, won't this go
> wrong if multiple PMUs are opened by the same process but only a subset
> are exposed to EL0? Perhaps pmu_direct_access should be treated as a mask
> rather than a counter, so that we can 'and' it with the supported_cpus for
> the PMU we're dealing with.

It needs to be a count to support multiple events on the same PMU. If
the event is not enabled for EL0, then we'd exit out on the
ARMPMU_EL0_RD_CNTR check. So I think we're fine.

> > @@ -980,9 +1032,23 @@ static int __armv8_pmuv3_map_event(struct perf_event *event,
> > &armv8_pmuv3_perf_cache_map,
> > ARMV8_PMU_EVTYPE_EVENT);
> >
> > - if (armv8pmu_event_is_64bit(event))
> > + if (armv8pmu_event_want_user_access(event) || !armv8pmu_event_is_64bit(event)) {
> > + event->hw.flags |= ARMPMU_EL0_RD_CNTR;
>
> Why do you set this for all 32-bit events?

It goes back to the config1 bits as explained in the commit msg. We
can always support user access for 32-bit counters, but for 64-bit
counters the user has to request both user access and 64-bit counters.
We could require explicit user access request for 32-bit access, but I
thought it was better to not require userspace to do something Arm
specific on open.

> The logic here feels like it
> could with a bit of untangling.

Yes, I don't love it, but couldn't come up with anything better. It is
complicated by the fact that flags have to be set before we assign the
counter and can't set/change them when we assign the counter. It would
take a lot of refactoring with armpmu code to fix that.

>
> > + /*
> > + * At this point, the counter is not assigned. If a 64-bit
> > + * counter is requested, we must make sure the h/w has 64-bit
> > + * counters if we set the event size to 64-bit because chaining
> > + * is not supported with userspace access. This may still fail
> > + * later on if the CPU cycle counter is in use.
> > + */
> > + if (armv8pmu_event_is_64bit(event) &&
> > + (armv8pmu_has_long_event(armpmu) ||
> > + hw_event_id == ARMV8_PMUV3_PERFCTR_CPU_CYCLES))
> > + event->hw.flags |= ARMPMU_EVT_64BIT;
> > + } else if (armv8pmu_event_is_64bit(event))
> > event->hw.flags |= ARMPMU_EVT_64BIT;

2021-03-30 21:13:32

by Rob Herring (Arm)

[permalink] [raw]
Subject: Re: [PATCH v6 02/10] arm64: perf: Enable PMU counter direct access for perf event

On Tue, Mar 30, 2021 at 12:09 PM Rob Herring <[email protected]> wrote:
>
> On Tue, Mar 30, 2021 at 10:31 AM Will Deacon <[email protected]> wrote:
> >
> > On Wed, Mar 10, 2021 at 05:08:29PM -0700, Rob Herring wrote:
> > > From: Raphael Gault <[email protected]>
> > >
> > > Keep track of event opened with direct access to the hardware counters
> > > and modify permissions while they are open.
> > >
> > > The strategy used here is the same which x86 uses: every time an event
> > > is mapped, the permissions are set if required. The atomic field added
> > > in the mm_context helps keep track of the different event opened and
> > > de-activate the permissions when all are unmapped.
> > > We also need to update the permissions in the context switch code so
> > > that tasks keep the right permissions.
> > >
> > > In order to enable 64-bit counters for userspace when available, a new
> > > config1 bit is added for userspace to indicate it wants userspace counter
> > > access. This bit allows the kernel to decide if chaining should be
> > > disabled and chaining and userspace access are incompatible.
> > > The modes for config1 are as follows:
> > >
> > > config1 = 0 or 2 : user access enabled and always 32-bit
> > > config1 = 1 : user access disabled and always 64-bit (using chaining if needed)
> > > config1 = 3 : user access enabled and counter size matches underlying counter.

[...]

> > > @@ -980,9 +1032,23 @@ static int __armv8_pmuv3_map_event(struct perf_event *event,
> > > &armv8_pmuv3_perf_cache_map,
> > > ARMV8_PMU_EVTYPE_EVENT);
> > >
> > > - if (armv8pmu_event_is_64bit(event))
> > > + if (armv8pmu_event_want_user_access(event) || !armv8pmu_event_is_64bit(event)) {
> > > + event->hw.flags |= ARMPMU_EL0_RD_CNTR;
> >
> > Why do you set this for all 32-bit events?
>
> It goes back to the config1 bits as explained in the commit msg. We
> can always support user access for 32-bit counters, but for 64-bit
> counters the user has to request both user access and 64-bit counters.
> We could require explicit user access request for 32-bit access, but I
> thought it was better to not require userspace to do something Arm
> specific on open.
>
> > The logic here feels like it
> > could with a bit of untangling.
>
> Yes, I don't love it, but couldn't come up with anything better. It is
> complicated by the fact that flags have to be set before we assign the
> counter and can't set/change them when we assign the counter. It would
> take a lot of refactoring with armpmu code to fix that.

How's this instead?:

if (armv8pmu_event_want_user_access(event) || !armv8pmu_event_is_64bit(event))
event->hw.flags |= ARMPMU_EL0_RD_CNTR;

/*
* At this point, the counter is not assigned. If a 64-bit counter is
* requested, we must make sure the h/w has 64-bit counters if we set
* the event size to 64-bit because chaining is not supported with
* userspace access. This may still fail later on if the CPU cycle
* counter is in use.
*/
if (armv8pmu_event_is_64bit(event) &&
(!armv8pmu_event_want_user_access(event) ||
armv8pmu_has_long_event(cpu_pmu) || (hw_event_id ==
ARMV8_PMUV3_PERFCTR_CPU_CYCLES)))
event->hw.flags |= ARMPMU_EVT_64BIT;

> > > + /*
> > > + * At this point, the counter is not assigned. If a 64-bit
> > > + * counter is requested, we must make sure the h/w has 64-bit
> > > + * counters if we set the event size to 64-bit because chaining
> > > + * is not supported with userspace access. This may still fail
> > > + * later on if the CPU cycle counter is in use.
> > > + */
> > > + if (armv8pmu_event_is_64bit(event) &&
> > > + (armv8pmu_has_long_event(armpmu) ||
> > > + hw_event_id == ARMV8_PMUV3_PERFCTR_CPU_CYCLES))
> > > + event->hw.flags |= ARMPMU_EVT_64BIT;
> > > + } else if (armv8pmu_event_is_64bit(event))
> > > event->hw.flags |= ARMPMU_EVT_64BIT;

2021-03-31 15:40:22

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH v6 02/10] arm64: perf: Enable PMU counter direct access for perf event

On Tue, Mar 30, 2021 at 04:08:11PM -0500, Rob Herring wrote:
> On Tue, Mar 30, 2021 at 12:09 PM Rob Herring <[email protected]> wrote:
> > On Tue, Mar 30, 2021 at 10:31 AM Will Deacon <[email protected]> wrote:
> > > The logic here feels like it
> > > could with a bit of untangling.
> >
> > Yes, I don't love it, but couldn't come up with anything better. It is
> > complicated by the fact that flags have to be set before we assign the
> > counter and can't set/change them when we assign the counter. It would
> > take a lot of refactoring with armpmu code to fix that.
>
> How's this instead?:
>
> if (armv8pmu_event_want_user_access(event) || !armv8pmu_event_is_64bit(event))
> event->hw.flags |= ARMPMU_EL0_RD_CNTR;
>
> /*
> * At this point, the counter is not assigned. If a 64-bit counter is
> * requested, we must make sure the h/w has 64-bit counters if we set
> * the event size to 64-bit because chaining is not supported with
> * userspace access. This may still fail later on if the CPU cycle
> * counter is in use.
> */
> if (armv8pmu_event_is_64bit(event) &&
> (!armv8pmu_event_want_user_access(event) ||
> armv8pmu_has_long_event(cpu_pmu) || (hw_event_id ==
> ARMV8_PMUV3_PERFCTR_CPU_CYCLES)))
> event->hw.flags |= ARMPMU_EVT_64BIT;

I thought there were some cases where we could assign cycles event to an
event counter; does that not happen anymore?

Will

2021-03-31 16:02:06

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH v6 10/10] Documentation: arm64: Document PMU counters access from userspace

On Wed, Mar 10, 2021 at 05:08:37PM -0700, Rob Herring wrote:
> From: Raphael Gault <[email protected]>
>
> Add a documentation file to describe the access to the pmu hardware
> counters from userspace
>
> Signed-off-by: Raphael Gault <[email protected]>
> Signed-off-by: Rob Herring <[email protected]>
> ---
> v6:
> - Update the chained event section with attr.config1 details
> v2:
> - Update links to test examples
>
> Changes from Raphael's v4:
> - Convert to rSt
> - Update chained event status
> - Add section for heterogeneous systems
> ---
> Documentation/arm64/index.rst | 1 +
> .../arm64/perf_counter_user_access.rst | 60 +++++++++++++++++++

We already have Documentation/arm64/perf.rst so I think you can add this
in there as a new section.

Will

2021-03-31 16:03:00

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH v6 02/10] arm64: perf: Enable PMU counter direct access for perf event

On Tue, Mar 30, 2021 at 12:09:38PM -0500, Rob Herring wrote:
> On Tue, Mar 30, 2021 at 10:31 AM Will Deacon <[email protected]> wrote:
> >
> > On Wed, Mar 10, 2021 at 05:08:29PM -0700, Rob Herring wrote:
> > > From: Raphael Gault <[email protected]>
> > >
> > > Keep track of event opened with direct access to the hardware counters
> > > and modify permissions while they are open.
> > >
> > > The strategy used here is the same which x86 uses: every time an event
> > > is mapped, the permissions are set if required. The atomic field added
> > > in the mm_context helps keep track of the different event opened and
> > > de-activate the permissions when all are unmapped.
> > > We also need to update the permissions in the context switch code so
> > > that tasks keep the right permissions.
> > >
> > > In order to enable 64-bit counters for userspace when available, a new
> > > config1 bit is added for userspace to indicate it wants userspace counter
> > > access. This bit allows the kernel to decide if chaining should be
> > > disabled and chaining and userspace access are incompatible.
> > > The modes for config1 are as follows:
> > >
> > > config1 = 0 or 2 : user access enabled and always 32-bit
> > > config1 = 1 : user access disabled and always 64-bit (using chaining if needed)
> > > config1 = 3 : user access enabled and counter size matches underlying counter.
> >
> > In this last case, how does userspace know whether it got a 32-bit or a
> > 64-bit counter?
>
> pmc_width in the user page. If the read loop is correct, then it
> doesn't matter to the user.

Gotcha; please mention that in this comment,

> > > +static void refresh_pmuserenr(void *mm)
> > > +{
> > > + if (mm == current->active_mm)
> > > + perf_switch_user_access(mm);
> > > +}
> > > +
> > > +static void armv8pmu_event_mapped(struct perf_event *event, struct mm_struct *mm)
> > > +{
> > > + struct arm_pmu *armpmu = to_arm_pmu(event->pmu);
> > > +
> > > + if (!(event->hw.flags & ARMPMU_EL0_RD_CNTR))
> > > + return;
> > > +
> > > + /*
> > > + * This function relies on not being called concurrently in two
> > > + * tasks in the same mm. Otherwise one task could observe
> > > + * pmu_direct_access > 1 and return all the way back to
> > > + * userspace with user access disabled while another task is still
> > > + * doing on_each_cpu_mask() to enable user access.
> > > + *
> > > + * For now, this can't happen because all callers hold mmap_lock
> > > + * for write. If this changes, we'll need a different solution.
> > > + */
> > > + lockdep_assert_held_write(&mm->mmap_lock);
> > > +
> > > + if (atomic_inc_return(&mm->context.pmu_direct_access) == 1)
> > > + on_each_cpu_mask(&armpmu->supported_cpus, refresh_pmuserenr, mm, 1);
> > > +}
> >
> > Why do we need to cross-call here? Seems like it would be a tonne simpler to
> > handle the trap. Is there a reason not to do that?
>
> Alignment with x86? You didn't like the trap handler either:
>
> > Hmm... this feels pretty fragile since, although we may expect userspace only
> > to trigger this in the context of the specific perf use-case, we don't have
> > a way to detect that, so the ABI we're exposing is that EL0 accesses to
> > non-existent counters will return 0. I don't really think that's something
> > we want to commit to.
>
> Full mail: https://lore.kernel.org/r/20200928182601.GA11974@willie-the-truck
>
> The handler would be different, but we'd still have the problem of not
> being able to detect the use case, right?

Well hang on, the thing you link to was about _emulating_ the access. I'm
not saying we should do that, but rather just enable EL0 PMU access lazily
instead of using the IPI. We're going to enable it when we context-switch
anyway.

The alternative is enabling EL0 access unconditionally and leaving it
enabled.

> > > +
> > > +static void armv8pmu_event_unmapped(struct perf_event *event, struct mm_struct *mm)
> > > +{
> > > + struct arm_pmu *armpmu = to_arm_pmu(event->pmu);
> > > +
> > > + if (!(event->hw.flags & ARMPMU_EL0_RD_CNTR))
> > > + return;
> > > +
> > > + if (atomic_dec_and_test(&mm->context.pmu_direct_access))
> > > + on_each_cpu_mask(&armpmu->supported_cpus, refresh_pmuserenr, mm, 1);
> >
> > Given that the pmu_direct_access field is global per-mm, won't this go
> > wrong if multiple PMUs are opened by the same process but only a subset
> > are exposed to EL0? Perhaps pmu_direct_access should be treated as a mask
> > rather than a counter, so that we can 'and' it with the supported_cpus for
> > the PMU we're dealing with.
>
> It needs to be a count to support multiple events on the same PMU. If
> the event is not enabled for EL0, then we'd exit out on the
> ARMPMU_EL0_RD_CNTR check. So I think we're fine.

I'm still not convinced; pmu_direct_access is shared between PMUs, so
testing the result of atomic_dec_and_test() just doesn't make sense to
me, as another PMU could be playing with the count.

> > > @@ -980,9 +1032,23 @@ static int __armv8_pmuv3_map_event(struct perf_event *event,
> > > &armv8_pmuv3_perf_cache_map,
> > > ARMV8_PMU_EVTYPE_EVENT);
> > >
> > > - if (armv8pmu_event_is_64bit(event))
> > > + if (armv8pmu_event_want_user_access(event) || !armv8pmu_event_is_64bit(event)) {
> > > + event->hw.flags |= ARMPMU_EL0_RD_CNTR;
> >
> > Why do you set this for all 32-bit events?
>
> It goes back to the config1 bits as explained in the commit msg. We
> can always support user access for 32-bit counters, but for 64-bit
> counters the user has to request both user access and 64-bit counters.
> We could require explicit user access request for 32-bit access, but I
> thought it was better to not require userspace to do something Arm
> specific on open.

Got it, thanks.

Will

2021-03-31 17:54:14

by Rob Herring (Arm)

[permalink] [raw]
Subject: Re: [PATCH v6 02/10] arm64: perf: Enable PMU counter direct access for perf event

On Wed, Mar 31, 2021 at 10:38 AM Will Deacon <[email protected]> wrote:
>
> On Tue, Mar 30, 2021 at 04:08:11PM -0500, Rob Herring wrote:
> > On Tue, Mar 30, 2021 at 12:09 PM Rob Herring <[email protected]> wrote:
> > > On Tue, Mar 30, 2021 at 10:31 AM Will Deacon <[email protected]> wrote:
> > > > The logic here feels like it
> > > > could with a bit of untangling.
> > >
> > > Yes, I don't love it, but couldn't come up with anything better. It is
> > > complicated by the fact that flags have to be set before we assign the
> > > counter and can't set/change them when we assign the counter. It would
> > > take a lot of refactoring with armpmu code to fix that.
> >
> > How's this instead?:
> >
> > if (armv8pmu_event_want_user_access(event) || !armv8pmu_event_is_64bit(event))
> > event->hw.flags |= ARMPMU_EL0_RD_CNTR;
> >
> > /*
> > * At this point, the counter is not assigned. If a 64-bit counter is
> > * requested, we must make sure the h/w has 64-bit counters if we set
> > * the event size to 64-bit because chaining is not supported with
> > * userspace access. This may still fail later on if the CPU cycle
> > * counter is in use.
> > */
> > if (armv8pmu_event_is_64bit(event) &&
> > (!armv8pmu_event_want_user_access(event) ||
> > armv8pmu_has_long_event(cpu_pmu) || (hw_event_id ==
> > ARMV8_PMUV3_PERFCTR_CPU_CYCLES)))
> > event->hw.flags |= ARMPMU_EVT_64BIT;
>
> I thought there were some cases where we could assign cycles event to an
> event counter; does that not happen anymore?

Yes, but if we hit that scenario when the user has asked for 64-bit
user access, then we return an error later when assigning the counter.
I think we can assume if users have gone to the trouble of requesting
64-bit counters, then they can deal with ensuring they don't have
multiple users.

Otherwise, the only way I see to simplify this is we only support
64-bit counters in userspace when we have v8.5 PMU.

Rob

2021-03-31 22:09:07

by Rob Herring (Arm)

[permalink] [raw]
Subject: Re: [PATCH v6 04/10] libperf: Add evsel mmap support

On Fri, Mar 12, 2021 at 12:29 PM Jiri Olsa <[email protected]> wrote:
>
> On Fri, Mar 12, 2021 at 07:34:39AM -0700, Rob Herring wrote:
> > On Fri, Mar 12, 2021 at 6:59 AM Jiri Olsa <[email protected]> wrote:
> > >
> > > On Wed, Mar 10, 2021 at 05:08:31PM -0700, Rob Herring wrote:
> > >
> > > SNIP
> > >
> > > > +
> > > > static int
> > > > sys_perf_event_open(struct perf_event_attr *attr,
> > > > pid_t pid, int cpu, int group_fd,
> > > > @@ -137,6 +147,8 @@ void perf_evsel__free_fd(struct perf_evsel *evsel)
> > > > {
> > > > xyarray__delete(evsel->fd);
> > > > evsel->fd = NULL;
> > > > + xyarray__delete(evsel->mmap);
> > > > + evsel->mmap = NULL;
> > > > }
> > > >
> > > > void perf_evsel__close(struct perf_evsel *evsel)
> > > > @@ -156,6 +168,45 @@ void perf_evsel__close_cpu(struct perf_evsel *evsel, int cpu)
> > > > perf_evsel__close_fd_cpu(evsel, cpu);
> > > > }
> > > >
> > > > +int perf_evsel__mmap(struct perf_evsel *evsel, int pages)
> > > > +{
> > > > + int ret, cpu, thread;
> > > > + struct perf_mmap_param mp = {
> > > > + .prot = PROT_READ | PROT_WRITE,
> > > > + .mask = (pages * page_size) - 1,
> > > > + };
> > >
> > > I don't mind using evsel->fd for dimensions below,
> > > but we need to check in here that it's defined,
> > > that perf_evsel__open was called
> >
> > Right, so I'll add this here:
> >
> > if (evsel->fd == NULL)
> > return -EINVAL;

Actually, pretty much none of the API checks this.
perf_evsel__run_ioctl(), perf_evsel__enable(), perf_evsel__disable(),
perf_evsel__read() will all just deference evsel->fd. So fix all of
these or follow existing behavior?


> > Note that struct evsel has dimensions in it, but they are only set in
> > the evlist code. I couldn't tell if that was by design or mistake.
>
> we do? we have the cpus and threads in perf_evsel no?

Right, perf_evsel has cpus and threads pointers which in turn have the
sizes, but those pointers are not set within the evsel code.

> also you'd need perf_evsel not evsel, right?
>
> >
> > BTW, I just noticed perf_evsel__open is leaking memory on most of its
> > error paths.
>
> nice.. let's fix it ;-)

NM, I missed that they are static...

Rob

2021-04-01 09:08:47

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH v6 02/10] arm64: perf: Enable PMU counter direct access for perf event

On Wed, Mar 31, 2021 at 12:52:11PM -0500, Rob Herring wrote:
> On Wed, Mar 31, 2021 at 10:38 AM Will Deacon <[email protected]> wrote:
> >
> > On Tue, Mar 30, 2021 at 04:08:11PM -0500, Rob Herring wrote:
> > > On Tue, Mar 30, 2021 at 12:09 PM Rob Herring <[email protected]> wrote:
> > > > On Tue, Mar 30, 2021 at 10:31 AM Will Deacon <[email protected]> wrote:
> > > > > The logic here feels like it
> > > > > could with a bit of untangling.
> > > >
> > > > Yes, I don't love it, but couldn't come up with anything better. It is
> > > > complicated by the fact that flags have to be set before we assign the
> > > > counter and can't set/change them when we assign the counter. It would
> > > > take a lot of refactoring with armpmu code to fix that.
> > >
> > > How's this instead?:
> > >
> > > if (armv8pmu_event_want_user_access(event) || !armv8pmu_event_is_64bit(event))
> > > event->hw.flags |= ARMPMU_EL0_RD_CNTR;
> > >
> > > /*
> > > * At this point, the counter is not assigned. If a 64-bit counter is
> > > * requested, we must make sure the h/w has 64-bit counters if we set
> > > * the event size to 64-bit because chaining is not supported with
> > > * userspace access. This may still fail later on if the CPU cycle
> > > * counter is in use.
> > > */
> > > if (armv8pmu_event_is_64bit(event) &&
> > > (!armv8pmu_event_want_user_access(event) ||
> > > armv8pmu_has_long_event(cpu_pmu) || (hw_event_id ==
> > > ARMV8_PMUV3_PERFCTR_CPU_CYCLES)))
> > > event->hw.flags |= ARMPMU_EVT_64BIT;
> >
> > I thought there were some cases where we could assign cycles event to an
> > event counter; does that not happen anymore?
>
> Yes, but if we hit that scenario when the user has asked for 64-bit
> user access, then we return an error later when assigning the counter.
> I think we can assume if users have gone to the trouble of requesting
> 64-bit counters, then they can deal with ensuring they don't have
> multiple users.
>
> Otherwise, the only way I see to simplify this is we only support
> 64-bit counters in userspace when we have v8.5 PMU.

I'm happy to start from that position, and then we can extend it later if
there's a need.

Will

2021-04-01 19:46:40

by Rob Herring (Arm)

[permalink] [raw]
Subject: Re: [PATCH v6 02/10] arm64: perf: Enable PMU counter direct access for perf event

On Wed, Mar 31, 2021 at 11:01 AM Will Deacon <[email protected]> wrote:
>
> On Tue, Mar 30, 2021 at 12:09:38PM -0500, Rob Herring wrote:
> > On Tue, Mar 30, 2021 at 10:31 AM Will Deacon <[email protected]> wrote:
> > >
> > > On Wed, Mar 10, 2021 at 05:08:29PM -0700, Rob Herring wrote:
> > > > From: Raphael Gault <[email protected]>
> > > >
> > > > Keep track of event opened with direct access to the hardware counters
> > > > and modify permissions while they are open.
> > > >
> > > > The strategy used here is the same which x86 uses: every time an event
> > > > is mapped, the permissions are set if required. The atomic field added
> > > > in the mm_context helps keep track of the different event opened and
> > > > de-activate the permissions when all are unmapped.
> > > > We also need to update the permissions in the context switch code so
> > > > that tasks keep the right permissions.
> > > >
> > > > In order to enable 64-bit counters for userspace when available, a new
> > > > config1 bit is added for userspace to indicate it wants userspace counter
> > > > access. This bit allows the kernel to decide if chaining should be
> > > > disabled and chaining and userspace access are incompatible.
> > > > The modes for config1 are as follows:
> > > >
> > > > config1 = 0 or 2 : user access enabled and always 32-bit
> > > > config1 = 1 : user access disabled and always 64-bit (using chaining if needed)
> > > > config1 = 3 : user access enabled and counter size matches underlying counter.
> > >
> > > In this last case, how does userspace know whether it got a 32-bit or a
> > > 64-bit counter?
> >
> > pmc_width in the user page. If the read loop is correct, then it
> > doesn't matter to the user.
>
> Gotcha; please mention that in this comment,
>
> > > > +static void refresh_pmuserenr(void *mm)
> > > > +{
> > > > + if (mm == current->active_mm)
> > > > + perf_switch_user_access(mm);
> > > > +}
> > > > +
> > > > +static void armv8pmu_event_mapped(struct perf_event *event, struct mm_struct *mm)
> > > > +{
> > > > + struct arm_pmu *armpmu = to_arm_pmu(event->pmu);
> > > > +
> > > > + if (!(event->hw.flags & ARMPMU_EL0_RD_CNTR))
> > > > + return;
> > > > +
> > > > + /*
> > > > + * This function relies on not being called concurrently in two
> > > > + * tasks in the same mm. Otherwise one task could observe
> > > > + * pmu_direct_access > 1 and return all the way back to
> > > > + * userspace with user access disabled while another task is still
> > > > + * doing on_each_cpu_mask() to enable user access.
> > > > + *
> > > > + * For now, this can't happen because all callers hold mmap_lock
> > > > + * for write. If this changes, we'll need a different solution.
> > > > + */
> > > > + lockdep_assert_held_write(&mm->mmap_lock);
> > > > +
> > > > + if (atomic_inc_return(&mm->context.pmu_direct_access) == 1)
> > > > + on_each_cpu_mask(&armpmu->supported_cpus, refresh_pmuserenr, mm, 1);
> > > > +}
> > >
> > > Why do we need to cross-call here? Seems like it would be a tonne simpler to
> > > handle the trap. Is there a reason not to do that?
> >
> > Alignment with x86? You didn't like the trap handler either:
> >
> > > Hmm... this feels pretty fragile since, although we may expect userspace only
> > > to trigger this in the context of the specific perf use-case, we don't have
> > > a way to detect that, so the ABI we're exposing is that EL0 accesses to
> > > non-existent counters will return 0. I don't really think that's something
> > > we want to commit to.
> >
> > Full mail: https://lore.kernel.org/r/20200928182601.GA11974@willie-the-truck
> >
> > The handler would be different, but we'd still have the problem of not
> > being able to detect the use case, right?
>
> Well hang on, the thing you link to was about _emulating_ the access. I'm
> not saying we should do that, but rather just enable EL0 PMU access lazily
> instead of using the IPI. We're going to enable it when we context-switch
> anyway.
>
> The alternative is enabling EL0 access unconditionally and leaving it
> enabled.

That BTW is what x86 originally did. This pmu_direct_access stuff came
about as part of locking down access a bit.

In any case, I've modified this to trap. It avoids the mmap lock
dependency too, so that's nice.

> > > > +
> > > > +static void armv8pmu_event_unmapped(struct perf_event *event, struct mm_struct *mm)
> > > > +{
> > > > + struct arm_pmu *armpmu = to_arm_pmu(event->pmu);
> > > > +
> > > > + if (!(event->hw.flags & ARMPMU_EL0_RD_CNTR))
> > > > + return;
> > > > +
> > > > + if (atomic_dec_and_test(&mm->context.pmu_direct_access))
> > > > + on_each_cpu_mask(&armpmu->supported_cpus, refresh_pmuserenr, mm, 1);
> > >
> > > Given that the pmu_direct_access field is global per-mm, won't this go
> > > wrong if multiple PMUs are opened by the same process but only a subset
> > > are exposed to EL0? Perhaps pmu_direct_access should be treated as a mask
> > > rather than a counter, so that we can 'and' it with the supported_cpus for
> > > the PMU we're dealing with.
> >
> > It needs to be a count to support multiple events on the same PMU. If
> > the event is not enabled for EL0, then we'd exit out on the
> > ARMPMU_EL0_RD_CNTR check. So I think we're fine.
>
> I'm still not convinced; pmu_direct_access is shared between PMUs, so
> testing the result of atomic_dec_and_test() just doesn't make sense to
> me, as another PMU could be playing with the count.

How is that a problem? Let's make a concrete example:

map PMU1:event2 -> pmu_direct_access = 1 -> enable access
map PMU2:event3 -> pmu_direct_access = 2
map PMU1:event4 -> pmu_direct_access = 3
unmap PMU2:event3 -> pmu_direct_access = 2
unmap PMU1:event2 -> pmu_direct_access = 1
unmap PMU1:event4 -> pmu_direct_access = 0 -> disable access

The only issue I can see is PMU2 remains enabled for user access until
the last unmap. But we're sharing the mm, so who cares? Also, in this
scenario it is the user's problem to pin themselves to cores sharing a
PMU. If the user doesn't do that, they get to keep the pieces.

We have to have something that says 'we have mmapped user accessible
events for the mm context'. I don't think there is any way around that
if we gate access on mmapping events.

Rob

2021-04-07 20:58:10

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH v6 02/10] arm64: perf: Enable PMU counter direct access for perf event

[Moving Mark to To: since I'd like his view on this]

On Thu, Apr 01, 2021 at 02:45:21PM -0500, Rob Herring wrote:
> On Wed, Mar 31, 2021 at 11:01 AM Will Deacon <[email protected]> wrote:
> >
> > On Tue, Mar 30, 2021 at 12:09:38PM -0500, Rob Herring wrote:
> > > On Tue, Mar 30, 2021 at 10:31 AM Will Deacon <[email protected]> wrote:
> > > >
> > > > On Wed, Mar 10, 2021 at 05:08:29PM -0700, Rob Herring wrote:
> > > > > From: Raphael Gault <[email protected]>
> > > > >
> > > > > Keep track of event opened with direct access to the hardware counters
> > > > > and modify permissions while they are open.
> > > > >
> > > > > The strategy used here is the same which x86 uses: every time an event
> > > > > is mapped, the permissions are set if required. The atomic field added
> > > > > in the mm_context helps keep track of the different event opened and
> > > > > de-activate the permissions when all are unmapped.
> > > > > We also need to update the permissions in the context switch code so
> > > > > that tasks keep the right permissions.
> > > > >
> > > > > In order to enable 64-bit counters for userspace when available, a new
> > > > > config1 bit is added for userspace to indicate it wants userspace counter
> > > > > access. This bit allows the kernel to decide if chaining should be
> > > > > disabled and chaining and userspace access are incompatible.
> > > > > The modes for config1 are as follows:
> > > > >
> > > > > config1 = 0 or 2 : user access enabled and always 32-bit
> > > > > config1 = 1 : user access disabled and always 64-bit (using chaining if needed)
> > > > > config1 = 3 : user access enabled and counter size matches underlying counter.
> > > >
> > > > In this last case, how does userspace know whether it got a 32-bit or a
> > > > 64-bit counter?
> > >
> > > pmc_width in the user page. If the read loop is correct, then it
> > > doesn't matter to the user.
> >
> > Gotcha; please mention that in this comment,
> >
> > > > > +static void refresh_pmuserenr(void *mm)
> > > > > +{
> > > > > + if (mm == current->active_mm)
> > > > > + perf_switch_user_access(mm);
> > > > > +}
> > > > > +
> > > > > +static void armv8pmu_event_mapped(struct perf_event *event, struct mm_struct *mm)
> > > > > +{
> > > > > + struct arm_pmu *armpmu = to_arm_pmu(event->pmu);
> > > > > +
> > > > > + if (!(event->hw.flags & ARMPMU_EL0_RD_CNTR))
> > > > > + return;
> > > > > +
> > > > > + /*
> > > > > + * This function relies on not being called concurrently in two
> > > > > + * tasks in the same mm. Otherwise one task could observe
> > > > > + * pmu_direct_access > 1 and return all the way back to
> > > > > + * userspace with user access disabled while another task is still
> > > > > + * doing on_each_cpu_mask() to enable user access.
> > > > > + *
> > > > > + * For now, this can't happen because all callers hold mmap_lock
> > > > > + * for write. If this changes, we'll need a different solution.
> > > > > + */
> > > > > + lockdep_assert_held_write(&mm->mmap_lock);
> > > > > +
> > > > > + if (atomic_inc_return(&mm->context.pmu_direct_access) == 1)
> > > > > + on_each_cpu_mask(&armpmu->supported_cpus, refresh_pmuserenr, mm, 1);
> > > > > +}
> > > >
> > > > Why do we need to cross-call here? Seems like it would be a tonne simpler to
> > > > handle the trap. Is there a reason not to do that?
> > >
> > > Alignment with x86? You didn't like the trap handler either:
> > >
> > > > Hmm... this feels pretty fragile since, although we may expect userspace only
> > > > to trigger this in the context of the specific perf use-case, we don't have
> > > > a way to detect that, so the ABI we're exposing is that EL0 accesses to
> > > > non-existent counters will return 0. I don't really think that's something
> > > > we want to commit to.
> > >
> > > Full mail: https://lore.kernel.org/r/20200928182601.GA11974@willie-the-truck
> > >
> > > The handler would be different, but we'd still have the problem of not
> > > being able to detect the use case, right?
> >
> > Well hang on, the thing you link to was about _emulating_ the access. I'm
> > not saying we should do that, but rather just enable EL0 PMU access lazily
> > instead of using the IPI. We're going to enable it when we context-switch
> > anyway.
> >
> > The alternative is enabling EL0 access unconditionally and leaving it
> > enabled.
>
> That BTW is what x86 originally did. This pmu_direct_access stuff came
> about as part of locking down access a bit.
>
> In any case, I've modified this to trap. It avoids the mmap lock
> dependency too, so that's nice.
>
> > > > > +
> > > > > +static void armv8pmu_event_unmapped(struct perf_event *event, struct mm_struct *mm)
> > > > > +{
> > > > > + struct arm_pmu *armpmu = to_arm_pmu(event->pmu);
> > > > > +
> > > > > + if (!(event->hw.flags & ARMPMU_EL0_RD_CNTR))
> > > > > + return;
> > > > > +
> > > > > + if (atomic_dec_and_test(&mm->context.pmu_direct_access))
> > > > > + on_each_cpu_mask(&armpmu->supported_cpus, refresh_pmuserenr, mm, 1);
> > > >
> > > > Given that the pmu_direct_access field is global per-mm, won't this go
> > > > wrong if multiple PMUs are opened by the same process but only a subset
> > > > are exposed to EL0? Perhaps pmu_direct_access should be treated as a mask
> > > > rather than a counter, so that we can 'and' it with the supported_cpus for
> > > > the PMU we're dealing with.
> > >
> > > It needs to be a count to support multiple events on the same PMU. If
> > > the event is not enabled for EL0, then we'd exit out on the
> > > ARMPMU_EL0_RD_CNTR check. So I think we're fine.
> >
> > I'm still not convinced; pmu_direct_access is shared between PMUs, so
> > testing the result of atomic_dec_and_test() just doesn't make sense to
> > me, as another PMU could be playing with the count.
>
> How is that a problem? Let's make a concrete example:
>
> map PMU1:event2 -> pmu_direct_access = 1 -> enable access
> map PMU2:event3 -> pmu_direct_access = 2
> map PMU1:event4 -> pmu_direct_access = 3
> unmap PMU2:event3 -> pmu_direct_access = 2
> unmap PMU1:event2 -> pmu_direct_access = 1
> unmap PMU1:event4 -> pmu_direct_access = 0 -> disable access
>
> The only issue I can see is PMU2 remains enabled for user access until
> the last unmap. But we're sharing the mm, so who cares? Also, in this
> scenario it is the user's problem to pin themselves to cores sharing a
> PMU. If the user doesn't do that, they get to keep the pieces.

I guess I'm just worried about exposing the counters to userspace after
the PMU driver (or perf core?) thinks that they're no longer exposed in
case we leak other events. However, I'm not sure how this is supposed to
work normally: what happens if e.g. a privileged user has a per-cpu counter
for a kernel event while a task has a counter with direct access -- can that
task read the kernel event out of the PMU registers from userspace?

Will

2021-04-08 11:10:05

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH v6 02/10] arm64: perf: Enable PMU counter direct access for perf event

On Wed, Apr 07, 2021 at 01:44:37PM +0100, Will Deacon wrote:
> [Moving Mark to To: since I'd like his view on this]
>
> On Thu, Apr 01, 2021 at 02:45:21PM -0500, Rob Herring wrote:
> > On Wed, Mar 31, 2021 at 11:01 AM Will Deacon <[email protected]> wrote:
> > >
> > > On Tue, Mar 30, 2021 at 12:09:38PM -0500, Rob Herring wrote:
> > > > On Tue, Mar 30, 2021 at 10:31 AM Will Deacon <[email protected]> wrote:
> > > > >
> > > > > On Wed, Mar 10, 2021 at 05:08:29PM -0700, Rob Herring wrote:
> > > > > > From: Raphael Gault <[email protected]>

> > > > > > +static void armv8pmu_event_unmapped(struct perf_event *event, struct mm_struct *mm)
> > > > > > +{
> > > > > > + struct arm_pmu *armpmu = to_arm_pmu(event->pmu);
> > > > > > +
> > > > > > + if (!(event->hw.flags & ARMPMU_EL0_RD_CNTR))
> > > > > > + return;
> > > > > > +
> > > > > > + if (atomic_dec_and_test(&mm->context.pmu_direct_access))
> > > > > > + on_each_cpu_mask(&armpmu->supported_cpus, refresh_pmuserenr, mm, 1);
> > > > >
> > > > > Given that the pmu_direct_access field is global per-mm, won't this go
> > > > > wrong if multiple PMUs are opened by the same process but only a subset
> > > > > are exposed to EL0? Perhaps pmu_direct_access should be treated as a mask
> > > > > rather than a counter, so that we can 'and' it with the supported_cpus for
> > > > > the PMU we're dealing with.
> > > >
> > > > It needs to be a count to support multiple events on the same PMU. If
> > > > the event is not enabled for EL0, then we'd exit out on the
> > > > ARMPMU_EL0_RD_CNTR check. So I think we're fine.
> > >
> > > I'm still not convinced; pmu_direct_access is shared between PMUs, so
> > > testing the result of atomic_dec_and_test() just doesn't make sense to
> > > me, as another PMU could be playing with the count.
> >
> > How is that a problem? Let's make a concrete example:
> >
> > map PMU1:event2 -> pmu_direct_access = 1 -> enable access
> > map PMU2:event3 -> pmu_direct_access = 2
> > map PMU1:event4 -> pmu_direct_access = 3
> > unmap PMU2:event3 -> pmu_direct_access = 2
> > unmap PMU1:event2 -> pmu_direct_access = 1
> > unmap PMU1:event4 -> pmu_direct_access = 0 -> disable access
> >
> > The only issue I can see is PMU2 remains enabled for user access until
> > the last unmap. But we're sharing the mm, so who cares? Also, in this
> > scenario it is the user's problem to pin themselves to cores sharing a
> > PMU. If the user doesn't do that, they get to keep the pieces.
>
> I guess I'm just worried about exposing the counters to userspace after
> the PMU driver (or perf core?) thinks that they're no longer exposed in
> case we leak other events.

IMO that's not practically different from the single-PMU case (i.e.
multi-PMU isn't material, either we have a concern with leaking or we
don't); more on that below.

While it looks odd to place this on the mm, I don't think it's the end
of the world.

> However, I'm not sure how this is supposed to work normally: what
> happens if e.g. a privileged user has a per-cpu counter for a kernel
> event while a task has a counter with direct access -- can that task
> read the kernel event out of the PMU registers from userspace?

Yes -- userspace could go read any counters even though it isn't
supposed to, and could potentially infer information from those. It
won't have access to the config registers or kernel data structures, so
it isn't guaranteed to know what the even is or when it is
context-switched/reprogrammed/etc.

If we believe that's a problem, then it's difficult to do anything
robust other than denying userspace access entirely, since disabling
userspace access while in use would surprise applications, and denying
privileged events would need some global state that we consult at event
creation time (in addition to being an inversion of privilege).

IIRC there was some fuss about this a while back on x86; I'll go dig and
see what I can find, unless Peter has a memory...

Thanks,
Mark.

2021-04-08 18:39:49

by Rob Herring (Arm)

[permalink] [raw]
Subject: Re: [PATCH v6 02/10] arm64: perf: Enable PMU counter direct access for perf event

On Thu, Apr 8, 2021 at 6:08 AM Mark Rutland <[email protected]> wrote:
>
> On Wed, Apr 07, 2021 at 01:44:37PM +0100, Will Deacon wrote:
> > [Moving Mark to To: since I'd like his view on this]
> >
> > On Thu, Apr 01, 2021 at 02:45:21PM -0500, Rob Herring wrote:
> > > On Wed, Mar 31, 2021 at 11:01 AM Will Deacon <[email protected]> wrote:
> > > >
> > > > On Tue, Mar 30, 2021 at 12:09:38PM -0500, Rob Herring wrote:
> > > > > On Tue, Mar 30, 2021 at 10:31 AM Will Deacon <[email protected]> wrote:
> > > > > >
> > > > > > On Wed, Mar 10, 2021 at 05:08:29PM -0700, Rob Herring wrote:
> > > > > > > From: Raphael Gault <[email protected]>
>
> > > > > > > +static void armv8pmu_event_unmapped(struct perf_event *event, struct mm_struct *mm)
> > > > > > > +{
> > > > > > > + struct arm_pmu *armpmu = to_arm_pmu(event->pmu);
> > > > > > > +
> > > > > > > + if (!(event->hw.flags & ARMPMU_EL0_RD_CNTR))
> > > > > > > + return;
> > > > > > > +
> > > > > > > + if (atomic_dec_and_test(&mm->context.pmu_direct_access))
> > > > > > > + on_each_cpu_mask(&armpmu->supported_cpus, refresh_pmuserenr, mm, 1);
> > > > > >
> > > > > > Given that the pmu_direct_access field is global per-mm, won't this go
> > > > > > wrong if multiple PMUs are opened by the same process but only a subset
> > > > > > are exposed to EL0? Perhaps pmu_direct_access should be treated as a mask
> > > > > > rather than a counter, so that we can 'and' it with the supported_cpus for
> > > > > > the PMU we're dealing with.
> > > > >
> > > > > It needs to be a count to support multiple events on the same PMU. If
> > > > > the event is not enabled for EL0, then we'd exit out on the
> > > > > ARMPMU_EL0_RD_CNTR check. So I think we're fine.
> > > >
> > > > I'm still not convinced; pmu_direct_access is shared between PMUs, so
> > > > testing the result of atomic_dec_and_test() just doesn't make sense to
> > > > me, as another PMU could be playing with the count.
> > >
> > > How is that a problem? Let's make a concrete example:
> > >
> > > map PMU1:event2 -> pmu_direct_access = 1 -> enable access
> > > map PMU2:event3 -> pmu_direct_access = 2
> > > map PMU1:event4 -> pmu_direct_access = 3
> > > unmap PMU2:event3 -> pmu_direct_access = 2
> > > unmap PMU1:event2 -> pmu_direct_access = 1
> > > unmap PMU1:event4 -> pmu_direct_access = 0 -> disable access
> > >
> > > The only issue I can see is PMU2 remains enabled for user access until
> > > the last unmap. But we're sharing the mm, so who cares? Also, in this
> > > scenario it is the user's problem to pin themselves to cores sharing a
> > > PMU. If the user doesn't do that, they get to keep the pieces.
> >
> > I guess I'm just worried about exposing the counters to userspace after
> > the PMU driver (or perf core?) thinks that they're no longer exposed in
> > case we leak other events.
>
> IMO that's not practically different from the single-PMU case (i.e.
> multi-PMU isn't material, either we have a concern with leaking or we
> don't); more on that below.
>
> While it looks odd to place this on the mm, I don't think it's the end
> of the world.
>
> > However, I'm not sure how this is supposed to work normally: what
> > happens if e.g. a privileged user has a per-cpu counter for a kernel
> > event while a task has a counter with direct access -- can that task
> > read the kernel event out of the PMU registers from userspace?
>
> Yes -- userspace could go read any counters even though it isn't
> supposed to, and could potentially infer information from those. It
> won't have access to the config registers or kernel data structures, so
> it isn't guaranteed to know what the even is or when it is
> context-switched/reprogrammed/etc.
>
> If we believe that's a problem, then it's difficult to do anything
> robust other than denying userspace access entirely, since disabling
> userspace access while in use would surprise applications, and denying
> privileged events would need some global state that we consult at event
> creation time (in addition to being an inversion of privilege).
>
> IIRC there was some fuss about this a while back on x86; I'll go dig and
> see what I can find, unless Peter has a memory...

Maybe this one[1].

Rob

[1] https://lore.kernel.org/lkml/[email protected]/

2021-04-19 19:59:45

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH v6 02/10] arm64: perf: Enable PMU counter direct access for perf event

On Thu, Apr 08, 2021 at 01:38:17PM -0500, Rob Herring wrote:
> On Thu, Apr 8, 2021 at 6:08 AM Mark Rutland <[email protected]> wrote:
> > On Wed, Apr 07, 2021 at 01:44:37PM +0100, Will Deacon wrote:
> > > On Thu, Apr 01, 2021 at 02:45:21PM -0500, Rob Herring wrote:
> > > > On Wed, Mar 31, 2021 at 11:01 AM Will Deacon <[email protected]> wrote:
> > > I guess I'm just worried about exposing the counters to userspace after
> > > the PMU driver (or perf core?) thinks that they're no longer exposed in
> > > case we leak other events.
> >
> > IMO that's not practically different from the single-PMU case (i.e.
> > multi-PMU isn't material, either we have a concern with leaking or we
> > don't); more on that below.

Well, maybe. It looks the single-PMU case is exposed to the same issue,
but I think a solution needs to take into account the multi-PMU situation.

> > While it looks odd to place this on the mm, I don't think it's the end
> > of the world.
> >
> > > However, I'm not sure how this is supposed to work normally: what
> > > happens if e.g. a privileged user has a per-cpu counter for a kernel
> > > event while a task has a counter with direct access -- can that task
> > > read the kernel event out of the PMU registers from userspace?
> >
> > Yes -- userspace could go read any counters even though it isn't
> > supposed to, and could potentially infer information from those. It
> > won't have access to the config registers or kernel data structures, so
> > it isn't guaranteed to know what the even is or when it is
> > context-switched/reprogrammed/etc.
> >
> > If we believe that's a problem, then it's difficult to do anything
> > robust other than denying userspace access entirely, since disabling
> > userspace access while in use would surprise applications, and denying
> > privileged events would need some global state that we consult at event
> > creation time (in addition to being an inversion of privilege).
> >
> > IIRC there was some fuss about this a while back on x86; I'll go dig and
> > see what I can find, unless Peter has a memory...
>
> Maybe this one[1].
>
> Rob
>
> [1] https://lore.kernel.org/lkml/[email protected]/

Going through the archives and talking to Peter, it looks like this is still
an active area of concern:

- There are patches to clear "dirty" counters on context-switch. They were
queued for 5.13 but broke -tip on Friday:

https://lore.kernel.org/lkml/YHm%[email protected]/

- Per-cpu events cannot be protected in software:

https://lore.kernel.org/lkml/CALCETrVVPzUd_hQ8xoomHn_wWRQJUvROeCt2do4_D4ROZoAVMg@mail.gmail.com/

so without hardware support, we need a way to disable user access for
people that care about this leakage

x86 has an "rdpmc" file exposed for the PMU device in sysfs which allows
access to be disabled. I don't think these patches add such a thing, and
that's where the fun with multi-PMU machines would come into play.

Will

2021-04-19 22:16:11

by Rob Herring (Arm)

[permalink] [raw]
Subject: Re: [PATCH v6 02/10] arm64: perf: Enable PMU counter direct access for perf event

On Mon, Apr 19, 2021 at 11:14 AM Will Deacon <[email protected]> wrote:
>
> On Thu, Apr 08, 2021 at 01:38:17PM -0500, Rob Herring wrote:
> > On Thu, Apr 8, 2021 at 6:08 AM Mark Rutland <[email protected]> wrote:
> > > On Wed, Apr 07, 2021 at 01:44:37PM +0100, Will Deacon wrote:
> > > > On Thu, Apr 01, 2021 at 02:45:21PM -0500, Rob Herring wrote:
> > > > > On Wed, Mar 31, 2021 at 11:01 AM Will Deacon <[email protected]> wrote:
> > > > I guess I'm just worried about exposing the counters to userspace after
> > > > the PMU driver (or perf core?) thinks that they're no longer exposed in
> > > > case we leak other events.
> > >
> > > IMO that's not practically different from the single-PMU case (i.e.
> > > multi-PMU isn't material, either we have a concern with leaking or we
> > > don't); more on that below.
>
> Well, maybe. It looks the single-PMU case is exposed to the same issue,
> but I think a solution needs to take into account the multi-PMU situation.
>
> > > While it looks odd to place this on the mm, I don't think it's the end
> > > of the world.
> > >
> > > > However, I'm not sure how this is supposed to work normally: what
> > > > happens if e.g. a privileged user has a per-cpu counter for a kernel
> > > > event while a task has a counter with direct access -- can that task
> > > > read the kernel event out of the PMU registers from userspace?
> > >
> > > Yes -- userspace could go read any counters even though it isn't
> > > supposed to, and could potentially infer information from those. It
> > > won't have access to the config registers or kernel data structures, so
> > > it isn't guaranteed to know what the even is or when it is
> > > context-switched/reprogrammed/etc.
> > >
> > > If we believe that's a problem, then it's difficult to do anything
> > > robust other than denying userspace access entirely, since disabling
> > > userspace access while in use would surprise applications, and denying
> > > privileged events would need some global state that we consult at event
> > > creation time (in addition to being an inversion of privilege).
> > >
> > > IIRC there was some fuss about this a while back on x86; I'll go dig and
> > > see what I can find, unless Peter has a memory...
> >
> > Maybe this one[1].
> >
> > Rob
> >
> > [1] https://lore.kernel.org/lkml/[email protected]/
>
> Going through the archives and talking to Peter, it looks like this is still
> an active area of concern:
>
> - There are patches to clear "dirty" counters on context-switch. They were
> queued for 5.13 but broke -tip on Friday:
>
> https://lore.kernel.org/lkml/YHm%[email protected]/

Yes, nice timing. I've reworked the arm64 support to do the same
things (minus the breakage). And it looks like we can simplify things
a bit by moving all the context switch handling into .sched_task() and
out of switch_mm. Unless there's some case where that wouldn't work
that I'm not aware of (entirely likely).

> - Per-cpu events cannot be protected in software:
>
> https://lore.kernel.org/lkml/CALCETrVVPzUd_hQ8xoomHn_wWRQJUvROeCt2do4_D4ROZoAVMg@mail.gmail.com/
>
> so without hardware support, we need a way to disable user access for
> people that care about this leakage
>
> x86 has an "rdpmc" file exposed for the PMU device in sysfs which allows
> access to be disabled. I don't think these patches add such a thing, and
> that's where the fun with multi-PMU machines would come into play.

The fun is because sysfs will end up with multiple 'rdpmc' files or
something else?

Rob

2021-05-04 23:37:46

by Ian Rogers

[permalink] [raw]
Subject: Re: [PATCH v6 06/10] libperf: Add support for user space counter access

On Wed, Mar 10, 2021 at 4:08 PM Rob Herring <[email protected]> wrote:
>
> x86 and arm64 can both support direct access of event counters in
> userspace. The access sequence is less than trivial and currently exists
> in perf test code (tools/perf/arch/x86/tests/rdpmc.c) with copies in
> projects such as PAPI and libpfm4.
>
> In order to support usersapce access, an event must be mmapped first
> with perf_evsel__mmap(). Then subsequent calls to perf_evsel__read()
> will use the fast path (assuming the arch supports it).
>
> Signed-off-by: Rob Herring <[email protected]>
> ---
> v6:
> - Adapt to mmap changes adding MMAP NULL check
> v5:
> - Make raw count s64 instead of u64 so that counter width shifting
> works
> - Adapt to mmap changes
> v4:
> - Update perf_evsel__mmap size to pages
> v3:
> - Split out perf_evsel__mmap() to separate patch
> ---
> tools/lib/perf/evsel.c | 4 ++
> tools/lib/perf/include/internal/mmap.h | 3 +
> tools/lib/perf/mmap.c | 88 ++++++++++++++++++++++++++
> tools/lib/perf/tests/test-evsel.c | 65 +++++++++++++++++++
> 4 files changed, 160 insertions(+)
>
> diff --git a/tools/lib/perf/evsel.c b/tools/lib/perf/evsel.c
> index 1057e9b15528..4d67343d36c9 100644
> --- a/tools/lib/perf/evsel.c
> +++ b/tools/lib/perf/evsel.c
> @@ -242,6 +242,10 @@ int perf_evsel__read(struct perf_evsel *evsel, int cpu, int thread,
> if (FD(evsel, cpu, thread) < 0)
> return -EINVAL;
>
> + if (MMAP(evsel, cpu, thread) &&
> + !perf_mmap__read_self(MMAP(evsel, cpu, thread), count))
> + return 0;
> +
> if (readn(FD(evsel, cpu, thread), count->values, size) <= 0)
> return -errno;
>
> diff --git a/tools/lib/perf/include/internal/mmap.h b/tools/lib/perf/include/internal/mmap.h
> index be7556e0a2b2..5e3422f40ed5 100644
> --- a/tools/lib/perf/include/internal/mmap.h
> +++ b/tools/lib/perf/include/internal/mmap.h
> @@ -11,6 +11,7 @@
> #define PERF_SAMPLE_MAX_SIZE (1 << 16)
>
> struct perf_mmap;
> +struct perf_counts_values;
>
> typedef void (*libperf_unmap_cb_t)(struct perf_mmap *map);
>
> @@ -52,4 +53,6 @@ void perf_mmap__put(struct perf_mmap *map);
>
> u64 perf_mmap__read_head(struct perf_mmap *map);
>
> +int perf_mmap__read_self(struct perf_mmap *map, struct perf_counts_values *count);
> +
> #endif /* __LIBPERF_INTERNAL_MMAP_H */
> diff --git a/tools/lib/perf/mmap.c b/tools/lib/perf/mmap.c
> index 79d5ed6c38cc..915469f00cf4 100644
> --- a/tools/lib/perf/mmap.c
> +++ b/tools/lib/perf/mmap.c
> @@ -8,9 +8,11 @@
> #include <linux/perf_event.h>
> #include <perf/mmap.h>
> #include <perf/event.h>
> +#include <perf/evsel.h>
> #include <internal/mmap.h>
> #include <internal/lib.h>
> #include <linux/kernel.h>
> +#include <linux/math64.h>
> #include "internal.h"
>
> void perf_mmap__init(struct perf_mmap *map, struct perf_mmap *prev,
> @@ -273,3 +275,89 @@ union perf_event *perf_mmap__read_event(struct perf_mmap *map)
>
> return event;
> }
> +
> +#if defined(__i386__) || defined(__x86_64__)
> +static u64 read_perf_counter(unsigned int counter)
> +{
> + unsigned int low, high;
> +
> + asm volatile("rdpmc" : "=a" (low), "=d" (high) : "c" (counter));
> +
> + return low | ((u64)high) << 32;
> +}
> +
> +static u64 read_timestamp(void)
> +{
> + unsigned int low, high;
> +
> + asm volatile("rdtsc" : "=a" (low), "=d" (high));
> +
> + return low | ((u64)high) << 32;
> +}
> +#else
> +static u64 read_perf_counter(unsigned int counter) { return 0; }
> +static u64 read_timestamp(void) { return 0; }
> +#endif
> +
> +int perf_mmap__read_self(struct perf_mmap *map, struct perf_counts_values *count)
> +{
> + struct perf_event_mmap_page *pc = map->base;
> + u32 seq, idx, time_mult = 0, time_shift = 0;
> + u64 cnt, cyc = 0, time_offset = 0, time_cycles = 0, time_mask = ~0ULL;
> +
> + if (!pc || !pc->cap_user_rdpmc)
> + return -1;
> +
> + do {
> + seq = READ_ONCE(pc->lock);
> + barrier();
> +
> + count->ena = READ_ONCE(pc->time_enabled);
> + count->run = READ_ONCE(pc->time_running);
> +
> + if (pc->cap_user_time && count->ena != count->run) {
> + cyc = read_timestamp();
> + time_mult = READ_ONCE(pc->time_mult);
> + time_shift = READ_ONCE(pc->time_shift);
> + time_offset = READ_ONCE(pc->time_offset);
> +
> + if (pc->cap_user_time_short) {
> + time_cycles = READ_ONCE(pc->time_cycles);
> + time_mask = READ_ONCE(pc->time_mask);
> + }

Nit, this is now out of sync with the comment code in perf_event.h.

> + }
> +
> + idx = READ_ONCE(pc->index);
> + cnt = READ_ONCE(pc->offset);
> + if (pc->cap_user_rdpmc && idx) {
> + s64 evcnt = read_perf_counter(idx - 1);
> + u16 width = READ_ONCE(pc->pmc_width);
> +
> + evcnt <<= 64 - width;
> + evcnt >>= 64 - width;
> + cnt += evcnt;
> + } else
> + return -1;
> +
> + barrier();
> + } while (READ_ONCE(pc->lock) != seq);
> +
> + if (count->ena != count->run) {
> + u64 delta;
> +
> + /* Adjust for cap_usr_time_short, a nop if not */
> + cyc = time_cycles + ((cyc - time_cycles) & time_mask);
> +
> + delta = time_offset + mul_u64_u32_shr(cyc, time_mult, time_shift);
> +
> + count->ena += delta;
> + if (idx)
> + count->run += delta;
> +
> + cnt = mul_u64_u64_div64(cnt, count->ena, count->run);

Does this still suffer the divide by zero if multiplexing hasn't run
the counter? If so, we still need to add something like:
https://lore.kernel.org/lkml/CAP-5=fVRdqvswtyQMg5cB+ntTGda+SAYskjTQednEH-AeZo13g@mail.gmail.com/

> + }
> +
> + count->val = cnt;
> +
> + return 0;
> +}
> diff --git a/tools/lib/perf/tests/test-evsel.c b/tools/lib/perf/tests/test-evsel.c
> index 0ad82d7a2a51..54fb4809b9ee 100644
> --- a/tools/lib/perf/tests/test-evsel.c
> +++ b/tools/lib/perf/tests/test-evsel.c
> @@ -120,6 +120,69 @@ static int test_stat_thread_enable(void)
> return 0;
> }
>
> +static int test_stat_user_read(int event)
> +{
> + struct perf_counts_values counts = { .val = 0 };
> + struct perf_thread_map *threads;
> + struct perf_evsel *evsel;
> + struct perf_event_mmap_page *pc;
> + struct perf_event_attr attr = {
> + .type = PERF_TYPE_HARDWARE,
> + .config = event,
> + };

A nit, previously test-evsel was able to run and pass on a hypervisor.
As now there is a reliance on hardware events the evsel open fails on
a hypervisor. It'd be nice if we could detect running on a hypervisor
and test software events in that case.

Thanks,
Ian

> + int err, i;
> +
> + threads = perf_thread_map__new_dummy();
> + __T("failed to create threads", threads);
> +
> + perf_thread_map__set_pid(threads, 0, 0);
> +
> + evsel = perf_evsel__new(&attr);
> + __T("failed to create evsel", evsel);
> +
> + err = perf_evsel__open(evsel, NULL, threads);
> + __T("failed to open evsel", err == 0);
> +
> + err = perf_evsel__mmap(evsel, 0);
> + __T("failed to mmap evsel", err == 0);
> +
> + pc = perf_evsel__mmap_base(evsel, 0, 0);
> +
> +#if defined(__i386__) || defined(__x86_64__) || defined(__aarch64__)
> + __T("userspace counter access not supported", pc->cap_user_rdpmc);
> + __T("userspace counter access not enabled", pc->index);
> + __T("userspace counter width not set", pc->pmc_width >= 32);
> +#endif
> +
> + perf_evsel__read(evsel, 0, 0, &counts);
> + __T("failed to read value for evsel", counts.val != 0);
> +
> + for (i = 0; i < 5; i++) {
> + volatile int count = 0x10000 << i;
> + __u64 start, end, last = 0;
> +
> + __T_VERBOSE("\tloop = %u, ", count);
> +
> + perf_evsel__read(evsel, 0, 0, &counts);
> + start = counts.val;
> +
> + while (count--) ;
> +
> + perf_evsel__read(evsel, 0, 0, &counts);
> + end = counts.val;
> +
> + __T("invalid counter data", (end - start) > last);
> + last = end - start;
> + __T_VERBOSE("count = %llu\n", end - start);
> + }
> +
> + perf_evsel__close(evsel);
> + perf_evsel__delete(evsel);
> +
> + perf_thread_map__put(threads);
> + return 0;
> +}
> +
> int main(int argc, char **argv)
> {
> __T_START;
> @@ -129,6 +192,8 @@ int main(int argc, char **argv)
> test_stat_cpu();
> test_stat_thread();
> test_stat_thread_enable();
> + test_stat_user_read(PERF_COUNT_HW_INSTRUCTIONS);
> + test_stat_user_read(PERF_COUNT_HW_CPU_CYCLES);
>
> __T_END;
> return tests_failed == 0 ? 0 : -1;
> --
> 2.27.0
>

2021-05-05 02:52:01

by Rob Herring (Arm)

[permalink] [raw]
Subject: Re: [PATCH v6 06/10] libperf: Add support for user space counter access

On Tue, May 4, 2021 at 4:41 PM Ian Rogers <[email protected]> wrote:
>
> On Wed, Mar 10, 2021 at 4:08 PM Rob Herring <[email protected]> wrote:
> >
> > x86 and arm64 can both support direct access of event counters in
> > userspace. The access sequence is less than trivial and currently exists
> > in perf test code (tools/perf/arch/x86/tests/rdpmc.c) with copies in
> > projects such as PAPI and libpfm4.
> >
> > In order to support usersapce access, an event must be mmapped first
> > with perf_evsel__mmap(). Then subsequent calls to perf_evsel__read()
> > will use the fast path (assuming the arch supports it).
> >
> > Signed-off-by: Rob Herring <[email protected]>
> > ---
> > v6:
> > - Adapt to mmap changes adding MMAP NULL check
> > v5:
> > - Make raw count s64 instead of u64 so that counter width shifting
> > works
> > - Adapt to mmap changes
> > v4:
> > - Update perf_evsel__mmap size to pages
> > v3:
> > - Split out perf_evsel__mmap() to separate patch
> > ---
> > tools/lib/perf/evsel.c | 4 ++
> > tools/lib/perf/include/internal/mmap.h | 3 +
> > tools/lib/perf/mmap.c | 88 ++++++++++++++++++++++++++
> > tools/lib/perf/tests/test-evsel.c | 65 +++++++++++++++++++
> > 4 files changed, 160 insertions(+)
> >
> > diff --git a/tools/lib/perf/evsel.c b/tools/lib/perf/evsel.c
> > index 1057e9b15528..4d67343d36c9 100644
> > --- a/tools/lib/perf/evsel.c
> > +++ b/tools/lib/perf/evsel.c
> > @@ -242,6 +242,10 @@ int perf_evsel__read(struct perf_evsel *evsel, int cpu, int thread,
> > if (FD(evsel, cpu, thread) < 0)
> > return -EINVAL;
> >
> > + if (MMAP(evsel, cpu, thread) &&
> > + !perf_mmap__read_self(MMAP(evsel, cpu, thread), count))
> > + return 0;
> > +
> > if (readn(FD(evsel, cpu, thread), count->values, size) <= 0)
> > return -errno;
> >
> > diff --git a/tools/lib/perf/include/internal/mmap.h b/tools/lib/perf/include/internal/mmap.h
> > index be7556e0a2b2..5e3422f40ed5 100644
> > --- a/tools/lib/perf/include/internal/mmap.h
> > +++ b/tools/lib/perf/include/internal/mmap.h
> > @@ -11,6 +11,7 @@
> > #define PERF_SAMPLE_MAX_SIZE (1 << 16)
> >
> > struct perf_mmap;
> > +struct perf_counts_values;
> >
> > typedef void (*libperf_unmap_cb_t)(struct perf_mmap *map);
> >
> > @@ -52,4 +53,6 @@ void perf_mmap__put(struct perf_mmap *map);
> >
> > u64 perf_mmap__read_head(struct perf_mmap *map);
> >
> > +int perf_mmap__read_self(struct perf_mmap *map, struct perf_counts_values *count);
> > +
> > #endif /* __LIBPERF_INTERNAL_MMAP_H */
> > diff --git a/tools/lib/perf/mmap.c b/tools/lib/perf/mmap.c
> > index 79d5ed6c38cc..915469f00cf4 100644
> > --- a/tools/lib/perf/mmap.c
> > +++ b/tools/lib/perf/mmap.c
> > @@ -8,9 +8,11 @@
> > #include <linux/perf_event.h>
> > #include <perf/mmap.h>
> > #include <perf/event.h>
> > +#include <perf/evsel.h>
> > #include <internal/mmap.h>
> > #include <internal/lib.h>
> > #include <linux/kernel.h>
> > +#include <linux/math64.h>
> > #include "internal.h"
> >
> > void perf_mmap__init(struct perf_mmap *map, struct perf_mmap *prev,
> > @@ -273,3 +275,89 @@ union perf_event *perf_mmap__read_event(struct perf_mmap *map)
> >
> > return event;
> > }
> > +
> > +#if defined(__i386__) || defined(__x86_64__)
> > +static u64 read_perf_counter(unsigned int counter)
> > +{
> > + unsigned int low, high;
> > +
> > + asm volatile("rdpmc" : "=a" (low), "=d" (high) : "c" (counter));
> > +
> > + return low | ((u64)high) << 32;
> > +}
> > +
> > +static u64 read_timestamp(void)
> > +{
> > + unsigned int low, high;
> > +
> > + asm volatile("rdtsc" : "=a" (low), "=d" (high));
> > +
> > + return low | ((u64)high) << 32;
> > +}
> > +#else
> > +static u64 read_perf_counter(unsigned int counter) { return 0; }
> > +static u64 read_timestamp(void) { return 0; }
> > +#endif
> > +
> > +int perf_mmap__read_self(struct perf_mmap *map, struct perf_counts_values *count)
> > +{
> > + struct perf_event_mmap_page *pc = map->base;
> > + u32 seq, idx, time_mult = 0, time_shift = 0;
> > + u64 cnt, cyc = 0, time_offset = 0, time_cycles = 0, time_mask = ~0ULL;
> > +
> > + if (!pc || !pc->cap_user_rdpmc)
> > + return -1;
> > +
> > + do {
> > + seq = READ_ONCE(pc->lock);
> > + barrier();
> > +
> > + count->ena = READ_ONCE(pc->time_enabled);
> > + count->run = READ_ONCE(pc->time_running);
> > +
> > + if (pc->cap_user_time && count->ena != count->run) {
> > + cyc = read_timestamp();
> > + time_mult = READ_ONCE(pc->time_mult);
> > + time_shift = READ_ONCE(pc->time_shift);
> > + time_offset = READ_ONCE(pc->time_offset);
> > +
> > + if (pc->cap_user_time_short) {
> > + time_cycles = READ_ONCE(pc->time_cycles);
> > + time_mask = READ_ONCE(pc->time_mask);
> > + }
>
> Nit, this is now out of sync with the comment code in perf_event.h.

IMO, we should just delete that version. One less slightly wrong version...

> > + }
> > +
> > + idx = READ_ONCE(pc->index);
> > + cnt = READ_ONCE(pc->offset);
> > + if (pc->cap_user_rdpmc && idx) {
> > + s64 evcnt = read_perf_counter(idx - 1);
> > + u16 width = READ_ONCE(pc->pmc_width);
> > +
> > + evcnt <<= 64 - width;
> > + evcnt >>= 64 - width;
> > + cnt += evcnt;
> > + } else
> > + return -1;
> > +
> > + barrier();
> > + } while (READ_ONCE(pc->lock) != seq);
> > +
> > + if (count->ena != count->run) {
> > + u64 delta;
> > +
> > + /* Adjust for cap_usr_time_short, a nop if not */
> > + cyc = time_cycles + ((cyc - time_cycles) & time_mask);
> > +
> > + delta = time_offset + mul_u64_u32_shr(cyc, time_mult, time_shift);
> > +
> > + count->ena += delta;
> > + if (idx)
> > + count->run += delta;
> > +
> > + cnt = mul_u64_u64_div64(cnt, count->ena, count->run);
>
> Does this still suffer the divide by zero if multiplexing hasn't run
> the counter? If so, we still need to add something like:
> https://lore.kernel.org/lkml/CAP-5=fVRdqvswtyQMg5cB+ntTGda+SAYskjTQednEH-AeZo13g@mail.gmail.com/

I don't think so because if we don't have a valid counter index, we
exit before this if.

>
> > + }
> > +
> > + count->val = cnt;
> > +
> > + return 0;
> > +}
> > diff --git a/tools/lib/perf/tests/test-evsel.c b/tools/lib/perf/tests/test-evsel.c
> > index 0ad82d7a2a51..54fb4809b9ee 100644
> > --- a/tools/lib/perf/tests/test-evsel.c
> > +++ b/tools/lib/perf/tests/test-evsel.c
> > @@ -120,6 +120,69 @@ static int test_stat_thread_enable(void)
> > return 0;
> > }
> >
> > +static int test_stat_user_read(int event)
> > +{
> > + struct perf_counts_values counts = { .val = 0 };
> > + struct perf_thread_map *threads;
> > + struct perf_evsel *evsel;
> > + struct perf_event_mmap_page *pc;
> > + struct perf_event_attr attr = {
> > + .type = PERF_TYPE_HARDWARE,
> > + .config = event,
> > + };
>
> A nit, previously test-evsel was able to run and pass on a hypervisor.
> As now there is a reliance on hardware events the evsel open fails on
> a hypervisor. It'd be nice if we could detect running on a hypervisor
> and test software events in that case.

I suppose we can just exit if open fails on this test.

Rob