2023-05-12 08:56:35

by Alexandre Ghiti

[permalink] [raw]
Subject: [PATCH v2 00/10] riscv: Allow userspace to directly access perf counters

riscv used to allow direct access to cycle/time/instret counters,
bypassing the perf framework, this patchset intends to allow the user to
mmap any counter when accessed through perf. But we can't break the
existing behaviour so we introduce a sysctl perf_user_access like arm64
does, which defaults to the legacy mode described above.

*Note* that there are still ongoing discussions around which mode should
be the default mode with distro people.

base-commit-tag: v6.3

Changes in v2:
- Split into smaller patches, way better!
- Add RB from Conor
- Simplify the way we checked riscv architecture
- Fix race mmap and other thread running on other cpus
- Use hwc when available
- Set all userspace access flags in event_init, too cumbersome to handle sysctl changes
- Fix arch_perf_update_userpage for pmu other than riscv-pmu by renaming pmu driver
- Fixed kernel test robot build error
- Fixed documentation (Andrew and Bagas)
- perf testsuite passes mmap tests in all 3 modes

Alexandre Ghiti (10):
perf: Fix wrong comment about default event_idx
include: riscv: Fix wrong include guard in riscv_pmu.h
riscv: Make legacy counter enum match the HW numbering
drivers: perf: Rename riscv pmu driver
riscv: Prepare for user-space perf event mmap support
drivers: perf: Implement perf event mmap support in the legacy backend
drivers: perf: Implement perf event mmap support in the SBI backend
Documentation: admin-guide: Add riscv sysctl_perf_user_access
tools: lib: perf: Implement riscv mmap support
perf: tests: Adapt mmap-basic.c for riscv

Documentation/admin-guide/sysctl/kernel.rst | 24 ++-
arch/riscv/kernel/Makefile | 2 +-
arch/riscv/kernel/perf_event.c | 74 ++++++++
drivers/perf/riscv_pmu.c | 41 ++++
drivers/perf/riscv_pmu_legacy.c | 37 +++-
drivers/perf/riscv_pmu_sbi.c | 196 +++++++++++++++++++-
include/linux/perf/riscv_pmu.h | 10 +-
include/linux/perf_event.h | 3 +-
tools/lib/perf/mmap.c | 65 +++++++
tools/perf/tests/mmap-basic.c | 4 +-
10 files changed, 435 insertions(+), 21 deletions(-)
create mode 100644 arch/riscv/kernel/perf_event.c

--
2.37.2



2023-05-12 08:58:37

by Alexandre Ghiti

[permalink] [raw]
Subject: [PATCH v2 02/10] include: riscv: Fix wrong include guard in riscv_pmu.h

The current include guard prevents the inclusion of asm/perf_event.h
which uses the same include guard: fix the one in riscv_pmu.h so that it
matches the file name.

Signed-off-by: Alexandre Ghiti <[email protected]>
Reviewed-by: Conor Dooley <[email protected]>
---
include/linux/perf/riscv_pmu.h | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/linux/perf/riscv_pmu.h b/include/linux/perf/riscv_pmu.h
index 43fc892aa7d9..9f70d94942e0 100644
--- a/include/linux/perf/riscv_pmu.h
+++ b/include/linux/perf/riscv_pmu.h
@@ -6,8 +6,8 @@
*
*/

-#ifndef _ASM_RISCV_PERF_EVENT_H
-#define _ASM_RISCV_PERF_EVENT_H
+#ifndef _RISCV_PMU_H
+#define _RISCV_PMU_H

#include <linux/perf_event.h>
#include <linux/ptrace.h>
@@ -81,4 +81,4 @@ int riscv_pmu_get_hpm_info(u32 *hw_ctr_width, u32 *num_hw_ctr);

#endif /* CONFIG_RISCV_PMU */

-#endif /* _ASM_RISCV_PERF_EVENT_H */
+#endif /* _RISCV_PMU_H */
--
2.37.2


2023-05-12 09:02:33

by Alexandre Ghiti

[permalink] [raw]
Subject: [PATCH v2 04/10] drivers: perf: Rename riscv pmu driver

In addition to being more pretty, it will be useful in upcoming commits
to distinguish those pmu drivers from the other pmu drivers.

Signed-off-by: Alexandre Ghiti <[email protected]>
---
drivers/perf/riscv_pmu_legacy.c | 2 +-
drivers/perf/riscv_pmu_sbi.c | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/perf/riscv_pmu_legacy.c b/drivers/perf/riscv_pmu_legacy.c
index 0d8c9d8849ee..ffe09d857366 100644
--- a/drivers/perf/riscv_pmu_legacy.c
+++ b/drivers/perf/riscv_pmu_legacy.c
@@ -95,7 +95,7 @@ static void pmu_legacy_init(struct riscv_pmu *pmu)
pmu->ctr_clear_idx = NULL;
pmu->ctr_read = pmu_legacy_read_ctr;

- perf_pmu_register(&pmu->pmu, "cpu", PERF_TYPE_RAW);
+ perf_pmu_register(&pmu->pmu, RISCV_PMU_LEGACY_PDEV_NAME, PERF_TYPE_RAW);
}

static int pmu_legacy_device_probe(struct platform_device *pdev)
diff --git a/drivers/perf/riscv_pmu_sbi.c b/drivers/perf/riscv_pmu_sbi.c
index 70cb50fd41c2..3b0ee2148054 100644
--- a/drivers/perf/riscv_pmu_sbi.c
+++ b/drivers/perf/riscv_pmu_sbi.c
@@ -897,7 +897,7 @@ static int pmu_sbi_device_probe(struct platform_device *pdev)
if (ret)
goto out_unregister;

- ret = perf_pmu_register(&pmu->pmu, "cpu", PERF_TYPE_RAW);
+ ret = perf_pmu_register(&pmu->pmu, RISCV_PMU_PDEV_NAME, PERF_TYPE_RAW);
if (ret)
goto out_unregister;

--
2.37.2


2023-05-12 09:02:48

by Alexandre Ghiti

[permalink] [raw]
Subject: [PATCH v2 03/10] riscv: Make legacy counter enum match the HW numbering

RISCV_PMU_LEGACY_INSTRET used to be set to 1 whereas the offset of this
hardware counter from CSR_CYCLE is actually 2: make this offset match the
real hw offset so that we can directly expose those values to userspace.

Signed-off-by: Alexandre Ghiti <[email protected]>
---
drivers/perf/riscv_pmu_legacy.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/perf/riscv_pmu_legacy.c b/drivers/perf/riscv_pmu_legacy.c
index ca9e20bfc7ac..0d8c9d8849ee 100644
--- a/drivers/perf/riscv_pmu_legacy.c
+++ b/drivers/perf/riscv_pmu_legacy.c
@@ -12,8 +12,11 @@
#include <linux/perf/riscv_pmu.h>
#include <linux/platform_device.h>

-#define RISCV_PMU_LEGACY_CYCLE 0
-#define RISCV_PMU_LEGACY_INSTRET 1
+enum {
+ RISCV_PMU_LEGACY_CYCLE,
+ RISCV_PMU_LEGACY_TIME,
+ RISCV_PMU_LEGACY_INSTRET
+};

static bool pmu_init_done;

--
2.37.2


2023-05-12 09:03:44

by Alexandre Ghiti

[permalink] [raw]
Subject: [PATCH v2 01/10] perf: Fix wrong comment about default event_idx

event_idx default implementation returns 0, not idx + 1.

Signed-off-by: Alexandre Ghiti <[email protected]>
---
include/linux/perf_event.h | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index d5628a7b5eaa..56fe43b20966 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -442,7 +442,8 @@ struct pmu {

/*
* Will return the value for perf_event_mmap_page::index for this event,
- * if no implementation is provided it will default to: event->hw.idx + 1.
+ * if no implementation is provided it will default to 0 (see
+ * perf_event_idx_default).
*/
int (*event_idx) (struct perf_event *event); /*optional */

--
2.37.2


2023-05-12 09:04:05

by Alexandre Ghiti

[permalink] [raw]
Subject: [PATCH v2 05/10] riscv: Prepare for user-space perf event mmap support

Provide all the necessary bits in the generic riscv pmu driver to be
able to mmap perf events in userspace: the heavy lifting lies in the
driver backend, namely the legacy and sbi implementations.

Note that arch_perf_update_userpage is almost a copy of arm64 code.

Signed-off-by: Alexandre Ghiti <[email protected]>
---
arch/riscv/kernel/Makefile | 2 +-
arch/riscv/kernel/perf_event.c | 58 ++++++++++++++++++++++++++++++++++
drivers/perf/riscv_pmu.c | 41 ++++++++++++++++++++++++
include/linux/perf/riscv_pmu.h | 4 +++
4 files changed, 104 insertions(+), 1 deletion(-)
create mode 100644 arch/riscv/kernel/perf_event.c

diff --git a/arch/riscv/kernel/Makefile b/arch/riscv/kernel/Makefile
index 4cf303a779ab..0d215fd9860d 100644
--- a/arch/riscv/kernel/Makefile
+++ b/arch/riscv/kernel/Makefile
@@ -70,7 +70,7 @@ obj-$(CONFIG_DYNAMIC_FTRACE) += mcount-dyn.o

obj-$(CONFIG_TRACE_IRQFLAGS) += trace_irq.o

-obj-$(CONFIG_PERF_EVENTS) += perf_callchain.o
+obj-$(CONFIG_PERF_EVENTS) += perf_callchain.o perf_event.o
obj-$(CONFIG_HAVE_PERF_REGS) += perf_regs.o
obj-$(CONFIG_RISCV_SBI) += sbi.o
ifeq ($(CONFIG_RISCV_SBI), y)
diff --git a/arch/riscv/kernel/perf_event.c b/arch/riscv/kernel/perf_event.c
new file mode 100644
index 000000000000..94174a0fc251
--- /dev/null
+++ b/arch/riscv/kernel/perf_event.c
@@ -0,0 +1,58 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#include <linux/sched_clock.h>
+
+void arch_perf_update_userpage(struct perf_event *event,
+ struct perf_event_mmap_page *userpg, u64 now)
+{
+ struct clock_read_data *rd;
+ unsigned int seq;
+ u64 ns;
+
+ userpg->cap_user_time = 0;
+ userpg->cap_user_time_zero = 0;
+ userpg->cap_user_time_short = 0;
+ userpg->cap_user_rdpmc =
+ !!(event->hw.flags & PERF_EVENT_FLAG_USER_READ_CNT);
+
+ userpg->pmc_width = 64;
+
+ do {
+ rd = sched_clock_read_begin(&seq);
+
+ userpg->time_mult = rd->mult;
+ userpg->time_shift = rd->shift;
+ userpg->time_zero = rd->epoch_ns;
+ userpg->time_cycles = rd->epoch_cyc;
+ userpg->time_mask = rd->sched_clock_mask;
+
+ /*
+ * Subtract the cycle base, such that software that
+ * doesn't know about cap_user_time_short still 'works'
+ * assuming no wraps.
+ */
+ ns = mul_u64_u32_shr(rd->epoch_cyc, rd->mult, rd->shift);
+ userpg->time_zero -= ns;
+
+ } while (sched_clock_read_retry(seq));
+
+ userpg->time_offset = userpg->time_zero - now;
+
+ /*
+ * time_shift is not expected to be greater than 31 due to
+ * the original published conversion algorithm shifting a
+ * 32-bit value (now specifies a 64-bit value) - refer
+ * perf_event_mmap_page documentation in perf_event.h.
+ */
+ if (userpg->time_shift == 32) {
+ userpg->time_shift = 31;
+ userpg->time_mult >>= 1;
+ }
+
+ /*
+ * Internal timekeeping for enabled/running/stopped times
+ * is always computed with the sched_clock.
+ */
+ userpg->cap_user_time = 1;
+ userpg->cap_user_time_zero = 1;
+ userpg->cap_user_time_short = 1;
+}
diff --git a/drivers/perf/riscv_pmu.c b/drivers/perf/riscv_pmu.c
index ebca5eab9c9b..af69da268246 100644
--- a/drivers/perf/riscv_pmu.c
+++ b/drivers/perf/riscv_pmu.c
@@ -171,6 +171,8 @@ int riscv_pmu_event_set_period(struct perf_event *event)

local64_set(&hwc->prev_count, (u64)-left);

+ perf_event_update_userpage(event);
+
return overflow;
}

@@ -267,6 +269,9 @@ static int riscv_pmu_event_init(struct perf_event *event)
hwc->idx = -1;
hwc->event_base = mapped_event;

+ if (rvpmu->event_init)
+ rvpmu->event_init(event);
+
if (!is_sampling_event(event)) {
/*
* For non-sampling runs, limit the sample_period to half
@@ -283,6 +288,39 @@ static int riscv_pmu_event_init(struct perf_event *event)
return 0;
}

+static int riscv_pmu_event_idx(struct perf_event *event)
+{
+ struct riscv_pmu *rvpmu = to_riscv_pmu(event->pmu);
+
+ if (!(event->hw.flags & PERF_EVENT_FLAG_USER_READ_CNT))
+ return 0;
+
+ if (rvpmu->csr_index)
+ return rvpmu->csr_index(event) + 1;
+
+ return 0;
+}
+
+static void riscv_pmu_event_mapped(struct perf_event *event, struct mm_struct *mm)
+{
+ struct riscv_pmu *rvpmu = to_riscv_pmu(event->pmu);
+
+ if (rvpmu->event_mapped) {
+ rvpmu->event_mapped(event, mm);
+ perf_event_update_userpage(event);
+ }
+}
+
+static void riscv_pmu_event_unmapped(struct perf_event *event, struct mm_struct *mm)
+{
+ struct riscv_pmu *rvpmu = to_riscv_pmu(event->pmu);
+
+ if (rvpmu->event_unmapped) {
+ rvpmu->event_unmapped(event, mm);
+ perf_event_update_userpage(event);
+ }
+}
+
struct riscv_pmu *riscv_pmu_alloc(void)
{
struct riscv_pmu *pmu;
@@ -307,6 +345,9 @@ struct riscv_pmu *riscv_pmu_alloc(void)
}
pmu->pmu = (struct pmu) {
.event_init = riscv_pmu_event_init,
+ .event_mapped = riscv_pmu_event_mapped,
+ .event_unmapped = riscv_pmu_event_unmapped,
+ .event_idx = riscv_pmu_event_idx,
.add = riscv_pmu_add,
.del = riscv_pmu_del,
.start = riscv_pmu_start,
diff --git a/include/linux/perf/riscv_pmu.h b/include/linux/perf/riscv_pmu.h
index 9f70d94942e0..1452c8af3b67 100644
--- a/include/linux/perf/riscv_pmu.h
+++ b/include/linux/perf/riscv_pmu.h
@@ -55,6 +55,10 @@ struct riscv_pmu {
void (*ctr_start)(struct perf_event *event, u64 init_val);
void (*ctr_stop)(struct perf_event *event, unsigned long flag);
int (*event_map)(struct perf_event *event, u64 *config);
+ void (*event_init)(struct perf_event *event);
+ void (*event_mapped)(struct perf_event *event, struct mm_struct *mm);
+ void (*event_unmapped)(struct perf_event *event, struct mm_struct *mm);
+ uint8_t (*csr_index)(struct perf_event *event);

struct cpu_hw_events __percpu *hw_events;
struct hlist_node node;
--
2.37.2


2023-05-12 09:10:39

by Alexandre Ghiti

[permalink] [raw]
Subject: [PATCH v2 06/10] drivers: perf: Implement perf event mmap support in the legacy backend

Implement the needed callbacks in the legacy driver so that we can
directly access the counters through perf in userspace.

Signed-off-by: Alexandre Ghiti <[email protected]>
---
drivers/perf/riscv_pmu_legacy.c | 28 ++++++++++++++++++++++++++++
1 file changed, 28 insertions(+)

diff --git a/drivers/perf/riscv_pmu_legacy.c b/drivers/perf/riscv_pmu_legacy.c
index ffe09d857366..f0f5bd856f66 100644
--- a/drivers/perf/riscv_pmu_legacy.c
+++ b/drivers/perf/riscv_pmu_legacy.c
@@ -74,6 +74,31 @@ static void pmu_legacy_ctr_start(struct perf_event *event, u64 ival)
local64_set(&hwc->prev_count, initial_val);
}

+static uint8_t pmu_legacy_csr_index(struct perf_event *event)
+{
+ return event->hw.idx;
+}
+
+static void pmu_legacy_event_mapped(struct perf_event *event, struct mm_struct *mm)
+{
+ /* In legacy mode, the first 3 CSRs are available. */
+ if (event->attr.config != PERF_COUNT_HW_CPU_CYCLES &&
+ event->attr.config != PERF_COUNT_HW_INSTRUCTIONS)
+ return;
+
+ event->hw.flags |= PERF_EVENT_FLAG_USER_READ_CNT;
+}
+
+static void pmu_legacy_event_unmapped(struct perf_event *event, struct mm_struct *mm)
+{
+ /* In legacy mode, the first 3 CSRs are available. */
+ if (event->attr.config != PERF_COUNT_HW_CPU_CYCLES &&
+ event->attr.config != PERF_COUNT_HW_INSTRUCTIONS)
+ return;
+
+ event->hw.flags &= ~PERF_EVENT_FLAG_USER_READ_CNT;
+}
+
/*
* This is just a simple implementation to allow legacy implementations
* compatible with new RISC-V PMU driver framework.
@@ -94,6 +119,9 @@ static void pmu_legacy_init(struct riscv_pmu *pmu)
pmu->ctr_get_width = NULL;
pmu->ctr_clear_idx = NULL;
pmu->ctr_read = pmu_legacy_read_ctr;
+ pmu->event_mapped = pmu_legacy_event_mapped;
+ pmu->event_unmapped = pmu_legacy_event_unmapped;
+ pmu->csr_index = pmu_legacy_csr_index;

perf_pmu_register(&pmu->pmu, RISCV_PMU_LEGACY_PDEV_NAME, PERF_TYPE_RAW);
}
--
2.37.2


2023-05-12 09:10:57

by Alexandre Ghiti

[permalink] [raw]
Subject: [PATCH v2 10/10] perf: tests: Adapt mmap-basic.c for riscv

riscv now supports mmaping hardware counters to userspace so adapt the test
to run on this architecture.

Signed-off-by: Alexandre Ghiti <[email protected]>
---
tools/perf/tests/mmap-basic.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/tools/perf/tests/mmap-basic.c b/tools/perf/tests/mmap-basic.c
index e68ca6229756..f5075ca774f8 100644
--- a/tools/perf/tests/mmap-basic.c
+++ b/tools/perf/tests/mmap-basic.c
@@ -284,7 +284,7 @@ static struct test_case tests__basic_mmap[] = {
"permissions"),
TEST_CASE_REASON("User space counter reading of instructions",
mmap_user_read_instr,
-#if defined(__i386__) || defined(__x86_64__) || defined(__aarch64__)
+#if defined(__i386__) || defined(__x86_64__) || defined(__aarch64__) || __riscv_xlen == 64
"permissions"
#else
"unsupported"
@@ -292,7 +292,7 @@ static struct test_case tests__basic_mmap[] = {
),
TEST_CASE_REASON("User space counter reading of cycles",
mmap_user_read_cycles,
-#if defined(__i386__) || defined(__x86_64__) || defined(__aarch64__)
+#if defined(__i386__) || defined(__x86_64__) || defined(__aarch64__) || __riscv_xlen == 64
"permissions"
#else
"unsupported"
--
2.37.2


2023-05-12 09:16:22

by Alexandre Ghiti

[permalink] [raw]
Subject: [PATCH v2 09/10] tools: lib: perf: Implement riscv mmap support

riscv now support mmaping hardware counters so add what's needed to
take advantage of that in libperf.

Signed-off-by: Alexandre Ghiti <[email protected]>
---
tools/lib/perf/mmap.c | 65 +++++++++++++++++++++++++++++++++++++++++++
1 file changed, 65 insertions(+)

diff --git a/tools/lib/perf/mmap.c b/tools/lib/perf/mmap.c
index 0d1634cedf44..65f250e0ef92 100644
--- a/tools/lib/perf/mmap.c
+++ b/tools/lib/perf/mmap.c
@@ -392,6 +392,71 @@ static u64 read_perf_counter(unsigned int counter)

static u64 read_timestamp(void) { return read_sysreg(cntvct_el0); }

+#elif __riscv_xlen == 64
+
+#define CSR_CYCLE 0xc00
+#define CSR_TIME 0xc01
+#define CSR_CYCLEH 0xc80
+
+#define csr_read(csr) \
+({ \
+ register unsigned long __v; \
+ __asm__ __volatile__ ("csrr %0, " #csr \
+ : "=r" (__v) : \
+ : "memory"); \
+ __v; \
+})
+
+static unsigned long csr_read_num(int csr_num)
+{
+#define switchcase_csr_read(__csr_num, __val) {\
+ case __csr_num: \
+ __val = csr_read(__csr_num); \
+ break; }
+#define switchcase_csr_read_2(__csr_num, __val) {\
+ switchcase_csr_read(__csr_num + 0, __val) \
+ switchcase_csr_read(__csr_num + 1, __val)}
+#define switchcase_csr_read_4(__csr_num, __val) {\
+ switchcase_csr_read_2(__csr_num + 0, __val) \
+ switchcase_csr_read_2(__csr_num + 2, __val)}
+#define switchcase_csr_read_8(__csr_num, __val) {\
+ switchcase_csr_read_4(__csr_num + 0, __val) \
+ switchcase_csr_read_4(__csr_num + 4, __val)}
+#define switchcase_csr_read_16(__csr_num, __val) {\
+ switchcase_csr_read_8(__csr_num + 0, __val) \
+ switchcase_csr_read_8(__csr_num + 8, __val)}
+#define switchcase_csr_read_32(__csr_num, __val) {\
+ switchcase_csr_read_16(__csr_num + 0, __val) \
+ switchcase_csr_read_16(__csr_num + 16, __val)}
+
+ unsigned long ret = 0;
+
+ switch (csr_num) {
+ switchcase_csr_read_32(CSR_CYCLE, ret)
+ switchcase_csr_read_32(CSR_CYCLEH, ret)
+ default:
+ break;
+ }
+
+ return ret;
+#undef switchcase_csr_read_32
+#undef switchcase_csr_read_16
+#undef switchcase_csr_read_8
+#undef switchcase_csr_read_4
+#undef switchcase_csr_read_2
+#undef switchcase_csr_read
+}
+
+static u64 read_perf_counter(unsigned int counter)
+{
+ return csr_read_num(CSR_CYCLE + counter);
+}
+
+static u64 read_timestamp(void)
+{
+ return csr_read_num(CSR_TIME);
+}
+
#else
static u64 read_perf_counter(unsigned int counter __maybe_unused) { return 0; }
static u64 read_timestamp(void) { return 0; }
--
2.37.2


2023-05-12 09:18:44

by Alexandre Ghiti

[permalink] [raw]
Subject: [PATCH v2 07/10] drivers: perf: Implement perf event mmap support in the SBI backend

We used to unconditionnally expose the cycle and instret csrs to
userspace, which gives rise to security concerns.

So now we only allow access to hw counters from userspace through the perf
framework which will handle context switches, per-task events...etc. But
as we cannot break userspace, we give the user the choice to go back to
the previous behaviour by setting the sysctl perf_user_access.

Signed-off-by: Alexandre Ghiti <[email protected]>
---
arch/riscv/kernel/perf_event.c | 18 ++-
drivers/perf/riscv_pmu_sbi.c | 194 ++++++++++++++++++++++++++++++++-
2 files changed, 205 insertions(+), 7 deletions(-)

diff --git a/arch/riscv/kernel/perf_event.c b/arch/riscv/kernel/perf_event.c
index 94174a0fc251..3af9ca45b43f 100644
--- a/arch/riscv/kernel/perf_event.c
+++ b/arch/riscv/kernel/perf_event.c
@@ -1,9 +1,13 @@
// SPDX-License-Identifier: GPL-2.0-only
+#include <linux/perf/riscv_pmu.h>
#include <linux/sched_clock.h>

void arch_perf_update_userpage(struct perf_event *event,
struct perf_event_mmap_page *userpg, u64 now)
{
+#ifdef CONFIG_RISCV_PMU_SBI
+ struct riscv_pmu *rvpmu = to_riscv_pmu(event->pmu);
+#endif
struct clock_read_data *rd;
unsigned int seq;
u64 ns;
@@ -14,7 +18,19 @@ void arch_perf_update_userpage(struct perf_event *event,
userpg->cap_user_rdpmc =
!!(event->hw.flags & PERF_EVENT_FLAG_USER_READ_CNT);

- userpg->pmc_width = 64;
+#ifdef CONFIG_RISCV_PMU_SBI
+ /*
+ * The counters are 64-bit but the priv spec doesn't mandate all the
+ * bits to be implemented: that's why, counter width can vary based on
+ * the cpu vendor.
+ */
+ if (event->pmu->name &&
+ !strncmp(event->pmu->name,
+ RISCV_PMU_PDEV_NAME, sizeof(RISCV_PMU_PDEV_NAME)))
+ userpg->pmc_width = rvpmu->ctr_get_width(event->hw.idx) + 1;
+ else
+#endif
+ userpg->pmc_width = 64;

do {
rd = sched_clock_read_begin(&seq);
diff --git a/drivers/perf/riscv_pmu_sbi.c b/drivers/perf/riscv_pmu_sbi.c
index 3b0ee2148054..d9bcc5cc6df5 100644
--- a/drivers/perf/riscv_pmu_sbi.c
+++ b/drivers/perf/riscv_pmu_sbi.c
@@ -24,6 +24,14 @@
#include <asm/sbi.h>
#include <asm/hwcap.h>

+#define SYSCTL_NO_USER_ACCESS 0
+#define SYSCTL_USER_ACCESS 1
+#define SYSCTL_LEGACY 2
+
+#define PERF_EVENT_FLAG_NO_USER_ACCESS BIT(SYSCTL_NO_USER_ACCESS)
+#define PERF_EVENT_FLAG_USER_ACCESS BIT(SYSCTL_USER_ACCESS)
+#define PERF_EVENT_FLAG_LEGACY BIT(SYSCTL_LEGACY)
+
PMU_FORMAT_ATTR(event, "config:0-47");
PMU_FORMAT_ATTR(firmware, "config:63");

@@ -43,6 +51,9 @@ static const struct attribute_group *riscv_pmu_attr_groups[] = {
NULL,
};

+/* Allow legacy access by default */
+static int sysctl_perf_user_access __read_mostly = SYSCTL_LEGACY;
+
/*
* RISC-V doesn't have heterogeneous harts yet. This need to be part of
* per_cpu in case of harts with different pmu counters
@@ -301,6 +312,11 @@ int riscv_pmu_get_hpm_info(u32 *hw_ctr_width, u32 *num_hw_ctr)
}
EXPORT_SYMBOL_GPL(riscv_pmu_get_hpm_info);

+static uint8_t pmu_sbi_csr_index(struct perf_event *event)
+{
+ return pmu_ctr_list[event->hw.idx].csr - CSR_CYCLE;
+}
+
static unsigned long pmu_sbi_get_filter_flags(struct perf_event *event)
{
unsigned long cflags = 0;
@@ -329,18 +345,34 @@ static int pmu_sbi_ctr_get_idx(struct perf_event *event)
struct cpu_hw_events *cpuc = this_cpu_ptr(rvpmu->hw_events);
struct sbiret ret;
int idx;
- uint64_t cbase = 0;
+ uint64_t cbase = 0, cmask = rvpmu->cmask;
unsigned long cflags = 0;

cflags = pmu_sbi_get_filter_flags(event);
+
+ /*
+ * In legacy mode, we have to force the fixed counters for those events
+ * but not in the user access mode as we want to use the other counters
+ * that support sampling/filtering.
+ */
+ if (hwc->flags & PERF_EVENT_FLAG_LEGACY) {
+ if (event->attr.config == PERF_COUNT_HW_CPU_CYCLES) {
+ cflags |= SBI_PMU_CFG_FLAG_SKIP_MATCH;
+ cmask = 1;
+ } else if (event->attr.config == PERF_COUNT_HW_INSTRUCTIONS) {
+ cflags |= SBI_PMU_CFG_FLAG_SKIP_MATCH;
+ cmask = 1UL << (CSR_INSTRET - CSR_CYCLE);
+ }
+ }
+
/* retrieve the available counter index */
#if defined(CONFIG_32BIT)
ret = sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_COUNTER_CFG_MATCH, cbase,
- rvpmu->cmask, cflags, hwc->event_base, hwc->config,
+ cmask, cflags, hwc->event_base, hwc->config,
hwc->config >> 32);
#else
ret = sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_COUNTER_CFG_MATCH, cbase,
- rvpmu->cmask, cflags, hwc->event_base, hwc->config, 0);
+ cmask, cflags, hwc->event_base, hwc->config, 0);
#endif
if (ret.error) {
pr_debug("Not able to find a counter for event %lx config %llx\n",
@@ -474,6 +506,14 @@ static u64 pmu_sbi_ctr_read(struct perf_event *event)
return val;
}

+static void pmu_sbi_set_scounteren(void *arg)
+{
+ struct perf_event *event = (struct perf_event *)arg;
+
+ csr_write(CSR_SCOUNTEREN,
+ csr_read(CSR_SCOUNTEREN) | (1 << pmu_sbi_csr_index(event)));
+}
+
static void pmu_sbi_ctr_start(struct perf_event *event, u64 ival)
{
struct sbiret ret;
@@ -490,6 +530,18 @@ static void pmu_sbi_ctr_start(struct perf_event *event, u64 ival)
if (ret.error && (ret.error != SBI_ERR_ALREADY_STARTED))
pr_err("Starting counter idx %d failed with error %d\n",
hwc->idx, sbi_err_map_linux_errno(ret.error));
+
+ if (hwc->flags & PERF_EVENT_FLAG_USER_ACCESS &&
+ hwc->flags & PERF_EVENT_FLAG_USER_READ_CNT)
+ pmu_sbi_set_scounteren((void *)event);
+}
+
+static void pmu_sbi_reset_scounteren(void *arg)
+{
+ struct perf_event *event = (struct perf_event *)arg;
+
+ csr_write(CSR_SCOUNTEREN,
+ csr_read(CSR_SCOUNTEREN) & ~(1 << pmu_sbi_csr_index(event)));
}

static void pmu_sbi_ctr_stop(struct perf_event *event, unsigned long flag)
@@ -497,6 +549,10 @@ static void pmu_sbi_ctr_stop(struct perf_event *event, unsigned long flag)
struct sbiret ret;
struct hw_perf_event *hwc = &event->hw;

+ if (hwc->flags & PERF_EVENT_FLAG_USER_ACCESS &&
+ hwc->flags & PERF_EVENT_FLAG_USER_READ_CNT)
+ pmu_sbi_reset_scounteren((void *)event);
+
ret = sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_COUNTER_STOP, hwc->idx, 1, flag, 0, 0, 0);
if (ret.error && (ret.error != SBI_ERR_ALREADY_STOPPED) &&
flag != SBI_PMU_STOP_FLAG_RESET)
@@ -704,10 +760,13 @@ static int pmu_sbi_starting_cpu(unsigned int cpu, struct hlist_node *node)
struct cpu_hw_events *cpu_hw_evt = this_cpu_ptr(pmu->hw_events);

/*
- * Enable the access for CYCLE, TIME, and INSTRET CSRs from userspace,
- * as is necessary to maintain uABI compatibility.
+ * We keep enabling userspace access to CYCLE, TIME and INSRET via the
+ * legacy option but that will be removed in the future.
*/
- csr_write(CSR_SCOUNTEREN, 0x7);
+ if (sysctl_perf_user_access == SYSCTL_LEGACY)
+ csr_write(CSR_SCOUNTEREN, 0x7);
+ else
+ csr_write(CSR_SCOUNTEREN, 0x2);

/* Stop all the counters so that they can be enabled from perf */
pmu_sbi_stop_all(pmu);
@@ -851,6 +910,123 @@ static void riscv_pmu_destroy(struct riscv_pmu *pmu)
cpuhp_state_remove_instance(CPUHP_AP_PERF_RISCV_STARTING, &pmu->node);
}

+static void pmu_sbi_event_init(struct perf_event *event)
+{
+ /*
+ * The permissions are set at event_init so that we do not depend
+ * on the sysctl value that can change.
+ */
+ if (sysctl_perf_user_access == SYSCTL_NO_USER_ACCESS)
+ event->hw.flags |= PERF_EVENT_FLAG_NO_USER_ACCESS;
+ else if (sysctl_perf_user_access == SYSCTL_USER_ACCESS)
+ event->hw.flags |= PERF_EVENT_FLAG_USER_ACCESS;
+ else
+ event->hw.flags |= PERF_EVENT_FLAG_LEGACY;
+}
+
+static void pmu_sbi_event_mapped(struct perf_event *event, struct mm_struct *mm)
+{
+ if (event->hw.flags & PERF_EVENT_FLAG_NO_USER_ACCESS)
+ return;
+
+ /* In legacy mode, the first 3 CSRs are available. */
+ if (event->hw.flags & PERF_EVENT_FLAG_LEGACY) {
+ if (event->attr.config != PERF_COUNT_HW_CPU_CYCLES &&
+ event->attr.config != PERF_COUNT_HW_INSTRUCTIONS) {
+ return;
+ }
+ }
+
+ /*
+ * The user mmapped the event to directly access it: this is where
+ * we determine based on sysctl_perf_user_access if we grant userspace
+ * the direct access to this event. That means that within the same
+ * task, some events may be directly accessible and some other may not,
+ * if the user changes the value of sysctl_perf_user_accesss in the
+ * meantime.
+ */
+
+ event->hw.flags |= PERF_EVENT_FLAG_USER_READ_CNT;
+
+ /*
+ * We must enable userspace access *before* advertising in the user page
+ * that it is possible to do so to avoid any race.
+ * And we must notify all cpus here because threads that currently run
+ * on other cpus will try to directly access the counter too without
+ * calling pmu_sbi_ctr_start.
+ */
+ if (event->hw.flags & PERF_EVENT_FLAG_USER_ACCESS)
+ on_each_cpu_mask(mm_cpumask(mm),
+ pmu_sbi_set_scounteren, (void *)event, 1);
+}
+
+static void pmu_sbi_event_unmapped(struct perf_event *event, struct mm_struct *mm)
+{
+ if (event->hw.flags & PERF_EVENT_FLAG_NO_USER_ACCESS)
+ return;
+
+ /* In legacy mode, the first 3 CSRs are available. */
+ if (event->hw.flags & PERF_EVENT_FLAG_LEGACY) {
+ if (event->attr.config != PERF_COUNT_HW_CPU_CYCLES &&
+ event->attr.config != PERF_COUNT_HW_INSTRUCTIONS) {
+ return;
+ }
+ }
+
+ /*
+ * Here we can directly remove user access since the user does not have
+ * access to the user page anymore so we avoid the racy window where the
+ * user could have read cap_user_rdpmc to true right before we disable
+ * it.
+ */
+ event->hw.flags &= ~PERF_EVENT_FLAG_USER_READ_CNT;
+
+ if (event->hw.flags & PERF_EVENT_FLAG_USER_ACCESS)
+ on_each_cpu_mask(mm_cpumask(mm),
+ pmu_sbi_reset_scounteren, (void *)event, 1);
+}
+
+static void riscv_pmu_update_counter_access(void *info)
+{
+ if (sysctl_perf_user_access == SYSCTL_LEGACY)
+ csr_write(CSR_SCOUNTEREN, 0x7);
+ else
+ csr_write(CSR_SCOUNTEREN, 0x2);
+}
+
+static int riscv_pmu_proc_user_access_handler(struct ctl_table *table,
+ int write, void *buffer,
+ size_t *lenp, loff_t *ppos)
+{
+ int prev = sysctl_perf_user_access;
+ int ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
+
+ /*
+ * Test against the previous value since we clear SCOUNTEREN when
+ * sysctl_perf_user_access is set to SYSCTL_USER_ACCESS, but we should
+ * not do that if that was already the case.
+ */
+ if (ret || !write || prev == sysctl_perf_user_access)
+ return ret;
+
+ on_each_cpu(riscv_pmu_update_counter_access, (void *)&prev, 1);
+
+ return 0;
+}
+
+static struct ctl_table sbi_pmu_sysctl_table[] = {
+ {
+ .procname = "perf_user_access",
+ .data = &sysctl_perf_user_access,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = riscv_pmu_proc_user_access_handler,
+ .extra1 = SYSCTL_ZERO,
+ .extra2 = SYSCTL_TWO,
+ },
+ { }
+};
+
static int pmu_sbi_device_probe(struct platform_device *pdev)
{
struct riscv_pmu *pmu = NULL;
@@ -888,6 +1064,10 @@ static int pmu_sbi_device_probe(struct platform_device *pdev)
pmu->ctr_get_width = pmu_sbi_ctr_get_width;
pmu->ctr_clear_idx = pmu_sbi_ctr_clear_idx;
pmu->ctr_read = pmu_sbi_ctr_read;
+ pmu->event_init = pmu_sbi_event_init;
+ pmu->event_mapped = pmu_sbi_event_mapped;
+ pmu->event_unmapped = pmu_sbi_event_unmapped;
+ pmu->csr_index = pmu_sbi_csr_index;

ret = cpuhp_state_add_instance(CPUHP_AP_PERF_RISCV_STARTING, &pmu->node);
if (ret)
@@ -901,6 +1081,8 @@ static int pmu_sbi_device_probe(struct platform_device *pdev)
if (ret)
goto out_unregister;

+ register_sysctl("kernel", sbi_pmu_sysctl_table);
+
return 0;

out_unregister:
--
2.37.2


2023-05-12 09:21:50

by Alexandre Ghiti

[permalink] [raw]
Subject: [PATCH v2 08/10] Documentation: admin-guide: Add riscv sysctl_perf_user_access

riscv now uses this sysctl so document its usage for this architecture.

Signed-off-by: Alexandre Ghiti <[email protected]>
---
Documentation/admin-guide/sysctl/kernel.rst | 24 +++++++++++++++++----
1 file changed, 20 insertions(+), 4 deletions(-)

diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst
index 4b7bfea28cd7..93cd518ca94b 100644
--- a/Documentation/admin-guide/sysctl/kernel.rst
+++ b/Documentation/admin-guide/sysctl/kernel.rst
@@ -941,16 +941,32 @@ enabled, otherwise writing to this file will return ``-EBUSY``.
The default value is 8.


-perf_user_access (arm64 only)
-=================================
+perf_user_access (arm64 and riscv only)
+=======================================
+
+Controls user space access for reading perf event counters.

-Controls user space access for reading perf event counters. When set to 1,
-user space can read performance monitor counter registers directly.
+arm64
+=====

The default value is 0 (access disabled).
+When set to 1, user space can read performance monitor counter registers
+directly.

See Documentation/arm64/perf.rst for more information.

+riscv
+=====
+
+When set to 0, user access is disabled.
+
+When set to 1, user space can read performance monitor counter registers
+directly only through perf, any direct access without perf intervention will
+trigger an illegal instruction.
+
+The default value is 2, which enables legacy mode (user space has direct
+access to cycle, time and insret CSRs only). Note that this legacy value
+is deprecated and will be removed once all userspace applications are fixed.

pid_max
=======
--
2.37.2


2023-05-15 18:02:19

by Conor Dooley

[permalink] [raw]
Subject: Re: [PATCH v2 00/10] riscv: Allow userspace to directly access perf counters

On Fri, May 12, 2023 at 10:53:11AM +0200, Alexandre Ghiti wrote:

> base-commit-tag: v6.3


BTW Alex, in the future it'd be great if you could pick a more recent
base (eg -rc1 or riscv/for-next) so that the patchwork automation
doesn't run into a bunch of conflicts while trying to apply patches.

Cheers,
Conor.


Attachments:
(No filename) (324.00 B)
signature.asc (235.00 B)
Download all attachments

2023-05-31 14:49:29

by Andrew Jones

[permalink] [raw]
Subject: Re: [PATCH v2 03/10] riscv: Make legacy counter enum match the HW numbering

On Fri, May 12, 2023 at 10:53:14AM +0200, Alexandre Ghiti wrote:
> RISCV_PMU_LEGACY_INSTRET used to be set to 1 whereas the offset of this
> hardware counter from CSR_CYCLE is actually 2: make this offset match the
> real hw offset so that we can directly expose those values to userspace.
>
> Signed-off-by: Alexandre Ghiti <[email protected]>
> ---
> drivers/perf/riscv_pmu_legacy.c | 7 +++++--
> 1 file changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/perf/riscv_pmu_legacy.c b/drivers/perf/riscv_pmu_legacy.c
> index ca9e20bfc7ac..0d8c9d8849ee 100644
> --- a/drivers/perf/riscv_pmu_legacy.c
> +++ b/drivers/perf/riscv_pmu_legacy.c
> @@ -12,8 +12,11 @@
> #include <linux/perf/riscv_pmu.h>
> #include <linux/platform_device.h>
>
> -#define RISCV_PMU_LEGACY_CYCLE 0
> -#define RISCV_PMU_LEGACY_INSTRET 1
> +enum {
> + RISCV_PMU_LEGACY_CYCLE,
> + RISCV_PMU_LEGACY_TIME,
> + RISCV_PMU_LEGACY_INSTRET
> +};

I guess this doesn't hurt, since these are just indices internal to this
driver, but it's a bit odd to also have a RISCV_PMU_LEGACY_TIME, when
the driver is only for cycle and instret, as its Kconfig help text says.

Thanks,
drew

2023-05-31 15:01:22

by Andrew Jones

[permalink] [raw]
Subject: Re: [PATCH v2 04/10] drivers: perf: Rename riscv pmu driver

On Fri, May 12, 2023 at 10:53:15AM +0200, Alexandre Ghiti wrote:
> In addition to being more pretty, it will be useful in upcoming commits
> to distinguish those pmu drivers from the other pmu drivers.
>
> Signed-off-by: Alexandre Ghiti <[email protected]>
> ---
> drivers/perf/riscv_pmu_legacy.c | 2 +-
> drivers/perf/riscv_pmu_sbi.c | 2 +-
> 2 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/perf/riscv_pmu_legacy.c b/drivers/perf/riscv_pmu_legacy.c
> index 0d8c9d8849ee..ffe09d857366 100644
> --- a/drivers/perf/riscv_pmu_legacy.c
> +++ b/drivers/perf/riscv_pmu_legacy.c
> @@ -95,7 +95,7 @@ static void pmu_legacy_init(struct riscv_pmu *pmu)
> pmu->ctr_clear_idx = NULL;
> pmu->ctr_read = pmu_legacy_read_ctr;
>
> - perf_pmu_register(&pmu->pmu, "cpu", PERF_TYPE_RAW);
> + perf_pmu_register(&pmu->pmu, RISCV_PMU_LEGACY_PDEV_NAME, PERF_TYPE_RAW);
> }
>
> static int pmu_legacy_device_probe(struct platform_device *pdev)
> diff --git a/drivers/perf/riscv_pmu_sbi.c b/drivers/perf/riscv_pmu_sbi.c
> index 70cb50fd41c2..3b0ee2148054 100644
> --- a/drivers/perf/riscv_pmu_sbi.c
> +++ b/drivers/perf/riscv_pmu_sbi.c
> @@ -897,7 +897,7 @@ static int pmu_sbi_device_probe(struct platform_device *pdev)
> if (ret)
> goto out_unregister;
>
> - ret = perf_pmu_register(&pmu->pmu, "cpu", PERF_TYPE_RAW);
> + ret = perf_pmu_register(&pmu->pmu, RISCV_PMU_PDEV_NAME, PERF_TYPE_RAW);

Should we include "sbi" in this name?

> if (ret)
> goto out_unregister;
>
> --
> 2.37.2
>

Otherwise,

Reviewed-by: Andrew Jones <[email protected]>

2023-05-31 15:02:24

by Andrew Jones

[permalink] [raw]
Subject: Re: [PATCH v2 06/10] drivers: perf: Implement perf event mmap support in the legacy backend

On Fri, May 12, 2023 at 10:53:17AM +0200, Alexandre Ghiti wrote:
> Implement the needed callbacks in the legacy driver so that we can
> directly access the counters through perf in userspace.
>
> Signed-off-by: Alexandre Ghiti <[email protected]>
> ---
> drivers/perf/riscv_pmu_legacy.c | 28 ++++++++++++++++++++++++++++
> 1 file changed, 28 insertions(+)
>
> diff --git a/drivers/perf/riscv_pmu_legacy.c b/drivers/perf/riscv_pmu_legacy.c
> index ffe09d857366..f0f5bd856f66 100644
> --- a/drivers/perf/riscv_pmu_legacy.c
> +++ b/drivers/perf/riscv_pmu_legacy.c
> @@ -74,6 +74,31 @@ static void pmu_legacy_ctr_start(struct perf_event *event, u64 ival)
> local64_set(&hwc->prev_count, initial_val);
> }
>
> +static uint8_t pmu_legacy_csr_index(struct perf_event *event)
> +{
> + return event->hw.idx;
> +}
> +
> +static void pmu_legacy_event_mapped(struct perf_event *event, struct mm_struct *mm)
> +{
> + /* In legacy mode, the first 3 CSRs are available. */

Shouldn't this be

/* In legacy mode, the first and third CSR are available. */

?

> + if (event->attr.config != PERF_COUNT_HW_CPU_CYCLES &&
> + event->attr.config != PERF_COUNT_HW_INSTRUCTIONS)
> + return;
> +
> + event->hw.flags |= PERF_EVENT_FLAG_USER_READ_CNT;
> +}
> +
> +static void pmu_legacy_event_unmapped(struct perf_event *event, struct mm_struct *mm)
> +{
> + /* In legacy mode, the first 3 CSRs are available. */

same comment

> + if (event->attr.config != PERF_COUNT_HW_CPU_CYCLES &&
> + event->attr.config != PERF_COUNT_HW_INSTRUCTIONS)
> + return;
> +
> + event->hw.flags &= ~PERF_EVENT_FLAG_USER_READ_CNT;
> +}
> +
> /*
> * This is just a simple implementation to allow legacy implementations
> * compatible with new RISC-V PMU driver framework.
> @@ -94,6 +119,9 @@ static void pmu_legacy_init(struct riscv_pmu *pmu)
> pmu->ctr_get_width = NULL;
> pmu->ctr_clear_idx = NULL;
> pmu->ctr_read = pmu_legacy_read_ctr;
> + pmu->event_mapped = pmu_legacy_event_mapped;
> + pmu->event_unmapped = pmu_legacy_event_unmapped;
> + pmu->csr_index = pmu_legacy_csr_index;
>
> perf_pmu_register(&pmu->pmu, RISCV_PMU_LEGACY_PDEV_NAME, PERF_TYPE_RAW);
> }
> --
> 2.37.2
>

Otherwise,

Reviewed-by: Andrew Jones <[email protected]>

2023-05-31 15:03:01

by Andrew Jones

[permalink] [raw]
Subject: Re: [PATCH v2 05/10] riscv: Prepare for user-space perf event mmap support

On Fri, May 12, 2023 at 10:53:16AM +0200, Alexandre Ghiti wrote:
> Provide all the necessary bits in the generic riscv pmu driver to be
> able to mmap perf events in userspace: the heavy lifting lies in the
> driver backend, namely the legacy and sbi implementations.
>
> Note that arch_perf_update_userpage is almost a copy of arm64 code.
>
> Signed-off-by: Alexandre Ghiti <[email protected]>
> ---
> arch/riscv/kernel/Makefile | 2 +-
> arch/riscv/kernel/perf_event.c | 58 ++++++++++++++++++++++++++++++++++
> drivers/perf/riscv_pmu.c | 41 ++++++++++++++++++++++++
> include/linux/perf/riscv_pmu.h | 4 +++
> 4 files changed, 104 insertions(+), 1 deletion(-)
> create mode 100644 arch/riscv/kernel/perf_event.c
>

Reviewed-by: Andrew Jones <[email protected]>

2023-05-31 15:03:02

by Andrew Jones

[permalink] [raw]
Subject: Re: [PATCH v2 02/10] include: riscv: Fix wrong include guard in riscv_pmu.h

On Fri, May 12, 2023 at 10:53:13AM +0200, Alexandre Ghiti wrote:
> The current include guard prevents the inclusion of asm/perf_event.h
> which uses the same include guard: fix the one in riscv_pmu.h so that it
> matches the file name.
>
> Signed-off-by: Alexandre Ghiti <[email protected]>
> Reviewed-by: Conor Dooley <[email protected]>
> ---
> include/linux/perf/riscv_pmu.h | 6 +++---
> 1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/perf/riscv_pmu.h b/include/linux/perf/riscv_pmu.h
> index 43fc892aa7d9..9f70d94942e0 100644
> --- a/include/linux/perf/riscv_pmu.h
> +++ b/include/linux/perf/riscv_pmu.h
> @@ -6,8 +6,8 @@
> *
> */
>
> -#ifndef _ASM_RISCV_PERF_EVENT_H
> -#define _ASM_RISCV_PERF_EVENT_H
> +#ifndef _RISCV_PMU_H
> +#define _RISCV_PMU_H
>
> #include <linux/perf_event.h>
> #include <linux/ptrace.h>
> @@ -81,4 +81,4 @@ int riscv_pmu_get_hpm_info(u32 *hw_ctr_width, u32 *num_hw_ctr);
>
> #endif /* CONFIG_RISCV_PMU */
>
> -#endif /* _ASM_RISCV_PERF_EVENT_H */
> +#endif /* _RISCV_PMU_H */
> --
> 2.37.2
>

Reviewed-by: Andrew Jones <[email protected]>

2023-05-31 15:04:28

by Andrew Jones

[permalink] [raw]
Subject: Re: [PATCH v2 01/10] perf: Fix wrong comment about default event_idx

On Fri, May 12, 2023 at 10:53:12AM +0200, Alexandre Ghiti wrote:
> event_idx default implementation returns 0, not idx + 1.

The comment was correct until commit c719f56092ad ("perf: Fix and clean
up initialization of pmu::event_idx"). I'm not sure that warrants a fixes
tag, but maybe a reference in the commit message.

>
> Signed-off-by: Alexandre Ghiti <[email protected]>
> ---
> include/linux/perf_event.h | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index d5628a7b5eaa..56fe43b20966 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -442,7 +442,8 @@ struct pmu {
>
> /*
> * Will return the value for perf_event_mmap_page::index for this event,
> - * if no implementation is provided it will default to: event->hw.idx + 1.
> + * if no implementation is provided it will default to 0 (see
> + * perf_event_idx_default).
> */
> int (*event_idx) (struct perf_event *event); /*optional */
>
> --
> 2.37.2
>

Otherwise,

Reviewed-by: Andrew Jones <[email protected]>

2023-05-31 15:06:59

by Andrew Jones

[permalink] [raw]
Subject: Re: [PATCH v2 07/10] drivers: perf: Implement perf event mmap support in the SBI backend

On Fri, May 12, 2023 at 10:53:18AM +0200, Alexandre Ghiti wrote:
> We used to unconditionnally expose the cycle and instret csrs to
> userspace, which gives rise to security concerns.
>
> So now we only allow access to hw counters from userspace through the perf
> framework which will handle context switches, per-task events...etc. But
> as we cannot break userspace, we give the user the choice to go back to
> the previous behaviour by setting the sysctl perf_user_access.
>
> Signed-off-by: Alexandre Ghiti <[email protected]>
> ---
> arch/riscv/kernel/perf_event.c | 18 ++-
> drivers/perf/riscv_pmu_sbi.c | 194 ++++++++++++++++++++++++++++++++-
> 2 files changed, 205 insertions(+), 7 deletions(-)
>
> diff --git a/arch/riscv/kernel/perf_event.c b/arch/riscv/kernel/perf_event.c
> index 94174a0fc251..3af9ca45b43f 100644
> --- a/arch/riscv/kernel/perf_event.c
> +++ b/arch/riscv/kernel/perf_event.c
> @@ -1,9 +1,13 @@
> // SPDX-License-Identifier: GPL-2.0-only
> +#include <linux/perf/riscv_pmu.h>
> #include <linux/sched_clock.h>
>
> void arch_perf_update_userpage(struct perf_event *event,
> struct perf_event_mmap_page *userpg, u64 now)
> {
> +#ifdef CONFIG_RISCV_PMU_SBI
> + struct riscv_pmu *rvpmu = to_riscv_pmu(event->pmu);
> +#endif

Can avoid this pair of #ifdef/#endif's by just declaring rvpmu inside the
if block below where it's needed. Or even just using to_riscv_pmu()
directly in place of rvpmu.

> struct clock_read_data *rd;
> unsigned int seq;
> u64 ns;
> @@ -14,7 +18,19 @@ void arch_perf_update_userpage(struct perf_event *event,
> userpg->cap_user_rdpmc =
> !!(event->hw.flags & PERF_EVENT_FLAG_USER_READ_CNT);
>
> - userpg->pmc_width = 64;
> +#ifdef CONFIG_RISCV_PMU_SBI
> + /*
> + * The counters are 64-bit but the priv spec doesn't mandate all the
> + * bits to be implemented: that's why, counter width can vary based on
> + * the cpu vendor.
> + */
> + if (event->pmu->name &&
> + !strncmp(event->pmu->name,
> + RISCV_PMU_PDEV_NAME, sizeof(RISCV_PMU_PDEV_NAME)))
> + userpg->pmc_width = rvpmu->ctr_get_width(event->hw.idx) + 1;
> + else
> +#endif
> + userpg->pmc_width = 64;

Can leave the initialization to 64 above the #ifdef CONFIG_RISCV_PMU_SBI
as is and drop the else.

>
> do {
> rd = sched_clock_read_begin(&seq);
> diff --git a/drivers/perf/riscv_pmu_sbi.c b/drivers/perf/riscv_pmu_sbi.c
> index 3b0ee2148054..d9bcc5cc6df5 100644
> --- a/drivers/perf/riscv_pmu_sbi.c
> +++ b/drivers/perf/riscv_pmu_sbi.c
> @@ -24,6 +24,14 @@
> #include <asm/sbi.h>
> #include <asm/hwcap.h>
>
> +#define SYSCTL_NO_USER_ACCESS 0
> +#define SYSCTL_USER_ACCESS 1
> +#define SYSCTL_LEGACY 2
> +
> +#define PERF_EVENT_FLAG_NO_USER_ACCESS BIT(SYSCTL_NO_USER_ACCESS)
> +#define PERF_EVENT_FLAG_USER_ACCESS BIT(SYSCTL_USER_ACCESS)
> +#define PERF_EVENT_FLAG_LEGACY BIT(SYSCTL_LEGACY)
> +
> PMU_FORMAT_ATTR(event, "config:0-47");
> PMU_FORMAT_ATTR(firmware, "config:63");
>
> @@ -43,6 +51,9 @@ static const struct attribute_group *riscv_pmu_attr_groups[] = {
> NULL,
> };
>
> +/* Allow legacy access by default */
> +static int sysctl_perf_user_access __read_mostly = SYSCTL_LEGACY;
> +

I'm still not in favor of this. Hopefully the distro discussions result in
it being changed.

> /*
> * RISC-V doesn't have heterogeneous harts yet. This need to be part of
> * per_cpu in case of harts with different pmu counters
> @@ -301,6 +312,11 @@ int riscv_pmu_get_hpm_info(u32 *hw_ctr_width, u32 *num_hw_ctr)
> }
> EXPORT_SYMBOL_GPL(riscv_pmu_get_hpm_info);
>
> +static uint8_t pmu_sbi_csr_index(struct perf_event *event)
> +{
> + return pmu_ctr_list[event->hw.idx].csr - CSR_CYCLE;
> +}
> +
> static unsigned long pmu_sbi_get_filter_flags(struct perf_event *event)
> {
> unsigned long cflags = 0;
> @@ -329,18 +345,34 @@ static int pmu_sbi_ctr_get_idx(struct perf_event *event)
> struct cpu_hw_events *cpuc = this_cpu_ptr(rvpmu->hw_events);
> struct sbiret ret;
> int idx;
> - uint64_t cbase = 0;
> + uint64_t cbase = 0, cmask = rvpmu->cmask;
> unsigned long cflags = 0;
>
> cflags = pmu_sbi_get_filter_flags(event);
> +
> + /*
> + * In legacy mode, we have to force the fixed counters for those events
> + * but not in the user access mode as we want to use the other counters
> + * that support sampling/filtering.
> + */
> + if (hwc->flags & PERF_EVENT_FLAG_LEGACY) {
> + if (event->attr.config == PERF_COUNT_HW_CPU_CYCLES) {
> + cflags |= SBI_PMU_CFG_FLAG_SKIP_MATCH;
> + cmask = 1;
> + } else if (event->attr.config == PERF_COUNT_HW_INSTRUCTIONS) {
> + cflags |= SBI_PMU_CFG_FLAG_SKIP_MATCH;
> + cmask = 1UL << (CSR_INSTRET - CSR_CYCLE);
> + }
> + }
> +
> /* retrieve the available counter index */
> #if defined(CONFIG_32BIT)
> ret = sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_COUNTER_CFG_MATCH, cbase,
> - rvpmu->cmask, cflags, hwc->event_base, hwc->config,
> + cmask, cflags, hwc->event_base, hwc->config,
> hwc->config >> 32);
> #else
> ret = sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_COUNTER_CFG_MATCH, cbase,
> - rvpmu->cmask, cflags, hwc->event_base, hwc->config, 0);
> + cmask, cflags, hwc->event_base, hwc->config, 0);
> #endif
> if (ret.error) {
> pr_debug("Not able to find a counter for event %lx config %llx\n",
> @@ -474,6 +506,14 @@ static u64 pmu_sbi_ctr_read(struct perf_event *event)
> return val;
> }
>
> +static void pmu_sbi_set_scounteren(void *arg)
> +{
> + struct perf_event *event = (struct perf_event *)arg;
> +
> + csr_write(CSR_SCOUNTEREN,
> + csr_read(CSR_SCOUNTEREN) | (1 << pmu_sbi_csr_index(event)));
> +}
> +
> static void pmu_sbi_ctr_start(struct perf_event *event, u64 ival)
> {
> struct sbiret ret;
> @@ -490,6 +530,18 @@ static void pmu_sbi_ctr_start(struct perf_event *event, u64 ival)
> if (ret.error && (ret.error != SBI_ERR_ALREADY_STARTED))
> pr_err("Starting counter idx %d failed with error %d\n",
> hwc->idx, sbi_err_map_linux_errno(ret.error));
> +
> + if (hwc->flags & PERF_EVENT_FLAG_USER_ACCESS &&
> + hwc->flags & PERF_EVENT_FLAG_USER_READ_CNT)
> + pmu_sbi_set_scounteren((void *)event);
> +}
> +
> +static void pmu_sbi_reset_scounteren(void *arg)
> +{
> + struct perf_event *event = (struct perf_event *)arg;
> +
> + csr_write(CSR_SCOUNTEREN,
> + csr_read(CSR_SCOUNTEREN) & ~(1 << pmu_sbi_csr_index(event)));
> }
>
> static void pmu_sbi_ctr_stop(struct perf_event *event, unsigned long flag)
> @@ -497,6 +549,10 @@ static void pmu_sbi_ctr_stop(struct perf_event *event, unsigned long flag)
> struct sbiret ret;
> struct hw_perf_event *hwc = &event->hw;
>
> + if (hwc->flags & PERF_EVENT_FLAG_USER_ACCESS &&
> + hwc->flags & PERF_EVENT_FLAG_USER_READ_CNT)
> + pmu_sbi_reset_scounteren((void *)event);
> +
> ret = sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_COUNTER_STOP, hwc->idx, 1, flag, 0, 0, 0);
> if (ret.error && (ret.error != SBI_ERR_ALREADY_STOPPED) &&
> flag != SBI_PMU_STOP_FLAG_RESET)
> @@ -704,10 +760,13 @@ static int pmu_sbi_starting_cpu(unsigned int cpu, struct hlist_node *node)
> struct cpu_hw_events *cpu_hw_evt = this_cpu_ptr(pmu->hw_events);
>
> /*
> - * Enable the access for CYCLE, TIME, and INSTRET CSRs from userspace,
> - * as is necessary to maintain uABI compatibility.
> + * We keep enabling userspace access to CYCLE, TIME and INSRET via the
> + * legacy option but that will be removed in the future.
> */
> - csr_write(CSR_SCOUNTEREN, 0x7);
> + if (sysctl_perf_user_access == SYSCTL_LEGACY)
> + csr_write(CSR_SCOUNTEREN, 0x7);
> + else
> + csr_write(CSR_SCOUNTEREN, 0x2);
>
> /* Stop all the counters so that they can be enabled from perf */
> pmu_sbi_stop_all(pmu);
> @@ -851,6 +910,123 @@ static void riscv_pmu_destroy(struct riscv_pmu *pmu)
> cpuhp_state_remove_instance(CPUHP_AP_PERF_RISCV_STARTING, &pmu->node);
> }
>
> +static void pmu_sbi_event_init(struct perf_event *event)
> +{
> + /*
> + * The permissions are set at event_init so that we do not depend
> + * on the sysctl value that can change.
> + */
> + if (sysctl_perf_user_access == SYSCTL_NO_USER_ACCESS)
> + event->hw.flags |= PERF_EVENT_FLAG_NO_USER_ACCESS;
> + else if (sysctl_perf_user_access == SYSCTL_USER_ACCESS)
> + event->hw.flags |= PERF_EVENT_FLAG_USER_ACCESS;
> + else
> + event->hw.flags |= PERF_EVENT_FLAG_LEGACY;
> +}
> +
> +static void pmu_sbi_event_mapped(struct perf_event *event, struct mm_struct *mm)
> +{
> + if (event->hw.flags & PERF_EVENT_FLAG_NO_USER_ACCESS)
> + return;
> +
> + /* In legacy mode, the first 3 CSRs are available. */

first and third

> + if (event->hw.flags & PERF_EVENT_FLAG_LEGACY) {
> + if (event->attr.config != PERF_COUNT_HW_CPU_CYCLES &&
> + event->attr.config != PERF_COUNT_HW_INSTRUCTIONS) {
> + return;
> + }
> + }
> +
> + /*
> + * The user mmapped the event to directly access it: this is where
> + * we determine based on sysctl_perf_user_access if we grant userspace
> + * the direct access to this event. That means that within the same
> + * task, some events may be directly accessible and some other may not,
> + * if the user changes the value of sysctl_perf_user_accesss in the
> + * meantime.
> + */
> +
> + event->hw.flags |= PERF_EVENT_FLAG_USER_READ_CNT;
> +
> + /*
> + * We must enable userspace access *before* advertising in the user page
> + * that it is possible to do so to avoid any race.
> + * And we must notify all cpus here because threads that currently run
> + * on other cpus will try to directly access the counter too without
> + * calling pmu_sbi_ctr_start.
> + */
> + if (event->hw.flags & PERF_EVENT_FLAG_USER_ACCESS)
> + on_each_cpu_mask(mm_cpumask(mm),
> + pmu_sbi_set_scounteren, (void *)event, 1);
> +}
> +
> +static void pmu_sbi_event_unmapped(struct perf_event *event, struct mm_struct *mm)
> +{
> + if (event->hw.flags & PERF_EVENT_FLAG_NO_USER_ACCESS)
> + return;
> +
> + /* In legacy mode, the first 3 CSRs are available. */

first and third

> + if (event->hw.flags & PERF_EVENT_FLAG_LEGACY) {
> + if (event->attr.config != PERF_COUNT_HW_CPU_CYCLES &&
> + event->attr.config != PERF_COUNT_HW_INSTRUCTIONS) {
> + return;
> + }
> + }
> +
> + /*
> + * Here we can directly remove user access since the user does not have
> + * access to the user page anymore so we avoid the racy window where the
> + * user could have read cap_user_rdpmc to true right before we disable
> + * it.
> + */
> + event->hw.flags &= ~PERF_EVENT_FLAG_USER_READ_CNT;
> +
> + if (event->hw.flags & PERF_EVENT_FLAG_USER_ACCESS)
> + on_each_cpu_mask(mm_cpumask(mm),
> + pmu_sbi_reset_scounteren, (void *)event, 1);
> +}
> +
> +static void riscv_pmu_update_counter_access(void *info)
> +{
> + if (sysctl_perf_user_access == SYSCTL_LEGACY)
> + csr_write(CSR_SCOUNTEREN, 0x7);
> + else
> + csr_write(CSR_SCOUNTEREN, 0x2);
> +}
> +
> +static int riscv_pmu_proc_user_access_handler(struct ctl_table *table,
> + int write, void *buffer,
> + size_t *lenp, loff_t *ppos)
> +{
> + int prev = sysctl_perf_user_access;
> + int ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
> +
> + /*
> + * Test against the previous value since we clear SCOUNTEREN when
> + * sysctl_perf_user_access is set to SYSCTL_USER_ACCESS, but we should
> + * not do that if that was already the case.
> + */
> + if (ret || !write || prev == sysctl_perf_user_access)
> + return ret;
> +
> + on_each_cpu(riscv_pmu_update_counter_access, (void *)&prev, 1);

Instead of passing prev shouldn't we pass NULL, as it's not used?

> +
> + return 0;
> +}
> +
> +static struct ctl_table sbi_pmu_sysctl_table[] = {
> + {
> + .procname = "perf_user_access",
> + .data = &sysctl_perf_user_access,
> + .maxlen = sizeof(unsigned int),
> + .mode = 0644,
> + .proc_handler = riscv_pmu_proc_user_access_handler,
> + .extra1 = SYSCTL_ZERO,
> + .extra2 = SYSCTL_TWO,
> + },
> + { }
> +};
> +
> static int pmu_sbi_device_probe(struct platform_device *pdev)
> {
> struct riscv_pmu *pmu = NULL;
> @@ -888,6 +1064,10 @@ static int pmu_sbi_device_probe(struct platform_device *pdev)
> pmu->ctr_get_width = pmu_sbi_ctr_get_width;
> pmu->ctr_clear_idx = pmu_sbi_ctr_clear_idx;
> pmu->ctr_read = pmu_sbi_ctr_read;
> + pmu->event_init = pmu_sbi_event_init;
> + pmu->event_mapped = pmu_sbi_event_mapped;
> + pmu->event_unmapped = pmu_sbi_event_unmapped;
> + pmu->csr_index = pmu_sbi_csr_index;
>
> ret = cpuhp_state_add_instance(CPUHP_AP_PERF_RISCV_STARTING, &pmu->node);
> if (ret)
> @@ -901,6 +1081,8 @@ static int pmu_sbi_device_probe(struct platform_device *pdev)
> if (ret)
> goto out_unregister;
>
> + register_sysctl("kernel", sbi_pmu_sysctl_table);
> +
> return 0;
>
> out_unregister:
> --
> 2.37.2
>

Thanks,
drew

2023-05-31 15:18:01

by Andrew Jones

[permalink] [raw]
Subject: Re: [PATCH v2 09/10] tools: lib: perf: Implement riscv mmap support

On Fri, May 12, 2023 at 10:53:20AM +0200, Alexandre Ghiti wrote:
> riscv now support mmaping hardware counters so add what's needed to
> take advantage of that in libperf.
>
> Signed-off-by: Alexandre Ghiti <[email protected]>
> ---
> tools/lib/perf/mmap.c | 65 +++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 65 insertions(+)
>

Reviewed-by: Andrew Jones <[email protected]>

2023-05-31 15:24:32

by Andrew Jones

[permalink] [raw]
Subject: Re: [PATCH v2 10/10] perf: tests: Adapt mmap-basic.c for riscv

On Fri, May 12, 2023 at 10:53:21AM +0200, Alexandre Ghiti wrote:
> riscv now supports mmaping hardware counters to userspace so adapt the test
> to run on this architecture.
>
> Signed-off-by: Alexandre Ghiti <[email protected]>
> ---
> tools/perf/tests/mmap-basic.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/tools/perf/tests/mmap-basic.c b/tools/perf/tests/mmap-basic.c
> index e68ca6229756..f5075ca774f8 100644
> --- a/tools/perf/tests/mmap-basic.c
> +++ b/tools/perf/tests/mmap-basic.c
> @@ -284,7 +284,7 @@ static struct test_case tests__basic_mmap[] = {
> "permissions"),
> TEST_CASE_REASON("User space counter reading of instructions",
> mmap_user_read_instr,
> -#if defined(__i386__) || defined(__x86_64__) || defined(__aarch64__)
> +#if defined(__i386__) || defined(__x86_64__) || defined(__aarch64__) || __riscv_xlen == 64
> "permissions"
> #else
> "unsupported"
> @@ -292,7 +292,7 @@ static struct test_case tests__basic_mmap[] = {
> ),
> TEST_CASE_REASON("User space counter reading of cycles",
> mmap_user_read_cycles,
> -#if defined(__i386__) || defined(__x86_64__) || defined(__aarch64__)
> +#if defined(__i386__) || defined(__x86_64__) || defined(__aarch64__) || __riscv_xlen == 64
> "permissions"
> #else
> "unsupported"
> --
> 2.37.2
>

Reviewed-by: Andrew Jones <[email protected]>

2023-05-31 15:24:47

by Andrew Jones

[permalink] [raw]
Subject: Re: [PATCH v2 08/10] Documentation: admin-guide: Add riscv sysctl_perf_user_access

On Fri, May 12, 2023 at 10:53:19AM +0200, Alexandre Ghiti wrote:
> riscv now uses this sysctl so document its usage for this architecture.
>
> Signed-off-by: Alexandre Ghiti <[email protected]>
> ---
> Documentation/admin-guide/sysctl/kernel.rst | 24 +++++++++++++++++----
> 1 file changed, 20 insertions(+), 4 deletions(-)
>
> diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst
> index 4b7bfea28cd7..93cd518ca94b 100644
> --- a/Documentation/admin-guide/sysctl/kernel.rst
> +++ b/Documentation/admin-guide/sysctl/kernel.rst
> @@ -941,16 +941,32 @@ enabled, otherwise writing to this file will return ``-EBUSY``.
> The default value is 8.
>
>
> -perf_user_access (arm64 only)
> -=================================
> +perf_user_access (arm64 and riscv only)
> +=======================================
> +
> +Controls user space access for reading perf event counters.
>
> -Controls user space access for reading perf event counters. When set to 1,
> -user space can read performance monitor counter registers directly.
> +arm64
> +=====
>
> The default value is 0 (access disabled).
> +When set to 1, user space can read performance monitor counter registers
> +directly.
>
> See Documentation/arm64/perf.rst for more information.
>
> +riscv
> +=====
> +
> +When set to 0, user access is disabled.
> +
> +When set to 1, user space can read performance monitor counter registers
> +directly only through perf, any direct access without perf intervention will
> +trigger an illegal instruction.
> +
> +The default value is 2, which enables legacy mode (user space has direct
> +access to cycle, time and insret CSRs only). Note that this legacy value
> +is deprecated and will be removed once all userspace applications are fixed.

All modes can access the time CSR so I'm not sure if it should be pointed
out here as if it's an exception. Maybe we shouldn't point it out at all
or we should point it out for all three?

Thanks,
drew

2023-05-31 17:27:07

by Atish Patra

[permalink] [raw]
Subject: Re: [PATCH v2 08/10] Documentation: admin-guide: Add riscv sysctl_perf_user_access

On Wed, May 31, 2023 at 8:07 AM Andrew Jones <[email protected]> wrote:
>
> On Fri, May 12, 2023 at 10:53:19AM +0200, Alexandre Ghiti wrote:
> > riscv now uses this sysctl so document its usage for this architecture.
> >
> > Signed-off-by: Alexandre Ghiti <[email protected]>
> > ---
> > Documentation/admin-guide/sysctl/kernel.rst | 24 +++++++++++++++++----
> > 1 file changed, 20 insertions(+), 4 deletions(-)
> >
> > diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst
> > index 4b7bfea28cd7..93cd518ca94b 100644
> > --- a/Documentation/admin-guide/sysctl/kernel.rst
> > +++ b/Documentation/admin-guide/sysctl/kernel.rst
> > @@ -941,16 +941,32 @@ enabled, otherwise writing to this file will return ``-EBUSY``.
> > The default value is 8.
> >
> >
> > -perf_user_access (arm64 only)
> > -=================================
> > +perf_user_access (arm64 and riscv only)
> > +=======================================
> > +
> > +Controls user space access for reading perf event counters.
> >
> > -Controls user space access for reading perf event counters. When set to 1,
> > -user space can read performance monitor counter registers directly.
> > +arm64
> > +=====
> >
> > The default value is 0 (access disabled).
> > +When set to 1, user space can read performance monitor counter registers
> > +directly.
> >
> > See Documentation/arm64/perf.rst for more information.
> >
> > +riscv
> > +=====
> > +
> > +When set to 0, user access is disabled.
> > +
> > +When set to 1, user space can read performance monitor counter registers
> > +directly only through perf, any direct access without perf intervention will
> > +trigger an illegal instruction.
> > +
> > +The default value is 2, which enables legacy mode (user space has direct
> > +access to cycle, time and insret CSRs only). Note that this legacy value
> > +is deprecated and will be removed once all userspace applications are fixed.
>
> All modes can access the time CSR so I'm not sure if it should be pointed
> out here as if it's an exception. Maybe we shouldn't point it out at all
> or we should point it out for all three?
>

Valid point. Thanks Drew.
In the future, we probably want to support prctl
(PR_SET_TSC/SECCOMP_MODE_STRICT) to disable even
time CSR access. I don't think there is any use case for it right now.

> Thanks,
> drew



--
Regards,
Atish

2023-06-05 14:03:56

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: [PATCH v2 10/10] perf: tests: Adapt mmap-basic.c for riscv

Em Wed, May 31, 2023 at 05:15:15PM +0200, Andrew Jones escreveu:
> On Fri, May 12, 2023 at 10:53:21AM +0200, Alexandre Ghiti wrote:
> > riscv now supports mmaping hardware counters to userspace so adapt the test
> > to run on this architecture.
> >
> > Signed-off-by: Alexandre Ghiti <[email protected]>
> > ---
> > tools/perf/tests/mmap-basic.c | 4 ++--
> > 1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/tools/perf/tests/mmap-basic.c b/tools/perf/tests/mmap-basic.c
> > index e68ca6229756..f5075ca774f8 100644
> > --- a/tools/perf/tests/mmap-basic.c
> > +++ b/tools/perf/tests/mmap-basic.c
> > @@ -284,7 +284,7 @@ static struct test_case tests__basic_mmap[] = {
> > "permissions"),
> > TEST_CASE_REASON("User space counter reading of instructions",
> > mmap_user_read_instr,
> > -#if defined(__i386__) || defined(__x86_64__) || defined(__aarch64__)
> > +#if defined(__i386__) || defined(__x86_64__) || defined(__aarch64__) || __riscv_xlen == 64
> > "permissions"
> > #else
> > "unsupported"
> > @@ -292,7 +292,7 @@ static struct test_case tests__basic_mmap[] = {
> > ),
> > TEST_CASE_REASON("User space counter reading of cycles",
> > mmap_user_read_cycles,
> > -#if defined(__i386__) || defined(__x86_64__) || defined(__aarch64__)
> > +#if defined(__i386__) || defined(__x86_64__) || defined(__aarch64__) || __riscv_xlen == 64
> > "permissions"
> > #else
> > "unsupported"
> > --
> > 2.37.2
> >
>
> Reviewed-by: Andrew Jones <[email protected]>

Was the kernel part merged upstream?

- Arnaldo

2023-06-05 14:45:44

by Alexandre Ghiti

[permalink] [raw]
Subject: Re: [PATCH v2 10/10] perf: tests: Adapt mmap-basic.c for riscv

(sorry mail client switched to html on its own)

On Mon, Jun 5, 2023 at 3:53 PM Arnaldo Carvalho de Melo <[email protected]> wrote:
>
> Em Wed, May 31, 2023 at 05:15:15PM +0200, Andrew Jones escreveu:
> > On Fri, May 12, 2023 at 10:53:21AM +0200, Alexandre Ghiti wrote:
> > > riscv now supports mmaping hardware counters to userspace so adapt the test
> > > to run on this architecture.
> > >
> > > Signed-off-by: Alexandre Ghiti <[email protected]>
> > > ---
> > > tools/perf/tests/mmap-basic.c | 4 ++--
> > > 1 file changed, 2 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/tools/perf/tests/mmap-basic.c b/tools/perf/tests/mmap-basic.c
> > > index e68ca6229756..f5075ca774f8 100644
> > > --- a/tools/perf/tests/mmap-basic.c
> > > +++ b/tools/perf/tests/mmap-basic.c
> > > @@ -284,7 +284,7 @@ static struct test_case tests__basic_mmap[] = {
> > > "permissions"),
> > > TEST_CASE_REASON("User space counter reading of instructions",
> > > mmap_user_read_instr,
> > > -#if defined(__i386__) || defined(__x86_64__) || defined(__aarch64__)
> > > +#if defined(__i386__) || defined(__x86_64__) || defined(__aarch64__) || __riscv_xlen == 64
> > > "permissions"
> > > #else
> > > "unsupported"
> > > @@ -292,7 +292,7 @@ static struct test_case tests__basic_mmap[] = {
> > > ),
> > > TEST_CASE_REASON("User space counter reading of cycles",
> > > mmap_user_read_cycles,
> > > -#if defined(__i386__) || defined(__x86_64__) || defined(__aarch64__)
> > > +#if defined(__i386__) || defined(__x86_64__) || defined(__aarch64__) || __riscv_xlen == 64
> > > "permissions"
> > > #else
> > > "unsupported"
> > > --
> > > 2.37.2
> > >
> >
> > Reviewed-by: Andrew Jones <[email protected]>
>
> Was the kernel part merged upstream?

No, I still haven't answered the reviews :)

>
> - Arnaldo

2023-06-05 14:48:53

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: [PATCH v2 10/10] perf: tests: Adapt mmap-basic.c for riscv

Em Mon, Jun 05, 2023 at 04:00:12PM +0200, Alexandre Ghiti escreveu:
> On Mon, Jun 5, 2023 at 3:53 PM Arnaldo Carvalho de Melo <[email protected]>
> wrote:
>
> > Em Wed, May 31, 2023 at 05:15:15PM +0200, Andrew Jones escreveu:
> > > On Fri, May 12, 2023 at 10:53:21AM +0200, Alexandre Ghiti wrote:
> > > > riscv now supports mmaping hardware counters to userspace so adapt the
> > test
> > > > to run on this architecture.
> > > >
> > > > Signed-off-by: Alexandre Ghiti <[email protected]>
> > > > ---
> > > > tools/perf/tests/mmap-basic.c | 4 ++--
> > > > 1 file changed, 2 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/tools/perf/tests/mmap-basic.c
> > b/tools/perf/tests/mmap-basic.c
> > > > index e68ca6229756..f5075ca774f8 100644
> > > > --- a/tools/perf/tests/mmap-basic.c
> > > > +++ b/tools/perf/tests/mmap-basic.c
> > > > @@ -284,7 +284,7 @@ static struct test_case tests__basic_mmap[] = {
> > > > "permissions"),
> > > > TEST_CASE_REASON("User space counter reading of instructions",
> > > > mmap_user_read_instr,
> > > > -#if defined(__i386__) || defined(__x86_64__) || defined(__aarch64__)
> > > > +#if defined(__i386__) || defined(__x86_64__) || defined(__aarch64__)
> > || __riscv_xlen == 64
> > > > "permissions"
> > > > #else
> > > > "unsupported"
> > > > @@ -292,7 +292,7 @@ static struct test_case tests__basic_mmap[] = {
> > > > ),
> > > > TEST_CASE_REASON("User space counter reading of cycles",
> > > > mmap_user_read_cycles,
> > > > -#if defined(__i386__) || defined(__x86_64__) || defined(__aarch64__)
> > > > +#if defined(__i386__) || defined(__x86_64__) || defined(__aarch64__)
> > || __riscv_xlen == 64
> > > > "permissions"
> > > > #else
> > > > "unsupported"
> > > > --
> > > > 2.37.2
> > > >
> > >
> > > Reviewed-by: Andrew Jones <[email protected]>
> >
> > Was the kernel part merged upstream?
> >
>
> No, I still haven't answered the reviews :)

Ok, I'll then wait for it to happen to consider merging the userland
parts,

Thanks,

- Arnaldo

2023-06-15 07:32:45

by Alexandre Ghiti

[permalink] [raw]
Subject: Re: [PATCH v2 03/10] riscv: Make legacy counter enum match the HW numbering

On 31/05/2023 16:01, Andrew Jones wrote:
> On Fri, May 12, 2023 at 10:53:14AM +0200, Alexandre Ghiti wrote:
>> RISCV_PMU_LEGACY_INSTRET used to be set to 1 whereas the offset of this
>> hardware counter from CSR_CYCLE is actually 2: make this offset match the
>> real hw offset so that we can directly expose those values to userspace.
>>
>> Signed-off-by: Alexandre Ghiti <[email protected]>
>> ---
>> drivers/perf/riscv_pmu_legacy.c | 7 +++++--
>> 1 file changed, 5 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/perf/riscv_pmu_legacy.c b/drivers/perf/riscv_pmu_legacy.c
>> index ca9e20bfc7ac..0d8c9d8849ee 100644
>> --- a/drivers/perf/riscv_pmu_legacy.c
>> +++ b/drivers/perf/riscv_pmu_legacy.c
>> @@ -12,8 +12,11 @@
>> #include <linux/perf/riscv_pmu.h>
>> #include <linux/platform_device.h>
>>
>> -#define RISCV_PMU_LEGACY_CYCLE 0
>> -#define RISCV_PMU_LEGACY_INSTRET 1
>> +enum {
>> + RISCV_PMU_LEGACY_CYCLE,
>> + RISCV_PMU_LEGACY_TIME,
>> + RISCV_PMU_LEGACY_INSTRET
>> +};
> I guess this doesn't hurt, since these are just indices internal to this
> driver, but it's a bit odd to also have a RISCV_PMU_LEGACY_TIME, when
> the driver is only for cycle and instret, as its Kconfig help text says.


I understand and you're right, that's weird, so I'll change that with
the following:

diff --git a/drivers/perf/riscv_pmu_legacy.c
b/drivers/perf/riscv_pmu_legacy.c
index ca9e20bfc7ac..6a000abc28bb 100644
--- a/drivers/perf/riscv_pmu_legacy.c
+++ b/drivers/perf/riscv_pmu_legacy.c
@@ -13,7 +13,7 @@
 #include <linux/platform_device.h>

 #define RISCV_PMU_LEGACY_CYCLE         0
-#define RISCV_PMU_LEGACY_INSTRET       1
+#define RISCV_PMU_LEGACY_INSTRET       2

 static bool pmu_init_done;

Thanks!


> Thanks,
> drew
>
> _______________________________________________
> linux-riscv mailing list
> [email protected]
> http://lists.infradead.org/mailman/listinfo/linux-riscv

2023-06-15 07:34:32

by Alexandre Ghiti

[permalink] [raw]
Subject: Re: [PATCH v2 01/10] perf: Fix wrong comment about default event_idx


On 31/05/2023 15:54, Andrew Jones wrote:
> On Fri, May 12, 2023 at 10:53:12AM +0200, Alexandre Ghiti wrote:
>> event_idx default implementation returns 0, not idx + 1.
> The comment was correct until commit c719f56092ad ("perf: Fix and clean
> up initialization of pmu::event_idx"). I'm not sure that warrants a fixes
> tag, but maybe a reference in the commit message.


You're right, I'll add a reference, I don't think it deserves a
backport, that's just a comment.


>> Signed-off-by: Alexandre Ghiti <[email protected]>
>> ---
>> include/linux/perf_event.h | 3 ++-
>> 1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
>> index d5628a7b5eaa..56fe43b20966 100644
>> --- a/include/linux/perf_event.h
>> +++ b/include/linux/perf_event.h
>> @@ -442,7 +442,8 @@ struct pmu {
>>
>> /*
>> * Will return the value for perf_event_mmap_page::index for this event,
>> - * if no implementation is provided it will default to: event->hw.idx + 1.
>> + * if no implementation is provided it will default to 0 (see
>> + * perf_event_idx_default).
>> */
>> int (*event_idx) (struct perf_event *event); /*optional */
>>
>> --
>> 2.37.2
>>
> Otherwise,
>
> Reviewed-by: Andrew Jones <[email protected]>


Thanks!


2023-06-15 07:53:14

by Alexandre Ghiti

[permalink] [raw]
Subject: Re: [PATCH v2 04/10] drivers: perf: Rename riscv pmu driver


On 31/05/2023 16:09, Andrew Jones wrote:
> On Fri, May 12, 2023 at 10:53:15AM +0200, Alexandre Ghiti wrote:
>> In addition to being more pretty, it will be useful in upcoming commits
>> to distinguish those pmu drivers from the other pmu drivers.
>>
>> Signed-off-by: Alexandre Ghiti <[email protected]>
>> ---
>> drivers/perf/riscv_pmu_legacy.c | 2 +-
>> drivers/perf/riscv_pmu_sbi.c | 2 +-
>> 2 files changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/perf/riscv_pmu_legacy.c b/drivers/perf/riscv_pmu_legacy.c
>> index 0d8c9d8849ee..ffe09d857366 100644
>> --- a/drivers/perf/riscv_pmu_legacy.c
>> +++ b/drivers/perf/riscv_pmu_legacy.c
>> @@ -95,7 +95,7 @@ static void pmu_legacy_init(struct riscv_pmu *pmu)
>> pmu->ctr_clear_idx = NULL;
>> pmu->ctr_read = pmu_legacy_read_ctr;
>>
>> - perf_pmu_register(&pmu->pmu, "cpu", PERF_TYPE_RAW);
>> + perf_pmu_register(&pmu->pmu, RISCV_PMU_LEGACY_PDEV_NAME, PERF_TYPE_RAW);
>> }
>>
>> static int pmu_legacy_device_probe(struct platform_device *pdev)
>> diff --git a/drivers/perf/riscv_pmu_sbi.c b/drivers/perf/riscv_pmu_sbi.c
>> index 70cb50fd41c2..3b0ee2148054 100644
>> --- a/drivers/perf/riscv_pmu_sbi.c
>> +++ b/drivers/perf/riscv_pmu_sbi.c
>> @@ -897,7 +897,7 @@ static int pmu_sbi_device_probe(struct platform_device *pdev)
>> if (ret)
>> goto out_unregister;
>>
>> - ret = perf_pmu_register(&pmu->pmu, "cpu", PERF_TYPE_RAW);
>> + ret = perf_pmu_register(&pmu->pmu, RISCV_PMU_PDEV_NAME, PERF_TYPE_RAW);
> Should we include "sbi" in this name?


I'd say that it is safe to do so and I understand your point, @Atish WDYT?


>
>> if (ret)
>> goto out_unregister;
>>
>> --
>> 2.37.2
>>
> Otherwise,
>
> Reviewed-by: Andrew Jones <[email protected]>

2023-06-15 08:21:36

by Alexandre Ghiti

[permalink] [raw]
Subject: Re: [PATCH v2 06/10] drivers: perf: Implement perf event mmap support in the legacy backend


On 31/05/2023 16:27, Andrew Jones wrote:
> On Fri, May 12, 2023 at 10:53:17AM +0200, Alexandre Ghiti wrote:
>> Implement the needed callbacks in the legacy driver so that we can
>> directly access the counters through perf in userspace.
>>
>> Signed-off-by: Alexandre Ghiti <[email protected]>
>> ---
>> drivers/perf/riscv_pmu_legacy.c | 28 ++++++++++++++++++++++++++++
>> 1 file changed, 28 insertions(+)
>>
>> diff --git a/drivers/perf/riscv_pmu_legacy.c b/drivers/perf/riscv_pmu_legacy.c
>> index ffe09d857366..f0f5bd856f66 100644
>> --- a/drivers/perf/riscv_pmu_legacy.c
>> +++ b/drivers/perf/riscv_pmu_legacy.c
>> @@ -74,6 +74,31 @@ static void pmu_legacy_ctr_start(struct perf_event *event, u64 ival)
>> local64_set(&hwc->prev_count, initial_val);
>> }
>>
>> +static uint8_t pmu_legacy_csr_index(struct perf_event *event)
>> +{
>> + return event->hw.idx;
>> +}
>> +
>> +static void pmu_legacy_event_mapped(struct perf_event *event, struct mm_struct *mm)
>> +{
>> + /* In legacy mode, the first 3 CSRs are available. */
> Shouldn't this be
>
> /* In legacy mode, the first and third CSR are available. */
>
> ?


Yes, I guess this comment is not right in this context, so I'll remove
the comment entirely as it does bring much.


>> + if (event->attr.config != PERF_COUNT_HW_CPU_CYCLES &&
>> + event->attr.config != PERF_COUNT_HW_INSTRUCTIONS)
>> + return;
>> +
>> + event->hw.flags |= PERF_EVENT_FLAG_USER_READ_CNT;
>> +}
>> +
>> +static void pmu_legacy_event_unmapped(struct perf_event *event, struct mm_struct *mm)
>> +{
>> + /* In legacy mode, the first 3 CSRs are available. */
> same comment
>
>> + if (event->attr.config != PERF_COUNT_HW_CPU_CYCLES &&
>> + event->attr.config != PERF_COUNT_HW_INSTRUCTIONS)
>> + return;
>> +
>> + event->hw.flags &= ~PERF_EVENT_FLAG_USER_READ_CNT;
>> +}
>> +
>> /*
>> * This is just a simple implementation to allow legacy implementations
>> * compatible with new RISC-V PMU driver framework.
>> @@ -94,6 +119,9 @@ static void pmu_legacy_init(struct riscv_pmu *pmu)
>> pmu->ctr_get_width = NULL;
>> pmu->ctr_clear_idx = NULL;
>> pmu->ctr_read = pmu_legacy_read_ctr;
>> + pmu->event_mapped = pmu_legacy_event_mapped;
>> + pmu->event_unmapped = pmu_legacy_event_unmapped;
>> + pmu->csr_index = pmu_legacy_csr_index;
>>
>> perf_pmu_register(&pmu->pmu, RISCV_PMU_LEGACY_PDEV_NAME, PERF_TYPE_RAW);
>> }
>> --
>> 2.37.2
>>
> Otherwise,
>
> Reviewed-by: Andrew Jones <[email protected]>


Thanks!


2023-06-15 08:53:29

by Atish Patra

[permalink] [raw]
Subject: Re: [PATCH v2 04/10] drivers: perf: Rename riscv pmu driver

On Thu, Jun 15, 2023 at 12:25 AM Alexandre Ghiti <[email protected]> wrote:
>
>
> On 31/05/2023 16:09, Andrew Jones wrote:
> > On Fri, May 12, 2023 at 10:53:15AM +0200, Alexandre Ghiti wrote:
> >> In addition to being more pretty, it will be useful in upcoming commits
> >> to distinguish those pmu drivers from the other pmu drivers.
> >>
> >> Signed-off-by: Alexandre Ghiti <[email protected]>
> >> ---
> >> drivers/perf/riscv_pmu_legacy.c | 2 +-
> >> drivers/perf/riscv_pmu_sbi.c | 2 +-
> >> 2 files changed, 2 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/drivers/perf/riscv_pmu_legacy.c b/drivers/perf/riscv_pmu_legacy.c
> >> index 0d8c9d8849ee..ffe09d857366 100644
> >> --- a/drivers/perf/riscv_pmu_legacy.c
> >> +++ b/drivers/perf/riscv_pmu_legacy.c
> >> @@ -95,7 +95,7 @@ static void pmu_legacy_init(struct riscv_pmu *pmu)
> >> pmu->ctr_clear_idx = NULL;
> >> pmu->ctr_read = pmu_legacy_read_ctr;
> >>
> >> - perf_pmu_register(&pmu->pmu, "cpu", PERF_TYPE_RAW);
> >> + perf_pmu_register(&pmu->pmu, RISCV_PMU_LEGACY_PDEV_NAME, PERF_TYPE_RAW);
> >> }
> >>
> >> static int pmu_legacy_device_probe(struct platform_device *pdev)
> >> diff --git a/drivers/perf/riscv_pmu_sbi.c b/drivers/perf/riscv_pmu_sbi.c
> >> index 70cb50fd41c2..3b0ee2148054 100644
> >> --- a/drivers/perf/riscv_pmu_sbi.c
> >> +++ b/drivers/perf/riscv_pmu_sbi.c
> >> @@ -897,7 +897,7 @@ static int pmu_sbi_device_probe(struct platform_device *pdev)
> >> if (ret)
> >> goto out_unregister;
> >>
> >> - ret = perf_pmu_register(&pmu->pmu, "cpu", PERF_TYPE_RAW);
> >> + ret = perf_pmu_register(&pmu->pmu, RISCV_PMU_PDEV_NAME, PERF_TYPE_RAW);
> > Should we include "sbi" in this name?
>
>
> I'd say that it is safe to do so and I understand your point, @Atish WDYT?
>

Actually, the argument in perf_pmu_register is about the pmu instance
name rather than the driver name.
Both legacy & SBI PMU drivers are just ways to access the "cpu" pmu instance.

In future we may have separate drivers for counter delegation
extensions that won't use the SBI PMU extension
at all for supported hardware. However, the PMU would still be cpu pmu.

There will be different SoC PMU drivers which will have different
names because it will represent SoC PMU instead of cpu pmu.

>
> >
> >> if (ret)
> >> goto out_unregister;
> >>
> >> --
> >> 2.37.2
> >>
> > Otherwise,
> >
> > Reviewed-by: Andrew Jones <[email protected]>



--
Regards,
Atish

2023-06-15 08:54:56

by Atish Patra

[permalink] [raw]
Subject: Re: [PATCH v2 02/10] include: riscv: Fix wrong include guard in riscv_pmu.h

On Wed, May 31, 2023 at 6:56 AM Andrew Jones <[email protected]> wrote:
>
> On Fri, May 12, 2023 at 10:53:13AM +0200, Alexandre Ghiti wrote:
> > The current include guard prevents the inclusion of asm/perf_event.h
> > which uses the same include guard: fix the one in riscv_pmu.h so that it
> > matches the file name.
> >
> > Signed-off-by: Alexandre Ghiti <[email protected]>
> > Reviewed-by: Conor Dooley <[email protected]>
> > ---
> > include/linux/perf/riscv_pmu.h | 6 +++---
> > 1 file changed, 3 insertions(+), 3 deletions(-)
> >
> > diff --git a/include/linux/perf/riscv_pmu.h b/include/linux/perf/riscv_pmu.h
> > index 43fc892aa7d9..9f70d94942e0 100644
> > --- a/include/linux/perf/riscv_pmu.h
> > +++ b/include/linux/perf/riscv_pmu.h
> > @@ -6,8 +6,8 @@
> > *
> > */
> >
> > -#ifndef _ASM_RISCV_PERF_EVENT_H
> > -#define _ASM_RISCV_PERF_EVENT_H
> > +#ifndef _RISCV_PMU_H
> > +#define _RISCV_PMU_H
> >
> > #include <linux/perf_event.h>
> > #include <linux/ptrace.h>
> > @@ -81,4 +81,4 @@ int riscv_pmu_get_hpm_info(u32 *hw_ctr_width, u32 *num_hw_ctr);
> >
> > #endif /* CONFIG_RISCV_PMU */
> >
> > -#endif /* _ASM_RISCV_PERF_EVENT_H */
> > +#endif /* _RISCV_PMU_H */
> > --
> > 2.37.2
> >
>
> Reviewed-by: Andrew Jones <[email protected]>


Reviewed-by: Atish Patra <[email protected]>

--
Regards,
Atish

2023-06-15 09:03:54

by Atish Patra

[permalink] [raw]
Subject: Re: [PATCH v2 01/10] perf: Fix wrong comment about default event_idx

On Thu, Jun 15, 2023 at 12:10 AM Alexandre Ghiti <[email protected]> wrote:
>
>
> On 31/05/2023 15:54, Andrew Jones wrote:
> > On Fri, May 12, 2023 at 10:53:12AM +0200, Alexandre Ghiti wrote:
> >> event_idx default implementation returns 0, not idx + 1.
> > The comment was correct until commit c719f56092ad ("perf: Fix and clean
> > up initialization of pmu::event_idx"). I'm not sure that warrants a fixes
> > tag, but maybe a reference in the commit message.
>
>
> You're right, I'll add a reference, I don't think it deserves a
> backport, that's just a comment.
>
>

With that done:

Reviewed-by: Atish Patra <[email protected]>

> >> Signed-off-by: Alexandre Ghiti <[email protected]>
> >> ---
> >> include/linux/perf_event.h | 3 ++-
> >> 1 file changed, 2 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> >> index d5628a7b5eaa..56fe43b20966 100644
> >> --- a/include/linux/perf_event.h
> >> +++ b/include/linux/perf_event.h
> >> @@ -442,7 +442,8 @@ struct pmu {
> >>
> >> /*
> >> * Will return the value for perf_event_mmap_page::index for this event,
> >> - * if no implementation is provided it will default to: event->hw.idx + 1.
> >> + * if no implementation is provided it will default to 0 (see
> >> + * perf_event_idx_default).
> >> */
> >> int (*event_idx) (struct perf_event *event); /*optional */
> >>
> >> --
> >> 2.37.2
> >>
> > Otherwise,
> >
> > Reviewed-by: Andrew Jones <[email protected]>
>
>
> Thanks!
>


--
Regards,
Atish

2023-06-15 09:29:16

by Atish Patra

[permalink] [raw]
Subject: Re: [PATCH v2 07/10] drivers: perf: Implement perf event mmap support in the SBI backend

On Wed, May 31, 2023 at 8:02 AM Andrew Jones <[email protected]> wrote:
>
> On Fri, May 12, 2023 at 10:53:18AM +0200, Alexandre Ghiti wrote:
> > We used to unconditionnally expose the cycle and instret csrs to
> > userspace, which gives rise to security concerns.
> >
> > So now we only allow access to hw counters from userspace through the perf
> > framework which will handle context switches, per-task events...etc. But
> > as we cannot break userspace, we give the user the choice to go back to
> > the previous behaviour by setting the sysctl perf_user_access.
> >
> > Signed-off-by: Alexandre Ghiti <[email protected]>
> > ---
> > arch/riscv/kernel/perf_event.c | 18 ++-
> > drivers/perf/riscv_pmu_sbi.c | 194 ++++++++++++++++++++++++++++++++-
> > 2 files changed, 205 insertions(+), 7 deletions(-)
> >
> > diff --git a/arch/riscv/kernel/perf_event.c b/arch/riscv/kernel/perf_event.c
> > index 94174a0fc251..3af9ca45b43f 100644
> > --- a/arch/riscv/kernel/perf_event.c
> > +++ b/arch/riscv/kernel/perf_event.c
> > @@ -1,9 +1,13 @@
> > // SPDX-License-Identifier: GPL-2.0-only
> > +#include <linux/perf/riscv_pmu.h>
> > #include <linux/sched_clock.h>
> >
> > void arch_perf_update_userpage(struct perf_event *event,
> > struct perf_event_mmap_page *userpg, u64 now)
> > {
> > +#ifdef CONFIG_RISCV_PMU_SBI
> > + struct riscv_pmu *rvpmu = to_riscv_pmu(event->pmu);
> > +#endif
>
> Can avoid this pair of #ifdef/#endif's by just declaring rvpmu inside the
> if block below where it's needed. Or even just using to_riscv_pmu()
> directly in place of rvpmu.
>
> > struct clock_read_data *rd;
> > unsigned int seq;
> > u64 ns;
> > @@ -14,7 +18,19 @@ void arch_perf_update_userpage(struct perf_event *event,
> > userpg->cap_user_rdpmc =
> > !!(event->hw.flags & PERF_EVENT_FLAG_USER_READ_CNT);
> >
> > - userpg->pmc_width = 64;
> > +#ifdef CONFIG_RISCV_PMU_SBI
> > + /*
> > + * The counters are 64-bit but the priv spec doesn't mandate all the
> > + * bits to be implemented: that's why, counter width can vary based on
> > + * the cpu vendor.
> > + */
> > + if (event->pmu->name &&
> > + !strncmp(event->pmu->name,
> > + RISCV_PMU_PDEV_NAME, sizeof(RISCV_PMU_PDEV_NAME)))
> > + userpg->pmc_width = rvpmu->ctr_get_width(event->hw.idx) + 1;
> > + else
> > +#endif
> > + userpg->pmc_width = 64;
>
> Can leave the initialization to 64 above the #ifdef CONFIG_RISCV_PMU_SBI
> as is and drop the else.
>
> >
> > do {
> > rd = sched_clock_read_begin(&seq);
> > diff --git a/drivers/perf/riscv_pmu_sbi.c b/drivers/perf/riscv_pmu_sbi.c
> > index 3b0ee2148054..d9bcc5cc6df5 100644
> > --- a/drivers/perf/riscv_pmu_sbi.c
> > +++ b/drivers/perf/riscv_pmu_sbi.c
> > @@ -24,6 +24,14 @@
> > #include <asm/sbi.h>
> > #include <asm/hwcap.h>
> >
> > +#define SYSCTL_NO_USER_ACCESS 0
> > +#define SYSCTL_USER_ACCESS 1
> > +#define SYSCTL_LEGACY 2
> > +
> > +#define PERF_EVENT_FLAG_NO_USER_ACCESS BIT(SYSCTL_NO_USER_ACCESS)
> > +#define PERF_EVENT_FLAG_USER_ACCESS BIT(SYSCTL_USER_ACCESS)
> > +#define PERF_EVENT_FLAG_LEGACY BIT(SYSCTL_LEGACY)
> > +
> > PMU_FORMAT_ATTR(event, "config:0-47");
> > PMU_FORMAT_ATTR(firmware, "config:63");
> >
> > @@ -43,6 +51,9 @@ static const struct attribute_group *riscv_pmu_attr_groups[] = {
> > NULL,
> > };
> >
> > +/* Allow legacy access by default */
> > +static int sysctl_perf_user_access __read_mostly = SYSCTL_LEGACY;
> > +
>
> I'm still not in favor of this. Hopefully the distro discussions result in
> it being changed.
>

I did not see any feedback from distro guys. I talked to David (fedora
maintainer) and he is even okay with
SYSCTL_NO_USER_ACCESS :). I would love to hear back from others (cc'd
a few distro folks to this thread).

> > /*
> > * RISC-V doesn't have heterogeneous harts yet. This need to be part of
> > * per_cpu in case of harts with different pmu counters
> > @@ -301,6 +312,11 @@ int riscv_pmu_get_hpm_info(u32 *hw_ctr_width, u32 *num_hw_ctr)
> > }
> > EXPORT_SYMBOL_GPL(riscv_pmu_get_hpm_info);
> >
> > +static uint8_t pmu_sbi_csr_index(struct perf_event *event)
> > +{
> > + return pmu_ctr_list[event->hw.idx].csr - CSR_CYCLE;
> > +}
> > +
> > static unsigned long pmu_sbi_get_filter_flags(struct perf_event *event)
> > {
> > unsigned long cflags = 0;
> > @@ -329,18 +345,34 @@ static int pmu_sbi_ctr_get_idx(struct perf_event *event)
> > struct cpu_hw_events *cpuc = this_cpu_ptr(rvpmu->hw_events);
> > struct sbiret ret;
> > int idx;
> > - uint64_t cbase = 0;
> > + uint64_t cbase = 0, cmask = rvpmu->cmask;
> > unsigned long cflags = 0;
> >
> > cflags = pmu_sbi_get_filter_flags(event);
> > +
> > + /*
> > + * In legacy mode, we have to force the fixed counters for those events
> > + * but not in the user access mode as we want to use the other counters
> > + * that support sampling/filtering.
> > + */
> > + if (hwc->flags & PERF_EVENT_FLAG_LEGACY) {
> > + if (event->attr.config == PERF_COUNT_HW_CPU_CYCLES) {
> > + cflags |= SBI_PMU_CFG_FLAG_SKIP_MATCH;
> > + cmask = 1;
> > + } else if (event->attr.config == PERF_COUNT_HW_INSTRUCTIONS) {
> > + cflags |= SBI_PMU_CFG_FLAG_SKIP_MATCH;
> > + cmask = 1UL << (CSR_INSTRET - CSR_CYCLE);
> > + }
> > + }
> > +
> > /* retrieve the available counter index */
> > #if defined(CONFIG_32BIT)
> > ret = sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_COUNTER_CFG_MATCH, cbase,
> > - rvpmu->cmask, cflags, hwc->event_base, hwc->config,
> > + cmask, cflags, hwc->event_base, hwc->config,
> > hwc->config >> 32);
> > #else
> > ret = sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_COUNTER_CFG_MATCH, cbase,
> > - rvpmu->cmask, cflags, hwc->event_base, hwc->config, 0);
> > + cmask, cflags, hwc->event_base, hwc->config, 0);
> > #endif
> > if (ret.error) {
> > pr_debug("Not able to find a counter for event %lx config %llx\n",
> > @@ -474,6 +506,14 @@ static u64 pmu_sbi_ctr_read(struct perf_event *event)
> > return val;
> > }
> >
> > +static void pmu_sbi_set_scounteren(void *arg)
> > +{
> > + struct perf_event *event = (struct perf_event *)arg;
> > +
> > + csr_write(CSR_SCOUNTEREN,
> > + csr_read(CSR_SCOUNTEREN) | (1 << pmu_sbi_csr_index(event)));
> > +}
> > +
> > static void pmu_sbi_ctr_start(struct perf_event *event, u64 ival)
> > {
> > struct sbiret ret;
> > @@ -490,6 +530,18 @@ static void pmu_sbi_ctr_start(struct perf_event *event, u64 ival)
> > if (ret.error && (ret.error != SBI_ERR_ALREADY_STARTED))
> > pr_err("Starting counter idx %d failed with error %d\n",
> > hwc->idx, sbi_err_map_linux_errno(ret.error));
> > +
> > + if (hwc->flags & PERF_EVENT_FLAG_USER_ACCESS &&
> > + hwc->flags & PERF_EVENT_FLAG_USER_READ_CNT)
> > + pmu_sbi_set_scounteren((void *)event);
> > +}
> > +
> > +static void pmu_sbi_reset_scounteren(void *arg)
> > +{
> > + struct perf_event *event = (struct perf_event *)arg;
> > +
> > + csr_write(CSR_SCOUNTEREN,
> > + csr_read(CSR_SCOUNTEREN) & ~(1 << pmu_sbi_csr_index(event)));
> > }
> >
> > static void pmu_sbi_ctr_stop(struct perf_event *event, unsigned long flag)
> > @@ -497,6 +549,10 @@ static void pmu_sbi_ctr_stop(struct perf_event *event, unsigned long flag)
> > struct sbiret ret;
> > struct hw_perf_event *hwc = &event->hw;
> >
> > + if (hwc->flags & PERF_EVENT_FLAG_USER_ACCESS &&
> > + hwc->flags & PERF_EVENT_FLAG_USER_READ_CNT)
> > + pmu_sbi_reset_scounteren((void *)event);
> > +
> > ret = sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_COUNTER_STOP, hwc->idx, 1, flag, 0, 0, 0);
> > if (ret.error && (ret.error != SBI_ERR_ALREADY_STOPPED) &&
> > flag != SBI_PMU_STOP_FLAG_RESET)
> > @@ -704,10 +760,13 @@ static int pmu_sbi_starting_cpu(unsigned int cpu, struct hlist_node *node)
> > struct cpu_hw_events *cpu_hw_evt = this_cpu_ptr(pmu->hw_events);
> >
> > /*
> > - * Enable the access for CYCLE, TIME, and INSTRET CSRs from userspace,
> > - * as is necessary to maintain uABI compatibility.
> > + * We keep enabling userspace access to CYCLE, TIME and INSRET via the
> > + * legacy option but that will be removed in the future.
> > */
> > - csr_write(CSR_SCOUNTEREN, 0x7);
> > + if (sysctl_perf_user_access == SYSCTL_LEGACY)
> > + csr_write(CSR_SCOUNTEREN, 0x7);
> > + else
> > + csr_write(CSR_SCOUNTEREN, 0x2);
> >
> > /* Stop all the counters so that they can be enabled from perf */
> > pmu_sbi_stop_all(pmu);
> > @@ -851,6 +910,123 @@ static void riscv_pmu_destroy(struct riscv_pmu *pmu)
> > cpuhp_state_remove_instance(CPUHP_AP_PERF_RISCV_STARTING, &pmu->node);
> > }
> >
> > +static void pmu_sbi_event_init(struct perf_event *event)
> > +{
> > + /*
> > + * The permissions are set at event_init so that we do not depend
> > + * on the sysctl value that can change.
> > + */
> > + if (sysctl_perf_user_access == SYSCTL_NO_USER_ACCESS)
> > + event->hw.flags |= PERF_EVENT_FLAG_NO_USER_ACCESS;
> > + else if (sysctl_perf_user_access == SYSCTL_USER_ACCESS)
> > + event->hw.flags |= PERF_EVENT_FLAG_USER_ACCESS;
> > + else
> > + event->hw.flags |= PERF_EVENT_FLAG_LEGACY;
> > +}
> > +
> > +static void pmu_sbi_event_mapped(struct perf_event *event, struct mm_struct *mm)
> > +{
> > + if (event->hw.flags & PERF_EVENT_FLAG_NO_USER_ACCESS)
> > + return;
> > +
> > + /* In legacy mode, the first 3 CSRs are available. */
>
> first and third
>
> > + if (event->hw.flags & PERF_EVENT_FLAG_LEGACY) {
> > + if (event->attr.config != PERF_COUNT_HW_CPU_CYCLES &&
> > + event->attr.config != PERF_COUNT_HW_INSTRUCTIONS) {
> > + return;
> > + }
> > + }
> > +
> > + /*
> > + * The user mmapped the event to directly access it: this is where
> > + * we determine based on sysctl_perf_user_access if we grant userspace
> > + * the direct access to this event. That means that within the same
> > + * task, some events may be directly accessible and some other may not,
> > + * if the user changes the value of sysctl_perf_user_accesss in the
> > + * meantime.
> > + */
> > +
> > + event->hw.flags |= PERF_EVENT_FLAG_USER_READ_CNT;
> > +
> > + /*
> > + * We must enable userspace access *before* advertising in the user page
> > + * that it is possible to do so to avoid any race.
> > + * And we must notify all cpus here because threads that currently run
> > + * on other cpus will try to directly access the counter too without
> > + * calling pmu_sbi_ctr_start.
> > + */
> > + if (event->hw.flags & PERF_EVENT_FLAG_USER_ACCESS)
> > + on_each_cpu_mask(mm_cpumask(mm),
> > + pmu_sbi_set_scounteren, (void *)event, 1);
> > +}
> > +
> > +static void pmu_sbi_event_unmapped(struct perf_event *event, struct mm_struct *mm)
> > +{
> > + if (event->hw.flags & PERF_EVENT_FLAG_NO_USER_ACCESS)
> > + return;
> > +
> > + /* In legacy mode, the first 3 CSRs are available. */
>
> first and third
>
> > + if (event->hw.flags & PERF_EVENT_FLAG_LEGACY) {
> > + if (event->attr.config != PERF_COUNT_HW_CPU_CYCLES &&
> > + event->attr.config != PERF_COUNT_HW_INSTRUCTIONS) {
> > + return;
> > + }
> > + }
> > +
> > + /*
> > + * Here we can directly remove user access since the user does not have
> > + * access to the user page anymore so we avoid the racy window where the
> > + * user could have read cap_user_rdpmc to true right before we disable
> > + * it.
> > + */
> > + event->hw.flags &= ~PERF_EVENT_FLAG_USER_READ_CNT;
> > +
> > + if (event->hw.flags & PERF_EVENT_FLAG_USER_ACCESS)
> > + on_each_cpu_mask(mm_cpumask(mm),
> > + pmu_sbi_reset_scounteren, (void *)event, 1);
> > +}
> > +
> > +static void riscv_pmu_update_counter_access(void *info)
> > +{
> > + if (sysctl_perf_user_access == SYSCTL_LEGACY)
> > + csr_write(CSR_SCOUNTEREN, 0x7);
> > + else
> > + csr_write(CSR_SCOUNTEREN, 0x2);
> > +}
> > +
> > +static int riscv_pmu_proc_user_access_handler(struct ctl_table *table,
> > + int write, void *buffer,
> > + size_t *lenp, loff_t *ppos)
> > +{
> > + int prev = sysctl_perf_user_access;
> > + int ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
> > +
> > + /*
> > + * Test against the previous value since we clear SCOUNTEREN when
> > + * sysctl_perf_user_access is set to SYSCTL_USER_ACCESS, but we should
> > + * not do that if that was already the case.
> > + */
> > + if (ret || !write || prev == sysctl_perf_user_access)
> > + return ret;
> > +
> > + on_each_cpu(riscv_pmu_update_counter_access, (void *)&prev, 1);
>
> Instead of passing prev shouldn't we pass NULL, as it's not used?
>
> > +
> > + return 0;
> > +}
> > +
> > +static struct ctl_table sbi_pmu_sysctl_table[] = {
> > + {
> > + .procname = "perf_user_access",
> > + .data = &sysctl_perf_user_access,
> > + .maxlen = sizeof(unsigned int),
> > + .mode = 0644,
> > + .proc_handler = riscv_pmu_proc_user_access_handler,
> > + .extra1 = SYSCTL_ZERO,
> > + .extra2 = SYSCTL_TWO,
> > + },
> > + { }
> > +};
> > +
> > static int pmu_sbi_device_probe(struct platform_device *pdev)
> > {
> > struct riscv_pmu *pmu = NULL;
> > @@ -888,6 +1064,10 @@ static int pmu_sbi_device_probe(struct platform_device *pdev)
> > pmu->ctr_get_width = pmu_sbi_ctr_get_width;
> > pmu->ctr_clear_idx = pmu_sbi_ctr_clear_idx;
> > pmu->ctr_read = pmu_sbi_ctr_read;
> > + pmu->event_init = pmu_sbi_event_init;
> > + pmu->event_mapped = pmu_sbi_event_mapped;
> > + pmu->event_unmapped = pmu_sbi_event_unmapped;
> > + pmu->csr_index = pmu_sbi_csr_index;
> >
> > ret = cpuhp_state_add_instance(CPUHP_AP_PERF_RISCV_STARTING, &pmu->node);
> > if (ret)
> > @@ -901,6 +1081,8 @@ static int pmu_sbi_device_probe(struct platform_device *pdev)
> > if (ret)
> > goto out_unregister;
> >
> > + register_sysctl("kernel", sbi_pmu_sysctl_table);
> > +
> > return 0;
> >
> > out_unregister:
> > --
> > 2.37.2
> >
>
> Thanks,
> drew



--
Regards,
Atish

2023-06-15 10:15:37

by Alexandre Ghiti

[permalink] [raw]
Subject: Re: [PATCH v2 07/10] drivers: perf: Implement perf event mmap support in the SBI backend


On 31/05/2023 17:02, Andrew Jones wrote:
> On Fri, May 12, 2023 at 10:53:18AM +0200, Alexandre Ghiti wrote:
>> We used to unconditionnally expose the cycle and instret csrs to
>> userspace, which gives rise to security concerns.
>>
>> So now we only allow access to hw counters from userspace through the perf
>> framework which will handle context switches, per-task events...etc. But
>> as we cannot break userspace, we give the user the choice to go back to
>> the previous behaviour by setting the sysctl perf_user_access.
>>
>> Signed-off-by: Alexandre Ghiti <[email protected]>
>> ---
>> arch/riscv/kernel/perf_event.c | 18 ++-
>> drivers/perf/riscv_pmu_sbi.c | 194 ++++++++++++++++++++++++++++++++-
>> 2 files changed, 205 insertions(+), 7 deletions(-)
>>
>> diff --git a/arch/riscv/kernel/perf_event.c b/arch/riscv/kernel/perf_event.c
>> index 94174a0fc251..3af9ca45b43f 100644
>> --- a/arch/riscv/kernel/perf_event.c
>> +++ b/arch/riscv/kernel/perf_event.c
>> @@ -1,9 +1,13 @@
>> // SPDX-License-Identifier: GPL-2.0-only
>> +#include <linux/perf/riscv_pmu.h>
>> #include <linux/sched_clock.h>
>>
>> void arch_perf_update_userpage(struct perf_event *event,
>> struct perf_event_mmap_page *userpg, u64 now)
>> {
>> +#ifdef CONFIG_RISCV_PMU_SBI
>> + struct riscv_pmu *rvpmu = to_riscv_pmu(event->pmu);
>> +#endif
> Can avoid this pair of #ifdef/#endif's by just declaring rvpmu inside the
> if block below where it's needed. Or even just using to_riscv_pmu()
> directly in place of rvpmu.


Great, thanks


>
>> struct clock_read_data *rd;
>> unsigned int seq;
>> u64 ns;
>> @@ -14,7 +18,19 @@ void arch_perf_update_userpage(struct perf_event *event,
>> userpg->cap_user_rdpmc =
>> !!(event->hw.flags & PERF_EVENT_FLAG_USER_READ_CNT);
>>
>> - userpg->pmc_width = 64;
>> +#ifdef CONFIG_RISCV_PMU_SBI
>> + /*
>> + * The counters are 64-bit but the priv spec doesn't mandate all the
>> + * bits to be implemented: that's why, counter width can vary based on
>> + * the cpu vendor.
>> + */
>> + if (event->pmu->name &&
>> + !strncmp(event->pmu->name,
>> + RISCV_PMU_PDEV_NAME, sizeof(RISCV_PMU_PDEV_NAME)))
>> + userpg->pmc_width = rvpmu->ctr_get_width(event->hw.idx) + 1;
>> + else
>> +#endif
>> + userpg->pmc_width = 64;
> Can leave the initialization to 64 above the #ifdef CONFIG_RISCV_PMU_SBI
> as is and drop the else.


Fine by me, thanks


>
>>
>> do {
>> rd = sched_clock_read_begin(&seq);
>> diff --git a/drivers/perf/riscv_pmu_sbi.c b/drivers/perf/riscv_pmu_sbi.c
>> index 3b0ee2148054..d9bcc5cc6df5 100644
>> --- a/drivers/perf/riscv_pmu_sbi.c
>> +++ b/drivers/perf/riscv_pmu_sbi.c
>> @@ -24,6 +24,14 @@
>> #include <asm/sbi.h>
>> #include <asm/hwcap.h>
>>
>> +#define SYSCTL_NO_USER_ACCESS 0
>> +#define SYSCTL_USER_ACCESS 1
>> +#define SYSCTL_LEGACY 2
>> +
>> +#define PERF_EVENT_FLAG_NO_USER_ACCESS BIT(SYSCTL_NO_USER_ACCESS)
>> +#define PERF_EVENT_FLAG_USER_ACCESS BIT(SYSCTL_USER_ACCESS)
>> +#define PERF_EVENT_FLAG_LEGACY BIT(SYSCTL_LEGACY)
>> +
>> PMU_FORMAT_ATTR(event, "config:0-47");
>> PMU_FORMAT_ATTR(firmware, "config:63");
>>
>> @@ -43,6 +51,9 @@ static const struct attribute_group *riscv_pmu_attr_groups[] = {
>> NULL,
>> };
>>
>> +/* Allow legacy access by default */
>> +static int sysctl_perf_user_access __read_mostly = SYSCTL_LEGACY;
>> +
> I'm still not in favor of this. Hopefully the distro discussions result in
> it being changed.


No progress on this front...


>> /*
>> * RISC-V doesn't have heterogeneous harts yet. This need to be part of
>> * per_cpu in case of harts with different pmu counters
>> @@ -301,6 +312,11 @@ int riscv_pmu_get_hpm_info(u32 *hw_ctr_width, u32 *num_hw_ctr)
>> }
>> EXPORT_SYMBOL_GPL(riscv_pmu_get_hpm_info);
>>
>> +static uint8_t pmu_sbi_csr_index(struct perf_event *event)
>> +{
>> + return pmu_ctr_list[event->hw.idx].csr - CSR_CYCLE;
>> +}
>> +
>> static unsigned long pmu_sbi_get_filter_flags(struct perf_event *event)
>> {
>> unsigned long cflags = 0;
>> @@ -329,18 +345,34 @@ static int pmu_sbi_ctr_get_idx(struct perf_event *event)
>> struct cpu_hw_events *cpuc = this_cpu_ptr(rvpmu->hw_events);
>> struct sbiret ret;
>> int idx;
>> - uint64_t cbase = 0;
>> + uint64_t cbase = 0, cmask = rvpmu->cmask;
>> unsigned long cflags = 0;
>>
>> cflags = pmu_sbi_get_filter_flags(event);
>> +
>> + /*
>> + * In legacy mode, we have to force the fixed counters for those events
>> + * but not in the user access mode as we want to use the other counters
>> + * that support sampling/filtering.
>> + */
>> + if (hwc->flags & PERF_EVENT_FLAG_LEGACY) {
>> + if (event->attr.config == PERF_COUNT_HW_CPU_CYCLES) {
>> + cflags |= SBI_PMU_CFG_FLAG_SKIP_MATCH;
>> + cmask = 1;
>> + } else if (event->attr.config == PERF_COUNT_HW_INSTRUCTIONS) {
>> + cflags |= SBI_PMU_CFG_FLAG_SKIP_MATCH;
>> + cmask = 1UL << (CSR_INSTRET - CSR_CYCLE);
>> + }
>> + }
>> +
>> /* retrieve the available counter index */
>> #if defined(CONFIG_32BIT)
>> ret = sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_COUNTER_CFG_MATCH, cbase,
>> - rvpmu->cmask, cflags, hwc->event_base, hwc->config,
>> + cmask, cflags, hwc->event_base, hwc->config,
>> hwc->config >> 32);
>> #else
>> ret = sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_COUNTER_CFG_MATCH, cbase,
>> - rvpmu->cmask, cflags, hwc->event_base, hwc->config, 0);
>> + cmask, cflags, hwc->event_base, hwc->config, 0);
>> #endif
>> if (ret.error) {
>> pr_debug("Not able to find a counter for event %lx config %llx\n",
>> @@ -474,6 +506,14 @@ static u64 pmu_sbi_ctr_read(struct perf_event *event)
>> return val;
>> }
>>
>> +static void pmu_sbi_set_scounteren(void *arg)
>> +{
>> + struct perf_event *event = (struct perf_event *)arg;
>> +
>> + csr_write(CSR_SCOUNTEREN,
>> + csr_read(CSR_SCOUNTEREN) | (1 << pmu_sbi_csr_index(event)));
>> +}
>> +
>> static void pmu_sbi_ctr_start(struct perf_event *event, u64 ival)
>> {
>> struct sbiret ret;
>> @@ -490,6 +530,18 @@ static void pmu_sbi_ctr_start(struct perf_event *event, u64 ival)
>> if (ret.error && (ret.error != SBI_ERR_ALREADY_STARTED))
>> pr_err("Starting counter idx %d failed with error %d\n",
>> hwc->idx, sbi_err_map_linux_errno(ret.error));
>> +
>> + if (hwc->flags & PERF_EVENT_FLAG_USER_ACCESS &&
>> + hwc->flags & PERF_EVENT_FLAG_USER_READ_CNT)
>> + pmu_sbi_set_scounteren((void *)event);
>> +}
>> +
>> +static void pmu_sbi_reset_scounteren(void *arg)
>> +{
>> + struct perf_event *event = (struct perf_event *)arg;
>> +
>> + csr_write(CSR_SCOUNTEREN,
>> + csr_read(CSR_SCOUNTEREN) & ~(1 << pmu_sbi_csr_index(event)));
>> }
>>
>> static void pmu_sbi_ctr_stop(struct perf_event *event, unsigned long flag)
>> @@ -497,6 +549,10 @@ static void pmu_sbi_ctr_stop(struct perf_event *event, unsigned long flag)
>> struct sbiret ret;
>> struct hw_perf_event *hwc = &event->hw;
>>
>> + if (hwc->flags & PERF_EVENT_FLAG_USER_ACCESS &&
>> + hwc->flags & PERF_EVENT_FLAG_USER_READ_CNT)
>> + pmu_sbi_reset_scounteren((void *)event);
>> +
>> ret = sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_COUNTER_STOP, hwc->idx, 1, flag, 0, 0, 0);
>> if (ret.error && (ret.error != SBI_ERR_ALREADY_STOPPED) &&
>> flag != SBI_PMU_STOP_FLAG_RESET)
>> @@ -704,10 +760,13 @@ static int pmu_sbi_starting_cpu(unsigned int cpu, struct hlist_node *node)
>> struct cpu_hw_events *cpu_hw_evt = this_cpu_ptr(pmu->hw_events);
>>
>> /*
>> - * Enable the access for CYCLE, TIME, and INSTRET CSRs from userspace,
>> - * as is necessary to maintain uABI compatibility.
>> + * We keep enabling userspace access to CYCLE, TIME and INSRET via the
>> + * legacy option but that will be removed in the future.
>> */
>> - csr_write(CSR_SCOUNTEREN, 0x7);
>> + if (sysctl_perf_user_access == SYSCTL_LEGACY)
>> + csr_write(CSR_SCOUNTEREN, 0x7);
>> + else
>> + csr_write(CSR_SCOUNTEREN, 0x2);
>>
>> /* Stop all the counters so that they can be enabled from perf */
>> pmu_sbi_stop_all(pmu);
>> @@ -851,6 +910,123 @@ static void riscv_pmu_destroy(struct riscv_pmu *pmu)
>> cpuhp_state_remove_instance(CPUHP_AP_PERF_RISCV_STARTING, &pmu->node);
>> }
>>
>> +static void pmu_sbi_event_init(struct perf_event *event)
>> +{
>> + /*
>> + * The permissions are set at event_init so that we do not depend
>> + * on the sysctl value that can change.
>> + */
>> + if (sysctl_perf_user_access == SYSCTL_NO_USER_ACCESS)
>> + event->hw.flags |= PERF_EVENT_FLAG_NO_USER_ACCESS;
>> + else if (sysctl_perf_user_access == SYSCTL_USER_ACCESS)
>> + event->hw.flags |= PERF_EVENT_FLAG_USER_ACCESS;
>> + else
>> + event->hw.flags |= PERF_EVENT_FLAG_LEGACY;
>> +}
>> +
>> +static void pmu_sbi_event_mapped(struct perf_event *event, struct mm_struct *mm)
>> +{
>> + if (event->hw.flags & PERF_EVENT_FLAG_NO_USER_ACCESS)
>> + return;
>> +
>> + /* In legacy mode, the first 3 CSRs are available. */
> first and third


Same comment as before, I'll remove this comment.


>
>> + if (event->hw.flags & PERF_EVENT_FLAG_LEGACY) {
>> + if (event->attr.config != PERF_COUNT_HW_CPU_CYCLES &&
>> + event->attr.config != PERF_COUNT_HW_INSTRUCTIONS) {
>> + return;
>> + }
>> + }
>> +
>> + /*
>> + * The user mmapped the event to directly access it: this is where
>> + * we determine based on sysctl_perf_user_access if we grant userspace
>> + * the direct access to this event. That means that within the same
>> + * task, some events may be directly accessible and some other may not,
>> + * if the user changes the value of sysctl_perf_user_accesss in the
>> + * meantime.
>> + */
>> +
>> + event->hw.flags |= PERF_EVENT_FLAG_USER_READ_CNT;
>> +
>> + /*
>> + * We must enable userspace access *before* advertising in the user page
>> + * that it is possible to do so to avoid any race.
>> + * And we must notify all cpus here because threads that currently run
>> + * on other cpus will try to directly access the counter too without
>> + * calling pmu_sbi_ctr_start.
>> + */
>> + if (event->hw.flags & PERF_EVENT_FLAG_USER_ACCESS)
>> + on_each_cpu_mask(mm_cpumask(mm),
>> + pmu_sbi_set_scounteren, (void *)event, 1);
>> +}
>> +
>> +static void pmu_sbi_event_unmapped(struct perf_event *event, struct mm_struct *mm)
>> +{
>> + if (event->hw.flags & PERF_EVENT_FLAG_NO_USER_ACCESS)
>> + return;
>> +
>> + /* In legacy mode, the first 3 CSRs are available. */
> first and third
>
>> + if (event->hw.flags & PERF_EVENT_FLAG_LEGACY) {
>> + if (event->attr.config != PERF_COUNT_HW_CPU_CYCLES &&
>> + event->attr.config != PERF_COUNT_HW_INSTRUCTIONS) {
>> + return;
>> + }
>> + }
>> +
>> + /*
>> + * Here we can directly remove user access since the user does not have
>> + * access to the user page anymore so we avoid the racy window where the
>> + * user could have read cap_user_rdpmc to true right before we disable
>> + * it.
>> + */
>> + event->hw.flags &= ~PERF_EVENT_FLAG_USER_READ_CNT;
>> +
>> + if (event->hw.flags & PERF_EVENT_FLAG_USER_ACCESS)
>> + on_each_cpu_mask(mm_cpumask(mm),
>> + pmu_sbi_reset_scounteren, (void *)event, 1);
>> +}
>> +
>> +static void riscv_pmu_update_counter_access(void *info)
>> +{
>> + if (sysctl_perf_user_access == SYSCTL_LEGACY)
>> + csr_write(CSR_SCOUNTEREN, 0x7);
>> + else
>> + csr_write(CSR_SCOUNTEREN, 0x2);
>> +}
>> +
>> +static int riscv_pmu_proc_user_access_handler(struct ctl_table *table,
>> + int write, void *buffer,
>> + size_t *lenp, loff_t *ppos)
>> +{
>> + int prev = sysctl_perf_user_access;
>> + int ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
>> +
>> + /*
>> + * Test against the previous value since we clear SCOUNTEREN when
>> + * sysctl_perf_user_access is set to SYSCTL_USER_ACCESS, but we should
>> + * not do that if that was already the case.
>> + */
>> + if (ret || !write || prev == sysctl_perf_user_access)
>> + return ret;
>> +
>> + on_each_cpu(riscv_pmu_update_counter_access, (void *)&prev, 1);
> Instead of passing prev shouldn't we pass NULL, as it's not used?


Yes, that's a leftover from a previous version, thanks!


>
>> +
>> + return 0;
>> +}
>> +
>> +static struct ctl_table sbi_pmu_sysctl_table[] = {
>> + {
>> + .procname = "perf_user_access",
>> + .data = &sysctl_perf_user_access,
>> + .maxlen = sizeof(unsigned int),
>> + .mode = 0644,
>> + .proc_handler = riscv_pmu_proc_user_access_handler,
>> + .extra1 = SYSCTL_ZERO,
>> + .extra2 = SYSCTL_TWO,
>> + },
>> + { }
>> +};
>> +
>> static int pmu_sbi_device_probe(struct platform_device *pdev)
>> {
>> struct riscv_pmu *pmu = NULL;
>> @@ -888,6 +1064,10 @@ static int pmu_sbi_device_probe(struct platform_device *pdev)
>> pmu->ctr_get_width = pmu_sbi_ctr_get_width;
>> pmu->ctr_clear_idx = pmu_sbi_ctr_clear_idx;
>> pmu->ctr_read = pmu_sbi_ctr_read;
>> + pmu->event_init = pmu_sbi_event_init;
>> + pmu->event_mapped = pmu_sbi_event_mapped;
>> + pmu->event_unmapped = pmu_sbi_event_unmapped;
>> + pmu->csr_index = pmu_sbi_csr_index;
>>
>> ret = cpuhp_state_add_instance(CPUHP_AP_PERF_RISCV_STARTING, &pmu->node);
>> if (ret)
>> @@ -901,6 +1081,8 @@ static int pmu_sbi_device_probe(struct platform_device *pdev)
>> if (ret)
>> goto out_unregister;
>>
>> + register_sysctl("kernel", sbi_pmu_sysctl_table);
>> +
>> return 0;
>>
>> out_unregister:
>> --
>> 2.37.2
>>
> Thanks,
> drew

2023-06-15 10:27:52

by Alexandre Ghiti

[permalink] [raw]
Subject: Re: [PATCH v2 10/10] perf: tests: Adapt mmap-basic.c for riscv


On 31/05/2023 17:15, Andrew Jones wrote:
> On Fri, May 12, 2023 at 10:53:21AM +0200, Alexandre Ghiti wrote:
>> riscv now supports mmaping hardware counters to userspace so adapt the test
>> to run on this architecture.
>>
>> Signed-off-by: Alexandre Ghiti <[email protected]>
>> ---
>> tools/perf/tests/mmap-basic.c | 4 ++--
>> 1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/tools/perf/tests/mmap-basic.c b/tools/perf/tests/mmap-basic.c
>> index e68ca6229756..f5075ca774f8 100644
>> --- a/tools/perf/tests/mmap-basic.c
>> +++ b/tools/perf/tests/mmap-basic.c
>> @@ -284,7 +284,7 @@ static struct test_case tests__basic_mmap[] = {
>> "permissions"),
>> TEST_CASE_REASON("User space counter reading of instructions",
>> mmap_user_read_instr,
>> -#if defined(__i386__) || defined(__x86_64__) || defined(__aarch64__)
>> +#if defined(__i386__) || defined(__x86_64__) || defined(__aarch64__) || __riscv_xlen == 64
>> "permissions"
>> #else
>> "unsupported"
>> @@ -292,7 +292,7 @@ static struct test_case tests__basic_mmap[] = {
>> ),
>> TEST_CASE_REASON("User space counter reading of cycles",
>> mmap_user_read_cycles,
>> -#if defined(__i386__) || defined(__x86_64__) || defined(__aarch64__)
>> +#if defined(__i386__) || defined(__x86_64__) || defined(__aarch64__) || __riscv_xlen == 64
>> "permissions"
>> #else
>> "unsupported"
>> --
>> 2.37.2
>>
> Reviewed-by: Andrew Jones <[email protected]>


Thanks for your review Andrew, as usual, always helpful.

And sorry for the delay!

Alex


>
> _______________________________________________
> linux-riscv mailing list
> [email protected]
> http://lists.infradead.org/mailman/listinfo/linux-riscv

2023-06-15 10:28:00

by Alexandre Ghiti

[permalink] [raw]
Subject: Re: [PATCH v2 08/10] Documentation: admin-guide: Add riscv sysctl_perf_user_access


On 31/05/2023 17:07, Andrew Jones wrote:
> On Fri, May 12, 2023 at 10:53:19AM +0200, Alexandre Ghiti wrote:
>> riscv now uses this sysctl so document its usage for this architecture.
>>
>> Signed-off-by: Alexandre Ghiti <[email protected]>
>> ---
>> Documentation/admin-guide/sysctl/kernel.rst | 24 +++++++++++++++++----
>> 1 file changed, 20 insertions(+), 4 deletions(-)
>>
>> diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst
>> index 4b7bfea28cd7..93cd518ca94b 100644
>> --- a/Documentation/admin-guide/sysctl/kernel.rst
>> +++ b/Documentation/admin-guide/sysctl/kernel.rst
>> @@ -941,16 +941,32 @@ enabled, otherwise writing to this file will return ``-EBUSY``.
>> The default value is 8.
>>
>>
>> -perf_user_access (arm64 only)
>> -=================================
>> +perf_user_access (arm64 and riscv only)
>> +=======================================
>> +
>> +Controls user space access for reading perf event counters.
>>
>> -Controls user space access for reading perf event counters. When set to 1,
>> -user space can read performance monitor counter registers directly.
>> +arm64
>> +=====
>>
>> The default value is 0 (access disabled).
>> +When set to 1, user space can read performance monitor counter registers
>> +directly.
>>
>> See Documentation/arm64/perf.rst for more information.
>>
>> +riscv
>> +=====
>> +
>> +When set to 0, user access is disabled.
>> +
>> +When set to 1, user space can read performance monitor counter registers
>> +directly only through perf, any direct access without perf intervention will
>> +trigger an illegal instruction.
>> +
>> +The default value is 2, which enables legacy mode (user space has direct
>> +access to cycle, time and insret CSRs only). Note that this legacy value
>> +is deprecated and will be removed once all userspace applications are fixed.
> All modes can access the time CSR so I'm not sure if it should be pointed
> out here as if it's an exception. Maybe we shouldn't point it out at all
> or we should point it out for all three?


Ok I removed the reference to the time CSR for the legacy mode and
instead added a note stating that this CSR is always accessible whatever
the mode.

Thanks!


>
> Thanks,
> drew

2023-06-15 13:52:05

by Heinrich Schuchardt

[permalink] [raw]
Subject: Re: [PATCH v2 07/10] drivers: perf: Implement perf event mmap support in the SBI backend

On 6/15/23 10:41, Atish Patra wrote:
> On Wed, May 31, 2023 at 8:02 AM Andrew Jones <[email protected]> wrote:
>>
>> On Fri, May 12, 2023 at 10:53:18AM +0200, Alexandre Ghiti wrote:
>>> We used to unconditionnally expose the cycle and instret csrs to
>>> userspace, which gives rise to security concerns.
>>>
>>> So now we only allow access to hw counters from userspace through the perf
>>> framework which will handle context switches, per-task events...etc. But
>>> as we cannot break userspace, we give the user the choice to go back to
>>> the previous behaviour by setting the sysctl perf_user_access.
>>>
>>> Signed-off-by: Alexandre Ghiti <[email protected]>
>>> ---
>>> arch/riscv/kernel/perf_event.c | 18 ++-
>>> drivers/perf/riscv_pmu_sbi.c | 194 ++++++++++++++++++++++++++++++++-
>>> 2 files changed, 205 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/arch/riscv/kernel/perf_event.c b/arch/riscv/kernel/perf_event.c
>>> index 94174a0fc251..3af9ca45b43f 100644
>>> --- a/arch/riscv/kernel/perf_event.c
>>> +++ b/arch/riscv/kernel/perf_event.c
>>> @@ -1,9 +1,13 @@
>>> // SPDX-License-Identifier: GPL-2.0-only
>>> +#include <linux/perf/riscv_pmu.h>
>>> #include <linux/sched_clock.h>
>>>
>>> void arch_perf_update_userpage(struct perf_event *event,
>>> struct perf_event_mmap_page *userpg, u64 now)
>>> {
>>> +#ifdef CONFIG_RISCV_PMU_SBI
>>> + struct riscv_pmu *rvpmu = to_riscv_pmu(event->pmu);
>>> +#endif
>>
>> Can avoid this pair of #ifdef/#endif's by just declaring rvpmu inside the
>> if block below where it's needed. Or even just using to_riscv_pmu()
>> directly in place of rvpmu.
>>
>>> struct clock_read_data *rd;
>>> unsigned int seq;
>>> u64 ns;
>>> @@ -14,7 +18,19 @@ void arch_perf_update_userpage(struct perf_event *event,
>>> userpg->cap_user_rdpmc =
>>> !!(event->hw.flags & PERF_EVENT_FLAG_USER_READ_CNT);
>>>
>>> - userpg->pmc_width = 64;
>>> +#ifdef CONFIG_RISCV_PMU_SBI
>>> + /*
>>> + * The counters are 64-bit but the priv spec doesn't mandate all the
>>> + * bits to be implemented: that's why, counter width can vary based on
>>> + * the cpu vendor.
>>> + */
>>> + if (event->pmu->name &&
>>> + !strncmp(event->pmu->name,
>>> + RISCV_PMU_PDEV_NAME, sizeof(RISCV_PMU_PDEV_NAME)))
>>> + userpg->pmc_width = rvpmu->ctr_get_width(event->hw.idx) + 1;
>>> + else
>>> +#endif
>>> + userpg->pmc_width = 64;
>>
>> Can leave the initialization to 64 above the #ifdef CONFIG_RISCV_PMU_SBI
>> as is and drop the else.
>>
>>>
>>> do {
>>> rd = sched_clock_read_begin(&seq);
>>> diff --git a/drivers/perf/riscv_pmu_sbi.c b/drivers/perf/riscv_pmu_sbi.c
>>> index 3b0ee2148054..d9bcc5cc6df5 100644
>>> --- a/drivers/perf/riscv_pmu_sbi.c
>>> +++ b/drivers/perf/riscv_pmu_sbi.c
>>> @@ -24,6 +24,14 @@
>>> #include <asm/sbi.h>
>>> #include <asm/hwcap.h>
>>>
>>> +#define SYSCTL_NO_USER_ACCESS 0
>>> +#define SYSCTL_USER_ACCESS 1
>>> +#define SYSCTL_LEGACY 2
>>> +
>>> +#define PERF_EVENT_FLAG_NO_USER_ACCESS BIT(SYSCTL_NO_USER_ACCESS)
>>> +#define PERF_EVENT_FLAG_USER_ACCESS BIT(SYSCTL_USER_ACCESS)
>>> +#define PERF_EVENT_FLAG_LEGACY BIT(SYSCTL_LEGACY)
>>> +
>>> PMU_FORMAT_ATTR(event, "config:0-47");
>>> PMU_FORMAT_ATTR(firmware, "config:63");
>>>
>>> @@ -43,6 +51,9 @@ static const struct attribute_group *riscv_pmu_attr_groups[] = {
>>> NULL,
>>> };
>>>
>>> +/* Allow legacy access by default */
>>> +static int sysctl_perf_user_access __read_mostly = SYSCTL_LEGACY;
>>> +
>>
>> I'm still not in favor of this. Hopefully the distro discussions result in
>> it being changed.
>>
>
> I did not see any feedback from distro guys. I talked to David (fedora
> maintainer) and he is even okay with
> SYSCTL_NO_USER_ACCESS :). I would love to hear back from others (cc'd
> a few distro folks to this thread).

In future we will have to support scenarios for confidential compute.
Here we have to avoid timing attacks against virtual machines. If access
to the cycle register is critical, this implies that hiding it behind a
sysctl setting is not good enough.

If we keep this sysctl based access at all, it should be behind a
Kconfig setting that is disabled by default and were the help text
clearly indicates the security implications.

Best regards

Heinrich

>
>>> /*
>>> * RISC-V doesn't have heterogeneous harts yet. This need to be part of
>>> * per_cpu in case of harts with different pmu counters
>>> @@ -301,6 +312,11 @@ int riscv_pmu_get_hpm_info(u32 *hw_ctr_width, u32 *num_hw_ctr)
>>> }
>>> EXPORT_SYMBOL_GPL(riscv_pmu_get_hpm_info);
>>>
>>> +static uint8_t pmu_sbi_csr_index(struct perf_event *event)
>>> +{
>>> + return pmu_ctr_list[event->hw.idx].csr - CSR_CYCLE;
>>> +}
>>> +
>>> static unsigned long pmu_sbi_get_filter_flags(struct perf_event *event)
>>> {
>>> unsigned long cflags = 0;
>>> @@ -329,18 +345,34 @@ static int pmu_sbi_ctr_get_idx(struct perf_event *event)
>>> struct cpu_hw_events *cpuc = this_cpu_ptr(rvpmu->hw_events);
>>> struct sbiret ret;
>>> int idx;
>>> - uint64_t cbase = 0;
>>> + uint64_t cbase = 0, cmask = rvpmu->cmask;
>>> unsigned long cflags = 0;
>>>
>>> cflags = pmu_sbi_get_filter_flags(event);
>>> +
>>> + /*
>>> + * In legacy mode, we have to force the fixed counters for those events
>>> + * but not in the user access mode as we want to use the other counters
>>> + * that support sampling/filtering.
>>> + */
>>> + if (hwc->flags & PERF_EVENT_FLAG_LEGACY) {
>>> + if (event->attr.config == PERF_COUNT_HW_CPU_CYCLES) {
>>> + cflags |= SBI_PMU_CFG_FLAG_SKIP_MATCH;
>>> + cmask = 1;
>>> + } else if (event->attr.config == PERF_COUNT_HW_INSTRUCTIONS) {
>>> + cflags |= SBI_PMU_CFG_FLAG_SKIP_MATCH;
>>> + cmask = 1UL << (CSR_INSTRET - CSR_CYCLE);
>>> + }
>>> + }
>>> +
>>> /* retrieve the available counter index */
>>> #if defined(CONFIG_32BIT)
>>> ret = sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_COUNTER_CFG_MATCH, cbase,
>>> - rvpmu->cmask, cflags, hwc->event_base, hwc->config,
>>> + cmask, cflags, hwc->event_base, hwc->config,
>>> hwc->config >> 32);
>>> #else
>>> ret = sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_COUNTER_CFG_MATCH, cbase,
>>> - rvpmu->cmask, cflags, hwc->event_base, hwc->config, 0);
>>> + cmask, cflags, hwc->event_base, hwc->config, 0);
>>> #endif
>>> if (ret.error) {
>>> pr_debug("Not able to find a counter for event %lx config %llx\n",
>>> @@ -474,6 +506,14 @@ static u64 pmu_sbi_ctr_read(struct perf_event *event)
>>> return val;
>>> }
>>>
>>> +static void pmu_sbi_set_scounteren(void *arg)
>>> +{
>>> + struct perf_event *event = (struct perf_event *)arg;
>>> +
>>> + csr_write(CSR_SCOUNTEREN,
>>> + csr_read(CSR_SCOUNTEREN) | (1 << pmu_sbi_csr_index(event)));
>>> +}
>>> +
>>> static void pmu_sbi_ctr_start(struct perf_event *event, u64 ival)
>>> {
>>> struct sbiret ret;
>>> @@ -490,6 +530,18 @@ static void pmu_sbi_ctr_start(struct perf_event *event, u64 ival)
>>> if (ret.error && (ret.error != SBI_ERR_ALREADY_STARTED))
>>> pr_err("Starting counter idx %d failed with error %d\n",
>>> hwc->idx, sbi_err_map_linux_errno(ret.error));
>>> +
>>> + if (hwc->flags & PERF_EVENT_FLAG_USER_ACCESS &&
>>> + hwc->flags & PERF_EVENT_FLAG_USER_READ_CNT)
>>> + pmu_sbi_set_scounteren((void *)event);
>>> +}
>>> +
>>> +static void pmu_sbi_reset_scounteren(void *arg)
>>> +{
>>> + struct perf_event *event = (struct perf_event *)arg;
>>> +
>>> + csr_write(CSR_SCOUNTEREN,
>>> + csr_read(CSR_SCOUNTEREN) & ~(1 << pmu_sbi_csr_index(event)));
>>> }
>>>
>>> static void pmu_sbi_ctr_stop(struct perf_event *event, unsigned long flag)
>>> @@ -497,6 +549,10 @@ static void pmu_sbi_ctr_stop(struct perf_event *event, unsigned long flag)
>>> struct sbiret ret;
>>> struct hw_perf_event *hwc = &event->hw;
>>>
>>> + if (hwc->flags & PERF_EVENT_FLAG_USER_ACCESS &&
>>> + hwc->flags & PERF_EVENT_FLAG_USER_READ_CNT)
>>> + pmu_sbi_reset_scounteren((void *)event);
>>> +
>>> ret = sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_COUNTER_STOP, hwc->idx, 1, flag, 0, 0, 0);
>>> if (ret.error && (ret.error != SBI_ERR_ALREADY_STOPPED) &&
>>> flag != SBI_PMU_STOP_FLAG_RESET)
>>> @@ -704,10 +760,13 @@ static int pmu_sbi_starting_cpu(unsigned int cpu, struct hlist_node *node)
>>> struct cpu_hw_events *cpu_hw_evt = this_cpu_ptr(pmu->hw_events);
>>>
>>> /*
>>> - * Enable the access for CYCLE, TIME, and INSTRET CSRs from userspace,
>>> - * as is necessary to maintain uABI compatibility.
>>> + * We keep enabling userspace access to CYCLE, TIME and INSRET via the
>>> + * legacy option but that will be removed in the future.
>>> */
>>> - csr_write(CSR_SCOUNTEREN, 0x7);
>>> + if (sysctl_perf_user_access == SYSCTL_LEGACY)
>>> + csr_write(CSR_SCOUNTEREN, 0x7);
>>> + else
>>> + csr_write(CSR_SCOUNTEREN, 0x2);
>>>
>>> /* Stop all the counters so that they can be enabled from perf */
>>> pmu_sbi_stop_all(pmu);
>>> @@ -851,6 +910,123 @@ static void riscv_pmu_destroy(struct riscv_pmu *pmu)
>>> cpuhp_state_remove_instance(CPUHP_AP_PERF_RISCV_STARTING, &pmu->node);
>>> }
>>>
>>> +static void pmu_sbi_event_init(struct perf_event *event)
>>> +{
>>> + /*
>>> + * The permissions are set at event_init so that we do not depend
>>> + * on the sysctl value that can change.
>>> + */
>>> + if (sysctl_perf_user_access == SYSCTL_NO_USER_ACCESS)
>>> + event->hw.flags |= PERF_EVENT_FLAG_NO_USER_ACCESS;
>>> + else if (sysctl_perf_user_access == SYSCTL_USER_ACCESS)
>>> + event->hw.flags |= PERF_EVENT_FLAG_USER_ACCESS;
>>> + else
>>> + event->hw.flags |= PERF_EVENT_FLAG_LEGACY;
>>> +}
>>> +
>>> +static void pmu_sbi_event_mapped(struct perf_event *event, struct mm_struct *mm)
>>> +{
>>> + if (event->hw.flags & PERF_EVENT_FLAG_NO_USER_ACCESS)
>>> + return;
>>> +
>>> + /* In legacy mode, the first 3 CSRs are available. */
>>
>> first and third
>>
>>> + if (event->hw.flags & PERF_EVENT_FLAG_LEGACY) {
>>> + if (event->attr.config != PERF_COUNT_HW_CPU_CYCLES &&
>>> + event->attr.config != PERF_COUNT_HW_INSTRUCTIONS) {
>>> + return;
>>> + }
>>> + }
>>> +
>>> + /*
>>> + * The user mmapped the event to directly access it: this is where
>>> + * we determine based on sysctl_perf_user_access if we grant userspace
>>> + * the direct access to this event. That means that within the same
>>> + * task, some events may be directly accessible and some other may not,
>>> + * if the user changes the value of sysctl_perf_user_accesss in the
>>> + * meantime.
>>> + */
>>> +
>>> + event->hw.flags |= PERF_EVENT_FLAG_USER_READ_CNT;
>>> +
>>> + /*
>>> + * We must enable userspace access *before* advertising in the user page
>>> + * that it is possible to do so to avoid any race.
>>> + * And we must notify all cpus here because threads that currently run
>>> + * on other cpus will try to directly access the counter too without
>>> + * calling pmu_sbi_ctr_start.
>>> + */
>>> + if (event->hw.flags & PERF_EVENT_FLAG_USER_ACCESS)
>>> + on_each_cpu_mask(mm_cpumask(mm),
>>> + pmu_sbi_set_scounteren, (void *)event, 1);
>>> +}
>>> +
>>> +static void pmu_sbi_event_unmapped(struct perf_event *event, struct mm_struct *mm)
>>> +{
>>> + if (event->hw.flags & PERF_EVENT_FLAG_NO_USER_ACCESS)
>>> + return;
>>> +
>>> + /* In legacy mode, the first 3 CSRs are available. */
>>
>> first and third
>>
>>> + if (event->hw.flags & PERF_EVENT_FLAG_LEGACY) {
>>> + if (event->attr.config != PERF_COUNT_HW_CPU_CYCLES &&
>>> + event->attr.config != PERF_COUNT_HW_INSTRUCTIONS) {
>>> + return;
>>> + }
>>> + }
>>> +
>>> + /*
>>> + * Here we can directly remove user access since the user does not have
>>> + * access to the user page anymore so we avoid the racy window where the
>>> + * user could have read cap_user_rdpmc to true right before we disable
>>> + * it.
>>> + */
>>> + event->hw.flags &= ~PERF_EVENT_FLAG_USER_READ_CNT;
>>> +
>>> + if (event->hw.flags & PERF_EVENT_FLAG_USER_ACCESS)
>>> + on_each_cpu_mask(mm_cpumask(mm),
>>> + pmu_sbi_reset_scounteren, (void *)event, 1);
>>> +}
>>> +
>>> +static void riscv_pmu_update_counter_access(void *info)
>>> +{
>>> + if (sysctl_perf_user_access == SYSCTL_LEGACY)
>>> + csr_write(CSR_SCOUNTEREN, 0x7);
>>> + else
>>> + csr_write(CSR_SCOUNTEREN, 0x2);
>>> +}
>>> +
>>> +static int riscv_pmu_proc_user_access_handler(struct ctl_table *table,
>>> + int write, void *buffer,
>>> + size_t *lenp, loff_t *ppos)
>>> +{
>>> + int prev = sysctl_perf_user_access;
>>> + int ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
>>> +
>>> + /*
>>> + * Test against the previous value since we clear SCOUNTEREN when
>>> + * sysctl_perf_user_access is set to SYSCTL_USER_ACCESS, but we should
>>> + * not do that if that was already the case.
>>> + */
>>> + if (ret || !write || prev == sysctl_perf_user_access)
>>> + return ret;
>>> +
>>> + on_each_cpu(riscv_pmu_update_counter_access, (void *)&prev, 1);
>>
>> Instead of passing prev shouldn't we pass NULL, as it's not used?
>>
>>> +
>>> + return 0;
>>> +}
>>> +
>>> +static struct ctl_table sbi_pmu_sysctl_table[] = {
>>> + {
>>> + .procname = "perf_user_access",
>>> + .data = &sysctl_perf_user_access,
>>> + .maxlen = sizeof(unsigned int),
>>> + .mode = 0644,
>>> + .proc_handler = riscv_pmu_proc_user_access_handler,
>>> + .extra1 = SYSCTL_ZERO,
>>> + .extra2 = SYSCTL_TWO,
>>> + },
>>> + { }
>>> +};
>>> +
>>> static int pmu_sbi_device_probe(struct platform_device *pdev)
>>> {
>>> struct riscv_pmu *pmu = NULL;
>>> @@ -888,6 +1064,10 @@ static int pmu_sbi_device_probe(struct platform_device *pdev)
>>> pmu->ctr_get_width = pmu_sbi_ctr_get_width;
>>> pmu->ctr_clear_idx = pmu_sbi_ctr_clear_idx;
>>> pmu->ctr_read = pmu_sbi_ctr_read;
>>> + pmu->event_init = pmu_sbi_event_init;
>>> + pmu->event_mapped = pmu_sbi_event_mapped;
>>> + pmu->event_unmapped = pmu_sbi_event_unmapped;
>>> + pmu->csr_index = pmu_sbi_csr_index;
>>>
>>> ret = cpuhp_state_add_instance(CPUHP_AP_PERF_RISCV_STARTING, &pmu->node);
>>> if (ret)
>>> @@ -901,6 +1081,8 @@ static int pmu_sbi_device_probe(struct platform_device *pdev)
>>> if (ret)
>>> goto out_unregister;
>>>
>>> + register_sysctl("kernel", sbi_pmu_sysctl_table);
>>> +
>>> return 0;
>>>
>>> out_unregister:
>>> --
>>> 2.37.2
>>>
>>
>> Thanks,
>> drew
>
>
>


2023-06-16 08:04:35

by Atish Patra

[permalink] [raw]
Subject: Re: [PATCH v2 07/10] drivers: perf: Implement perf event mmap support in the SBI backend

On Thu, Jun 15, 2023 at 6:28 AM Heinrich Schuchardt
<[email protected]> wrote:
>
> On 6/15/23 10:41, Atish Patra wrote:
> > On Wed, May 31, 2023 at 8:02 AM Andrew Jones <[email protected]> wrote:
> >>
> >> On Fri, May 12, 2023 at 10:53:18AM +0200, Alexandre Ghiti wrote:
> >>> We used to unconditionnally expose the cycle and instret csrs to
> >>> userspace, which gives rise to security concerns.
> >>>
> >>> So now we only allow access to hw counters from userspace through the perf
> >>> framework which will handle context switches, per-task events...etc. But
> >>> as we cannot break userspace, we give the user the choice to go back to
> >>> the previous behaviour by setting the sysctl perf_user_access.
> >>>
> >>> Signed-off-by: Alexandre Ghiti <[email protected]>
> >>> ---
> >>> arch/riscv/kernel/perf_event.c | 18 ++-
> >>> drivers/perf/riscv_pmu_sbi.c | 194 ++++++++++++++++++++++++++++++++-
> >>> 2 files changed, 205 insertions(+), 7 deletions(-)
> >>>
> >>> diff --git a/arch/riscv/kernel/perf_event.c b/arch/riscv/kernel/perf_event.c
> >>> index 94174a0fc251..3af9ca45b43f 100644
> >>> --- a/arch/riscv/kernel/perf_event.c
> >>> +++ b/arch/riscv/kernel/perf_event.c
> >>> @@ -1,9 +1,13 @@
> >>> // SPDX-License-Identifier: GPL-2.0-only
> >>> +#include <linux/perf/riscv_pmu.h>
> >>> #include <linux/sched_clock.h>
> >>>
> >>> void arch_perf_update_userpage(struct perf_event *event,
> >>> struct perf_event_mmap_page *userpg, u64 now)
> >>> {
> >>> +#ifdef CONFIG_RISCV_PMU_SBI
> >>> + struct riscv_pmu *rvpmu = to_riscv_pmu(event->pmu);
> >>> +#endif
> >>
> >> Can avoid this pair of #ifdef/#endif's by just declaring rvpmu inside the
> >> if block below where it's needed. Or even just using to_riscv_pmu()
> >> directly in place of rvpmu.
> >>
> >>> struct clock_read_data *rd;
> >>> unsigned int seq;
> >>> u64 ns;
> >>> @@ -14,7 +18,19 @@ void arch_perf_update_userpage(struct perf_event *event,
> >>> userpg->cap_user_rdpmc =
> >>> !!(event->hw.flags & PERF_EVENT_FLAG_USER_READ_CNT);
> >>>
> >>> - userpg->pmc_width = 64;
> >>> +#ifdef CONFIG_RISCV_PMU_SBI
> >>> + /*
> >>> + * The counters are 64-bit but the priv spec doesn't mandate all the
> >>> + * bits to be implemented: that's why, counter width can vary based on
> >>> + * the cpu vendor.
> >>> + */
> >>> + if (event->pmu->name &&
> >>> + !strncmp(event->pmu->name,
> >>> + RISCV_PMU_PDEV_NAME, sizeof(RISCV_PMU_PDEV_NAME)))
> >>> + userpg->pmc_width = rvpmu->ctr_get_width(event->hw.idx) + 1;
> >>> + else
> >>> +#endif
> >>> + userpg->pmc_width = 64;
> >>
> >> Can leave the initialization to 64 above the #ifdef CONFIG_RISCV_PMU_SBI
> >> as is and drop the else.
> >>
> >>>
> >>> do {
> >>> rd = sched_clock_read_begin(&seq);
> >>> diff --git a/drivers/perf/riscv_pmu_sbi.c b/drivers/perf/riscv_pmu_sbi.c
> >>> index 3b0ee2148054..d9bcc5cc6df5 100644
> >>> --- a/drivers/perf/riscv_pmu_sbi.c
> >>> +++ b/drivers/perf/riscv_pmu_sbi.c
> >>> @@ -24,6 +24,14 @@
> >>> #include <asm/sbi.h>
> >>> #include <asm/hwcap.h>
> >>>
> >>> +#define SYSCTL_NO_USER_ACCESS 0
> >>> +#define SYSCTL_USER_ACCESS 1
> >>> +#define SYSCTL_LEGACY 2
> >>> +
> >>> +#define PERF_EVENT_FLAG_NO_USER_ACCESS BIT(SYSCTL_NO_USER_ACCESS)
> >>> +#define PERF_EVENT_FLAG_USER_ACCESS BIT(SYSCTL_USER_ACCESS)
> >>> +#define PERF_EVENT_FLAG_LEGACY BIT(SYSCTL_LEGACY)
> >>> +
> >>> PMU_FORMAT_ATTR(event, "config:0-47");
> >>> PMU_FORMAT_ATTR(firmware, "config:63");
> >>>
> >>> @@ -43,6 +51,9 @@ static const struct attribute_group *riscv_pmu_attr_groups[] = {
> >>> NULL,
> >>> };
> >>>
> >>> +/* Allow legacy access by default */
> >>> +static int sysctl_perf_user_access __read_mostly = SYSCTL_LEGACY;
> >>> +
> >>
> >> I'm still not in favor of this. Hopefully the distro discussions result in
> >> it being changed.
> >>
> >
> > I did not see any feedback from distro guys. I talked to David (fedora
> > maintainer) and he is even okay with
> > SYSCTL_NO_USER_ACCESS :). I would love to hear back from others (cc'd
> > a few distro folks to this thread).
>
> In future we will have to support scenarios for confidential compute.
> Here we have to avoid timing attacks against virtual machines. If access
> to the cycle register is critical, this implies that hiding it behind a
> sysctl setting is not good enough.
>

For virtualization, all counter usages are virtualized by KVM. Thus,
the hypervisor
remains in complete control.

> If we keep this sysctl based access at all, it should be behind a
> Kconfig setting that is disabled by default and were the help text
> clearly indicates the security implications.
>

That would require rebuild/reinstallation of the kernel for users with
root privileges
to enable these. Distros can disable the access by setting
the default to PERF_EVENT_FLAG_NO_USER_ACCESS or PERF_EVENT_FLAG_USER_ACCESS
based on use case. But root privileges users should have a way to opt in.

As per my understanding, that's the intention behind x86/ARM64
implementation as well.

> Best regards
>
> Heinrich
>
> >
> >>> /*
> >>> * RISC-V doesn't have heterogeneous harts yet. This need to be part of
> >>> * per_cpu in case of harts with different pmu counters
> >>> @@ -301,6 +312,11 @@ int riscv_pmu_get_hpm_info(u32 *hw_ctr_width, u32 *num_hw_ctr)
> >>> }
> >>> EXPORT_SYMBOL_GPL(riscv_pmu_get_hpm_info);
> >>>
> >>> +static uint8_t pmu_sbi_csr_index(struct perf_event *event)
> >>> +{
> >>> + return pmu_ctr_list[event->hw.idx].csr - CSR_CYCLE;
> >>> +}
> >>> +
> >>> static unsigned long pmu_sbi_get_filter_flags(struct perf_event *event)
> >>> {
> >>> unsigned long cflags = 0;
> >>> @@ -329,18 +345,34 @@ static int pmu_sbi_ctr_get_idx(struct perf_event *event)
> >>> struct cpu_hw_events *cpuc = this_cpu_ptr(rvpmu->hw_events);
> >>> struct sbiret ret;
> >>> int idx;
> >>> - uint64_t cbase = 0;
> >>> + uint64_t cbase = 0, cmask = rvpmu->cmask;
> >>> unsigned long cflags = 0;
> >>>
> >>> cflags = pmu_sbi_get_filter_flags(event);
> >>> +
> >>> + /*
> >>> + * In legacy mode, we have to force the fixed counters for those events
> >>> + * but not in the user access mode as we want to use the other counters
> >>> + * that support sampling/filtering.
> >>> + */
> >>> + if (hwc->flags & PERF_EVENT_FLAG_LEGACY) {
> >>> + if (event->attr.config == PERF_COUNT_HW_CPU_CYCLES) {
> >>> + cflags |= SBI_PMU_CFG_FLAG_SKIP_MATCH;
> >>> + cmask = 1;
> >>> + } else if (event->attr.config == PERF_COUNT_HW_INSTRUCTIONS) {
> >>> + cflags |= SBI_PMU_CFG_FLAG_SKIP_MATCH;
> >>> + cmask = 1UL << (CSR_INSTRET - CSR_CYCLE);
> >>> + }
> >>> + }
> >>> +
> >>> /* retrieve the available counter index */
> >>> #if defined(CONFIG_32BIT)
> >>> ret = sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_COUNTER_CFG_MATCH, cbase,
> >>> - rvpmu->cmask, cflags, hwc->event_base, hwc->config,
> >>> + cmask, cflags, hwc->event_base, hwc->config,
> >>> hwc->config >> 32);
> >>> #else
> >>> ret = sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_COUNTER_CFG_MATCH, cbase,
> >>> - rvpmu->cmask, cflags, hwc->event_base, hwc->config, 0);
> >>> + cmask, cflags, hwc->event_base, hwc->config, 0);
> >>> #endif
> >>> if (ret.error) {
> >>> pr_debug("Not able to find a counter for event %lx config %llx\n",
> >>> @@ -474,6 +506,14 @@ static u64 pmu_sbi_ctr_read(struct perf_event *event)
> >>> return val;
> >>> }
> >>>
> >>> +static void pmu_sbi_set_scounteren(void *arg)
> >>> +{
> >>> + struct perf_event *event = (struct perf_event *)arg;
> >>> +
> >>> + csr_write(CSR_SCOUNTEREN,
> >>> + csr_read(CSR_SCOUNTEREN) | (1 << pmu_sbi_csr_index(event)));
> >>> +}
> >>> +
> >>> static void pmu_sbi_ctr_start(struct perf_event *event, u64 ival)
> >>> {
> >>> struct sbiret ret;
> >>> @@ -490,6 +530,18 @@ static void pmu_sbi_ctr_start(struct perf_event *event, u64 ival)
> >>> if (ret.error && (ret.error != SBI_ERR_ALREADY_STARTED))
> >>> pr_err("Starting counter idx %d failed with error %d\n",
> >>> hwc->idx, sbi_err_map_linux_errno(ret.error));
> >>> +
> >>> + if (hwc->flags & PERF_EVENT_FLAG_USER_ACCESS &&
> >>> + hwc->flags & PERF_EVENT_FLAG_USER_READ_CNT)
> >>> + pmu_sbi_set_scounteren((void *)event);
> >>> +}
> >>> +
> >>> +static void pmu_sbi_reset_scounteren(void *arg)
> >>> +{
> >>> + struct perf_event *event = (struct perf_event *)arg;
> >>> +
> >>> + csr_write(CSR_SCOUNTEREN,
> >>> + csr_read(CSR_SCOUNTEREN) & ~(1 << pmu_sbi_csr_index(event)));
> >>> }
> >>>
> >>> static void pmu_sbi_ctr_stop(struct perf_event *event, unsigned long flag)
> >>> @@ -497,6 +549,10 @@ static void pmu_sbi_ctr_stop(struct perf_event *event, unsigned long flag)
> >>> struct sbiret ret;
> >>> struct hw_perf_event *hwc = &event->hw;
> >>>
> >>> + if (hwc->flags & PERF_EVENT_FLAG_USER_ACCESS &&
> >>> + hwc->flags & PERF_EVENT_FLAG_USER_READ_CNT)
> >>> + pmu_sbi_reset_scounteren((void *)event);
> >>> +
> >>> ret = sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_COUNTER_STOP, hwc->idx, 1, flag, 0, 0, 0);
> >>> if (ret.error && (ret.error != SBI_ERR_ALREADY_STOPPED) &&
> >>> flag != SBI_PMU_STOP_FLAG_RESET)
> >>> @@ -704,10 +760,13 @@ static int pmu_sbi_starting_cpu(unsigned int cpu, struct hlist_node *node)
> >>> struct cpu_hw_events *cpu_hw_evt = this_cpu_ptr(pmu->hw_events);
> >>>
> >>> /*
> >>> - * Enable the access for CYCLE, TIME, and INSTRET CSRs from userspace,
> >>> - * as is necessary to maintain uABI compatibility.
> >>> + * We keep enabling userspace access to CYCLE, TIME and INSRET via the
> >>> + * legacy option but that will be removed in the future.
> >>> */
> >>> - csr_write(CSR_SCOUNTEREN, 0x7);
> >>> + if (sysctl_perf_user_access == SYSCTL_LEGACY)
> >>> + csr_write(CSR_SCOUNTEREN, 0x7);
> >>> + else
> >>> + csr_write(CSR_SCOUNTEREN, 0x2);
> >>>
> >>> /* Stop all the counters so that they can be enabled from perf */
> >>> pmu_sbi_stop_all(pmu);
> >>> @@ -851,6 +910,123 @@ static void riscv_pmu_destroy(struct riscv_pmu *pmu)
> >>> cpuhp_state_remove_instance(CPUHP_AP_PERF_RISCV_STARTING, &pmu->node);
> >>> }
> >>>
> >>> +static void pmu_sbi_event_init(struct perf_event *event)
> >>> +{
> >>> + /*
> >>> + * The permissions are set at event_init so that we do not depend
> >>> + * on the sysctl value that can change.
> >>> + */
> >>> + if (sysctl_perf_user_access == SYSCTL_NO_USER_ACCESS)
> >>> + event->hw.flags |= PERF_EVENT_FLAG_NO_USER_ACCESS;
> >>> + else if (sysctl_perf_user_access == SYSCTL_USER_ACCESS)
> >>> + event->hw.flags |= PERF_EVENT_FLAG_USER_ACCESS;
> >>> + else
> >>> + event->hw.flags |= PERF_EVENT_FLAG_LEGACY;
> >>> +}
> >>> +
> >>> +static void pmu_sbi_event_mapped(struct perf_event *event, struct mm_struct *mm)
> >>> +{
> >>> + if (event->hw.flags & PERF_EVENT_FLAG_NO_USER_ACCESS)
> >>> + return;
> >>> +
> >>> + /* In legacy mode, the first 3 CSRs are available. */
> >>
> >> first and third
> >>
> >>> + if (event->hw.flags & PERF_EVENT_FLAG_LEGACY) {
> >>> + if (event->attr.config != PERF_COUNT_HW_CPU_CYCLES &&
> >>> + event->attr.config != PERF_COUNT_HW_INSTRUCTIONS) {
> >>> + return;
> >>> + }
> >>> + }
> >>> +
> >>> + /*
> >>> + * The user mmapped the event to directly access it: this is where
> >>> + * we determine based on sysctl_perf_user_access if we grant userspace
> >>> + * the direct access to this event. That means that within the same
> >>> + * task, some events may be directly accessible and some other may not,
> >>> + * if the user changes the value of sysctl_perf_user_accesss in the
> >>> + * meantime.
> >>> + */
> >>> +
> >>> + event->hw.flags |= PERF_EVENT_FLAG_USER_READ_CNT;
> >>> +
> >>> + /*
> >>> + * We must enable userspace access *before* advertising in the user page
> >>> + * that it is possible to do so to avoid any race.
> >>> + * And we must notify all cpus here because threads that currently run
> >>> + * on other cpus will try to directly access the counter too without
> >>> + * calling pmu_sbi_ctr_start.
> >>> + */
> >>> + if (event->hw.flags & PERF_EVENT_FLAG_USER_ACCESS)
> >>> + on_each_cpu_mask(mm_cpumask(mm),
> >>> + pmu_sbi_set_scounteren, (void *)event, 1);
> >>> +}
> >>> +
> >>> +static void pmu_sbi_event_unmapped(struct perf_event *event, struct mm_struct *mm)
> >>> +{
> >>> + if (event->hw.flags & PERF_EVENT_FLAG_NO_USER_ACCESS)
> >>> + return;
> >>> +
> >>> + /* In legacy mode, the first 3 CSRs are available. */
> >>
> >> first and third
> >>
> >>> + if (event->hw.flags & PERF_EVENT_FLAG_LEGACY) {
> >>> + if (event->attr.config != PERF_COUNT_HW_CPU_CYCLES &&
> >>> + event->attr.config != PERF_COUNT_HW_INSTRUCTIONS) {
> >>> + return;
> >>> + }
> >>> + }
> >>> +
> >>> + /*
> >>> + * Here we can directly remove user access since the user does not have
> >>> + * access to the user page anymore so we avoid the racy window where the
> >>> + * user could have read cap_user_rdpmc to true right before we disable
> >>> + * it.
> >>> + */
> >>> + event->hw.flags &= ~PERF_EVENT_FLAG_USER_READ_CNT;
> >>> +
> >>> + if (event->hw.flags & PERF_EVENT_FLAG_USER_ACCESS)
> >>> + on_each_cpu_mask(mm_cpumask(mm),
> >>> + pmu_sbi_reset_scounteren, (void *)event, 1);
> >>> +}
> >>> +
> >>> +static void riscv_pmu_update_counter_access(void *info)
> >>> +{
> >>> + if (sysctl_perf_user_access == SYSCTL_LEGACY)
> >>> + csr_write(CSR_SCOUNTEREN, 0x7);
> >>> + else
> >>> + csr_write(CSR_SCOUNTEREN, 0x2);
> >>> +}
> >>> +
> >>> +static int riscv_pmu_proc_user_access_handler(struct ctl_table *table,
> >>> + int write, void *buffer,
> >>> + size_t *lenp, loff_t *ppos)
> >>> +{
> >>> + int prev = sysctl_perf_user_access;
> >>> + int ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
> >>> +
> >>> + /*
> >>> + * Test against the previous value since we clear SCOUNTEREN when
> >>> + * sysctl_perf_user_access is set to SYSCTL_USER_ACCESS, but we should
> >>> + * not do that if that was already the case.
> >>> + */
> >>> + if (ret || !write || prev == sysctl_perf_user_access)
> >>> + return ret;
> >>> +
> >>> + on_each_cpu(riscv_pmu_update_counter_access, (void *)&prev, 1);
> >>
> >> Instead of passing prev shouldn't we pass NULL, as it's not used?
> >>
> >>> +
> >>> + return 0;
> >>> +}
> >>> +
> >>> +static struct ctl_table sbi_pmu_sysctl_table[] = {
> >>> + {
> >>> + .procname = "perf_user_access",
> >>> + .data = &sysctl_perf_user_access,
> >>> + .maxlen = sizeof(unsigned int),
> >>> + .mode = 0644,
> >>> + .proc_handler = riscv_pmu_proc_user_access_handler,
> >>> + .extra1 = SYSCTL_ZERO,
> >>> + .extra2 = SYSCTL_TWO,
> >>> + },
> >>> + { }
> >>> +};
> >>> +
> >>> static int pmu_sbi_device_probe(struct platform_device *pdev)
> >>> {
> >>> struct riscv_pmu *pmu = NULL;
> >>> @@ -888,6 +1064,10 @@ static int pmu_sbi_device_probe(struct platform_device *pdev)
> >>> pmu->ctr_get_width = pmu_sbi_ctr_get_width;
> >>> pmu->ctr_clear_idx = pmu_sbi_ctr_clear_idx;
> >>> pmu->ctr_read = pmu_sbi_ctr_read;
> >>> + pmu->event_init = pmu_sbi_event_init;
> >>> + pmu->event_mapped = pmu_sbi_event_mapped;
> >>> + pmu->event_unmapped = pmu_sbi_event_unmapped;
> >>> + pmu->csr_index = pmu_sbi_csr_index;
> >>>
> >>> ret = cpuhp_state_add_instance(CPUHP_AP_PERF_RISCV_STARTING, &pmu->node);
> >>> if (ret)
> >>> @@ -901,6 +1081,8 @@ static int pmu_sbi_device_probe(struct platform_device *pdev)
> >>> if (ret)
> >>> goto out_unregister;
> >>>
> >>> + register_sysctl("kernel", sbi_pmu_sysctl_table);
> >>> +
> >>> return 0;
> >>>
> >>> out_unregister:
> >>> --
> >>> 2.37.2
> >>>
> >>
> >> Thanks,
> >> drew
> >
> >
> >
>


--
Regards,
Atish

2023-06-16 08:53:02

by Atish Patra

[permalink] [raw]
Subject: Re: [PATCH v2 05/10] riscv: Prepare for user-space perf event mmap support

On Fri, May 12, 2023 at 1:58 AM Alexandre Ghiti <[email protected]> wrote:
>
> Provide all the necessary bits in the generic riscv pmu driver to be
> able to mmap perf events in userspace: the heavy lifting lies in the
> driver backend, namely the legacy and sbi implementations.
>
> Note that arch_perf_update_userpage is almost a copy of arm64 code.
>
> Signed-off-by: Alexandre Ghiti <[email protected]>
> ---
> arch/riscv/kernel/Makefile | 2 +-
> arch/riscv/kernel/perf_event.c | 58 ++++++++++++++++++++++++++++++++++
> drivers/perf/riscv_pmu.c | 41 ++++++++++++++++++++++++
> include/linux/perf/riscv_pmu.h | 4 +++
> 4 files changed, 104 insertions(+), 1 deletion(-)
> create mode 100644 arch/riscv/kernel/perf_event.c
>
> diff --git a/arch/riscv/kernel/Makefile b/arch/riscv/kernel/Makefile
> index 4cf303a779ab..0d215fd9860d 100644
> --- a/arch/riscv/kernel/Makefile
> +++ b/arch/riscv/kernel/Makefile
> @@ -70,7 +70,7 @@ obj-$(CONFIG_DYNAMIC_FTRACE) += mcount-dyn.o
>
> obj-$(CONFIG_TRACE_IRQFLAGS) += trace_irq.o
>
> -obj-$(CONFIG_PERF_EVENTS) += perf_callchain.o
> +obj-$(CONFIG_PERF_EVENTS) += perf_callchain.o perf_event.o
> obj-$(CONFIG_HAVE_PERF_REGS) += perf_regs.o
> obj-$(CONFIG_RISCV_SBI) += sbi.o
> ifeq ($(CONFIG_RISCV_SBI), y)
> diff --git a/arch/riscv/kernel/perf_event.c b/arch/riscv/kernel/perf_event.c
> new file mode 100644
> index 000000000000..94174a0fc251
> --- /dev/null
> +++ b/arch/riscv/kernel/perf_event.c
> @@ -0,0 +1,58 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +#include <linux/sched_clock.h>
> +
> +void arch_perf_update_userpage(struct perf_event *event,
> + struct perf_event_mmap_page *userpg, u64 now)
> +{
> + struct clock_read_data *rd;
> + unsigned int seq;
> + u64 ns;
> +
> + userpg->cap_user_time = 0;
> + userpg->cap_user_time_zero = 0;
> + userpg->cap_user_time_short = 0;
> + userpg->cap_user_rdpmc =
> + !!(event->hw.flags & PERF_EVENT_FLAG_USER_READ_CNT);
> +
> + userpg->pmc_width = 64;
> +

The counter width is 64 for cycle & instret. Other hpmcounter can have
different width.
This information should retrieved from counter info.

> + do {
> + rd = sched_clock_read_begin(&seq);
> +
> + userpg->time_mult = rd->mult;
> + userpg->time_shift = rd->shift;
> + userpg->time_zero = rd->epoch_ns;
> + userpg->time_cycles = rd->epoch_cyc;
> + userpg->time_mask = rd->sched_clock_mask;
> +
> + /*
> + * Subtract the cycle base, such that software that
> + * doesn't know about cap_user_time_short still 'works'
> + * assuming no wraps.
> + */
> + ns = mul_u64_u32_shr(rd->epoch_cyc, rd->mult, rd->shift);
> + userpg->time_zero -= ns;
> +
> + } while (sched_clock_read_retry(seq));
> +
> + userpg->time_offset = userpg->time_zero - now;
> +
> + /*
> + * time_shift is not expected to be greater than 31 due to
> + * the original published conversion algorithm shifting a
> + * 32-bit value (now specifies a 64-bit value) - refer
> + * perf_event_mmap_page documentation in perf_event.h.
> + */
> + if (userpg->time_shift == 32) {
> + userpg->time_shift = 31;
> + userpg->time_mult >>= 1;
> + }
> +
> + /*
> + * Internal timekeeping for enabled/running/stopped times
> + * is always computed with the sched_clock.
> + */
> + userpg->cap_user_time = 1;
> + userpg->cap_user_time_zero = 1;
> + userpg->cap_user_time_short = 1;
> +}
> diff --git a/drivers/perf/riscv_pmu.c b/drivers/perf/riscv_pmu.c
> index ebca5eab9c9b..af69da268246 100644
> --- a/drivers/perf/riscv_pmu.c
> +++ b/drivers/perf/riscv_pmu.c
> @@ -171,6 +171,8 @@ int riscv_pmu_event_set_period(struct perf_event *event)
>
> local64_set(&hwc->prev_count, (u64)-left);
>
> + perf_event_update_userpage(event);
> +
> return overflow;
> }
>
> @@ -267,6 +269,9 @@ static int riscv_pmu_event_init(struct perf_event *event)
> hwc->idx = -1;
> hwc->event_base = mapped_event;
>
> + if (rvpmu->event_init)
> + rvpmu->event_init(event);
> +
> if (!is_sampling_event(event)) {
> /*
> * For non-sampling runs, limit the sample_period to half
> @@ -283,6 +288,39 @@ static int riscv_pmu_event_init(struct perf_event *event)
> return 0;
> }
>
> +static int riscv_pmu_event_idx(struct perf_event *event)
> +{
> + struct riscv_pmu *rvpmu = to_riscv_pmu(event->pmu);
> +
> + if (!(event->hw.flags & PERF_EVENT_FLAG_USER_READ_CNT))
> + return 0;
> +
> + if (rvpmu->csr_index)
> + return rvpmu->csr_index(event) + 1;
> +
> + return 0;
> +}
> +
> +static void riscv_pmu_event_mapped(struct perf_event *event, struct mm_struct *mm)
> +{
> + struct riscv_pmu *rvpmu = to_riscv_pmu(event->pmu);
> +
> + if (rvpmu->event_mapped) {
> + rvpmu->event_mapped(event, mm);
> + perf_event_update_userpage(event);
> + }
> +}
> +
> +static void riscv_pmu_event_unmapped(struct perf_event *event, struct mm_struct *mm)
> +{
> + struct riscv_pmu *rvpmu = to_riscv_pmu(event->pmu);
> +
> + if (rvpmu->event_unmapped) {
> + rvpmu->event_unmapped(event, mm);
> + perf_event_update_userpage(event);
> + }
> +}
> +
> struct riscv_pmu *riscv_pmu_alloc(void)
> {
> struct riscv_pmu *pmu;
> @@ -307,6 +345,9 @@ struct riscv_pmu *riscv_pmu_alloc(void)
> }
> pmu->pmu = (struct pmu) {
> .event_init = riscv_pmu_event_init,
> + .event_mapped = riscv_pmu_event_mapped,
> + .event_unmapped = riscv_pmu_event_unmapped,
> + .event_idx = riscv_pmu_event_idx,
> .add = riscv_pmu_add,
> .del = riscv_pmu_del,
> .start = riscv_pmu_start,
> diff --git a/include/linux/perf/riscv_pmu.h b/include/linux/perf/riscv_pmu.h
> index 9f70d94942e0..1452c8af3b67 100644
> --- a/include/linux/perf/riscv_pmu.h
> +++ b/include/linux/perf/riscv_pmu.h
> @@ -55,6 +55,10 @@ struct riscv_pmu {
> void (*ctr_start)(struct perf_event *event, u64 init_val);
> void (*ctr_stop)(struct perf_event *event, unsigned long flag);
> int (*event_map)(struct perf_event *event, u64 *config);
> + void (*event_init)(struct perf_event *event);
> + void (*event_mapped)(struct perf_event *event, struct mm_struct *mm);
> + void (*event_unmapped)(struct perf_event *event, struct mm_struct *mm);
> + uint8_t (*csr_index)(struct perf_event *event);
>
> struct cpu_hw_events __percpu *hw_events;
> struct hlist_node node;
> --
> 2.37.2
>


--
Regards,
Atish

2023-06-16 08:54:09

by Atish Patra

[permalink] [raw]
Subject: Re: [PATCH v2 09/10] tools: lib: perf: Implement riscv mmap support

On Fri, May 12, 2023 at 2:03 AM Alexandre Ghiti <[email protected]> wrote:
>
> riscv now support mmaping hardware counters so add what's needed to
> take advantage of that in libperf.
>
> Signed-off-by: Alexandre Ghiti <[email protected]>
> ---
> tools/lib/perf/mmap.c | 65 +++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 65 insertions(+)
>
> diff --git a/tools/lib/perf/mmap.c b/tools/lib/perf/mmap.c
> index 0d1634cedf44..65f250e0ef92 100644
> --- a/tools/lib/perf/mmap.c
> +++ b/tools/lib/perf/mmap.c
> @@ -392,6 +392,71 @@ static u64 read_perf_counter(unsigned int counter)
>
> static u64 read_timestamp(void) { return read_sysreg(cntvct_el0); }
>
> +#elif __riscv_xlen == 64
> +

This is applicable for RV32 as well. No ?
otherwise, you won't need CSR_CYCLEH

> +#define CSR_CYCLE 0xc00
> +#define CSR_TIME 0xc01
> +#define CSR_CYCLEH 0xc80
> +
> +#define csr_read(csr) \
> +({ \
> + register unsigned long __v; \
> + __asm__ __volatile__ ("csrr %0, " #csr \
> + : "=r" (__v) : \
> + : "memory"); \
> + __v; \
> +})
> +
> +static unsigned long csr_read_num(int csr_num)
> +{
> +#define switchcase_csr_read(__csr_num, __val) {\
> + case __csr_num: \
> + __val = csr_read(__csr_num); \
> + break; }
> +#define switchcase_csr_read_2(__csr_num, __val) {\
> + switchcase_csr_read(__csr_num + 0, __val) \
> + switchcase_csr_read(__csr_num + 1, __val)}
> +#define switchcase_csr_read_4(__csr_num, __val) {\
> + switchcase_csr_read_2(__csr_num + 0, __val) \
> + switchcase_csr_read_2(__csr_num + 2, __val)}
> +#define switchcase_csr_read_8(__csr_num, __val) {\
> + switchcase_csr_read_4(__csr_num + 0, __val) \
> + switchcase_csr_read_4(__csr_num + 4, __val)}
> +#define switchcase_csr_read_16(__csr_num, __val) {\
> + switchcase_csr_read_8(__csr_num + 0, __val) \
> + switchcase_csr_read_8(__csr_num + 8, __val)}
> +#define switchcase_csr_read_32(__csr_num, __val) {\
> + switchcase_csr_read_16(__csr_num + 0, __val) \
> + switchcase_csr_read_16(__csr_num + 16, __val)}
> +
> + unsigned long ret = 0;
> +
> + switch (csr_num) {
> + switchcase_csr_read_32(CSR_CYCLE, ret)
> + switchcase_csr_read_32(CSR_CYCLEH, ret)
> + default:
> + break;
> + }
> +
> + return ret;
> +#undef switchcase_csr_read_32
> +#undef switchcase_csr_read_16
> +#undef switchcase_csr_read_8
> +#undef switchcase_csr_read_4
> +#undef switchcase_csr_read_2
> +#undef switchcase_csr_read
> +}
> +
> +static u64 read_perf_counter(unsigned int counter)
> +{
> + return csr_read_num(CSR_CYCLE + counter);
> +}
> +
> +static u64 read_timestamp(void)
> +{
> + return csr_read_num(CSR_TIME);
> +}
> +
> #else
> static u64 read_perf_counter(unsigned int counter __maybe_unused) { return 0; }
> static u64 read_timestamp(void) { return 0; }
> --
> 2.37.2
>


--
Regards,
Atish

2023-06-16 09:01:52

by Alexandre Ghiti

[permalink] [raw]
Subject: Re: [PATCH v2 05/10] riscv: Prepare for user-space perf event mmap support

On Fri, Jun 16, 2023 at 10:28 AM Atish Patra <[email protected]> wrote:
>
> On Fri, May 12, 2023 at 1:58 AM Alexandre Ghiti <[email protected]> wrote:
> >
> > Provide all the necessary bits in the generic riscv pmu driver to be
> > able to mmap perf events in userspace: the heavy lifting lies in the
> > driver backend, namely the legacy and sbi implementations.
> >
> > Note that arch_perf_update_userpage is almost a copy of arm64 code.
> >
> > Signed-off-by: Alexandre Ghiti <[email protected]>
> > ---
> > arch/riscv/kernel/Makefile | 2 +-
> > arch/riscv/kernel/perf_event.c | 58 ++++++++++++++++++++++++++++++++++
> > drivers/perf/riscv_pmu.c | 41 ++++++++++++++++++++++++
> > include/linux/perf/riscv_pmu.h | 4 +++
> > 4 files changed, 104 insertions(+), 1 deletion(-)
> > create mode 100644 arch/riscv/kernel/perf_event.c
> >
> > diff --git a/arch/riscv/kernel/Makefile b/arch/riscv/kernel/Makefile
> > index 4cf303a779ab..0d215fd9860d 100644
> > --- a/arch/riscv/kernel/Makefile
> > +++ b/arch/riscv/kernel/Makefile
> > @@ -70,7 +70,7 @@ obj-$(CONFIG_DYNAMIC_FTRACE) += mcount-dyn.o
> >
> > obj-$(CONFIG_TRACE_IRQFLAGS) += trace_irq.o
> >
> > -obj-$(CONFIG_PERF_EVENTS) += perf_callchain.o
> > +obj-$(CONFIG_PERF_EVENTS) += perf_callchain.o perf_event.o
> > obj-$(CONFIG_HAVE_PERF_REGS) += perf_regs.o
> > obj-$(CONFIG_RISCV_SBI) += sbi.o
> > ifeq ($(CONFIG_RISCV_SBI), y)
> > diff --git a/arch/riscv/kernel/perf_event.c b/arch/riscv/kernel/perf_event.c
> > new file mode 100644
> > index 000000000000..94174a0fc251
> > --- /dev/null
> > +++ b/arch/riscv/kernel/perf_event.c
> > @@ -0,0 +1,58 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +#include <linux/sched_clock.h>
> > +
> > +void arch_perf_update_userpage(struct perf_event *event,
> > + struct perf_event_mmap_page *userpg, u64 now)
> > +{
> > + struct clock_read_data *rd;
> > + unsigned int seq;
> > + u64 ns;
> > +
> > + userpg->cap_user_time = 0;
> > + userpg->cap_user_time_zero = 0;
> > + userpg->cap_user_time_short = 0;
> > + userpg->cap_user_rdpmc =
> > + !!(event->hw.flags & PERF_EVENT_FLAG_USER_READ_CNT);
> > +
> > + userpg->pmc_width = 64;
> > +
>
> The counter width is 64 for cycle & instret. Other hpmcounter can have
> different width.
> This information should retrieved from counter info.

Yes, this is done in patch 7 when I adapt the perf SBI backend to
allow the user access.

>
> > + do {
> > + rd = sched_clock_read_begin(&seq);
> > +
> > + userpg->time_mult = rd->mult;
> > + userpg->time_shift = rd->shift;
> > + userpg->time_zero = rd->epoch_ns;
> > + userpg->time_cycles = rd->epoch_cyc;
> > + userpg->time_mask = rd->sched_clock_mask;
> > +
> > + /*
> > + * Subtract the cycle base, such that software that
> > + * doesn't know about cap_user_time_short still 'works'
> > + * assuming no wraps.
> > + */
> > + ns = mul_u64_u32_shr(rd->epoch_cyc, rd->mult, rd->shift);
> > + userpg->time_zero -= ns;
> > +
> > + } while (sched_clock_read_retry(seq));
> > +
> > + userpg->time_offset = userpg->time_zero - now;
> > +
> > + /*
> > + * time_shift is not expected to be greater than 31 due to
> > + * the original published conversion algorithm shifting a
> > + * 32-bit value (now specifies a 64-bit value) - refer
> > + * perf_event_mmap_page documentation in perf_event.h.
> > + */
> > + if (userpg->time_shift == 32) {
> > + userpg->time_shift = 31;
> > + userpg->time_mult >>= 1;
> > + }
> > +
> > + /*
> > + * Internal timekeeping for enabled/running/stopped times
> > + * is always computed with the sched_clock.
> > + */
> > + userpg->cap_user_time = 1;
> > + userpg->cap_user_time_zero = 1;
> > + userpg->cap_user_time_short = 1;
> > +}
> > diff --git a/drivers/perf/riscv_pmu.c b/drivers/perf/riscv_pmu.c
> > index ebca5eab9c9b..af69da268246 100644
> > --- a/drivers/perf/riscv_pmu.c
> > +++ b/drivers/perf/riscv_pmu.c
> > @@ -171,6 +171,8 @@ int riscv_pmu_event_set_period(struct perf_event *event)
> >
> > local64_set(&hwc->prev_count, (u64)-left);
> >
> > + perf_event_update_userpage(event);
> > +
> > return overflow;
> > }
> >
> > @@ -267,6 +269,9 @@ static int riscv_pmu_event_init(struct perf_event *event)
> > hwc->idx = -1;
> > hwc->event_base = mapped_event;
> >
> > + if (rvpmu->event_init)
> > + rvpmu->event_init(event);
> > +
> > if (!is_sampling_event(event)) {
> > /*
> > * For non-sampling runs, limit the sample_period to half
> > @@ -283,6 +288,39 @@ static int riscv_pmu_event_init(struct perf_event *event)
> > return 0;
> > }
> >
> > +static int riscv_pmu_event_idx(struct perf_event *event)
> > +{
> > + struct riscv_pmu *rvpmu = to_riscv_pmu(event->pmu);
> > +
> > + if (!(event->hw.flags & PERF_EVENT_FLAG_USER_READ_CNT))
> > + return 0;
> > +
> > + if (rvpmu->csr_index)
> > + return rvpmu->csr_index(event) + 1;
> > +
> > + return 0;
> > +}
> > +
> > +static void riscv_pmu_event_mapped(struct perf_event *event, struct mm_struct *mm)
> > +{
> > + struct riscv_pmu *rvpmu = to_riscv_pmu(event->pmu);
> > +
> > + if (rvpmu->event_mapped) {
> > + rvpmu->event_mapped(event, mm);
> > + perf_event_update_userpage(event);
> > + }
> > +}
> > +
> > +static void riscv_pmu_event_unmapped(struct perf_event *event, struct mm_struct *mm)
> > +{
> > + struct riscv_pmu *rvpmu = to_riscv_pmu(event->pmu);
> > +
> > + if (rvpmu->event_unmapped) {
> > + rvpmu->event_unmapped(event, mm);
> > + perf_event_update_userpage(event);
> > + }
> > +}
> > +
> > struct riscv_pmu *riscv_pmu_alloc(void)
> > {
> > struct riscv_pmu *pmu;
> > @@ -307,6 +345,9 @@ struct riscv_pmu *riscv_pmu_alloc(void)
> > }
> > pmu->pmu = (struct pmu) {
> > .event_init = riscv_pmu_event_init,
> > + .event_mapped = riscv_pmu_event_mapped,
> > + .event_unmapped = riscv_pmu_event_unmapped,
> > + .event_idx = riscv_pmu_event_idx,
> > .add = riscv_pmu_add,
> > .del = riscv_pmu_del,
> > .start = riscv_pmu_start,
> > diff --git a/include/linux/perf/riscv_pmu.h b/include/linux/perf/riscv_pmu.h
> > index 9f70d94942e0..1452c8af3b67 100644
> > --- a/include/linux/perf/riscv_pmu.h
> > +++ b/include/linux/perf/riscv_pmu.h
> > @@ -55,6 +55,10 @@ struct riscv_pmu {
> > void (*ctr_start)(struct perf_event *event, u64 init_val);
> > void (*ctr_stop)(struct perf_event *event, unsigned long flag);
> > int (*event_map)(struct perf_event *event, u64 *config);
> > + void (*event_init)(struct perf_event *event);
> > + void (*event_mapped)(struct perf_event *event, struct mm_struct *mm);
> > + void (*event_unmapped)(struct perf_event *event, struct mm_struct *mm);
> > + uint8_t (*csr_index)(struct perf_event *event);
> >
> > struct cpu_hw_events __percpu *hw_events;
> > struct hlist_node node;
> > --
> > 2.37.2
> >
>
>
> --
> Regards,
> Atish

2023-06-16 09:19:48

by Alexandre Ghiti

[permalink] [raw]
Subject: Re: [PATCH v2 09/10] tools: lib: perf: Implement riscv mmap support

On Fri, Jun 16, 2023 at 10:43 AM Atish Patra <[email protected]> wrote:
>
> On Fri, May 12, 2023 at 2:03 AM Alexandre Ghiti <[email protected]> wrote:
> >
> > riscv now support mmaping hardware counters so add what's needed to
> > take advantage of that in libperf.
> >
> > Signed-off-by: Alexandre Ghiti <[email protected]>
> > ---
> > tools/lib/perf/mmap.c | 65 +++++++++++++++++++++++++++++++++++++++++++
> > 1 file changed, 65 insertions(+)
> >
> > diff --git a/tools/lib/perf/mmap.c b/tools/lib/perf/mmap.c
> > index 0d1634cedf44..65f250e0ef92 100644
> > --- a/tools/lib/perf/mmap.c
> > +++ b/tools/lib/perf/mmap.c
> > @@ -392,6 +392,71 @@ static u64 read_perf_counter(unsigned int counter)
> >
> > static u64 read_timestamp(void) { return read_sysreg(cntvct_el0); }
> >
> > +#elif __riscv_xlen == 64
> > +
>
> This is applicable for RV32 as well. No ?
> otherwise, you won't need CSR_CYCLEH

Admittedly, I have not checked rv32 at all in this series and the code
below is a copy-paste. I'd say that rv32 support is out of scope for
this series, is that ok with you?

>
> > +#define CSR_CYCLE 0xc00
> > +#define CSR_TIME 0xc01
> > +#define CSR_CYCLEH 0xc80
> > +
> > +#define csr_read(csr) \
> > +({ \
> > + register unsigned long __v; \
> > + __asm__ __volatile__ ("csrr %0, " #csr \
> > + : "=r" (__v) : \
> > + : "memory"); \
> > + __v; \
> > +})
> > +
> > +static unsigned long csr_read_num(int csr_num)
> > +{
> > +#define switchcase_csr_read(__csr_num, __val) {\
> > + case __csr_num: \
> > + __val = csr_read(__csr_num); \
> > + break; }
> > +#define switchcase_csr_read_2(__csr_num, __val) {\
> > + switchcase_csr_read(__csr_num + 0, __val) \
> > + switchcase_csr_read(__csr_num + 1, __val)}
> > +#define switchcase_csr_read_4(__csr_num, __val) {\
> > + switchcase_csr_read_2(__csr_num + 0, __val) \
> > + switchcase_csr_read_2(__csr_num + 2, __val)}
> > +#define switchcase_csr_read_8(__csr_num, __val) {\
> > + switchcase_csr_read_4(__csr_num + 0, __val) \
> > + switchcase_csr_read_4(__csr_num + 4, __val)}
> > +#define switchcase_csr_read_16(__csr_num, __val) {\
> > + switchcase_csr_read_8(__csr_num + 0, __val) \
> > + switchcase_csr_read_8(__csr_num + 8, __val)}
> > +#define switchcase_csr_read_32(__csr_num, __val) {\
> > + switchcase_csr_read_16(__csr_num + 0, __val) \
> > + switchcase_csr_read_16(__csr_num + 16, __val)}
> > +
> > + unsigned long ret = 0;
> > +
> > + switch (csr_num) {
> > + switchcase_csr_read_32(CSR_CYCLE, ret)
> > + switchcase_csr_read_32(CSR_CYCLEH, ret)
> > + default:
> > + break;
> > + }
> > +
> > + return ret;
> > +#undef switchcase_csr_read_32
> > +#undef switchcase_csr_read_16
> > +#undef switchcase_csr_read_8
> > +#undef switchcase_csr_read_4
> > +#undef switchcase_csr_read_2
> > +#undef switchcase_csr_read
> > +}
> > +
> > +static u64 read_perf_counter(unsigned int counter)
> > +{
> > + return csr_read_num(CSR_CYCLE + counter);
> > +}
> > +
> > +static u64 read_timestamp(void)
> > +{
> > + return csr_read_num(CSR_TIME);
> > +}
> > +
> > #else
> > static u64 read_perf_counter(unsigned int counter __maybe_unused) { return 0; }
> > static u64 read_timestamp(void) { return 0; }
> > --
> > 2.37.2
> >
>
>
> --
> Regards,
> Atish

2023-06-19 19:33:01

by Atish Patra

[permalink] [raw]
Subject: Re: [PATCH v2 09/10] tools: lib: perf: Implement riscv mmap support

On Fri, Jun 16, 2023 at 2:06 AM Alexandre Ghiti <[email protected]> wrote:
>
> On Fri, Jun 16, 2023 at 10:43 AM Atish Patra <[email protected]> wrote:
> >
> > On Fri, May 12, 2023 at 2:03 AM Alexandre Ghiti <[email protected]> wrote:
> > >
> > > riscv now support mmaping hardware counters so add what's needed to
> > > take advantage of that in libperf.
> > >
> > > Signed-off-by: Alexandre Ghiti <[email protected]>
> > > ---
> > > tools/lib/perf/mmap.c | 65 +++++++++++++++++++++++++++++++++++++++++++
> > > 1 file changed, 65 insertions(+)
> > >
> > > diff --git a/tools/lib/perf/mmap.c b/tools/lib/perf/mmap.c
> > > index 0d1634cedf44..65f250e0ef92 100644
> > > --- a/tools/lib/perf/mmap.c
> > > +++ b/tools/lib/perf/mmap.c
> > > @@ -392,6 +392,71 @@ static u64 read_perf_counter(unsigned int counter)
> > >
> > > static u64 read_timestamp(void) { return read_sysreg(cntvct_el0); }
> > >
> > > +#elif __riscv_xlen == 64
> > > +
> >
> > This is applicable for RV32 as well. No ?
> > otherwise, you won't need CSR_CYCLEH
>
> Admittedly, I have not checked rv32 at all in this series and the code
> below is a copy-paste. I'd say that rv32 support is out of scope for
> this series, is that ok with you?
>

That's fine. Let's just remove the CYCLEH and leave a TODO comment for RV32.

> >
> > > +#define CSR_CYCLE 0xc00
> > > +#define CSR_TIME 0xc01
> > > +#define CSR_CYCLEH 0xc80
> > > +
> > > +#define csr_read(csr) \
> > > +({ \
> > > + register unsigned long __v; \
> > > + __asm__ __volatile__ ("csrr %0, " #csr \
> > > + : "=r" (__v) : \
> > > + : "memory"); \
> > > + __v; \
> > > +})
> > > +
> > > +static unsigned long csr_read_num(int csr_num)
> > > +{
> > > +#define switchcase_csr_read(__csr_num, __val) {\
> > > + case __csr_num: \
> > > + __val = csr_read(__csr_num); \
> > > + break; }
> > > +#define switchcase_csr_read_2(__csr_num, __val) {\
> > > + switchcase_csr_read(__csr_num + 0, __val) \
> > > + switchcase_csr_read(__csr_num + 1, __val)}
> > > +#define switchcase_csr_read_4(__csr_num, __val) {\
> > > + switchcase_csr_read_2(__csr_num + 0, __val) \
> > > + switchcase_csr_read_2(__csr_num + 2, __val)}
> > > +#define switchcase_csr_read_8(__csr_num, __val) {\
> > > + switchcase_csr_read_4(__csr_num + 0, __val) \
> > > + switchcase_csr_read_4(__csr_num + 4, __val)}
> > > +#define switchcase_csr_read_16(__csr_num, __val) {\
> > > + switchcase_csr_read_8(__csr_num + 0, __val) \
> > > + switchcase_csr_read_8(__csr_num + 8, __val)}
> > > +#define switchcase_csr_read_32(__csr_num, __val) {\
> > > + switchcase_csr_read_16(__csr_num + 0, __val) \
> > > + switchcase_csr_read_16(__csr_num + 16, __val)}
> > > +
> > > + unsigned long ret = 0;
> > > +
> > > + switch (csr_num) {
> > > + switchcase_csr_read_32(CSR_CYCLE, ret)
> > > + switchcase_csr_read_32(CSR_CYCLEH, ret)
> > > + default:
> > > + break;
> > > + }
> > > +
> > > + return ret;
> > > +#undef switchcase_csr_read_32
> > > +#undef switchcase_csr_read_16
> > > +#undef switchcase_csr_read_8
> > > +#undef switchcase_csr_read_4
> > > +#undef switchcase_csr_read_2
> > > +#undef switchcase_csr_read
> > > +}
> > > +
> > > +static u64 read_perf_counter(unsigned int counter)
> > > +{
> > > + return csr_read_num(CSR_CYCLE + counter);
> > > +}
> > > +
> > > +static u64 read_timestamp(void)
> > > +{
> > > + return csr_read_num(CSR_TIME);
> > > +}
> > > +
> > > #else
> > > static u64 read_perf_counter(unsigned int counter __maybe_unused) { return 0; }
> > > static u64 read_timestamp(void) { return 0; }
> > > --
> > > 2.37.2
> > >
> >
> >
> > --
> > Regards,
> > Atish



--
Regards,
Atish

2023-06-20 15:36:25

by Atish Patra

[permalink] [raw]
Subject: Re: [PATCH v2 05/10] riscv: Prepare for user-space perf event mmap support

On Fri, Jun 16, 2023 at 1:57 AM Alexandre Ghiti <[email protected]> wrote:
>
> On Fri, Jun 16, 2023 at 10:28 AM Atish Patra <[email protected]> wrote:
> >
> > On Fri, May 12, 2023 at 1:58 AM Alexandre Ghiti <[email protected]> wrote:
> > >
> > > Provide all the necessary bits in the generic riscv pmu driver to be
> > > able to mmap perf events in userspace: the heavy lifting lies in the
> > > driver backend, namely the legacy and sbi implementations.
> > >
> > > Note that arch_perf_update_userpage is almost a copy of arm64 code.
> > >
> > > Signed-off-by: Alexandre Ghiti <[email protected]>
> > > ---
> > > arch/riscv/kernel/Makefile | 2 +-
> > > arch/riscv/kernel/perf_event.c | 58 ++++++++++++++++++++++++++++++++++
> > > drivers/perf/riscv_pmu.c | 41 ++++++++++++++++++++++++
> > > include/linux/perf/riscv_pmu.h | 4 +++
> > > 4 files changed, 104 insertions(+), 1 deletion(-)
> > > create mode 100644 arch/riscv/kernel/perf_event.c
> > >
> > > diff --git a/arch/riscv/kernel/Makefile b/arch/riscv/kernel/Makefile
> > > index 4cf303a779ab..0d215fd9860d 100644
> > > --- a/arch/riscv/kernel/Makefile
> > > +++ b/arch/riscv/kernel/Makefile
> > > @@ -70,7 +70,7 @@ obj-$(CONFIG_DYNAMIC_FTRACE) += mcount-dyn.o
> > >
> > > obj-$(CONFIG_TRACE_IRQFLAGS) += trace_irq.o
> > >
> > > -obj-$(CONFIG_PERF_EVENTS) += perf_callchain.o
> > > +obj-$(CONFIG_PERF_EVENTS) += perf_callchain.o perf_event.o
> > > obj-$(CONFIG_HAVE_PERF_REGS) += perf_regs.o
> > > obj-$(CONFIG_RISCV_SBI) += sbi.o
> > > ifeq ($(CONFIG_RISCV_SBI), y)
> > > diff --git a/arch/riscv/kernel/perf_event.c b/arch/riscv/kernel/perf_event.c
> > > new file mode 100644
> > > index 000000000000..94174a0fc251
> > > --- /dev/null
> > > +++ b/arch/riscv/kernel/perf_event.c
> > > @@ -0,0 +1,58 @@
> > > +// SPDX-License-Identifier: GPL-2.0-only
> > > +#include <linux/sched_clock.h>
> > > +
> > > +void arch_perf_update_userpage(struct perf_event *event,
> > > + struct perf_event_mmap_page *userpg, u64 now)
> > > +{
> > > + struct clock_read_data *rd;
> > > + unsigned int seq;
> > > + u64 ns;
> > > +
> > > + userpg->cap_user_time = 0;
> > > + userpg->cap_user_time_zero = 0;
> > > + userpg->cap_user_time_short = 0;
> > > + userpg->cap_user_rdpmc =
> > > + !!(event->hw.flags & PERF_EVENT_FLAG_USER_READ_CNT);
> > > +
> > > + userpg->pmc_width = 64;
> > > +
> >
> > The counter width is 64 for cycle & instret. Other hpmcounter can have
> > different width.
> > This information should retrieved from counter info.
>
> Yes, this is done in patch 7 when I adapt the perf SBI backend to
> allow the user access.
>

Yes. I missed that earlier. Thanks.

> >
> > > + do {
> > > + rd = sched_clock_read_begin(&seq);
> > > +
> > > + userpg->time_mult = rd->mult;
> > > + userpg->time_shift = rd->shift;
> > > + userpg->time_zero = rd->epoch_ns;
> > > + userpg->time_cycles = rd->epoch_cyc;
> > > + userpg->time_mask = rd->sched_clock_mask;
> > > +
> > > + /*
> > > + * Subtract the cycle base, such that software that
> > > + * doesn't know about cap_user_time_short still 'works'
> > > + * assuming no wraps.
> > > + */
> > > + ns = mul_u64_u32_shr(rd->epoch_cyc, rd->mult, rd->shift);
> > > + userpg->time_zero -= ns;
> > > +
> > > + } while (sched_clock_read_retry(seq));
> > > +
> > > + userpg->time_offset = userpg->time_zero - now;
> > > +
> > > + /*
> > > + * time_shift is not expected to be greater than 31 due to
> > > + * the original published conversion algorithm shifting a
> > > + * 32-bit value (now specifies a 64-bit value) - refer
> > > + * perf_event_mmap_page documentation in perf_event.h.
> > > + */
> > > + if (userpg->time_shift == 32) {
> > > + userpg->time_shift = 31;
> > > + userpg->time_mult >>= 1;
> > > + }
> > > +
> > > + /*
> > > + * Internal timekeeping for enabled/running/stopped times
> > > + * is always computed with the sched_clock.
> > > + */
> > > + userpg->cap_user_time = 1;
> > > + userpg->cap_user_time_zero = 1;
> > > + userpg->cap_user_time_short = 1;
> > > +}
> > > diff --git a/drivers/perf/riscv_pmu.c b/drivers/perf/riscv_pmu.c
> > > index ebca5eab9c9b..af69da268246 100644
> > > --- a/drivers/perf/riscv_pmu.c
> > > +++ b/drivers/perf/riscv_pmu.c
> > > @@ -171,6 +171,8 @@ int riscv_pmu_event_set_period(struct perf_event *event)
> > >
> > > local64_set(&hwc->prev_count, (u64)-left);
> > >
> > > + perf_event_update_userpage(event);
> > > +
> > > return overflow;
> > > }
> > >
> > > @@ -267,6 +269,9 @@ static int riscv_pmu_event_init(struct perf_event *event)
> > > hwc->idx = -1;
> > > hwc->event_base = mapped_event;
> > >
> > > + if (rvpmu->event_init)
> > > + rvpmu->event_init(event);
> > > +
> > > if (!is_sampling_event(event)) {
> > > /*
> > > * For non-sampling runs, limit the sample_period to half
> > > @@ -283,6 +288,39 @@ static int riscv_pmu_event_init(struct perf_event *event)
> > > return 0;
> > > }
> > >
> > > +static int riscv_pmu_event_idx(struct perf_event *event)
> > > +{
> > > + struct riscv_pmu *rvpmu = to_riscv_pmu(event->pmu);
> > > +
> > > + if (!(event->hw.flags & PERF_EVENT_FLAG_USER_READ_CNT))
> > > + return 0;
> > > +
> > > + if (rvpmu->csr_index)
> > > + return rvpmu->csr_index(event) + 1;
> > > +
> > > + return 0;
> > > +}
> > > +
> > > +static void riscv_pmu_event_mapped(struct perf_event *event, struct mm_struct *mm)
> > > +{
> > > + struct riscv_pmu *rvpmu = to_riscv_pmu(event->pmu);
> > > +
> > > + if (rvpmu->event_mapped) {
> > > + rvpmu->event_mapped(event, mm);
> > > + perf_event_update_userpage(event);
> > > + }
> > > +}
> > > +
> > > +static void riscv_pmu_event_unmapped(struct perf_event *event, struct mm_struct *mm)
> > > +{
> > > + struct riscv_pmu *rvpmu = to_riscv_pmu(event->pmu);
> > > +
> > > + if (rvpmu->event_unmapped) {
> > > + rvpmu->event_unmapped(event, mm);
> > > + perf_event_update_userpage(event);
> > > + }
> > > +}
> > > +
> > > struct riscv_pmu *riscv_pmu_alloc(void)
> > > {
> > > struct riscv_pmu *pmu;
> > > @@ -307,6 +345,9 @@ struct riscv_pmu *riscv_pmu_alloc(void)
> > > }
> > > pmu->pmu = (struct pmu) {
> > > .event_init = riscv_pmu_event_init,
> > > + .event_mapped = riscv_pmu_event_mapped,
> > > + .event_unmapped = riscv_pmu_event_unmapped,
> > > + .event_idx = riscv_pmu_event_idx,
> > > .add = riscv_pmu_add,
> > > .del = riscv_pmu_del,
> > > .start = riscv_pmu_start,
> > > diff --git a/include/linux/perf/riscv_pmu.h b/include/linux/perf/riscv_pmu.h
> > > index 9f70d94942e0..1452c8af3b67 100644
> > > --- a/include/linux/perf/riscv_pmu.h
> > > +++ b/include/linux/perf/riscv_pmu.h
> > > @@ -55,6 +55,10 @@ struct riscv_pmu {
> > > void (*ctr_start)(struct perf_event *event, u64 init_val);
> > > void (*ctr_stop)(struct perf_event *event, unsigned long flag);
> > > int (*event_map)(struct perf_event *event, u64 *config);
> > > + void (*event_init)(struct perf_event *event);
> > > + void (*event_mapped)(struct perf_event *event, struct mm_struct *mm);
> > > + void (*event_unmapped)(struct perf_event *event, struct mm_struct *mm);
> > > + uint8_t (*csr_index)(struct perf_event *event);
> > >
> > > struct cpu_hw_events __percpu *hw_events;
> > > struct hlist_node node;
> > > --
> > > 2.37.2
> > >
> >
> >
> > --
> > Regards,
> > Atish



--
Regards,
Atish

2023-06-21 23:58:34

by Palmer Dabbelt

[permalink] [raw]
Subject: Re: [PATCH v2 00/10] riscv: Allow userspace to directly access perf counters

On Fri, 12 May 2023 01:53:11 PDT (-0700), [email protected] wrote:
> riscv used to allow direct access to cycle/time/instret counters,
> bypassing the perf framework, this patchset intends to allow the user to
> mmap any counter when accessed through perf. But we can't break the
> existing behaviour so we introduce a sysctl perf_user_access like arm64
> does, which defaults to the legacy mode described above.
>
> *Note* that there are still ongoing discussions around which mode should
> be the default mode with distro people.
>
> base-commit-tag: v6.3
>
> Changes in v2:
> - Split into smaller patches, way better!
> - Add RB from Conor
> - Simplify the way we checked riscv architecture
> - Fix race mmap and other thread running on other cpus
> - Use hwc when available
> - Set all userspace access flags in event_init, too cumbersome to handle sysctl changes
> - Fix arch_perf_update_userpage for pmu other than riscv-pmu by renaming pmu driver
> - Fixed kernel test robot build error
> - Fixed documentation (Andrew and Bagas)
> - perf testsuite passes mmap tests in all 3 modes
>
> Alexandre Ghiti (10):
> perf: Fix wrong comment about default event_idx
> include: riscv: Fix wrong include guard in riscv_pmu.h
> riscv: Make legacy counter enum match the HW numbering
> drivers: perf: Rename riscv pmu driver
> riscv: Prepare for user-space perf event mmap support
> drivers: perf: Implement perf event mmap support in the legacy backend
> drivers: perf: Implement perf event mmap support in the SBI backend
> Documentation: admin-guide: Add riscv sysctl_perf_user_access
> tools: lib: perf: Implement riscv mmap support
> perf: tests: Adapt mmap-basic.c for riscv
>
> Documentation/admin-guide/sysctl/kernel.rst | 24 ++-
> arch/riscv/kernel/Makefile | 2 +-
> arch/riscv/kernel/perf_event.c | 74 ++++++++
> drivers/perf/riscv_pmu.c | 41 ++++
> drivers/perf/riscv_pmu_legacy.c | 37 +++-
> drivers/perf/riscv_pmu_sbi.c | 196 +++++++++++++++++++-
> include/linux/perf/riscv_pmu.h | 10 +-
> include/linux/perf_event.h | 3 +-
> tools/lib/perf/mmap.c | 65 +++++++
> tools/perf/tests/mmap-basic.c | 4 +-
> 10 files changed, 435 insertions(+), 21 deletions(-)
> create mode 100644 arch/riscv/kernel/perf_event.c

It looks like there's a handful of outstanding review comments, so I'm
going to mark this as changes requested in patchwork -- sorry if I
missed a v3.