riscv used to allow direct access to cycle/time/instret counters,
bypassing the perf framework, this patchset intends to allow the user to
mmap any counter when accessed through perf. But we can't break the
existing behaviour so we introduce a sysctl perf_user_access like arm64
does, which defaults to the legacy mode described above.
This version needs openSBI v1.3 *and* a kernel fix that went upstream lately
(https://lore.kernel.org/lkml/[email protected]/T/).
**Important**: In this version, the default mode is now user access, not
the legacy so some applications will break.
base-commit-tag: v6.4-rc6
Changes in v4:
- Fixed some nits in riscv_pmu_sbi.c thanks to Andrew
- Fixed the documentation thanks to Andrew
- Added RB from Andrew \o/
Changes in v3:
- patch 1 now contains the ref to the faulty commit (no Fixes tag as it is only a comment), as Andrew suggested
- Removed RISCV_PMU_LEGACY_TIME from patch 3, as Andrew suggested
- Rename RISCV_PMU_PDEV_NAME to "riscv-pmu-sbi", patch4 is just cosmetic now, as Andrew suggested
- Removed a few useless (and wrong) comments, as Andrew suggested
- Simplify arch_perf_update_userpage code, as Andrew suggested
- Documentation now mentions that time CSR is *always* accessible, whatever the mode, as suggested by Andrew
- Removed CYCLEH reference and add TODO for rv32 support, as suggested by Atish
- Do not rename the pmu instance as Atish suggested
- Set pmc_width only if rdpmc is enabled and CONFIG_RISCV_PMU is set and the event is a hw event
- Move arch_perf_update_userpage https://lore.kernel.org/lkml/[email protected]/T/
- **Switch to user mode access by default**
Changes in v2:
- Split into smaller patches, way better!
- Add RB from Conor
- Simplify the way we checked riscv architecture
- Fix race mmap and other thread running on other cpus
- Use hwc when available
- Set all userspace access flags in event_init, too cumbersome to handle sysctl changes
- Fix arch_perf_update_userpage for pmu other than riscv-pmu by renaming pmu driver
- Fixed kernel test robot build error
- Fixed documentation (Andrew and Bagas)
- perf testsuite passes mmap tests in all 3 modes
Alexandre Ghiti (10):
perf: Fix wrong comment about default event_idx
include: riscv: Fix wrong include guard in riscv_pmu.h
riscv: Make legacy counter enum match the HW numbering
drivers: perf: Rename riscv pmu sbi driver
riscv: Prepare for user-space perf event mmap support
drivers: perf: Implement perf event mmap support in the legacy backend
drivers: perf: Implement perf event mmap support in the SBI backend
Documentation: admin-guide: Add riscv sysctl_perf_user_access
tools: lib: perf: Implement riscv mmap support
perf: tests: Adapt mmap-basic.c for riscv
Documentation/admin-guide/sysctl/kernel.rst | 27 ++-
drivers/perf/riscv_pmu.c | 113 +++++++++++
drivers/perf/riscv_pmu_legacy.c | 28 ++-
drivers/perf/riscv_pmu_sbi.c | 196 +++++++++++++++++++-
include/linux/perf/riscv_pmu.h | 12 +-
include/linux/perf_event.h | 3 +-
tools/lib/perf/mmap.c | 65 +++++++
tools/perf/tests/mmap-basic.c | 4 +-
8 files changed, 428 insertions(+), 20 deletions(-)
--
2.39.2
That's just cosmetic, no functional changes.
Signed-off-by: Alexandre Ghiti <[email protected]>
Reviewed-by: Andrew Jones <[email protected]>
---
drivers/perf/riscv_pmu_sbi.c | 4 ++--
include/linux/perf/riscv_pmu.h | 2 +-
2 files changed, 3 insertions(+), 3 deletions(-)
diff --git a/drivers/perf/riscv_pmu_sbi.c b/drivers/perf/riscv_pmu_sbi.c
index 4f3ac296b3e2..83c3f1c4d2f1 100644
--- a/drivers/perf/riscv_pmu_sbi.c
+++ b/drivers/perf/riscv_pmu_sbi.c
@@ -914,7 +914,7 @@ static int pmu_sbi_device_probe(struct platform_device *pdev)
static struct platform_driver pmu_sbi_driver = {
.probe = pmu_sbi_device_probe,
.driver = {
- .name = RISCV_PMU_PDEV_NAME,
+ .name = RISCV_PMU_SBI_PDEV_NAME,
},
};
@@ -941,7 +941,7 @@ static int __init pmu_sbi_devinit(void)
if (ret)
return ret;
- pdev = platform_device_register_simple(RISCV_PMU_PDEV_NAME, -1, NULL, 0);
+ pdev = platform_device_register_simple(RISCV_PMU_SBI_PDEV_NAME, -1, NULL, 0);
if (IS_ERR(pdev)) {
platform_driver_unregister(&pmu_sbi_driver);
return PTR_ERR(pdev);
diff --git a/include/linux/perf/riscv_pmu.h b/include/linux/perf/riscv_pmu.h
index 9f70d94942e0..5deeea0be7cb 100644
--- a/include/linux/perf/riscv_pmu.h
+++ b/include/linux/perf/riscv_pmu.h
@@ -21,7 +21,7 @@
#define RISCV_MAX_COUNTERS 64
#define RISCV_OP_UNSUPP (-EOPNOTSUPP)
-#define RISCV_PMU_PDEV_NAME "riscv-pmu"
+#define RISCV_PMU_SBI_PDEV_NAME "riscv-pmu-sbi"
#define RISCV_PMU_LEGACY_PDEV_NAME "riscv-pmu-legacy"
#define RISCV_PMU_STOP_FLAG_RESET 1
--
2.39.2
Since commit c719f56092ad ("perf: Fix and clean up initialization of
pmu::event_idx"), event_idx default implementation has returned 0, not
idx + 1, so fix the comment that can be misleading.
Signed-off-by: Alexandre Ghiti <[email protected]>
Reviewed-by: Andrew Jones <[email protected]>
Reviewed-by: Atish Patra <[email protected]>
---
include/linux/perf_event.h | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index d5628a7b5eaa..56fe43b20966 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -442,7 +442,8 @@ struct pmu {
/*
* Will return the value for perf_event_mmap_page::index for this event,
- * if no implementation is provided it will default to: event->hw.idx + 1.
+ * if no implementation is provided it will default to 0 (see
+ * perf_event_idx_default).
*/
int (*event_idx) (struct perf_event *event); /*optional */
--
2.39.2
The current include guard prevents the inclusion of asm/perf_event.h
which uses the same include guard: fix the one in riscv_pmu.h so that it
matches the file name.
Signed-off-by: Alexandre Ghiti <[email protected]>
Reviewed-by: Conor Dooley <[email protected]>
Reviewed-by: Andrew Jones <[email protected]>
Reviewed-by: Atish Patra <[email protected]>
---
include/linux/perf/riscv_pmu.h | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/include/linux/perf/riscv_pmu.h b/include/linux/perf/riscv_pmu.h
index 43fc892aa7d9..9f70d94942e0 100644
--- a/include/linux/perf/riscv_pmu.h
+++ b/include/linux/perf/riscv_pmu.h
@@ -6,8 +6,8 @@
*
*/
-#ifndef _ASM_RISCV_PERF_EVENT_H
-#define _ASM_RISCV_PERF_EVENT_H
+#ifndef _RISCV_PMU_H
+#define _RISCV_PMU_H
#include <linux/perf_event.h>
#include <linux/ptrace.h>
@@ -81,4 +81,4 @@ int riscv_pmu_get_hpm_info(u32 *hw_ctr_width, u32 *num_hw_ctr);
#endif /* CONFIG_RISCV_PMU */
-#endif /* _ASM_RISCV_PERF_EVENT_H */
+#endif /* _RISCV_PMU_H */
--
2.39.2
RISCV_PMU_LEGACY_INSTRET used to be set to 1 whereas the offset of this
hardware counter from CSR_CYCLE is actually 2: make this offset match the
real hw offset so that we can directly expose those values to userspace.
Signed-off-by: Alexandre Ghiti <[email protected]>
Reviewed-by: Andrew Jones <[email protected]>
---
drivers/perf/riscv_pmu_legacy.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/perf/riscv_pmu_legacy.c b/drivers/perf/riscv_pmu_legacy.c
index ca9e20bfc7ac..6a000abc28bb 100644
--- a/drivers/perf/riscv_pmu_legacy.c
+++ b/drivers/perf/riscv_pmu_legacy.c
@@ -13,7 +13,7 @@
#include <linux/platform_device.h>
#define RISCV_PMU_LEGACY_CYCLE 0
-#define RISCV_PMU_LEGACY_INSTRET 1
+#define RISCV_PMU_LEGACY_INSTRET 2
static bool pmu_init_done;
--
2.39.2
We used to unconditionnally expose the cycle and instret csrs to
userspace, which gives rise to security concerns.
So now we only allow access to hw counters from userspace through the perf
framework which will handle context switches, per-task events...etc. But
as we cannot break userspace, we give the user the choice to go back to
the previous behaviour by setting the sysctl perf_user_access.
Signed-off-by: Alexandre Ghiti <[email protected]>
Reviewed-by: Andrew Jones <[email protected]>
---
drivers/perf/riscv_pmu.c | 10 +-
drivers/perf/riscv_pmu_sbi.c | 192 +++++++++++++++++++++++++++++++++--
2 files changed, 195 insertions(+), 7 deletions(-)
diff --git a/drivers/perf/riscv_pmu.c b/drivers/perf/riscv_pmu.c
index 432ad2e80ce3..80c052e93f9e 100644
--- a/drivers/perf/riscv_pmu.c
+++ b/drivers/perf/riscv_pmu.c
@@ -38,7 +38,15 @@ void arch_perf_update_userpage(struct perf_event *event,
userpg->cap_user_time_short = 0;
userpg->cap_user_rdpmc = riscv_perf_user_access(event);
- userpg->pmc_width = 64;
+#ifdef CONFIG_RISCV_PMU
+ /*
+ * The counters are 64-bit but the priv spec doesn't mandate all the
+ * bits to be implemented: that's why, counter width can vary based on
+ * the cpu vendor.
+ */
+ if (userpg->cap_user_rdpmc)
+ userpg->pmc_width = to_riscv_pmu(event->pmu)->ctr_get_width(event->hw.idx) + 1;
+#endif
do {
rd = sched_clock_read_begin(&seq);
diff --git a/drivers/perf/riscv_pmu_sbi.c b/drivers/perf/riscv_pmu_sbi.c
index 83c3f1c4d2f1..2236cc9aa4b8 100644
--- a/drivers/perf/riscv_pmu_sbi.c
+++ b/drivers/perf/riscv_pmu_sbi.c
@@ -24,6 +24,14 @@
#include <asm/sbi.h>
#include <asm/hwcap.h>
+#define SYSCTL_NO_USER_ACCESS 0
+#define SYSCTL_USER_ACCESS 1
+#define SYSCTL_LEGACY 2
+
+#define PERF_EVENT_FLAG_NO_USER_ACCESS BIT(SYSCTL_NO_USER_ACCESS)
+#define PERF_EVENT_FLAG_USER_ACCESS BIT(SYSCTL_USER_ACCESS)
+#define PERF_EVENT_FLAG_LEGACY BIT(SYSCTL_LEGACY)
+
PMU_FORMAT_ATTR(event, "config:0-47");
PMU_FORMAT_ATTR(firmware, "config:63");
@@ -43,6 +51,9 @@ static const struct attribute_group *riscv_pmu_attr_groups[] = {
NULL,
};
+/* Allow user mode access by default */
+static int sysctl_perf_user_access __read_mostly = SYSCTL_USER_ACCESS;
+
/*
* RISC-V doesn't have heterogeneous harts yet. This need to be part of
* per_cpu in case of harts with different pmu counters
@@ -301,6 +312,11 @@ int riscv_pmu_get_hpm_info(u32 *hw_ctr_width, u32 *num_hw_ctr)
}
EXPORT_SYMBOL_GPL(riscv_pmu_get_hpm_info);
+static uint8_t pmu_sbi_csr_index(struct perf_event *event)
+{
+ return pmu_ctr_list[event->hw.idx].csr - CSR_CYCLE;
+}
+
static unsigned long pmu_sbi_get_filter_flags(struct perf_event *event)
{
unsigned long cflags = 0;
@@ -329,18 +345,34 @@ static int pmu_sbi_ctr_get_idx(struct perf_event *event)
struct cpu_hw_events *cpuc = this_cpu_ptr(rvpmu->hw_events);
struct sbiret ret;
int idx;
- uint64_t cbase = 0;
+ uint64_t cbase = 0, cmask = rvpmu->cmask;
unsigned long cflags = 0;
cflags = pmu_sbi_get_filter_flags(event);
+
+ /*
+ * In legacy mode, we have to force the fixed counters for those events
+ * but not in the user access mode as we want to use the other counters
+ * that support sampling/filtering.
+ */
+ if (hwc->flags & PERF_EVENT_FLAG_LEGACY) {
+ if (event->attr.config == PERF_COUNT_HW_CPU_CYCLES) {
+ cflags |= SBI_PMU_CFG_FLAG_SKIP_MATCH;
+ cmask = 1;
+ } else if (event->attr.config == PERF_COUNT_HW_INSTRUCTIONS) {
+ cflags |= SBI_PMU_CFG_FLAG_SKIP_MATCH;
+ cmask = 1UL << (CSR_INSTRET - CSR_CYCLE);
+ }
+ }
+
/* retrieve the available counter index */
#if defined(CONFIG_32BIT)
ret = sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_COUNTER_CFG_MATCH, cbase,
- rvpmu->cmask, cflags, hwc->event_base, hwc->config,
+ cmask, cflags, hwc->event_base, hwc->config,
hwc->config >> 32);
#else
ret = sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_COUNTER_CFG_MATCH, cbase,
- rvpmu->cmask, cflags, hwc->event_base, hwc->config, 0);
+ cmask, cflags, hwc->event_base, hwc->config, 0);
#endif
if (ret.error) {
pr_debug("Not able to find a counter for event %lx config %llx\n",
@@ -474,6 +506,22 @@ static u64 pmu_sbi_ctr_read(struct perf_event *event)
return val;
}
+static void pmu_sbi_set_scounteren(void *arg)
+{
+ struct perf_event *event = (struct perf_event *)arg;
+
+ csr_write(CSR_SCOUNTEREN,
+ csr_read(CSR_SCOUNTEREN) | (1 << pmu_sbi_csr_index(event)));
+}
+
+static void pmu_sbi_reset_scounteren(void *arg)
+{
+ struct perf_event *event = (struct perf_event *)arg;
+
+ csr_write(CSR_SCOUNTEREN,
+ csr_read(CSR_SCOUNTEREN) & ~(1 << pmu_sbi_csr_index(event)));
+}
+
static void pmu_sbi_ctr_start(struct perf_event *event, u64 ival)
{
struct sbiret ret;
@@ -490,6 +538,10 @@ static void pmu_sbi_ctr_start(struct perf_event *event, u64 ival)
if (ret.error && (ret.error != SBI_ERR_ALREADY_STARTED))
pr_err("Starting counter idx %d failed with error %d\n",
hwc->idx, sbi_err_map_linux_errno(ret.error));
+
+ if ((hwc->flags & PERF_EVENT_FLAG_USER_ACCESS) &&
+ (hwc->flags & PERF_EVENT_FLAG_USER_READ_CNT))
+ pmu_sbi_set_scounteren((void *)event);
}
static void pmu_sbi_ctr_stop(struct perf_event *event, unsigned long flag)
@@ -497,6 +549,10 @@ static void pmu_sbi_ctr_stop(struct perf_event *event, unsigned long flag)
struct sbiret ret;
struct hw_perf_event *hwc = &event->hw;
+ if ((hwc->flags & PERF_EVENT_FLAG_USER_ACCESS) &&
+ (hwc->flags & PERF_EVENT_FLAG_USER_READ_CNT))
+ pmu_sbi_reset_scounteren((void *)event);
+
ret = sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_COUNTER_STOP, hwc->idx, 1, flag, 0, 0, 0);
if (ret.error && (ret.error != SBI_ERR_ALREADY_STOPPED) &&
flag != SBI_PMU_STOP_FLAG_RESET)
@@ -704,10 +760,13 @@ static int pmu_sbi_starting_cpu(unsigned int cpu, struct hlist_node *node)
struct cpu_hw_events *cpu_hw_evt = this_cpu_ptr(pmu->hw_events);
/*
- * Enable the access for CYCLE, TIME, and INSTRET CSRs from userspace,
- * as is necessary to maintain uABI compatibility.
+ * We keep enabling userspace access to CYCLE, TIME and INSRET via the
+ * legacy option but that will be removed in the future.
*/
- csr_write(CSR_SCOUNTEREN, 0x7);
+ if (sysctl_perf_user_access == SYSCTL_LEGACY)
+ csr_write(CSR_SCOUNTEREN, 0x7);
+ else
+ csr_write(CSR_SCOUNTEREN, 0x2);
/* Stop all the counters so that they can be enabled from perf */
pmu_sbi_stop_all(pmu);
@@ -851,6 +910,121 @@ static void riscv_pmu_destroy(struct riscv_pmu *pmu)
cpuhp_state_remove_instance(CPUHP_AP_PERF_RISCV_STARTING, &pmu->node);
}
+static void pmu_sbi_event_init(struct perf_event *event)
+{
+ /*
+ * The permissions are set at event_init so that we do not depend
+ * on the sysctl value that can change.
+ */
+ if (sysctl_perf_user_access == SYSCTL_NO_USER_ACCESS)
+ event->hw.flags |= PERF_EVENT_FLAG_NO_USER_ACCESS;
+ else if (sysctl_perf_user_access == SYSCTL_USER_ACCESS)
+ event->hw.flags |= PERF_EVENT_FLAG_USER_ACCESS;
+ else
+ event->hw.flags |= PERF_EVENT_FLAG_LEGACY;
+}
+
+static void pmu_sbi_event_mapped(struct perf_event *event, struct mm_struct *mm)
+{
+ if (event->hw.flags & PERF_EVENT_FLAG_NO_USER_ACCESS)
+ return;
+
+ if (event->hw.flags & PERF_EVENT_FLAG_LEGACY) {
+ if (event->attr.config != PERF_COUNT_HW_CPU_CYCLES &&
+ event->attr.config != PERF_COUNT_HW_INSTRUCTIONS) {
+ return;
+ }
+ }
+
+ /*
+ * The user mmapped the event to directly access it: this is where
+ * we determine based on sysctl_perf_user_access if we grant userspace
+ * the direct access to this event. That means that within the same
+ * task, some events may be directly accessible and some other may not,
+ * if the user changes the value of sysctl_perf_user_accesss in the
+ * meantime.
+ */
+
+ event->hw.flags |= PERF_EVENT_FLAG_USER_READ_CNT;
+
+ /*
+ * We must enable userspace access *before* advertising in the user page
+ * that it is possible to do so to avoid any race.
+ * And we must notify all cpus here because threads that currently run
+ * on other cpus will try to directly access the counter too without
+ * calling pmu_sbi_ctr_start.
+ */
+ if (event->hw.flags & PERF_EVENT_FLAG_USER_ACCESS)
+ on_each_cpu_mask(mm_cpumask(mm),
+ pmu_sbi_set_scounteren, (void *)event, 1);
+}
+
+static void pmu_sbi_event_unmapped(struct perf_event *event, struct mm_struct *mm)
+{
+ if (event->hw.flags & PERF_EVENT_FLAG_NO_USER_ACCESS)
+ return;
+
+ if (event->hw.flags & PERF_EVENT_FLAG_LEGACY) {
+ if (event->attr.config != PERF_COUNT_HW_CPU_CYCLES &&
+ event->attr.config != PERF_COUNT_HW_INSTRUCTIONS) {
+ return;
+ }
+ }
+
+ /*
+ * Here we can directly remove user access since the user does not have
+ * access to the user page anymore so we avoid the racy window where the
+ * user could have read cap_user_rdpmc to true right before we disable
+ * it.
+ */
+ event->hw.flags &= ~PERF_EVENT_FLAG_USER_READ_CNT;
+
+ if (event->hw.flags & PERF_EVENT_FLAG_USER_ACCESS)
+ on_each_cpu_mask(mm_cpumask(mm),
+ pmu_sbi_reset_scounteren, (void *)event, 1);
+}
+
+static void riscv_pmu_update_counter_access(void *info)
+{
+ if (sysctl_perf_user_access == SYSCTL_LEGACY)
+ csr_write(CSR_SCOUNTEREN, 0x7);
+ else
+ csr_write(CSR_SCOUNTEREN, 0x2);
+}
+
+static int riscv_pmu_proc_user_access_handler(struct ctl_table *table,
+ int write, void *buffer,
+ size_t *lenp, loff_t *ppos)
+{
+ int prev = sysctl_perf_user_access;
+ int ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
+
+ /*
+ * Test against the previous value since we clear SCOUNTEREN when
+ * sysctl_perf_user_access is set to SYSCTL_USER_ACCESS, but we should
+ * not do that if that was already the case.
+ */
+ if (ret || !write || prev == sysctl_perf_user_access)
+ return ret;
+
+ on_each_cpu(riscv_pmu_update_counter_access, NULL, 1);
+
+ return 0;
+}
+
+static struct ctl_table sbi_pmu_sysctl_table[] = {
+ {
+ .procname = "perf_user_access",
+ .data = &sysctl_perf_user_access,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = riscv_pmu_proc_user_access_handler,
+ .extra1 = SYSCTL_ZERO,
+ .extra2 = SYSCTL_TWO,
+ },
+ { }
+};
+
static int pmu_sbi_device_probe(struct platform_device *pdev)
{
struct riscv_pmu *pmu = NULL;
@@ -888,6 +1062,10 @@ static int pmu_sbi_device_probe(struct platform_device *pdev)
pmu->ctr_get_width = pmu_sbi_ctr_get_width;
pmu->ctr_clear_idx = pmu_sbi_ctr_clear_idx;
pmu->ctr_read = pmu_sbi_ctr_read;
+ pmu->event_init = pmu_sbi_event_init;
+ pmu->event_mapped = pmu_sbi_event_mapped;
+ pmu->event_unmapped = pmu_sbi_event_unmapped;
+ pmu->csr_index = pmu_sbi_csr_index;
ret = cpuhp_state_add_instance(CPUHP_AP_PERF_RISCV_STARTING, &pmu->node);
if (ret)
@@ -901,6 +1079,8 @@ static int pmu_sbi_device_probe(struct platform_device *pdev)
if (ret)
goto out_unregister;
+ register_sysctl("kernel", sbi_pmu_sysctl_table);
+
return 0;
out_unregister:
--
2.39.2
Implement the needed callbacks in the legacy driver so that we can
directly access the counters through perf in userspace.
Signed-off-by: Alexandre Ghiti <[email protected]>
Reviewed-by: Andrew Jones <[email protected]>
---
drivers/perf/riscv_pmu_legacy.c | 26 ++++++++++++++++++++++++++
1 file changed, 26 insertions(+)
diff --git a/drivers/perf/riscv_pmu_legacy.c b/drivers/perf/riscv_pmu_legacy.c
index 6a000abc28bb..79fdd667922e 100644
--- a/drivers/perf/riscv_pmu_legacy.c
+++ b/drivers/perf/riscv_pmu_legacy.c
@@ -71,6 +71,29 @@ static void pmu_legacy_ctr_start(struct perf_event *event, u64 ival)
local64_set(&hwc->prev_count, initial_val);
}
+static uint8_t pmu_legacy_csr_index(struct perf_event *event)
+{
+ return event->hw.idx;
+}
+
+static void pmu_legacy_event_mapped(struct perf_event *event, struct mm_struct *mm)
+{
+ if (event->attr.config != PERF_COUNT_HW_CPU_CYCLES &&
+ event->attr.config != PERF_COUNT_HW_INSTRUCTIONS)
+ return;
+
+ event->hw.flags |= PERF_EVENT_FLAG_USER_READ_CNT;
+}
+
+static void pmu_legacy_event_unmapped(struct perf_event *event, struct mm_struct *mm)
+{
+ if (event->attr.config != PERF_COUNT_HW_CPU_CYCLES &&
+ event->attr.config != PERF_COUNT_HW_INSTRUCTIONS)
+ return;
+
+ event->hw.flags &= ~PERF_EVENT_FLAG_USER_READ_CNT;
+}
+
/*
* This is just a simple implementation to allow legacy implementations
* compatible with new RISC-V PMU driver framework.
@@ -91,6 +114,9 @@ static void pmu_legacy_init(struct riscv_pmu *pmu)
pmu->ctr_get_width = NULL;
pmu->ctr_clear_idx = NULL;
pmu->ctr_read = pmu_legacy_read_ctr;
+ pmu->event_mapped = pmu_legacy_event_mapped;
+ pmu->event_unmapped = pmu_legacy_event_unmapped;
+ pmu->csr_index = pmu_legacy_csr_index;
perf_pmu_register(&pmu->pmu, "cpu", PERF_TYPE_RAW);
}
--
2.39.2
riscv now supports mmaping hardware counters so add what's needed to
take advantage of that in libperf.
Signed-off-by: Alexandre Ghiti <[email protected]>
Reviewed-by: Andrew Jones <[email protected]>
---
tools/lib/perf/mmap.c | 65 +++++++++++++++++++++++++++++++++++++++++++
1 file changed, 65 insertions(+)
diff --git a/tools/lib/perf/mmap.c b/tools/lib/perf/mmap.c
index 0d1634cedf44..378a163f0554 100644
--- a/tools/lib/perf/mmap.c
+++ b/tools/lib/perf/mmap.c
@@ -392,6 +392,71 @@ static u64 read_perf_counter(unsigned int counter)
static u64 read_timestamp(void) { return read_sysreg(cntvct_el0); }
+#elif __riscv_xlen == 64
+
+/* TODO: implement rv32 support */
+
+#define CSR_CYCLE 0xc00
+#define CSR_TIME 0xc01
+
+#define csr_read(csr) \
+({ \
+ register unsigned long __v; \
+ __asm__ __volatile__ ("csrr %0, " #csr \
+ : "=r" (__v) : \
+ : "memory"); \
+ __v; \
+})
+
+static unsigned long csr_read_num(int csr_num)
+{
+#define switchcase_csr_read(__csr_num, __val) {\
+ case __csr_num: \
+ __val = csr_read(__csr_num); \
+ break; }
+#define switchcase_csr_read_2(__csr_num, __val) {\
+ switchcase_csr_read(__csr_num + 0, __val) \
+ switchcase_csr_read(__csr_num + 1, __val)}
+#define switchcase_csr_read_4(__csr_num, __val) {\
+ switchcase_csr_read_2(__csr_num + 0, __val) \
+ switchcase_csr_read_2(__csr_num + 2, __val)}
+#define switchcase_csr_read_8(__csr_num, __val) {\
+ switchcase_csr_read_4(__csr_num + 0, __val) \
+ switchcase_csr_read_4(__csr_num + 4, __val)}
+#define switchcase_csr_read_16(__csr_num, __val) {\
+ switchcase_csr_read_8(__csr_num + 0, __val) \
+ switchcase_csr_read_8(__csr_num + 8, __val)}
+#define switchcase_csr_read_32(__csr_num, __val) {\
+ switchcase_csr_read_16(__csr_num + 0, __val) \
+ switchcase_csr_read_16(__csr_num + 16, __val)}
+
+ unsigned long ret = 0;
+
+ switch (csr_num) {
+ switchcase_csr_read_32(CSR_CYCLE, ret)
+ default:
+ break;
+ }
+
+ return ret;
+#undef switchcase_csr_read_32
+#undef switchcase_csr_read_16
+#undef switchcase_csr_read_8
+#undef switchcase_csr_read_4
+#undef switchcase_csr_read_2
+#undef switchcase_csr_read
+}
+
+static u64 read_perf_counter(unsigned int counter)
+{
+ return csr_read_num(CSR_CYCLE + counter);
+}
+
+static u64 read_timestamp(void)
+{
+ return csr_read_num(CSR_TIME);
+}
+
#else
static u64 read_perf_counter(unsigned int counter __maybe_unused) { return 0; }
static u64 read_timestamp(void) { return 0; }
--
2.39.2
riscv now supports mmaping hardware counters to userspace so adapt the test
to run on this architecture.
Signed-off-by: Alexandre Ghiti <[email protected]>
Reviewed-by: Andrew Jones <[email protected]>
---
tools/perf/tests/mmap-basic.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/tools/perf/tests/mmap-basic.c b/tools/perf/tests/mmap-basic.c
index e68ca6229756..f5075ca774f8 100644
--- a/tools/perf/tests/mmap-basic.c
+++ b/tools/perf/tests/mmap-basic.c
@@ -284,7 +284,7 @@ static struct test_case tests__basic_mmap[] = {
"permissions"),
TEST_CASE_REASON("User space counter reading of instructions",
mmap_user_read_instr,
-#if defined(__i386__) || defined(__x86_64__) || defined(__aarch64__)
+#if defined(__i386__) || defined(__x86_64__) || defined(__aarch64__) || __riscv_xlen == 64
"permissions"
#else
"unsupported"
@@ -292,7 +292,7 @@ static struct test_case tests__basic_mmap[] = {
),
TEST_CASE_REASON("User space counter reading of cycles",
mmap_user_read_cycles,
-#if defined(__i386__) || defined(__x86_64__) || defined(__aarch64__)
+#if defined(__i386__) || defined(__x86_64__) || defined(__aarch64__) || __riscv_xlen == 64
"permissions"
#else
"unsupported"
--
2.39.2
Provide all the necessary bits in the generic riscv pmu driver to be
able to mmap perf events in userspace: the heavy lifting lies in the
driver backend, namely the legacy and sbi implementations.
Note that arch_perf_update_userpage is almost a copy of arm64 code.
Signed-off-by: Alexandre Ghiti <[email protected]>
Reviewed-by: Andrew Jones <[email protected]>
---
drivers/perf/riscv_pmu.c | 105 +++++++++++++++++++++++++++++++++
include/linux/perf/riscv_pmu.h | 4 ++
2 files changed, 109 insertions(+)
diff --git a/drivers/perf/riscv_pmu.c b/drivers/perf/riscv_pmu.c
index ebca5eab9c9b..432ad2e80ce3 100644
--- a/drivers/perf/riscv_pmu.c
+++ b/drivers/perf/riscv_pmu.c
@@ -14,9 +14,73 @@
#include <linux/perf/riscv_pmu.h>
#include <linux/printk.h>
#include <linux/smp.h>
+#include <linux/sched_clock.h>
#include <asm/sbi.h>
+static bool riscv_perf_user_access(struct perf_event *event)
+{
+ return ((event->attr.type == PERF_TYPE_HARDWARE) ||
+ (event->attr.type == PERF_TYPE_HW_CACHE) ||
+ (event->attr.type == PERF_TYPE_RAW)) &&
+ !!(event->hw.flags & PERF_EVENT_FLAG_USER_READ_CNT);
+}
+
+void arch_perf_update_userpage(struct perf_event *event,
+ struct perf_event_mmap_page *userpg, u64 now)
+{
+ struct clock_read_data *rd;
+ unsigned int seq;
+ u64 ns;
+
+ userpg->cap_user_time = 0;
+ userpg->cap_user_time_zero = 0;
+ userpg->cap_user_time_short = 0;
+ userpg->cap_user_rdpmc = riscv_perf_user_access(event);
+
+ userpg->pmc_width = 64;
+
+ do {
+ rd = sched_clock_read_begin(&seq);
+
+ userpg->time_mult = rd->mult;
+ userpg->time_shift = rd->shift;
+ userpg->time_zero = rd->epoch_ns;
+ userpg->time_cycles = rd->epoch_cyc;
+ userpg->time_mask = rd->sched_clock_mask;
+
+ /*
+ * Subtract the cycle base, such that software that
+ * doesn't know about cap_user_time_short still 'works'
+ * assuming no wraps.
+ */
+ ns = mul_u64_u32_shr(rd->epoch_cyc, rd->mult, rd->shift);
+ userpg->time_zero -= ns;
+
+ } while (sched_clock_read_retry(seq));
+
+ userpg->time_offset = userpg->time_zero - now;
+
+ /*
+ * time_shift is not expected to be greater than 31 due to
+ * the original published conversion algorithm shifting a
+ * 32-bit value (now specifies a 64-bit value) - refer
+ * perf_event_mmap_page documentation in perf_event.h.
+ */
+ if (userpg->time_shift == 32) {
+ userpg->time_shift = 31;
+ userpg->time_mult >>= 1;
+ }
+
+ /*
+ * Internal timekeeping for enabled/running/stopped times
+ * is always computed with the sched_clock.
+ */
+ userpg->cap_user_time = 1;
+ userpg->cap_user_time_zero = 1;
+ userpg->cap_user_time_short = 1;
+}
+
static unsigned long csr_read_num(int csr_num)
{
#define switchcase_csr_read(__csr_num, __val) {\
@@ -171,6 +235,8 @@ int riscv_pmu_event_set_period(struct perf_event *event)
local64_set(&hwc->prev_count, (u64)-left);
+ perf_event_update_userpage(event);
+
return overflow;
}
@@ -267,6 +333,9 @@ static int riscv_pmu_event_init(struct perf_event *event)
hwc->idx = -1;
hwc->event_base = mapped_event;
+ if (rvpmu->event_init)
+ rvpmu->event_init(event);
+
if (!is_sampling_event(event)) {
/*
* For non-sampling runs, limit the sample_period to half
@@ -283,6 +352,39 @@ static int riscv_pmu_event_init(struct perf_event *event)
return 0;
}
+static int riscv_pmu_event_idx(struct perf_event *event)
+{
+ struct riscv_pmu *rvpmu = to_riscv_pmu(event->pmu);
+
+ if (!(event->hw.flags & PERF_EVENT_FLAG_USER_READ_CNT))
+ return 0;
+
+ if (rvpmu->csr_index)
+ return rvpmu->csr_index(event) + 1;
+
+ return 0;
+}
+
+static void riscv_pmu_event_mapped(struct perf_event *event, struct mm_struct *mm)
+{
+ struct riscv_pmu *rvpmu = to_riscv_pmu(event->pmu);
+
+ if (rvpmu->event_mapped) {
+ rvpmu->event_mapped(event, mm);
+ perf_event_update_userpage(event);
+ }
+}
+
+static void riscv_pmu_event_unmapped(struct perf_event *event, struct mm_struct *mm)
+{
+ struct riscv_pmu *rvpmu = to_riscv_pmu(event->pmu);
+
+ if (rvpmu->event_unmapped) {
+ rvpmu->event_unmapped(event, mm);
+ perf_event_update_userpage(event);
+ }
+}
+
struct riscv_pmu *riscv_pmu_alloc(void)
{
struct riscv_pmu *pmu;
@@ -307,6 +409,9 @@ struct riscv_pmu *riscv_pmu_alloc(void)
}
pmu->pmu = (struct pmu) {
.event_init = riscv_pmu_event_init,
+ .event_mapped = riscv_pmu_event_mapped,
+ .event_unmapped = riscv_pmu_event_unmapped,
+ .event_idx = riscv_pmu_event_idx,
.add = riscv_pmu_add,
.del = riscv_pmu_del,
.start = riscv_pmu_start,
diff --git a/include/linux/perf/riscv_pmu.h b/include/linux/perf/riscv_pmu.h
index 5deeea0be7cb..43282e22ebe1 100644
--- a/include/linux/perf/riscv_pmu.h
+++ b/include/linux/perf/riscv_pmu.h
@@ -55,6 +55,10 @@ struct riscv_pmu {
void (*ctr_start)(struct perf_event *event, u64 init_val);
void (*ctr_stop)(struct perf_event *event, unsigned long flag);
int (*event_map)(struct perf_event *event, u64 *config);
+ void (*event_init)(struct perf_event *event);
+ void (*event_mapped)(struct perf_event *event, struct mm_struct *mm);
+ void (*event_unmapped)(struct perf_event *event, struct mm_struct *mm);
+ uint8_t (*csr_index)(struct perf_event *event);
struct cpu_hw_events __percpu *hw_events;
struct hlist_node node;
--
2.39.2
riscv now uses this sysctl so document its usage for this architecture.
Signed-off-by: Alexandre Ghiti <[email protected]>
---
Documentation/admin-guide/sysctl/kernel.rst | 27 ++++++++++++++++++---
1 file changed, 23 insertions(+), 4 deletions(-)
diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst
index d85d90f5d000..19b627883313 100644
--- a/Documentation/admin-guide/sysctl/kernel.rst
+++ b/Documentation/admin-guide/sysctl/kernel.rst
@@ -941,16 +941,35 @@ enabled, otherwise writing to this file will return ``-EBUSY``.
The default value is 8.
-perf_user_access (arm64 only)
-=================================
+perf_user_access (arm64 and riscv only)
+=======================================
+
+Controls user space access for reading perf event counters.
-Controls user space access for reading perf event counters. When set to 1,
-user space can read performance monitor counter registers directly.
+arm64
+=====
The default value is 0 (access disabled).
+When set to 1, user space can read performance monitor counter registers
+directly.
+
See Documentation/arm64/perf.rst for more information.
+riscv
+=====
+
+When set to 0, user space access is disabled.
+
+The default value is 1, user space can read performance monitor counter
+registers through perf, any direct access without perf intervention will trigger
+an illegal instruction.
+
+When set to 2, which enables legacy mode (user space has direct access to cycle
+and insret CSRs only). Note that this legacy value is deprecated and will be
+removed once all user space applications are fixed.
+
+Note that the time CSR is always directly accessible to all modes.
pid_max
=======
--
2.39.2
On Mon, 3 Jul 2023, Alexandre Ghiti wrote:
> -=================================
> +perf_user_access (arm64 and riscv only)
> +=======================================
so I complained about this when support for this went in for arm64.
Why do we have two separate ways of getting this info, one for x86 and one
for arm64/riscv?
Could we get x86 patched to use the same interface?
It's a pain for tool users to have to maintain multiple code paths because
the various architectures can't agree on how to export this info to
userspace.
Vince
On Mon, Jul 3, 2023 at 5:49 AM Alexandre Ghiti <[email protected]> wrote:
>
> RISCV_PMU_LEGACY_INSTRET used to be set to 1 whereas the offset of this
> hardware counter from CSR_CYCLE is actually 2: make this offset match the
> real hw offset so that we can directly expose those values to userspace.
>
> Signed-off-by: Alexandre Ghiti <[email protected]>
> Reviewed-by: Andrew Jones <[email protected]>
> ---
> drivers/perf/riscv_pmu_legacy.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/perf/riscv_pmu_legacy.c b/drivers/perf/riscv_pmu_legacy.c
> index ca9e20bfc7ac..6a000abc28bb 100644
> --- a/drivers/perf/riscv_pmu_legacy.c
> +++ b/drivers/perf/riscv_pmu_legacy.c
> @@ -13,7 +13,7 @@
> #include <linux/platform_device.h>
>
> #define RISCV_PMU_LEGACY_CYCLE 0
> -#define RISCV_PMU_LEGACY_INSTRET 1
> +#define RISCV_PMU_LEGACY_INSTRET 2
>
> static bool pmu_init_done;
>
> --
> 2.39.2
>
Reviewed-by: Atish Patra <[email protected]>
--
Regards,
Atish
On Mon, Jul 3, 2023 at 5:50 AM Alexandre Ghiti <[email protected]> wrote:
>
> That's just cosmetic, no functional changes.
>
> Signed-off-by: Alexandre Ghiti <[email protected]>
> Reviewed-by: Andrew Jones <[email protected]>
> ---
> drivers/perf/riscv_pmu_sbi.c | 4 ++--
> include/linux/perf/riscv_pmu.h | 2 +-
> 2 files changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/perf/riscv_pmu_sbi.c b/drivers/perf/riscv_pmu_sbi.c
> index 4f3ac296b3e2..83c3f1c4d2f1 100644
> --- a/drivers/perf/riscv_pmu_sbi.c
> +++ b/drivers/perf/riscv_pmu_sbi.c
> @@ -914,7 +914,7 @@ static int pmu_sbi_device_probe(struct platform_device *pdev)
> static struct platform_driver pmu_sbi_driver = {
> .probe = pmu_sbi_device_probe,
> .driver = {
> - .name = RISCV_PMU_PDEV_NAME,
> + .name = RISCV_PMU_SBI_PDEV_NAME,
> },
> };
>
> @@ -941,7 +941,7 @@ static int __init pmu_sbi_devinit(void)
> if (ret)
> return ret;
>
> - pdev = platform_device_register_simple(RISCV_PMU_PDEV_NAME, -1, NULL, 0);
> + pdev = platform_device_register_simple(RISCV_PMU_SBI_PDEV_NAME, -1, NULL, 0);
> if (IS_ERR(pdev)) {
> platform_driver_unregister(&pmu_sbi_driver);
> return PTR_ERR(pdev);
> diff --git a/include/linux/perf/riscv_pmu.h b/include/linux/perf/riscv_pmu.h
> index 9f70d94942e0..5deeea0be7cb 100644
> --- a/include/linux/perf/riscv_pmu.h
> +++ b/include/linux/perf/riscv_pmu.h
> @@ -21,7 +21,7 @@
>
> #define RISCV_MAX_COUNTERS 64
> #define RISCV_OP_UNSUPP (-EOPNOTSUPP)
> -#define RISCV_PMU_PDEV_NAME "riscv-pmu"
> +#define RISCV_PMU_SBI_PDEV_NAME "riscv-pmu-sbi"
> #define RISCV_PMU_LEGACY_PDEV_NAME "riscv-pmu-legacy"
>
> #define RISCV_PMU_STOP_FLAG_RESET 1
> --
> 2.39.2
>
Reviewed-by: Atish Patra <[email protected]>
--
Regards,
Atish
On Mon, Jul 3, 2023 at 5:51 AM Alexandre Ghiti <[email protected]> wrote:
>
> Provide all the necessary bits in the generic riscv pmu driver to be
> able to mmap perf events in userspace: the heavy lifting lies in the
> driver backend, namely the legacy and sbi implementations.
>
> Note that arch_perf_update_userpage is almost a copy of arm64 code.
>
> Signed-off-by: Alexandre Ghiti <[email protected]>
> Reviewed-by: Andrew Jones <[email protected]>
> ---
> drivers/perf/riscv_pmu.c | 105 +++++++++++++++++++++++++++++++++
> include/linux/perf/riscv_pmu.h | 4 ++
> 2 files changed, 109 insertions(+)
>
> diff --git a/drivers/perf/riscv_pmu.c b/drivers/perf/riscv_pmu.c
> index ebca5eab9c9b..432ad2e80ce3 100644
> --- a/drivers/perf/riscv_pmu.c
> +++ b/drivers/perf/riscv_pmu.c
> @@ -14,9 +14,73 @@
> #include <linux/perf/riscv_pmu.h>
> #include <linux/printk.h>
> #include <linux/smp.h>
> +#include <linux/sched_clock.h>
>
> #include <asm/sbi.h>
>
> +static bool riscv_perf_user_access(struct perf_event *event)
> +{
> + return ((event->attr.type == PERF_TYPE_HARDWARE) ||
> + (event->attr.type == PERF_TYPE_HW_CACHE) ||
> + (event->attr.type == PERF_TYPE_RAW)) &&
> + !!(event->hw.flags & PERF_EVENT_FLAG_USER_READ_CNT);
> +}
> +
> +void arch_perf_update_userpage(struct perf_event *event,
> + struct perf_event_mmap_page *userpg, u64 now)
> +{
> + struct clock_read_data *rd;
> + unsigned int seq;
> + u64 ns;
> +
> + userpg->cap_user_time = 0;
> + userpg->cap_user_time_zero = 0;
> + userpg->cap_user_time_short = 0;
> + userpg->cap_user_rdpmc = riscv_perf_user_access(event);
> +
> + userpg->pmc_width = 64;
> +
> + do {
> + rd = sched_clock_read_begin(&seq);
> +
> + userpg->time_mult = rd->mult;
> + userpg->time_shift = rd->shift;
> + userpg->time_zero = rd->epoch_ns;
> + userpg->time_cycles = rd->epoch_cyc;
> + userpg->time_mask = rd->sched_clock_mask;
> +
> + /*
> + * Subtract the cycle base, such that software that
> + * doesn't know about cap_user_time_short still 'works'
> + * assuming no wraps.
> + */
> + ns = mul_u64_u32_shr(rd->epoch_cyc, rd->mult, rd->shift);
> + userpg->time_zero -= ns;
> +
> + } while (sched_clock_read_retry(seq));
> +
> + userpg->time_offset = userpg->time_zero - now;
> +
> + /*
> + * time_shift is not expected to be greater than 31 due to
> + * the original published conversion algorithm shifting a
> + * 32-bit value (now specifies a 64-bit value) - refer
> + * perf_event_mmap_page documentation in perf_event.h.
> + */
> + if (userpg->time_shift == 32) {
> + userpg->time_shift = 31;
> + userpg->time_mult >>= 1;
> + }
> +
> + /*
> + * Internal timekeeping for enabled/running/stopped times
> + * is always computed with the sched_clock.
> + */
> + userpg->cap_user_time = 1;
> + userpg->cap_user_time_zero = 1;
> + userpg->cap_user_time_short = 1;
> +}
> +
> static unsigned long csr_read_num(int csr_num)
> {
> #define switchcase_csr_read(__csr_num, __val) {\
> @@ -171,6 +235,8 @@ int riscv_pmu_event_set_period(struct perf_event *event)
>
> local64_set(&hwc->prev_count, (u64)-left);
>
> + perf_event_update_userpage(event);
> +
> return overflow;
> }
>
> @@ -267,6 +333,9 @@ static int riscv_pmu_event_init(struct perf_event *event)
> hwc->idx = -1;
> hwc->event_base = mapped_event;
>
> + if (rvpmu->event_init)
> + rvpmu->event_init(event);
> +
> if (!is_sampling_event(event)) {
> /*
> * For non-sampling runs, limit the sample_period to half
> @@ -283,6 +352,39 @@ static int riscv_pmu_event_init(struct perf_event *event)
> return 0;
> }
>
> +static int riscv_pmu_event_idx(struct perf_event *event)
> +{
> + struct riscv_pmu *rvpmu = to_riscv_pmu(event->pmu);
> +
> + if (!(event->hw.flags & PERF_EVENT_FLAG_USER_READ_CNT))
> + return 0;
> +
> + if (rvpmu->csr_index)
> + return rvpmu->csr_index(event) + 1;
> +
> + return 0;
> +}
> +
> +static void riscv_pmu_event_mapped(struct perf_event *event, struct mm_struct *mm)
> +{
> + struct riscv_pmu *rvpmu = to_riscv_pmu(event->pmu);
> +
> + if (rvpmu->event_mapped) {
> + rvpmu->event_mapped(event, mm);
> + perf_event_update_userpage(event);
> + }
> +}
> +
> +static void riscv_pmu_event_unmapped(struct perf_event *event, struct mm_struct *mm)
> +{
> + struct riscv_pmu *rvpmu = to_riscv_pmu(event->pmu);
> +
> + if (rvpmu->event_unmapped) {
> + rvpmu->event_unmapped(event, mm);
> + perf_event_update_userpage(event);
> + }
> +}
> +
> struct riscv_pmu *riscv_pmu_alloc(void)
> {
> struct riscv_pmu *pmu;
> @@ -307,6 +409,9 @@ struct riscv_pmu *riscv_pmu_alloc(void)
> }
> pmu->pmu = (struct pmu) {
> .event_init = riscv_pmu_event_init,
> + .event_mapped = riscv_pmu_event_mapped,
> + .event_unmapped = riscv_pmu_event_unmapped,
> + .event_idx = riscv_pmu_event_idx,
> .add = riscv_pmu_add,
> .del = riscv_pmu_del,
> .start = riscv_pmu_start,
> diff --git a/include/linux/perf/riscv_pmu.h b/include/linux/perf/riscv_pmu.h
> index 5deeea0be7cb..43282e22ebe1 100644
> --- a/include/linux/perf/riscv_pmu.h
> +++ b/include/linux/perf/riscv_pmu.h
> @@ -55,6 +55,10 @@ struct riscv_pmu {
> void (*ctr_start)(struct perf_event *event, u64 init_val);
> void (*ctr_stop)(struct perf_event *event, unsigned long flag);
> int (*event_map)(struct perf_event *event, u64 *config);
> + void (*event_init)(struct perf_event *event);
> + void (*event_mapped)(struct perf_event *event, struct mm_struct *mm);
> + void (*event_unmapped)(struct perf_event *event, struct mm_struct *mm);
> + uint8_t (*csr_index)(struct perf_event *event);
>
> struct cpu_hw_events __percpu *hw_events;
> struct hlist_node node;
> --
> 2.39.2
>
Reviewed-by: Atish Patra <[email protected]>
--
Regards,
Atish
On Mon, Jul 3, 2023 at 5:52 AM Alexandre Ghiti <[email protected]> wrote:
>
> Implement the needed callbacks in the legacy driver so that we can
> directly access the counters through perf in userspace.
>
> Signed-off-by: Alexandre Ghiti <[email protected]>
> Reviewed-by: Andrew Jones <[email protected]>
> ---
> drivers/perf/riscv_pmu_legacy.c | 26 ++++++++++++++++++++++++++
> 1 file changed, 26 insertions(+)
>
> diff --git a/drivers/perf/riscv_pmu_legacy.c b/drivers/perf/riscv_pmu_legacy.c
> index 6a000abc28bb..79fdd667922e 100644
> --- a/drivers/perf/riscv_pmu_legacy.c
> +++ b/drivers/perf/riscv_pmu_legacy.c
> @@ -71,6 +71,29 @@ static void pmu_legacy_ctr_start(struct perf_event *event, u64 ival)
> local64_set(&hwc->prev_count, initial_val);
> }
>
> +static uint8_t pmu_legacy_csr_index(struct perf_event *event)
> +{
> + return event->hw.idx;
> +}
> +
> +static void pmu_legacy_event_mapped(struct perf_event *event, struct mm_struct *mm)
> +{
> + if (event->attr.config != PERF_COUNT_HW_CPU_CYCLES &&
> + event->attr.config != PERF_COUNT_HW_INSTRUCTIONS)
> + return;
> +
> + event->hw.flags |= PERF_EVENT_FLAG_USER_READ_CNT;
> +}
> +
> +static void pmu_legacy_event_unmapped(struct perf_event *event, struct mm_struct *mm)
> +{
> + if (event->attr.config != PERF_COUNT_HW_CPU_CYCLES &&
> + event->attr.config != PERF_COUNT_HW_INSTRUCTIONS)
> + return;
> +
> + event->hw.flags &= ~PERF_EVENT_FLAG_USER_READ_CNT;
> +}
> +
> /*
> * This is just a simple implementation to allow legacy implementations
> * compatible with new RISC-V PMU driver framework.
> @@ -91,6 +114,9 @@ static void pmu_legacy_init(struct riscv_pmu *pmu)
> pmu->ctr_get_width = NULL;
> pmu->ctr_clear_idx = NULL;
> pmu->ctr_read = pmu_legacy_read_ctr;
> + pmu->event_mapped = pmu_legacy_event_mapped;
> + pmu->event_unmapped = pmu_legacy_event_unmapped;
> + pmu->csr_index = pmu_legacy_csr_index;
>
> perf_pmu_register(&pmu->pmu, "cpu", PERF_TYPE_RAW);
> }
> --
> 2.39.2
>
Reviewed-by: Atish Patra <[email protected]>
--
Regards,
Atish
On Mon, Jul 3, 2023 at 5:54 AM Alexandre Ghiti <[email protected]> wrote:
>
> riscv now uses this sysctl so document its usage for this architecture.
>
> Signed-off-by: Alexandre Ghiti <[email protected]>
> ---
> Documentation/admin-guide/sysctl/kernel.rst | 27 ++++++++++++++++++---
> 1 file changed, 23 insertions(+), 4 deletions(-)
>
> diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst
> index d85d90f5d000..19b627883313 100644
> --- a/Documentation/admin-guide/sysctl/kernel.rst
> +++ b/Documentation/admin-guide/sysctl/kernel.rst
> @@ -941,16 +941,35 @@ enabled, otherwise writing to this file will return ``-EBUSY``.
> The default value is 8.
>
>
> -perf_user_access (arm64 only)
> -=================================
> +perf_user_access (arm64 and riscv only)
> +=======================================
> +
> +Controls user space access for reading perf event counters.
>
> -Controls user space access for reading perf event counters. When set to 1,
> -user space can read performance monitor counter registers directly.
> +arm64
> +=====
>
> The default value is 0 (access disabled).
>
> +When set to 1, user space can read performance monitor counter registers
> +directly.
> +
For ARM64, 1 also enables reading performance counters via perf interface only.
I don't think there is any way to access the counters directly in ARM64.
> See Documentation/arm64/perf.rst for more information.
>
> +riscv
> +=====
> +
> +When set to 0, user space access is disabled.
> +
> +The default value is 1, user space can read performance monitor counter
> +registers through perf, any direct access without perf intervention will trigger
> +an illegal instruction.
> +
> +When set to 2, which enables legacy mode (user space has direct access to cycle
> +and insret CSRs only). Note that this legacy value is deprecated and will be
> +removed once all user space applications are fixed.
> +
> +Note that the time CSR is always directly accessible to all modes.
>
> pid_max
> =======
> --
> 2.39.2
>
--
Regards,
Atish
On Mon, Jul 3, 2023 at 5:53 AM Alexandre Ghiti <[email protected]> wrote:
>
> We used to unconditionnally expose the cycle and instret csrs to
> userspace, which gives rise to security concerns.
>
> So now we only allow access to hw counters from userspace through the perf
> framework which will handle context switches, per-task events...etc. But
> as we cannot break userspace, we give the user the choice to go back to
> the previous behaviour by setting the sysctl perf_user_access.
>
> Signed-off-by: Alexandre Ghiti <[email protected]>
> Reviewed-by: Andrew Jones <[email protected]>
> ---
> drivers/perf/riscv_pmu.c | 10 +-
> drivers/perf/riscv_pmu_sbi.c | 192 +++++++++++++++++++++++++++++++++--
> 2 files changed, 195 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/perf/riscv_pmu.c b/drivers/perf/riscv_pmu.c
> index 432ad2e80ce3..80c052e93f9e 100644
> --- a/drivers/perf/riscv_pmu.c
> +++ b/drivers/perf/riscv_pmu.c
> @@ -38,7 +38,15 @@ void arch_perf_update_userpage(struct perf_event *event,
> userpg->cap_user_time_short = 0;
> userpg->cap_user_rdpmc = riscv_perf_user_access(event);
>
> - userpg->pmc_width = 64;
> +#ifdef CONFIG_RISCV_PMU
> + /*
> + * The counters are 64-bit but the priv spec doesn't mandate all the
> + * bits to be implemented: that's why, counter width can vary based on
> + * the cpu vendor.
> + */
> + if (userpg->cap_user_rdpmc)
> + userpg->pmc_width = to_riscv_pmu(event->pmu)->ctr_get_width(event->hw.idx) + 1;
> +#endif
>
> do {
> rd = sched_clock_read_begin(&seq);
> diff --git a/drivers/perf/riscv_pmu_sbi.c b/drivers/perf/riscv_pmu_sbi.c
> index 83c3f1c4d2f1..2236cc9aa4b8 100644
> --- a/drivers/perf/riscv_pmu_sbi.c
> +++ b/drivers/perf/riscv_pmu_sbi.c
> @@ -24,6 +24,14 @@
> #include <asm/sbi.h>
> #include <asm/hwcap.h>
>
> +#define SYSCTL_NO_USER_ACCESS 0
> +#define SYSCTL_USER_ACCESS 1
> +#define SYSCTL_LEGACY 2
> +
> +#define PERF_EVENT_FLAG_NO_USER_ACCESS BIT(SYSCTL_NO_USER_ACCESS)
> +#define PERF_EVENT_FLAG_USER_ACCESS BIT(SYSCTL_USER_ACCESS)
> +#define PERF_EVENT_FLAG_LEGACY BIT(SYSCTL_LEGACY)
> +
> PMU_FORMAT_ATTR(event, "config:0-47");
> PMU_FORMAT_ATTR(firmware, "config:63");
>
> @@ -43,6 +51,9 @@ static const struct attribute_group *riscv_pmu_attr_groups[] = {
> NULL,
> };
>
> +/* Allow user mode access by default */
> +static int sysctl_perf_user_access __read_mostly = SYSCTL_USER_ACCESS;
> +
> /*
> * RISC-V doesn't have heterogeneous harts yet. This need to be part of
> * per_cpu in case of harts with different pmu counters
> @@ -301,6 +312,11 @@ int riscv_pmu_get_hpm_info(u32 *hw_ctr_width, u32 *num_hw_ctr)
> }
> EXPORT_SYMBOL_GPL(riscv_pmu_get_hpm_info);
>
> +static uint8_t pmu_sbi_csr_index(struct perf_event *event)
> +{
> + return pmu_ctr_list[event->hw.idx].csr - CSR_CYCLE;
> +}
> +
> static unsigned long pmu_sbi_get_filter_flags(struct perf_event *event)
> {
> unsigned long cflags = 0;
> @@ -329,18 +345,34 @@ static int pmu_sbi_ctr_get_idx(struct perf_event *event)
> struct cpu_hw_events *cpuc = this_cpu_ptr(rvpmu->hw_events);
> struct sbiret ret;
> int idx;
> - uint64_t cbase = 0;
> + uint64_t cbase = 0, cmask = rvpmu->cmask;
> unsigned long cflags = 0;
>
> cflags = pmu_sbi_get_filter_flags(event);
> +
> + /*
> + * In legacy mode, we have to force the fixed counters for those events
> + * but not in the user access mode as we want to use the other counters
> + * that support sampling/filtering.
> + */
> + if (hwc->flags & PERF_EVENT_FLAG_LEGACY) {
> + if (event->attr.config == PERF_COUNT_HW_CPU_CYCLES) {
> + cflags |= SBI_PMU_CFG_FLAG_SKIP_MATCH;
> + cmask = 1;
> + } else if (event->attr.config == PERF_COUNT_HW_INSTRUCTIONS) {
> + cflags |= SBI_PMU_CFG_FLAG_SKIP_MATCH;
> + cmask = 1UL << (CSR_INSTRET - CSR_CYCLE);
> + }
> + }
> +
Why do we need to do this ? If an application is using perf interface
to program the counters,
they don't need the PERF_EVENT_FLAG_LEGACY.
> /* retrieve the available counter index */
> #if defined(CONFIG_32BIT)
> ret = sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_COUNTER_CFG_MATCH, cbase,
> - rvpmu->cmask, cflags, hwc->event_base, hwc->config,
> + cmask, cflags, hwc->event_base, hwc->config,
> hwc->config >> 32);
> #else
> ret = sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_COUNTER_CFG_MATCH, cbase,
> - rvpmu->cmask, cflags, hwc->event_base, hwc->config, 0);
> + cmask, cflags, hwc->event_base, hwc->config, 0);
> #endif
> if (ret.error) {
> pr_debug("Not able to find a counter for event %lx config %llx\n",
> @@ -474,6 +506,22 @@ static u64 pmu_sbi_ctr_read(struct perf_event *event)
> return val;
> }
>
> +static void pmu_sbi_set_scounteren(void *arg)
> +{
> + struct perf_event *event = (struct perf_event *)arg;
> +
> + csr_write(CSR_SCOUNTEREN,
> + csr_read(CSR_SCOUNTEREN) | (1 << pmu_sbi_csr_index(event)));
> +}
> +
> +static void pmu_sbi_reset_scounteren(void *arg)
> +{
> + struct perf_event *event = (struct perf_event *)arg;
> +
> + csr_write(CSR_SCOUNTEREN,
> + csr_read(CSR_SCOUNTEREN) & ~(1 << pmu_sbi_csr_index(event)));
> +}
> +
> static void pmu_sbi_ctr_start(struct perf_event *event, u64 ival)
> {
> struct sbiret ret;
> @@ -490,6 +538,10 @@ static void pmu_sbi_ctr_start(struct perf_event *event, u64 ival)
> if (ret.error && (ret.error != SBI_ERR_ALREADY_STARTED))
> pr_err("Starting counter idx %d failed with error %d\n",
> hwc->idx, sbi_err_map_linux_errno(ret.error));
> +
> + if ((hwc->flags & PERF_EVENT_FLAG_USER_ACCESS) &&
> + (hwc->flags & PERF_EVENT_FLAG_USER_READ_CNT))
> + pmu_sbi_set_scounteren((void *)event);
> }
>
> static void pmu_sbi_ctr_stop(struct perf_event *event, unsigned long flag)
> @@ -497,6 +549,10 @@ static void pmu_sbi_ctr_stop(struct perf_event *event, unsigned long flag)
> struct sbiret ret;
> struct hw_perf_event *hwc = &event->hw;
>
> + if ((hwc->flags & PERF_EVENT_FLAG_USER_ACCESS) &&
> + (hwc->flags & PERF_EVENT_FLAG_USER_READ_CNT))
> + pmu_sbi_reset_scounteren((void *)event);
> +
> ret = sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_COUNTER_STOP, hwc->idx, 1, flag, 0, 0, 0);
> if (ret.error && (ret.error != SBI_ERR_ALREADY_STOPPED) &&
> flag != SBI_PMU_STOP_FLAG_RESET)
> @@ -704,10 +760,13 @@ static int pmu_sbi_starting_cpu(unsigned int cpu, struct hlist_node *node)
> struct cpu_hw_events *cpu_hw_evt = this_cpu_ptr(pmu->hw_events);
>
> /*
> - * Enable the access for CYCLE, TIME, and INSTRET CSRs from userspace,
> - * as is necessary to maintain uABI compatibility.
> + * We keep enabling userspace access to CYCLE, TIME and INSRET via the
/s/INSRET/INSTRET
> + * legacy option but that will be removed in the future.
> */
> - csr_write(CSR_SCOUNTEREN, 0x7);
> + if (sysctl_perf_user_access == SYSCTL_LEGACY)
> + csr_write(CSR_SCOUNTEREN, 0x7);
> + else
> + csr_write(CSR_SCOUNTEREN, 0x2);
>
> /* Stop all the counters so that they can be enabled from perf */
> pmu_sbi_stop_all(pmu);
> @@ -851,6 +910,121 @@ static void riscv_pmu_destroy(struct riscv_pmu *pmu)
> cpuhp_state_remove_instance(CPUHP_AP_PERF_RISCV_STARTING, &pmu->node);
> }
>
> +static void pmu_sbi_event_init(struct perf_event *event)
> +{
> + /*
> + * The permissions are set at event_init so that we do not depend
> + * on the sysctl value that can change.
> + */
> + if (sysctl_perf_user_access == SYSCTL_NO_USER_ACCESS)
> + event->hw.flags |= PERF_EVENT_FLAG_NO_USER_ACCESS;
> + else if (sysctl_perf_user_access == SYSCTL_USER_ACCESS)
> + event->hw.flags |= PERF_EVENT_FLAG_USER_ACCESS;
> + else
> + event->hw.flags |= PERF_EVENT_FLAG_LEGACY;
> +}
> +
> +static void pmu_sbi_event_mapped(struct perf_event *event, struct mm_struct *mm)
> +{
> + if (event->hw.flags & PERF_EVENT_FLAG_NO_USER_ACCESS)
> + return;
> +
> + if (event->hw.flags & PERF_EVENT_FLAG_LEGACY) {
> + if (event->attr.config != PERF_COUNT_HW_CPU_CYCLES &&
> + event->attr.config != PERF_COUNT_HW_INSTRUCTIONS) {
> + return;
> + }
> + }
> +
> + /*
> + * The user mmapped the event to directly access it: this is where
> + * we determine based on sysctl_perf_user_access if we grant userspace
> + * the direct access to this event. That means that within the same
> + * task, some events may be directly accessible and some other may not,
> + * if the user changes the value of sysctl_perf_user_accesss in the
> + * meantime.
> + */
> +
> + event->hw.flags |= PERF_EVENT_FLAG_USER_READ_CNT;
> +
> + /*
> + * We must enable userspace access *before* advertising in the user page
> + * that it is possible to do so to avoid any race.
> + * And we must notify all cpus here because threads that currently run
> + * on other cpus will try to directly access the counter too without
> + * calling pmu_sbi_ctr_start.
> + */
> + if (event->hw.flags & PERF_EVENT_FLAG_USER_ACCESS)
> + on_each_cpu_mask(mm_cpumask(mm),
> + pmu_sbi_set_scounteren, (void *)event, 1);
> +}
> +
> +static void pmu_sbi_event_unmapped(struct perf_event *event, struct mm_struct *mm)
> +{
> + if (event->hw.flags & PERF_EVENT_FLAG_NO_USER_ACCESS)
> + return;
> +
> + if (event->hw.flags & PERF_EVENT_FLAG_LEGACY) {
> + if (event->attr.config != PERF_COUNT_HW_CPU_CYCLES &&
> + event->attr.config != PERF_COUNT_HW_INSTRUCTIONS) {
> + return;
> + }
> + }
> +
> + /*
> + * Here we can directly remove user access since the user does not have
> + * access to the user page anymore so we avoid the racy window where the
> + * user could have read cap_user_rdpmc to true right before we disable
> + * it.
> + */
> + event->hw.flags &= ~PERF_EVENT_FLAG_USER_READ_CNT;
> +
> + if (event->hw.flags & PERF_EVENT_FLAG_USER_ACCESS)
> + on_each_cpu_mask(mm_cpumask(mm),
> + pmu_sbi_reset_scounteren, (void *)event, 1);
> +}
> +
> +static void riscv_pmu_update_counter_access(void *info)
> +{
> + if (sysctl_perf_user_access == SYSCTL_LEGACY)
> + csr_write(CSR_SCOUNTEREN, 0x7);
> + else
> + csr_write(CSR_SCOUNTEREN, 0x2);
> +}
> +
> +static int riscv_pmu_proc_user_access_handler(struct ctl_table *table,
> + int write, void *buffer,
> + size_t *lenp, loff_t *ppos)
> +{
> + int prev = sysctl_perf_user_access;
> + int ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
> +
> + /*
> + * Test against the previous value since we clear SCOUNTEREN when
> + * sysctl_perf_user_access is set to SYSCTL_USER_ACCESS, but we should
> + * not do that if that was already the case.
> + */
> + if (ret || !write || prev == sysctl_perf_user_access)
> + return ret;
> +
> + on_each_cpu(riscv_pmu_update_counter_access, NULL, 1);
> +
> + return 0;
> +}
> +
> +static struct ctl_table sbi_pmu_sysctl_table[] = {
> + {
> + .procname = "perf_user_access",
> + .data = &sysctl_perf_user_access,
> + .maxlen = sizeof(unsigned int),
> + .mode = 0644,
> + .proc_handler = riscv_pmu_proc_user_access_handler,
> + .extra1 = SYSCTL_ZERO,
> + .extra2 = SYSCTL_TWO,
> + },
> + { }
> +};
> +
> static int pmu_sbi_device_probe(struct platform_device *pdev)
> {
> struct riscv_pmu *pmu = NULL;
> @@ -888,6 +1062,10 @@ static int pmu_sbi_device_probe(struct platform_device *pdev)
> pmu->ctr_get_width = pmu_sbi_ctr_get_width;
> pmu->ctr_clear_idx = pmu_sbi_ctr_clear_idx;
> pmu->ctr_read = pmu_sbi_ctr_read;
> + pmu->event_init = pmu_sbi_event_init;
> + pmu->event_mapped = pmu_sbi_event_mapped;
> + pmu->event_unmapped = pmu_sbi_event_unmapped;
> + pmu->csr_index = pmu_sbi_csr_index;
>
> ret = cpuhp_state_add_instance(CPUHP_AP_PERF_RISCV_STARTING, &pmu->node);
> if (ret)
> @@ -901,6 +1079,8 @@ static int pmu_sbi_device_probe(struct platform_device *pdev)
> if (ret)
> goto out_unregister;
>
> + register_sysctl("kernel", sbi_pmu_sysctl_table);
> +
> return 0;
>
> out_unregister:
> --
> 2.39.2
>
--
Regards,
Atish
On Mon, Jul 3, 2023 at 5:46 AM Alexandre Ghiti <[email protected]> wrote:
>
> riscv used to allow direct access to cycle/time/instret counters,
> bypassing the perf framework, this patchset intends to allow the user to
> mmap any counter when accessed through perf. But we can't break the
> existing behaviour so we introduce a sysctl perf_user_access like arm64
> does, which defaults to the legacy mode described above.
>
> This version needs openSBI v1.3 *and* a kernel fix that went upstream lately
> (https://lore.kernel.org/lkml/[email protected]/T/).
>
> **Important**: In this version, the default mode is now user access, not
> the legacy so some applications will break.
>
Maybe providing a little more context to the issue will help
understanding why breaking those legacy
applications is necessary due to security reasons.
Here was the previous discussion around this[1].
Most of the legacy user applications using rdcycle should use rdtime
instead as they just want to record the time
elapsed. Allowing rdcycle/rdinstret to be read from user space can
lead to very deterministic attacks[2].
[1] https://groups.google.com/a/groups.riscv.org/g/sw-dev/c/REWcwYnzsKE?pli=1
[2] https://www.youtube.com/watch?v=3-c4C_L2PRQ&ab_channel=IEEESymposiumonSecurityandPrivacy
Any user application that really requires to read rdcycle directly can
use this new perf interface to do that without any latency.
> base-commit-tag: v6.4-rc6
>
> Changes in v4:
> - Fixed some nits in riscv_pmu_sbi.c thanks to Andrew
> - Fixed the documentation thanks to Andrew
> - Added RB from Andrew \o/
>
> Changes in v3:
> - patch 1 now contains the ref to the faulty commit (no Fixes tag as it is only a comment), as Andrew suggested
> - Removed RISCV_PMU_LEGACY_TIME from patch 3, as Andrew suggested
> - Rename RISCV_PMU_PDEV_NAME to "riscv-pmu-sbi", patch4 is just cosmetic now, as Andrew suggested
> - Removed a few useless (and wrong) comments, as Andrew suggested
> - Simplify arch_perf_update_userpage code, as Andrew suggested
> - Documentation now mentions that time CSR is *always* accessible, whatever the mode, as suggested by Andrew
> - Removed CYCLEH reference and add TODO for rv32 support, as suggested by Atish
> - Do not rename the pmu instance as Atish suggested
> - Set pmc_width only if rdpmc is enabled and CONFIG_RISCV_PMU is set and the event is a hw event
> - Move arch_perf_update_userpage https://lore.kernel.org/lkml/[email protected]/T/
> - **Switch to user mode access by default**
>
> Changes in v2:
> - Split into smaller patches, way better!
> - Add RB from Conor
> - Simplify the way we checked riscv architecture
> - Fix race mmap and other thread running on other cpus
> - Use hwc when available
> - Set all userspace access flags in event_init, too cumbersome to handle sysctl changes
> - Fix arch_perf_update_userpage for pmu other than riscv-pmu by renaming pmu driver
> - Fixed kernel test robot build error
> - Fixed documentation (Andrew and Bagas)
> - perf testsuite passes mmap tests in all 3 modes
>
> Alexandre Ghiti (10):
> perf: Fix wrong comment about default event_idx
> include: riscv: Fix wrong include guard in riscv_pmu.h
> riscv: Make legacy counter enum match the HW numbering
> drivers: perf: Rename riscv pmu sbi driver
> riscv: Prepare for user-space perf event mmap support
> drivers: perf: Implement perf event mmap support in the legacy backend
> drivers: perf: Implement perf event mmap support in the SBI backend
> Documentation: admin-guide: Add riscv sysctl_perf_user_access
> tools: lib: perf: Implement riscv mmap support
> perf: tests: Adapt mmap-basic.c for riscv
>
> Documentation/admin-guide/sysctl/kernel.rst | 27 ++-
> drivers/perf/riscv_pmu.c | 113 +++++++++++
> drivers/perf/riscv_pmu_legacy.c | 28 ++-
> drivers/perf/riscv_pmu_sbi.c | 196 +++++++++++++++++++-
> include/linux/perf/riscv_pmu.h | 12 +-
> include/linux/perf_event.h | 3 +-
> tools/lib/perf/mmap.c | 65 +++++++
> tools/perf/tests/mmap-basic.c | 4 +-
> 8 files changed, 428 insertions(+), 20 deletions(-)
>
> --
> 2.39.2
>
--
Regards,
Atish
On Mon, Jul 3, 2023 at 5:57 AM Alexandre Ghiti <[email protected]> wrote:
>
> riscv now supports mmaping hardware counters to userspace so adapt the test
> to run on this architecture.
>
> Signed-off-by: Alexandre Ghiti <[email protected]>
> Reviewed-by: Andrew Jones <[email protected]>
> ---
> tools/perf/tests/mmap-basic.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/tools/perf/tests/mmap-basic.c b/tools/perf/tests/mmap-basic.c
> index e68ca6229756..f5075ca774f8 100644
> --- a/tools/perf/tests/mmap-basic.c
> +++ b/tools/perf/tests/mmap-basic.c
> @@ -284,7 +284,7 @@ static struct test_case tests__basic_mmap[] = {
> "permissions"),
> TEST_CASE_REASON("User space counter reading of instructions",
> mmap_user_read_instr,
> -#if defined(__i386__) || defined(__x86_64__) || defined(__aarch64__)
> +#if defined(__i386__) || defined(__x86_64__) || defined(__aarch64__) || __riscv_xlen == 64
> "permissions"
> #else
> "unsupported"
> @@ -292,7 +292,7 @@ static struct test_case tests__basic_mmap[] = {
> ),
> TEST_CASE_REASON("User space counter reading of cycles",
> mmap_user_read_cycles,
> -#if defined(__i386__) || defined(__x86_64__) || defined(__aarch64__)
> +#if defined(__i386__) || defined(__x86_64__) || defined(__aarch64__) || __riscv_xlen == 64
> "permissions"
> #else
> "unsupported"
> --
> 2.39.2
>
Reviewed-by: Atish Patra <[email protected]>
--
Regards,
Atish
On Mon, Jul 3, 2023 at 5:56 AM Alexandre Ghiti <[email protected]> wrote:
>
> riscv now supports mmaping hardware counters so add what's needed to
> take advantage of that in libperf.
>
> Signed-off-by: Alexandre Ghiti <[email protected]>
> Reviewed-by: Andrew Jones <[email protected]>
> ---
> tools/lib/perf/mmap.c | 65 +++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 65 insertions(+)
>
> diff --git a/tools/lib/perf/mmap.c b/tools/lib/perf/mmap.c
> index 0d1634cedf44..378a163f0554 100644
> --- a/tools/lib/perf/mmap.c
> +++ b/tools/lib/perf/mmap.c
> @@ -392,6 +392,71 @@ static u64 read_perf_counter(unsigned int counter)
>
> static u64 read_timestamp(void) { return read_sysreg(cntvct_el0); }
>
> +#elif __riscv_xlen == 64
> +
> +/* TODO: implement rv32 support */
> +
> +#define CSR_CYCLE 0xc00
> +#define CSR_TIME 0xc01
> +
> +#define csr_read(csr) \
> +({ \
> + register unsigned long __v; \
> + __asm__ __volatile__ ("csrr %0, " #csr \
> + : "=r" (__v) : \
> + : "memory"); \
> + __v; \
> +})
> +
> +static unsigned long csr_read_num(int csr_num)
> +{
> +#define switchcase_csr_read(__csr_num, __val) {\
> + case __csr_num: \
> + __val = csr_read(__csr_num); \
> + break; }
> +#define switchcase_csr_read_2(__csr_num, __val) {\
> + switchcase_csr_read(__csr_num + 0, __val) \
> + switchcase_csr_read(__csr_num + 1, __val)}
> +#define switchcase_csr_read_4(__csr_num, __val) {\
> + switchcase_csr_read_2(__csr_num + 0, __val) \
> + switchcase_csr_read_2(__csr_num + 2, __val)}
> +#define switchcase_csr_read_8(__csr_num, __val) {\
> + switchcase_csr_read_4(__csr_num + 0, __val) \
> + switchcase_csr_read_4(__csr_num + 4, __val)}
> +#define switchcase_csr_read_16(__csr_num, __val) {\
> + switchcase_csr_read_8(__csr_num + 0, __val) \
> + switchcase_csr_read_8(__csr_num + 8, __val)}
> +#define switchcase_csr_read_32(__csr_num, __val) {\
> + switchcase_csr_read_16(__csr_num + 0, __val) \
> + switchcase_csr_read_16(__csr_num + 16, __val)}
> +
> + unsigned long ret = 0;
> +
> + switch (csr_num) {
> + switchcase_csr_read_32(CSR_CYCLE, ret)
> + default:
> + break;
> + }
> +
> + return ret;
> +#undef switchcase_csr_read_32
> +#undef switchcase_csr_read_16
> +#undef switchcase_csr_read_8
> +#undef switchcase_csr_read_4
> +#undef switchcase_csr_read_2
> +#undef switchcase_csr_read
> +}
> +
> +static u64 read_perf_counter(unsigned int counter)
> +{
> + return csr_read_num(CSR_CYCLE + counter);
> +}
> +
> +static u64 read_timestamp(void)
> +{
> + return csr_read_num(CSR_TIME);
> +}
> +
> #else
> static u64 read_perf_counter(unsigned int counter __maybe_unused) { return 0; }
> static u64 read_timestamp(void) { return 0; }
> --
> 2.39.2
>
Reviewed-by: Atish Patra <[email protected]>
--
Regards,
Atish
Hi,
Le 3 juillet 2023 15:46:37 GMT+03:00, Alexandre Ghiti <[email protected]> a écrit :
>riscv used to allow direct access to cycle/time/instret counters,
>bypassing the perf framework, this patchset intends to allow the user to
>mmap any counter when accessed through perf. But we can't break the
>existing behaviour so we introduce a sysctl perf_user_access like arm64
>does, which defaults to the legacy mode described above.
AFAIK, if the default settings breaks user space, the patchset is considered to break user space. That being the case, either this case is deemed special enough that breaking user space is OK, or it is not.
If it is not OK, then the only way out that I can think of, consists of trapping and emulating the counters, returning the same sanitised values that Linux perf would return. Then you can add a kernel config option to disable that trap-and-emulation code in the future.
Either way, I don't suppose that there should be an option to be insecurely backward compatible.
>
>This version needs openSBI v1.3 *and* a kernel fix that went upstream lately
>(https://lore.kernel.org/lkml/[email protected]/T/).
>
>**Important**: In this version, the default mode is now user access, not
>the legacy so some applications will break.
>
>base-commit-tag: v6.4-rc6
>
>Changes in v4:
>- Fixed some nits in riscv_pmu_sbi.c thanks to Andrew
>- Fixed the documentation thanks to Andrew
>- Added RB from Andrew \o/
>
>Changes in v3:
>- patch 1 now contains the ref to the faulty commit (no Fixes tag as it is only a comment), as Andrew suggested
>- Removed RISCV_PMU_LEGACY_TIME from patch 3, as Andrew suggested
>- Rename RISCV_PMU_PDEV_NAME to "riscv-pmu-sbi", patch4 is just cosmetic now, as Andrew suggested
>- Removed a few useless (and wrong) comments, as Andrew suggested
>- Simplify arch_perf_update_userpage code, as Andrew suggested
>- Documentation now mentions that time CSR is *always* accessible, whatever the mode, as suggested by Andrew
>- Removed CYCLEH reference and add TODO for rv32 support, as suggested by Atish
>- Do not rename the pmu instance as Atish suggested
>- Set pmc_width only if rdpmc is enabled and CONFIG_RISCV_PMU is set and the event is a hw event
>- Move arch_perf_update_userpage https://lore.kernel.org/lkml/[email protected]/T/
>- **Switch to user mode access by default**
>
>Changes in v2:
>- Split into smaller patches, way better!
>- Add RB from Conor
>- Simplify the way we checked riscv architecture
>- Fix race mmap and other thread running on other cpus
>- Use hwc when available
>- Set all userspace access flags in event_init, too cumbersome to handle sysctl changes
>- Fix arch_perf_update_userpage for pmu other than riscv-pmu by renaming pmu driver
>- Fixed kernel test robot build error
>- Fixed documentation (Andrew and Bagas)
>- perf testsuite passes mmap tests in all 3 modes
>
>Alexandre Ghiti (10):
> perf: Fix wrong comment about default event_idx
> include: riscv: Fix wrong include guard in riscv_pmu.h
> riscv: Make legacy counter enum match the HW numbering
> drivers: perf: Rename riscv pmu sbi driver
> riscv: Prepare for user-space perf event mmap support
> drivers: perf: Implement perf event mmap support in the legacy backend
> drivers: perf: Implement perf event mmap support in the SBI backend
> Documentation: admin-guide: Add riscv sysctl_perf_user_access
> tools: lib: perf: Implement riscv mmap support
> perf: tests: Adapt mmap-basic.c for riscv
>
> Documentation/admin-guide/sysctl/kernel.rst | 27 ++-
> drivers/perf/riscv_pmu.c | 113 +++++++++++
> drivers/perf/riscv_pmu_legacy.c | 28 ++-
> drivers/perf/riscv_pmu_sbi.c | 196 +++++++++++++++++++-
> include/linux/perf/riscv_pmu.h | 12 +-
> include/linux/perf_event.h | 3 +-
> tools/lib/perf/mmap.c | 65 +++++++
> tools/perf/tests/mmap-basic.c | 4 +-
> 8 files changed, 428 insertions(+), 20 deletions(-)
>
On Mon, Jul 17, 2023 at 6:38 AM Rémi Denis-Courmont <[email protected]> wrote:
>
> Hi,
>
> Le 3 juillet 2023 15:46:37 GMT+03:00, Alexandre Ghiti <[email protected]> a écrit :
> >riscv used to allow direct access to cycle/time/instret counters,
> >bypassing the perf framework, this patchset intends to allow the user to
> >mmap any counter when accessed through perf. But we can't break the
> >existing behaviour so we introduce a sysctl perf_user_access like arm64
> >does, which defaults to the legacy mode described above.
>
> AFAIK, if the default settings breaks user space, the patchset is considered to break user space. That being the case, either this case is deemed special enough that breaking user space is OK, or it is not.
>
This case is a special case as the usage was incorrect in the first
place. Any application that genuinely requires rdcycle can always get
it now via the perf interface.
If the insecure and incorrect behavior is allowed, we suspect the user
space behavior will never be fixed as it is hard to put a future flag
date in these cases.
Given that the RISC-V eco-system is still young, it is better to
disallow this behavior to avoid further proliferation of the same
incorrect usage.
> If it is not OK, then the only way out that I can think of, consists of trapping and emulating the counters, returning the same sanitised values that Linux perf would return. Then you can add a kernel config option to disable that trap-and-emulation code in the future.
>
What do you mean by "sanitised" value ? If you mean the correct value,
it doesn't know when to stop providing that value (without reinventing
the entire context switch which the perf framework already does for
us).
If we just provide a dummy value, that doesn't help the current users
as well. They will probably get unexpected results instead of sigill.
I have cc'd a few folks who were involved in the initial discussions
when this issue was reported to collect the feedback.
> Either way, I don't suppose that there should be an option to be insecurely backward compatible.
>
> >
> >This version needs openSBI v1.3 *and* a kernel fix that went upstream lately
> >(https://lore.kernel.org/lkml/[email protected]/T/).
> >
> >**Important**: In this version, the default mode is now user access, not
> >the legacy so some applications will break.
> >
> >base-commit-tag: v6.4-rc6
> >
> >Changes in v4:
> >- Fixed some nits in riscv_pmu_sbi.c thanks to Andrew
> >- Fixed the documentation thanks to Andrew
> >- Added RB from Andrew \o/
> >
> >Changes in v3:
> >- patch 1 now contains the ref to the faulty commit (no Fixes tag as it is only a comment), as Andrew suggested
> >- Removed RISCV_PMU_LEGACY_TIME from patch 3, as Andrew suggested
> >- Rename RISCV_PMU_PDEV_NAME to "riscv-pmu-sbi", patch4 is just cosmetic now, as Andrew suggested
> >- Removed a few useless (and wrong) comments, as Andrew suggested
> >- Simplify arch_perf_update_userpage code, as Andrew suggested
> >- Documentation now mentions that time CSR is *always* accessible, whatever the mode, as suggested by Andrew
> >- Removed CYCLEH reference and add TODO for rv32 support, as suggested by Atish
> >- Do not rename the pmu instance as Atish suggested
> >- Set pmc_width only if rdpmc is enabled and CONFIG_RISCV_PMU is set and the event is a hw event
> >- Move arch_perf_update_userpage https://lore.kernel.org/lkml/[email protected]/T/
> >- **Switch to user mode access by default**
> >
> >Changes in v2:
> >- Split into smaller patches, way better!
> >- Add RB from Conor
> >- Simplify the way we checked riscv architecture
> >- Fix race mmap and other thread running on other cpus
> >- Use hwc when available
> >- Set all userspace access flags in event_init, too cumbersome to handle sysctl changes
> >- Fix arch_perf_update_userpage for pmu other than riscv-pmu by renaming pmu driver
> >- Fixed kernel test robot build error
> >- Fixed documentation (Andrew and Bagas)
> >- perf testsuite passes mmap tests in all 3 modes
> >
> >Alexandre Ghiti (10):
> > perf: Fix wrong comment about default event_idx
> > include: riscv: Fix wrong include guard in riscv_pmu.h
> > riscv: Make legacy counter enum match the HW numbering
> > drivers: perf: Rename riscv pmu sbi driver
> > riscv: Prepare for user-space perf event mmap support
> > drivers: perf: Implement perf event mmap support in the legacy backend
> > drivers: perf: Implement perf event mmap support in the SBI backend
> > Documentation: admin-guide: Add riscv sysctl_perf_user_access
> > tools: lib: perf: Implement riscv mmap support
> > perf: tests: Adapt mmap-basic.c for riscv
> >
> > Documentation/admin-guide/sysctl/kernel.rst | 27 ++-
> > drivers/perf/riscv_pmu.c | 113 +++++++++++
> > drivers/perf/riscv_pmu_legacy.c | 28 ++-
> > drivers/perf/riscv_pmu_sbi.c | 196 +++++++++++++++++++-
> > include/linux/perf/riscv_pmu.h | 12 +-
> > include/linux/perf_event.h | 3 +-
> > tools/lib/perf/mmap.c | 65 +++++++
> > tools/perf/tests/mmap-basic.c | 4 +-
> > 8 files changed, 428 insertions(+), 20 deletions(-)
> >
>
> _______________________________________________
> linux-riscv mailing list
> [email protected]
> http://lists.infradead.org/mailman/listinfo/linux-riscv
--
Regards,
Atish
Hi,
Le tiistaina 18. heinäkuuta 2023, 2.22.54 EEST Atish Patra a écrit :
> > AFAIK, if the default settings breaks user space, the patchset is
> > considered to break user space. That being the case, either this case is
> > deemed special enough that breaking user space is OK, or it is not.
> This case is a special case as the usage was incorrect in the first
> place.
I agree that it's not only insecure but also incorrect. However it mostly
works. In fact I don't disagree with the change as such, but I think that the
commit messages are misleading and confusing. For a start, in one place it
says that it is not breaking user space and in another it says basically the
opposite.
(Unfortunately, not everybody agrees with the change. I can't seem to get
FFmpeg's checkasm tool fixed:
http://ffmpeg.org/pipermail/ffmpeg-devel/2023-July/312245.html )
Also this is not the first time somebody argues that an API should be removed
because it's broken. That's not special.
> Any application that genuinely requires rdcycle can always get
> it now via the perf interface.
Sure. But the question is whether it breaks user space and if so, whether
that's acceptable. Existing apps that call RDCYCLE will now fail, presumbly
receive SIGILL(?).
> If the insecure and incorrect behavior is allowed, we suspect the user
> space behavior will never be fixed as it is hard to put a future flag
> date in these cases.
For better or worse, I can only agree there. But then adding an option to
preserve the broken behaviour is begging for people to (ab)use it, and indeed
never fix the problem, and never be able to remove the option.
> > If it is not OK, then the only way out that I can think of, consists of
> > trapping and emulating the counters, returning the same sanitised values
> > that Linux perf would return. Then you can add a kernel config option to
> > disable that trap-and-emulation code in the future.
> What do you mean by "sanitised" value ?
I mean whatever avoids creating a security issue. Presumably report the number
of cycles spent in the calling thread and in user mode since the first time
that the process called RDCYCLE?
Maybe it's not reasonable for complexity or performance reasons, but then IMO,
it deserves a little bit better explaining in the commit message.
--
雷米‧德尼-库尔蒙
http://www.remlab.net/
On Tue, Jul 18, 2023 at 10:06 AM Rémi Denis-Courmont <[email protected]> wrote:
>
> Hi,
>
> Le tiistaina 18. heinäkuuta 2023, 2.22.54 EEST Atish Patra a écrit :
> > > AFAIK, if the default settings breaks user space, the patchset is
> > > considered to break user space. That being the case, either this case is
> > > deemed special enough that breaking user space is OK, or it is not.
>
> > This case is a special case as the usage was incorrect in the first
> > place.
>
> I agree that it's not only insecure but also incorrect. However it mostly
> works. In fact I don't disagree with the change as such, but I think that the
> commit messages are misleading and confusing. For a start, in one place it
> says that it is not breaking user space and in another it says basically the
> opposite.
>
Agreed. We will improve the commit message to clarify that. That's also the
reason I started this whole thread :)
> (Unfortunately, not everybody agrees with the change. I can't seem to get
> FFmpeg's checkasm tool fixed:
> http://ffmpeg.org/pipermail/ffmpeg-devel/2023-July/312245.html )
>
Why can't rdtime(equivalent of rdtsc) be used instead of rdcycle ?
What does it use in x86 ? It also doesn't allow reading cycle counter
by default.
The perf syscall overhead is just one time setup thing during the
start of the application.
For counting the cycles before/after a loop, it still provides a
direct CSR access in user mode.
> Also this is not the first time somebody argues that an API should be removed
> because it's broken. That's not special.
>
> > Any application that genuinely requires rdcycle can always get
> > it now via the perf interface.
>
> Sure. But the question is whether it breaks user space and if so, whether
> that's acceptable. Existing apps that call RDCYCLE will now fail, presumbly
> receive SIGILL(?).
>
Yes. With this changes it will receive SIGILL if the default is NO ACCESS.
You can change the sysctl parameter to enable the direct access though
and make it work though.
> > If the insecure and incorrect behavior is allowed, we suspect the user
> > space behavior will never be fixed as it is hard to put a future flag
> > date in these cases.
>
> For better or worse, I can only agree there. But then adding an option to
> preserve the broken behaviour is begging for people to (ab)use it, and indeed
> never fix the problem, and never be able to remove the option.
>
x86 still carries that option. So I don't think once get down path, it
will very difficult to remove it.
> > > If it is not OK, then the only way out that I can think of, consists of
> > > trapping and emulating the counters, returning the same sanitised values
> > > that Linux perf would return. Then you can add a kernel config option to
> > > disable that trap-and-emulation code in the future.
> > What do you mean by "sanitised" value ?
>
> I mean whatever avoids creating a security issue. Presumably report the number
> of cycles spent in the calling thread and in user mode since the first time
> that the process called RDCYCLE?
>
> Maybe it's not reasonable for complexity or performance reasons, but then IMO,
> it deserves a little bit better explaining in the commit message.
>
Yes. I believe the complexities and throwaway code (assuming we should
stop doing that in the long run)
is not worth it given that we have a perfectly valid interface via
perf without any performance sacrifice.
RISC-V is not the first one to do it. It is disabled by default for
ARM64/x86 as well.
If the application usage was legal and we have years of software
development relying on that, it might have
made sense (e.g. x86 legacy usage). However, RISC-V is still young to
avoid those pitfalls.
> --
> 雷米‧德尼-库尔蒙
> http://www.remlab.net/
>
>
>
--
Regards,
Atish
Le tiistaina 18. heinäkuuta 2023, 21.45.15 EEST Atish Patra a écrit :
> > I agree that it's not only insecure but also incorrect. However it mostly
> > works. In fact I don't disagree with the change as such, but I think that
> > the commit messages are misleading and confusing. For a start, in one
> > place it says that it is not breaking user space and in another it says
> > basically the opposite.
>
> Agreed. We will improve the commit message to clarify that. That's also the
> reason I started this whole thread :)
>
> > (Unfortunately, not everybody agrees with the change. I can't seem to get
> > FFmpeg's checkasm tool fixed:
> > http://ffmpeg.org/pipermail/ffmpeg-devel/2023-July/312245.html )
>
> Why can't rdtime(equivalent of rdtsc) be used instead of rdcycle ?
Isn't RDTIM susceptible to interference from power management and CPU
frequency scaling? I suppose that RDCYCLE may behave differently depending on
PM in *some* designs, but that would still be way better than RDTIME for the
purpose.
As far as benchmarking is concerned (_excluding_ system security), RDTIME
seems to have all the problems of RDCYCLE, and then some more, no?
> The perf syscall overhead is just one time setup thing during the
> start of the application.
> For counting the cycles before/after a loop, it still provides a
> direct CSR access in user mode.
I suppose that you allude to mmap() here? The (dumb) FFmpeg code is using the
ioctl() interface though, but that's just laziness.
--
レミ・デニ-クールモン
http://www.remlab.net/
On Tue, Jul 18, 2023 at 12:04 PM Rémi Denis-Courmont <[email protected]> wrote:
>
> Le tiistaina 18. heinäkuuta 2023, 21.45.15 EEST Atish Patra a écrit :
> > > I agree that it's not only insecure but also incorrect. However it mostly
> > > works. In fact I don't disagree with the change as such, but I think that
> > > the commit messages are misleading and confusing. For a start, in one
> > > place it says that it is not breaking user space and in another it says
> > > basically the opposite.
> >
> > Agreed. We will improve the commit message to clarify that. That's also the
> > reason I started this whole thread :)
> >
> > > (Unfortunately, not everybody agrees with the change. I can't seem to get
> > > FFmpeg's checkasm tool fixed:
> > > http://ffmpeg.org/pipermail/ffmpeg-devel/2023-July/312245.html )
> >
> > Why can't rdtime(equivalent of rdtsc) be used instead of rdcycle ?
>
> Isn't RDTIM susceptible to interference from power management and CPU
> frequency scaling? I suppose that RDCYCLE may behave differently depending on
> PM in *some* designs, but that would still be way better than RDTIME for the
> purpose.
>
Yes. But that's what it is probably using for other ISAs ? My point
was it should
just do whatever it does for other ISA. RISC-V is no special in that regard.
> As far as benchmarking is concerned (_excluding_ system security), RDTIME
> seems to have all the problems of RDCYCLE, and then some more, no?
>
Correct. If the application can deal with noisy data, it can use
rdtime similar to
other architectures. Otherwise, it should just use the perf mmap
interface in the beginning
and read the counters whenever required.
> > The perf syscall overhead is just one time setup thing during the
> > start of the application.
> > For counting the cycles before/after a loop, it still provides a
> > direct CSR access in user mode.
>
> I suppose that you allude to mmap() here? The (dumb) FFmpeg code is using the
> ioctl() interface though, but that's just laziness.
>
Yeah. I agree it is more work but just one time effort to get more
reliable data without
compromising the security.
> --
> レミ・デニ-クールモン
> http://www.remlab.net/
>
>
>
--
Regards,
Atish
Le keskiviikkona 19. heinäkuuta 2023, 1.48.49 EEST Atish Patra a écrit :
> > Isn't RDTIM susceptible to interference from power management and CPU
> > frequency scaling? I suppose that RDCYCLE may behave differently depending
> > on PM in *some* designs, but that would still be way better than RDTIME
> > for the purpose.
>
> Yes. But that's what it is probably using for other ISAs ?
At least on AArch64, it is using either Linux perf cycle counter, or if that
is disabled at build time, the raw PMU cycle counter - which obviously leads
to SIGILL on Linux, just like this MR would do with RDCYCLE.
Again, I do not *personally* have objections to disabling RDCYCLE for
userspace (somebody else does, but that's neither my nor your problem). I do
have objections to the wording of some of the commit messages though.
> My point was it should just do whatever it does for other ISA. RISC-V is no
> special in that regard.
Sure. My point is that RDTIME may be great for, so to say, system-level
benchmarks. For FFmpeg that could something like how long it takes to
transcode a video. But it doesn't seem to make much sense for
microbenchmarking of single threaded tightly optimised loops, as opposed to
RDCYCLE (or a wrapper for RDCYCLE).
--
Rémi Denis-Courmont
http://www.remlab.net/
On Wed, Jul 19, 2023 at 7:46 AM Rémi Denis-Courmont <[email protected]> wrote:
>
> Le keskiviikkona 19. heinäkuuta 2023, 1.48.49 EEST Atish Patra a écrit :
> > > Isn't RDTIM susceptible to interference from power management and CPU
> > > frequency scaling? I suppose that RDCYCLE may behave differently depending
> > > on PM in *some* designs, but that would still be way better than RDTIME
> > > for the purpose.
> >
> > Yes. But that's what it is probably using for other ISAs ?
>
> At least on AArch64, it is using either Linux perf cycle counter, or if that
> is disabled at build time, the raw PMU cycle counter - which obviously leads
> to SIGILL on Linux, just like this MR would do with RDCYCLE.
>
Good to know. Thanks for the clarification.
> Again, I do not *personally* have objections to disabling RDCYCLE for
> userspace (somebody else does, but that's neither my nor your problem). I do
> have objections to the wording of some of the commit messages though.
>
Completely agreed. We will update the commit text with more clarification in v5.
> > My point was it should just do whatever it does for other ISA. RISC-V is no
> > special in that regard.
>
> Sure. My point is that RDTIME may be great for, so to say, system-level
> benchmarks. For FFmpeg that could something like how long it takes to
> transcode a video. But it doesn't seem to make much sense for
> microbenchmarking of single threaded tightly optimised loops, as opposed to
> RDCYCLE (or a wrapper for RDCYCLE).
>
> --
> Rémi Denis-Courmont
> http://www.remlab.net/
>
>
>
--
Regards,
Atish
Hi Atish,
On Fri, Jul 14, 2023 at 10:46 AM Atish Patra <[email protected]> wrote:
>
> On Mon, Jul 3, 2023 at 5:53 AM Alexandre Ghiti <[email protected]> wrote:
> >
> > We used to unconditionnally expose the cycle and instret csrs to
> > userspace, which gives rise to security concerns.
> >
> > So now we only allow access to hw counters from userspace through the perf
> > framework which will handle context switches, per-task events...etc. But
> > as we cannot break userspace, we give the user the choice to go back to
> > the previous behaviour by setting the sysctl perf_user_access.
> >
> > Signed-off-by: Alexandre Ghiti <[email protected]>
> > Reviewed-by: Andrew Jones <[email protected]>
> > ---
> > drivers/perf/riscv_pmu.c | 10 +-
> > drivers/perf/riscv_pmu_sbi.c | 192 +++++++++++++++++++++++++++++++++--
> > 2 files changed, 195 insertions(+), 7 deletions(-)
> >
> > diff --git a/drivers/perf/riscv_pmu.c b/drivers/perf/riscv_pmu.c
> > index 432ad2e80ce3..80c052e93f9e 100644
> > --- a/drivers/perf/riscv_pmu.c
> > +++ b/drivers/perf/riscv_pmu.c
> > @@ -38,7 +38,15 @@ void arch_perf_update_userpage(struct perf_event *event,
> > userpg->cap_user_time_short = 0;
> > userpg->cap_user_rdpmc = riscv_perf_user_access(event);
> >
> > - userpg->pmc_width = 64;
> > +#ifdef CONFIG_RISCV_PMU
> > + /*
> > + * The counters are 64-bit but the priv spec doesn't mandate all the
> > + * bits to be implemented: that's why, counter width can vary based on
> > + * the cpu vendor.
> > + */
> > + if (userpg->cap_user_rdpmc)
> > + userpg->pmc_width = to_riscv_pmu(event->pmu)->ctr_get_width(event->hw.idx) + 1;
> > +#endif
> >
> > do {
> > rd = sched_clock_read_begin(&seq);
> > diff --git a/drivers/perf/riscv_pmu_sbi.c b/drivers/perf/riscv_pmu_sbi.c
> > index 83c3f1c4d2f1..2236cc9aa4b8 100644
> > --- a/drivers/perf/riscv_pmu_sbi.c
> > +++ b/drivers/perf/riscv_pmu_sbi.c
> > @@ -24,6 +24,14 @@
> > #include <asm/sbi.h>
> > #include <asm/hwcap.h>
> >
> > +#define SYSCTL_NO_USER_ACCESS 0
> > +#define SYSCTL_USER_ACCESS 1
> > +#define SYSCTL_LEGACY 2
> > +
> > +#define PERF_EVENT_FLAG_NO_USER_ACCESS BIT(SYSCTL_NO_USER_ACCESS)
> > +#define PERF_EVENT_FLAG_USER_ACCESS BIT(SYSCTL_USER_ACCESS)
> > +#define PERF_EVENT_FLAG_LEGACY BIT(SYSCTL_LEGACY)
> > +
> > PMU_FORMAT_ATTR(event, "config:0-47");
> > PMU_FORMAT_ATTR(firmware, "config:63");
> >
> > @@ -43,6 +51,9 @@ static const struct attribute_group *riscv_pmu_attr_groups[] = {
> > NULL,
> > };
> >
> > +/* Allow user mode access by default */
> > +static int sysctl_perf_user_access __read_mostly = SYSCTL_USER_ACCESS;
> > +
> > /*
> > * RISC-V doesn't have heterogeneous harts yet. This need to be part of
> > * per_cpu in case of harts with different pmu counters
> > @@ -301,6 +312,11 @@ int riscv_pmu_get_hpm_info(u32 *hw_ctr_width, u32 *num_hw_ctr)
> > }
> > EXPORT_SYMBOL_GPL(riscv_pmu_get_hpm_info);
> >
> > +static uint8_t pmu_sbi_csr_index(struct perf_event *event)
> > +{
> > + return pmu_ctr_list[event->hw.idx].csr - CSR_CYCLE;
> > +}
> > +
> > static unsigned long pmu_sbi_get_filter_flags(struct perf_event *event)
> > {
> > unsigned long cflags = 0;
> > @@ -329,18 +345,34 @@ static int pmu_sbi_ctr_get_idx(struct perf_event *event)
> > struct cpu_hw_events *cpuc = this_cpu_ptr(rvpmu->hw_events);
> > struct sbiret ret;
> > int idx;
> > - uint64_t cbase = 0;
> > + uint64_t cbase = 0, cmask = rvpmu->cmask;
> > unsigned long cflags = 0;
> >
> > cflags = pmu_sbi_get_filter_flags(event);
> > +
> > + /*
> > + * In legacy mode, we have to force the fixed counters for those events
> > + * but not in the user access mode as we want to use the other counters
> > + * that support sampling/filtering.
> > + */
> > + if (hwc->flags & PERF_EVENT_FLAG_LEGACY) {
> > + if (event->attr.config == PERF_COUNT_HW_CPU_CYCLES) {
> > + cflags |= SBI_PMU_CFG_FLAG_SKIP_MATCH;
> > + cmask = 1;
> > + } else if (event->attr.config == PERF_COUNT_HW_INSTRUCTIONS) {
> > + cflags |= SBI_PMU_CFG_FLAG_SKIP_MATCH;
> > + cmask = 1UL << (CSR_INSTRET - CSR_CYCLE);
> > + }
> > + }
> > +
>
> Why do we need to do this ? If an application is using perf interface
> to program the counters,
> they don't need the PERF_EVENT_FLAG_LEGACY.
But if the current global mode is the legacy mode and the application
is using the perf interface, we need to make sure only the direct
counters are exposed to userspace otherwise, any opensbi can choose
any other counter and that would give rise to a SIGILL (since
SCOUNTEREN is set globally).
>
> > /* retrieve the available counter index */
> > #if defined(CONFIG_32BIT)
> > ret = sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_COUNTER_CFG_MATCH, cbase,
> > - rvpmu->cmask, cflags, hwc->event_base, hwc->config,
> > + cmask, cflags, hwc->event_base, hwc->config,
> > hwc->config >> 32);
> > #else
> > ret = sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_COUNTER_CFG_MATCH, cbase,
> > - rvpmu->cmask, cflags, hwc->event_base, hwc->config, 0);
> > + cmask, cflags, hwc->event_base, hwc->config, 0);
> > #endif
> > if (ret.error) {
> > pr_debug("Not able to find a counter for event %lx config %llx\n",
> > @@ -474,6 +506,22 @@ static u64 pmu_sbi_ctr_read(struct perf_event *event)
> > return val;
> > }
> >
> > +static void pmu_sbi_set_scounteren(void *arg)
> > +{
> > + struct perf_event *event = (struct perf_event *)arg;
> > +
> > + csr_write(CSR_SCOUNTEREN,
> > + csr_read(CSR_SCOUNTEREN) | (1 << pmu_sbi_csr_index(event)));
> > +}
> > +
> > +static void pmu_sbi_reset_scounteren(void *arg)
> > +{
> > + struct perf_event *event = (struct perf_event *)arg;
> > +
> > + csr_write(CSR_SCOUNTEREN,
> > + csr_read(CSR_SCOUNTEREN) & ~(1 << pmu_sbi_csr_index(event)));
> > +}
> > +
> > static void pmu_sbi_ctr_start(struct perf_event *event, u64 ival)
> > {
> > struct sbiret ret;
> > @@ -490,6 +538,10 @@ static void pmu_sbi_ctr_start(struct perf_event *event, u64 ival)
> > if (ret.error && (ret.error != SBI_ERR_ALREADY_STARTED))
> > pr_err("Starting counter idx %d failed with error %d\n",
> > hwc->idx, sbi_err_map_linux_errno(ret.error));
> > +
> > + if ((hwc->flags & PERF_EVENT_FLAG_USER_ACCESS) &&
> > + (hwc->flags & PERF_EVENT_FLAG_USER_READ_CNT))
> > + pmu_sbi_set_scounteren((void *)event);
> > }
> >
> > static void pmu_sbi_ctr_stop(struct perf_event *event, unsigned long flag)
> > @@ -497,6 +549,10 @@ static void pmu_sbi_ctr_stop(struct perf_event *event, unsigned long flag)
> > struct sbiret ret;
> > struct hw_perf_event *hwc = &event->hw;
> >
> > + if ((hwc->flags & PERF_EVENT_FLAG_USER_ACCESS) &&
> > + (hwc->flags & PERF_EVENT_FLAG_USER_READ_CNT))
> > + pmu_sbi_reset_scounteren((void *)event);
> > +
> > ret = sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_COUNTER_STOP, hwc->idx, 1, flag, 0, 0, 0);
> > if (ret.error && (ret.error != SBI_ERR_ALREADY_STOPPED) &&
> > flag != SBI_PMU_STOP_FLAG_RESET)
> > @@ -704,10 +760,13 @@ static int pmu_sbi_starting_cpu(unsigned int cpu, struct hlist_node *node)
> > struct cpu_hw_events *cpu_hw_evt = this_cpu_ptr(pmu->hw_events);
> >
> > /*
> > - * Enable the access for CYCLE, TIME, and INSTRET CSRs from userspace,
> > - * as is necessary to maintain uABI compatibility.
> > + * We keep enabling userspace access to CYCLE, TIME and INSRET via the
>
> /s/INSRET/INSTRET
Thanks!
>
> > + * legacy option but that will be removed in the future.
> > */
> > - csr_write(CSR_SCOUNTEREN, 0x7);
> > + if (sysctl_perf_user_access == SYSCTL_LEGACY)
> > + csr_write(CSR_SCOUNTEREN, 0x7);
> > + else
> > + csr_write(CSR_SCOUNTEREN, 0x2);
> >
> > /* Stop all the counters so that they can be enabled from perf */
> > pmu_sbi_stop_all(pmu);
> > @@ -851,6 +910,121 @@ static void riscv_pmu_destroy(struct riscv_pmu *pmu)
> > cpuhp_state_remove_instance(CPUHP_AP_PERF_RISCV_STARTING, &pmu->node);
> > }
> >
> > +static void pmu_sbi_event_init(struct perf_event *event)
> > +{
> > + /*
> > + * The permissions are set at event_init so that we do not depend
> > + * on the sysctl value that can change.
> > + */
> > + if (sysctl_perf_user_access == SYSCTL_NO_USER_ACCESS)
> > + event->hw.flags |= PERF_EVENT_FLAG_NO_USER_ACCESS;
> > + else if (sysctl_perf_user_access == SYSCTL_USER_ACCESS)
> > + event->hw.flags |= PERF_EVENT_FLAG_USER_ACCESS;
> > + else
> > + event->hw.flags |= PERF_EVENT_FLAG_LEGACY;
> > +}
> > +
> > +static void pmu_sbi_event_mapped(struct perf_event *event, struct mm_struct *mm)
> > +{
> > + if (event->hw.flags & PERF_EVENT_FLAG_NO_USER_ACCESS)
> > + return;
> > +
> > + if (event->hw.flags & PERF_EVENT_FLAG_LEGACY) {
> > + if (event->attr.config != PERF_COUNT_HW_CPU_CYCLES &&
> > + event->attr.config != PERF_COUNT_HW_INSTRUCTIONS) {
> > + return;
> > + }
> > + }
> > +
> > + /*
> > + * The user mmapped the event to directly access it: this is where
> > + * we determine based on sysctl_perf_user_access if we grant userspace
> > + * the direct access to this event. That means that within the same
> > + * task, some events may be directly accessible and some other may not,
> > + * if the user changes the value of sysctl_perf_user_accesss in the
> > + * meantime.
> > + */
> > +
> > + event->hw.flags |= PERF_EVENT_FLAG_USER_READ_CNT;
> > +
> > + /*
> > + * We must enable userspace access *before* advertising in the user page
> > + * that it is possible to do so to avoid any race.
> > + * And we must notify all cpus here because threads that currently run
> > + * on other cpus will try to directly access the counter too without
> > + * calling pmu_sbi_ctr_start.
> > + */
> > + if (event->hw.flags & PERF_EVENT_FLAG_USER_ACCESS)
> > + on_each_cpu_mask(mm_cpumask(mm),
> > + pmu_sbi_set_scounteren, (void *)event, 1);
> > +}
> > +
> > +static void pmu_sbi_event_unmapped(struct perf_event *event, struct mm_struct *mm)
> > +{
> > + if (event->hw.flags & PERF_EVENT_FLAG_NO_USER_ACCESS)
> > + return;
> > +
> > + if (event->hw.flags & PERF_EVENT_FLAG_LEGACY) {
> > + if (event->attr.config != PERF_COUNT_HW_CPU_CYCLES &&
> > + event->attr.config != PERF_COUNT_HW_INSTRUCTIONS) {
> > + return;
> > + }
> > + }
> > +
> > + /*
> > + * Here we can directly remove user access since the user does not have
> > + * access to the user page anymore so we avoid the racy window where the
> > + * user could have read cap_user_rdpmc to true right before we disable
> > + * it.
> > + */
> > + event->hw.flags &= ~PERF_EVENT_FLAG_USER_READ_CNT;
> > +
> > + if (event->hw.flags & PERF_EVENT_FLAG_USER_ACCESS)
> > + on_each_cpu_mask(mm_cpumask(mm),
> > + pmu_sbi_reset_scounteren, (void *)event, 1);
> > +}
> > +
> > +static void riscv_pmu_update_counter_access(void *info)
> > +{
> > + if (sysctl_perf_user_access == SYSCTL_LEGACY)
> > + csr_write(CSR_SCOUNTEREN, 0x7);
> > + else
> > + csr_write(CSR_SCOUNTEREN, 0x2);
> > +}
> > +
> > +static int riscv_pmu_proc_user_access_handler(struct ctl_table *table,
> > + int write, void *buffer,
> > + size_t *lenp, loff_t *ppos)
> > +{
> > + int prev = sysctl_perf_user_access;
> > + int ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
> > +
> > + /*
> > + * Test against the previous value since we clear SCOUNTEREN when
> > + * sysctl_perf_user_access is set to SYSCTL_USER_ACCESS, but we should
> > + * not do that if that was already the case.
> > + */
> > + if (ret || !write || prev == sysctl_perf_user_access)
> > + return ret;
> > +
> > + on_each_cpu(riscv_pmu_update_counter_access, NULL, 1);
> > +
> > + return 0;
> > +}
> > +
> > +static struct ctl_table sbi_pmu_sysctl_table[] = {
> > + {
> > + .procname = "perf_user_access",
> > + .data = &sysctl_perf_user_access,
> > + .maxlen = sizeof(unsigned int),
> > + .mode = 0644,
> > + .proc_handler = riscv_pmu_proc_user_access_handler,
> > + .extra1 = SYSCTL_ZERO,
> > + .extra2 = SYSCTL_TWO,
> > + },
> > + { }
> > +};
> > +
> > static int pmu_sbi_device_probe(struct platform_device *pdev)
> > {
> > struct riscv_pmu *pmu = NULL;
> > @@ -888,6 +1062,10 @@ static int pmu_sbi_device_probe(struct platform_device *pdev)
> > pmu->ctr_get_width = pmu_sbi_ctr_get_width;
> > pmu->ctr_clear_idx = pmu_sbi_ctr_clear_idx;
> > pmu->ctr_read = pmu_sbi_ctr_read;
> > + pmu->event_init = pmu_sbi_event_init;
> > + pmu->event_mapped = pmu_sbi_event_mapped;
> > + pmu->event_unmapped = pmu_sbi_event_unmapped;
> > + pmu->csr_index = pmu_sbi_csr_index;
> >
> > ret = cpuhp_state_add_instance(CPUHP_AP_PERF_RISCV_STARTING, &pmu->node);
> > if (ret)
> > @@ -901,6 +1079,8 @@ static int pmu_sbi_device_probe(struct platform_device *pdev)
> > if (ret)
> > goto out_unregister;
> >
> > + register_sysctl("kernel", sbi_pmu_sysctl_table);
> > +
> > return 0;
> >
> > out_unregister:
> > --
> > 2.39.2
> >
>
>
> --
> Regards,
> Atish
Hi Atish, Rémi,
On 19/07/2023 19:14, Atish Patra wrote:
> On Wed, Jul 19, 2023 at 7:46 AM Rémi Denis-Courmont <[email protected]> wrote:
>> Le keskiviikkona 19. heinäkuuta 2023, 1.48.49 EEST Atish Patra a écrit :
>>>> Isn't RDTIM susceptible to interference from power management and CPU
>>>> frequency scaling? I suppose that RDCYCLE may behave differently depending
>>>> on PM in *some* designs, but that would still be way better than RDTIME
>>>> for the purpose.
>>> Yes. But that's what it is probably using for other ISAs ?
>> At least on AArch64, it is using either Linux perf cycle counter, or if that
>> is disabled at build time, the raw PMU cycle counter - which obviously leads
>> to SIGILL on Linux, just like this MR would do with RDCYCLE.
>>
> Good to know. Thanks for the clarification.
>
>> Again, I do not *personally* have objections to disabling RDCYCLE for
>> userspace (somebody else does, but that's neither my nor your problem). I do
>> have objections to the wording of some of the commit messages though.
>>
> Completely agreed. We will update the commit text with more clarification in v5.
Thanks for all your comments and sorry for being slow here. I will
improve the commit logs in the next version, that's an oversight, sorry
about that.
Thanks again,
Alex
>
>>> My point was it should just do whatever it does for other ISA. RISC-V is no
>>> special in that regard.
>> Sure. My point is that RDTIME may be great for, so to say, system-level
>> benchmarks. For FFmpeg that could something like how long it takes to
>> transcode a video. But it doesn't seem to make much sense for
>> microbenchmarking of single threaded tightly optimised loops, as opposed to
>> RDCYCLE (or a wrapper for RDCYCLE).
>>
>> --
>> Rémi Denis-Courmont
>> http://www.remlab.net/
>>
>>
>>
>