This series implements SBI PMU improvements done in SBI v2.0[1] i.e. PMU snapshot
and fw_read_hi() functions.
SBI v2.0 introduced PMU snapshot feature which allows the SBI implementation
to provide counter information (i.e. values/overflow status) via a shared
memory between the SBI implementation and supervisor OS. This allows to minimize
the number of traps in when perf being used inside a kvm guest as it relies on
SBI PMU + trap/emulation of the counters.
The current set of ratified RISC-V specification also doesn't allow scountovf
to be trap/emulated by the hypervisor. The SBI PMU snapshot bridges the gap
in ISA as well and enables perf sampling in the guest. However, LCOFI in the
guest only works via IRQ filtering in AIA specification. That's why, AIA
has to be enabled in the hardware (at least the Ssaia extension) in order to
use the sampling support in the perf.
Here are the patch wise implementation details.
PATCH 1,6,7 : Generic cleanups/improvements.
PATCH 2,3,10 : FW_READ_HI function implementation
PATCH 4-5: Add PMU snapshot feature in sbi pmu driver
PATCH 6-7: KVM implementation for snapshot and sampling in kvm guests
The series is based on kvm-next and is available at:
https://github.com/atishp04/linux/tree/kvm_pmu_snapshot_v2
The kvmtool patch is also available at:
https://github.com/atishp04/kvmtool/tree/sscofpmf
It also requires Ssaia ISA extension to be present in the hardware in order to
get perf sampling support in the guest. In Qemu virt machine, it can be done
by the following config.
```
-cpu rv64,sscofpmf=true,x-ssaia=true
```
There is no other dependencies on AIA apart from that. Thus, Ssaia must be disabled
for the guest if AIA patches are not available. Here is the example command.
```
./lkvm-static run -m 256 -c2 --console serial -p "console=ttyS0 earlycon" --disable-ssaia -k ./Image --debug
```
The series has been tested only in Qemu.
Here is the snippet of the perf running inside a kvm guest.
===================================================
$ perf record -e cycles -e instructions perf bench sched messaging -g 5
...
$ Running 'sched/messaging' benchmark:
...
[ 45.928723] perf_duration_warn: 2 callbacks suppressed
[ 45.929000] perf: interrupt took too long (484426 > 483186), lowering kernel.perf_event_max_sample_rate to 250
$ 20 sender and receiver processes per group
$ 5 groups == 200 processes run
Total time: 14.220 [sec]
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.117 MB perf.data (1942 samples) ]
$ perf report --stdio
$ To display the perf.data header info, please use --header/--header-only optio>
$
$
$ Total Lost Samples: 0
$
$ Samples: 943 of event 'cycles'
$ Event count (approx.): 5128976844
$
$ Overhead Command Shared Object Symbol >
$ ........ ............... ........................... .....................>
$
7.59% sched-messaging [kernel.kallsyms] [k] memcpy
5.48% sched-messaging [kernel.kallsyms] [k] percpu_counter_ad>
5.24% sched-messaging [kernel.kallsyms] [k] __sbi_rfence_v02_>
4.00% sched-messaging [kernel.kallsyms] [k] _raw_spin_unlock_>
3.79% sched-messaging [kernel.kallsyms] [k] set_pte_range
3.72% sched-messaging [kernel.kallsyms] [k] next_uptodate_fol>
3.46% sched-messaging [kernel.kallsyms] [k] filemap_map_pages
3.31% sched-messaging [kernel.kallsyms] [k] handle_mm_fault
3.20% sched-messaging [kernel.kallsyms] [k] finish_task_switc>
3.16% sched-messaging [kernel.kallsyms] [k] clear_page
3.03% sched-messaging [kernel.kallsyms] [k] mtree_range_walk
2.42% sched-messaging [kernel.kallsyms] [k] flush_icache_pte
===================================================
[1] https://github.com/riscv-non-isa/riscv-sbi-doc
Changes from v1->v2:
1. Fixed warning/errors from patchwork CI.
2. Rebased on top of kvm-next.
3. Added Acked-by tags.
Changes from RFC->v1:
1. Addressed all the comments on RFC series.
2. Removed PATCH2 and merged into later patches.
3. Added 2 more patches for minor fixes.
4. Fixed KVM boot issue without Ssaia and made sscofpmf in guest dependent on
Ssaia in the host.
Atish Patra (10):
RISC-V: Fix the typo in Scountovf CSR name
RISC-V: Add FIRMWARE_READ_HI definition
drivers/perf: riscv: Read upper bits of a firmware counter
RISC-V: Add SBI PMU snapshot definitions
drivers/perf: riscv: Implement SBI PMU snapshot function
RISC-V: KVM: No need to update the counter value during reset
RISC-V: KVM: No need to exit to the user space if perf event failed
RISC-V: KVM: Implement SBI PMU Snapshot feature
RISC-V: KVM: Add perf sampling support for guests
RISC-V: KVM: Support 64 bit firmware counters on RV32
arch/riscv/include/asm/csr.h | 5 +-
arch/riscv/include/asm/errata_list.h | 2 +-
arch/riscv/include/asm/kvm_vcpu_pmu.h | 14 +-
arch/riscv/include/asm/sbi.h | 12 ++
arch/riscv/include/uapi/asm/kvm.h | 1 +
arch/riscv/kvm/aia.c | 5 +
arch/riscv/kvm/main.c | 1 +
arch/riscv/kvm/vcpu.c | 8 +-
arch/riscv/kvm/vcpu_onereg.c | 9 +-
arch/riscv/kvm/vcpu_pmu.c | 246 ++++++++++++++++++++++++--
arch/riscv/kvm/vcpu_sbi_pmu.c | 15 +-
drivers/perf/riscv_pmu.c | 1 +
drivers/perf/riscv_pmu_sbi.c | 228 ++++++++++++++++++++++--
include/linux/perf/riscv_pmu.h | 6 +
14 files changed, 508 insertions(+), 45 deletions(-)
--
2.34.1
The counter overflow CSR name is "scountovf" not "sscountovf".
Fix the csr name.
Fixes: 4905ec2fb7e6 ("RISC-V: Add sscofpmf extension support")
Reviewed-by: Conor Dooley <[email protected]>
Reviewed-by: Anup Patel <[email protected]>
Signed-off-by: Atish Patra <[email protected]>
---
arch/riscv/include/asm/csr.h | 2 +-
arch/riscv/include/asm/errata_list.h | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/riscv/include/asm/csr.h b/arch/riscv/include/asm/csr.h
index 306a19a5509c..88cdc8a3e654 100644
--- a/arch/riscv/include/asm/csr.h
+++ b/arch/riscv/include/asm/csr.h
@@ -281,7 +281,7 @@
#define CSR_HPMCOUNTER30H 0xc9e
#define CSR_HPMCOUNTER31H 0xc9f
-#define CSR_SSCOUNTOVF 0xda0
+#define CSR_SCOUNTOVF 0xda0
#define CSR_SSTATUS 0x100
#define CSR_SIE 0x104
diff --git a/arch/riscv/include/asm/errata_list.h b/arch/riscv/include/asm/errata_list.h
index 83ed25e43553..7026fba12eeb 100644
--- a/arch/riscv/include/asm/errata_list.h
+++ b/arch/riscv/include/asm/errata_list.h
@@ -152,7 +152,7 @@ asm volatile(ALTERNATIVE_2( \
#define ALT_SBI_PMU_OVERFLOW(__ovl) \
asm volatile(ALTERNATIVE( \
- "csrr %0, " __stringify(CSR_SSCOUNTOVF), \
+ "csrr %0, " __stringify(CSR_SCOUNTOVF), \
"csrr %0, " __stringify(THEAD_C9XX_CSR_SCOUNTEROF), \
THEAD_VENDOR_ID, ERRATA_THEAD_PMU, \
CONFIG_ERRATA_THEAD_PMU) \
--
2.34.1
SBI v2.0 added another function to SBI PMU extension to read
the upper bits of a counter with width larger than XLEN.
Add the definition for that function.
Acked-by: Conor Dooley <[email protected]>
Reviewed-by: Anup Patel <[email protected]>
Signed-off-by: Atish Patra <[email protected]>
---
arch/riscv/include/asm/sbi.h | 1 +
1 file changed, 1 insertion(+)
diff --git a/arch/riscv/include/asm/sbi.h b/arch/riscv/include/asm/sbi.h
index b6f898c56940..914eacc6ba2e 100644
--- a/arch/riscv/include/asm/sbi.h
+++ b/arch/riscv/include/asm/sbi.h
@@ -122,6 +122,7 @@ enum sbi_ext_pmu_fid {
SBI_EXT_PMU_COUNTER_START,
SBI_EXT_PMU_COUNTER_STOP,
SBI_EXT_PMU_COUNTER_FW_READ,
+ SBI_EXT_PMU_COUNTER_FW_READ_HI,
};
union sbi_pmu_ctr_info {
--
2.34.1
SBI v2.0 introduced a explicit function to read the upper 32 bits
for any firmwar counter width that is longer than 32bits.
This is only applicable for RV32 where firmware counter can be
64 bit.
Acked-by: Palmer Dabbelt <[email protected]>
Signed-off-by: Atish Patra <[email protected]>
---
drivers/perf/riscv_pmu_sbi.c | 20 ++++++++++++++++----
1 file changed, 16 insertions(+), 4 deletions(-)
diff --git a/drivers/perf/riscv_pmu_sbi.c b/drivers/perf/riscv_pmu_sbi.c
index 16acd4dcdb96..646604f8c0a5 100644
--- a/drivers/perf/riscv_pmu_sbi.c
+++ b/drivers/perf/riscv_pmu_sbi.c
@@ -35,6 +35,8 @@
PMU_FORMAT_ATTR(event, "config:0-47");
PMU_FORMAT_ATTR(firmware, "config:63");
+static bool sbi_v2_available;
+
static struct attribute *riscv_arch_formats_attr[] = {
&format_attr_event.attr,
&format_attr_firmware.attr,
@@ -488,16 +490,23 @@ static u64 pmu_sbi_ctr_read(struct perf_event *event)
struct hw_perf_event *hwc = &event->hw;
int idx = hwc->idx;
struct sbiret ret;
- union sbi_pmu_ctr_info info;
u64 val = 0;
+ union sbi_pmu_ctr_info info = pmu_ctr_list[idx];
if (pmu_sbi_is_fw_event(event)) {
ret = sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_COUNTER_FW_READ,
hwc->idx, 0, 0, 0, 0, 0);
- if (!ret.error)
- val = ret.value;
+ if (ret.error)
+ return val;
+
+ val = ret.value;
+ if (IS_ENABLED(CONFIG_32BIT) && sbi_v2_available && info.width >= 32) {
+ ret = sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_COUNTER_FW_READ_HI,
+ hwc->idx, 0, 0, 0, 0, 0);
+ if (!ret.error)
+ val |= ((u64)ret.value << 32);
+ }
} else {
- info = pmu_ctr_list[idx];
val = riscv_pmu_ctr_read_csr(info.csr);
if (IS_ENABLED(CONFIG_32BIT))
val = ((u64)riscv_pmu_ctr_read_csr(info.csr + 0x80)) << 31 | val;
@@ -1108,6 +1117,9 @@ static int __init pmu_sbi_devinit(void)
return 0;
}
+ if (sbi_spec_version >= sbi_mk_version(2, 0))
+ sbi_v2_available = true;
+
ret = cpuhp_setup_state_multi(CPUHP_AP_PERF_RISCV_STARTING,
"perf/riscv/pmu:starting",
pmu_sbi_starting_cpu, pmu_sbi_dying_cpu);
--
2.34.1
SBI PMU Snapshot function optimizes the number of traps to
higher privilege mode by leveraging a shared memory between the S/VS-mode
and the M/HS mode. Add the definitions for that extension and new error
codes.
Reviewed-by: Anup Patel <[email protected]>
Acked-by: Palmer Dabbelt <[email protected]>
Signed-off-by: Atish Patra <[email protected]>
---
arch/riscv/include/asm/sbi.h | 11 +++++++++++
1 file changed, 11 insertions(+)
diff --git a/arch/riscv/include/asm/sbi.h b/arch/riscv/include/asm/sbi.h
index 914eacc6ba2e..75e95a7d9aa3 100644
--- a/arch/riscv/include/asm/sbi.h
+++ b/arch/riscv/include/asm/sbi.h
@@ -123,6 +123,7 @@ enum sbi_ext_pmu_fid {
SBI_EXT_PMU_COUNTER_STOP,
SBI_EXT_PMU_COUNTER_FW_READ,
SBI_EXT_PMU_COUNTER_FW_READ_HI,
+ SBI_EXT_PMU_SNAPSHOT_SET_SHMEM,
};
union sbi_pmu_ctr_info {
@@ -139,6 +140,13 @@ union sbi_pmu_ctr_info {
};
};
+/* Data structure to contain the pmu snapshot data */
+struct riscv_pmu_snapshot_data {
+ u64 ctr_overflow_mask;
+ u64 ctr_values[64];
+ u64 reserved[447];
+};
+
#define RISCV_PMU_RAW_EVENT_MASK GENMASK_ULL(47, 0)
#define RISCV_PMU_RAW_EVENT_IDX 0x20000
@@ -235,9 +243,11 @@ enum sbi_pmu_ctr_type {
/* Flags defined for counter start function */
#define SBI_PMU_START_FLAG_SET_INIT_VALUE (1 << 0)
+#define SBI_PMU_START_FLAG_INIT_FROM_SNAPSHOT BIT(1)
/* Flags defined for counter stop function */
#define SBI_PMU_STOP_FLAG_RESET (1 << 0)
+#define SBI_PMU_STOP_FLAG_TAKE_SNAPSHOT BIT(1)
enum sbi_ext_dbcn_fid {
SBI_EXT_DBCN_CONSOLE_WRITE = 0,
@@ -276,6 +286,7 @@ struct sbi_sta_struct {
#define SBI_ERR_ALREADY_AVAILABLE -6
#define SBI_ERR_ALREADY_STARTED -7
#define SBI_ERR_ALREADY_STOPPED -8
+#define SBI_ERR_NO_SHMEM -9
extern unsigned long sbi_spec_version;
struct sbiret {
--
2.34.1
SBI v2.0 SBI introduced PMU snapshot feature which adds the following
features.
1. Read counter values directly from the shared memory instead of
csr read.
2. Start multiple counters with initial values with one SBI call.
These functionalities optimizes the number of traps to the higher
privilege mode. If the kernel is in VS mode while the hypervisor
deploy trap & emulate method, this would minimize all the hpmcounter
CSR read traps. If the kernel is running in S-mode, the benefits
reduced to CSR latency vs DRAM/cache latency as there is no trap
involved while accessing the hpmcounter CSRs.
In both modes, it does saves the number of ecalls while starting
multiple counter together with an initial values. This is a likely
scenario if multiple counters overflow at the same time.
Acked-by: Palmer Dabbelt <[email protected]>
Signed-off-by: Atish Patra <[email protected]>
---
drivers/perf/riscv_pmu.c | 1 +
drivers/perf/riscv_pmu_sbi.c | 208 +++++++++++++++++++++++++++++++--
include/linux/perf/riscv_pmu.h | 6 +
3 files changed, 203 insertions(+), 12 deletions(-)
diff --git a/drivers/perf/riscv_pmu.c b/drivers/perf/riscv_pmu.c
index 0dda70e1ef90..5b57acb770d3 100644
--- a/drivers/perf/riscv_pmu.c
+++ b/drivers/perf/riscv_pmu.c
@@ -412,6 +412,7 @@ struct riscv_pmu *riscv_pmu_alloc(void)
cpuc->n_events = 0;
for (i = 0; i < RISCV_MAX_COUNTERS; i++)
cpuc->events[i] = NULL;
+ cpuc->snapshot_addr = NULL;
}
pmu->pmu = (struct pmu) {
.event_init = riscv_pmu_event_init,
diff --git a/drivers/perf/riscv_pmu_sbi.c b/drivers/perf/riscv_pmu_sbi.c
index 646604f8c0a5..1aeb8c59e78f 100644
--- a/drivers/perf/riscv_pmu_sbi.c
+++ b/drivers/perf/riscv_pmu_sbi.c
@@ -36,6 +36,9 @@ PMU_FORMAT_ATTR(event, "config:0-47");
PMU_FORMAT_ATTR(firmware, "config:63");
static bool sbi_v2_available;
+static DEFINE_STATIC_KEY_FALSE(sbi_pmu_snapshot_available);
+#define sbi_pmu_snapshot_available() \
+ static_branch_unlikely(&sbi_pmu_snapshot_available)
static struct attribute *riscv_arch_formats_attr[] = {
&format_attr_event.attr,
@@ -485,14 +488,100 @@ static int pmu_sbi_event_map(struct perf_event *event, u64 *econfig)
return ret;
}
+static void pmu_sbi_snapshot_free(struct riscv_pmu *pmu)
+{
+ int cpu;
+
+ for_each_possible_cpu(cpu) {
+ struct cpu_hw_events *cpu_hw_evt = per_cpu_ptr(pmu->hw_events, cpu);
+
+ if (!cpu_hw_evt->snapshot_addr)
+ continue;
+
+ free_page((unsigned long)cpu_hw_evt->snapshot_addr);
+ cpu_hw_evt->snapshot_addr = NULL;
+ cpu_hw_evt->snapshot_addr_phys = 0;
+ }
+}
+
+static int pmu_sbi_snapshot_alloc(struct riscv_pmu *pmu)
+{
+ int cpu;
+ struct page *snapshot_page;
+
+ for_each_possible_cpu(cpu) {
+ struct cpu_hw_events *cpu_hw_evt = per_cpu_ptr(pmu->hw_events, cpu);
+
+ if (cpu_hw_evt->snapshot_addr)
+ continue;
+
+ snapshot_page = alloc_page(GFP_ATOMIC | __GFP_ZERO);
+ if (!snapshot_page) {
+ pmu_sbi_snapshot_free(pmu);
+ return -ENOMEM;
+ }
+ cpu_hw_evt->snapshot_addr = page_to_virt(snapshot_page);
+ cpu_hw_evt->snapshot_addr_phys = page_to_phys(snapshot_page);
+ }
+
+ return 0;
+}
+
+static void pmu_sbi_snapshot_disable(void)
+{
+ sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_SNAPSHOT_SET_SHMEM, -1,
+ -1, 0, 0, 0, 0);
+}
+
+static int pmu_sbi_snapshot_setup(struct riscv_pmu *pmu, int cpu)
+{
+ struct cpu_hw_events *cpu_hw_evt;
+ struct sbiret ret = {0};
+
+ cpu_hw_evt = per_cpu_ptr(pmu->hw_events, cpu);
+ if (!cpu_hw_evt->snapshot_addr_phys)
+ return -EINVAL;
+
+ if (cpu_hw_evt->snapshot_set_done)
+ return 0;
+
+ if (IS_ENABLED(CONFIG_32BIT))
+ ret = sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_SNAPSHOT_SET_SHMEM,
+ cpu_hw_evt->snapshot_addr_phys,
+ (u64)(cpu_hw_evt->snapshot_addr_phys) >> 32, 0, 0, 0, 0);
+ else
+ ret = sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_SNAPSHOT_SET_SHMEM,
+ cpu_hw_evt->snapshot_addr_phys, 0, 0, 0, 0, 0);
+
+ /* Free up the snapshot area memory and fall back to SBI PMU calls without snapshot */
+ if (ret.error) {
+ if (ret.error != SBI_ERR_NOT_SUPPORTED)
+ pr_warn("pmu snapshot setup failed with error %ld\n", ret.error);
+ return sbi_err_map_linux_errno(ret.error);
+ }
+
+ cpu_hw_evt->snapshot_set_done = true;
+
+ return 0;
+}
+
static u64 pmu_sbi_ctr_read(struct perf_event *event)
{
struct hw_perf_event *hwc = &event->hw;
int idx = hwc->idx;
struct sbiret ret;
u64 val = 0;
+ struct riscv_pmu *pmu = to_riscv_pmu(event->pmu);
+ struct cpu_hw_events *cpu_hw_evt = this_cpu_ptr(pmu->hw_events);
+ struct riscv_pmu_snapshot_data *sdata = cpu_hw_evt->snapshot_addr;
union sbi_pmu_ctr_info info = pmu_ctr_list[idx];
+ /* Read the value from the shared memory directly */
+ if (sbi_pmu_snapshot_available()) {
+ val = sdata->ctr_values[idx];
+ return val;
+ }
+
if (pmu_sbi_is_fw_event(event)) {
ret = sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_COUNTER_FW_READ,
hwc->idx, 0, 0, 0, 0, 0);
@@ -539,6 +628,7 @@ static void pmu_sbi_ctr_start(struct perf_event *event, u64 ival)
struct hw_perf_event *hwc = &event->hw;
unsigned long flag = SBI_PMU_START_FLAG_SET_INIT_VALUE;
+ /* There is no benefit setting SNAPSHOT FLAG for a single counter */
#if defined(CONFIG_32BIT)
ret = sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_COUNTER_START, hwc->idx,
1, flag, ival, ival >> 32, 0);
@@ -559,16 +649,36 @@ static void pmu_sbi_ctr_stop(struct perf_event *event, unsigned long flag)
{
struct sbiret ret;
struct hw_perf_event *hwc = &event->hw;
+ struct riscv_pmu *pmu = to_riscv_pmu(event->pmu);
+ struct cpu_hw_events *cpu_hw_evt = this_cpu_ptr(pmu->hw_events);
+ struct riscv_pmu_snapshot_data *sdata = cpu_hw_evt->snapshot_addr;
if ((hwc->flags & PERF_EVENT_FLAG_USER_ACCESS) &&
(hwc->flags & PERF_EVENT_FLAG_USER_READ_CNT))
pmu_sbi_reset_scounteren((void *)event);
+ if (sbi_pmu_snapshot_available())
+ flag |= SBI_PMU_STOP_FLAG_TAKE_SNAPSHOT;
+
ret = sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_COUNTER_STOP, hwc->idx, 1, flag, 0, 0, 0);
- if (ret.error && (ret.error != SBI_ERR_ALREADY_STOPPED) &&
- flag != SBI_PMU_STOP_FLAG_RESET)
+ if (!ret.error && sbi_pmu_snapshot_available()) {
+ /*
+ * The counter snapshot is based on the index base specified by hwc->idx.
+ * The actual counter value is updated in shared memory at index 0 when counter
+ * mask is 0x01. To ensure accurate counter values, it's necessary to transfer
+ * the counter value to shared memory. However, if hwc->idx is zero, the counter
+ * value is already correctly updated in shared memory, requiring no further
+ * adjustment.
+ */
+ if (hwc->idx > 0) {
+ sdata->ctr_values[hwc->idx] = sdata->ctr_values[0];
+ sdata->ctr_values[0] = 0;
+ }
+ } else if (ret.error && (ret.error != SBI_ERR_ALREADY_STOPPED) &&
+ flag != SBI_PMU_STOP_FLAG_RESET) {
pr_err("Stopping counter idx %d failed with error %d\n",
hwc->idx, sbi_err_map_linux_errno(ret.error));
+ }
}
static int pmu_sbi_find_num_ctrs(void)
@@ -626,10 +736,14 @@ static inline void pmu_sbi_stop_all(struct riscv_pmu *pmu)
static inline void pmu_sbi_stop_hw_ctrs(struct riscv_pmu *pmu)
{
struct cpu_hw_events *cpu_hw_evt = this_cpu_ptr(pmu->hw_events);
+ unsigned long flag = 0;
+
+ if (sbi_pmu_snapshot_available())
+ flag = SBI_PMU_STOP_FLAG_TAKE_SNAPSHOT;
/* No need to check the error here as we can't do anything about the error */
sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_COUNTER_STOP, 0,
- cpu_hw_evt->used_hw_ctrs[0], 0, 0, 0, 0);
+ cpu_hw_evt->used_hw_ctrs[0], flag, 0, 0, 0);
}
/*
@@ -638,11 +752,10 @@ static inline void pmu_sbi_stop_hw_ctrs(struct riscv_pmu *pmu)
* while the overflowed counters need to be started with updated initialization
* value.
*/
-static inline void pmu_sbi_start_overflow_mask(struct riscv_pmu *pmu,
- unsigned long ctr_ovf_mask)
+static noinline void pmu_sbi_start_ovf_ctrs_sbi(struct cpu_hw_events *cpu_hw_evt,
+ unsigned long ctr_ovf_mask)
{
int idx = 0;
- struct cpu_hw_events *cpu_hw_evt = this_cpu_ptr(pmu->hw_events);
struct perf_event *event;
unsigned long flag = SBI_PMU_START_FLAG_SET_INIT_VALUE;
unsigned long ctr_start_mask = 0;
@@ -677,6 +790,48 @@ static inline void pmu_sbi_start_overflow_mask(struct riscv_pmu *pmu,
}
}
+static noinline void pmu_sbi_start_ovf_ctrs_snapshot(struct cpu_hw_events *cpu_hw_evt,
+ unsigned long ctr_ovf_mask)
+{
+ int idx = 0;
+ struct perf_event *event;
+ unsigned long flag = SBI_PMU_START_FLAG_INIT_FROM_SNAPSHOT;
+ u64 max_period, init_val = 0;
+ struct hw_perf_event *hwc;
+ unsigned long ctr_start_mask = 0;
+ struct riscv_pmu_snapshot_data *sdata = cpu_hw_evt->snapshot_addr;
+
+ for_each_set_bit(idx, cpu_hw_evt->used_hw_ctrs, RISCV_MAX_COUNTERS) {
+ if (ctr_ovf_mask & (1 << idx)) {
+ event = cpu_hw_evt->events[idx];
+ hwc = &event->hw;
+ max_period = riscv_pmu_ctr_get_width_mask(event);
+ init_val = local64_read(&hwc->prev_count) & max_period;
+ sdata->ctr_values[idx] = init_val;
+ }
+ /* We donot need to update the non-overflow counters the previous
+ * value should have been there already.
+ */
+ }
+
+ ctr_start_mask = cpu_hw_evt->used_hw_ctrs[0];
+
+ /* Start all the counters in a single shot */
+ sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_COUNTER_START, 0, ctr_start_mask,
+ flag, 0, 0, 0);
+}
+
+static void pmu_sbi_start_overflow_mask(struct riscv_pmu *pmu,
+ unsigned long ctr_ovf_mask)
+{
+ struct cpu_hw_events *cpu_hw_evt = this_cpu_ptr(pmu->hw_events);
+
+ if (sbi_pmu_snapshot_available())
+ pmu_sbi_start_ovf_ctrs_snapshot(cpu_hw_evt, ctr_ovf_mask);
+ else
+ pmu_sbi_start_ovf_ctrs_sbi(cpu_hw_evt, ctr_ovf_mask);
+}
+
static irqreturn_t pmu_sbi_ovf_handler(int irq, void *dev)
{
struct perf_sample_data data;
@@ -690,6 +845,7 @@ static irqreturn_t pmu_sbi_ovf_handler(int irq, void *dev)
unsigned long overflowed_ctrs = 0;
struct cpu_hw_events *cpu_hw_evt = dev;
u64 start_clock = sched_clock();
+ struct riscv_pmu_snapshot_data *sdata = cpu_hw_evt->snapshot_addr;
if (WARN_ON_ONCE(!cpu_hw_evt))
return IRQ_NONE;
@@ -711,8 +867,10 @@ static irqreturn_t pmu_sbi_ovf_handler(int irq, void *dev)
pmu_sbi_stop_hw_ctrs(pmu);
/* Overflow status register should only be read after counter are stopped */
- ALT_SBI_PMU_OVERFLOW(overflow);
-
+ if (sbi_pmu_snapshot_available())
+ overflow = sdata->ctr_overflow_mask;
+ else
+ ALT_SBI_PMU_OVERFLOW(overflow);
/*
* Overflow interrupt pending bit should only be cleared after stopping
* all the counters to avoid any race condition.
@@ -794,6 +952,9 @@ static int pmu_sbi_starting_cpu(unsigned int cpu, struct hlist_node *node)
enable_percpu_irq(riscv_pmu_irq, IRQ_TYPE_NONE);
}
+ if (sbi_pmu_snapshot_available())
+ return pmu_sbi_snapshot_setup(pmu, cpu);
+
return 0;
}
@@ -807,6 +968,9 @@ static int pmu_sbi_dying_cpu(unsigned int cpu, struct hlist_node *node)
/* Disable all counters access for user mode now */
csr_write(CSR_SCOUNTEREN, 0x0);
+ if (sbi_pmu_snapshot_available())
+ pmu_sbi_snapshot_disable();
+
return 0;
}
@@ -1076,10 +1240,6 @@ static int pmu_sbi_device_probe(struct platform_device *pdev)
pmu->event_unmapped = pmu_sbi_event_unmapped;
pmu->csr_index = pmu_sbi_csr_index;
- ret = cpuhp_state_add_instance(CPUHP_AP_PERF_RISCV_STARTING, &pmu->node);
- if (ret)
- return ret;
-
ret = riscv_pm_pmu_register(pmu);
if (ret)
goto out_unregister;
@@ -1088,8 +1248,32 @@ static int pmu_sbi_device_probe(struct platform_device *pdev)
if (ret)
goto out_unregister;
+ /* SBI PMU Snapsphot is only available in SBI v2.0 */
+ if (sbi_v2_available) {
+ ret = pmu_sbi_snapshot_alloc(pmu);
+ if (ret)
+ goto out_unregister;
+
+ ret = pmu_sbi_snapshot_setup(pmu, smp_processor_id());
+ if (!ret) {
+ pr_info("SBI PMU snapshot detected\n");
+ /*
+ * We enable it once here for the boot cpu. If snapshot shmem setup
+ * fails during cpu hotplug process, it will fail to start the cpu
+ * as we can not handle hetergenous PMUs with different snapshot
+ * capability.
+ */
+ static_branch_enable(&sbi_pmu_snapshot_available);
+ }
+ /* Snapshot is an optional feature. Continue if not available */
+ }
+
register_sysctl("kernel", sbi_pmu_sysctl_table);
+ ret = cpuhp_state_add_instance(CPUHP_AP_PERF_RISCV_STARTING, &pmu->node);
+ if (ret)
+ return ret;
+
return 0;
out_unregister:
diff --git a/include/linux/perf/riscv_pmu.h b/include/linux/perf/riscv_pmu.h
index 43282e22ebe1..c3fa90970042 100644
--- a/include/linux/perf/riscv_pmu.h
+++ b/include/linux/perf/riscv_pmu.h
@@ -39,6 +39,12 @@ struct cpu_hw_events {
DECLARE_BITMAP(used_hw_ctrs, RISCV_MAX_COUNTERS);
/* currently enabled firmware counters */
DECLARE_BITMAP(used_fw_ctrs, RISCV_MAX_COUNTERS);
+ /* The virtual address of the shared memory where counter snapshot will be taken */
+ void *snapshot_addr;
+ /* The physical address of the shared memory where counter snapshot will be taken */
+ phys_addr_t snapshot_addr_phys;
+ /* Boolean flag to indicate setup is already done */
+ bool snapshot_set_done;
};
struct riscv_pmu {
--
2.34.1
The virtual counter value is updated during pmu_ctr_read. There is no need
to update it in reset case. Otherwise, it will be counted twice which is
incorrect.
Fixes: 0cb74b65d2e5 ("RISC-V: KVM: Implement perf support without sampling")
Signed-off-by: Atish Patra <[email protected]>
---
arch/riscv/kvm/vcpu_pmu.c | 7 ++-----
1 file changed, 2 insertions(+), 5 deletions(-)
diff --git a/arch/riscv/kvm/vcpu_pmu.c b/arch/riscv/kvm/vcpu_pmu.c
index 86391a5061dd..8c44f26e754d 100644
--- a/arch/riscv/kvm/vcpu_pmu.c
+++ b/arch/riscv/kvm/vcpu_pmu.c
@@ -432,12 +432,9 @@ int kvm_riscv_vcpu_pmu_ctr_stop(struct kvm_vcpu *vcpu, unsigned long ctr_base,
sbiret = SBI_ERR_ALREADY_STOPPED;
}
- if (flags & SBI_PMU_STOP_FLAG_RESET) {
- /* Relase the counter if this is a reset request */
- pmc->counter_val += perf_event_read_value(pmc->perf_event,
- &enabled, &running);
+ if (flags & SBI_PMU_STOP_FLAG_RESET)
+ /* Release the counter if this is a reset request */
kvm_pmu_release_perf_event(pmc);
- }
} else {
sbiret = SBI_ERR_INVALID_PARAM;
}
--
2.34.1
Currently, we return a linux error code if creating a perf event failed
in kvm. That shouldn't be necessary as guest can continue to operate
without perf profiling or profiling with firmware counters.
Return appropriate SBI error code to indicate that PMU configuration
failed. An error message in kvm already describes the reason for failure.
Fixes: 0cb74b65d2e5 ("RISC-V: KVM: Implement perf support without sampling")
Signed-off-by: Atish Patra <[email protected]>
---
arch/riscv/kvm/vcpu_pmu.c | 14 +++++++++-----
arch/riscv/kvm/vcpu_sbi_pmu.c | 6 +++---
2 files changed, 12 insertions(+), 8 deletions(-)
diff --git a/arch/riscv/kvm/vcpu_pmu.c b/arch/riscv/kvm/vcpu_pmu.c
index 8c44f26e754d..08f561998611 100644
--- a/arch/riscv/kvm/vcpu_pmu.c
+++ b/arch/riscv/kvm/vcpu_pmu.c
@@ -229,8 +229,9 @@ static int kvm_pmu_validate_counter_mask(struct kvm_pmu *kvpmu, unsigned long ct
return 0;
}
-static int kvm_pmu_create_perf_event(struct kvm_pmc *pmc, struct perf_event_attr *attr,
- unsigned long flags, unsigned long eidx, unsigned long evtdata)
+static long kvm_pmu_create_perf_event(struct kvm_pmc *pmc, struct perf_event_attr *attr,
+ unsigned long flags, unsigned long eidx,
+ unsigned long evtdata)
{
struct perf_event *event;
@@ -455,7 +456,8 @@ int kvm_riscv_vcpu_pmu_ctr_cfg_match(struct kvm_vcpu *vcpu, unsigned long ctr_ba
unsigned long eidx, u64 evtdata,
struct kvm_vcpu_sbi_return *retdata)
{
- int ctr_idx, ret, sbiret = 0;
+ int ctr_idx, sbiret = 0;
+ long ret;
bool is_fevent;
unsigned long event_code;
u32 etype = kvm_pmu_get_perf_event_type(eidx);
@@ -514,8 +516,10 @@ int kvm_riscv_vcpu_pmu_ctr_cfg_match(struct kvm_vcpu *vcpu, unsigned long ctr_ba
kvpmu->fw_event[event_code].started = true;
} else {
ret = kvm_pmu_create_perf_event(pmc, &attr, flags, eidx, evtdata);
- if (ret)
- return ret;
+ if (ret) {
+ sbiret = SBI_ERR_NOT_SUPPORTED;
+ goto out;
+ }
}
set_bit(ctr_idx, kvpmu->pmc_in_use);
diff --git a/arch/riscv/kvm/vcpu_sbi_pmu.c b/arch/riscv/kvm/vcpu_sbi_pmu.c
index 7eca72df2cbd..b70179e9e875 100644
--- a/arch/riscv/kvm/vcpu_sbi_pmu.c
+++ b/arch/riscv/kvm/vcpu_sbi_pmu.c
@@ -42,9 +42,9 @@ static int kvm_sbi_ext_pmu_handler(struct kvm_vcpu *vcpu, struct kvm_run *run,
#endif
/*
* This can fail if perf core framework fails to create an event.
- * Forward the error to userspace because it's an error which
- * happened within the host kernel. The other option would be
- * to convert to an SBI error and forward to the guest.
+ * No need to forward the error to userspace and exit the guest
+ * operation can continue without profiling. Forward the
+ * appropriate SBI error to the guest.
*/
ret = kvm_riscv_vcpu_pmu_ctr_cfg_match(vcpu, cp->a0, cp->a1,
cp->a2, cp->a3, temp, retdata);
--
2.34.1
PMU Snapshot function allows to minimize the number of traps when the
guest access configures/access the hpmcounters. If the snapshot feature
is enabled, the hypervisor updates the shared memory with counter
data and state of overflown counters. The guest can just read the
shared memory instead of trap & emulate done by the hypervisor.
This patch doesn't implement the counter overflow yet.
Signed-off-by: Atish Patra <[email protected]>
---
arch/riscv/include/asm/kvm_vcpu_pmu.h | 9 ++
arch/riscv/kvm/aia.c | 5 ++
arch/riscv/kvm/vcpu_onereg.c | 7 +-
arch/riscv/kvm/vcpu_pmu.c | 120 +++++++++++++++++++++++++-
arch/riscv/kvm/vcpu_sbi_pmu.c | 3 +
5 files changed, 140 insertions(+), 4 deletions(-)
diff --git a/arch/riscv/include/asm/kvm_vcpu_pmu.h b/arch/riscv/include/asm/kvm_vcpu_pmu.h
index 395518a1664e..d56b901a61fc 100644
--- a/arch/riscv/include/asm/kvm_vcpu_pmu.h
+++ b/arch/riscv/include/asm/kvm_vcpu_pmu.h
@@ -50,6 +50,12 @@ struct kvm_pmu {
bool init_done;
/* Bit map of all the virtual counter used */
DECLARE_BITMAP(pmc_in_use, RISCV_KVM_MAX_COUNTERS);
+ /* Bit map of all the virtual counter overflown */
+ DECLARE_BITMAP(pmc_overflown, RISCV_KVM_MAX_COUNTERS);
+ /* The address of the counter snapshot area (guest physical address) */
+ gpa_t snapshot_addr;
+ /* The actual data of the snapshot */
+ struct riscv_pmu_snapshot_data *sdata;
};
#define vcpu_to_pmu(vcpu) (&(vcpu)->arch.pmu_context)
@@ -85,6 +91,9 @@ int kvm_riscv_vcpu_pmu_ctr_cfg_match(struct kvm_vcpu *vcpu, unsigned long ctr_ba
int kvm_riscv_vcpu_pmu_ctr_read(struct kvm_vcpu *vcpu, unsigned long cidx,
struct kvm_vcpu_sbi_return *retdata);
void kvm_riscv_vcpu_pmu_init(struct kvm_vcpu *vcpu);
+int kvm_riscv_vcpu_pmu_setup_snapshot(struct kvm_vcpu *vcpu, unsigned long saddr_low,
+ unsigned long saddr_high, unsigned long flags,
+ struct kvm_vcpu_sbi_return *retdata);
void kvm_riscv_vcpu_pmu_deinit(struct kvm_vcpu *vcpu);
void kvm_riscv_vcpu_pmu_reset(struct kvm_vcpu *vcpu);
diff --git a/arch/riscv/kvm/aia.c b/arch/riscv/kvm/aia.c
index a944294f6f23..71d161d7430d 100644
--- a/arch/riscv/kvm/aia.c
+++ b/arch/riscv/kvm/aia.c
@@ -545,6 +545,9 @@ void kvm_riscv_aia_enable(void)
enable_percpu_irq(hgei_parent_irq,
irq_get_trigger_type(hgei_parent_irq));
csr_set(CSR_HIE, BIT(IRQ_S_GEXT));
+ /* Enable IRQ filtering for overflow interrupt only if sscofpmf is present */
+ if (__riscv_isa_extension_available(NULL, RISCV_ISA_EXT_SSCOFPMF))
+ csr_write(CSR_HVIEN, BIT(IRQ_PMU_OVF));
}
void kvm_riscv_aia_disable(void)
@@ -560,6 +563,8 @@ void kvm_riscv_aia_disable(void)
/* Disable per-CPU SGEI interrupt */
csr_clear(CSR_HIE, BIT(IRQ_S_GEXT));
+ if (__riscv_isa_extension_available(NULL, RISCV_ISA_EXT_SSCOFPMF))
+ csr_clear(CSR_HVIEN, BIT(IRQ_PMU_OVF));
disable_percpu_irq(hgei_parent_irq);
aia_set_hvictl(false);
diff --git a/arch/riscv/kvm/vcpu_onereg.c b/arch/riscv/kvm/vcpu_onereg.c
index fc34557f5356..581568847910 100644
--- a/arch/riscv/kvm/vcpu_onereg.c
+++ b/arch/riscv/kvm/vcpu_onereg.c
@@ -117,8 +117,13 @@ void kvm_riscv_vcpu_setup_isa(struct kvm_vcpu *vcpu)
for (i = 0; i < ARRAY_SIZE(kvm_isa_ext_arr); i++) {
host_isa = kvm_isa_ext_arr[i];
if (__riscv_isa_extension_available(NULL, host_isa) &&
- kvm_riscv_vcpu_isa_enable_allowed(i))
+ kvm_riscv_vcpu_isa_enable_allowed(i)) {
+ /* Sscofpmf depends on interrupt filtering defined in ssaia */
+ if (host_isa == RISCV_ISA_EXT_SSCOFPMF &&
+ !__riscv_isa_extension_available(NULL, RISCV_ISA_EXT_SSAIA))
+ continue;
set_bit(host_isa, vcpu->arch.isa);
+ }
}
}
diff --git a/arch/riscv/kvm/vcpu_pmu.c b/arch/riscv/kvm/vcpu_pmu.c
index 08f561998611..e980235b8436 100644
--- a/arch/riscv/kvm/vcpu_pmu.c
+++ b/arch/riscv/kvm/vcpu_pmu.c
@@ -311,6 +311,81 @@ int kvm_riscv_vcpu_pmu_read_hpm(struct kvm_vcpu *vcpu, unsigned int csr_num,
return ret;
}
+static void kvm_pmu_clear_snapshot_area(struct kvm_vcpu *vcpu)
+{
+ struct kvm_pmu *kvpmu = vcpu_to_pmu(vcpu);
+ int snapshot_area_size = sizeof(struct riscv_pmu_snapshot_data);
+
+ if (kvpmu->sdata) {
+ memset(kvpmu->sdata, 0, snapshot_area_size);
+ if (kvpmu->snapshot_addr != INVALID_GPA)
+ kvm_vcpu_write_guest(vcpu, kvpmu->snapshot_addr,
+ kvpmu->sdata, snapshot_area_size);
+ kfree(kvpmu->sdata);
+ kvpmu->sdata = NULL;
+ }
+ kvpmu->snapshot_addr = INVALID_GPA;
+}
+
+int kvm_riscv_vcpu_pmu_setup_snapshot(struct kvm_vcpu *vcpu, unsigned long saddr_low,
+ unsigned long saddr_high, unsigned long flags,
+ struct kvm_vcpu_sbi_return *retdata)
+{
+ struct kvm_pmu *kvpmu = vcpu_to_pmu(vcpu);
+ int snapshot_area_size = sizeof(struct riscv_pmu_snapshot_data);
+ int sbiret = 0;
+ gpa_t saddr;
+ unsigned long hva;
+ bool writable;
+
+ if (!kvpmu) {
+ sbiret = SBI_ERR_INVALID_PARAM;
+ goto out;
+ }
+
+ if (saddr_low == -1 && saddr_high == -1) {
+ kvm_pmu_clear_snapshot_area(vcpu);
+ return 0;
+ }
+
+ saddr = saddr_low;
+
+ if (saddr_high != 0) {
+ if (IS_ENABLED(CONFIG_32BIT))
+ saddr |= ((gpa_t)saddr << 32);
+ else
+ sbiret = SBI_ERR_INVALID_ADDRESS;
+ goto out;
+ }
+
+ if (kvm_is_error_gpa(vcpu->kvm, saddr)) {
+ sbiret = SBI_ERR_INVALID_PARAM;
+ goto out;
+ }
+
+ hva = kvm_vcpu_gfn_to_hva_prot(vcpu, saddr >> PAGE_SHIFT, &writable);
+ if (kvm_is_error_hva(hva) || !writable) {
+ sbiret = SBI_ERR_INVALID_ADDRESS;
+ goto out;
+ }
+
+ kvpmu->snapshot_addr = saddr;
+ kvpmu->sdata = kzalloc(snapshot_area_size, GFP_ATOMIC);
+ if (!kvpmu->sdata)
+ return -ENOMEM;
+
+ if (kvm_vcpu_write_guest(vcpu, saddr, kvpmu->sdata, snapshot_area_size)) {
+ kfree(kvpmu->sdata);
+ kvpmu->snapshot_addr = INVALID_GPA;
+ sbiret = SBI_ERR_FAILURE;
+ }
+
+out:
+ retdata->err_val = sbiret;
+
+ return 0;
+}
+
int kvm_riscv_vcpu_pmu_num_ctrs(struct kvm_vcpu *vcpu,
struct kvm_vcpu_sbi_return *retdata)
{
@@ -344,20 +419,32 @@ int kvm_riscv_vcpu_pmu_ctr_start(struct kvm_vcpu *vcpu, unsigned long ctr_base,
int i, pmc_index, sbiret = 0;
struct kvm_pmc *pmc;
int fevent_code;
+ bool snap_flag_set = flags & SBI_PMU_START_FLAG_INIT_FROM_SNAPSHOT;
- if (kvm_pmu_validate_counter_mask(kvpmu, ctr_base, ctr_mask) < 0) {
+ if ((kvm_pmu_validate_counter_mask(kvpmu, ctr_base, ctr_mask) < 0)) {
sbiret = SBI_ERR_INVALID_PARAM;
goto out;
}
+ if (snap_flag_set && kvpmu->snapshot_addr == INVALID_GPA) {
+ sbiret = SBI_ERR_NO_SHMEM;
+ goto out;
+ }
+
/* Start the counters that have been configured and requested by the guest */
for_each_set_bit(i, &ctr_mask, RISCV_MAX_COUNTERS) {
pmc_index = i + ctr_base;
if (!test_bit(pmc_index, kvpmu->pmc_in_use))
continue;
pmc = &kvpmu->pmc[pmc_index];
- if (flags & SBI_PMU_START_FLAG_SET_INIT_VALUE)
+ if (flags & SBI_PMU_START_FLAG_SET_INIT_VALUE) {
pmc->counter_val = ival;
+ } else if (snap_flag_set) {
+ kvm_vcpu_read_guest(vcpu, kvpmu->snapshot_addr, kvpmu->sdata,
+ sizeof(struct riscv_pmu_snapshot_data));
+ pmc->counter_val = kvpmu->sdata->ctr_values[pmc_index];
+ }
+
if (pmc->cinfo.type == SBI_PMU_CTR_TYPE_FW) {
fevent_code = get_event_code(pmc->event_idx);
if (fevent_code >= SBI_PMU_FW_MAX) {
@@ -401,12 +488,18 @@ int kvm_riscv_vcpu_pmu_ctr_stop(struct kvm_vcpu *vcpu, unsigned long ctr_base,
u64 enabled, running;
struct kvm_pmc *pmc;
int fevent_code;
+ bool snap_flag_set = flags & SBI_PMU_STOP_FLAG_TAKE_SNAPSHOT;
- if (kvm_pmu_validate_counter_mask(kvpmu, ctr_base, ctr_mask) < 0) {
+ if ((kvm_pmu_validate_counter_mask(kvpmu, ctr_base, ctr_mask) < 0)) {
sbiret = SBI_ERR_INVALID_PARAM;
goto out;
}
+ if (snap_flag_set && kvpmu->snapshot_addr == INVALID_GPA) {
+ sbiret = SBI_ERR_NO_SHMEM;
+ goto out;
+ }
+
/* Stop the counters that have been configured and requested by the guest */
for_each_set_bit(i, &ctr_mask, RISCV_MAX_COUNTERS) {
pmc_index = i + ctr_base;
@@ -439,9 +532,28 @@ int kvm_riscv_vcpu_pmu_ctr_stop(struct kvm_vcpu *vcpu, unsigned long ctr_base,
} else {
sbiret = SBI_ERR_INVALID_PARAM;
}
+
+ if (snap_flag_set && !sbiret) {
+ if (pmc->cinfo.type == SBI_PMU_CTR_TYPE_FW)
+ pmc->counter_val = kvpmu->fw_event[fevent_code].value;
+ else if (pmc->perf_event)
+ pmc->counter_val += perf_event_read_value(pmc->perf_event,
+ &enabled, &running);
+ /* TODO: Add counter overflow support when sscofpmf support is added */
+ kvpmu->sdata->ctr_values[i] = pmc->counter_val;
+ kvm_vcpu_write_guest(vcpu, kvpmu->snapshot_addr, kvpmu->sdata,
+ sizeof(struct riscv_pmu_snapshot_data));
+ }
+
if (flags & SBI_PMU_STOP_FLAG_RESET) {
pmc->event_idx = SBI_PMU_EVENT_IDX_INVALID;
clear_bit(pmc_index, kvpmu->pmc_in_use);
+ if (snap_flag_set) {
+ /* Clear the snapshot area for the upcoming deletion event */
+ kvpmu->sdata->ctr_values[i] = 0;
+ kvm_vcpu_write_guest(vcpu, kvpmu->snapshot_addr, kvpmu->sdata,
+ sizeof(struct riscv_pmu_snapshot_data));
+ }
}
}
@@ -567,6 +679,7 @@ void kvm_riscv_vcpu_pmu_init(struct kvm_vcpu *vcpu)
kvpmu->num_hw_ctrs = num_hw_ctrs + 1;
kvpmu->num_fw_ctrs = SBI_PMU_FW_MAX;
memset(&kvpmu->fw_event, 0, SBI_PMU_FW_MAX * sizeof(struct kvm_fw_event));
+ kvpmu->snapshot_addr = INVALID_GPA;
if (kvpmu->num_hw_ctrs > RISCV_KVM_MAX_HW_CTRS) {
pr_warn_once("Limiting the hardware counters to 32 as specified by the ISA");
@@ -626,6 +739,7 @@ void kvm_riscv_vcpu_pmu_deinit(struct kvm_vcpu *vcpu)
}
bitmap_zero(kvpmu->pmc_in_use, RISCV_MAX_COUNTERS);
memset(&kvpmu->fw_event, 0, SBI_PMU_FW_MAX * sizeof(struct kvm_fw_event));
+ kvm_pmu_clear_snapshot_area(vcpu);
}
void kvm_riscv_vcpu_pmu_reset(struct kvm_vcpu *vcpu)
diff --git a/arch/riscv/kvm/vcpu_sbi_pmu.c b/arch/riscv/kvm/vcpu_sbi_pmu.c
index b70179e9e875..9f61136e4bb1 100644
--- a/arch/riscv/kvm/vcpu_sbi_pmu.c
+++ b/arch/riscv/kvm/vcpu_sbi_pmu.c
@@ -64,6 +64,9 @@ static int kvm_sbi_ext_pmu_handler(struct kvm_vcpu *vcpu, struct kvm_run *run,
case SBI_EXT_PMU_COUNTER_FW_READ:
ret = kvm_riscv_vcpu_pmu_ctr_read(vcpu, cp->a0, retdata);
break;
+ case SBI_EXT_PMU_SNAPSHOT_SET_SHMEM:
+ ret = kvm_riscv_vcpu_pmu_setup_snapshot(vcpu, cp->a0, cp->a1, cp->a2, retdata);
+ break;
default:
retdata->err_val = SBI_ERR_NOT_SUPPORTED;
}
--
2.34.1
KVM enables perf for guest via counter virtualization. However, the
sampling can not be supported as there is no mechanism to enabled
trap/emulate scountovf in ISA yet. Rely on the SBI PMU snapshot
to provide the counter overflow data via the shared memory.
In case of sampling event, the host first guest the LCOFI interrupt
and injects to the guest via irq filtering mechanism defined in AIA
specification. Thus, ssaia must be enabled in the host in order to
use perf sampling in the guest. No other AIA dpeendancy w.r.t kernel
is required.
Signed-off-by: Atish Patra <[email protected]>
---
arch/riscv/include/asm/csr.h | 3 +-
arch/riscv/include/asm/kvm_vcpu_pmu.h | 1 +
arch/riscv/include/uapi/asm/kvm.h | 1 +
arch/riscv/kvm/main.c | 1 +
arch/riscv/kvm/vcpu.c | 8 +--
arch/riscv/kvm/vcpu_onereg.c | 2 +
arch/riscv/kvm/vcpu_pmu.c | 70 +++++++++++++++++++++++++--
7 files changed, 77 insertions(+), 9 deletions(-)
diff --git a/arch/riscv/include/asm/csr.h b/arch/riscv/include/asm/csr.h
index 88cdc8a3e654..bec09b33e2f0 100644
--- a/arch/riscv/include/asm/csr.h
+++ b/arch/riscv/include/asm/csr.h
@@ -168,7 +168,8 @@
#define VSIP_TO_HVIP_SHIFT (IRQ_VS_SOFT - IRQ_S_SOFT)
#define VSIP_VALID_MASK ((_AC(1, UL) << IRQ_S_SOFT) | \
(_AC(1, UL) << IRQ_S_TIMER) | \
- (_AC(1, UL) << IRQ_S_EXT))
+ (_AC(1, UL) << IRQ_S_EXT) | \
+ (_AC(1, UL) << IRQ_PMU_OVF))
/* AIA CSR bits */
#define TOPI_IID_SHIFT 16
diff --git a/arch/riscv/include/asm/kvm_vcpu_pmu.h b/arch/riscv/include/asm/kvm_vcpu_pmu.h
index d56b901a61fc..af6d0ff5ce41 100644
--- a/arch/riscv/include/asm/kvm_vcpu_pmu.h
+++ b/arch/riscv/include/asm/kvm_vcpu_pmu.h
@@ -36,6 +36,7 @@ struct kvm_pmc {
bool started;
/* Monitoring event ID */
unsigned long event_idx;
+ struct kvm_vcpu *vcpu;
};
/* PMU data structure per vcpu */
diff --git a/arch/riscv/include/uapi/asm/kvm.h b/arch/riscv/include/uapi/asm/kvm.h
index d6b7a5b95874..d5aea43bc797 100644
--- a/arch/riscv/include/uapi/asm/kvm.h
+++ b/arch/riscv/include/uapi/asm/kvm.h
@@ -139,6 +139,7 @@ enum KVM_RISCV_ISA_EXT_ID {
KVM_RISCV_ISA_EXT_ZIHPM,
KVM_RISCV_ISA_EXT_SMSTATEEN,
KVM_RISCV_ISA_EXT_ZICOND,
+ KVM_RISCV_ISA_EXT_SSCOFPMF,
KVM_RISCV_ISA_EXT_MAX,
};
diff --git a/arch/riscv/kvm/main.c b/arch/riscv/kvm/main.c
index 225a435d9c9a..5a3a4cee0e3d 100644
--- a/arch/riscv/kvm/main.c
+++ b/arch/riscv/kvm/main.c
@@ -43,6 +43,7 @@ int kvm_arch_hardware_enable(void)
csr_write(CSR_HCOUNTEREN, 0x02);
csr_write(CSR_HVIP, 0);
+ csr_write(CSR_HVIEN, 1UL << IRQ_PMU_OVF);
kvm_riscv_aia_enable();
diff --git a/arch/riscv/kvm/vcpu.c b/arch/riscv/kvm/vcpu.c
index b5ca9f2e98ac..f83f0226439f 100644
--- a/arch/riscv/kvm/vcpu.c
+++ b/arch/riscv/kvm/vcpu.c
@@ -382,7 +382,8 @@ int kvm_riscv_vcpu_set_interrupt(struct kvm_vcpu *vcpu, unsigned int irq)
if (irq < IRQ_LOCAL_MAX &&
irq != IRQ_VS_SOFT &&
irq != IRQ_VS_TIMER &&
- irq != IRQ_VS_EXT)
+ irq != IRQ_VS_EXT &&
+ irq != IRQ_PMU_OVF)
return -EINVAL;
set_bit(irq, vcpu->arch.irqs_pending);
@@ -397,14 +398,15 @@ int kvm_riscv_vcpu_set_interrupt(struct kvm_vcpu *vcpu, unsigned int irq)
int kvm_riscv_vcpu_unset_interrupt(struct kvm_vcpu *vcpu, unsigned int irq)
{
/*
- * We only allow VS-mode software, timer, and external
+ * We only allow VS-mode software, timer, counter overflow and external
* interrupts when irq is one of the local interrupts
* defined by RISC-V privilege specification.
*/
if (irq < IRQ_LOCAL_MAX &&
irq != IRQ_VS_SOFT &&
irq != IRQ_VS_TIMER &&
- irq != IRQ_VS_EXT)
+ irq != IRQ_VS_EXT &&
+ irq != IRQ_PMU_OVF)
return -EINVAL;
clear_bit(irq, vcpu->arch.irqs_pending);
diff --git a/arch/riscv/kvm/vcpu_onereg.c b/arch/riscv/kvm/vcpu_onereg.c
index 581568847910..1eaaa919aa61 100644
--- a/arch/riscv/kvm/vcpu_onereg.c
+++ b/arch/riscv/kvm/vcpu_onereg.c
@@ -36,6 +36,7 @@ static const unsigned long kvm_isa_ext_arr[] = {
/* Multi letter extensions (alphabetically sorted) */
KVM_ISA_EXT_ARR(SMSTATEEN),
KVM_ISA_EXT_ARR(SSAIA),
+ KVM_ISA_EXT_ARR(SSCOFPMF),
KVM_ISA_EXT_ARR(SSTC),
KVM_ISA_EXT_ARR(SVINVAL),
KVM_ISA_EXT_ARR(SVNAPOT),
@@ -88,6 +89,7 @@ static bool kvm_riscv_vcpu_isa_disable_allowed(unsigned long ext)
case KVM_RISCV_ISA_EXT_I:
case KVM_RISCV_ISA_EXT_M:
case KVM_RISCV_ISA_EXT_SSTC:
+ case KVM_RISCV_ISA_EXT_SSCOFPMF:
case KVM_RISCV_ISA_EXT_SVINVAL:
case KVM_RISCV_ISA_EXT_SVNAPOT:
case KVM_RISCV_ISA_EXT_ZBA:
diff --git a/arch/riscv/kvm/vcpu_pmu.c b/arch/riscv/kvm/vcpu_pmu.c
index e980235b8436..f2bf5b5bdd61 100644
--- a/arch/riscv/kvm/vcpu_pmu.c
+++ b/arch/riscv/kvm/vcpu_pmu.c
@@ -229,6 +229,47 @@ static int kvm_pmu_validate_counter_mask(struct kvm_pmu *kvpmu, unsigned long ct
return 0;
}
+static void kvm_riscv_pmu_overflow(struct perf_event *perf_event,
+ struct perf_sample_data *data,
+ struct pt_regs *regs)
+{
+ struct kvm_pmc *pmc = perf_event->overflow_handler_context;
+ struct kvm_vcpu *vcpu = pmc->vcpu;
+ struct kvm_pmu *kvpmu = vcpu_to_pmu(vcpu);
+ struct riscv_pmu *rpmu = to_riscv_pmu(perf_event->pmu);
+ u64 period;
+
+ /*
+ * Stop the event counting by directly accessing the perf_event.
+ * Otherwise, this needs to deferred via a workqueue.
+ * That will introduce skew in the counter value because the actual
+ * physical counter would start after returning from this function.
+ * It will be stopped again once the workqueue is scheduled
+ */
+ rpmu->pmu.stop(perf_event, PERF_EF_UPDATE);
+
+ /*
+ * The hw counter would start automatically when this function returns.
+ * Thus, the host may continue to interrupt and inject it to the guest
+ * even without the guest configuring the next event. Depending on the hardware
+ * the host may have some sluggishness only if privilege mode filtering is not
+ * available. In an ideal world, where qemu is not the only capable hardware,
+ * this can be removed.
+ * FYI: ARM64 does this way while x86 doesn't do anything as such.
+ * TODO: Should we keep it for RISC-V ?
+ */
+ period = -(local64_read(&perf_event->count));
+
+ local64_set(&perf_event->hw.period_left, 0);
+ perf_event->attr.sample_period = period;
+ perf_event->hw.sample_period = period;
+
+ set_bit(pmc->idx, kvpmu->pmc_overflown);
+ kvm_riscv_vcpu_set_interrupt(vcpu, IRQ_PMU_OVF);
+
+ rpmu->pmu.start(perf_event, PERF_EF_RELOAD);
+}
+
static long kvm_pmu_create_perf_event(struct kvm_pmc *pmc, struct perf_event_attr *attr,
unsigned long flags, unsigned long eidx,
unsigned long evtdata)
@@ -248,7 +289,7 @@ static long kvm_pmu_create_perf_event(struct kvm_pmc *pmc, struct perf_event_att
*/
attr->sample_period = kvm_pmu_get_sample_period(pmc);
- event = perf_event_create_kernel_counter(attr, -1, current, NULL, pmc);
+ event = perf_event_create_kernel_counter(attr, -1, current, kvm_riscv_pmu_overflow, pmc);
if (IS_ERR(event)) {
pr_err("kvm pmu event creation failed for eidx %lx: %ld\n", eidx, PTR_ERR(event));
return PTR_ERR(event);
@@ -473,6 +514,12 @@ int kvm_riscv_vcpu_pmu_ctr_start(struct kvm_vcpu *vcpu, unsigned long ctr_base,
}
}
+ /* The guest have serviced the interrupt and starting the counter again */
+ if (test_bit(IRQ_PMU_OVF, vcpu->arch.irqs_pending)) {
+ clear_bit(pmc_index, kvpmu->pmc_overflown);
+ kvm_riscv_vcpu_unset_interrupt(vcpu, IRQ_PMU_OVF);
+ }
+
out:
retdata->err_val = sbiret;
@@ -539,7 +586,13 @@ int kvm_riscv_vcpu_pmu_ctr_stop(struct kvm_vcpu *vcpu, unsigned long ctr_base,
else if (pmc->perf_event)
pmc->counter_val += perf_event_read_value(pmc->perf_event,
&enabled, &running);
- /* TODO: Add counter overflow support when sscofpmf support is added */
+ /*
+ * The counter and overflow indicies in the snapshot region are w.r.to
+ * cbase. Modify the set bit in the counter mask instead of the pmc_index
+ * which indicates the absolute counter index.
+ */
+ if (test_bit(pmc_index, kvpmu->pmc_overflown))
+ kvpmu->sdata->ctr_overflow_mask |= (1UL << i);
kvpmu->sdata->ctr_values[i] = pmc->counter_val;
kvm_vcpu_write_guest(vcpu, kvpmu->snapshot_addr, kvpmu->sdata,
sizeof(struct riscv_pmu_snapshot_data));
@@ -548,15 +601,20 @@ int kvm_riscv_vcpu_pmu_ctr_stop(struct kvm_vcpu *vcpu, unsigned long ctr_base,
if (flags & SBI_PMU_STOP_FLAG_RESET) {
pmc->event_idx = SBI_PMU_EVENT_IDX_INVALID;
clear_bit(pmc_index, kvpmu->pmc_in_use);
+ clear_bit(pmc_index, kvpmu->pmc_overflown);
if (snap_flag_set) {
/* Clear the snapshot area for the upcoming deletion event */
kvpmu->sdata->ctr_values[i] = 0;
+ /*
+ * Only clear the given counter as the caller is responsible to
+ * validate both the overflow mask and configured counters.
+ */
+ kvpmu->sdata->ctr_overflow_mask &= ~(1UL << i);
kvm_vcpu_write_guest(vcpu, kvpmu->snapshot_addr, kvpmu->sdata,
sizeof(struct riscv_pmu_snapshot_data));
}
}
}
-
out:
retdata->err_val = sbiret;
@@ -699,6 +757,7 @@ void kvm_riscv_vcpu_pmu_init(struct kvm_vcpu *vcpu)
pmc = &kvpmu->pmc[i];
pmc->idx = i;
pmc->event_idx = SBI_PMU_EVENT_IDX_INVALID;
+ pmc->vcpu = vcpu;
if (i < kvpmu->num_hw_ctrs) {
pmc->cinfo.type = SBI_PMU_CTR_TYPE_HW;
if (i < 3)
@@ -731,13 +790,14 @@ void kvm_riscv_vcpu_pmu_deinit(struct kvm_vcpu *vcpu)
if (!kvpmu)
return;
- for_each_set_bit(i, kvpmu->pmc_in_use, RISCV_MAX_COUNTERS) {
+ for_each_set_bit(i, kvpmu->pmc_in_use, RISCV_KVM_MAX_COUNTERS) {
pmc = &kvpmu->pmc[i];
pmc->counter_val = 0;
kvm_pmu_release_perf_event(pmc);
pmc->event_idx = SBI_PMU_EVENT_IDX_INVALID;
}
- bitmap_zero(kvpmu->pmc_in_use, RISCV_MAX_COUNTERS);
+ bitmap_zero(kvpmu->pmc_in_use, RISCV_KVM_MAX_COUNTERS);
+ bitmap_zero(kvpmu->pmc_overflown, RISCV_KVM_MAX_COUNTERS);
memset(&kvpmu->fw_event, 0, SBI_PMU_FW_MAX * sizeof(struct kvm_fw_event));
kvm_pmu_clear_snapshot_area(vcpu);
}
--
2.34.1
The SBI v2.0 introduced a fw_read_hi function to read 64 bit firmware
counters for RV32 based systems.
Add infrastructure to support that.
Signed-off-by: Atish Patra <[email protected]>
---
arch/riscv/include/asm/kvm_vcpu_pmu.h | 4 ++-
arch/riscv/kvm/vcpu_pmu.c | 37 ++++++++++++++++++++++++++-
arch/riscv/kvm/vcpu_sbi_pmu.c | 6 +++++
3 files changed, 45 insertions(+), 2 deletions(-)
diff --git a/arch/riscv/include/asm/kvm_vcpu_pmu.h b/arch/riscv/include/asm/kvm_vcpu_pmu.h
index af6d0ff5ce41..463c349a9ea5 100644
--- a/arch/riscv/include/asm/kvm_vcpu_pmu.h
+++ b/arch/riscv/include/asm/kvm_vcpu_pmu.h
@@ -20,7 +20,7 @@ static_assert(RISCV_KVM_MAX_COUNTERS <= 64);
struct kvm_fw_event {
/* Current value of the event */
- unsigned long value;
+ u64 value;
/* Event monitoring status */
bool started;
@@ -91,6 +91,8 @@ int kvm_riscv_vcpu_pmu_ctr_cfg_match(struct kvm_vcpu *vcpu, unsigned long ctr_ba
struct kvm_vcpu_sbi_return *retdata);
int kvm_riscv_vcpu_pmu_ctr_read(struct kvm_vcpu *vcpu, unsigned long cidx,
struct kvm_vcpu_sbi_return *retdata);
+int kvm_riscv_vcpu_pmu_fw_ctr_read_hi(struct kvm_vcpu *vcpu, unsigned long cidx,
+ struct kvm_vcpu_sbi_return *retdata);
void kvm_riscv_vcpu_pmu_init(struct kvm_vcpu *vcpu);
int kvm_riscv_vcpu_pmu_setup_snapshot(struct kvm_vcpu *vcpu, unsigned long saddr_low,
unsigned long saddr_high, unsigned long flags,
diff --git a/arch/riscv/kvm/vcpu_pmu.c b/arch/riscv/kvm/vcpu_pmu.c
index f2bf5b5bdd61..e6ce37819ca2 100644
--- a/arch/riscv/kvm/vcpu_pmu.c
+++ b/arch/riscv/kvm/vcpu_pmu.c
@@ -196,6 +196,29 @@ static int pmu_get_pmc_index(struct kvm_pmu *pmu, unsigned long eidx,
return kvm_pmu_get_programmable_pmc_index(pmu, eidx, cbase, cmask);
}
+static int pmu_fw_ctr_read_hi(struct kvm_vcpu *vcpu, unsigned long cidx,
+ unsigned long *out_val)
+{
+ struct kvm_pmu *kvpmu = vcpu_to_pmu(vcpu);
+ struct kvm_pmc *pmc;
+ int fevent_code;
+
+ if (!IS_ENABLED(CONFIG_32BIT))
+ return -EINVAL;
+
+ pmc = &kvpmu->pmc[cidx];
+
+ if (pmc->cinfo.type != SBI_PMU_CTR_TYPE_FW)
+ return -EINVAL;
+
+ fevent_code = get_event_code(pmc->event_idx);
+ pmc->counter_val = kvpmu->fw_event[fevent_code].value;
+
+ *out_val = pmc->counter_val >> 32;
+
+ return 0;
+}
+
static int pmu_ctr_read(struct kvm_vcpu *vcpu, unsigned long cidx,
unsigned long *out_val)
{
@@ -701,6 +724,18 @@ int kvm_riscv_vcpu_pmu_ctr_cfg_match(struct kvm_vcpu *vcpu, unsigned long ctr_ba
return 0;
}
+int kvm_riscv_vcpu_pmu_fw_ctr_read_hi(struct kvm_vcpu *vcpu, unsigned long cidx,
+ struct kvm_vcpu_sbi_return *retdata)
+{
+ int ret;
+
+ ret = pmu_fw_ctr_read_hi(vcpu, cidx, &retdata->out_val);
+ if (ret == -EINVAL)
+ retdata->err_val = SBI_ERR_INVALID_PARAM;
+
+ return 0;
+}
+
int kvm_riscv_vcpu_pmu_ctr_read(struct kvm_vcpu *vcpu, unsigned long cidx,
struct kvm_vcpu_sbi_return *retdata)
{
@@ -774,7 +809,7 @@ void kvm_riscv_vcpu_pmu_init(struct kvm_vcpu *vcpu)
pmc->cinfo.csr = CSR_CYCLE + i;
} else {
pmc->cinfo.type = SBI_PMU_CTR_TYPE_FW;
- pmc->cinfo.width = BITS_PER_LONG - 1;
+ pmc->cinfo.width = 63;
}
}
diff --git a/arch/riscv/kvm/vcpu_sbi_pmu.c b/arch/riscv/kvm/vcpu_sbi_pmu.c
index 9f61136e4bb1..58a0e5587e2a 100644
--- a/arch/riscv/kvm/vcpu_sbi_pmu.c
+++ b/arch/riscv/kvm/vcpu_sbi_pmu.c
@@ -64,6 +64,12 @@ static int kvm_sbi_ext_pmu_handler(struct kvm_vcpu *vcpu, struct kvm_run *run,
case SBI_EXT_PMU_COUNTER_FW_READ:
ret = kvm_riscv_vcpu_pmu_ctr_read(vcpu, cp->a0, retdata);
break;
+ case SBI_EXT_PMU_COUNTER_FW_READ_HI:
+ if (IS_ENABLED(CONFIG_32BIT))
+ ret = kvm_riscv_vcpu_pmu_fw_ctr_read_hi(vcpu, cp->a0, retdata);
+ else
+ retdata->out_val = 0;
+ break;
case SBI_EXT_PMU_SNAPSHOT_SET_SHMEM:
ret = kvm_riscv_vcpu_pmu_setup_snapshot(vcpu, cp->a0, cp->a1, cp->a2, retdata);
break;
--
2.34.1
On Sat, Dec 30, 2023 at 3:20 AM Atish Patra <[email protected]> wrote:
>
> SBI v2.0 introduced a explicit function to read the upper 32 bits
> for any firmwar counter width that is longer than 32bits.
> This is only applicable for RV32 where firmware counter can be
> 64 bit.
>
> Acked-by: Palmer Dabbelt <[email protected]>
> Signed-off-by: Atish Patra <[email protected]>
LGTM.
Reviewed-by: Anup Patel <[email protected]>
Regards,
Anup
> ---
> drivers/perf/riscv_pmu_sbi.c | 20 ++++++++++++++++----
> 1 file changed, 16 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/perf/riscv_pmu_sbi.c b/drivers/perf/riscv_pmu_sbi.c
> index 16acd4dcdb96..646604f8c0a5 100644
> --- a/drivers/perf/riscv_pmu_sbi.c
> +++ b/drivers/perf/riscv_pmu_sbi.c
> @@ -35,6 +35,8 @@
> PMU_FORMAT_ATTR(event, "config:0-47");
> PMU_FORMAT_ATTR(firmware, "config:63");
>
> +static bool sbi_v2_available;
> +
> static struct attribute *riscv_arch_formats_attr[] = {
> &format_attr_event.attr,
> &format_attr_firmware.attr,
> @@ -488,16 +490,23 @@ static u64 pmu_sbi_ctr_read(struct perf_event *event)
> struct hw_perf_event *hwc = &event->hw;
> int idx = hwc->idx;
> struct sbiret ret;
> - union sbi_pmu_ctr_info info;
> u64 val = 0;
> + union sbi_pmu_ctr_info info = pmu_ctr_list[idx];
>
> if (pmu_sbi_is_fw_event(event)) {
> ret = sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_COUNTER_FW_READ,
> hwc->idx, 0, 0, 0, 0, 0);
> - if (!ret.error)
> - val = ret.value;
> + if (ret.error)
> + return val;
> +
> + val = ret.value;
> + if (IS_ENABLED(CONFIG_32BIT) && sbi_v2_available && info.width >= 32) {
> + ret = sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_COUNTER_FW_READ_HI,
> + hwc->idx, 0, 0, 0, 0, 0);
> + if (!ret.error)
> + val |= ((u64)ret.value << 32);
> + }
> } else {
> - info = pmu_ctr_list[idx];
> val = riscv_pmu_ctr_read_csr(info.csr);
> if (IS_ENABLED(CONFIG_32BIT))
> val = ((u64)riscv_pmu_ctr_read_csr(info.csr + 0x80)) << 31 | val;
> @@ -1108,6 +1117,9 @@ static int __init pmu_sbi_devinit(void)
> return 0;
> }
>
> + if (sbi_spec_version >= sbi_mk_version(2, 0))
> + sbi_v2_available = true;
> +
> ret = cpuhp_setup_state_multi(CPUHP_AP_PERF_RISCV_STARTING,
> "perf/riscv/pmu:starting",
> pmu_sbi_starting_cpu, pmu_sbi_dying_cpu);
> --
> 2.34.1
>
On Sat, Dec 30, 2023 at 3:20 AM Atish Patra <[email protected]> wrote:
>
> SBI v2.0 SBI introduced PMU snapshot feature which adds the following
> features.
>
> 1. Read counter values directly from the shared memory instead of
> csr read.
> 2. Start multiple counters with initial values with one SBI call.
>
> These functionalities optimizes the number of traps to the higher
> privilege mode. If the kernel is in VS mode while the hypervisor
> deploy trap & emulate method, this would minimize all the hpmcounter
> CSR read traps. If the kernel is running in S-mode, the benefits
> reduced to CSR latency vs DRAM/cache latency as there is no trap
> involved while accessing the hpmcounter CSRs.
>
> In both modes, it does saves the number of ecalls while starting
> multiple counter together with an initial values. This is a likely
> scenario if multiple counters overflow at the same time.
>
> Acked-by: Palmer Dabbelt <[email protected]>
> Signed-off-by: Atish Patra <[email protected]>
LGTM.
Reviewed-by: Anup Patel <[email protected]>
Regards,
Anup
> ---
> drivers/perf/riscv_pmu.c | 1 +
> drivers/perf/riscv_pmu_sbi.c | 208 +++++++++++++++++++++++++++++++--
> include/linux/perf/riscv_pmu.h | 6 +
> 3 files changed, 203 insertions(+), 12 deletions(-)
>
> diff --git a/drivers/perf/riscv_pmu.c b/drivers/perf/riscv_pmu.c
> index 0dda70e1ef90..5b57acb770d3 100644
> --- a/drivers/perf/riscv_pmu.c
> +++ b/drivers/perf/riscv_pmu.c
> @@ -412,6 +412,7 @@ struct riscv_pmu *riscv_pmu_alloc(void)
> cpuc->n_events = 0;
> for (i = 0; i < RISCV_MAX_COUNTERS; i++)
> cpuc->events[i] = NULL;
> + cpuc->snapshot_addr = NULL;
> }
> pmu->pmu = (struct pmu) {
> .event_init = riscv_pmu_event_init,
> diff --git a/drivers/perf/riscv_pmu_sbi.c b/drivers/perf/riscv_pmu_sbi.c
> index 646604f8c0a5..1aeb8c59e78f 100644
> --- a/drivers/perf/riscv_pmu_sbi.c
> +++ b/drivers/perf/riscv_pmu_sbi.c
> @@ -36,6 +36,9 @@ PMU_FORMAT_ATTR(event, "config:0-47");
> PMU_FORMAT_ATTR(firmware, "config:63");
>
> static bool sbi_v2_available;
> +static DEFINE_STATIC_KEY_FALSE(sbi_pmu_snapshot_available);
> +#define sbi_pmu_snapshot_available() \
> + static_branch_unlikely(&sbi_pmu_snapshot_available)
>
> static struct attribute *riscv_arch_formats_attr[] = {
> &format_attr_event.attr,
> @@ -485,14 +488,100 @@ static int pmu_sbi_event_map(struct perf_event *event, u64 *econfig)
> return ret;
> }
>
> +static void pmu_sbi_snapshot_free(struct riscv_pmu *pmu)
> +{
> + int cpu;
> +
> + for_each_possible_cpu(cpu) {
> + struct cpu_hw_events *cpu_hw_evt = per_cpu_ptr(pmu->hw_events, cpu);
> +
> + if (!cpu_hw_evt->snapshot_addr)
> + continue;
> +
> + free_page((unsigned long)cpu_hw_evt->snapshot_addr);
> + cpu_hw_evt->snapshot_addr = NULL;
> + cpu_hw_evt->snapshot_addr_phys = 0;
> + }
> +}
> +
> +static int pmu_sbi_snapshot_alloc(struct riscv_pmu *pmu)
> +{
> + int cpu;
> + struct page *snapshot_page;
> +
> + for_each_possible_cpu(cpu) {
> + struct cpu_hw_events *cpu_hw_evt = per_cpu_ptr(pmu->hw_events, cpu);
> +
> + if (cpu_hw_evt->snapshot_addr)
> + continue;
> +
> + snapshot_page = alloc_page(GFP_ATOMIC | __GFP_ZERO);
> + if (!snapshot_page) {
> + pmu_sbi_snapshot_free(pmu);
> + return -ENOMEM;
> + }
> + cpu_hw_evt->snapshot_addr = page_to_virt(snapshot_page);
> + cpu_hw_evt->snapshot_addr_phys = page_to_phys(snapshot_page);
> + }
> +
> + return 0;
> +}
> +
> +static void pmu_sbi_snapshot_disable(void)
> +{
> + sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_SNAPSHOT_SET_SHMEM, -1,
> + -1, 0, 0, 0, 0);
> +}
> +
> +static int pmu_sbi_snapshot_setup(struct riscv_pmu *pmu, int cpu)
> +{
> + struct cpu_hw_events *cpu_hw_evt;
> + struct sbiret ret = {0};
> +
> + cpu_hw_evt = per_cpu_ptr(pmu->hw_events, cpu);
> + if (!cpu_hw_evt->snapshot_addr_phys)
> + return -EINVAL;
> +
> + if (cpu_hw_evt->snapshot_set_done)
> + return 0;
> +
> + if (IS_ENABLED(CONFIG_32BIT))
> + ret = sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_SNAPSHOT_SET_SHMEM,
> + cpu_hw_evt->snapshot_addr_phys,
> + (u64)(cpu_hw_evt->snapshot_addr_phys) >> 32, 0, 0, 0, 0);
> + else
> + ret = sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_SNAPSHOT_SET_SHMEM,
> + cpu_hw_evt->snapshot_addr_phys, 0, 0, 0, 0, 0);
> +
> + /* Free up the snapshot area memory and fall back to SBI PMU calls without snapshot */
> + if (ret.error) {
> + if (ret.error != SBI_ERR_NOT_SUPPORTED)
> + pr_warn("pmu snapshot setup failed with error %ld\n", ret.error);
> + return sbi_err_map_linux_errno(ret.error);
> + }
> +
> + cpu_hw_evt->snapshot_set_done = true;
> +
> + return 0;
> +}
> +
> static u64 pmu_sbi_ctr_read(struct perf_event *event)
> {
> struct hw_perf_event *hwc = &event->hw;
> int idx = hwc->idx;
> struct sbiret ret;
> u64 val = 0;
> + struct riscv_pmu *pmu = to_riscv_pmu(event->pmu);
> + struct cpu_hw_events *cpu_hw_evt = this_cpu_ptr(pmu->hw_events);
> + struct riscv_pmu_snapshot_data *sdata = cpu_hw_evt->snapshot_addr;
> union sbi_pmu_ctr_info info = pmu_ctr_list[idx];
>
> + /* Read the value from the shared memory directly */
> + if (sbi_pmu_snapshot_available()) {
> + val = sdata->ctr_values[idx];
> + return val;
> + }
> +
> if (pmu_sbi_is_fw_event(event)) {
> ret = sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_COUNTER_FW_READ,
> hwc->idx, 0, 0, 0, 0, 0);
> @@ -539,6 +628,7 @@ static void pmu_sbi_ctr_start(struct perf_event *event, u64 ival)
> struct hw_perf_event *hwc = &event->hw;
> unsigned long flag = SBI_PMU_START_FLAG_SET_INIT_VALUE;
>
> + /* There is no benefit setting SNAPSHOT FLAG for a single counter */
> #if defined(CONFIG_32BIT)
> ret = sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_COUNTER_START, hwc->idx,
> 1, flag, ival, ival >> 32, 0);
> @@ -559,16 +649,36 @@ static void pmu_sbi_ctr_stop(struct perf_event *event, unsigned long flag)
> {
> struct sbiret ret;
> struct hw_perf_event *hwc = &event->hw;
> + struct riscv_pmu *pmu = to_riscv_pmu(event->pmu);
> + struct cpu_hw_events *cpu_hw_evt = this_cpu_ptr(pmu->hw_events);
> + struct riscv_pmu_snapshot_data *sdata = cpu_hw_evt->snapshot_addr;
>
> if ((hwc->flags & PERF_EVENT_FLAG_USER_ACCESS) &&
> (hwc->flags & PERF_EVENT_FLAG_USER_READ_CNT))
> pmu_sbi_reset_scounteren((void *)event);
>
> + if (sbi_pmu_snapshot_available())
> + flag |= SBI_PMU_STOP_FLAG_TAKE_SNAPSHOT;
> +
> ret = sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_COUNTER_STOP, hwc->idx, 1, flag, 0, 0, 0);
> - if (ret.error && (ret.error != SBI_ERR_ALREADY_STOPPED) &&
> - flag != SBI_PMU_STOP_FLAG_RESET)
> + if (!ret.error && sbi_pmu_snapshot_available()) {
> + /*
> + * The counter snapshot is based on the index base specified by hwc->idx.
> + * The actual counter value is updated in shared memory at index 0 when counter
> + * mask is 0x01. To ensure accurate counter values, it's necessary to transfer
> + * the counter value to shared memory. However, if hwc->idx is zero, the counter
> + * value is already correctly updated in shared memory, requiring no further
> + * adjustment.
> + */
> + if (hwc->idx > 0) {
> + sdata->ctr_values[hwc->idx] = sdata->ctr_values[0];
> + sdata->ctr_values[0] = 0;
> + }
> + } else if (ret.error && (ret.error != SBI_ERR_ALREADY_STOPPED) &&
> + flag != SBI_PMU_STOP_FLAG_RESET) {
> pr_err("Stopping counter idx %d failed with error %d\n",
> hwc->idx, sbi_err_map_linux_errno(ret.error));
> + }
> }
>
> static int pmu_sbi_find_num_ctrs(void)
> @@ -626,10 +736,14 @@ static inline void pmu_sbi_stop_all(struct riscv_pmu *pmu)
> static inline void pmu_sbi_stop_hw_ctrs(struct riscv_pmu *pmu)
> {
> struct cpu_hw_events *cpu_hw_evt = this_cpu_ptr(pmu->hw_events);
> + unsigned long flag = 0;
> +
> + if (sbi_pmu_snapshot_available())
> + flag = SBI_PMU_STOP_FLAG_TAKE_SNAPSHOT;
>
> /* No need to check the error here as we can't do anything about the error */
> sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_COUNTER_STOP, 0,
> - cpu_hw_evt->used_hw_ctrs[0], 0, 0, 0, 0);
> + cpu_hw_evt->used_hw_ctrs[0], flag, 0, 0, 0);
> }
>
> /*
> @@ -638,11 +752,10 @@ static inline void pmu_sbi_stop_hw_ctrs(struct riscv_pmu *pmu)
> * while the overflowed counters need to be started with updated initialization
> * value.
> */
> -static inline void pmu_sbi_start_overflow_mask(struct riscv_pmu *pmu,
> - unsigned long ctr_ovf_mask)
> +static noinline void pmu_sbi_start_ovf_ctrs_sbi(struct cpu_hw_events *cpu_hw_evt,
> + unsigned long ctr_ovf_mask)
> {
> int idx = 0;
> - struct cpu_hw_events *cpu_hw_evt = this_cpu_ptr(pmu->hw_events);
> struct perf_event *event;
> unsigned long flag = SBI_PMU_START_FLAG_SET_INIT_VALUE;
> unsigned long ctr_start_mask = 0;
> @@ -677,6 +790,48 @@ static inline void pmu_sbi_start_overflow_mask(struct riscv_pmu *pmu,
> }
> }
>
> +static noinline void pmu_sbi_start_ovf_ctrs_snapshot(struct cpu_hw_events *cpu_hw_evt,
> + unsigned long ctr_ovf_mask)
> +{
> + int idx = 0;
> + struct perf_event *event;
> + unsigned long flag = SBI_PMU_START_FLAG_INIT_FROM_SNAPSHOT;
> + u64 max_period, init_val = 0;
> + struct hw_perf_event *hwc;
> + unsigned long ctr_start_mask = 0;
> + struct riscv_pmu_snapshot_data *sdata = cpu_hw_evt->snapshot_addr;
> +
> + for_each_set_bit(idx, cpu_hw_evt->used_hw_ctrs, RISCV_MAX_COUNTERS) {
> + if (ctr_ovf_mask & (1 << idx)) {
> + event = cpu_hw_evt->events[idx];
> + hwc = &event->hw;
> + max_period = riscv_pmu_ctr_get_width_mask(event);
> + init_val = local64_read(&hwc->prev_count) & max_period;
> + sdata->ctr_values[idx] = init_val;
> + }
> + /* We donot need to update the non-overflow counters the previous
> + * value should have been there already.
> + */
> + }
> +
> + ctr_start_mask = cpu_hw_evt->used_hw_ctrs[0];
> +
> + /* Start all the counters in a single shot */
> + sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_COUNTER_START, 0, ctr_start_mask,
> + flag, 0, 0, 0);
> +}
> +
> +static void pmu_sbi_start_overflow_mask(struct riscv_pmu *pmu,
> + unsigned long ctr_ovf_mask)
> +{
> + struct cpu_hw_events *cpu_hw_evt = this_cpu_ptr(pmu->hw_events);
> +
> + if (sbi_pmu_snapshot_available())
> + pmu_sbi_start_ovf_ctrs_snapshot(cpu_hw_evt, ctr_ovf_mask);
> + else
> + pmu_sbi_start_ovf_ctrs_sbi(cpu_hw_evt, ctr_ovf_mask);
> +}
> +
> static irqreturn_t pmu_sbi_ovf_handler(int irq, void *dev)
> {
> struct perf_sample_data data;
> @@ -690,6 +845,7 @@ static irqreturn_t pmu_sbi_ovf_handler(int irq, void *dev)
> unsigned long overflowed_ctrs = 0;
> struct cpu_hw_events *cpu_hw_evt = dev;
> u64 start_clock = sched_clock();
> + struct riscv_pmu_snapshot_data *sdata = cpu_hw_evt->snapshot_addr;
>
> if (WARN_ON_ONCE(!cpu_hw_evt))
> return IRQ_NONE;
> @@ -711,8 +867,10 @@ static irqreturn_t pmu_sbi_ovf_handler(int irq, void *dev)
> pmu_sbi_stop_hw_ctrs(pmu);
>
> /* Overflow status register should only be read after counter are stopped */
> - ALT_SBI_PMU_OVERFLOW(overflow);
> -
> + if (sbi_pmu_snapshot_available())
> + overflow = sdata->ctr_overflow_mask;
> + else
> + ALT_SBI_PMU_OVERFLOW(overflow);
> /*
> * Overflow interrupt pending bit should only be cleared after stopping
> * all the counters to avoid any race condition.
> @@ -794,6 +952,9 @@ static int pmu_sbi_starting_cpu(unsigned int cpu, struct hlist_node *node)
> enable_percpu_irq(riscv_pmu_irq, IRQ_TYPE_NONE);
> }
>
> + if (sbi_pmu_snapshot_available())
> + return pmu_sbi_snapshot_setup(pmu, cpu);
> +
> return 0;
> }
>
> @@ -807,6 +968,9 @@ static int pmu_sbi_dying_cpu(unsigned int cpu, struct hlist_node *node)
> /* Disable all counters access for user mode now */
> csr_write(CSR_SCOUNTEREN, 0x0);
>
> + if (sbi_pmu_snapshot_available())
> + pmu_sbi_snapshot_disable();
> +
> return 0;
> }
>
> @@ -1076,10 +1240,6 @@ static int pmu_sbi_device_probe(struct platform_device *pdev)
> pmu->event_unmapped = pmu_sbi_event_unmapped;
> pmu->csr_index = pmu_sbi_csr_index;
>
> - ret = cpuhp_state_add_instance(CPUHP_AP_PERF_RISCV_STARTING, &pmu->node);
> - if (ret)
> - return ret;
> -
> ret = riscv_pm_pmu_register(pmu);
> if (ret)
> goto out_unregister;
> @@ -1088,8 +1248,32 @@ static int pmu_sbi_device_probe(struct platform_device *pdev)
> if (ret)
> goto out_unregister;
>
> + /* SBI PMU Snapsphot is only available in SBI v2.0 */
> + if (sbi_v2_available) {
> + ret = pmu_sbi_snapshot_alloc(pmu);
> + if (ret)
> + goto out_unregister;
> +
> + ret = pmu_sbi_snapshot_setup(pmu, smp_processor_id());
> + if (!ret) {
> + pr_info("SBI PMU snapshot detected\n");
> + /*
> + * We enable it once here for the boot cpu. If snapshot shmem setup
> + * fails during cpu hotplug process, it will fail to start the cpu
> + * as we can not handle hetergenous PMUs with different snapshot
> + * capability.
> + */
> + static_branch_enable(&sbi_pmu_snapshot_available);
> + }
> + /* Snapshot is an optional feature. Continue if not available */
> + }
> +
> register_sysctl("kernel", sbi_pmu_sysctl_table);
>
> + ret = cpuhp_state_add_instance(CPUHP_AP_PERF_RISCV_STARTING, &pmu->node);
> + if (ret)
> + return ret;
> +
> return 0;
>
> out_unregister:
> diff --git a/include/linux/perf/riscv_pmu.h b/include/linux/perf/riscv_pmu.h
> index 43282e22ebe1..c3fa90970042 100644
> --- a/include/linux/perf/riscv_pmu.h
> +++ b/include/linux/perf/riscv_pmu.h
> @@ -39,6 +39,12 @@ struct cpu_hw_events {
> DECLARE_BITMAP(used_hw_ctrs, RISCV_MAX_COUNTERS);
> /* currently enabled firmware counters */
> DECLARE_BITMAP(used_fw_ctrs, RISCV_MAX_COUNTERS);
> + /* The virtual address of the shared memory where counter snapshot will be taken */
> + void *snapshot_addr;
> + /* The physical address of the shared memory where counter snapshot will be taken */
> + phys_addr_t snapshot_addr_phys;
> + /* Boolean flag to indicate setup is already done */
> + bool snapshot_set_done;
> };
>
> struct riscv_pmu {
> --
> 2.34.1
>
On Sat, Dec 30, 2023 at 3:20 AM Atish Patra <[email protected]> wrote:
>
> The virtual counter value is updated during pmu_ctr_read. There is no need
> to update it in reset case. Otherwise, it will be counted twice which is
> incorrect.
>
> Fixes: 0cb74b65d2e5 ("RISC-V: KVM: Implement perf support without sampling")
> Signed-off-by: Atish Patra <[email protected]>
LGTM.
Reviewed-by: Anup Patel <[email protected]>
Regards,
Anup
> ---
> arch/riscv/kvm/vcpu_pmu.c | 7 ++-----
> 1 file changed, 2 insertions(+), 5 deletions(-)
>
> diff --git a/arch/riscv/kvm/vcpu_pmu.c b/arch/riscv/kvm/vcpu_pmu.c
> index 86391a5061dd..8c44f26e754d 100644
> --- a/arch/riscv/kvm/vcpu_pmu.c
> +++ b/arch/riscv/kvm/vcpu_pmu.c
> @@ -432,12 +432,9 @@ int kvm_riscv_vcpu_pmu_ctr_stop(struct kvm_vcpu *vcpu, unsigned long ctr_base,
> sbiret = SBI_ERR_ALREADY_STOPPED;
> }
>
> - if (flags & SBI_PMU_STOP_FLAG_RESET) {
> - /* Relase the counter if this is a reset request */
> - pmc->counter_val += perf_event_read_value(pmc->perf_event,
> - &enabled, &running);
> + if (flags & SBI_PMU_STOP_FLAG_RESET)
> + /* Release the counter if this is a reset request */
> kvm_pmu_release_perf_event(pmc);
> - }
> } else {
> sbiret = SBI_ERR_INVALID_PARAM;
> }
> --
> 2.34.1
>
On Sat, Dec 30, 2023 at 3:20 AM Atish Patra <[email protected]> wrote:
>
> Currently, we return a linux error code if creating a perf event failed
> in kvm. That shouldn't be necessary as guest can continue to operate
> without perf profiling or profiling with firmware counters.
>
> Return appropriate SBI error code to indicate that PMU configuration
> failed. An error message in kvm already describes the reason for failure.
>
> Fixes: 0cb74b65d2e5 ("RISC-V: KVM: Implement perf support without sampling")
> Signed-off-by: Atish Patra <[email protected]>
LGTM.
Reviewed-by: Anup Patel <[email protected]>
Regards,
Anup
> ---
> arch/riscv/kvm/vcpu_pmu.c | 14 +++++++++-----
> arch/riscv/kvm/vcpu_sbi_pmu.c | 6 +++---
> 2 files changed, 12 insertions(+), 8 deletions(-)
>
> diff --git a/arch/riscv/kvm/vcpu_pmu.c b/arch/riscv/kvm/vcpu_pmu.c
> index 8c44f26e754d..08f561998611 100644
> --- a/arch/riscv/kvm/vcpu_pmu.c
> +++ b/arch/riscv/kvm/vcpu_pmu.c
> @@ -229,8 +229,9 @@ static int kvm_pmu_validate_counter_mask(struct kvm_pmu *kvpmu, unsigned long ct
> return 0;
> }
>
> -static int kvm_pmu_create_perf_event(struct kvm_pmc *pmc, struct perf_event_attr *attr,
> - unsigned long flags, unsigned long eidx, unsigned long evtdata)
> +static long kvm_pmu_create_perf_event(struct kvm_pmc *pmc, struct perf_event_attr *attr,
> + unsigned long flags, unsigned long eidx,
> + unsigned long evtdata)
> {
> struct perf_event *event;
>
> @@ -455,7 +456,8 @@ int kvm_riscv_vcpu_pmu_ctr_cfg_match(struct kvm_vcpu *vcpu, unsigned long ctr_ba
> unsigned long eidx, u64 evtdata,
> struct kvm_vcpu_sbi_return *retdata)
> {
> - int ctr_idx, ret, sbiret = 0;
> + int ctr_idx, sbiret = 0;
> + long ret;
> bool is_fevent;
> unsigned long event_code;
> u32 etype = kvm_pmu_get_perf_event_type(eidx);
> @@ -514,8 +516,10 @@ int kvm_riscv_vcpu_pmu_ctr_cfg_match(struct kvm_vcpu *vcpu, unsigned long ctr_ba
> kvpmu->fw_event[event_code].started = true;
> } else {
> ret = kvm_pmu_create_perf_event(pmc, &attr, flags, eidx, evtdata);
> - if (ret)
> - return ret;
> + if (ret) {
> + sbiret = SBI_ERR_NOT_SUPPORTED;
> + goto out;
> + }
> }
>
> set_bit(ctr_idx, kvpmu->pmc_in_use);
> diff --git a/arch/riscv/kvm/vcpu_sbi_pmu.c b/arch/riscv/kvm/vcpu_sbi_pmu.c
> index 7eca72df2cbd..b70179e9e875 100644
> --- a/arch/riscv/kvm/vcpu_sbi_pmu.c
> +++ b/arch/riscv/kvm/vcpu_sbi_pmu.c
> @@ -42,9 +42,9 @@ static int kvm_sbi_ext_pmu_handler(struct kvm_vcpu *vcpu, struct kvm_run *run,
> #endif
> /*
> * This can fail if perf core framework fails to create an event.
> - * Forward the error to userspace because it's an error which
> - * happened within the host kernel. The other option would be
> - * to convert to an SBI error and forward to the guest.
> + * No need to forward the error to userspace and exit the guest
> + * operation can continue without profiling. Forward the
> + * appropriate SBI error to the guest.
> */
> ret = kvm_riscv_vcpu_pmu_ctr_cfg_match(vcpu, cp->a0, cp->a1,
> cp->a2, cp->a3, temp, retdata);
> --
> 2.34.1
>
On Sat, Dec 30, 2023 at 3:20 AM Atish Patra <[email protected]> wrote:
>
> PMU Snapshot function allows to minimize the number of traps when the
> guest access configures/access the hpmcounters. If the snapshot feature
> is enabled, the hypervisor updates the shared memory with counter
> data and state of overflown counters. The guest can just read the
> shared memory instead of trap & emulate done by the hypervisor.
>
> This patch doesn't implement the counter overflow yet.
>
> Signed-off-by: Atish Patra <[email protected]>
LGTM.
Reviewed-by: Anup Patel <[email protected]>
Regards,
Anup
> ---
> arch/riscv/include/asm/kvm_vcpu_pmu.h | 9 ++
> arch/riscv/kvm/aia.c | 5 ++
> arch/riscv/kvm/vcpu_onereg.c | 7 +-
> arch/riscv/kvm/vcpu_pmu.c | 120 +++++++++++++++++++++++++-
> arch/riscv/kvm/vcpu_sbi_pmu.c | 3 +
> 5 files changed, 140 insertions(+), 4 deletions(-)
>
> diff --git a/arch/riscv/include/asm/kvm_vcpu_pmu.h b/arch/riscv/include/asm/kvm_vcpu_pmu.h
> index 395518a1664e..d56b901a61fc 100644
> --- a/arch/riscv/include/asm/kvm_vcpu_pmu.h
> +++ b/arch/riscv/include/asm/kvm_vcpu_pmu.h
> @@ -50,6 +50,12 @@ struct kvm_pmu {
> bool init_done;
> /* Bit map of all the virtual counter used */
> DECLARE_BITMAP(pmc_in_use, RISCV_KVM_MAX_COUNTERS);
> + /* Bit map of all the virtual counter overflown */
> + DECLARE_BITMAP(pmc_overflown, RISCV_KVM_MAX_COUNTERS);
> + /* The address of the counter snapshot area (guest physical address) */
> + gpa_t snapshot_addr;
> + /* The actual data of the snapshot */
> + struct riscv_pmu_snapshot_data *sdata;
> };
>
> #define vcpu_to_pmu(vcpu) (&(vcpu)->arch.pmu_context)
> @@ -85,6 +91,9 @@ int kvm_riscv_vcpu_pmu_ctr_cfg_match(struct kvm_vcpu *vcpu, unsigned long ctr_ba
> int kvm_riscv_vcpu_pmu_ctr_read(struct kvm_vcpu *vcpu, unsigned long cidx,
> struct kvm_vcpu_sbi_return *retdata);
> void kvm_riscv_vcpu_pmu_init(struct kvm_vcpu *vcpu);
> +int kvm_riscv_vcpu_pmu_setup_snapshot(struct kvm_vcpu *vcpu, unsigned long saddr_low,
> + unsigned long saddr_high, unsigned long flags,
> + struct kvm_vcpu_sbi_return *retdata);
> void kvm_riscv_vcpu_pmu_deinit(struct kvm_vcpu *vcpu);
> void kvm_riscv_vcpu_pmu_reset(struct kvm_vcpu *vcpu);
>
> diff --git a/arch/riscv/kvm/aia.c b/arch/riscv/kvm/aia.c
> index a944294f6f23..71d161d7430d 100644
> --- a/arch/riscv/kvm/aia.c
> +++ b/arch/riscv/kvm/aia.c
> @@ -545,6 +545,9 @@ void kvm_riscv_aia_enable(void)
> enable_percpu_irq(hgei_parent_irq,
> irq_get_trigger_type(hgei_parent_irq));
> csr_set(CSR_HIE, BIT(IRQ_S_GEXT));
> + /* Enable IRQ filtering for overflow interrupt only if sscofpmf is present */
> + if (__riscv_isa_extension_available(NULL, RISCV_ISA_EXT_SSCOFPMF))
> + csr_write(CSR_HVIEN, BIT(IRQ_PMU_OVF));
> }
>
> void kvm_riscv_aia_disable(void)
> @@ -560,6 +563,8 @@ void kvm_riscv_aia_disable(void)
>
> /* Disable per-CPU SGEI interrupt */
> csr_clear(CSR_HIE, BIT(IRQ_S_GEXT));
> + if (__riscv_isa_extension_available(NULL, RISCV_ISA_EXT_SSCOFPMF))
> + csr_clear(CSR_HVIEN, BIT(IRQ_PMU_OVF));
> disable_percpu_irq(hgei_parent_irq);
>
> aia_set_hvictl(false);
> diff --git a/arch/riscv/kvm/vcpu_onereg.c b/arch/riscv/kvm/vcpu_onereg.c
> index fc34557f5356..581568847910 100644
> --- a/arch/riscv/kvm/vcpu_onereg.c
> +++ b/arch/riscv/kvm/vcpu_onereg.c
> @@ -117,8 +117,13 @@ void kvm_riscv_vcpu_setup_isa(struct kvm_vcpu *vcpu)
> for (i = 0; i < ARRAY_SIZE(kvm_isa_ext_arr); i++) {
> host_isa = kvm_isa_ext_arr[i];
> if (__riscv_isa_extension_available(NULL, host_isa) &&
> - kvm_riscv_vcpu_isa_enable_allowed(i))
> + kvm_riscv_vcpu_isa_enable_allowed(i)) {
> + /* Sscofpmf depends on interrupt filtering defined in ssaia */
> + if (host_isa == RISCV_ISA_EXT_SSCOFPMF &&
> + !__riscv_isa_extension_available(NULL, RISCV_ISA_EXT_SSAIA))
> + continue;
> set_bit(host_isa, vcpu->arch.isa);
> + }
> }
> }
>
> diff --git a/arch/riscv/kvm/vcpu_pmu.c b/arch/riscv/kvm/vcpu_pmu.c
> index 08f561998611..e980235b8436 100644
> --- a/arch/riscv/kvm/vcpu_pmu.c
> +++ b/arch/riscv/kvm/vcpu_pmu.c
> @@ -311,6 +311,81 @@ int kvm_riscv_vcpu_pmu_read_hpm(struct kvm_vcpu *vcpu, unsigned int csr_num,
> return ret;
> }
>
> +static void kvm_pmu_clear_snapshot_area(struct kvm_vcpu *vcpu)
> +{
> + struct kvm_pmu *kvpmu = vcpu_to_pmu(vcpu);
> + int snapshot_area_size = sizeof(struct riscv_pmu_snapshot_data);
> +
> + if (kvpmu->sdata) {
> + memset(kvpmu->sdata, 0, snapshot_area_size);
> + if (kvpmu->snapshot_addr != INVALID_GPA)
> + kvm_vcpu_write_guest(vcpu, kvpmu->snapshot_addr,
> + kvpmu->sdata, snapshot_area_size);
> + kfree(kvpmu->sdata);
> + kvpmu->sdata = NULL;
> + }
> + kvpmu->snapshot_addr = INVALID_GPA;
> +}
> +
> +int kvm_riscv_vcpu_pmu_setup_snapshot(struct kvm_vcpu *vcpu, unsigned long saddr_low,
> + unsigned long saddr_high, unsigned long flags,
> + struct kvm_vcpu_sbi_return *retdata)
> +{
> + struct kvm_pmu *kvpmu = vcpu_to_pmu(vcpu);
> + int snapshot_area_size = sizeof(struct riscv_pmu_snapshot_data);
> + int sbiret = 0;
> + gpa_t saddr;
> + unsigned long hva;
> + bool writable;
> +
> + if (!kvpmu) {
> + sbiret = SBI_ERR_INVALID_PARAM;
> + goto out;
> + }
> +
> + if (saddr_low == -1 && saddr_high == -1) {
> + kvm_pmu_clear_snapshot_area(vcpu);
> + return 0;
> + }
> +
> + saddr = saddr_low;
> +
> + if (saddr_high != 0) {
> + if (IS_ENABLED(CONFIG_32BIT))
> + saddr |= ((gpa_t)saddr << 32);
> + else
> + sbiret = SBI_ERR_INVALID_ADDRESS;
> + goto out;
> + }
> +
> + if (kvm_is_error_gpa(vcpu->kvm, saddr)) {
> + sbiret = SBI_ERR_INVALID_PARAM;
> + goto out;
> + }
> +
> + hva = kvm_vcpu_gfn_to_hva_prot(vcpu, saddr >> PAGE_SHIFT, &writable);
> + if (kvm_is_error_hva(hva) || !writable) {
> + sbiret = SBI_ERR_INVALID_ADDRESS;
> + goto out;
> + }
> +
> + kvpmu->snapshot_addr = saddr;
> + kvpmu->sdata = kzalloc(snapshot_area_size, GFP_ATOMIC);
> + if (!kvpmu->sdata)
> + return -ENOMEM;
> +
> + if (kvm_vcpu_write_guest(vcpu, saddr, kvpmu->sdata, snapshot_area_size)) {
> + kfree(kvpmu->sdata);
> + kvpmu->snapshot_addr = INVALID_GPA;
> + sbiret = SBI_ERR_FAILURE;
> + }
> +
> +out:
> + retdata->err_val = sbiret;
> +
> + return 0;
> +}
> +
> int kvm_riscv_vcpu_pmu_num_ctrs(struct kvm_vcpu *vcpu,
> struct kvm_vcpu_sbi_return *retdata)
> {
> @@ -344,20 +419,32 @@ int kvm_riscv_vcpu_pmu_ctr_start(struct kvm_vcpu *vcpu, unsigned long ctr_base,
> int i, pmc_index, sbiret = 0;
> struct kvm_pmc *pmc;
> int fevent_code;
> + bool snap_flag_set = flags & SBI_PMU_START_FLAG_INIT_FROM_SNAPSHOT;
>
> - if (kvm_pmu_validate_counter_mask(kvpmu, ctr_base, ctr_mask) < 0) {
> + if ((kvm_pmu_validate_counter_mask(kvpmu, ctr_base, ctr_mask) < 0)) {
> sbiret = SBI_ERR_INVALID_PARAM;
> goto out;
> }
>
> + if (snap_flag_set && kvpmu->snapshot_addr == INVALID_GPA) {
> + sbiret = SBI_ERR_NO_SHMEM;
> + goto out;
> + }
> +
> /* Start the counters that have been configured and requested by the guest */
> for_each_set_bit(i, &ctr_mask, RISCV_MAX_COUNTERS) {
> pmc_index = i + ctr_base;
> if (!test_bit(pmc_index, kvpmu->pmc_in_use))
> continue;
> pmc = &kvpmu->pmc[pmc_index];
> - if (flags & SBI_PMU_START_FLAG_SET_INIT_VALUE)
> + if (flags & SBI_PMU_START_FLAG_SET_INIT_VALUE) {
> pmc->counter_val = ival;
> + } else if (snap_flag_set) {
> + kvm_vcpu_read_guest(vcpu, kvpmu->snapshot_addr, kvpmu->sdata,
> + sizeof(struct riscv_pmu_snapshot_data));
> + pmc->counter_val = kvpmu->sdata->ctr_values[pmc_index];
> + }
> +
> if (pmc->cinfo.type == SBI_PMU_CTR_TYPE_FW) {
> fevent_code = get_event_code(pmc->event_idx);
> if (fevent_code >= SBI_PMU_FW_MAX) {
> @@ -401,12 +488,18 @@ int kvm_riscv_vcpu_pmu_ctr_stop(struct kvm_vcpu *vcpu, unsigned long ctr_base,
> u64 enabled, running;
> struct kvm_pmc *pmc;
> int fevent_code;
> + bool snap_flag_set = flags & SBI_PMU_STOP_FLAG_TAKE_SNAPSHOT;
>
> - if (kvm_pmu_validate_counter_mask(kvpmu, ctr_base, ctr_mask) < 0) {
> + if ((kvm_pmu_validate_counter_mask(kvpmu, ctr_base, ctr_mask) < 0)) {
> sbiret = SBI_ERR_INVALID_PARAM;
> goto out;
> }
>
> + if (snap_flag_set && kvpmu->snapshot_addr == INVALID_GPA) {
> + sbiret = SBI_ERR_NO_SHMEM;
> + goto out;
> + }
> +
> /* Stop the counters that have been configured and requested by the guest */
> for_each_set_bit(i, &ctr_mask, RISCV_MAX_COUNTERS) {
> pmc_index = i + ctr_base;
> @@ -439,9 +532,28 @@ int kvm_riscv_vcpu_pmu_ctr_stop(struct kvm_vcpu *vcpu, unsigned long ctr_base,
> } else {
> sbiret = SBI_ERR_INVALID_PARAM;
> }
> +
> + if (snap_flag_set && !sbiret) {
> + if (pmc->cinfo.type == SBI_PMU_CTR_TYPE_FW)
> + pmc->counter_val = kvpmu->fw_event[fevent_code].value;
> + else if (pmc->perf_event)
> + pmc->counter_val += perf_event_read_value(pmc->perf_event,
> + &enabled, &running);
> + /* TODO: Add counter overflow support when sscofpmf support is added */
> + kvpmu->sdata->ctr_values[i] = pmc->counter_val;
> + kvm_vcpu_write_guest(vcpu, kvpmu->snapshot_addr, kvpmu->sdata,
> + sizeof(struct riscv_pmu_snapshot_data));
> + }
> +
> if (flags & SBI_PMU_STOP_FLAG_RESET) {
> pmc->event_idx = SBI_PMU_EVENT_IDX_INVALID;
> clear_bit(pmc_index, kvpmu->pmc_in_use);
> + if (snap_flag_set) {
> + /* Clear the snapshot area for the upcoming deletion event */
> + kvpmu->sdata->ctr_values[i] = 0;
> + kvm_vcpu_write_guest(vcpu, kvpmu->snapshot_addr, kvpmu->sdata,
> + sizeof(struct riscv_pmu_snapshot_data));
> + }
> }
> }
>
> @@ -567,6 +679,7 @@ void kvm_riscv_vcpu_pmu_init(struct kvm_vcpu *vcpu)
> kvpmu->num_hw_ctrs = num_hw_ctrs + 1;
> kvpmu->num_fw_ctrs = SBI_PMU_FW_MAX;
> memset(&kvpmu->fw_event, 0, SBI_PMU_FW_MAX * sizeof(struct kvm_fw_event));
> + kvpmu->snapshot_addr = INVALID_GPA;
>
> if (kvpmu->num_hw_ctrs > RISCV_KVM_MAX_HW_CTRS) {
> pr_warn_once("Limiting the hardware counters to 32 as specified by the ISA");
> @@ -626,6 +739,7 @@ void kvm_riscv_vcpu_pmu_deinit(struct kvm_vcpu *vcpu)
> }
> bitmap_zero(kvpmu->pmc_in_use, RISCV_MAX_COUNTERS);
> memset(&kvpmu->fw_event, 0, SBI_PMU_FW_MAX * sizeof(struct kvm_fw_event));
> + kvm_pmu_clear_snapshot_area(vcpu);
> }
>
> void kvm_riscv_vcpu_pmu_reset(struct kvm_vcpu *vcpu)
> diff --git a/arch/riscv/kvm/vcpu_sbi_pmu.c b/arch/riscv/kvm/vcpu_sbi_pmu.c
> index b70179e9e875..9f61136e4bb1 100644
> --- a/arch/riscv/kvm/vcpu_sbi_pmu.c
> +++ b/arch/riscv/kvm/vcpu_sbi_pmu.c
> @@ -64,6 +64,9 @@ static int kvm_sbi_ext_pmu_handler(struct kvm_vcpu *vcpu, struct kvm_run *run,
> case SBI_EXT_PMU_COUNTER_FW_READ:
> ret = kvm_riscv_vcpu_pmu_ctr_read(vcpu, cp->a0, retdata);
> break;
> + case SBI_EXT_PMU_SNAPSHOT_SET_SHMEM:
> + ret = kvm_riscv_vcpu_pmu_setup_snapshot(vcpu, cp->a0, cp->a1, cp->a2, retdata);
> + break;
> default:
> retdata->err_val = SBI_ERR_NOT_SUPPORTED;
> }
> --
> 2.34.1
>
On Sat, Dec 30, 2023 at 3:20 AM Atish Patra <[email protected]> wrote:
>
> KVM enables perf for guest via counter virtualization. However, the
> sampling can not be supported as there is no mechanism to enabled
> trap/emulate scountovf in ISA yet. Rely on the SBI PMU snapshot
> to provide the counter overflow data via the shared memory.
>
> In case of sampling event, the host first guest the LCOFI interrupt
> and injects to the guest via irq filtering mechanism defined in AIA
> specification. Thus, ssaia must be enabled in the host in order to
> use perf sampling in the guest. No other AIA dpeendancy w.r.t kernel
> is required.
>
> Signed-off-by: Atish Patra <[email protected]>
LGTM.
Reviewed-by: Anup Patel <[email protected]>
Regards,
Anup
> ---
> arch/riscv/include/asm/csr.h | 3 +-
> arch/riscv/include/asm/kvm_vcpu_pmu.h | 1 +
> arch/riscv/include/uapi/asm/kvm.h | 1 +
> arch/riscv/kvm/main.c | 1 +
> arch/riscv/kvm/vcpu.c | 8 +--
> arch/riscv/kvm/vcpu_onereg.c | 2 +
> arch/riscv/kvm/vcpu_pmu.c | 70 +++++++++++++++++++++++++--
> 7 files changed, 77 insertions(+), 9 deletions(-)
>
> diff --git a/arch/riscv/include/asm/csr.h b/arch/riscv/include/asm/csr.h
> index 88cdc8a3e654..bec09b33e2f0 100644
> --- a/arch/riscv/include/asm/csr.h
> +++ b/arch/riscv/include/asm/csr.h
> @@ -168,7 +168,8 @@
> #define VSIP_TO_HVIP_SHIFT (IRQ_VS_SOFT - IRQ_S_SOFT)
> #define VSIP_VALID_MASK ((_AC(1, UL) << IRQ_S_SOFT) | \
> (_AC(1, UL) << IRQ_S_TIMER) | \
> - (_AC(1, UL) << IRQ_S_EXT))
> + (_AC(1, UL) << IRQ_S_EXT) | \
> + (_AC(1, UL) << IRQ_PMU_OVF))
>
> /* AIA CSR bits */
> #define TOPI_IID_SHIFT 16
> diff --git a/arch/riscv/include/asm/kvm_vcpu_pmu.h b/arch/riscv/include/asm/kvm_vcpu_pmu.h
> index d56b901a61fc..af6d0ff5ce41 100644
> --- a/arch/riscv/include/asm/kvm_vcpu_pmu.h
> +++ b/arch/riscv/include/asm/kvm_vcpu_pmu.h
> @@ -36,6 +36,7 @@ struct kvm_pmc {
> bool started;
> /* Monitoring event ID */
> unsigned long event_idx;
> + struct kvm_vcpu *vcpu;
> };
>
> /* PMU data structure per vcpu */
> diff --git a/arch/riscv/include/uapi/asm/kvm.h b/arch/riscv/include/uapi/asm/kvm.h
> index d6b7a5b95874..d5aea43bc797 100644
> --- a/arch/riscv/include/uapi/asm/kvm.h
> +++ b/arch/riscv/include/uapi/asm/kvm.h
> @@ -139,6 +139,7 @@ enum KVM_RISCV_ISA_EXT_ID {
> KVM_RISCV_ISA_EXT_ZIHPM,
> KVM_RISCV_ISA_EXT_SMSTATEEN,
> KVM_RISCV_ISA_EXT_ZICOND,
> + KVM_RISCV_ISA_EXT_SSCOFPMF,
> KVM_RISCV_ISA_EXT_MAX,
> };
>
> diff --git a/arch/riscv/kvm/main.c b/arch/riscv/kvm/main.c
> index 225a435d9c9a..5a3a4cee0e3d 100644
> --- a/arch/riscv/kvm/main.c
> +++ b/arch/riscv/kvm/main.c
> @@ -43,6 +43,7 @@ int kvm_arch_hardware_enable(void)
> csr_write(CSR_HCOUNTEREN, 0x02);
>
> csr_write(CSR_HVIP, 0);
> + csr_write(CSR_HVIEN, 1UL << IRQ_PMU_OVF);
>
> kvm_riscv_aia_enable();
>
> diff --git a/arch/riscv/kvm/vcpu.c b/arch/riscv/kvm/vcpu.c
> index b5ca9f2e98ac..f83f0226439f 100644
> --- a/arch/riscv/kvm/vcpu.c
> +++ b/arch/riscv/kvm/vcpu.c
> @@ -382,7 +382,8 @@ int kvm_riscv_vcpu_set_interrupt(struct kvm_vcpu *vcpu, unsigned int irq)
> if (irq < IRQ_LOCAL_MAX &&
> irq != IRQ_VS_SOFT &&
> irq != IRQ_VS_TIMER &&
> - irq != IRQ_VS_EXT)
> + irq != IRQ_VS_EXT &&
> + irq != IRQ_PMU_OVF)
> return -EINVAL;
>
> set_bit(irq, vcpu->arch.irqs_pending);
> @@ -397,14 +398,15 @@ int kvm_riscv_vcpu_set_interrupt(struct kvm_vcpu *vcpu, unsigned int irq)
> int kvm_riscv_vcpu_unset_interrupt(struct kvm_vcpu *vcpu, unsigned int irq)
> {
> /*
> - * We only allow VS-mode software, timer, and external
> + * We only allow VS-mode software, timer, counter overflow and external
> * interrupts when irq is one of the local interrupts
> * defined by RISC-V privilege specification.
> */
> if (irq < IRQ_LOCAL_MAX &&
> irq != IRQ_VS_SOFT &&
> irq != IRQ_VS_TIMER &&
> - irq != IRQ_VS_EXT)
> + irq != IRQ_VS_EXT &&
> + irq != IRQ_PMU_OVF)
> return -EINVAL;
>
> clear_bit(irq, vcpu->arch.irqs_pending);
> diff --git a/arch/riscv/kvm/vcpu_onereg.c b/arch/riscv/kvm/vcpu_onereg.c
> index 581568847910..1eaaa919aa61 100644
> --- a/arch/riscv/kvm/vcpu_onereg.c
> +++ b/arch/riscv/kvm/vcpu_onereg.c
> @@ -36,6 +36,7 @@ static const unsigned long kvm_isa_ext_arr[] = {
> /* Multi letter extensions (alphabetically sorted) */
> KVM_ISA_EXT_ARR(SMSTATEEN),
> KVM_ISA_EXT_ARR(SSAIA),
> + KVM_ISA_EXT_ARR(SSCOFPMF),
> KVM_ISA_EXT_ARR(SSTC),
> KVM_ISA_EXT_ARR(SVINVAL),
> KVM_ISA_EXT_ARR(SVNAPOT),
> @@ -88,6 +89,7 @@ static bool kvm_riscv_vcpu_isa_disable_allowed(unsigned long ext)
> case KVM_RISCV_ISA_EXT_I:
> case KVM_RISCV_ISA_EXT_M:
> case KVM_RISCV_ISA_EXT_SSTC:
> + case KVM_RISCV_ISA_EXT_SSCOFPMF:
> case KVM_RISCV_ISA_EXT_SVINVAL:
> case KVM_RISCV_ISA_EXT_SVNAPOT:
> case KVM_RISCV_ISA_EXT_ZBA:
> diff --git a/arch/riscv/kvm/vcpu_pmu.c b/arch/riscv/kvm/vcpu_pmu.c
> index e980235b8436..f2bf5b5bdd61 100644
> --- a/arch/riscv/kvm/vcpu_pmu.c
> +++ b/arch/riscv/kvm/vcpu_pmu.c
> @@ -229,6 +229,47 @@ static int kvm_pmu_validate_counter_mask(struct kvm_pmu *kvpmu, unsigned long ct
> return 0;
> }
>
> +static void kvm_riscv_pmu_overflow(struct perf_event *perf_event,
> + struct perf_sample_data *data,
> + struct pt_regs *regs)
> +{
> + struct kvm_pmc *pmc = perf_event->overflow_handler_context;
> + struct kvm_vcpu *vcpu = pmc->vcpu;
> + struct kvm_pmu *kvpmu = vcpu_to_pmu(vcpu);
> + struct riscv_pmu *rpmu = to_riscv_pmu(perf_event->pmu);
> + u64 period;
> +
> + /*
> + * Stop the event counting by directly accessing the perf_event.
> + * Otherwise, this needs to deferred via a workqueue.
> + * That will introduce skew in the counter value because the actual
> + * physical counter would start after returning from this function.
> + * It will be stopped again once the workqueue is scheduled
> + */
> + rpmu->pmu.stop(perf_event, PERF_EF_UPDATE);
> +
> + /*
> + * The hw counter would start automatically when this function returns.
> + * Thus, the host may continue to interrupt and inject it to the guest
> + * even without the guest configuring the next event. Depending on the hardware
> + * the host may have some sluggishness only if privilege mode filtering is not
> + * available. In an ideal world, where qemu is not the only capable hardware,
> + * this can be removed.
> + * FYI: ARM64 does this way while x86 doesn't do anything as such.
> + * TODO: Should we keep it for RISC-V ?
> + */
> + period = -(local64_read(&perf_event->count));
> +
> + local64_set(&perf_event->hw.period_left, 0);
> + perf_event->attr.sample_period = period;
> + perf_event->hw.sample_period = period;
> +
> + set_bit(pmc->idx, kvpmu->pmc_overflown);
> + kvm_riscv_vcpu_set_interrupt(vcpu, IRQ_PMU_OVF);
> +
> + rpmu->pmu.start(perf_event, PERF_EF_RELOAD);
> +}
> +
> static long kvm_pmu_create_perf_event(struct kvm_pmc *pmc, struct perf_event_attr *attr,
> unsigned long flags, unsigned long eidx,
> unsigned long evtdata)
> @@ -248,7 +289,7 @@ static long kvm_pmu_create_perf_event(struct kvm_pmc *pmc, struct perf_event_att
> */
> attr->sample_period = kvm_pmu_get_sample_period(pmc);
>
> - event = perf_event_create_kernel_counter(attr, -1, current, NULL, pmc);
> + event = perf_event_create_kernel_counter(attr, -1, current, kvm_riscv_pmu_overflow, pmc);
> if (IS_ERR(event)) {
> pr_err("kvm pmu event creation failed for eidx %lx: %ld\n", eidx, PTR_ERR(event));
> return PTR_ERR(event);
> @@ -473,6 +514,12 @@ int kvm_riscv_vcpu_pmu_ctr_start(struct kvm_vcpu *vcpu, unsigned long ctr_base,
> }
> }
>
> + /* The guest have serviced the interrupt and starting the counter again */
> + if (test_bit(IRQ_PMU_OVF, vcpu->arch.irqs_pending)) {
> + clear_bit(pmc_index, kvpmu->pmc_overflown);
> + kvm_riscv_vcpu_unset_interrupt(vcpu, IRQ_PMU_OVF);
> + }
> +
> out:
> retdata->err_val = sbiret;
>
> @@ -539,7 +586,13 @@ int kvm_riscv_vcpu_pmu_ctr_stop(struct kvm_vcpu *vcpu, unsigned long ctr_base,
> else if (pmc->perf_event)
> pmc->counter_val += perf_event_read_value(pmc->perf_event,
> &enabled, &running);
> - /* TODO: Add counter overflow support when sscofpmf support is added */
> + /*
> + * The counter and overflow indicies in the snapshot region are w.r.to
> + * cbase. Modify the set bit in the counter mask instead of the pmc_index
> + * which indicates the absolute counter index.
> + */
> + if (test_bit(pmc_index, kvpmu->pmc_overflown))
> + kvpmu->sdata->ctr_overflow_mask |= (1UL << i);
> kvpmu->sdata->ctr_values[i] = pmc->counter_val;
> kvm_vcpu_write_guest(vcpu, kvpmu->snapshot_addr, kvpmu->sdata,
> sizeof(struct riscv_pmu_snapshot_data));
> @@ -548,15 +601,20 @@ int kvm_riscv_vcpu_pmu_ctr_stop(struct kvm_vcpu *vcpu, unsigned long ctr_base,
> if (flags & SBI_PMU_STOP_FLAG_RESET) {
> pmc->event_idx = SBI_PMU_EVENT_IDX_INVALID;
> clear_bit(pmc_index, kvpmu->pmc_in_use);
> + clear_bit(pmc_index, kvpmu->pmc_overflown);
> if (snap_flag_set) {
> /* Clear the snapshot area for the upcoming deletion event */
> kvpmu->sdata->ctr_values[i] = 0;
> + /*
> + * Only clear the given counter as the caller is responsible to
> + * validate both the overflow mask and configured counters.
> + */
> + kvpmu->sdata->ctr_overflow_mask &= ~(1UL << i);
> kvm_vcpu_write_guest(vcpu, kvpmu->snapshot_addr, kvpmu->sdata,
> sizeof(struct riscv_pmu_snapshot_data));
> }
> }
> }
> -
> out:
> retdata->err_val = sbiret;
>
> @@ -699,6 +757,7 @@ void kvm_riscv_vcpu_pmu_init(struct kvm_vcpu *vcpu)
> pmc = &kvpmu->pmc[i];
> pmc->idx = i;
> pmc->event_idx = SBI_PMU_EVENT_IDX_INVALID;
> + pmc->vcpu = vcpu;
> if (i < kvpmu->num_hw_ctrs) {
> pmc->cinfo.type = SBI_PMU_CTR_TYPE_HW;
> if (i < 3)
> @@ -731,13 +790,14 @@ void kvm_riscv_vcpu_pmu_deinit(struct kvm_vcpu *vcpu)
> if (!kvpmu)
> return;
>
> - for_each_set_bit(i, kvpmu->pmc_in_use, RISCV_MAX_COUNTERS) {
> + for_each_set_bit(i, kvpmu->pmc_in_use, RISCV_KVM_MAX_COUNTERS) {
> pmc = &kvpmu->pmc[i];
> pmc->counter_val = 0;
> kvm_pmu_release_perf_event(pmc);
> pmc->event_idx = SBI_PMU_EVENT_IDX_INVALID;
> }
> - bitmap_zero(kvpmu->pmc_in_use, RISCV_MAX_COUNTERS);
> + bitmap_zero(kvpmu->pmc_in_use, RISCV_KVM_MAX_COUNTERS);
> + bitmap_zero(kvpmu->pmc_overflown, RISCV_KVM_MAX_COUNTERS);
> memset(&kvpmu->fw_event, 0, SBI_PMU_FW_MAX * sizeof(struct kvm_fw_event));
> kvm_pmu_clear_snapshot_area(vcpu);
> }
> --
> 2.34.1
>
On Sat, Dec 30, 2023 at 3:20 AM Atish Patra <[email protected]> wrote:
>
> The SBI v2.0 introduced a fw_read_hi function to read 64 bit firmware
> counters for RV32 based systems.
>
> Add infrastructure to support that.
>
> Signed-off-by: Atish Patra <[email protected]>
LGTM.
Reviewed-by: Anup Patel <[email protected]>
Regards,
Anup
> ---
> arch/riscv/include/asm/kvm_vcpu_pmu.h | 4 ++-
> arch/riscv/kvm/vcpu_pmu.c | 37 ++++++++++++++++++++++++++-
> arch/riscv/kvm/vcpu_sbi_pmu.c | 6 +++++
> 3 files changed, 45 insertions(+), 2 deletions(-)
>
> diff --git a/arch/riscv/include/asm/kvm_vcpu_pmu.h b/arch/riscv/include/asm/kvm_vcpu_pmu.h
> index af6d0ff5ce41..463c349a9ea5 100644
> --- a/arch/riscv/include/asm/kvm_vcpu_pmu.h
> +++ b/arch/riscv/include/asm/kvm_vcpu_pmu.h
> @@ -20,7 +20,7 @@ static_assert(RISCV_KVM_MAX_COUNTERS <= 64);
>
> struct kvm_fw_event {
> /* Current value of the event */
> - unsigned long value;
> + u64 value;
>
> /* Event monitoring status */
> bool started;
> @@ -91,6 +91,8 @@ int kvm_riscv_vcpu_pmu_ctr_cfg_match(struct kvm_vcpu *vcpu, unsigned long ctr_ba
> struct kvm_vcpu_sbi_return *retdata);
> int kvm_riscv_vcpu_pmu_ctr_read(struct kvm_vcpu *vcpu, unsigned long cidx,
> struct kvm_vcpu_sbi_return *retdata);
> +int kvm_riscv_vcpu_pmu_fw_ctr_read_hi(struct kvm_vcpu *vcpu, unsigned long cidx,
> + struct kvm_vcpu_sbi_return *retdata);
> void kvm_riscv_vcpu_pmu_init(struct kvm_vcpu *vcpu);
> int kvm_riscv_vcpu_pmu_setup_snapshot(struct kvm_vcpu *vcpu, unsigned long saddr_low,
> unsigned long saddr_high, unsigned long flags,
> diff --git a/arch/riscv/kvm/vcpu_pmu.c b/arch/riscv/kvm/vcpu_pmu.c
> index f2bf5b5bdd61..e6ce37819ca2 100644
> --- a/arch/riscv/kvm/vcpu_pmu.c
> +++ b/arch/riscv/kvm/vcpu_pmu.c
> @@ -196,6 +196,29 @@ static int pmu_get_pmc_index(struct kvm_pmu *pmu, unsigned long eidx,
> return kvm_pmu_get_programmable_pmc_index(pmu, eidx, cbase, cmask);
> }
>
> +static int pmu_fw_ctr_read_hi(struct kvm_vcpu *vcpu, unsigned long cidx,
> + unsigned long *out_val)
> +{
> + struct kvm_pmu *kvpmu = vcpu_to_pmu(vcpu);
> + struct kvm_pmc *pmc;
> + int fevent_code;
> +
> + if (!IS_ENABLED(CONFIG_32BIT))
> + return -EINVAL;
> +
> + pmc = &kvpmu->pmc[cidx];
> +
> + if (pmc->cinfo.type != SBI_PMU_CTR_TYPE_FW)
> + return -EINVAL;
> +
> + fevent_code = get_event_code(pmc->event_idx);
> + pmc->counter_val = kvpmu->fw_event[fevent_code].value;
> +
> + *out_val = pmc->counter_val >> 32;
> +
> + return 0;
> +}
> +
> static int pmu_ctr_read(struct kvm_vcpu *vcpu, unsigned long cidx,
> unsigned long *out_val)
> {
> @@ -701,6 +724,18 @@ int kvm_riscv_vcpu_pmu_ctr_cfg_match(struct kvm_vcpu *vcpu, unsigned long ctr_ba
> return 0;
> }
>
> +int kvm_riscv_vcpu_pmu_fw_ctr_read_hi(struct kvm_vcpu *vcpu, unsigned long cidx,
> + struct kvm_vcpu_sbi_return *retdata)
> +{
> + int ret;
> +
> + ret = pmu_fw_ctr_read_hi(vcpu, cidx, &retdata->out_val);
> + if (ret == -EINVAL)
> + retdata->err_val = SBI_ERR_INVALID_PARAM;
> +
> + return 0;
> +}
> +
> int kvm_riscv_vcpu_pmu_ctr_read(struct kvm_vcpu *vcpu, unsigned long cidx,
> struct kvm_vcpu_sbi_return *retdata)
> {
> @@ -774,7 +809,7 @@ void kvm_riscv_vcpu_pmu_init(struct kvm_vcpu *vcpu)
> pmc->cinfo.csr = CSR_CYCLE + i;
> } else {
> pmc->cinfo.type = SBI_PMU_CTR_TYPE_FW;
> - pmc->cinfo.width = BITS_PER_LONG - 1;
> + pmc->cinfo.width = 63;
> }
> }
>
> diff --git a/arch/riscv/kvm/vcpu_sbi_pmu.c b/arch/riscv/kvm/vcpu_sbi_pmu.c
> index 9f61136e4bb1..58a0e5587e2a 100644
> --- a/arch/riscv/kvm/vcpu_sbi_pmu.c
> +++ b/arch/riscv/kvm/vcpu_sbi_pmu.c
> @@ -64,6 +64,12 @@ static int kvm_sbi_ext_pmu_handler(struct kvm_vcpu *vcpu, struct kvm_run *run,
> case SBI_EXT_PMU_COUNTER_FW_READ:
> ret = kvm_riscv_vcpu_pmu_ctr_read(vcpu, cp->a0, retdata);
> break;
> + case SBI_EXT_PMU_COUNTER_FW_READ_HI:
> + if (IS_ENABLED(CONFIG_32BIT))
> + ret = kvm_riscv_vcpu_pmu_fw_ctr_read_hi(vcpu, cp->a0, retdata);
> + else
> + retdata->out_val = 0;
> + break;
> case SBI_EXT_PMU_SNAPSHOT_SET_SHMEM:
> ret = kvm_riscv_vcpu_pmu_setup_snapshot(vcpu, cp->a0, cp->a1, cp->a2, retdata);
> break;
> --
> 2.34.1
>
On Fri, Dec 29, 2023 at 01:49:43PM -0800, Atish Patra wrote:
> SBI v2.0 introduced a explicit function to read the upper 32 bits
> for any firmwar counter width that is longer than 32bits.
> This is only applicable for RV32 where firmware counter can be
> 64 bit.
>
> Acked-by: Palmer Dabbelt <[email protected]>
> Signed-off-by: Atish Patra <[email protected]>
> ---
> drivers/perf/riscv_pmu_sbi.c | 20 ++++++++++++++++----
> 1 file changed, 16 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/perf/riscv_pmu_sbi.c b/drivers/perf/riscv_pmu_sbi.c
> index 16acd4dcdb96..646604f8c0a5 100644
> --- a/drivers/perf/riscv_pmu_sbi.c
> +++ b/drivers/perf/riscv_pmu_sbi.c
> @@ -35,6 +35,8 @@
> PMU_FORMAT_ATTR(event, "config:0-47");
> PMU_FORMAT_ATTR(firmware, "config:63");
>
> +static bool sbi_v2_available;
> +
> static struct attribute *riscv_arch_formats_attr[] = {
> &format_attr_event.attr,
> &format_attr_firmware.attr,
> @@ -488,16 +490,23 @@ static u64 pmu_sbi_ctr_read(struct perf_event *event)
> struct hw_perf_event *hwc = &event->hw;
> int idx = hwc->idx;
> struct sbiret ret;
> - union sbi_pmu_ctr_info info;
> u64 val = 0;
> + union sbi_pmu_ctr_info info = pmu_ctr_list[idx];
>
> if (pmu_sbi_is_fw_event(event)) {
> ret = sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_COUNTER_FW_READ,
> hwc->idx, 0, 0, 0, 0, 0);
> - if (!ret.error)
> - val = ret.value;
> + if (ret.error)
> + return val;
A nit perhaps, but can you just make this return 0 please? I think it
makes the code more obvious in its intent.
Otherwise I think you've satisfied the issues that I had with the
previous iteration:
Reviewed-by: Conor Dooley <[email protected]>
Cheers,
Conor.
> +
> + val = ret.value;
> + if (IS_ENABLED(CONFIG_32BIT) && sbi_v2_available && info.width >= 32) {
> + ret = sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_COUNTER_FW_READ_HI,
> + hwc->idx, 0, 0, 0, 0, 0);
> + if (!ret.error)
> + val |= ((u64)ret.value << 32);
> + }
> } else {
> - info = pmu_ctr_list[idx];
> val = riscv_pmu_ctr_read_csr(info.csr);
> if (IS_ENABLED(CONFIG_32BIT))
> val = ((u64)riscv_pmu_ctr_read_csr(info.csr + 0x80)) << 31 | val;
> @@ -1108,6 +1117,9 @@ static int __init pmu_sbi_devinit(void)
> return 0;
> }
>
> + if (sbi_spec_version >= sbi_mk_version(2, 0))
> + sbi_v2_available = true;
> +
> ret = cpuhp_setup_state_multi(CPUHP_AP_PERF_RISCV_STARTING,
> "perf/riscv/pmu:starting",
> pmu_sbi_starting_cpu, pmu_sbi_dying_cpu);
> --
> 2.34.1
>
On Fri, Dec 29, 2023 at 01:49:45PM -0800, Atish Patra wrote:
> +static noinline void pmu_sbi_start_ovf_ctrs_snapshot(struct cpu_hw_events *cpu_hw_evt,
> + unsigned long ctr_ovf_mask)
> +{
> + int idx = 0;
> + struct perf_event *event;
> + unsigned long flag = SBI_PMU_START_FLAG_INIT_FROM_SNAPSHOT;
> + u64 max_period, init_val = 0;
> + struct hw_perf_event *hwc;
> + unsigned long ctr_start_mask = 0;
> + struct riscv_pmu_snapshot_data *sdata = cpu_hw_evt->snapshot_addr;
> +
> + for_each_set_bit(idx, cpu_hw_evt->used_hw_ctrs, RISCV_MAX_COUNTERS) {
> + if (ctr_ovf_mask & (1 << idx)) {
> + event = cpu_hw_evt->events[idx];
> + hwc = &event->hw;
> + max_period = riscv_pmu_ctr_get_width_mask(event);
> + init_val = local64_read(&hwc->prev_count) & max_period;
> + sdata->ctr_values[idx] = init_val;
> + }
> + /* We donot need to update the non-overflow counters the previous
> + * value should have been there already.
> + */
One nit for if this is resent, you've got the wrong comment style here.
Otherwise, looks like the things we discussed before got addressed:
Reviewed-by: Conor Dooley <[email protected]>
Cheers,
Conor.
30.12.2023 00:49, Atish Patra wrote:
>
> KVM enables perf for guest via counter virtualization. However, the
> sampling can not be supported as there is no mechanism to enabled
> trap/emulate scountovf in ISA yet. Rely on the SBI PMU snapshot
> to provide the counter overflow data via the shared memory.
>
> In case of sampling event, the host first guest the LCOFI interrupt
> and injects to the guest via irq filtering mechanism defined in AIA
> specification. Thus, ssaia must be enabled in the host in order to
> use perf sampling in the guest. No other AIA dpeendancy w.r.t kernel
> is required.
>
> Signed-off-by: Atish Patra <[email protected]>
> ---
> arch/riscv/include/asm/csr.h | 3 +-
> arch/riscv/include/asm/kvm_vcpu_pmu.h | 1 +
> arch/riscv/include/uapi/asm/kvm.h | 1 +
> arch/riscv/kvm/main.c | 1 +
> arch/riscv/kvm/vcpu.c | 8 +--
> arch/riscv/kvm/vcpu_onereg.c | 2 +
> arch/riscv/kvm/vcpu_pmu.c | 70 +++++++++++++++++++++++++--
> 7 files changed, 77 insertions(+), 9 deletions(-)
>
> diff --git a/arch/riscv/include/asm/csr.h b/arch/riscv/include/asm/csr.h
> index 88cdc8a3e654..bec09b33e2f0 100644
> --- a/arch/riscv/include/asm/csr.h
> +++ b/arch/riscv/include/asm/csr.h
> @@ -168,7 +168,8 @@
> #define VSIP_TO_HVIP_SHIFT (IRQ_VS_SOFT - IRQ_S_SOFT)
> #define VSIP_VALID_MASK ((_AC(1, UL) << IRQ_S_SOFT) | \
> (_AC(1, UL) << IRQ_S_TIMER) | \
> - (_AC(1, UL) << IRQ_S_EXT))
> + (_AC(1, UL) << IRQ_S_EXT) | \
> + (_AC(1, UL) << IRQ_PMU_OVF))
>
> /* AIA CSR bits */
> #define TOPI_IID_SHIFT 16
> diff --git a/arch/riscv/include/asm/kvm_vcpu_pmu.h b/arch/riscv/include/asm/kvm_vcpu_pmu.h
> index d56b901a61fc..af6d0ff5ce41 100644
> --- a/arch/riscv/include/asm/kvm_vcpu_pmu.h
> +++ b/arch/riscv/include/asm/kvm_vcpu_pmu.h
> @@ -36,6 +36,7 @@ struct kvm_pmc {
> bool started;
> /* Monitoring event ID */
> unsigned long event_idx;
> + struct kvm_vcpu *vcpu;
> };
>
> /* PMU data structure per vcpu */
> diff --git a/arch/riscv/include/uapi/asm/kvm.h b/arch/riscv/include/uapi/asm/kvm.h
> index d6b7a5b95874..d5aea43bc797 100644
> --- a/arch/riscv/include/uapi/asm/kvm.h
> +++ b/arch/riscv/include/uapi/asm/kvm.h
> @@ -139,6 +139,7 @@ enum KVM_RISCV_ISA_EXT_ID {
> KVM_RISCV_ISA_EXT_ZIHPM,
> KVM_RISCV_ISA_EXT_SMSTATEEN,
> KVM_RISCV_ISA_EXT_ZICOND,
> + KVM_RISCV_ISA_EXT_SSCOFPMF,
> KVM_RISCV_ISA_EXT_MAX,
> };
>
> diff --git a/arch/riscv/kvm/main.c b/arch/riscv/kvm/main.c
> index 225a435d9c9a..5a3a4cee0e3d 100644
> --- a/arch/riscv/kvm/main.c
> +++ b/arch/riscv/kvm/main.c
> @@ -43,6 +43,7 @@ int kvm_arch_hardware_enable(void)
> csr_write(CSR_HCOUNTEREN, 0x02);
>
> csr_write(CSR_HVIP, 0);
> + csr_write(CSR_HVIEN, 1UL << IRQ_PMU_OVF);\
Hi Atish,
Maybe I missed something, but I thought you planned to move CSR_HVIEN to kvm_riscv_aia_enable() and
add dependency for KVM sscofpmf and AIA since this register is part of AIA spec.
Thank you
Vladimir Isaev
>
> kvm_riscv_aia_enable();
>
> diff --git a/arch/riscv/kvm/vcpu.c b/arch/riscv/kvm/vcpu.c
> index b5ca9f2e98ac..f83f0226439f 100644
> --- a/arch/riscv/kvm/vcpu.c
> +++ b/arch/riscv/kvm/vcpu.c
> @@ -382,7 +382,8 @@ int kvm_riscv_vcpu_set_interrupt(struct kvm_vcpu *vcpu, unsigned int irq)
> if (irq < IRQ_LOCAL_MAX &&
> irq != IRQ_VS_SOFT &&
> irq != IRQ_VS_TIMER &&
> - irq != IRQ_VS_EXT)
> + irq != IRQ_VS_EXT &&
> + irq != IRQ_PMU_OVF)
> return -EINVAL;
>
> set_bit(irq, vcpu->arch.irqs_pending);
> @@ -397,14 +398,15 @@ int kvm_riscv_vcpu_set_interrupt(struct kvm_vcpu *vcpu, unsigned int irq)
> int kvm_riscv_vcpu_unset_interrupt(struct kvm_vcpu *vcpu, unsigned int irq)
> {
> /*
> - * We only allow VS-mode software, timer, and external
> + * We only allow VS-mode software, timer, counter overflow and external
> * interrupts when irq is one of the local interrupts
> * defined by RISC-V privilege specification.
> */
> if (irq < IRQ_LOCAL_MAX &&
> irq != IRQ_VS_SOFT &&
> irq != IRQ_VS_TIMER &&
> - irq != IRQ_VS_EXT)
> + irq != IRQ_VS_EXT &&
> + irq != IRQ_PMU_OVF)
> return -EINVAL;
>
> clear_bit(irq, vcpu->arch.irqs_pending);
> diff --git a/arch/riscv/kvm/vcpu_onereg.c b/arch/riscv/kvm/vcpu_onereg.c
> index 581568847910..1eaaa919aa61 100644
> --- a/arch/riscv/kvm/vcpu_onereg.c
> +++ b/arch/riscv/kvm/vcpu_onereg.c
> @@ -36,6 +36,7 @@ static const unsigned long kvm_isa_ext_arr[] = {
> /* Multi letter extensions (alphabetically sorted) */
> KVM_ISA_EXT_ARR(SMSTATEEN),
> KVM_ISA_EXT_ARR(SSAIA),
> + KVM_ISA_EXT_ARR(SSCOFPMF),
> KVM_ISA_EXT_ARR(SSTC),
> KVM_ISA_EXT_ARR(SVINVAL),
> KVM_ISA_EXT_ARR(SVNAPOT),
> @@ -88,6 +89,7 @@ static bool kvm_riscv_vcpu_isa_disable_allowed(unsigned long ext)
> case KVM_RISCV_ISA_EXT_I:
> case KVM_RISCV_ISA_EXT_M:
> case KVM_RISCV_ISA_EXT_SSTC:
> + case KVM_RISCV_ISA_EXT_SSCOFPMF:
> case KVM_RISCV_ISA_EXT_SVINVAL:
> case KVM_RISCV_ISA_EXT_SVNAPOT:
> case KVM_RISCV_ISA_EXT_ZBA:
> diff --git a/arch/riscv/kvm/vcpu_pmu.c b/arch/riscv/kvm/vcpu_pmu.c
> index e980235b8436..f2bf5b5bdd61 100644
> --- a/arch/riscv/kvm/vcpu_pmu.c
> +++ b/arch/riscv/kvm/vcpu_pmu.c
> @@ -229,6 +229,47 @@ static int kvm_pmu_validate_counter_mask(struct kvm_pmu *kvpmu, unsigned long ct
> return 0;
> }
>
> +static void kvm_riscv_pmu_overflow(struct perf_event *perf_event,
> + struct perf_sample_data *data,
> + struct pt_regs *regs)
> +{
> + struct kvm_pmc *pmc = perf_event->overflow_handler_context;
> + struct kvm_vcpu *vcpu = pmc->vcpu;
> + struct kvm_pmu *kvpmu = vcpu_to_pmu(vcpu);
> + struct riscv_pmu *rpmu = to_riscv_pmu(perf_event->pmu);
> + u64 period;
> +
> + /*
> + * Stop the event counting by directly accessing the perf_event.
> + * Otherwise, this needs to deferred via a workqueue.
> + * That will introduce skew in the counter value because the actual
> + * physical counter would start after returning from this function.
> + * It will be stopped again once the workqueue is scheduled
> + */
> + rpmu->pmu.stop(perf_event, PERF_EF_UPDATE);
> +
> + /*
> + * The hw counter would start automatically when this function returns.
> + * Thus, the host may continue to interrupt and inject it to the guest
> + * even without the guest configuring the next event. Depending on the hardware
> + * the host may have some sluggishness only if privilege mode filtering is not
> + * available. In an ideal world, where qemu is not the only capable hardware,
> + * this can be removed.
> + * FYI: ARM64 does this way while x86 doesn't do anything as such.
> + * TODO: Should we keep it for RISC-V ?
> + */
> + period = -(local64_read(&perf_event->count));
> +
> + local64_set(&perf_event->hw.period_left, 0);
> + perf_event->attr.sample_period = period;
> + perf_event->hw.sample_period = period;
> +
> + set_bit(pmc->idx, kvpmu->pmc_overflown);
> + kvm_riscv_vcpu_set_interrupt(vcpu, IRQ_PMU_OVF);
> +
> + rpmu->pmu.start(perf_event, PERF_EF_RELOAD);
> +}
> +
> static long kvm_pmu_create_perf_event(struct kvm_pmc *pmc, struct perf_event_attr *attr,
> unsigned long flags, unsigned long eidx,
> unsigned long evtdata)
> @@ -248,7 +289,7 @@ static long kvm_pmu_create_perf_event(struct kvm_pmc *pmc, struct perf_event_att
> */
> attr->sample_period = kvm_pmu_get_sample_period(pmc);
>
> - event = perf_event_create_kernel_counter(attr, -1, current, NULL, pmc);
> + event = perf_event_create_kernel_counter(attr, -1, current, kvm_riscv_pmu_overflow, pmc);
> if (IS_ERR(event)) {
> pr_err("kvm pmu event creation failed for eidx %lx: %ld\n", eidx, PTR_ERR(event));
> return PTR_ERR(event);
> @@ -473,6 +514,12 @@ int kvm_riscv_vcpu_pmu_ctr_start(struct kvm_vcpu *vcpu, unsigned long ctr_base,
> }
> }
>
> + /* The guest have serviced the interrupt and starting the counter again */
> + if (test_bit(IRQ_PMU_OVF, vcpu->arch.irqs_pending)) {
> + clear_bit(pmc_index, kvpmu->pmc_overflown);
> + kvm_riscv_vcpu_unset_interrupt(vcpu, IRQ_PMU_OVF);
> + }
> +
> out:
> retdata->err_val = sbiret;
>
> @@ -539,7 +586,13 @@ int kvm_riscv_vcpu_pmu_ctr_stop(struct kvm_vcpu *vcpu, unsigned long ctr_base,
> else if (pmc->perf_event)
> pmc->counter_val += perf_event_read_value(pmc->perf_event,
> &enabled, &running);
> - /* TODO: Add counter overflow support when sscofpmf support is added */
> + /*
> + * The counter and overflow indicies in the snapshot region are w.r.to
> + * cbase. Modify the set bit in the counter mask instead of the pmc_index
> + * which indicates the absolute counter index.
> + */
> + if (test_bit(pmc_index, kvpmu->pmc_overflown))
> + kvpmu->sdata->ctr_overflow_mask |= (1UL << i);
> kvpmu->sdata->ctr_values[i] = pmc->counter_val;
> kvm_vcpu_write_guest(vcpu, kvpmu->snapshot_addr, kvpmu->sdata,
> sizeof(struct riscv_pmu_snapshot_data));
> @@ -548,15 +601,20 @@ int kvm_riscv_vcpu_pmu_ctr_stop(struct kvm_vcpu *vcpu, unsigned long ctr_base,
> if (flags & SBI_PMU_STOP_FLAG_RESET) {
> pmc->event_idx = SBI_PMU_EVENT_IDX_INVALID;
> clear_bit(pmc_index, kvpmu->pmc_in_use);
> + clear_bit(pmc_index, kvpmu->pmc_overflown);
> if (snap_flag_set) {
> /* Clear the snapshot area for the upcoming deletion event */
> kvpmu->sdata->ctr_values[i] = 0;
> + /*
> + * Only clear the given counter as the caller is responsible to
> + * validate both the overflow mask and configured counters.
> + */
> + kvpmu->sdata->ctr_overflow_mask &= ~(1UL << i);
> kvm_vcpu_write_guest(vcpu, kvpmu->snapshot_addr, kvpmu->sdata,
> sizeof(struct riscv_pmu_snapshot_data));
> }
> }
> }
> -
> out:
> retdata->err_val = sbiret;
>
> @@ -699,6 +757,7 @@ void kvm_riscv_vcpu_pmu_init(struct kvm_vcpu *vcpu)
> pmc = &kvpmu->pmc[i];
> pmc->idx = i;
> pmc->event_idx = SBI_PMU_EVENT_IDX_INVALID;
> + pmc->vcpu = vcpu;
> if (i < kvpmu->num_hw_ctrs) {
> pmc->cinfo.type = SBI_PMU_CTR_TYPE_HW;
> if (i < 3)
> @@ -731,13 +790,14 @@ void kvm_riscv_vcpu_pmu_deinit(struct kvm_vcpu *vcpu)
> if (!kvpmu)
> return;
>
> - for_each_set_bit(i, kvpmu->pmc_in_use, RISCV_MAX_COUNTERS) {
> + for_each_set_bit(i, kvpmu->pmc_in_use, RISCV_KVM_MAX_COUNTERS) {
> pmc = &kvpmu->pmc[i];
> pmc->counter_val = 0;
> kvm_pmu_release_perf_event(pmc);
> pmc->event_idx = SBI_PMU_EVENT_IDX_INVALID;
> }
> - bitmap_zero(kvpmu->pmc_in_use, RISCV_MAX_COUNTERS);
> + bitmap_zero(kvpmu->pmc_in_use, RISCV_KVM_MAX_COUNTERS);
> + bitmap_zero(kvpmu->pmc_overflown, RISCV_KVM_MAX_COUNTERS);
> memset(&kvpmu->fw_event, 0, SBI_PMU_FW_MAX * sizeof(struct kvm_fw_event));
> kvm_pmu_clear_snapshot_area(vcpu);
> }
> --
> 2.34.1
>
>
> _______________________________________________
> linux-riscv mailing list
> [email protected]
> http://lists.infradead.org/mailman/listinfo/linux-riscv
On Wed, Jan 10, 2024 at 4:38 AM Vladimir Isaev
<[email protected]> wrote:
>
>
> 30.12.2023 00:49, Atish Patra wrote:
> >
> > KVM enables perf for guest via counter virtualization. However, the
> > sampling can not be supported as there is no mechanism to enabled
> > trap/emulate scountovf in ISA yet. Rely on the SBI PMU snapshot
> > to provide the counter overflow data via the shared memory.
> >
> > In case of sampling event, the host first guest the LCOFI interrupt
> > and injects to the guest via irq filtering mechanism defined in AIA
> > specification. Thus, ssaia must be enabled in the host in order to
> > use perf sampling in the guest. No other AIA dpeendancy w.r.t kernel
> > is required.
> >
> > Signed-off-by: Atish Patra <[email protected]>
> > ---
> > arch/riscv/include/asm/csr.h | 3 +-
> > arch/riscv/include/asm/kvm_vcpu_pmu.h | 1 +
> > arch/riscv/include/uapi/asm/kvm.h | 1 +
> > arch/riscv/kvm/main.c | 1 +
> > arch/riscv/kvm/vcpu.c | 8 +--
> > arch/riscv/kvm/vcpu_onereg.c | 2 +
> > arch/riscv/kvm/vcpu_pmu.c | 70 +++++++++++++++++++++++++--
> > 7 files changed, 77 insertions(+), 9 deletions(-)
> >
> > diff --git a/arch/riscv/include/asm/csr.h b/arch/riscv/include/asm/csr.h
> > index 88cdc8a3e654..bec09b33e2f0 100644
> > --- a/arch/riscv/include/asm/csr.h
> > +++ b/arch/riscv/include/asm/csr.h
> > @@ -168,7 +168,8 @@
> > #define VSIP_TO_HVIP_SHIFT (IRQ_VS_SOFT - IRQ_S_SOFT)
> > #define VSIP_VALID_MASK ((_AC(1, UL) << IRQ_S_SOFT) | \
> > (_AC(1, UL) << IRQ_S_TIMER) | \
> > - (_AC(1, UL) << IRQ_S_EXT))
> > + (_AC(1, UL) << IRQ_S_EXT) | \
> > + (_AC(1, UL) << IRQ_PMU_OVF))
> >
> > /* AIA CSR bits */
> > #define TOPI_IID_SHIFT 16
> > diff --git a/arch/riscv/include/asm/kvm_vcpu_pmu.h b/arch/riscv/include/asm/kvm_vcpu_pmu.h
> > index d56b901a61fc..af6d0ff5ce41 100644
> > --- a/arch/riscv/include/asm/kvm_vcpu_pmu.h
> > +++ b/arch/riscv/include/asm/kvm_vcpu_pmu.h
> > @@ -36,6 +36,7 @@ struct kvm_pmc {
> > bool started;
> > /* Monitoring event ID */
> > unsigned long event_idx;
> > + struct kvm_vcpu *vcpu;
> > };
> >
> > /* PMU data structure per vcpu */
> > diff --git a/arch/riscv/include/uapi/asm/kvm.h b/arch/riscv/include/uapi/asm/kvm.h
> > index d6b7a5b95874..d5aea43bc797 100644
> > --- a/arch/riscv/include/uapi/asm/kvm.h
> > +++ b/arch/riscv/include/uapi/asm/kvm.h
> > @@ -139,6 +139,7 @@ enum KVM_RISCV_ISA_EXT_ID {
> > KVM_RISCV_ISA_EXT_ZIHPM,
> > KVM_RISCV_ISA_EXT_SMSTATEEN,
> > KVM_RISCV_ISA_EXT_ZICOND,
> > + KVM_RISCV_ISA_EXT_SSCOFPMF,
> > KVM_RISCV_ISA_EXT_MAX,
> > };
> >
> > diff --git a/arch/riscv/kvm/main.c b/arch/riscv/kvm/main.c
> > index 225a435d9c9a..5a3a4cee0e3d 100644
> > --- a/arch/riscv/kvm/main.c
> > +++ b/arch/riscv/kvm/main.c
> > @@ -43,6 +43,7 @@ int kvm_arch_hardware_enable(void)
> > csr_write(CSR_HCOUNTEREN, 0x02);
> >
> > csr_write(CSR_HVIP, 0);
> > + csr_write(CSR_HVIEN, 1UL << IRQ_PMU_OVF);\
>
> Hi Atish,
>
> Maybe I missed something, but I thought you planned to move CSR_HVIEN to kvm_riscv_aia_enable() and
> add dependency for KVM sscofpmf and AIA since this register is part of AIA spec.
>
I already did that in the previous patch.
https://lore.kernel.org/all/[email protected]/
I am not sure how this line got into this patch. My bad! Thanks for noticing it.
I will send a fixed version right away.
> Thank you
> Vladimir Isaev
>
> >
> > kvm_riscv_aia_enable();
> >
> > diff --git a/arch/riscv/kvm/vcpu.c b/arch/riscv/kvm/vcpu.c
> > index b5ca9f2e98ac..f83f0226439f 100644
> > --- a/arch/riscv/kvm/vcpu.c
> > +++ b/arch/riscv/kvm/vcpu.c
> > @@ -382,7 +382,8 @@ int kvm_riscv_vcpu_set_interrupt(struct kvm_vcpu *vcpu, unsigned int irq)
> > if (irq < IRQ_LOCAL_MAX &&
> > irq != IRQ_VS_SOFT &&
> > irq != IRQ_VS_TIMER &&
> > - irq != IRQ_VS_EXT)
> > + irq != IRQ_VS_EXT &&
> > + irq != IRQ_PMU_OVF)
> > return -EINVAL;
> >
> > set_bit(irq, vcpu->arch.irqs_pending);
> > @@ -397,14 +398,15 @@ int kvm_riscv_vcpu_set_interrupt(struct kvm_vcpu *vcpu, unsigned int irq)
> > int kvm_riscv_vcpu_unset_interrupt(struct kvm_vcpu *vcpu, unsigned int irq)
> > {
> > /*
> > - * We only allow VS-mode software, timer, and external
> > + * We only allow VS-mode software, timer, counter overflow and external
> > * interrupts when irq is one of the local interrupts
> > * defined by RISC-V privilege specification.
> > */
> > if (irq < IRQ_LOCAL_MAX &&
> > irq != IRQ_VS_SOFT &&
> > irq != IRQ_VS_TIMER &&
> > - irq != IRQ_VS_EXT)
> > + irq != IRQ_VS_EXT &&
> > + irq != IRQ_PMU_OVF)
> > return -EINVAL;
> >
> > clear_bit(irq, vcpu->arch.irqs_pending);
> > diff --git a/arch/riscv/kvm/vcpu_onereg.c b/arch/riscv/kvm/vcpu_onereg.c
> > index 581568847910..1eaaa919aa61 100644
> > --- a/arch/riscv/kvm/vcpu_onereg.c
> > +++ b/arch/riscv/kvm/vcpu_onereg.c
> > @@ -36,6 +36,7 @@ static const unsigned long kvm_isa_ext_arr[] = {
> > /* Multi letter extensions (alphabetically sorted) */
> > KVM_ISA_EXT_ARR(SMSTATEEN),
> > KVM_ISA_EXT_ARR(SSAIA),
> > + KVM_ISA_EXT_ARR(SSCOFPMF),
> > KVM_ISA_EXT_ARR(SSTC),
> > KVM_ISA_EXT_ARR(SVINVAL),
> > KVM_ISA_EXT_ARR(SVNAPOT),
> > @@ -88,6 +89,7 @@ static bool kvm_riscv_vcpu_isa_disable_allowed(unsigned long ext)
> > case KVM_RISCV_ISA_EXT_I:
> > case KVM_RISCV_ISA_EXT_M:
> > case KVM_RISCV_ISA_EXT_SSTC:
> > + case KVM_RISCV_ISA_EXT_SSCOFPMF:
> > case KVM_RISCV_ISA_EXT_SVINVAL:
> > case KVM_RISCV_ISA_EXT_SVNAPOT:
> > case KVM_RISCV_ISA_EXT_ZBA:
> > diff --git a/arch/riscv/kvm/vcpu_pmu.c b/arch/riscv/kvm/vcpu_pmu.c
> > index e980235b8436..f2bf5b5bdd61 100644
> > --- a/arch/riscv/kvm/vcpu_pmu.c
> > +++ b/arch/riscv/kvm/vcpu_pmu.c
> > @@ -229,6 +229,47 @@ static int kvm_pmu_validate_counter_mask(struct kvm_pmu *kvpmu, unsigned long ct
> > return 0;
> > }
> >
> > +static void kvm_riscv_pmu_overflow(struct perf_event *perf_event,
> > + struct perf_sample_data *data,
> > + struct pt_regs *regs)
> > +{
> > + struct kvm_pmc *pmc = perf_event->overflow_handler_context;
> > + struct kvm_vcpu *vcpu = pmc->vcpu;
> > + struct kvm_pmu *kvpmu = vcpu_to_pmu(vcpu);
> > + struct riscv_pmu *rpmu = to_riscv_pmu(perf_event->pmu);
> > + u64 period;
> > +
> > + /*
> > + * Stop the event counting by directly accessing the perf_event.
> > + * Otherwise, this needs to deferred via a workqueue.
> > + * That will introduce skew in the counter value because the actual
> > + * physical counter would start after returning from this function.
> > + * It will be stopped again once the workqueue is scheduled
> > + */
> > + rpmu->pmu.stop(perf_event, PERF_EF_UPDATE);
> > +
> > + /*
> > + * The hw counter would start automatically when this function returns.
> > + * Thus, the host may continue to interrupt and inject it to the guest
> > + * even without the guest configuring the next event. Depending on the hardware
> > + * the host may have some sluggishness only if privilege mode filtering is not
> > + * available. In an ideal world, where qemu is not the only capable hardware,
> > + * this can be removed.
> > + * FYI: ARM64 does this way while x86 doesn't do anything as such.
> > + * TODO: Should we keep it for RISC-V ?
> > + */
> > + period = -(local64_read(&perf_event->count));
> > +
> > + local64_set(&perf_event->hw.period_left, 0);
> > + perf_event->attr.sample_period = period;
> > + perf_event->hw.sample_period = period;
> > +
> > + set_bit(pmc->idx, kvpmu->pmc_overflown);
> > + kvm_riscv_vcpu_set_interrupt(vcpu, IRQ_PMU_OVF);
> > +
> > + rpmu->pmu.start(perf_event, PERF_EF_RELOAD);
> > +}
> > +
> > static long kvm_pmu_create_perf_event(struct kvm_pmc *pmc, struct perf_event_attr *attr,
> > unsigned long flags, unsigned long eidx,
> > unsigned long evtdata)
> > @@ -248,7 +289,7 @@ static long kvm_pmu_create_perf_event(struct kvm_pmc *pmc, struct perf_event_att
> > */
> > attr->sample_period = kvm_pmu_get_sample_period(pmc);
> >
> > - event = perf_event_create_kernel_counter(attr, -1, current, NULL, pmc);
> > + event = perf_event_create_kernel_counter(attr, -1, current, kvm_riscv_pmu_overflow, pmc);
> > if (IS_ERR(event)) {
> > pr_err("kvm pmu event creation failed for eidx %lx: %ld\n", eidx, PTR_ERR(event));
> > return PTR_ERR(event);
> > @@ -473,6 +514,12 @@ int kvm_riscv_vcpu_pmu_ctr_start(struct kvm_vcpu *vcpu, unsigned long ctr_base,
> > }
> > }
> >
> > + /* The guest have serviced the interrupt and starting the counter again */
> > + if (test_bit(IRQ_PMU_OVF, vcpu->arch.irqs_pending)) {
> > + clear_bit(pmc_index, kvpmu->pmc_overflown);
> > + kvm_riscv_vcpu_unset_interrupt(vcpu, IRQ_PMU_OVF);
> > + }
> > +
> > out:
> > retdata->err_val = sbiret;
> >
> > @@ -539,7 +586,13 @@ int kvm_riscv_vcpu_pmu_ctr_stop(struct kvm_vcpu *vcpu, unsigned long ctr_base,
> > else if (pmc->perf_event)
> > pmc->counter_val += perf_event_read_value(pmc->perf_event,
> > &enabled, &running);
> > - /* TODO: Add counter overflow support when sscofpmf support is added */
> > + /*
> > + * The counter and overflow indicies in the snapshot region are w.r.to
> > + * cbase. Modify the set bit in the counter mask instead of the pmc_index
> > + * which indicates the absolute counter index.
> > + */
> > + if (test_bit(pmc_index, kvpmu->pmc_overflown))
> > + kvpmu->sdata->ctr_overflow_mask |= (1UL << i);
> > kvpmu->sdata->ctr_values[i] = pmc->counter_val;
> > kvm_vcpu_write_guest(vcpu, kvpmu->snapshot_addr, kvpmu->sdata,
> > sizeof(struct riscv_pmu_snapshot_data));
> > @@ -548,15 +601,20 @@ int kvm_riscv_vcpu_pmu_ctr_stop(struct kvm_vcpu *vcpu, unsigned long ctr_base,
> > if (flags & SBI_PMU_STOP_FLAG_RESET) {
> > pmc->event_idx = SBI_PMU_EVENT_IDX_INVALID;
> > clear_bit(pmc_index, kvpmu->pmc_in_use);
> > + clear_bit(pmc_index, kvpmu->pmc_overflown);
> > if (snap_flag_set) {
> > /* Clear the snapshot area for the upcoming deletion event */
> > kvpmu->sdata->ctr_values[i] = 0;
> > + /*
> > + * Only clear the given counter as the caller is responsible to
> > + * validate both the overflow mask and configured counters.
> > + */
> > + kvpmu->sdata->ctr_overflow_mask &= ~(1UL << i);
> > kvm_vcpu_write_guest(vcpu, kvpmu->snapshot_addr, kvpmu->sdata,
> > sizeof(struct riscv_pmu_snapshot_data));
> > }
> > }
> > }
> > -
> > out:
> > retdata->err_val = sbiret;
> >
> > @@ -699,6 +757,7 @@ void kvm_riscv_vcpu_pmu_init(struct kvm_vcpu *vcpu)
> > pmc = &kvpmu->pmc[i];
> > pmc->idx = i;
> > pmc->event_idx = SBI_PMU_EVENT_IDX_INVALID;
> > + pmc->vcpu = vcpu;
> > if (i < kvpmu->num_hw_ctrs) {
> > pmc->cinfo.type = SBI_PMU_CTR_TYPE_HW;
> > if (i < 3)
> > @@ -731,13 +790,14 @@ void kvm_riscv_vcpu_pmu_deinit(struct kvm_vcpu *vcpu)
> > if (!kvpmu)
> > return;
> >
> > - for_each_set_bit(i, kvpmu->pmc_in_use, RISCV_MAX_COUNTERS) {
> > + for_each_set_bit(i, kvpmu->pmc_in_use, RISCV_KVM_MAX_COUNTERS) {
> > pmc = &kvpmu->pmc[i];
> > pmc->counter_val = 0;
> > kvm_pmu_release_perf_event(pmc);
> > pmc->event_idx = SBI_PMU_EVENT_IDX_INVALID;
> > }
> > - bitmap_zero(kvpmu->pmc_in_use, RISCV_MAX_COUNTERS);
> > + bitmap_zero(kvpmu->pmc_in_use, RISCV_KVM_MAX_COUNTERS);
> > + bitmap_zero(kvpmu->pmc_overflown, RISCV_KVM_MAX_COUNTERS);
> > memset(&kvpmu->fw_event, 0, SBI_PMU_FW_MAX * sizeof(struct kvm_fw_event));
> > kvm_pmu_clear_snapshot_area(vcpu);
> > }
> > --
> > 2.34.1
> >
> >
> > _______________________________________________
> > linux-riscv mailing list
> > [email protected]
> > http://lists.infradead.org/mailman/listinfo/linux-riscv