This series implements SBI PMU improvements done in SBI v2.0[1] i.e. PMU snapshot
and fw_read_hi() functions.
SBI v2.0 introduced PMU snapshot feature which allows the SBI implementation
to provide counter information (i.e. values/overlfow status) via a shared
memory between the SBI implementation and supervisor OS. This allows to minimize
the number of traps in when perf being used inside a kvm guest as it relies on
SBI PMU + trap/emulation of the counters.
The current set of ratified RISC-V specification also doesn't allow scountovf
to be trap/emulated by the hypervisor. The SBI PMU snapshot bridges the gap
in ISA as well and enables perf sampling in the guest. However, LCOFI in the
guest only works via IRQ filtering in AIA specification. That's why, AIA
has to be enabled in the hardware (at least the Ssaia extension) in order to
use the sampling support in the perf.
Here are the patch wise implementation details.
PATCH 1-2 : Generic cleanups/improvements.
PATCH 3,4,9 : FW_READ_HI function implementation
PATCH 5-6: Add PMU snapshot feature in sbi pmu driver
PATCH 7-8: KVM implementation for snapshot and sampling in kvm guests
The series is based on v6.70-rc3 and is available at:
https://github.com/atishp04/linux/tree/kvm_pmu_snapshot_v1
The kvmtool patch is also available at:
https://github.com/atishp04/kvmtool/tree/sscofpmf
It also requires Ssaia ISA extension to be present in the hardware in order to
get perf sampling support in the guest. In Qemu virt machine, it can be done
by the following config.
```
-cpu rv64,sscofpmf=true,x-ssaia=true
```
There is no other dependancies on AIA apart from that. Thus, Ssaia must be disabled
for the guest if AIA patches are not available. Here is the example command.
```
./lkvm-static run -m 256 -c2 --console serial -p "console=ttyS0 earlycon" --disable-ssaia -k ./Image --debug
```
The series has been tested only in Qemu.
Here is the snippet of the perf running inside a kvm guest.
===================================================
# perf record -e cycles -e instructions perf bench sched messaging -g 5
...
# Running 'sched/messaging' benchmark:
...
[ 45.928723] perf_duration_warn: 2 callbacks suppressed
[ 45.929000] perf: interrupt took too long (484426 > 483186), lowering kernel.perf_event_max_sample_rate to 250
# 20 sender and receiver processes per group
# 5 groups == 200 processes run
Total time: 14.220 [sec]
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.117 MB perf.data (1942 samples) ]
# perf report --stdio
# To display the perf.data header info, please use --header/--header-only optio>
#
#
# Total Lost Samples: 0
#
# Samples: 943 of event 'cycles'
# Event count (approx.): 5128976844
#
# Overhead Command Shared Object Symbol >
# ........ ............... ........................... .....................>
#
7.59% sched-messaging [kernel.kallsyms] [k] memcpy
5.48% sched-messaging [kernel.kallsyms] [k] percpu_counter_ad>
5.24% sched-messaging [kernel.kallsyms] [k] __sbi_rfence_v02_>
4.00% sched-messaging [kernel.kallsyms] [k] _raw_spin_unlock_>
3.79% sched-messaging [kernel.kallsyms] [k] set_pte_range
3.72% sched-messaging [kernel.kallsyms] [k] next_uptodate_fol>
3.46% sched-messaging [kernel.kallsyms] [k] filemap_map_pages
3.31% sched-messaging [kernel.kallsyms] [k] handle_mm_fault
3.20% sched-messaging [kernel.kallsyms] [k] finish_task_switc>
3.16% sched-messaging [kernel.kallsyms] [k] clear_page
3.03% sched-messaging [kernel.kallsyms] [k] mtree_range_walk
2.42% sched-messaging [kernel.kallsyms] [k] flush_icache_pte
===================================================
[1] https://github.com/riscv-non-isa/riscv-sbi-doc
Atish Patra (9):
RISC-V: Fix the typo in Scountovf CSR name
drivers/perf: riscv: Add a flag to indicate SBI v2.0 support
RISC-V: Add FIRMWARE_READ_HI definition
drivers/perf: riscv: Read upper bits of a firmware counter
RISC-V: Add SBI PMU snapshot definitions
drivers/perf: riscv: Implement SBI PMU snapshot function
RISC-V: KVM: Implement SBI PMU Snapshot feature
RISC-V: KVM: Add perf sampling support for guests
RISC-V: KVM: Support 64 bit firmware counters on RV32
arch/riscv/include/asm/csr.h | 5 +-
arch/riscv/include/asm/errata_list.h | 2 +-
arch/riscv/include/asm/kvm_vcpu_pmu.h | 16 +-
arch/riscv/include/asm/sbi.h | 11 ++
arch/riscv/include/uapi/asm/kvm.h | 1 +
arch/riscv/kvm/main.c | 1 +
arch/riscv/kvm/vcpu.c | 8 +-
arch/riscv/kvm/vcpu_onereg.c | 1 +
arch/riscv/kvm/vcpu_pmu.c | 232 ++++++++++++++++++++++++--
arch/riscv/kvm/vcpu_sbi_pmu.c | 10 ++
drivers/perf/riscv_pmu.c | 1 +
drivers/perf/riscv_pmu_sbi.c | 219 ++++++++++++++++++++++--
include/linux/perf/riscv_pmu.h | 6 +
13 files changed, 478 insertions(+), 35 deletions(-)
--
2.34.1
SBI v2.0 added few functions to improve SBI PMU extension. In order
to be backward compatible, the driver must use these functions only
if SBI v2.0 is available.
Signed-off-by: Atish Patra <[email protected]>
---
drivers/perf/riscv_pmu_sbi.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/drivers/perf/riscv_pmu_sbi.c b/drivers/perf/riscv_pmu_sbi.c
index 16acd4dcdb96..40a335350d08 100644
--- a/drivers/perf/riscv_pmu_sbi.c
+++ b/drivers/perf/riscv_pmu_sbi.c
@@ -35,6 +35,8 @@
PMU_FORMAT_ATTR(event, "config:0-47");
PMU_FORMAT_ATTR(firmware, "config:63");
+static bool sbi_v2_available;
+
static struct attribute *riscv_arch_formats_attr[] = {
&format_attr_event.attr,
&format_attr_firmware.attr,
@@ -1108,6 +1110,9 @@ static int __init pmu_sbi_devinit(void)
return 0;
}
+ if (sbi_spec_version >= sbi_mk_version(2, 0))
+ sbi_v2_available = true;
+
ret = cpuhp_setup_state_multi(CPUHP_AP_PERF_RISCV_STARTING,
"perf/riscv/pmu:starting",
pmu_sbi_starting_cpu, pmu_sbi_dying_cpu);
--
2.34.1
The counter overflow CSR name is "scountovf" not "sscountovf".
Fix the csr name.
Fixes: 4905ec2fb7e6 ("RISC-V: Add sscofpmf extension support")
Signed-off-by: Atish Patra <[email protected]>
---
arch/riscv/include/asm/csr.h | 2 +-
arch/riscv/include/asm/errata_list.h | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/riscv/include/asm/csr.h b/arch/riscv/include/asm/csr.h
index 306a19a5509c..88cdc8a3e654 100644
--- a/arch/riscv/include/asm/csr.h
+++ b/arch/riscv/include/asm/csr.h
@@ -281,7 +281,7 @@
#define CSR_HPMCOUNTER30H 0xc9e
#define CSR_HPMCOUNTER31H 0xc9f
-#define CSR_SSCOUNTOVF 0xda0
+#define CSR_SCOUNTOVF 0xda0
#define CSR_SSTATUS 0x100
#define CSR_SIE 0x104
diff --git a/arch/riscv/include/asm/errata_list.h b/arch/riscv/include/asm/errata_list.h
index 83ed25e43553..7026fba12eeb 100644
--- a/arch/riscv/include/asm/errata_list.h
+++ b/arch/riscv/include/asm/errata_list.h
@@ -152,7 +152,7 @@ asm volatile(ALTERNATIVE_2( \
#define ALT_SBI_PMU_OVERFLOW(__ovl) \
asm volatile(ALTERNATIVE( \
- "csrr %0, " __stringify(CSR_SSCOUNTOVF), \
+ "csrr %0, " __stringify(CSR_SCOUNTOVF), \
"csrr %0, " __stringify(THEAD_C9XX_CSR_SCOUNTEROF), \
THEAD_VENDOR_ID, ERRATA_THEAD_PMU, \
CONFIG_ERRATA_THEAD_PMU) \
--
2.34.1
SBI PMU Snapshot function optimizes the number of traps to
higher privilege mode by leveraging a shared memory between the S/VS-mode
and the M/HS mode. Add the definitions for that extension
Signed-off-by: Atish Patra <[email protected]>
---
arch/riscv/include/asm/sbi.h | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/arch/riscv/include/asm/sbi.h b/arch/riscv/include/asm/sbi.h
index f3eeca79a02d..29821addb9b7 100644
--- a/arch/riscv/include/asm/sbi.h
+++ b/arch/riscv/include/asm/sbi.h
@@ -122,6 +122,7 @@ enum sbi_ext_pmu_fid {
SBI_EXT_PMU_COUNTER_STOP,
SBI_EXT_PMU_COUNTER_FW_READ,
SBI_EXT_PMU_COUNTER_FW_READ_HI,
+ SBI_EXT_PMU_SNAPSHOT_SET_SHMEM,
};
union sbi_pmu_ctr_info {
@@ -138,6 +139,13 @@ union sbi_pmu_ctr_info {
};
};
+/* Data structure to contain the pmu snapshot data */
+struct riscv_pmu_snapshot_data {
+ uint64_t ctr_overflow_mask;
+ uint64_t ctr_values[64];
+ uint64_t reserved[447];
+};
+
#define RISCV_PMU_RAW_EVENT_MASK GENMASK_ULL(47, 0)
#define RISCV_PMU_RAW_EVENT_IDX 0x20000
@@ -234,9 +242,11 @@ enum sbi_pmu_ctr_type {
/* Flags defined for counter start function */
#define SBI_PMU_START_FLAG_SET_INIT_VALUE (1 << 0)
+#define SBI_PMU_START_FLAG_INIT_FROM_SNAPSHOT (1 << 1)
/* Flags defined for counter stop function */
#define SBI_PMU_STOP_FLAG_RESET (1 << 0)
+#define SBI_PMU_STOP_FLAG_TAKE_SNAPSHOT (1 << 1)
enum sbi_ext_dbcn_fid {
SBI_EXT_DBCN_CONSOLE_WRITE = 0,
--
2.34.1
SBI v2.0 SBI introduced PMU snapshot feature which adds the following
features.
1. Read counter values directly from the shared memory instead of
csr read.
2. Start multiple counters with initial values with one SBI call.
These functionalities optimizes the number of traps to the higher
privilege mode. If the kernel is in VS mode while the hypervisor
deploy trap & emulate method, this would minimize all the hpmcounter
CSR read traps. If the kernel is running in S-mode, the benfits
reduced to CSR latency vs DRAM/cache latency as there is no trap
involved while accessing the hpmcounter CSRs.
In both modes, it does saves the number of ecalls while starting
multiple counter together with an initial values. This is a likely
scenario if multiple counters overflow at the same time.
Signed-off-by: Atish Patra <[email protected]>
---
drivers/perf/riscv_pmu.c | 1 +
drivers/perf/riscv_pmu_sbi.c | 203 ++++++++++++++++++++++++++++++---
include/linux/perf/riscv_pmu.h | 6 +
3 files changed, 197 insertions(+), 13 deletions(-)
diff --git a/drivers/perf/riscv_pmu.c b/drivers/perf/riscv_pmu.c
index 0dda70e1ef90..5b57acb770d3 100644
--- a/drivers/perf/riscv_pmu.c
+++ b/drivers/perf/riscv_pmu.c
@@ -412,6 +412,7 @@ struct riscv_pmu *riscv_pmu_alloc(void)
cpuc->n_events = 0;
for (i = 0; i < RISCV_MAX_COUNTERS; i++)
cpuc->events[i] = NULL;
+ cpuc->snapshot_addr = NULL;
}
pmu->pmu = (struct pmu) {
.event_init = riscv_pmu_event_init,
diff --git a/drivers/perf/riscv_pmu_sbi.c b/drivers/perf/riscv_pmu_sbi.c
index 1c9049e6b574..1b8b6de63b69 100644
--- a/drivers/perf/riscv_pmu_sbi.c
+++ b/drivers/perf/riscv_pmu_sbi.c
@@ -36,6 +36,9 @@ PMU_FORMAT_ATTR(event, "config:0-47");
PMU_FORMAT_ATTR(firmware, "config:63");
static bool sbi_v2_available;
+static DEFINE_STATIC_KEY_FALSE(sbi_pmu_snapshot_available);
+#define sbi_pmu_snapshot_available() \
+ static_branch_unlikely(&sbi_pmu_snapshot_available)
static struct attribute *riscv_arch_formats_attr[] = {
&format_attr_event.attr,
@@ -485,14 +488,101 @@ static int pmu_sbi_event_map(struct perf_event *event, u64 *econfig)
return ret;
}
+static void pmu_sbi_snapshot_free(struct riscv_pmu *pmu)
+{
+ int cpu;
+ struct cpu_hw_events *cpu_hw_evt;
+
+ for_each_possible_cpu(cpu) {
+ cpu_hw_evt = per_cpu_ptr(pmu->hw_events, cpu);
+ if (!cpu_hw_evt->snapshot_addr)
+ continue;
+ free_page((unsigned long)cpu_hw_evt->snapshot_addr);
+ cpu_hw_evt->snapshot_addr = NULL;
+ cpu_hw_evt->snapshot_addr_phys = 0;
+ }
+}
+
+static int pmu_sbi_snapshot_alloc(struct riscv_pmu *pmu)
+{
+ int cpu;
+ struct page *snapshot_page;
+ struct cpu_hw_events *cpu_hw_evt;
+
+ for_each_possible_cpu(cpu) {
+ cpu_hw_evt = per_cpu_ptr(pmu->hw_events, cpu);
+ if (cpu_hw_evt->snapshot_addr)
+ continue;
+ snapshot_page = alloc_page(GFP_ATOMIC | __GFP_ZERO);
+ if (!snapshot_page) {
+ pmu_sbi_snapshot_free(pmu);
+ return -ENOMEM;
+ }
+ cpu_hw_evt->snapshot_addr = page_to_virt(snapshot_page);
+ cpu_hw_evt->snapshot_addr_phys = page_to_phys(snapshot_page);
+ }
+
+ return 0;
+}
+
+static void pmu_sbi_snapshot_disable(void)
+{
+ sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_SNAPSHOT_SET_SHMEM, -1,
+ -1, 0, 0, 0, 0);
+}
+
+static int pmu_sbi_snapshot_setup(struct riscv_pmu *pmu, int cpu)
+{
+ struct cpu_hw_events *cpu_hw_evt;
+ struct sbiret ret = {0};
+ int rc;
+
+ cpu_hw_evt = per_cpu_ptr(pmu->hw_events, cpu);
+ if (!cpu_hw_evt->snapshot_addr_phys)
+ return -EINVAL;
+
+ if (cpu_hw_evt->snapshot_set_done)
+ return 0;
+
+#if defined(CONFIG_32BIT)
+ ret = sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_SNAPSHOT_SET_SHMEM, cpu_hw_evt->snapshot_addr_phys,
+ (u64)(cpu_hw_evt->snapshot_addr_phys) >> 32, 0, 0, 0, 0);
+#else
+ ret = sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_SNAPSHOT_SET_SHMEM, cpu_hw_evt->snapshot_addr_phys,
+ 0, 0, 0, 0, 0);
+#endif
+ /* Free up the snapshot area memory and fall back to default SBI */
+ if (ret.error) {
+ if (ret.error != SBI_ERR_NOT_SUPPORTED)
+ pr_warn("%s: pmu snapshot setup failed with error %ld\n", __func__,
+ ret.error);
+ rc = sbi_err_map_linux_errno(ret.error);
+ if (rc)
+ return rc;
+ }
+
+ cpu_hw_evt->snapshot_set_done = true;
+
+ return 0;
+}
+
static u64 pmu_sbi_ctr_read(struct perf_event *event)
{
struct hw_perf_event *hwc = &event->hw;
int idx = hwc->idx;
struct sbiret ret;
u64 val = 0;
+ struct riscv_pmu *pmu = to_riscv_pmu(event->pmu);
+ struct cpu_hw_events *cpu_hw_evt = this_cpu_ptr(pmu->hw_events);
+ struct riscv_pmu_snapshot_data *sdata = cpu_hw_evt->snapshot_addr;
union sbi_pmu_ctr_info info = pmu_ctr_list[idx];
+ /* Read the value from the shared memory directly */
+ if (sbi_pmu_snapshot_available()) {
+ val = sdata->ctr_values[idx];
+ goto done;
+ }
+
if (pmu_sbi_is_fw_event(event)) {
ret = sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_COUNTER_FW_READ,
hwc->idx, 0, 0, 0, 0, 0);
@@ -512,6 +602,7 @@ static u64 pmu_sbi_ctr_read(struct perf_event *event)
val = ((u64)riscv_pmu_ctr_read_csr(info.csr + 0x80)) << 31 | val;
}
+done:
return val;
}
@@ -539,6 +630,7 @@ static void pmu_sbi_ctr_start(struct perf_event *event, u64 ival)
struct hw_perf_event *hwc = &event->hw;
unsigned long flag = SBI_PMU_START_FLAG_SET_INIT_VALUE;
+ /* There is no benefit setting SNAPSHOT FLAG for a single counter */
#if defined(CONFIG_32BIT)
ret = sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_COUNTER_START, hwc->idx,
1, flag, ival, ival >> 32, 0);
@@ -559,16 +651,29 @@ static void pmu_sbi_ctr_stop(struct perf_event *event, unsigned long flag)
{
struct sbiret ret;
struct hw_perf_event *hwc = &event->hw;
+ struct riscv_pmu *pmu = to_riscv_pmu(event->pmu);
+ struct cpu_hw_events *cpu_hw_evt = this_cpu_ptr(pmu->hw_events);
+ struct riscv_pmu_snapshot_data *sdata = cpu_hw_evt->snapshot_addr;
if ((hwc->flags & PERF_EVENT_FLAG_USER_ACCESS) &&
(hwc->flags & PERF_EVENT_FLAG_USER_READ_CNT))
pmu_sbi_reset_scounteren((void *)event);
+ if (sbi_pmu_snapshot_available())
+ flag |= SBI_PMU_STOP_FLAG_TAKE_SNAPSHOT;
+
ret = sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_COUNTER_STOP, hwc->idx, 1, flag, 0, 0, 0);
- if (ret.error && (ret.error != SBI_ERR_ALREADY_STOPPED) &&
- flag != SBI_PMU_STOP_FLAG_RESET)
+ if (!ret.error && sbi_pmu_snapshot_available()) {
+ /* Snapshot is taken relative to the counter idx base. Apply a fixup. */
+ if (hwc->idx > 0) {
+ sdata->ctr_values[hwc->idx] = sdata->ctr_values[0];
+ sdata->ctr_values[0] = 0;
+ }
+ } else if (ret.error && (ret.error != SBI_ERR_ALREADY_STOPPED) &&
+ flag != SBI_PMU_STOP_FLAG_RESET) {
pr_err("Stopping counter idx %d failed with error %d\n",
hwc->idx, sbi_err_map_linux_errno(ret.error));
+ }
}
static int pmu_sbi_find_num_ctrs(void)
@@ -626,10 +731,14 @@ static inline void pmu_sbi_stop_all(struct riscv_pmu *pmu)
static inline void pmu_sbi_stop_hw_ctrs(struct riscv_pmu *pmu)
{
struct cpu_hw_events *cpu_hw_evt = this_cpu_ptr(pmu->hw_events);
+ unsigned long flag = 0;
+
+ if (sbi_pmu_snapshot_available())
+ flag = SBI_PMU_STOP_FLAG_TAKE_SNAPSHOT;
/* No need to check the error here as we can't do anything about the error */
sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_COUNTER_STOP, 0,
- cpu_hw_evt->used_hw_ctrs[0], 0, 0, 0, 0);
+ cpu_hw_evt->used_hw_ctrs[0], flag, 0, 0, 0);
}
/*
@@ -638,11 +747,10 @@ static inline void pmu_sbi_stop_hw_ctrs(struct riscv_pmu *pmu)
* while the overflowed counters need to be started with updated initialization
* value.
*/
-static inline void pmu_sbi_start_overflow_mask(struct riscv_pmu *pmu,
- unsigned long ctr_ovf_mask)
+static noinline void pmu_sbi_start_ovf_ctrs_sbi(struct cpu_hw_events *cpu_hw_evt,
+ unsigned long ctr_ovf_mask)
{
int idx = 0;
- struct cpu_hw_events *cpu_hw_evt = this_cpu_ptr(pmu->hw_events);
struct perf_event *event;
unsigned long flag = SBI_PMU_START_FLAG_SET_INIT_VALUE;
unsigned long ctr_start_mask = 0;
@@ -677,6 +785,49 @@ static inline void pmu_sbi_start_overflow_mask(struct riscv_pmu *pmu,
}
}
+static noinline void pmu_sbi_start_ovf_ctrs_snapshot(struct cpu_hw_events *cpu_hw_evt,
+ unsigned long ctr_ovf_mask)
+{
+ int idx = 0;
+ struct perf_event *event;
+ unsigned long flag = SBI_PMU_START_FLAG_INIT_FROM_SNAPSHOT;
+ uint64_t max_period;
+ struct hw_perf_event *hwc;
+ u64 init_val = 0;
+ unsigned long ctr_start_mask = 0;
+ struct riscv_pmu_snapshot_data *sdata = cpu_hw_evt->snapshot_addr;
+
+ for_each_set_bit(idx, cpu_hw_evt->used_hw_ctrs, RISCV_MAX_COUNTERS) {
+ if (ctr_ovf_mask & (1 << idx)) {
+ event = cpu_hw_evt->events[idx];
+ hwc = &event->hw;
+ max_period = riscv_pmu_ctr_get_width_mask(event);
+ init_val = local64_read(&hwc->prev_count) & max_period;
+ sdata->ctr_values[idx] = init_val;
+ }
+ /* We donot need to update the non-overflow counters the previous
+ * value should have been there already.
+ */
+ }
+
+ ctr_start_mask = cpu_hw_evt->used_hw_ctrs[0];
+
+ /* Start all the counters in a single shot */
+ sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_COUNTER_START, 0, ctr_start_mask,
+ flag, 0, 0, 0);
+}
+
+static void pmu_sbi_start_overflow_mask(struct riscv_pmu *pmu,
+ unsigned long ctr_ovf_mask)
+{
+ struct cpu_hw_events *cpu_hw_evt = this_cpu_ptr(pmu->hw_events);
+
+ if (sbi_pmu_snapshot_available())
+ pmu_sbi_start_ovf_ctrs_snapshot(cpu_hw_evt, ctr_ovf_mask);
+ else
+ pmu_sbi_start_ovf_ctrs_sbi(cpu_hw_evt, ctr_ovf_mask);
+}
+
static irqreturn_t pmu_sbi_ovf_handler(int irq, void *dev)
{
struct perf_sample_data data;
@@ -690,6 +841,7 @@ static irqreturn_t pmu_sbi_ovf_handler(int irq, void *dev)
unsigned long overflowed_ctrs = 0;
struct cpu_hw_events *cpu_hw_evt = dev;
u64 start_clock = sched_clock();
+ struct riscv_pmu_snapshot_data *sdata = cpu_hw_evt->snapshot_addr;
if (WARN_ON_ONCE(!cpu_hw_evt))
return IRQ_NONE;
@@ -711,8 +863,10 @@ static irqreturn_t pmu_sbi_ovf_handler(int irq, void *dev)
pmu_sbi_stop_hw_ctrs(pmu);
/* Overflow status register should only be read after counter are stopped */
- ALT_SBI_PMU_OVERFLOW(overflow);
-
+ if (sbi_pmu_snapshot_available())
+ overflow = sdata->ctr_overflow_mask;
+ else
+ ALT_SBI_PMU_OVERFLOW(overflow);
/*
* Overflow interrupt pending bit should only be cleared after stopping
* all the counters to avoid any race condition.
@@ -774,6 +928,7 @@ static int pmu_sbi_starting_cpu(unsigned int cpu, struct hlist_node *node)
{
struct riscv_pmu *pmu = hlist_entry_safe(node, struct riscv_pmu, node);
struct cpu_hw_events *cpu_hw_evt = this_cpu_ptr(pmu->hw_events);
+ int ret = 0;
/*
* We keep enabling userspace access to CYCLE, TIME and INSTRET via the
@@ -794,7 +949,10 @@ static int pmu_sbi_starting_cpu(unsigned int cpu, struct hlist_node *node)
enable_percpu_irq(riscv_pmu_irq, IRQ_TYPE_NONE);
}
- return 0;
+ if (sbi_pmu_snapshot_available())
+ ret = pmu_sbi_snapshot_setup(pmu, cpu);
+
+ return ret;
}
static int pmu_sbi_dying_cpu(unsigned int cpu, struct hlist_node *node)
@@ -807,6 +965,9 @@ static int pmu_sbi_dying_cpu(unsigned int cpu, struct hlist_node *node)
/* Disable all counters access for user mode now */
csr_write(CSR_SCOUNTEREN, 0x0);
+ if (sbi_pmu_snapshot_available())
+ pmu_sbi_snapshot_disable();
+
return 0;
}
@@ -1076,10 +1237,6 @@ static int pmu_sbi_device_probe(struct platform_device *pdev)
pmu->event_unmapped = pmu_sbi_event_unmapped;
pmu->csr_index = pmu_sbi_csr_index;
- ret = cpuhp_state_add_instance(CPUHP_AP_PERF_RISCV_STARTING, &pmu->node);
- if (ret)
- return ret;
-
ret = riscv_pm_pmu_register(pmu);
if (ret)
goto out_unregister;
@@ -1088,8 +1245,28 @@ static int pmu_sbi_device_probe(struct platform_device *pdev)
if (ret)
goto out_unregister;
+ /* SBI PMU Snasphot is only available in SBI v2.0 */
+ if (sbi_v2_available) {
+ ret = pmu_sbi_snapshot_alloc(pmu);
+ if (ret)
+ goto out_unregister;
+ ret = pmu_sbi_snapshot_setup(pmu, smp_processor_id());
+ if (!ret) {
+ pr_info("SBI PMU snapshot is available to optimize the PMU traps\n");
+ /* We enable it once here for the boot cpu. If snapshot shmem fails during
+ * cpu hotplug on, it should bail out.
+ */
+ static_branch_enable(&sbi_pmu_snapshot_available);
+ }
+ /* Snapshot is an optional feature. Continue if not available */
+ }
+
register_sysctl("kernel", sbi_pmu_sysctl_table);
+ ret = cpuhp_state_add_instance(CPUHP_AP_PERF_RISCV_STARTING, &pmu->node);
+ if (ret)
+ return ret;
+
return 0;
out_unregister:
diff --git a/include/linux/perf/riscv_pmu.h b/include/linux/perf/riscv_pmu.h
index 43282e22ebe1..c3fa90970042 100644
--- a/include/linux/perf/riscv_pmu.h
+++ b/include/linux/perf/riscv_pmu.h
@@ -39,6 +39,12 @@ struct cpu_hw_events {
DECLARE_BITMAP(used_hw_ctrs, RISCV_MAX_COUNTERS);
/* currently enabled firmware counters */
DECLARE_BITMAP(used_fw_ctrs, RISCV_MAX_COUNTERS);
+ /* The virtual address of the shared memory where counter snapshot will be taken */
+ void *snapshot_addr;
+ /* The physical address of the shared memory where counter snapshot will be taken */
+ phys_addr_t snapshot_addr_phys;
+ /* Boolean flag to indicate setup is already done */
+ bool snapshot_set_done;
};
struct riscv_pmu {
--
2.34.1
SBI v2.0 introduced a explicit function to read the upper bits
for any firmwar counter width that is longer than XLEN. Currently,
this is only applicable for RV32 where firmware counter can be
64 bit.
Signed-off-by: Atish Patra <[email protected]>
---
drivers/perf/riscv_pmu_sbi.c | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)
diff --git a/drivers/perf/riscv_pmu_sbi.c b/drivers/perf/riscv_pmu_sbi.c
index 40a335350d08..1c9049e6b574 100644
--- a/drivers/perf/riscv_pmu_sbi.c
+++ b/drivers/perf/riscv_pmu_sbi.c
@@ -490,16 +490,23 @@ static u64 pmu_sbi_ctr_read(struct perf_event *event)
struct hw_perf_event *hwc = &event->hw;
int idx = hwc->idx;
struct sbiret ret;
- union sbi_pmu_ctr_info info;
u64 val = 0;
+ union sbi_pmu_ctr_info info = pmu_ctr_list[idx];
if (pmu_sbi_is_fw_event(event)) {
ret = sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_COUNTER_FW_READ,
hwc->idx, 0, 0, 0, 0, 0);
if (!ret.error)
val = ret.value;
+#if defined(CONFIG_32BIT)
+ if (sbi_v2_available && info.width >= 32) {
+ ret = sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_COUNTER_FW_READ_HI,
+ hwc->idx, 0, 0, 0, 0, 0);
+ if (!ret.error)
+ val = val | ((u64)ret.value << 32);
+ }
+#endif
} else {
- info = pmu_ctr_list[idx];
val = riscv_pmu_ctr_read_csr(info.csr);
if (IS_ENABLED(CONFIG_32BIT))
val = ((u64)riscv_pmu_ctr_read_csr(info.csr + 0x80)) << 31 | val;
--
2.34.1
SBI v2.0 added another function to SBI PMU extension to read
the upper bits of a counter with width larger than XLEN.
Add the definition for that function.
Signed-off-by: Atish Patra <[email protected]>
---
arch/riscv/include/asm/sbi.h | 1 +
1 file changed, 1 insertion(+)
diff --git a/arch/riscv/include/asm/sbi.h b/arch/riscv/include/asm/sbi.h
index 0892f4421bc4..f3eeca79a02d 100644
--- a/arch/riscv/include/asm/sbi.h
+++ b/arch/riscv/include/asm/sbi.h
@@ -121,6 +121,7 @@ enum sbi_ext_pmu_fid {
SBI_EXT_PMU_COUNTER_START,
SBI_EXT_PMU_COUNTER_STOP,
SBI_EXT_PMU_COUNTER_FW_READ,
+ SBI_EXT_PMU_COUNTER_FW_READ_HI,
};
union sbi_pmu_ctr_info {
--
2.34.1
PMU Snapshot function allows to minimize the number of traps when the
guest access configures/access the hpmcounters. If the snapshot feature
is enabled, the hypervisor updates the shared memory with counter
data and state of overflown counters. The guest can just read the
shared memory instead of trap & emulate done by the hypervisor.
This patch doesn't implement the counter overflow yet.
Signed-off-by: Atish Patra <[email protected]>
---
arch/riscv/include/asm/kvm_vcpu_pmu.h | 10 ++
arch/riscv/kvm/vcpu_pmu.c | 129 ++++++++++++++++++++++++--
arch/riscv/kvm/vcpu_sbi_pmu.c | 3 +
3 files changed, 134 insertions(+), 8 deletions(-)
diff --git a/arch/riscv/include/asm/kvm_vcpu_pmu.h b/arch/riscv/include/asm/kvm_vcpu_pmu.h
index 395518a1664e..64c75acad6ba 100644
--- a/arch/riscv/include/asm/kvm_vcpu_pmu.h
+++ b/arch/riscv/include/asm/kvm_vcpu_pmu.h
@@ -36,6 +36,7 @@ struct kvm_pmc {
bool started;
/* Monitoring event ID */
unsigned long event_idx;
+ struct kvm_vcpu *vcpu;
};
/* PMU data structure per vcpu */
@@ -50,6 +51,12 @@ struct kvm_pmu {
bool init_done;
/* Bit map of all the virtual counter used */
DECLARE_BITMAP(pmc_in_use, RISCV_KVM_MAX_COUNTERS);
+ /* Bit map of all the virtual counter overflown */
+ DECLARE_BITMAP(pmc_overflown, RISCV_KVM_MAX_COUNTERS);
+ /* The address of the counter snapshot area (guest physical address) */
+ unsigned long snapshot_addr;
+ /* The actual data of the snapshot */
+ struct riscv_pmu_snapshot_data *sdata;
};
#define vcpu_to_pmu(vcpu) (&(vcpu)->arch.pmu_context)
@@ -85,6 +92,9 @@ int kvm_riscv_vcpu_pmu_ctr_cfg_match(struct kvm_vcpu *vcpu, unsigned long ctr_ba
int kvm_riscv_vcpu_pmu_ctr_read(struct kvm_vcpu *vcpu, unsigned long cidx,
struct kvm_vcpu_sbi_return *retdata);
void kvm_riscv_vcpu_pmu_init(struct kvm_vcpu *vcpu);
+int kvm_riscv_vcpu_pmu_setup_snapshot(struct kvm_vcpu *vcpu, unsigned long saddr_low,
+ unsigned long saddr_high, unsigned long flags,
+ struct kvm_vcpu_sbi_return *retdata);
void kvm_riscv_vcpu_pmu_deinit(struct kvm_vcpu *vcpu);
void kvm_riscv_vcpu_pmu_reset(struct kvm_vcpu *vcpu);
diff --git a/arch/riscv/kvm/vcpu_pmu.c b/arch/riscv/kvm/vcpu_pmu.c
index 86391a5061dd..622c4ee89e7b 100644
--- a/arch/riscv/kvm/vcpu_pmu.c
+++ b/arch/riscv/kvm/vcpu_pmu.c
@@ -310,6 +310,79 @@ int kvm_riscv_vcpu_pmu_read_hpm(struct kvm_vcpu *vcpu, unsigned int csr_num,
return ret;
}
+static void kvm_pmu_clear_snapshot_area(struct kvm_vcpu *vcpu)
+{
+ struct kvm_pmu *kvpmu = vcpu_to_pmu(vcpu);
+ int snapshot_area_size = sizeof(struct riscv_pmu_snapshot_data);
+
+ if (kvpmu->sdata) {
+ memset(kvpmu->sdata, 0, snapshot_area_size);
+ if (kvpmu->snapshot_addr != INVALID_GPA)
+ kvm_vcpu_write_guest(vcpu, kvpmu->snapshot_addr,
+ kvpmu->sdata, snapshot_area_size);
+ }
+ kvpmu->snapshot_addr = INVALID_GPA;
+}
+
+int kvm_riscv_vcpu_pmu_setup_snapshot(struct kvm_vcpu *vcpu, unsigned long saddr_low,
+ unsigned long saddr_high, unsigned long flags,
+ struct kvm_vcpu_sbi_return *retdata)
+{
+ struct kvm_pmu *kvpmu = vcpu_to_pmu(vcpu);
+ int snapshot_area_size = sizeof(struct riscv_pmu_snapshot_data);
+ int sbiret = 0;
+ gpa_t saddr;
+ unsigned long hva;
+ bool writable;
+
+ if (!kvpmu) {
+ sbiret = SBI_ERR_INVALID_PARAM;
+ goto out;
+ }
+
+ if (saddr_low == -1 && saddr_high == -1) {
+ kvm_pmu_clear_snapshot_area(vcpu);
+ return 0;
+ }
+
+ saddr = saddr_low;
+
+ if (saddr_high != 0) {
+#ifdef CONFIG_32BIT
+ saddr |= ((gpa_t)saddr << 32);
+#else
+ sbiret = SBI_ERR_INVALID_ADDRESS;
+ goto out;
+#endif
+ }
+
+ if (kvm_is_error_gpa(vcpu->kvm, saddr)) {
+ sbiret = SBI_ERR_INVALID_PARAM;
+ goto out;
+ }
+
+ hva = kvm_vcpu_gfn_to_hva_prot(vcpu, saddr >> PAGE_SHIFT, &writable);
+ if (kvm_is_error_hva(hva) || !writable) {
+ sbiret = SBI_ERR_INVALID_ADDRESS;
+ goto out;
+ }
+
+ kvpmu->snapshot_addr = saddr;
+ kvpmu->sdata = kzalloc(snapshot_area_size, GFP_ATOMIC);
+ if (!kvpmu->sdata)
+ return -ENOMEM;
+
+ if (kvm_vcpu_write_guest(vcpu, saddr, kvpmu->sdata, snapshot_area_size)) {
+ kfree(kvpmu->sdata);
+ kvpmu->snapshot_addr = INVALID_GPA;
+ sbiret = SBI_ERR_FAILURE;
+ }
+out:
+ retdata->err_val = sbiret;
+
+ return 0;
+}
+
int kvm_riscv_vcpu_pmu_num_ctrs(struct kvm_vcpu *vcpu,
struct kvm_vcpu_sbi_return *retdata)
{
@@ -343,8 +416,10 @@ int kvm_riscv_vcpu_pmu_ctr_start(struct kvm_vcpu *vcpu, unsigned long ctr_base,
int i, pmc_index, sbiret = 0;
struct kvm_pmc *pmc;
int fevent_code;
+ bool bSnapshot = flags & SBI_PMU_START_FLAG_INIT_FROM_SNAPSHOT;
- if (kvm_pmu_validate_counter_mask(kvpmu, ctr_base, ctr_mask) < 0) {
+ if ((kvm_pmu_validate_counter_mask(kvpmu, ctr_base, ctr_mask) < 0) ||
+ (bSnapshot && kvpmu->snapshot_addr == INVALID_GPA)) {
sbiret = SBI_ERR_INVALID_PARAM;
goto out;
}
@@ -355,8 +430,14 @@ int kvm_riscv_vcpu_pmu_ctr_start(struct kvm_vcpu *vcpu, unsigned long ctr_base,
if (!test_bit(pmc_index, kvpmu->pmc_in_use))
continue;
pmc = &kvpmu->pmc[pmc_index];
- if (flags & SBI_PMU_START_FLAG_SET_INIT_VALUE)
+ if (flags & SBI_PMU_START_FLAG_SET_INIT_VALUE) {
pmc->counter_val = ival;
+ } else if (bSnapshot) {
+ kvm_vcpu_read_guest(vcpu, kvpmu->snapshot_addr, kvpmu->sdata,
+ sizeof(struct riscv_pmu_snapshot_data));
+ pmc->counter_val = kvpmu->sdata->ctr_values[pmc_index];
+ }
+
if (pmc->cinfo.type == SBI_PMU_CTR_TYPE_FW) {
fevent_code = get_event_code(pmc->event_idx);
if (fevent_code >= SBI_PMU_FW_MAX) {
@@ -400,8 +481,10 @@ int kvm_riscv_vcpu_pmu_ctr_stop(struct kvm_vcpu *vcpu, unsigned long ctr_base,
u64 enabled, running;
struct kvm_pmc *pmc;
int fevent_code;
+ bool bSnapshot = flags & SBI_PMU_STOP_FLAG_TAKE_SNAPSHOT;
- if (kvm_pmu_validate_counter_mask(kvpmu, ctr_base, ctr_mask) < 0) {
+ if ((kvm_pmu_validate_counter_mask(kvpmu, ctr_base, ctr_mask) < 0) ||
+ (bSnapshot && (kvpmu->snapshot_addr == INVALID_GPA))) {
sbiret = SBI_ERR_INVALID_PARAM;
goto out;
}
@@ -423,27 +506,52 @@ int kvm_riscv_vcpu_pmu_ctr_stop(struct kvm_vcpu *vcpu, unsigned long ctr_base,
sbiret = SBI_ERR_ALREADY_STOPPED;
kvpmu->fw_event[fevent_code].started = false;
+ /* No need to increment the value as it is absolute for firmware events */
+ pmc->counter_val = kvpmu->fw_event[fevent_code].value;
} else if (pmc->perf_event) {
if (pmc->started) {
/* Stop counting the counter */
perf_event_disable(pmc->perf_event);
- pmc->started = false;
} else {
sbiret = SBI_ERR_ALREADY_STOPPED;
}
- if (flags & SBI_PMU_STOP_FLAG_RESET) {
- /* Relase the counter if this is a reset request */
+ /* Stop counting the counter */
+ perf_event_disable(pmc->perf_event);
+
+ /* We only update if stopped is already called. The caller may stop/reset
+ * the event in two steps.
+ */
+ if (pmc->started) {
pmc->counter_val += perf_event_read_value(pmc->perf_event,
&enabled, &running);
+ pmc->started = false;
+ }
+
+ if (flags & SBI_PMU_STOP_FLAG_RESET) {
+ /* Relase the counter if this is a reset request */
kvm_pmu_release_perf_event(pmc);
}
} else {
sbiret = SBI_ERR_INVALID_PARAM;
}
+
+ if (bSnapshot && !sbiret) {
+ //TODO: Add counter overflow support when sscofpmf support is added
+ kvpmu->sdata->ctr_values[i] = pmc->counter_val;
+ kvm_vcpu_write_guest(vcpu, kvpmu->snapshot_addr, kvpmu->sdata,
+ sizeof(struct riscv_pmu_snapshot_data));
+ }
+
if (flags & SBI_PMU_STOP_FLAG_RESET) {
pmc->event_idx = SBI_PMU_EVENT_IDX_INVALID;
clear_bit(pmc_index, kvpmu->pmc_in_use);
+ if (bSnapshot) {
+ /* Clear the snapshot area for the upcoming deletion event */
+ kvpmu->sdata->ctr_values[i] = 0;
+ kvm_vcpu_write_guest(vcpu, kvpmu->snapshot_addr, kvpmu->sdata,
+ sizeof(struct riscv_pmu_snapshot_data));
+ }
}
}
@@ -517,8 +625,10 @@ int kvm_riscv_vcpu_pmu_ctr_cfg_match(struct kvm_vcpu *vcpu, unsigned long ctr_ba
kvpmu->fw_event[event_code].started = true;
} else {
ret = kvm_pmu_create_perf_event(pmc, &attr, flags, eidx, evtdata);
- if (ret)
- return ret;
+ if (ret) {
+ sbiret = SBI_ERR_NOT_SUPPORTED;
+ goto out;
+ }
}
set_bit(ctr_idx, kvpmu->pmc_in_use);
@@ -566,6 +676,7 @@ void kvm_riscv_vcpu_pmu_init(struct kvm_vcpu *vcpu)
kvpmu->num_hw_ctrs = num_hw_ctrs + 1;
kvpmu->num_fw_ctrs = SBI_PMU_FW_MAX;
memset(&kvpmu->fw_event, 0, SBI_PMU_FW_MAX * sizeof(struct kvm_fw_event));
+ kvpmu->snapshot_addr = INVALID_GPA;
if (kvpmu->num_hw_ctrs > RISCV_KVM_MAX_HW_CTRS) {
pr_warn_once("Limiting the hardware counters to 32 as specified by the ISA");
@@ -585,6 +696,7 @@ void kvm_riscv_vcpu_pmu_init(struct kvm_vcpu *vcpu)
pmc = &kvpmu->pmc[i];
pmc->idx = i;
pmc->event_idx = SBI_PMU_EVENT_IDX_INVALID;
+ pmc->vcpu = vcpu;
if (i < kvpmu->num_hw_ctrs) {
pmc->cinfo.type = SBI_PMU_CTR_TYPE_HW;
if (i < 3)
@@ -625,6 +737,7 @@ void kvm_riscv_vcpu_pmu_deinit(struct kvm_vcpu *vcpu)
}
bitmap_zero(kvpmu->pmc_in_use, RISCV_MAX_COUNTERS);
memset(&kvpmu->fw_event, 0, SBI_PMU_FW_MAX * sizeof(struct kvm_fw_event));
+ kvpmu->snapshot_addr = INVALID_GPA;
}
void kvm_riscv_vcpu_pmu_reset(struct kvm_vcpu *vcpu)
diff --git a/arch/riscv/kvm/vcpu_sbi_pmu.c b/arch/riscv/kvm/vcpu_sbi_pmu.c
index 7eca72df2cbd..77c20a61fd7d 100644
--- a/arch/riscv/kvm/vcpu_sbi_pmu.c
+++ b/arch/riscv/kvm/vcpu_sbi_pmu.c
@@ -64,6 +64,9 @@ static int kvm_sbi_ext_pmu_handler(struct kvm_vcpu *vcpu, struct kvm_run *run,
case SBI_EXT_PMU_COUNTER_FW_READ:
ret = kvm_riscv_vcpu_pmu_ctr_read(vcpu, cp->a0, retdata);
break;
+ case SBI_EXT_PMU_SNAPSHOT_SET_SHMEM:
+ ret = kvm_riscv_vcpu_pmu_setup_snapshot(vcpu, cp->a0, cp->a1, cp->a2, retdata);
+ break;
default:
retdata->err_val = SBI_ERR_NOT_SUPPORTED;
}
--
2.34.1
KVM enables perf for guest via counter virtualization. However, the
sampling can not be supported as there is no mechanism to enabled
trap/emulate scountovf in ISA yet. Rely on the SBI PMU snapshot
to provide the counter overflow data via the shared memory.
In case of sampling event, the host first guest the LCOFI interrupt
and injects to the guest via irq filtering mechanism defined in AIA
specification. Thus, ssaia must be enabled in the host in order to
use perf sampling in the guest. No other AIA dpeendancy w.r.t kernel
is required.
Signed-off-by: Atish Patra <[email protected]>
---
arch/riscv/include/asm/csr.h | 3 +-
arch/riscv/include/uapi/asm/kvm.h | 1 +
arch/riscv/kvm/main.c | 1 +
arch/riscv/kvm/vcpu.c | 8 ++--
arch/riscv/kvm/vcpu_onereg.c | 1 +
arch/riscv/kvm/vcpu_pmu.c | 69 ++++++++++++++++++++++++++++---
6 files changed, 73 insertions(+), 10 deletions(-)
diff --git a/arch/riscv/include/asm/csr.h b/arch/riscv/include/asm/csr.h
index 88cdc8a3e654..bec09b33e2f0 100644
--- a/arch/riscv/include/asm/csr.h
+++ b/arch/riscv/include/asm/csr.h
@@ -168,7 +168,8 @@
#define VSIP_TO_HVIP_SHIFT (IRQ_VS_SOFT - IRQ_S_SOFT)
#define VSIP_VALID_MASK ((_AC(1, UL) << IRQ_S_SOFT) | \
(_AC(1, UL) << IRQ_S_TIMER) | \
- (_AC(1, UL) << IRQ_S_EXT))
+ (_AC(1, UL) << IRQ_S_EXT) | \
+ (_AC(1, UL) << IRQ_PMU_OVF))
/* AIA CSR bits */
#define TOPI_IID_SHIFT 16
diff --git a/arch/riscv/include/uapi/asm/kvm.h b/arch/riscv/include/uapi/asm/kvm.h
index 60d3b21dead7..741c16f4518e 100644
--- a/arch/riscv/include/uapi/asm/kvm.h
+++ b/arch/riscv/include/uapi/asm/kvm.h
@@ -139,6 +139,7 @@ enum KVM_RISCV_ISA_EXT_ID {
KVM_RISCV_ISA_EXT_ZIHPM,
KVM_RISCV_ISA_EXT_SMSTATEEN,
KVM_RISCV_ISA_EXT_ZICOND,
+ KVM_RISCV_ISA_EXT_SSCOFPMF,
KVM_RISCV_ISA_EXT_MAX,
};
diff --git a/arch/riscv/kvm/main.c b/arch/riscv/kvm/main.c
index 225a435d9c9a..5a3a4cee0e3d 100644
--- a/arch/riscv/kvm/main.c
+++ b/arch/riscv/kvm/main.c
@@ -43,6 +43,7 @@ int kvm_arch_hardware_enable(void)
csr_write(CSR_HCOUNTEREN, 0x02);
csr_write(CSR_HVIP, 0);
+ csr_write(CSR_HVIEN, 1UL << IRQ_PMU_OVF);
kvm_riscv_aia_enable();
diff --git a/arch/riscv/kvm/vcpu.c b/arch/riscv/kvm/vcpu.c
index e087c809073c..2d9f252356c3 100644
--- a/arch/riscv/kvm/vcpu.c
+++ b/arch/riscv/kvm/vcpu.c
@@ -380,7 +380,8 @@ int kvm_riscv_vcpu_set_interrupt(struct kvm_vcpu *vcpu, unsigned int irq)
if (irq < IRQ_LOCAL_MAX &&
irq != IRQ_VS_SOFT &&
irq != IRQ_VS_TIMER &&
- irq != IRQ_VS_EXT)
+ irq != IRQ_VS_EXT &&
+ irq != IRQ_PMU_OVF)
return -EINVAL;
set_bit(irq, vcpu->arch.irqs_pending);
@@ -395,14 +396,15 @@ int kvm_riscv_vcpu_set_interrupt(struct kvm_vcpu *vcpu, unsigned int irq)
int kvm_riscv_vcpu_unset_interrupt(struct kvm_vcpu *vcpu, unsigned int irq)
{
/*
- * We only allow VS-mode software, timer, and external
+ * We only allow VS-mode software, timer, counter overflow and external
* interrupts when irq is one of the local interrupts
* defined by RISC-V privilege specification.
*/
if (irq < IRQ_LOCAL_MAX &&
irq != IRQ_VS_SOFT &&
irq != IRQ_VS_TIMER &&
- irq != IRQ_VS_EXT)
+ irq != IRQ_VS_EXT &&
+ irq != IRQ_PMU_OVF)
return -EINVAL;
clear_bit(irq, vcpu->arch.irqs_pending);
diff --git a/arch/riscv/kvm/vcpu_onereg.c b/arch/riscv/kvm/vcpu_onereg.c
index f8c9fa0c03c5..19a0e4eaf0df 100644
--- a/arch/riscv/kvm/vcpu_onereg.c
+++ b/arch/riscv/kvm/vcpu_onereg.c
@@ -36,6 +36,7 @@ static const unsigned long kvm_isa_ext_arr[] = {
/* Multi letter extensions (alphabetically sorted) */
KVM_ISA_EXT_ARR(SMSTATEEN),
KVM_ISA_EXT_ARR(SSAIA),
+ KVM_ISA_EXT_ARR(SSCOFPMF),
KVM_ISA_EXT_ARR(SSTC),
KVM_ISA_EXT_ARR(SVINVAL),
KVM_ISA_EXT_ARR(SVNAPOT),
diff --git a/arch/riscv/kvm/vcpu_pmu.c b/arch/riscv/kvm/vcpu_pmu.c
index 622c4ee89e7b..86c8e92f92d3 100644
--- a/arch/riscv/kvm/vcpu_pmu.c
+++ b/arch/riscv/kvm/vcpu_pmu.c
@@ -229,6 +229,47 @@ static int kvm_pmu_validate_counter_mask(struct kvm_pmu *kvpmu, unsigned long ct
return 0;
}
+static void kvm_riscv_pmu_overflow(struct perf_event *perf_event,
+ struct perf_sample_data *data,
+ struct pt_regs *regs)
+{
+ struct kvm_pmc *pmc = perf_event->overflow_handler_context;
+ struct kvm_vcpu *vcpu = pmc->vcpu;
+ struct kvm_pmu *kvpmu = vcpu_to_pmu(vcpu);
+ struct riscv_pmu *rpmu = to_riscv_pmu(perf_event->pmu);
+ u64 period;
+
+ /*
+ * Stop the event counting by directly accessing the perf_event.
+ * Otherwise, this needs to deferred via a workqueue.
+ * That will introduce skew in the counter value because the actual
+ * physical counter would start after returning from this function.
+ * It will be stopped again once the workqueue is scheduled
+ */
+ rpmu->pmu.stop(perf_event, PERF_EF_UPDATE);
+
+ /*
+ * The hw counter would start automatically when this function returns.
+ * Thus, the host may continue to interrupts and inject it to the guest
+ * even without guest configuring the next event. Depending on the hardware
+ * the host may some sluggishness only if privilege mode filtering is not
+ * available. In an ideal world, where qemu is not the only capable hardware,
+ * this can be removed.
+ * FYI: ARM64 does this way while x86 doesn't do anything as such.
+ * TODO: Should we keep it for RISC-V ?
+ */
+ period = -(local64_read(&perf_event->count));
+
+ local64_set(&perf_event->hw.period_left, 0);
+ perf_event->attr.sample_period = period;
+ perf_event->hw.sample_period = period;
+
+ set_bit(pmc->idx, kvpmu->pmc_overflown);
+ kvm_riscv_vcpu_set_interrupt(vcpu, IRQ_PMU_OVF);
+
+ rpmu->pmu.start(perf_event, PERF_EF_RELOAD);
+}
+
static int kvm_pmu_create_perf_event(struct kvm_pmc *pmc, struct perf_event_attr *attr,
unsigned long flags, unsigned long eidx, unsigned long evtdata)
{
@@ -247,7 +288,7 @@ static int kvm_pmu_create_perf_event(struct kvm_pmc *pmc, struct perf_event_attr
*/
attr->sample_period = kvm_pmu_get_sample_period(pmc);
- event = perf_event_create_kernel_counter(attr, -1, current, NULL, pmc);
+ event = perf_event_create_kernel_counter(attr, -1, current, kvm_riscv_pmu_overflow, pmc);
if (IS_ERR(event)) {
pr_err("kvm pmu event creation failed for eidx %lx: %ld\n", eidx, PTR_ERR(event));
return PTR_ERR(event);
@@ -466,6 +507,12 @@ int kvm_riscv_vcpu_pmu_ctr_start(struct kvm_vcpu *vcpu, unsigned long ctr_base,
}
}
+ /* The guest have serviced the interrupt and starting the counter again */
+ if (test_bit(IRQ_PMU_OVF, vcpu->arch.irqs_pending)) {
+ clear_bit(pmc_index, kvpmu->pmc_overflown);
+ kvm_riscv_vcpu_unset_interrupt(vcpu, IRQ_PMU_OVF);
+ }
+
out:
retdata->err_val = sbiret;
@@ -537,7 +584,12 @@ int kvm_riscv_vcpu_pmu_ctr_stop(struct kvm_vcpu *vcpu, unsigned long ctr_base,
}
if (bSnapshot && !sbiret) {
- //TODO: Add counter overflow support when sscofpmf support is added
+ /* The counter and overflow indicies in the snapshot region are w.r.to
+ * cbase. Modify the set bit in the counter mask instead of the pmc_index
+ * which indicates the absolute counter index.
+ */
+ if (test_bit(pmc_index, kvpmu->pmc_overflown))
+ kvpmu->sdata->ctr_overflow_mask |= (1UL << i);
kvpmu->sdata->ctr_values[i] = pmc->counter_val;
kvm_vcpu_write_guest(vcpu, kvpmu->snapshot_addr, kvpmu->sdata,
sizeof(struct riscv_pmu_snapshot_data));
@@ -546,15 +598,19 @@ int kvm_riscv_vcpu_pmu_ctr_stop(struct kvm_vcpu *vcpu, unsigned long ctr_base,
if (flags & SBI_PMU_STOP_FLAG_RESET) {
pmc->event_idx = SBI_PMU_EVENT_IDX_INVALID;
clear_bit(pmc_index, kvpmu->pmc_in_use);
+ clear_bit(pmc_index, kvpmu->pmc_overflown);
if (bSnapshot) {
/* Clear the snapshot area for the upcoming deletion event */
kvpmu->sdata->ctr_values[i] = 0;
+ /* Only clear the given counter as the caller is responsible to
+ * validate both the overflow mask and configured counters.
+ */
+ kvpmu->sdata->ctr_overflow_mask &= ~(1UL << i);
kvm_vcpu_write_guest(vcpu, kvpmu->snapshot_addr, kvpmu->sdata,
sizeof(struct riscv_pmu_snapshot_data));
}
}
}
-
out:
retdata->err_val = sbiret;
@@ -729,15 +785,16 @@ void kvm_riscv_vcpu_pmu_deinit(struct kvm_vcpu *vcpu)
if (!kvpmu)
return;
- for_each_set_bit(i, kvpmu->pmc_in_use, RISCV_MAX_COUNTERS) {
+ for_each_set_bit(i, kvpmu->pmc_in_use, RISCV_KVM_MAX_COUNTERS) {
pmc = &kvpmu->pmc[i];
pmc->counter_val = 0;
kvm_pmu_release_perf_event(pmc);
pmc->event_idx = SBI_PMU_EVENT_IDX_INVALID;
}
- bitmap_zero(kvpmu->pmc_in_use, RISCV_MAX_COUNTERS);
+ bitmap_zero(kvpmu->pmc_in_use, RISCV_KVM_MAX_COUNTERS);
+ bitmap_zero(kvpmu->pmc_overflown, RISCV_KVM_MAX_COUNTERS);
memset(&kvpmu->fw_event, 0, SBI_PMU_FW_MAX * sizeof(struct kvm_fw_event));
- kvpmu->snapshot_addr = INVALID_GPA;
+ kvm_pmu_clear_snapshot_area(vcpu);
}
void kvm_riscv_vcpu_pmu_reset(struct kvm_vcpu *vcpu)
--
2.34.1
The SBI v2.0 introduced a fw_read_hi function to read 64 bit firmware
counters for RV32 based systems.
Add infrastructure to support that.
Signed-off-by: Atish Patra <[email protected]>
---
arch/riscv/include/asm/kvm_vcpu_pmu.h | 6 ++++-
arch/riscv/kvm/vcpu_pmu.c | 38 ++++++++++++++++++++++++++-
arch/riscv/kvm/vcpu_sbi_pmu.c | 7 +++++
3 files changed, 49 insertions(+), 2 deletions(-)
diff --git a/arch/riscv/include/asm/kvm_vcpu_pmu.h b/arch/riscv/include/asm/kvm_vcpu_pmu.h
index 64c75acad6ba..dd655315e706 100644
--- a/arch/riscv/include/asm/kvm_vcpu_pmu.h
+++ b/arch/riscv/include/asm/kvm_vcpu_pmu.h
@@ -20,7 +20,7 @@ static_assert(RISCV_KVM_MAX_COUNTERS <= 64);
struct kvm_fw_event {
/* Current value of the event */
- unsigned long value;
+ uint64_t value;
/* Event monitoring status */
bool started;
@@ -91,6 +91,10 @@ int kvm_riscv_vcpu_pmu_ctr_cfg_match(struct kvm_vcpu *vcpu, unsigned long ctr_ba
struct kvm_vcpu_sbi_return *retdata);
int kvm_riscv_vcpu_pmu_ctr_read(struct kvm_vcpu *vcpu, unsigned long cidx,
struct kvm_vcpu_sbi_return *retdata);
+#if defined(CONFIG_32BIT)
+int kvm_riscv_vcpu_pmu_fw_ctr_read_hi(struct kvm_vcpu *vcpu, unsigned long cidx,
+ struct kvm_vcpu_sbi_return *retdata);
+#endif
void kvm_riscv_vcpu_pmu_init(struct kvm_vcpu *vcpu);
int kvm_riscv_vcpu_pmu_setup_snapshot(struct kvm_vcpu *vcpu, unsigned long saddr_low,
unsigned long saddr_high, unsigned long flags,
diff --git a/arch/riscv/kvm/vcpu_pmu.c b/arch/riscv/kvm/vcpu_pmu.c
index 86c8e92f92d3..5b4a93647256 100644
--- a/arch/riscv/kvm/vcpu_pmu.c
+++ b/arch/riscv/kvm/vcpu_pmu.c
@@ -195,6 +195,28 @@ static int pmu_get_pmc_index(struct kvm_pmu *pmu, unsigned long eidx,
return kvm_pmu_get_programmable_pmc_index(pmu, eidx, cbase, cmask);
}
+#if defined(CONFIG_32BIT)
+static int pmu_fw_ctr_read_hi(struct kvm_vcpu *vcpu, unsigned long cidx,
+ unsigned long *out_val)
+{
+ struct kvm_pmu *kvpmu = vcpu_to_pmu(vcpu);
+ struct kvm_pmc *pmc;
+ u64 enabled, running;
+ int fevent_code;
+
+ pmc = &kvpmu->pmc[cidx];
+
+ if (pmc->cinfo.type != SBI_PMU_CTR_TYPE_FW)
+ return -EINVAL;
+
+ fevent_code = get_event_code(pmc->event_idx);
+ pmc->counter_val = kvpmu->fw_event[fevent_code].value;
+
+ *out_val = pmc->counter_val >> 32;
+
+ return 0;
+}
+#endif
static int pmu_ctr_read(struct kvm_vcpu *vcpu, unsigned long cidx,
unsigned long *out_val)
@@ -696,6 +718,20 @@ int kvm_riscv_vcpu_pmu_ctr_cfg_match(struct kvm_vcpu *vcpu, unsigned long ctr_ba
return 0;
}
+#if defined(CONFIG_32BIT)
+int kvm_riscv_vcpu_pmu_fw_ctr_read_hi(struct kvm_vcpu *vcpu, unsigned long cidx,
+ struct kvm_vcpu_sbi_return *retdata)
+{
+ int ret;
+
+ ret = pmu_fw_ctr_read_hi(vcpu, cidx, &retdata->out_val);
+ if (ret == -EINVAL)
+ retdata->err_val = SBI_ERR_INVALID_PARAM;
+
+ return 0;
+}
+#endif
+
int kvm_riscv_vcpu_pmu_ctr_read(struct kvm_vcpu *vcpu, unsigned long cidx,
struct kvm_vcpu_sbi_return *retdata)
{
@@ -769,7 +805,7 @@ void kvm_riscv_vcpu_pmu_init(struct kvm_vcpu *vcpu)
pmc->cinfo.csr = CSR_CYCLE + i;
} else {
pmc->cinfo.type = SBI_PMU_CTR_TYPE_FW;
- pmc->cinfo.width = BITS_PER_LONG - 1;
+ pmc->cinfo.width = 63;
}
}
diff --git a/arch/riscv/kvm/vcpu_sbi_pmu.c b/arch/riscv/kvm/vcpu_sbi_pmu.c
index 77c20a61fd7d..0cd051d5a448 100644
--- a/arch/riscv/kvm/vcpu_sbi_pmu.c
+++ b/arch/riscv/kvm/vcpu_sbi_pmu.c
@@ -64,6 +64,13 @@ static int kvm_sbi_ext_pmu_handler(struct kvm_vcpu *vcpu, struct kvm_run *run,
case SBI_EXT_PMU_COUNTER_FW_READ:
ret = kvm_riscv_vcpu_pmu_ctr_read(vcpu, cp->a0, retdata);
break;
+ case SBI_EXT_PMU_COUNTER_FW_READ_HI:
+#if defined(CONFIG_32BIT)
+ ret = kvm_riscv_vcpu_pmu_fw_ctr_read_hi(vcpu, cp->a0, retdata);
+#else
+ retdata->out_val = 0;
+#endif
+ break;
case SBI_EXT_PMU_SNAPSHOT_SET_SHMEM:
ret = kvm_riscv_vcpu_pmu_setup_snapshot(vcpu, cp->a0, cp->a1, cp->a2, retdata);
break;
--
2.34.1
05.12.2023 05:43, Atish Patra wrote:
>
> KVM enables perf for guest via counter virtualization. However, the
> sampling can not be supported as there is no mechanism to enabled
> trap/emulate scountovf in ISA yet. Rely on the SBI PMU snapshot
> to provide the counter overflow data via the shared memory.
>
> In case of sampling event, the host first guest the LCOFI interrupt
> and injects to the guest via irq filtering mechanism defined in AIA
> specification. Thus, ssaia must be enabled in the host in order to
> use perf sampling in the guest. No other AIA dpeendancy w.r.t kernel
> is required.
I don't understand why do we need HVIEN and AIA, why HIDELEG can't be used for this puprpose?
>
> Signed-off-by: Atish Patra <[email protected]>
> ---
> arch/riscv/include/asm/csr.h | 3 +-
> arch/riscv/include/uapi/asm/kvm.h | 1 +
> arch/riscv/kvm/main.c | 1 +
> arch/riscv/kvm/vcpu.c | 8 ++--
> arch/riscv/kvm/vcpu_onereg.c | 1 +
> arch/riscv/kvm/vcpu_pmu.c | 69 ++++++++++++++++++++++++++++---
> 6 files changed, 73 insertions(+), 10 deletions(-)
>
> diff --git a/arch/riscv/include/asm/csr.h b/arch/riscv/include/asm/csr.h
> index 88cdc8a3e654..bec09b33e2f0 100644
> --- a/arch/riscv/include/asm/csr.h
> +++ b/arch/riscv/include/asm/csr.h
> @@ -168,7 +168,8 @@
> #define VSIP_TO_HVIP_SHIFT (IRQ_VS_SOFT - IRQ_S_SOFT)
> #define VSIP_VALID_MASK ((_AC(1, UL) << IRQ_S_SOFT) | \
> (_AC(1, UL) << IRQ_S_TIMER) | \
> - (_AC(1, UL) << IRQ_S_EXT))
> + (_AC(1, UL) << IRQ_S_EXT) | \
> + (_AC(1, UL) << IRQ_PMU_OVF))
>
> /* AIA CSR bits */
> #define TOPI_IID_SHIFT 16
> diff --git a/arch/riscv/include/uapi/asm/kvm.h b/arch/riscv/include/uapi/asm/kvm.h
> index 60d3b21dead7..741c16f4518e 100644
> --- a/arch/riscv/include/uapi/asm/kvm.h
> +++ b/arch/riscv/include/uapi/asm/kvm.h
> @@ -139,6 +139,7 @@ enum KVM_RISCV_ISA_EXT_ID {
> KVM_RISCV_ISA_EXT_ZIHPM,
> KVM_RISCV_ISA_EXT_SMSTATEEN,
> KVM_RISCV_ISA_EXT_ZICOND,
> + KVM_RISCV_ISA_EXT_SSCOFPMF,
> KVM_RISCV_ISA_EXT_MAX,
> };
>
> diff --git a/arch/riscv/kvm/main.c b/arch/riscv/kvm/main.c
> index 225a435d9c9a..5a3a4cee0e3d 100644
> --- a/arch/riscv/kvm/main.c
> +++ b/arch/riscv/kvm/main.c
> @@ -43,6 +43,7 @@ int kvm_arch_hardware_enable(void)
> csr_write(CSR_HCOUNTEREN, 0x02);
>
> csr_write(CSR_HVIP, 0);
> + csr_write(CSR_HVIEN, 1UL << IRQ_PMU_OVF);
Is my understanding correct that this will break KVM for non-AIA CPUs?
As I can remember HVIEN depends on AIA.
>
> kvm_riscv_aia_enable();
>
> diff --git a/arch/riscv/kvm/vcpu.c b/arch/riscv/kvm/vcpu.c
> index e087c809073c..2d9f252356c3 100644
> --- a/arch/riscv/kvm/vcpu.c
> +++ b/arch/riscv/kvm/vcpu.c
> @@ -380,7 +380,8 @@ int kvm_riscv_vcpu_set_interrupt(struct kvm_vcpu *vcpu, unsigned int irq)
> if (irq < IRQ_LOCAL_MAX &&
> irq != IRQ_VS_SOFT &&
> irq != IRQ_VS_TIMER &&
> - irq != IRQ_VS_EXT)
> + irq != IRQ_VS_EXT &&
> + irq != IRQ_PMU_OVF)
> return -EINVAL;
>
> set_bit(irq, vcpu->arch.irqs_pending);
> @@ -395,14 +396,15 @@ int kvm_riscv_vcpu_set_interrupt(struct kvm_vcpu *vcpu, unsigned int irq)
> int kvm_riscv_vcpu_unset_interrupt(struct kvm_vcpu *vcpu, unsigned int irq)
> {
> /*
> - * We only allow VS-mode software, timer, and external
> + * We only allow VS-mode software, timer, counter overflow and external
> * interrupts when irq is one of the local interrupts
> * defined by RISC-V privilege specification.
> */
> if (irq < IRQ_LOCAL_MAX &&
> irq != IRQ_VS_SOFT &&
> irq != IRQ_VS_TIMER &&
> - irq != IRQ_VS_EXT)
> + irq != IRQ_VS_EXT &&
> + irq != IRQ_PMU_OVF)
> return -EINVAL;
>
> clear_bit(irq, vcpu->arch.irqs_pending);
> diff --git a/arch/riscv/kvm/vcpu_onereg.c b/arch/riscv/kvm/vcpu_onereg.c
> index f8c9fa0c03c5..19a0e4eaf0df 100644
> --- a/arch/riscv/kvm/vcpu_onereg.c
> +++ b/arch/riscv/kvm/vcpu_onereg.c
> @@ -36,6 +36,7 @@ static const unsigned long kvm_isa_ext_arr[] = {
> /* Multi letter extensions (alphabetically sorted) */
> KVM_ISA_EXT_ARR(SMSTATEEN),
> KVM_ISA_EXT_ARR(SSAIA),
> + KVM_ISA_EXT_ARR(SSCOFPMF),
> KVM_ISA_EXT_ARR(SSTC),
> KVM_ISA_EXT_ARR(SVINVAL),
> KVM_ISA_EXT_ARR(SVNAPOT),
> diff --git a/arch/riscv/kvm/vcpu_pmu.c b/arch/riscv/kvm/vcpu_pmu.c
> index 622c4ee89e7b..86c8e92f92d3 100644
> --- a/arch/riscv/kvm/vcpu_pmu.c
> +++ b/arch/riscv/kvm/vcpu_pmu.c
> @@ -229,6 +229,47 @@ static int kvm_pmu_validate_counter_mask(struct kvm_pmu *kvpmu, unsigned long ct
> return 0;
> }
>
> +static void kvm_riscv_pmu_overflow(struct perf_event *perf_event,
> + struct perf_sample_data *data,
> + struct pt_regs *regs)
> +{
> + struct kvm_pmc *pmc = perf_event->overflow_handler_context;
> + struct kvm_vcpu *vcpu = pmc->vcpu;
> + struct kvm_pmu *kvpmu = vcpu_to_pmu(vcpu);
> + struct riscv_pmu *rpmu = to_riscv_pmu(perf_event->pmu);
> + u64 period;
> +
> + /*
> + * Stop the event counting by directly accessing the perf_event.
> + * Otherwise, this needs to deferred via a workqueue.
> + * That will introduce skew in the counter value because the actual
> + * physical counter would start after returning from this function.
> + * It will be stopped again once the workqueue is scheduled
> + */
> + rpmu->pmu.stop(perf_event, PERF_EF_UPDATE);
> +
> + /*
> + * The hw counter would start automatically when this function returns.
> + * Thus, the host may continue to interrupts and inject it to the guest
> + * even without guest configuring the next event. Depending on the hardware
> + * the host may some sluggishness only if privilege mode filtering is not
> + * available. In an ideal world, where qemu is not the only capable hardware,
> + * this can be removed.
> + * FYI: ARM64 does this way while x86 doesn't do anything as such.
> + * TODO: Should we keep it for RISC-V ?
> + */
> + period = -(local64_read(&perf_event->count));
> +
> + local64_set(&perf_event->hw.period_left, 0);
> + perf_event->attr.sample_period = period;
> + perf_event->hw.sample_period = period;
> +
> + set_bit(pmc->idx, kvpmu->pmc_overflown);
> + kvm_riscv_vcpu_set_interrupt(vcpu, IRQ_PMU_OVF);
> +
> + rpmu->pmu.start(perf_event, PERF_EF_RELOAD);
> +}
> +
> static int kvm_pmu_create_perf_event(struct kvm_pmc *pmc, struct perf_event_attr *attr,
> unsigned long flags, unsigned long eidx, unsigned long evtdata)
> {
> @@ -247,7 +288,7 @@ static int kvm_pmu_create_perf_event(struct kvm_pmc *pmc, struct perf_event_attr
> */
> attr->sample_period = kvm_pmu_get_sample_period(pmc);
>
> - event = perf_event_create_kernel_counter(attr, -1, current, NULL, pmc);
> + event = perf_event_create_kernel_counter(attr, -1, current, kvm_riscv_pmu_overflow, pmc);
> if (IS_ERR(event)) {
> pr_err("kvm pmu event creation failed for eidx %lx: %ld\n", eidx, PTR_ERR(event));
> return PTR_ERR(event);
> @@ -466,6 +507,12 @@ int kvm_riscv_vcpu_pmu_ctr_start(struct kvm_vcpu *vcpu, unsigned long ctr_base,
> }
> }
>
> + /* The guest have serviced the interrupt and starting the counter again */
> + if (test_bit(IRQ_PMU_OVF, vcpu->arch.irqs_pending)) {
> + clear_bit(pmc_index, kvpmu->pmc_overflown);
> + kvm_riscv_vcpu_unset_interrupt(vcpu, IRQ_PMU_OVF);
> + }
> +
> out:
> retdata->err_val = sbiret;
>
> @@ -537,7 +584,12 @@ int kvm_riscv_vcpu_pmu_ctr_stop(struct kvm_vcpu *vcpu, unsigned long ctr_base,
> }
>
> if (bSnapshot && !sbiret) {
> - //TODO: Add counter overflow support when sscofpmf support is added
> + /* The counter and overflow indicies in the snapshot region are w.r.to
> + * cbase. Modify the set bit in the counter mask instead of the pmc_index
> + * which indicates the absolute counter index.
> + */
> + if (test_bit(pmc_index, kvpmu->pmc_overflown))
> + kvpmu->sdata->ctr_overflow_mask |= (1UL << i);
> kvpmu->sdata->ctr_values[i] = pmc->counter_val;
> kvm_vcpu_write_guest(vcpu, kvpmu->snapshot_addr, kvpmu->sdata,
> sizeof(struct riscv_pmu_snapshot_data));
> @@ -546,15 +598,19 @@ int kvm_riscv_vcpu_pmu_ctr_stop(struct kvm_vcpu *vcpu, unsigned long ctr_base,
> if (flags & SBI_PMU_STOP_FLAG_RESET) {
> pmc->event_idx = SBI_PMU_EVENT_IDX_INVALID;
> clear_bit(pmc_index, kvpmu->pmc_in_use);
> + clear_bit(pmc_index, kvpmu->pmc_overflown);
> if (bSnapshot) {
> /* Clear the snapshot area for the upcoming deletion event */
> kvpmu->sdata->ctr_values[i] = 0;
> + /* Only clear the given counter as the caller is responsible to
> + * validate both the overflow mask and configured counters.
> + */
> + kvpmu->sdata->ctr_overflow_mask &= ~(1UL << i);
> kvm_vcpu_write_guest(vcpu, kvpmu->snapshot_addr, kvpmu->sdata,
> sizeof(struct riscv_pmu_snapshot_data));
> }
> }
> }
> -
> out:
> retdata->err_val = sbiret;
>
> @@ -729,15 +785,16 @@ void kvm_riscv_vcpu_pmu_deinit(struct kvm_vcpu *vcpu)
> if (!kvpmu)
> return;
>
> - for_each_set_bit(i, kvpmu->pmc_in_use, RISCV_MAX_COUNTERS) {
> + for_each_set_bit(i, kvpmu->pmc_in_use, RISCV_KVM_MAX_COUNTERS) {
> pmc = &kvpmu->pmc[i];
> pmc->counter_val = 0;
> kvm_pmu_release_perf_event(pmc);
> pmc->event_idx = SBI_PMU_EVENT_IDX_INVALID;
> }
> - bitmap_zero(kvpmu->pmc_in_use, RISCV_MAX_COUNTERS);
> + bitmap_zero(kvpmu->pmc_in_use, RISCV_KVM_MAX_COUNTERS);
> + bitmap_zero(kvpmu->pmc_overflown, RISCV_KVM_MAX_COUNTERS);
> memset(&kvpmu->fw_event, 0, SBI_PMU_FW_MAX * sizeof(struct kvm_fw_event));
> - kvpmu->snapshot_addr = INVALID_GPA;
> + kvm_pmu_clear_snapshot_area(vcpu);
> }
>
> void kvm_riscv_vcpu_pmu_reset(struct kvm_vcpu *vcpu)
> --
> 2.34.1
>
>
> _______________________________________________
> linux-riscv mailing list
> [email protected]
> http://lists.infradead.org/mailman/listinfo/linux-riscv
Thank you,
Vladimir Isaev
On Tue, Dec 5, 2023 at 10:43 PM Vladimir Isaev
<[email protected]> wrote:
>
> 05.12.2023 05:43, Atish Patra wrote:
> >
> > KVM enables perf for guest via counter virtualization. However, the
> > sampling can not be supported as there is no mechanism to enabled
> > trap/emulate scountovf in ISA yet. Rely on the SBI PMU snapshot
> > to provide the counter overflow data via the shared memory.
> >
> > In case of sampling event, the host first guest the LCOFI interrupt
> > and injects to the guest via irq filtering mechanism defined in AIA
> > specification. Thus, ssaia must be enabled in the host in order to
> > use perf sampling in the guest. No other AIA dpeendancy w.r.t kernel
> > is required.
>
> I don't understand why do we need HVIEN and AIA, why HIDELEG can't be used for this puprpose?
>
If it is enabled in HIDELEG, the guest gets the interrupt directly. As
the counters are virtualized, the host needs to get
the interrupt and inject it to the guest by setting the hvip bit.
> >
> > Signed-off-by: Atish Patra <[email protected]>
> > ---
> > arch/riscv/include/asm/csr.h | 3 +-
> > arch/riscv/include/uapi/asm/kvm.h | 1 +
> > arch/riscv/kvm/main.c | 1 +
> > arch/riscv/kvm/vcpu.c | 8 ++--
> > arch/riscv/kvm/vcpu_onereg.c | 1 +
> > arch/riscv/kvm/vcpu_pmu.c | 69 ++++++++++++++++++++++++++++---
> > 6 files changed, 73 insertions(+), 10 deletions(-)
> >
> > diff --git a/arch/riscv/include/asm/csr.h b/arch/riscv/include/asm/csr.h
> > index 88cdc8a3e654..bec09b33e2f0 100644
> > --- a/arch/riscv/include/asm/csr.h
> > +++ b/arch/riscv/include/asm/csr.h
> > @@ -168,7 +168,8 @@
> > #define VSIP_TO_HVIP_SHIFT (IRQ_VS_SOFT - IRQ_S_SOFT)
> > #define VSIP_VALID_MASK ((_AC(1, UL) << IRQ_S_SOFT) | \
> > (_AC(1, UL) << IRQ_S_TIMER) | \
> > - (_AC(1, UL) << IRQ_S_EXT))
> > + (_AC(1, UL) << IRQ_S_EXT) | \
> > + (_AC(1, UL) << IRQ_PMU_OVF))
> >
> > /* AIA CSR bits */
> > #define TOPI_IID_SHIFT 16
> > diff --git a/arch/riscv/include/uapi/asm/kvm.h b/arch/riscv/include/uapi/asm/kvm.h
> > index 60d3b21dead7..741c16f4518e 100644
> > --- a/arch/riscv/include/uapi/asm/kvm.h
> > +++ b/arch/riscv/include/uapi/asm/kvm.h
> > @@ -139,6 +139,7 @@ enum KVM_RISCV_ISA_EXT_ID {
> > KVM_RISCV_ISA_EXT_ZIHPM,
> > KVM_RISCV_ISA_EXT_SMSTATEEN,
> > KVM_RISCV_ISA_EXT_ZICOND,
> > + KVM_RISCV_ISA_EXT_SSCOFPMF,
> > KVM_RISCV_ISA_EXT_MAX,
> > };
> >
> > diff --git a/arch/riscv/kvm/main.c b/arch/riscv/kvm/main.c
> > index 225a435d9c9a..5a3a4cee0e3d 100644
> > --- a/arch/riscv/kvm/main.c
> > +++ b/arch/riscv/kvm/main.c
> > @@ -43,6 +43,7 @@ int kvm_arch_hardware_enable(void)
> > csr_write(CSR_HCOUNTEREN, 0x02);
> >
> > csr_write(CSR_HVIP, 0);
> > + csr_write(CSR_HVIEN, 1UL << IRQ_PMU_OVF);
>
> Is my understanding correct that this will break KVM for non-AIA CPUs?
>
> As I can remember HVIEN depends on AIA.
>
Yes. It was supposed to be inside kvm_riscv_aia_enable. My bad.
I will fix it and send it v2.
We also should advertise sscofpmf to the guest only if ssaia is available
in the host. I will work on that too.
> >
> > kvm_riscv_aia_enable();
> >
> > diff --git a/arch/riscv/kvm/vcpu.c b/arch/riscv/kvm/vcpu.c
> > index e087c809073c..2d9f252356c3 100644
> > --- a/arch/riscv/kvm/vcpu.c
> > +++ b/arch/riscv/kvm/vcpu.c
> > @@ -380,7 +380,8 @@ int kvm_riscv_vcpu_set_interrupt(struct kvm_vcpu *vcpu, unsigned int irq)
> > if (irq < IRQ_LOCAL_MAX &&
> > irq != IRQ_VS_SOFT &&
> > irq != IRQ_VS_TIMER &&
> > - irq != IRQ_VS_EXT)
> > + irq != IRQ_VS_EXT &&
> > + irq != IRQ_PMU_OVF)
> > return -EINVAL;
> >
> > set_bit(irq, vcpu->arch.irqs_pending);
> > @@ -395,14 +396,15 @@ int kvm_riscv_vcpu_set_interrupt(struct kvm_vcpu *vcpu, unsigned int irq)
> > int kvm_riscv_vcpu_unset_interrupt(struct kvm_vcpu *vcpu, unsigned int irq)
> > {
> > /*
> > - * We only allow VS-mode software, timer, and external
> > + * We only allow VS-mode software, timer, counter overflow and external
> > * interrupts when irq is one of the local interrupts
> > * defined by RISC-V privilege specification.
> > */
> > if (irq < IRQ_LOCAL_MAX &&
> > irq != IRQ_VS_SOFT &&
> > irq != IRQ_VS_TIMER &&
> > - irq != IRQ_VS_EXT)
> > + irq != IRQ_VS_EXT &&
> > + irq != IRQ_PMU_OVF)
> > return -EINVAL;
> >
> > clear_bit(irq, vcpu->arch.irqs_pending);
> > diff --git a/arch/riscv/kvm/vcpu_onereg.c b/arch/riscv/kvm/vcpu_onereg.c
> > index f8c9fa0c03c5..19a0e4eaf0df 100644
> > --- a/arch/riscv/kvm/vcpu_onereg.c
> > +++ b/arch/riscv/kvm/vcpu_onereg.c
> > @@ -36,6 +36,7 @@ static const unsigned long kvm_isa_ext_arr[] = {
> > /* Multi letter extensions (alphabetically sorted) */
> > KVM_ISA_EXT_ARR(SMSTATEEN),
> > KVM_ISA_EXT_ARR(SSAIA),
> > + KVM_ISA_EXT_ARR(SSCOFPMF),
> > KVM_ISA_EXT_ARR(SSTC),
> > KVM_ISA_EXT_ARR(SVINVAL),
> > KVM_ISA_EXT_ARR(SVNAPOT),
> > diff --git a/arch/riscv/kvm/vcpu_pmu.c b/arch/riscv/kvm/vcpu_pmu.c
> > index 622c4ee89e7b..86c8e92f92d3 100644
> > --- a/arch/riscv/kvm/vcpu_pmu.c
> > +++ b/arch/riscv/kvm/vcpu_pmu.c
> > @@ -229,6 +229,47 @@ static int kvm_pmu_validate_counter_mask(struct kvm_pmu *kvpmu, unsigned long ct
> > return 0;
> > }
> >
> > +static void kvm_riscv_pmu_overflow(struct perf_event *perf_event,
> > + struct perf_sample_data *data,
> > + struct pt_regs *regs)
> > +{
> > + struct kvm_pmc *pmc = perf_event->overflow_handler_context;
> > + struct kvm_vcpu *vcpu = pmc->vcpu;
> > + struct kvm_pmu *kvpmu = vcpu_to_pmu(vcpu);
> > + struct riscv_pmu *rpmu = to_riscv_pmu(perf_event->pmu);
> > + u64 period;
> > +
> > + /*
> > + * Stop the event counting by directly accessing the perf_event.
> > + * Otherwise, this needs to deferred via a workqueue.
> > + * That will introduce skew in the counter value because the actual
> > + * physical counter would start after returning from this function.
> > + * It will be stopped again once the workqueue is scheduled
> > + */
> > + rpmu->pmu.stop(perf_event, PERF_EF_UPDATE);
> > +
> > + /*
> > + * The hw counter would start automatically when this function returns.
> > + * Thus, the host may continue to interrupts and inject it to the guest
> > + * even without guest configuring the next event. Depending on the hardware
> > + * the host may some sluggishness only if privilege mode filtering is not
> > + * available. In an ideal world, where qemu is not the only capable hardware,
> > + * this can be removed.
> > + * FYI: ARM64 does this way while x86 doesn't do anything as such.
> > + * TODO: Should we keep it for RISC-V ?
> > + */
> > + period = -(local64_read(&perf_event->count));
> > +
> > + local64_set(&perf_event->hw.period_left, 0);
> > + perf_event->attr.sample_period = period;
> > + perf_event->hw.sample_period = period;
> > +
> > + set_bit(pmc->idx, kvpmu->pmc_overflown);
> > + kvm_riscv_vcpu_set_interrupt(vcpu, IRQ_PMU_OVF);
> > +
> > + rpmu->pmu.start(perf_event, PERF_EF_RELOAD);
> > +}
> > +
> > static int kvm_pmu_create_perf_event(struct kvm_pmc *pmc, struct perf_event_attr *attr,
> > unsigned long flags, unsigned long eidx, unsigned long evtdata)
> > {
> > @@ -247,7 +288,7 @@ static int kvm_pmu_create_perf_event(struct kvm_pmc *pmc, struct perf_event_attr
> > */
> > attr->sample_period = kvm_pmu_get_sample_period(pmc);
> >
> > - event = perf_event_create_kernel_counter(attr, -1, current, NULL, pmc);
> > + event = perf_event_create_kernel_counter(attr, -1, current, kvm_riscv_pmu_overflow, pmc);
> > if (IS_ERR(event)) {
> > pr_err("kvm pmu event creation failed for eidx %lx: %ld\n", eidx, PTR_ERR(event));
> > return PTR_ERR(event);
> > @@ -466,6 +507,12 @@ int kvm_riscv_vcpu_pmu_ctr_start(struct kvm_vcpu *vcpu, unsigned long ctr_base,
> > }
> > }
> >
> > + /* The guest have serviced the interrupt and starting the counter again */
> > + if (test_bit(IRQ_PMU_OVF, vcpu->arch.irqs_pending)) {
> > + clear_bit(pmc_index, kvpmu->pmc_overflown);
> > + kvm_riscv_vcpu_unset_interrupt(vcpu, IRQ_PMU_OVF);
> > + }
> > +
> > out:
> > retdata->err_val = sbiret;
> >
> > @@ -537,7 +584,12 @@ int kvm_riscv_vcpu_pmu_ctr_stop(struct kvm_vcpu *vcpu, unsigned long ctr_base,
> > }
> >
> > if (bSnapshot && !sbiret) {
> > - //TODO: Add counter overflow support when sscofpmf support is added
> > + /* The counter and overflow indicies in the snapshot region are w.r.to
> > + * cbase. Modify the set bit in the counter mask instead of the pmc_index
> > + * which indicates the absolute counter index.
> > + */
> > + if (test_bit(pmc_index, kvpmu->pmc_overflown))
> > + kvpmu->sdata->ctr_overflow_mask |= (1UL << i);
> > kvpmu->sdata->ctr_values[i] = pmc->counter_val;
> > kvm_vcpu_write_guest(vcpu, kvpmu->snapshot_addr, kvpmu->sdata,
> > sizeof(struct riscv_pmu_snapshot_data));
> > @@ -546,15 +598,19 @@ int kvm_riscv_vcpu_pmu_ctr_stop(struct kvm_vcpu *vcpu, unsigned long ctr_base,
> > if (flags & SBI_PMU_STOP_FLAG_RESET) {
> > pmc->event_idx = SBI_PMU_EVENT_IDX_INVALID;
> > clear_bit(pmc_index, kvpmu->pmc_in_use);
> > + clear_bit(pmc_index, kvpmu->pmc_overflown);
> > if (bSnapshot) {
> > /* Clear the snapshot area for the upcoming deletion event */
> > kvpmu->sdata->ctr_values[i] = 0;
> > + /* Only clear the given counter as the caller is responsible to
> > + * validate both the overflow mask and configured counters.
> > + */
> > + kvpmu->sdata->ctr_overflow_mask &= ~(1UL << i);
> > kvm_vcpu_write_guest(vcpu, kvpmu->snapshot_addr, kvpmu->sdata,
> > sizeof(struct riscv_pmu_snapshot_data));
> > }
> > }
> > }
> > -
> > out:
> > retdata->err_val = sbiret;
> >
> > @@ -729,15 +785,16 @@ void kvm_riscv_vcpu_pmu_deinit(struct kvm_vcpu *vcpu)
> > if (!kvpmu)
> > return;
> >
> > - for_each_set_bit(i, kvpmu->pmc_in_use, RISCV_MAX_COUNTERS) {
> > + for_each_set_bit(i, kvpmu->pmc_in_use, RISCV_KVM_MAX_COUNTERS) {
> > pmc = &kvpmu->pmc[i];
> > pmc->counter_val = 0;
> > kvm_pmu_release_perf_event(pmc);
> > pmc->event_idx = SBI_PMU_EVENT_IDX_INVALID;
> > }
> > - bitmap_zero(kvpmu->pmc_in_use, RISCV_MAX_COUNTERS);
> > + bitmap_zero(kvpmu->pmc_in_use, RISCV_KVM_MAX_COUNTERS);
> > + bitmap_zero(kvpmu->pmc_overflown, RISCV_KVM_MAX_COUNTERS);
> > memset(&kvpmu->fw_event, 0, SBI_PMU_FW_MAX * sizeof(struct kvm_fw_event));
> > - kvpmu->snapshot_addr = INVALID_GPA;
> > + kvm_pmu_clear_snapshot_area(vcpu);
> > }
> >
> > void kvm_riscv_vcpu_pmu_reset(struct kvm_vcpu *vcpu)
> > --
> > 2.34.1
> >
> >
> > _______________________________________________
> > linux-riscv mailing list
> > [email protected]
> > http://lists.infradead.org/mailman/listinfo/linux-riscv
>
> Thank you,
> Vladimir Isaev
Hey Atish,
On Mon, Dec 04, 2023 at 06:43:01PM -0800, Atish Patra wrote:
> This series implements SBI PMU improvements done in SBI v2.0[1] i.e. PMU snapshot
> and fw_read_hi() functions.
I don't see any commentary in this cover letter as to why the series is
an RFC. v2.0 is a frozen spec per the Releases tab on GitHub, so that
has ruled out the usual reason for spec related things being RFCs.
What is it about the series that you are not yet willing to stand over?
Cheers,
Conor.
> SBI v2.0 introduced PMU snapshot feature which allows the SBI implementation
> to provide counter information (i.e. values/overlfow status) via a shared
> memory between the SBI implementation and supervisor OS. This allows to minimize
> the number of traps in when perf being used inside a kvm guest as it relies on
> SBI PMU + trap/emulation of the counters.
>
> The current set of ratified RISC-V specification also doesn't allow scountovf
> to be trap/emulated by the hypervisor. The SBI PMU snapshot bridges the gap
> in ISA as well and enables perf sampling in the guest. However, LCOFI in the
> guest only works via IRQ filtering in AIA specification. That's why, AIA
> has to be enabled in the hardware (at least the Ssaia extension) in order to
> use the sampling support in the perf.
>
> Here are the patch wise implementation details.
>
> PATCH 1-2 : Generic cleanups/improvements.
> PATCH 3,4,9 : FW_READ_HI function implementation
> PATCH 5-6: Add PMU snapshot feature in sbi pmu driver
> PATCH 7-8: KVM implementation for snapshot and sampling in kvm guests
>
> The series is based on v6.70-rc3 and is available at:
>
> https://github.com/atishp04/linux/tree/kvm_pmu_snapshot_v1
>
> The kvmtool patch is also available at:
> https://github.com/atishp04/kvmtool/tree/sscofpmf
>
> It also requires Ssaia ISA extension to be present in the hardware in order to
> get perf sampling support in the guest. In Qemu virt machine, it can be done
> by the following config.
>
> ```
> -cpu rv64,sscofpmf=true,x-ssaia=true
> ```
>
> There is no other dependancies on AIA apart from that. Thus, Ssaia must be disabled
> for the guest if AIA patches are not available. Here is the example command.
>
> ```
> ./lkvm-static run -m 256 -c2 --console serial -p "console=ttyS0 earlycon" --disable-ssaia -k ./Image --debug
> ```
>
> The series has been tested only in Qemu.
> Here is the snippet of the perf running inside a kvm guest.
>
> ===================================================
> # perf record -e cycles -e instructions perf bench sched messaging -g 5
> ...
> # Running 'sched/messaging' benchmark:
> ...
> [ 45.928723] perf_duration_warn: 2 callbacks suppressed
> [ 45.929000] perf: interrupt took too long (484426 > 483186), lowering kernel.perf_event_max_sample_rate to 250
> # 20 sender and receiver processes per group
> # 5 groups == 200 processes run
>
> Total time: 14.220 [sec]
> [ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 0.117 MB perf.data (1942 samples) ]
> # perf report --stdio
> # To display the perf.data header info, please use --header/--header-only optio>
> #
> #
> # Total Lost Samples: 0
> #
> # Samples: 943 of event 'cycles'
> # Event count (approx.): 5128976844
> #
> # Overhead Command Shared Object Symbol >
> # ........ ............... ........................... .....................>
> #
> 7.59% sched-messaging [kernel.kallsyms] [k] memcpy
> 5.48% sched-messaging [kernel.kallsyms] [k] percpu_counter_ad>
> 5.24% sched-messaging [kernel.kallsyms] [k] __sbi_rfence_v02_>
> 4.00% sched-messaging [kernel.kallsyms] [k] _raw_spin_unlock_>
> 3.79% sched-messaging [kernel.kallsyms] [k] set_pte_range
> 3.72% sched-messaging [kernel.kallsyms] [k] next_uptodate_fol>
> 3.46% sched-messaging [kernel.kallsyms] [k] filemap_map_pages
> 3.31% sched-messaging [kernel.kallsyms] [k] handle_mm_fault
> 3.20% sched-messaging [kernel.kallsyms] [k] finish_task_switc>
> 3.16% sched-messaging [kernel.kallsyms] [k] clear_page
> 3.03% sched-messaging [kernel.kallsyms] [k] mtree_range_walk
> 2.42% sched-messaging [kernel.kallsyms] [k] flush_icache_pte
>
> ===================================================
>
> [1] https://github.com/riscv-non-isa/riscv-sbi-doc
>
> Atish Patra (9):
> RISC-V: Fix the typo in Scountovf CSR name
> drivers/perf: riscv: Add a flag to indicate SBI v2.0 support
> RISC-V: Add FIRMWARE_READ_HI definition
> drivers/perf: riscv: Read upper bits of a firmware counter
> RISC-V: Add SBI PMU snapshot definitions
> drivers/perf: riscv: Implement SBI PMU snapshot function
> RISC-V: KVM: Implement SBI PMU Snapshot feature
> RISC-V: KVM: Add perf sampling support for guests
> RISC-V: KVM: Support 64 bit firmware counters on RV32
>
> arch/riscv/include/asm/csr.h | 5 +-
> arch/riscv/include/asm/errata_list.h | 2 +-
> arch/riscv/include/asm/kvm_vcpu_pmu.h | 16 +-
> arch/riscv/include/asm/sbi.h | 11 ++
> arch/riscv/include/uapi/asm/kvm.h | 1 +
> arch/riscv/kvm/main.c | 1 +
> arch/riscv/kvm/vcpu.c | 8 +-
> arch/riscv/kvm/vcpu_onereg.c | 1 +
> arch/riscv/kvm/vcpu_pmu.c | 232 ++++++++++++++++++++++++--
> arch/riscv/kvm/vcpu_sbi_pmu.c | 10 ++
> drivers/perf/riscv_pmu.c | 1 +
> drivers/perf/riscv_pmu_sbi.c | 219 ++++++++++++++++++++++--
> include/linux/perf/riscv_pmu.h | 6 +
> 13 files changed, 478 insertions(+), 35 deletions(-)
>
> --
> 2.34.1
>
On Mon, Dec 04, 2023 at 06:43:02PM -0800, Atish Patra wrote:
> The counter overflow CSR name is "scountovf" not "sscountovf".
>
> Fix the csr name.
>
> Fixes: 4905ec2fb7e6 ("RISC-V: Add sscofpmf extension support")
>
^^ No blank line here.
Reviewed-by: Conor Dooley <[email protected]>
Cheers,
Conor.
> Signed-off-by: Atish Patra <[email protected]>
> ---
> arch/riscv/include/asm/csr.h | 2 +-
> arch/riscv/include/asm/errata_list.h | 2 +-
> 2 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/arch/riscv/include/asm/csr.h b/arch/riscv/include/asm/csr.h
> index 306a19a5509c..88cdc8a3e654 100644
> --- a/arch/riscv/include/asm/csr.h
> +++ b/arch/riscv/include/asm/csr.h
> @@ -281,7 +281,7 @@
> #define CSR_HPMCOUNTER30H 0xc9e
> #define CSR_HPMCOUNTER31H 0xc9f
>
> -#define CSR_SSCOUNTOVF 0xda0
> +#define CSR_SCOUNTOVF 0xda0
>
> #define CSR_SSTATUS 0x100
> #define CSR_SIE 0x104
> diff --git a/arch/riscv/include/asm/errata_list.h b/arch/riscv/include/asm/errata_list.h
> index 83ed25e43553..7026fba12eeb 100644
> --- a/arch/riscv/include/asm/errata_list.h
> +++ b/arch/riscv/include/asm/errata_list.h
> @@ -152,7 +152,7 @@ asm volatile(ALTERNATIVE_2( \
>
> #define ALT_SBI_PMU_OVERFLOW(__ovl) \
> asm volatile(ALTERNATIVE( \
> - "csrr %0, " __stringify(CSR_SSCOUNTOVF), \
> + "csrr %0, " __stringify(CSR_SCOUNTOVF), \
> "csrr %0, " __stringify(THEAD_C9XX_CSR_SCOUNTEROF), \
> THEAD_VENDOR_ID, ERRATA_THEAD_PMU, \
> CONFIG_ERRATA_THEAD_PMU) \
> --
> 2.34.1
>
On Mon, Dec 04, 2023 at 06:43:03PM -0800, Atish Patra wrote:
> SBI v2.0 added few functions to improve SBI PMU extension. In order
> to be backward compatible, the driver must use these functions only
> if SBI v2.0 is available.
>
> Signed-off-by: Atish Patra <[email protected]>
IMO this does not make sense in a patch of its own and should probably
be squashed with the first user for it.
> ---
> drivers/perf/riscv_pmu_sbi.c | 5 +++++
> 1 file changed, 5 insertions(+)
>
> diff --git a/drivers/perf/riscv_pmu_sbi.c b/drivers/perf/riscv_pmu_sbi.c
> index 16acd4dcdb96..40a335350d08 100644
> --- a/drivers/perf/riscv_pmu_sbi.c
> +++ b/drivers/perf/riscv_pmu_sbi.c
> @@ -35,6 +35,8 @@
> PMU_FORMAT_ATTR(event, "config:0-47");
> PMU_FORMAT_ATTR(firmware, "config:63");
>
> +static bool sbi_v2_available;
> +
> static struct attribute *riscv_arch_formats_attr[] = {
> &format_attr_event.attr,
> &format_attr_firmware.attr,
> @@ -1108,6 +1110,9 @@ static int __init pmu_sbi_devinit(void)
> return 0;
> }
>
> + if (sbi_spec_version >= sbi_mk_version(2, 0))
> + sbi_v2_available = true;
> +
> ret = cpuhp_setup_state_multi(CPUHP_AP_PERF_RISCV_STARTING,
> "perf/riscv/pmu:starting",
> pmu_sbi_starting_cpu, pmu_sbi_dying_cpu);
> --
> 2.34.1
>
On Mon, Dec 04, 2023 at 06:43:04PM -0800, Atish Patra wrote:
> SBI v2.0 added another function to SBI PMU extension to read
> the upper bits of a counter with width larger than XLEN.
This definition here is quite a lot less specific than that in 11/1 of
the spec. I don't think that really matters much in reality since we
only support exactly one XLEN where that is the case.
Acked-by: Conor Dooley <[email protected]>
Cheers,
Conor.
> Add the definition for that function.
>
> Signed-off-by: Atish Patra <[email protected]>
> ---
> arch/riscv/include/asm/sbi.h | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/arch/riscv/include/asm/sbi.h b/arch/riscv/include/asm/sbi.h
> index 0892f4421bc4..f3eeca79a02d 100644
> --- a/arch/riscv/include/asm/sbi.h
> +++ b/arch/riscv/include/asm/sbi.h
> @@ -121,6 +121,7 @@ enum sbi_ext_pmu_fid {
> SBI_EXT_PMU_COUNTER_START,
> SBI_EXT_PMU_COUNTER_STOP,
> SBI_EXT_PMU_COUNTER_FW_READ,
> + SBI_EXT_PMU_COUNTER_FW_READ_HI,
> };
>
> union sbi_pmu_ctr_info {
> --
> 2.34.1
>
On Mon, Dec 04, 2023 at 06:43:05PM -0800, Atish Patra wrote:
> SBI v2.0 introduced a explicit function to read the upper bits
> for any firmwar counter width that is longer than XLEN. Currently,
> this is only applicable for RV32 where firmware counter can be
> 64 bit.
The v2.0 spec explicitly says that this function returns the upper
32 bits of the counter for rv32 and will always return 0 for rv64
or higher. The commit message here seems overly generic compared to
the actual definition in the spec, and makes it seem like it could
be used with a 128 bit counter on rv64 to get the upper 64 bits.
I tried to think about what "generic" situation the commit message
had been written for, but the things I came up with would all require
changes to the spec to define behaviour for FID #5 and/or FID #1, so
in the end I couldn't figure out the rationale behind the non-committal
wording used here.
>
> Signed-off-by: Atish Patra <[email protected]>
> ---
> drivers/perf/riscv_pmu_sbi.c | 11 +++++++++--
> 1 file changed, 9 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/perf/riscv_pmu_sbi.c b/drivers/perf/riscv_pmu_sbi.c
> index 40a335350d08..1c9049e6b574 100644
> --- a/drivers/perf/riscv_pmu_sbi.c
> +++ b/drivers/perf/riscv_pmu_sbi.c
> @@ -490,16 +490,23 @@ static u64 pmu_sbi_ctr_read(struct perf_event *event)
> struct hw_perf_event *hwc = &event->hw;
> int idx = hwc->idx;
> struct sbiret ret;
> - union sbi_pmu_ctr_info info;
> u64 val = 0;
> + union sbi_pmu_ctr_info info = pmu_ctr_list[idx];
>
> if (pmu_sbi_is_fw_event(event)) {
> ret = sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_COUNTER_FW_READ,
> hwc->idx, 0, 0, 0, 0, 0);
> if (!ret.error)
> val = ret.value;
> +#if defined(CONFIG_32BIT)
Why is this not IS_ENABLED()? The code below uses one. You could then
fold it into the if statement below.
> + if (sbi_v2_available && info.width >= 32) {
>= 32? I know it is from the spec, but why does the spec define it as
"One less than number of bits in CSR"? Saving bits in the structure I
guess?
> + ret = sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_COUNTER_FW_READ_HI,
> + hwc->idx, 0, 0, 0, 0, 0);
> + if (!ret.error)
> + val = val | ((u64)ret.value << 32);
If the first ecall fails but the second one doesn't won't we corrupt
val by only setting the upper bits? If returning val == 0 is the thing
to do in the error case (which it is in the existing code) should the
first `if (!ret.error)` become `if (ret.error)` -> `return 0`?
> + val = val | ((u64)ret.value << 32);
Also, |= ?
Cheers,
Conor.
> + }
> +#endif
> } else {
> - info = pmu_ctr_list[idx];
> val = riscv_pmu_ctr_read_csr(info.csr);
> if (IS_ENABLED(CONFIG_32BIT))
> val = ((u64)riscv_pmu_ctr_read_csr(info.csr + 0x80)) << 31 | val;
> --
> 2.34.1
>
On Mon, Dec 04, 2023 at 06:43:06PM -0800, Atish Patra wrote:
> SBI PMU Snapshot function optimizes the number of traps to
> higher privilege mode by leveraging a shared memory between the S/VS-mode
> and the M/HS mode. Add the definitions for that extension
>
> Signed-off-by: Atish Patra <[email protected]>
> ---
> arch/riscv/include/asm/sbi.h | 10 ++++++++++
> 1 file changed, 10 insertions(+)
>
> diff --git a/arch/riscv/include/asm/sbi.h b/arch/riscv/include/asm/sbi.h
> index f3eeca79a02d..29821addb9b7 100644
> --- a/arch/riscv/include/asm/sbi.h
> +++ b/arch/riscv/include/asm/sbi.h
> @@ -122,6 +122,7 @@ enum sbi_ext_pmu_fid {
> SBI_EXT_PMU_COUNTER_STOP,
> SBI_EXT_PMU_COUNTER_FW_READ,
> SBI_EXT_PMU_COUNTER_FW_READ_HI,
> + SBI_EXT_PMU_SNAPSHOT_SET_SHMEM,
> };
>
> union sbi_pmu_ctr_info {
> @@ -138,6 +139,13 @@ union sbi_pmu_ctr_info {
> };
> };
>
> +/* Data structure to contain the pmu snapshot data */
> +struct riscv_pmu_snapshot_data {
> + uint64_t ctr_overflow_mask;
> + uint64_t ctr_values[64];
> + uint64_t reserved[447];
> +};
> +
> #define RISCV_PMU_RAW_EVENT_MASK GENMASK_ULL(47, 0)
> #define RISCV_PMU_RAW_EVENT_IDX 0x20000
>
> @@ -234,9 +242,11 @@ enum sbi_pmu_ctr_type {
>
> /* Flags defined for counter start function */
> #define SBI_PMU_START_FLAG_SET_INIT_VALUE (1 << 0)
> +#define SBI_PMU_START_FLAG_INIT_FROM_SNAPSHOT (1 << 1)
>
> /* Flags defined for counter stop function */
> #define SBI_PMU_STOP_FLAG_RESET (1 << 0)
> +#define SBI_PMU_STOP_FLAG_TAKE_SNAPSHOT (1 << 1)
If we can use GENMASK in this file, why can we not use BIT()?
>
> enum sbi_ext_dbcn_fid {
> SBI_EXT_DBCN_CONSOLE_WRITE = 0,
> --
> 2.34.1
>
Hey Atish,
On Mon, Dec 04, 2023 at 06:43:07PM -0800, Atish Patra wrote:
> SBI v2.0 SBI introduced PMU snapshot feature which adds the following
> features.
>
> 1. Read counter values directly from the shared memory instead of
> csr read.
> 2. Start multiple counters with initial values with one SBI call.
>
> These functionalities optimizes the number of traps to the higher
> privilege mode. If the kernel is in VS mode while the hypervisor
> deploy trap & emulate method, this would minimize all the hpmcounter
> CSR read traps. If the kernel is running in S-mode, the benfits
> reduced to CSR latency vs DRAM/cache latency as there is no trap
> involved while accessing the hpmcounter CSRs.
>
> In both modes, it does saves the number of ecalls while starting
> multiple counter together with an initial values. This is a likely
> scenario if multiple counters overflow at the same time.
>
> Signed-off-by: Atish Patra <[email protected]>
> ---
> drivers/perf/riscv_pmu.c | 1 +
> drivers/perf/riscv_pmu_sbi.c | 203 ++++++++++++++++++++++++++++++---
> include/linux/perf/riscv_pmu.h | 6 +
> 3 files changed, 197 insertions(+), 13 deletions(-)
>
> diff --git a/drivers/perf/riscv_pmu.c b/drivers/perf/riscv_pmu.c
> index 0dda70e1ef90..5b57acb770d3 100644
> --- a/drivers/perf/riscv_pmu.c
> +++ b/drivers/perf/riscv_pmu.c
> @@ -412,6 +412,7 @@ struct riscv_pmu *riscv_pmu_alloc(void)
> cpuc->n_events = 0;
> for (i = 0; i < RISCV_MAX_COUNTERS; i++)
> cpuc->events[i] = NULL;
> + cpuc->snapshot_addr = NULL;
> }
> pmu->pmu = (struct pmu) {
> .event_init = riscv_pmu_event_init,
> diff --git a/drivers/perf/riscv_pmu_sbi.c b/drivers/perf/riscv_pmu_sbi.c
> index 1c9049e6b574..1b8b6de63b69 100644
> --- a/drivers/perf/riscv_pmu_sbi.c
> +++ b/drivers/perf/riscv_pmu_sbi.c
> @@ -36,6 +36,9 @@ PMU_FORMAT_ATTR(event, "config:0-47");
> PMU_FORMAT_ATTR(firmware, "config:63");
>
> static bool sbi_v2_available;
> +static DEFINE_STATIC_KEY_FALSE(sbi_pmu_snapshot_available);
> +#define sbi_pmu_snapshot_available() \
> + static_branch_unlikely(&sbi_pmu_snapshot_available)
>
> static struct attribute *riscv_arch_formats_attr[] = {
> &format_attr_event.attr,
> @@ -485,14 +488,101 @@ static int pmu_sbi_event_map(struct perf_event *event, u64 *econfig)
> return ret;
> }
>
> +static void pmu_sbi_snapshot_free(struct riscv_pmu *pmu)
> +{
> + int cpu;
> + struct cpu_hw_events *cpu_hw_evt;
This is only used inside the scope of the for loop.
> +
> + for_each_possible_cpu(cpu) {
> + cpu_hw_evt = per_cpu_ptr(pmu->hw_events, cpu);
> + if (!cpu_hw_evt->snapshot_addr)
> + continue;
Could you add a blank line here please?
> + free_page((unsigned long)cpu_hw_evt->snapshot_addr);
> + cpu_hw_evt->snapshot_addr = NULL;
> + cpu_hw_evt->snapshot_addr_phys = 0;
Why do these need to be explicitly zeroed?
> + }
> +}
> +
> +static int pmu_sbi_snapshot_alloc(struct riscv_pmu *pmu)
> +{
> + int cpu;
> + struct page *snapshot_page;
> + struct cpu_hw_events *cpu_hw_evt;
Same here re scope
> +
> + for_each_possible_cpu(cpu) {
> + cpu_hw_evt = per_cpu_ptr(pmu->hw_events, cpu);
> + if (cpu_hw_evt->snapshot_addr)
> + continue;
Same here re blank line
> + snapshot_page = alloc_page(GFP_ATOMIC | __GFP_ZERO);
> + if (!snapshot_page) {
> + pmu_sbi_snapshot_free(pmu);
> + return -ENOMEM;
> + }
> + cpu_hw_evt->snapshot_addr = page_to_virt(snapshot_page);
> + cpu_hw_evt->snapshot_addr_phys = page_to_phys(snapshot_page);
> + }
> +
> + return 0;
> +}
> +
> +static void pmu_sbi_snapshot_disable(void)
> +{
> + sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_SNAPSHOT_SET_SHMEM, -1,
> + -1, 0, 0, 0, 0);
> +}
> +
> +static int pmu_sbi_snapshot_setup(struct riscv_pmu *pmu, int cpu)
> +{
> + struct cpu_hw_events *cpu_hw_evt;
> + struct sbiret ret = {0};
> + int rc;
> +
> + cpu_hw_evt = per_cpu_ptr(pmu->hw_events, cpu);
> + if (!cpu_hw_evt->snapshot_addr_phys)
> + return -EINVAL;
> +
> + if (cpu_hw_evt->snapshot_set_done)
> + return 0;
> +
> +#if defined(CONFIG_32BIT)
Why does this need to be an `#if defined()`? Does the code not compile
if you use IS_ENABLED()?
> + ret = sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_SNAPSHOT_SET_SHMEM, cpu_hw_evt->snapshot_addr_phys,
> + (u64)(cpu_hw_evt->snapshot_addr_phys) >> 32, 0, 0, 0, 0);
> +#else
> + ret = sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_SNAPSHOT_SET_SHMEM, cpu_hw_evt->snapshot_addr_phys,
> + 0, 0, 0, 0, 0);
> +#endif
> + /* Free up the snapshot area memory and fall back to default SBI */
What does "fall back to the default SBI mean"? SBI is an interface so I
don't understand what it means in this context. Secondly,
> + if (ret.error) {
> + if (ret.error != SBI_ERR_NOT_SUPPORTED)
> + pr_warn("%s: pmu snapshot setup failed with error %ld\n", __func__,
> + ret.error);
Why is the function relevant here? Is the error message in-and-of-itself
not sufficient here? Where else would one be setting up the snapshots
other than the setup function?
> + rc = sbi_err_map_linux_errno(ret.error);
> + if (rc)
> + return rc;
Is it even possible for !rc at this point? You've already checked that
ret.error is non zero, so this just becomes
`return sbi_err_map_linux_errno(ret.error);`?
> + }
> +
> + cpu_hw_evt->snapshot_set_done = true;
> +
> + return 0;
> +}
> +
> static u64 pmu_sbi_ctr_read(struct perf_event *event)
> {
> struct hw_perf_event *hwc = &event->hw;
> int idx = hwc->idx;
> struct sbiret ret;
> u64 val = 0;
> + struct riscv_pmu *pmu = to_riscv_pmu(event->pmu);
> + struct cpu_hw_events *cpu_hw_evt = this_cpu_ptr(pmu->hw_events);
> + struct riscv_pmu_snapshot_data *sdata = cpu_hw_evt->snapshot_addr;
> union sbi_pmu_ctr_info info = pmu_ctr_list[idx];
>
> + /* Read the value from the shared memory directly */
Statement of the obvious, no?
> + if (sbi_pmu_snapshot_available()) {
> + val = sdata->ctr_values[idx];
> + goto done;
s/goto done/return val/
There's no cleanup to be done here, what purpose does the goto serve?
> + }
> +
> if (pmu_sbi_is_fw_event(event)) {
> ret = sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_COUNTER_FW_READ,
> hwc->idx, 0, 0, 0, 0, 0);
> @@ -512,6 +602,7 @@ static u64 pmu_sbi_ctr_read(struct perf_event *event)
> val = ((u64)riscv_pmu_ctr_read_csr(info.csr + 0x80)) << 31 | val;
> }
>
> +done:
> return val;
> }
>
> @@ -539,6 +630,7 @@ static void pmu_sbi_ctr_start(struct perf_event *event, u64 ival)
> struct hw_perf_event *hwc = &event->hw;
> unsigned long flag = SBI_PMU_START_FLAG_SET_INIT_VALUE;
>
> + /* There is no benefit setting SNAPSHOT FLAG for a single counter */
> #if defined(CONFIG_32BIT)
> ret = sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_COUNTER_START, hwc->idx,
> 1, flag, ival, ival >> 32, 0);
> @@ -559,16 +651,29 @@ static void pmu_sbi_ctr_stop(struct perf_event *event, unsigned long flag)
> {
> struct sbiret ret;
> struct hw_perf_event *hwc = &event->hw;
> + struct riscv_pmu *pmu = to_riscv_pmu(event->pmu);
> + struct cpu_hw_events *cpu_hw_evt = this_cpu_ptr(pmu->hw_events);
> + struct riscv_pmu_snapshot_data *sdata = cpu_hw_evt->snapshot_addr;
>
> if ((hwc->flags & PERF_EVENT_FLAG_USER_ACCESS) &&
> (hwc->flags & PERF_EVENT_FLAG_USER_READ_CNT))
> pmu_sbi_reset_scounteren((void *)event);
>
> + if (sbi_pmu_snapshot_available())
> + flag |= SBI_PMU_STOP_FLAG_TAKE_SNAPSHOT;
> +
> ret = sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_COUNTER_STOP, hwc->idx, 1, flag, 0, 0, 0);
> - if (ret.error && (ret.error != SBI_ERR_ALREADY_STOPPED) &&
> - flag != SBI_PMU_STOP_FLAG_RESET)
> + if (!ret.error && sbi_pmu_snapshot_available()) {
> + /* Snapshot is taken relative to the counter idx base. Apply a fixup. */
> + if (hwc->idx > 0) {
> + sdata->ctr_values[hwc->idx] = sdata->ctr_values[0];
> + sdata->ctr_values[0] = 0;
Why is this being zeroed in this manner? Why is zeroing it not required
if hwc->idx == 0? You've got a comment there that could probably do with
elaboration.
> + }
> + } else if (ret.error && (ret.error != SBI_ERR_ALREADY_STOPPED) &&
> + flag != SBI_PMU_STOP_FLAG_RESET) {
> pr_err("Stopping counter idx %d failed with error %d\n",
> hwc->idx, sbi_err_map_linux_errno(ret.error));
> + }
> }
>
> static int pmu_sbi_find_num_ctrs(void)
> @@ -626,10 +731,14 @@ static inline void pmu_sbi_stop_all(struct riscv_pmu *pmu)
> static inline void pmu_sbi_stop_hw_ctrs(struct riscv_pmu *pmu)
> {
> struct cpu_hw_events *cpu_hw_evt = this_cpu_ptr(pmu->hw_events);
> + unsigned long flag = 0;
> +
> + if (sbi_pmu_snapshot_available())
> + flag = SBI_PMU_STOP_FLAG_TAKE_SNAPSHOT;
>
> /* No need to check the error here as we can't do anything about the error */
> sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_COUNTER_STOP, 0,
> - cpu_hw_evt->used_hw_ctrs[0], 0, 0, 0, 0);
> + cpu_hw_evt->used_hw_ctrs[0], flag, 0, 0, 0);
> }
>
> /*
> @@ -638,11 +747,10 @@ static inline void pmu_sbi_stop_hw_ctrs(struct riscv_pmu *pmu)
> * while the overflowed counters need to be started with updated initialization
> * value.
> */
> -static inline void pmu_sbi_start_overflow_mask(struct riscv_pmu *pmu,
> - unsigned long ctr_ovf_mask)
> +static noinline void pmu_sbi_start_ovf_ctrs_sbi(struct cpu_hw_events *cpu_hw_evt,
> + unsigned long ctr_ovf_mask)
> {
> int idx = 0;
> - struct cpu_hw_events *cpu_hw_evt = this_cpu_ptr(pmu->hw_events);
> struct perf_event *event;
> unsigned long flag = SBI_PMU_START_FLAG_SET_INIT_VALUE;
> unsigned long ctr_start_mask = 0;
> @@ -677,6 +785,49 @@ static inline void pmu_sbi_start_overflow_mask(struct riscv_pmu *pmu,
> }
> }
>
> +static noinline void pmu_sbi_start_ovf_ctrs_snapshot(struct cpu_hw_events *cpu_hw_evt,
> + unsigned long ctr_ovf_mask)
> +{
> + int idx = 0;
> + struct perf_event *event;
> + unsigned long flag = SBI_PMU_START_FLAG_INIT_FROM_SNAPSHOT;
> + uint64_t max_period;
> + struct hw_perf_event *hwc;
> + u64 init_val = 0;
> + unsigned long ctr_start_mask = 0;
> + struct riscv_pmu_snapshot_data *sdata = cpu_hw_evt->snapshot_addr;
> +
> + for_each_set_bit(idx, cpu_hw_evt->used_hw_ctrs, RISCV_MAX_COUNTERS) {
> + if (ctr_ovf_mask & (1 << idx)) {
> + event = cpu_hw_evt->events[idx];
> + hwc = &event->hw;
> + max_period = riscv_pmu_ctr_get_width_mask(event);
> + init_val = local64_read(&hwc->prev_count) & max_period;
> + sdata->ctr_values[idx] = init_val;
> + }
> + /* We donot need to update the non-overflow counters the previous
/*
* We don't need to update the non-overflow counters as the previous
> + * value should have been there already.
> + */
> + }
> +
> + ctr_start_mask = cpu_hw_evt->used_hw_ctrs[0];
> +
> + /* Start all the counters in a single shot */
> + sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_COUNTER_START, 0, ctr_start_mask,
> + flag, 0, 0, 0);
> +}
> +
> +static void pmu_sbi_start_overflow_mask(struct riscv_pmu *pmu,
> + unsigned long ctr_ovf_mask)
> +{
> + struct cpu_hw_events *cpu_hw_evt = this_cpu_ptr(pmu->hw_events);
> +
> + if (sbi_pmu_snapshot_available())
> + pmu_sbi_start_ovf_ctrs_snapshot(cpu_hw_evt, ctr_ovf_mask);
> + else
> + pmu_sbi_start_ovf_ctrs_sbi(cpu_hw_evt, ctr_ovf_mask);
> +}
> +
> static irqreturn_t pmu_sbi_ovf_handler(int irq, void *dev)
> {
> struct perf_sample_data data;
> @@ -690,6 +841,7 @@ static irqreturn_t pmu_sbi_ovf_handler(int irq, void *dev)
> unsigned long overflowed_ctrs = 0;
> struct cpu_hw_events *cpu_hw_evt = dev;
> u64 start_clock = sched_clock();
> + struct riscv_pmu_snapshot_data *sdata = cpu_hw_evt->snapshot_addr;
>
> if (WARN_ON_ONCE(!cpu_hw_evt))
> return IRQ_NONE;
> @@ -711,8 +863,10 @@ static irqreturn_t pmu_sbi_ovf_handler(int irq, void *dev)
> pmu_sbi_stop_hw_ctrs(pmu);
>
> /* Overflow status register should only be read after counter are stopped */
> - ALT_SBI_PMU_OVERFLOW(overflow);
> -
> + if (sbi_pmu_snapshot_available())
> + overflow = sdata->ctr_overflow_mask;
> + else
> + ALT_SBI_PMU_OVERFLOW(overflow);
> /*
> * Overflow interrupt pending bit should only be cleared after stopping
> * all the counters to avoid any race condition.
> @@ -774,6 +928,7 @@ static int pmu_sbi_starting_cpu(unsigned int cpu, struct hlist_node *node)
> {
> struct riscv_pmu *pmu = hlist_entry_safe(node, struct riscv_pmu, node);
> struct cpu_hw_events *cpu_hw_evt = this_cpu_ptr(pmu->hw_events);
> + int ret = 0;
>
> /*
> * We keep enabling userspace access to CYCLE, TIME and INSTRET via the
> @@ -794,7 +949,10 @@ static int pmu_sbi_starting_cpu(unsigned int cpu, struct hlist_node *node)
> enable_percpu_irq(riscv_pmu_irq, IRQ_TYPE_NONE);
> }
>
> - return 0;
> + if (sbi_pmu_snapshot_available())
> + ret = pmu_sbi_snapshot_setup(pmu, cpu);
> +
> + return ret;
I'd just write this as
if (sbi_pmu_snapshot_available())
return pmu_sbi_snapshot_setup(pmu, cpu);
return 0;
and drop the newly added variable I think.
> }
>
> static int pmu_sbi_dying_cpu(unsigned int cpu, struct hlist_node *node)
> @@ -807,6 +965,9 @@ static int pmu_sbi_dying_cpu(unsigned int cpu, struct hlist_node *node)
> /* Disable all counters access for user mode now */
> csr_write(CSR_SCOUNTEREN, 0x0);
>
> + if (sbi_pmu_snapshot_available())
> + pmu_sbi_snapshot_disable();
> +
> return 0;
> }
>
> @@ -1076,10 +1237,6 @@ static int pmu_sbi_device_probe(struct platform_device *pdev)
> pmu->event_unmapped = pmu_sbi_event_unmapped;
> pmu->csr_index = pmu_sbi_csr_index;
>
> - ret = cpuhp_state_add_instance(CPUHP_AP_PERF_RISCV_STARTING, &pmu->node);
> - if (ret)
> - return ret;
> -
> ret = riscv_pm_pmu_register(pmu);
> if (ret)
> goto out_unregister;
> @@ -1088,8 +1245,28 @@ static int pmu_sbi_device_probe(struct platform_device *pdev)
> if (ret)
> goto out_unregister;
>
> + /* SBI PMU Snasphot is only available in SBI v2.0 */
s/Snasphot/Snapshot/
> + if (sbi_v2_available) {
> + ret = pmu_sbi_snapshot_alloc(pmu);
> + if (ret)
> + goto out_unregister;
A blank line here aids readability by breaking up the reuse of ret.
> + ret = pmu_sbi_snapshot_setup(pmu, smp_processor_id());
> + if (!ret) {
> + pr_info("SBI PMU snapshot is available to optimize the PMU traps\n");
Why the verbose message? Could we standardise on one wording for the SBI
function probing stuff? Most users seem to be "SBI FOO extension detected".
Only IPI has additional wording and PMU differs slightly.
> + /* We enable it once here for the boot cpu. If snapshot shmem fails during
Again, comment style here. What does "snapshot shmem" mean? I think
there's a missing action here. Registration? Allocation?
> + * cpu hotplug on, it should bail out.
Should or will? What action does "bail out" correspond to?
Thanks,
Conor.
> + */
> + static_branch_enable(&sbi_pmu_snapshot_available);
> + }
> + /* Snapshot is an optional feature. Continue if not available */
> + }
> +
> register_sysctl("kernel", sbi_pmu_sysctl_table);
>
> + ret = cpuhp_state_add_instance(CPUHP_AP_PERF_RISCV_STARTING, &pmu->node);
> + if (ret)
> + return ret;
> +
> return 0;
>
> out_unregister:
> diff --git a/include/linux/perf/riscv_pmu.h b/include/linux/perf/riscv_pmu.h
> index 43282e22ebe1..c3fa90970042 100644
> --- a/include/linux/perf/riscv_pmu.h
> +++ b/include/linux/perf/riscv_pmu.h
> @@ -39,6 +39,12 @@ struct cpu_hw_events {
> DECLARE_BITMAP(used_hw_ctrs, RISCV_MAX_COUNTERS);
> /* currently enabled firmware counters */
> DECLARE_BITMAP(used_fw_ctrs, RISCV_MAX_COUNTERS);
> + /* The virtual address of the shared memory where counter snapshot will be taken */
> + void *snapshot_addr;
> + /* The physical address of the shared memory where counter snapshot will be taken */
> + phys_addr_t snapshot_addr_phys;
> + /* Boolean flag to indicate setup is already done */
> + bool snapshot_set_done;
> };
>
> struct riscv_pmu {
> --
> 2.34.1
>
On Thu, Dec 7, 2023 at 4:03 AM Conor Dooley <[email protected]> wrote:
>
> Hey Atish,
>
> On Mon, Dec 04, 2023 at 06:43:01PM -0800, Atish Patra wrote:
> > This series implements SBI PMU improvements done in SBI v2.0[1] i.e. PMU snapshot
> > and fw_read_hi() functions.
>
> I don't see any commentary in this cover letter as to why the series is
> an RFC. v2.0 is a frozen spec per the Releases tab on GitHub, so that
> has ruled out the usual reason for spec related things being RFCs.
>
> What is it about the series that you are not yet willing to stand over?
>
Nothing. It's just my script where I tag any first version of a
feature series as RFC :).
I am planning to send the next one with a version tag this week as I
got some feedback.
Thanks for reviewing the patches :).
> Cheers,
> Conor.
>
> > SBI v2.0 introduced PMU snapshot feature which allows the SBI implementation
> > to provide counter information (i.e. values/overlfow status) via a shared
> > memory between the SBI implementation and supervisor OS. This allows to minimize
> > the number of traps in when perf being used inside a kvm guest as it relies on
> > SBI PMU + trap/emulation of the counters.
> >
> > The current set of ratified RISC-V specification also doesn't allow scountovf
> > to be trap/emulated by the hypervisor. The SBI PMU snapshot bridges the gap
> > in ISA as well and enables perf sampling in the guest. However, LCOFI in the
> > guest only works via IRQ filtering in AIA specification. That's why, AIA
> > has to be enabled in the hardware (at least the Ssaia extension) in order to
> > use the sampling support in the perf.
> >
> > Here are the patch wise implementation details.
> >
> > PATCH 1-2 : Generic cleanups/improvements.
> > PATCH 3,4,9 : FW_READ_HI function implementation
> > PATCH 5-6: Add PMU snapshot feature in sbi pmu driver
> > PATCH 7-8: KVM implementation for snapshot and sampling in kvm guests
> >
> > The series is based on v6.70-rc3 and is available at:
> >
> > https://github.com/atishp04/linux/tree/kvm_pmu_snapshot_v1
> >
> > The kvmtool patch is also available at:
> > https://github.com/atishp04/kvmtool/tree/sscofpmf
> >
> > It also requires Ssaia ISA extension to be present in the hardware in order to
> > get perf sampling support in the guest. In Qemu virt machine, it can be done
> > by the following config.
> >
> > ```
> > -cpu rv64,sscofpmf=true,x-ssaia=true
> > ```
> >
> > There is no other dependancies on AIA apart from that. Thus, Ssaia must be disabled
> > for the guest if AIA patches are not available. Here is the example command.
> >
> > ```
> > ./lkvm-static run -m 256 -c2 --console serial -p "console=ttyS0 earlycon" --disable-ssaia -k ./Image --debug
> > ```
> >
> > The series has been tested only in Qemu.
> > Here is the snippet of the perf running inside a kvm guest.
> >
> > ===================================================
> > # perf record -e cycles -e instructions perf bench sched messaging -g 5
> > ...
> > # Running 'sched/messaging' benchmark:
> > ...
> > [ 45.928723] perf_duration_warn: 2 callbacks suppressed
> > [ 45.929000] perf: interrupt took too long (484426 > 483186), lowering kernel.perf_event_max_sample_rate to 250
> > # 20 sender and receiver processes per group
> > # 5 groups == 200 processes run
> >
> > Total time: 14.220 [sec]
> > [ perf record: Woken up 1 times to write data ]
> > [ perf record: Captured and wrote 0.117 MB perf.data (1942 samples) ]
> > # perf report --stdio
> > # To display the perf.data header info, please use --header/--header-only optio>
> > #
> > #
> > # Total Lost Samples: 0
> > #
> > # Samples: 943 of event 'cycles'
> > # Event count (approx.): 5128976844
> > #
> > # Overhead Command Shared Object Symbol >
> > # ........ ............... ........................... .....................>
> > #
> > 7.59% sched-messaging [kernel.kallsyms] [k] memcpy
> > 5.48% sched-messaging [kernel.kallsyms] [k] percpu_counter_ad>
> > 5.24% sched-messaging [kernel.kallsyms] [k] __sbi_rfence_v02_>
> > 4.00% sched-messaging [kernel.kallsyms] [k] _raw_spin_unlock_>
> > 3.79% sched-messaging [kernel.kallsyms] [k] set_pte_range
> > 3.72% sched-messaging [kernel.kallsyms] [k] next_uptodate_fol>
> > 3.46% sched-messaging [kernel.kallsyms] [k] filemap_map_pages
> > 3.31% sched-messaging [kernel.kallsyms] [k] handle_mm_fault
> > 3.20% sched-messaging [kernel.kallsyms] [k] finish_task_switc>
> > 3.16% sched-messaging [kernel.kallsyms] [k] clear_page
> > 3.03% sched-messaging [kernel.kallsyms] [k] mtree_range_walk
> > 2.42% sched-messaging [kernel.kallsyms] [k] flush_icache_pte
> >
> > ===================================================
> >
> > [1] https://github.com/riscv-non-isa/riscv-sbi-doc
> >
> > Atish Patra (9):
> > RISC-V: Fix the typo in Scountovf CSR name
> > drivers/perf: riscv: Add a flag to indicate SBI v2.0 support
> > RISC-V: Add FIRMWARE_READ_HI definition
> > drivers/perf: riscv: Read upper bits of a firmware counter
> > RISC-V: Add SBI PMU snapshot definitions
> > drivers/perf: riscv: Implement SBI PMU snapshot function
> > RISC-V: KVM: Implement SBI PMU Snapshot feature
> > RISC-V: KVM: Add perf sampling support for guests
> > RISC-V: KVM: Support 64 bit firmware counters on RV32
> >
> > arch/riscv/include/asm/csr.h | 5 +-
> > arch/riscv/include/asm/errata_list.h | 2 +-
> > arch/riscv/include/asm/kvm_vcpu_pmu.h | 16 +-
> > arch/riscv/include/asm/sbi.h | 11 ++
> > arch/riscv/include/uapi/asm/kvm.h | 1 +
> > arch/riscv/kvm/main.c | 1 +
> > arch/riscv/kvm/vcpu.c | 8 +-
> > arch/riscv/kvm/vcpu_onereg.c | 1 +
> > arch/riscv/kvm/vcpu_pmu.c | 232 ++++++++++++++++++++++++--
> > arch/riscv/kvm/vcpu_sbi_pmu.c | 10 ++
> > drivers/perf/riscv_pmu.c | 1 +
> > drivers/perf/riscv_pmu_sbi.c | 219 ++++++++++++++++++++++--
> > include/linux/perf/riscv_pmu.h | 6 +
> > 13 files changed, 478 insertions(+), 35 deletions(-)
> >
> > --
> > 2.34.1
> >
On Tue, Dec 5, 2023 at 8:13 AM Atish Patra <[email protected]> wrote:
>
> The counter overflow CSR name is "scountovf" not "sscountovf".
>
> Fix the csr name.
>
> Fixes: 4905ec2fb7e6 ("RISC-V: Add sscofpmf extension support")
>
> Signed-off-by: Atish Patra <[email protected]>
LGTM.
Reviewed-by: Anup Patel <[email protected]>
Regards,
Anup
> ---
> arch/riscv/include/asm/csr.h | 2 +-
> arch/riscv/include/asm/errata_list.h | 2 +-
> 2 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/arch/riscv/include/asm/csr.h b/arch/riscv/include/asm/csr.h
> index 306a19a5509c..88cdc8a3e654 100644
> --- a/arch/riscv/include/asm/csr.h
> +++ b/arch/riscv/include/asm/csr.h
> @@ -281,7 +281,7 @@
> #define CSR_HPMCOUNTER30H 0xc9e
> #define CSR_HPMCOUNTER31H 0xc9f
>
> -#define CSR_SSCOUNTOVF 0xda0
> +#define CSR_SCOUNTOVF 0xda0
>
> #define CSR_SSTATUS 0x100
> #define CSR_SIE 0x104
> diff --git a/arch/riscv/include/asm/errata_list.h b/arch/riscv/include/asm/errata_list.h
> index 83ed25e43553..7026fba12eeb 100644
> --- a/arch/riscv/include/asm/errata_list.h
> +++ b/arch/riscv/include/asm/errata_list.h
> @@ -152,7 +152,7 @@ asm volatile(ALTERNATIVE_2( \
>
> #define ALT_SBI_PMU_OVERFLOW(__ovl) \
> asm volatile(ALTERNATIVE( \
> - "csrr %0, " __stringify(CSR_SSCOUNTOVF), \
> + "csrr %0, " __stringify(CSR_SCOUNTOVF), \
> "csrr %0, " __stringify(THEAD_C9XX_CSR_SCOUNTEROF), \
> THEAD_VENDOR_ID, ERRATA_THEAD_PMU, \
> CONFIG_ERRATA_THEAD_PMU) \
> --
> 2.34.1
>
On Thu, Dec 7, 2023 at 5:39 PM Conor Dooley <[email protected]> wrote:
>
> On Mon, Dec 04, 2023 at 06:43:03PM -0800, Atish Patra wrote:
> > SBI v2.0 added few functions to improve SBI PMU extension. In order
> > to be backward compatible, the driver must use these functions only
> > if SBI v2.0 is available.
> >
> > Signed-off-by: Atish Patra <[email protected]>
>
> IMO this does not make sense in a patch of its own and should probably
> be squashed with the first user for it.
I agree. This patch should be squashed into patch4 where the
flag is first used.
Regards,
Anup
>
> > ---
> > drivers/perf/riscv_pmu_sbi.c | 5 +++++
> > 1 file changed, 5 insertions(+)
> >
> > diff --git a/drivers/perf/riscv_pmu_sbi.c b/drivers/perf/riscv_pmu_sbi.c
> > index 16acd4dcdb96..40a335350d08 100644
> > --- a/drivers/perf/riscv_pmu_sbi.c
> > +++ b/drivers/perf/riscv_pmu_sbi.c
> > @@ -35,6 +35,8 @@
> > PMU_FORMAT_ATTR(event, "config:0-47");
> > PMU_FORMAT_ATTR(firmware, "config:63");
> >
> > +static bool sbi_v2_available;
> > +
> > static struct attribute *riscv_arch_formats_attr[] = {
> > &format_attr_event.attr,
> > &format_attr_firmware.attr,
> > @@ -1108,6 +1110,9 @@ static int __init pmu_sbi_devinit(void)
> > return 0;
> > }
> >
> > + if (sbi_spec_version >= sbi_mk_version(2, 0))
> > + sbi_v2_available = true;
> > +
> > ret = cpuhp_setup_state_multi(CPUHP_AP_PERF_RISCV_STARTING,
> > "perf/riscv/pmu:starting",
> > pmu_sbi_starting_cpu, pmu_sbi_dying_cpu);
> > --
> > 2.34.1
> >
On Tue, Dec 5, 2023 at 8:13 AM Atish Patra <[email protected]> wrote:
>
> SBI v2.0 added another function to SBI PMU extension to read
> the upper bits of a counter with width larger than XLEN.
>
> Add the definition for that function.
>
> Signed-off-by: Atish Patra <[email protected]>
LGTM.
Reviewed-by: Anup Patel <[email protected]>
Regards,
Anup
> ---
> arch/riscv/include/asm/sbi.h | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/arch/riscv/include/asm/sbi.h b/arch/riscv/include/asm/sbi.h
> index 0892f4421bc4..f3eeca79a02d 100644
> --- a/arch/riscv/include/asm/sbi.h
> +++ b/arch/riscv/include/asm/sbi.h
> @@ -121,6 +121,7 @@ enum sbi_ext_pmu_fid {
> SBI_EXT_PMU_COUNTER_START,
> SBI_EXT_PMU_COUNTER_STOP,
> SBI_EXT_PMU_COUNTER_FW_READ,
> + SBI_EXT_PMU_COUNTER_FW_READ_HI,
> };
>
> union sbi_pmu_ctr_info {
> --
> 2.34.1
>
On Thu, Dec 7, 2023 at 6:03 PM Conor Dooley <[email protected]> wrote:
>
> On Mon, Dec 04, 2023 at 06:43:05PM -0800, Atish Patra wrote:
> > SBI v2.0 introduced a explicit function to read the upper bits
> > for any firmwar counter width that is longer than XLEN. Currently,
> > this is only applicable for RV32 where firmware counter can be
> > 64 bit.
>
> The v2.0 spec explicitly says that this function returns the upper
> 32 bits of the counter for rv32 and will always return 0 for rv64
> or higher. The commit message here seems overly generic compared to
> the actual definition in the spec, and makes it seem like it could
> be used with a 128 bit counter on rv64 to get the upper 64 bits.
>
> I tried to think about what "generic" situation the commit message
> had been written for, but the things I came up with would all require
> changes to the spec to define behaviour for FID #5 and/or FID #1, so
> in the end I couldn't figure out the rationale behind the non-committal
> wording used here.
>
> >
> > Signed-off-by: Atish Patra <[email protected]>
> > ---
> > drivers/perf/riscv_pmu_sbi.c | 11 +++++++++--
> > 1 file changed, 9 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/perf/riscv_pmu_sbi.c b/drivers/perf/riscv_pmu_sbi.c
> > index 40a335350d08..1c9049e6b574 100644
> > --- a/drivers/perf/riscv_pmu_sbi.c
> > +++ b/drivers/perf/riscv_pmu_sbi.c
> > @@ -490,16 +490,23 @@ static u64 pmu_sbi_ctr_read(struct perf_event *event)
> > struct hw_perf_event *hwc = &event->hw;
> > int idx = hwc->idx;
> > struct sbiret ret;
> > - union sbi_pmu_ctr_info info;
> > u64 val = 0;
> > + union sbi_pmu_ctr_info info = pmu_ctr_list[idx];
> >
> > if (pmu_sbi_is_fw_event(event)) {
> > ret = sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_COUNTER_FW_READ,
> > hwc->idx, 0, 0, 0, 0, 0);
> > if (!ret.error)
> > val = ret.value;
> > +#if defined(CONFIG_32BIT)
>
> Why is this not IS_ENABLED()? The code below uses one. You could then
> fold it into the if statement below.
>
> > + if (sbi_v2_available && info.width >= 32) {
>
> >= 32? I know it is from the spec, but why does the spec define it as
> "One less than number of bits in CSR"? Saving bits in the structure I
> guess?
Yes, it is for using fewer bits in counter_info.
The maximum width of a HW counter is 64 bits. The absolute value 64
requires 7 bits in counter_info whereas absolute value 63 requires 6 bits
in counter_info. Also, a HW counter if it exists will have at least 1 bit
implemented otherwise the HW counter does not exist.
Regards,
Anup
>
> > + ret = sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_COUNTER_FW_READ_HI,
> > + hwc->idx, 0, 0, 0, 0, 0);
>
> > + if (!ret.error)
> > + val = val | ((u64)ret.value << 32);
>
> If the first ecall fails but the second one doesn't won't we corrupt
> val by only setting the upper bits? If returning val == 0 is the thing
> to do in the error case (which it is in the existing code) should the
> first `if (!ret.error)` become `if (ret.error)` -> `return 0`?
>
>
> > + val = val | ((u64)ret.value << 32);
>
> Also, |= ?
>
> Cheers,
> Conor.
>
> > + }
> > +#endif
> > } else {
> > - info = pmu_ctr_list[idx];
> > val = riscv_pmu_ctr_read_csr(info.csr);
> > if (IS_ENABLED(CONFIG_32BIT))
> > val = ((u64)riscv_pmu_ctr_read_csr(info.csr + 0x80)) << 31 | val;
> > --
> > 2.34.1
> >
On Tue, Dec 5, 2023 at 8:13 AM Atish Patra <[email protected]> wrote:
>
> SBI PMU Snapshot function optimizes the number of traps to
> higher privilege mode by leveraging a shared memory between the S/VS-mode
> and the M/HS mode. Add the definitions for that extension
>
> Signed-off-by: Atish Patra <[email protected]>
LGTM.
Reviewed-by: Anup Patel <[email protected]>
Regards,
Anup
> ---
> arch/riscv/include/asm/sbi.h | 10 ++++++++++
> 1 file changed, 10 insertions(+)
>
> diff --git a/arch/riscv/include/asm/sbi.h b/arch/riscv/include/asm/sbi.h
> index f3eeca79a02d..29821addb9b7 100644
> --- a/arch/riscv/include/asm/sbi.h
> +++ b/arch/riscv/include/asm/sbi.h
> @@ -122,6 +122,7 @@ enum sbi_ext_pmu_fid {
> SBI_EXT_PMU_COUNTER_STOP,
> SBI_EXT_PMU_COUNTER_FW_READ,
> SBI_EXT_PMU_COUNTER_FW_READ_HI,
> + SBI_EXT_PMU_SNAPSHOT_SET_SHMEM,
> };
>
> union sbi_pmu_ctr_info {
> @@ -138,6 +139,13 @@ union sbi_pmu_ctr_info {
> };
> };
>
> +/* Data structure to contain the pmu snapshot data */
> +struct riscv_pmu_snapshot_data {
> + uint64_t ctr_overflow_mask;
> + uint64_t ctr_values[64];
> + uint64_t reserved[447];
> +};
> +
> #define RISCV_PMU_RAW_EVENT_MASK GENMASK_ULL(47, 0)
> #define RISCV_PMU_RAW_EVENT_IDX 0x20000
>
> @@ -234,9 +242,11 @@ enum sbi_pmu_ctr_type {
>
> /* Flags defined for counter start function */
> #define SBI_PMU_START_FLAG_SET_INIT_VALUE (1 << 0)
> +#define SBI_PMU_START_FLAG_INIT_FROM_SNAPSHOT (1 << 1)
>
> /* Flags defined for counter stop function */
> #define SBI_PMU_STOP_FLAG_RESET (1 << 0)
> +#define SBI_PMU_STOP_FLAG_TAKE_SNAPSHOT (1 << 1)
>
> enum sbi_ext_dbcn_fid {
> SBI_EXT_DBCN_CONSOLE_WRITE = 0,
> --
> 2.34.1
>
On Tue, Dec 5, 2023 at 8:13 AM Atish Patra <[email protected]> wrote:
>
> The SBI v2.0 introduced a fw_read_hi function to read 64 bit firmware
> counters for RV32 based systems.
>
> Add infrastructure to support that.
>
> Signed-off-by: Atish Patra <[email protected]>
> ---
> arch/riscv/include/asm/kvm_vcpu_pmu.h | 6 ++++-
> arch/riscv/kvm/vcpu_pmu.c | 38 ++++++++++++++++++++++++++-
> arch/riscv/kvm/vcpu_sbi_pmu.c | 7 +++++
> 3 files changed, 49 insertions(+), 2 deletions(-)
>
> diff --git a/arch/riscv/include/asm/kvm_vcpu_pmu.h b/arch/riscv/include/asm/kvm_vcpu_pmu.h
> index 64c75acad6ba..dd655315e706 100644
> --- a/arch/riscv/include/asm/kvm_vcpu_pmu.h
> +++ b/arch/riscv/include/asm/kvm_vcpu_pmu.h
> @@ -20,7 +20,7 @@ static_assert(RISCV_KVM_MAX_COUNTERS <= 64);
>
> struct kvm_fw_event {
> /* Current value of the event */
> - unsigned long value;
> + uint64_t value;
>
> /* Event monitoring status */
> bool started;
> @@ -91,6 +91,10 @@ int kvm_riscv_vcpu_pmu_ctr_cfg_match(struct kvm_vcpu *vcpu, unsigned long ctr_ba
> struct kvm_vcpu_sbi_return *retdata);
> int kvm_riscv_vcpu_pmu_ctr_read(struct kvm_vcpu *vcpu, unsigned long cidx,
> struct kvm_vcpu_sbi_return *retdata);
> +#if defined(CONFIG_32BIT)
> +int kvm_riscv_vcpu_pmu_fw_ctr_read_hi(struct kvm_vcpu *vcpu, unsigned long cidx,
> + struct kvm_vcpu_sbi_return *retdata);
> +#endif
> void kvm_riscv_vcpu_pmu_init(struct kvm_vcpu *vcpu);
> int kvm_riscv_vcpu_pmu_setup_snapshot(struct kvm_vcpu *vcpu, unsigned long saddr_low,
> unsigned long saddr_high, unsigned long flags,
> diff --git a/arch/riscv/kvm/vcpu_pmu.c b/arch/riscv/kvm/vcpu_pmu.c
> index 86c8e92f92d3..5b4a93647256 100644
> --- a/arch/riscv/kvm/vcpu_pmu.c
> +++ b/arch/riscv/kvm/vcpu_pmu.c
> @@ -195,6 +195,28 @@ static int pmu_get_pmc_index(struct kvm_pmu *pmu, unsigned long eidx,
>
> return kvm_pmu_get_programmable_pmc_index(pmu, eidx, cbase, cmask);
> }
Newline here.
> +#if defined(CONFIG_32BIT)
Just like other patches, let's use IS_ENABLED() here.
> +static int pmu_fw_ctr_read_hi(struct kvm_vcpu *vcpu, unsigned long cidx,
> + unsigned long *out_val)
> +{
> + struct kvm_pmu *kvpmu = vcpu_to_pmu(vcpu);
> + struct kvm_pmc *pmc;
> + u64 enabled, running;
> + int fevent_code;
> +
> + pmc = &kvpmu->pmc[cidx];
> +
> + if (pmc->cinfo.type != SBI_PMU_CTR_TYPE_FW)
> + return -EINVAL;
> +
> + fevent_code = get_event_code(pmc->event_idx);
> + pmc->counter_val = kvpmu->fw_event[fevent_code].value;
> +
> + *out_val = pmc->counter_val >> 32;
> +
> + return 0;
> +}
> +#endif
>
> static int pmu_ctr_read(struct kvm_vcpu *vcpu, unsigned long cidx,
> unsigned long *out_val)
> @@ -696,6 +718,20 @@ int kvm_riscv_vcpu_pmu_ctr_cfg_match(struct kvm_vcpu *vcpu, unsigned long ctr_ba
> return 0;
> }
>
> +#if defined(CONFIG_32BIT)
> +int kvm_riscv_vcpu_pmu_fw_ctr_read_hi(struct kvm_vcpu *vcpu, unsigned long cidx,
> + struct kvm_vcpu_sbi_return *retdata)
> +{
> + int ret;
> +
> + ret = pmu_fw_ctr_read_hi(vcpu, cidx, &retdata->out_val);
> + if (ret == -EINVAL)
> + retdata->err_val = SBI_ERR_INVALID_PARAM;
> +
> + return 0;
> +}
> +#endif
> +
> int kvm_riscv_vcpu_pmu_ctr_read(struct kvm_vcpu *vcpu, unsigned long cidx,
> struct kvm_vcpu_sbi_return *retdata)
> {
> @@ -769,7 +805,7 @@ void kvm_riscv_vcpu_pmu_init(struct kvm_vcpu *vcpu)
> pmc->cinfo.csr = CSR_CYCLE + i;
> } else {
> pmc->cinfo.type = SBI_PMU_CTR_TYPE_FW;
> - pmc->cinfo.width = BITS_PER_LONG - 1;
> + pmc->cinfo.width = 63;
> }
> }
>
> diff --git a/arch/riscv/kvm/vcpu_sbi_pmu.c b/arch/riscv/kvm/vcpu_sbi_pmu.c
> index 77c20a61fd7d..0cd051d5a448 100644
> --- a/arch/riscv/kvm/vcpu_sbi_pmu.c
> +++ b/arch/riscv/kvm/vcpu_sbi_pmu.c
> @@ -64,6 +64,13 @@ static int kvm_sbi_ext_pmu_handler(struct kvm_vcpu *vcpu, struct kvm_run *run,
> case SBI_EXT_PMU_COUNTER_FW_READ:
> ret = kvm_riscv_vcpu_pmu_ctr_read(vcpu, cp->a0, retdata);
> break;
> + case SBI_EXT_PMU_COUNTER_FW_READ_HI:
> +#if defined(CONFIG_32BIT)
Same as above, use IS_ENABLED() here.
> + ret = kvm_riscv_vcpu_pmu_fw_ctr_read_hi(vcpu, cp->a0, retdata);
> +#else
> + retdata->out_val = 0;
> +#endif
> + break;
> case SBI_EXT_PMU_SNAPSHOT_SET_SHMEM:
> ret = kvm_riscv_vcpu_pmu_setup_snapshot(vcpu, cp->a0, cp->a1, cp->a2, retdata);
> break;
> --
> 2.34.1
>
Apart from minor nits above, this looks good to me.
Reviewed-by: Anup Patel <[email protected]>
Regards,
Anup
On Tue, Dec 5, 2023 at 8:13 AM Atish Patra <[email protected]> wrote:
>
> PMU Snapshot function allows to minimize the number of traps when the
> guest access configures/access the hpmcounters. If the snapshot feature
> is enabled, the hypervisor updates the shared memory with counter
> data and state of overflown counters. The guest can just read the
> shared memory instead of trap & emulate done by the hypervisor.
>
> This patch doesn't implement the counter overflow yet.
>
> Signed-off-by: Atish Patra <[email protected]>
> ---
> arch/riscv/include/asm/kvm_vcpu_pmu.h | 10 ++
> arch/riscv/kvm/vcpu_pmu.c | 129 ++++++++++++++++++++++++--
> arch/riscv/kvm/vcpu_sbi_pmu.c | 3 +
> 3 files changed, 134 insertions(+), 8 deletions(-)
>
> diff --git a/arch/riscv/include/asm/kvm_vcpu_pmu.h b/arch/riscv/include/asm/kvm_vcpu_pmu.h
> index 395518a1664e..64c75acad6ba 100644
> --- a/arch/riscv/include/asm/kvm_vcpu_pmu.h
> +++ b/arch/riscv/include/asm/kvm_vcpu_pmu.h
> @@ -36,6 +36,7 @@ struct kvm_pmc {
> bool started;
> /* Monitoring event ID */
> unsigned long event_idx;
> + struct kvm_vcpu *vcpu;
Where is this used ?
> };
>
> /* PMU data structure per vcpu */
> @@ -50,6 +51,12 @@ struct kvm_pmu {
> bool init_done;
> /* Bit map of all the virtual counter used */
> DECLARE_BITMAP(pmc_in_use, RISCV_KVM_MAX_COUNTERS);
> + /* Bit map of all the virtual counter overflown */
> + DECLARE_BITMAP(pmc_overflown, RISCV_KVM_MAX_COUNTERS);
> + /* The address of the counter snapshot area (guest physical address) */
> + unsigned long snapshot_addr;
> + /* The actual data of the snapshot */
> + struct riscv_pmu_snapshot_data *sdata;
> };
>
> #define vcpu_to_pmu(vcpu) (&(vcpu)->arch.pmu_context)
> @@ -85,6 +92,9 @@ int kvm_riscv_vcpu_pmu_ctr_cfg_match(struct kvm_vcpu *vcpu, unsigned long ctr_ba
> int kvm_riscv_vcpu_pmu_ctr_read(struct kvm_vcpu *vcpu, unsigned long cidx,
> struct kvm_vcpu_sbi_return *retdata);
> void kvm_riscv_vcpu_pmu_init(struct kvm_vcpu *vcpu);
> +int kvm_riscv_vcpu_pmu_setup_snapshot(struct kvm_vcpu *vcpu, unsigned long saddr_low,
> + unsigned long saddr_high, unsigned long flags,
> + struct kvm_vcpu_sbi_return *retdata);
> void kvm_riscv_vcpu_pmu_deinit(struct kvm_vcpu *vcpu);
> void kvm_riscv_vcpu_pmu_reset(struct kvm_vcpu *vcpu);
>
> diff --git a/arch/riscv/kvm/vcpu_pmu.c b/arch/riscv/kvm/vcpu_pmu.c
> index 86391a5061dd..622c4ee89e7b 100644
> --- a/arch/riscv/kvm/vcpu_pmu.c
> +++ b/arch/riscv/kvm/vcpu_pmu.c
> @@ -310,6 +310,79 @@ int kvm_riscv_vcpu_pmu_read_hpm(struct kvm_vcpu *vcpu, unsigned int csr_num,
> return ret;
> }
>
> +static void kvm_pmu_clear_snapshot_area(struct kvm_vcpu *vcpu)
> +{
> + struct kvm_pmu *kvpmu = vcpu_to_pmu(vcpu);
> + int snapshot_area_size = sizeof(struct riscv_pmu_snapshot_data);
> +
> + if (kvpmu->sdata) {
> + memset(kvpmu->sdata, 0, snapshot_area_size);
> + if (kvpmu->snapshot_addr != INVALID_GPA)
> + kvm_vcpu_write_guest(vcpu, kvpmu->snapshot_addr,
> + kvpmu->sdata, snapshot_area_size);
We should free the "kvpmu->sdata" and set it to NULL. This way subsequent
re-enabling of snapshot won't leak the kernel memory.
> + }
> + kvpmu->snapshot_addr = INVALID_GPA;
> +}
> +
> +int kvm_riscv_vcpu_pmu_setup_snapshot(struct kvm_vcpu *vcpu, unsigned long saddr_low,
> + unsigned long saddr_high, unsigned long flags,
> + struct kvm_vcpu_sbi_return *retdata)
> +{
> + struct kvm_pmu *kvpmu = vcpu_to_pmu(vcpu);
> + int snapshot_area_size = sizeof(struct riscv_pmu_snapshot_data);
> + int sbiret = 0;
> + gpa_t saddr;
> + unsigned long hva;
> + bool writable;
> +
> + if (!kvpmu) {
> + sbiret = SBI_ERR_INVALID_PARAM;
> + goto out;
> + }
> +
> + if (saddr_low == -1 && saddr_high == -1) {
> + kvm_pmu_clear_snapshot_area(vcpu);
> + return 0;
> + }
> +
> + saddr = saddr_low;
> +
> + if (saddr_high != 0) {
> +#ifdef CONFIG_32BIT
> + saddr |= ((gpa_t)saddr << 32);
> +#else
> + sbiret = SBI_ERR_INVALID_ADDRESS;
> + goto out;
> +#endif
> + }
> +
> + if (kvm_is_error_gpa(vcpu->kvm, saddr)) {
> + sbiret = SBI_ERR_INVALID_PARAM;
> + goto out;
> + }
> +
> + hva = kvm_vcpu_gfn_to_hva_prot(vcpu, saddr >> PAGE_SHIFT, &writable);
> + if (kvm_is_error_hva(hva) || !writable) {
> + sbiret = SBI_ERR_INVALID_ADDRESS;
> + goto out;
> + }
> +
> + kvpmu->snapshot_addr = saddr;
> + kvpmu->sdata = kzalloc(snapshot_area_size, GFP_ATOMIC);
> + if (!kvpmu->sdata)
> + return -ENOMEM;
> +
> + if (kvm_vcpu_write_guest(vcpu, saddr, kvpmu->sdata, snapshot_area_size)) {
> + kfree(kvpmu->sdata);
> + kvpmu->snapshot_addr = INVALID_GPA;
> + sbiret = SBI_ERR_FAILURE;
> + }
Newline here.
> +out:
> + retdata->err_val = sbiret;
> +
> + return 0;
> +}
> +
> int kvm_riscv_vcpu_pmu_num_ctrs(struct kvm_vcpu *vcpu,
> struct kvm_vcpu_sbi_return *retdata)
> {
> @@ -343,8 +416,10 @@ int kvm_riscv_vcpu_pmu_ctr_start(struct kvm_vcpu *vcpu, unsigned long ctr_base,
> int i, pmc_index, sbiret = 0;
> struct kvm_pmc *pmc;
> int fevent_code;
> + bool bSnapshot = flags & SBI_PMU_START_FLAG_INIT_FROM_SNAPSHOT;
>
> - if (kvm_pmu_validate_counter_mask(kvpmu, ctr_base, ctr_mask) < 0) {
> + if ((kvm_pmu_validate_counter_mask(kvpmu, ctr_base, ctr_mask) < 0) ||
> + (bSnapshot && kvpmu->snapshot_addr == INVALID_GPA)) {
We have a different error code when shared memory is not available.
> sbiret = SBI_ERR_INVALID_PARAM;
> goto out;
> }
> @@ -355,8 +430,14 @@ int kvm_riscv_vcpu_pmu_ctr_start(struct kvm_vcpu *vcpu, unsigned long ctr_base,
> if (!test_bit(pmc_index, kvpmu->pmc_in_use))
> continue;
> pmc = &kvpmu->pmc[pmc_index];
> - if (flags & SBI_PMU_START_FLAG_SET_INIT_VALUE)
> + if (flags & SBI_PMU_START_FLAG_SET_INIT_VALUE) {
> pmc->counter_val = ival;
> + } else if (bSnapshot) {
> + kvm_vcpu_read_guest(vcpu, kvpmu->snapshot_addr, kvpmu->sdata,
> + sizeof(struct riscv_pmu_snapshot_data));
> + pmc->counter_val = kvpmu->sdata->ctr_values[pmc_index];
> + }
> +
> if (pmc->cinfo.type == SBI_PMU_CTR_TYPE_FW) {
> fevent_code = get_event_code(pmc->event_idx);
> if (fevent_code >= SBI_PMU_FW_MAX) {
> @@ -400,8 +481,10 @@ int kvm_riscv_vcpu_pmu_ctr_stop(struct kvm_vcpu *vcpu, unsigned long ctr_base,
> u64 enabled, running;
> struct kvm_pmc *pmc;
> int fevent_code;
> + bool bSnapshot = flags & SBI_PMU_STOP_FLAG_TAKE_SNAPSHOT;
>
> - if (kvm_pmu_validate_counter_mask(kvpmu, ctr_base, ctr_mask) < 0) {
> + if ((kvm_pmu_validate_counter_mask(kvpmu, ctr_base, ctr_mask) < 0) ||
> + (bSnapshot && (kvpmu->snapshot_addr == INVALID_GPA))) {
Same as above.
> sbiret = SBI_ERR_INVALID_PARAM;
> goto out;
> }
> @@ -423,27 +506,52 @@ int kvm_riscv_vcpu_pmu_ctr_stop(struct kvm_vcpu *vcpu, unsigned long ctr_base,
> sbiret = SBI_ERR_ALREADY_STOPPED;
>
> kvpmu->fw_event[fevent_code].started = false;
> + /* No need to increment the value as it is absolute for firmware events */
> + pmc->counter_val = kvpmu->fw_event[fevent_code].value;
This change does not relate to the current patch.
> } else if (pmc->perf_event) {
> if (pmc->started) {
> /* Stop counting the counter */
> perf_event_disable(pmc->perf_event);
> - pmc->started = false;
Same as above.
> } else {
> sbiret = SBI_ERR_ALREADY_STOPPED;
> }
>
> - if (flags & SBI_PMU_STOP_FLAG_RESET) {
> - /* Relase the counter if this is a reset request */
> + /* Stop counting the counter */
> + perf_event_disable(pmc->perf_event);
> +
> + /* We only update if stopped is already called. The caller may stop/reset
> + * the event in two steps.
> + */
Use a double winged style multiline comment block.
> + if (pmc->started) {
> pmc->counter_val += perf_event_read_value(pmc->perf_event,
> &enabled, &running);
> + pmc->started = false;
> + }
> +
> + if (flags & SBI_PMU_STOP_FLAG_RESET) {
No need for braces here.
> + /* Relase the counter if this is a reset request */
s/Relase/Release/
> kvm_pmu_release_perf_event(pmc);
> }
> } else {
> sbiret = SBI_ERR_INVALID_PARAM;
> }
> +
> + if (bSnapshot && !sbiret) {
> + //TODO: Add counter overflow support when sscofpmf support is added
Use "/* */"
> + kvpmu->sdata->ctr_values[i] = pmc->counter_val;
> + kvm_vcpu_write_guest(vcpu, kvpmu->snapshot_addr, kvpmu->sdata,
> + sizeof(struct riscv_pmu_snapshot_data));
> + }
> +
> if (flags & SBI_PMU_STOP_FLAG_RESET) {
> pmc->event_idx = SBI_PMU_EVENT_IDX_INVALID;
> clear_bit(pmc_index, kvpmu->pmc_in_use);
> + if (bSnapshot) {
> + /* Clear the snapshot area for the upcoming deletion event */
> + kvpmu->sdata->ctr_values[i] = 0;
> + kvm_vcpu_write_guest(vcpu, kvpmu->snapshot_addr, kvpmu->sdata,
> + sizeof(struct riscv_pmu_snapshot_data));
> + }
> }
> }
>
> @@ -517,8 +625,10 @@ int kvm_riscv_vcpu_pmu_ctr_cfg_match(struct kvm_vcpu *vcpu, unsigned long ctr_ba
> kvpmu->fw_event[event_code].started = true;
> } else {
> ret = kvm_pmu_create_perf_event(pmc, &attr, flags, eidx, evtdata);
> - if (ret)
> - return ret;
> + if (ret) {
> + sbiret = SBI_ERR_NOT_SUPPORTED;
> + goto out;
> + }
This also looks like a change not related to the current patch.
> }
>
> set_bit(ctr_idx, kvpmu->pmc_in_use);
> @@ -566,6 +676,7 @@ void kvm_riscv_vcpu_pmu_init(struct kvm_vcpu *vcpu)
> kvpmu->num_hw_ctrs = num_hw_ctrs + 1;
> kvpmu->num_fw_ctrs = SBI_PMU_FW_MAX;
> memset(&kvpmu->fw_event, 0, SBI_PMU_FW_MAX * sizeof(struct kvm_fw_event));
> + kvpmu->snapshot_addr = INVALID_GPA;
>
> if (kvpmu->num_hw_ctrs > RISCV_KVM_MAX_HW_CTRS) {
> pr_warn_once("Limiting the hardware counters to 32 as specified by the ISA");
> @@ -585,6 +696,7 @@ void kvm_riscv_vcpu_pmu_init(struct kvm_vcpu *vcpu)
> pmc = &kvpmu->pmc[i];
> pmc->idx = i;
> pmc->event_idx = SBI_PMU_EVENT_IDX_INVALID;
> + pmc->vcpu = vcpu;
> if (i < kvpmu->num_hw_ctrs) {
> pmc->cinfo.type = SBI_PMU_CTR_TYPE_HW;
> if (i < 3)
> @@ -625,6 +737,7 @@ void kvm_riscv_vcpu_pmu_deinit(struct kvm_vcpu *vcpu)
> }
> bitmap_zero(kvpmu->pmc_in_use, RISCV_MAX_COUNTERS);
> memset(&kvpmu->fw_event, 0, SBI_PMU_FW_MAX * sizeof(struct kvm_fw_event));
> + kvpmu->snapshot_addr = INVALID_GPA;
You need to also free the sdata pointer.
> }
>
> void kvm_riscv_vcpu_pmu_reset(struct kvm_vcpu *vcpu)
> diff --git a/arch/riscv/kvm/vcpu_sbi_pmu.c b/arch/riscv/kvm/vcpu_sbi_pmu.c
> index 7eca72df2cbd..77c20a61fd7d 100644
> --- a/arch/riscv/kvm/vcpu_sbi_pmu.c
> +++ b/arch/riscv/kvm/vcpu_sbi_pmu.c
> @@ -64,6 +64,9 @@ static int kvm_sbi_ext_pmu_handler(struct kvm_vcpu *vcpu, struct kvm_run *run,
> case SBI_EXT_PMU_COUNTER_FW_READ:
> ret = kvm_riscv_vcpu_pmu_ctr_read(vcpu, cp->a0, retdata);
> break;
> + case SBI_EXT_PMU_SNAPSHOT_SET_SHMEM:
> + ret = kvm_riscv_vcpu_pmu_setup_snapshot(vcpu, cp->a0, cp->a1, cp->a2, retdata);
> + break;
> default:
> retdata->err_val = SBI_ERR_NOT_SUPPORTED;
> }
> --
> 2.34.1
>
Regards,
Anup
On Tue, Dec 5, 2023 at 8:13 AM Atish Patra <[email protected]> wrote:
>
> KVM enables perf for guest via counter virtualization. However, the
> sampling can not be supported as there is no mechanism to enabled
> trap/emulate scountovf in ISA yet. Rely on the SBI PMU snapshot
> to provide the counter overflow data via the shared memory.
>
> In case of sampling event, the host first guest the LCOFI interrupt
> and injects to the guest via irq filtering mechanism defined in AIA
> specification. Thus, ssaia must be enabled in the host in order to
> use perf sampling in the guest. No other AIA dpeendancy w.r.t kernel
s/dpeendancy/dependency/
> is required.
>
> Signed-off-by: Atish Patra <[email protected]>
> ---
> arch/riscv/include/asm/csr.h | 3 +-
> arch/riscv/include/uapi/asm/kvm.h | 1 +
> arch/riscv/kvm/main.c | 1 +
> arch/riscv/kvm/vcpu.c | 8 ++--
> arch/riscv/kvm/vcpu_onereg.c | 1 +
> arch/riscv/kvm/vcpu_pmu.c | 69 ++++++++++++++++++++++++++++---
> 6 files changed, 73 insertions(+), 10 deletions(-)
>
> diff --git a/arch/riscv/include/asm/csr.h b/arch/riscv/include/asm/csr.h
> index 88cdc8a3e654..bec09b33e2f0 100644
> --- a/arch/riscv/include/asm/csr.h
> +++ b/arch/riscv/include/asm/csr.h
> @@ -168,7 +168,8 @@
> #define VSIP_TO_HVIP_SHIFT (IRQ_VS_SOFT - IRQ_S_SOFT)
> #define VSIP_VALID_MASK ((_AC(1, UL) << IRQ_S_SOFT) | \
> (_AC(1, UL) << IRQ_S_TIMER) | \
> - (_AC(1, UL) << IRQ_S_EXT))
> + (_AC(1, UL) << IRQ_S_EXT) | \
> + (_AC(1, UL) << IRQ_PMU_OVF))
>
> /* AIA CSR bits */
> #define TOPI_IID_SHIFT 16
> diff --git a/arch/riscv/include/uapi/asm/kvm.h b/arch/riscv/include/uapi/asm/kvm.h
> index 60d3b21dead7..741c16f4518e 100644
> --- a/arch/riscv/include/uapi/asm/kvm.h
> +++ b/arch/riscv/include/uapi/asm/kvm.h
> @@ -139,6 +139,7 @@ enum KVM_RISCV_ISA_EXT_ID {
> KVM_RISCV_ISA_EXT_ZIHPM,
> KVM_RISCV_ISA_EXT_SMSTATEEN,
> KVM_RISCV_ISA_EXT_ZICOND,
> + KVM_RISCV_ISA_EXT_SSCOFPMF,
> KVM_RISCV_ISA_EXT_MAX,
> };
>
> diff --git a/arch/riscv/kvm/main.c b/arch/riscv/kvm/main.c
> index 225a435d9c9a..5a3a4cee0e3d 100644
> --- a/arch/riscv/kvm/main.c
> +++ b/arch/riscv/kvm/main.c
> @@ -43,6 +43,7 @@ int kvm_arch_hardware_enable(void)
> csr_write(CSR_HCOUNTEREN, 0x02);
>
> csr_write(CSR_HVIP, 0);
> + csr_write(CSR_HVIEN, 1UL << IRQ_PMU_OVF);
>
> kvm_riscv_aia_enable();
>
> diff --git a/arch/riscv/kvm/vcpu.c b/arch/riscv/kvm/vcpu.c
> index e087c809073c..2d9f252356c3 100644
> --- a/arch/riscv/kvm/vcpu.c
> +++ b/arch/riscv/kvm/vcpu.c
> @@ -380,7 +380,8 @@ int kvm_riscv_vcpu_set_interrupt(struct kvm_vcpu *vcpu, unsigned int irq)
> if (irq < IRQ_LOCAL_MAX &&
> irq != IRQ_VS_SOFT &&
> irq != IRQ_VS_TIMER &&
> - irq != IRQ_VS_EXT)
> + irq != IRQ_VS_EXT &&
> + irq != IRQ_PMU_OVF)
> return -EINVAL;
>
> set_bit(irq, vcpu->arch.irqs_pending);
> @@ -395,14 +396,15 @@ int kvm_riscv_vcpu_set_interrupt(struct kvm_vcpu *vcpu, unsigned int irq)
> int kvm_riscv_vcpu_unset_interrupt(struct kvm_vcpu *vcpu, unsigned int irq)
> {
> /*
> - * We only allow VS-mode software, timer, and external
> + * We only allow VS-mode software, timer, counter overflow and external
> * interrupts when irq is one of the local interrupts
> * defined by RISC-V privilege specification.
> */
> if (irq < IRQ_LOCAL_MAX &&
> irq != IRQ_VS_SOFT &&
> irq != IRQ_VS_TIMER &&
> - irq != IRQ_VS_EXT)
> + irq != IRQ_VS_EXT &&
> + irq != IRQ_PMU_OVF)
> return -EINVAL;
>
> clear_bit(irq, vcpu->arch.irqs_pending);
> diff --git a/arch/riscv/kvm/vcpu_onereg.c b/arch/riscv/kvm/vcpu_onereg.c
> index f8c9fa0c03c5..19a0e4eaf0df 100644
> --- a/arch/riscv/kvm/vcpu_onereg.c
> +++ b/arch/riscv/kvm/vcpu_onereg.c
> @@ -36,6 +36,7 @@ static const unsigned long kvm_isa_ext_arr[] = {
> /* Multi letter extensions (alphabetically sorted) */
> KVM_ISA_EXT_ARR(SMSTATEEN),
> KVM_ISA_EXT_ARR(SSAIA),
> + KVM_ISA_EXT_ARR(SSCOFPMF),
Sscofpmf can't be disabled for guest so we should add it to
kvm_riscv_vcpu_isa_disable_allowed(), no ?
> KVM_ISA_EXT_ARR(SSTC),
> KVM_ISA_EXT_ARR(SVINVAL),
> KVM_ISA_EXT_ARR(SVNAPOT),
> diff --git a/arch/riscv/kvm/vcpu_pmu.c b/arch/riscv/kvm/vcpu_pmu.c
> index 622c4ee89e7b..86c8e92f92d3 100644
> --- a/arch/riscv/kvm/vcpu_pmu.c
> +++ b/arch/riscv/kvm/vcpu_pmu.c
> @@ -229,6 +229,47 @@ static int kvm_pmu_validate_counter_mask(struct kvm_pmu *kvpmu, unsigned long ct
> return 0;
> }
>
> +static void kvm_riscv_pmu_overflow(struct perf_event *perf_event,
> + struct perf_sample_data *data,
> + struct pt_regs *regs)
> +{
> + struct kvm_pmc *pmc = perf_event->overflow_handler_context;
> + struct kvm_vcpu *vcpu = pmc->vcpu;
Ahh, the "vcpu" field is used here. Move that change from
patch7 to this patch.
> + struct kvm_pmu *kvpmu = vcpu_to_pmu(vcpu);
> + struct riscv_pmu *rpmu = to_riscv_pmu(perf_event->pmu);
> + u64 period;
> +
> + /*
> + * Stop the event counting by directly accessing the perf_event.
> + * Otherwise, this needs to deferred via a workqueue.
> + * That will introduce skew in the counter value because the actual
> + * physical counter would start after returning from this function.
> + * It will be stopped again once the workqueue is scheduled
> + */
> + rpmu->pmu.stop(perf_event, PERF_EF_UPDATE);
> +
> + /*
> + * The hw counter would start automatically when this function returns.
> + * Thus, the host may continue to interrupts and inject it to the guest
> + * even without guest configuring the next event. Depending on the hardware
> + * the host may some sluggishness only if privilege mode filtering is not
> + * available. In an ideal world, where qemu is not the only capable hardware,
> + * this can be removed.
> + * FYI: ARM64 does this way while x86 doesn't do anything as such.
> + * TODO: Should we keep it for RISC-V ?
> + */
> + period = -(local64_read(&perf_event->count));
> +
> + local64_set(&perf_event->hw.period_left, 0);
> + perf_event->attr.sample_period = period;
> + perf_event->hw.sample_period = period;
> +
> + set_bit(pmc->idx, kvpmu->pmc_overflown);
> + kvm_riscv_vcpu_set_interrupt(vcpu, IRQ_PMU_OVF);
> +
> + rpmu->pmu.start(perf_event, PERF_EF_RELOAD);
> +}
> +
> static int kvm_pmu_create_perf_event(struct kvm_pmc *pmc, struct perf_event_attr *attr,
> unsigned long flags, unsigned long eidx, unsigned long evtdata)
> {
> @@ -247,7 +288,7 @@ static int kvm_pmu_create_perf_event(struct kvm_pmc *pmc, struct perf_event_attr
> */
> attr->sample_period = kvm_pmu_get_sample_period(pmc);
>
> - event = perf_event_create_kernel_counter(attr, -1, current, NULL, pmc);
> + event = perf_event_create_kernel_counter(attr, -1, current, kvm_riscv_pmu_overflow, pmc);
> if (IS_ERR(event)) {
> pr_err("kvm pmu event creation failed for eidx %lx: %ld\n", eidx, PTR_ERR(event));
> return PTR_ERR(event);
> @@ -466,6 +507,12 @@ int kvm_riscv_vcpu_pmu_ctr_start(struct kvm_vcpu *vcpu, unsigned long ctr_base,
> }
> }
>
> + /* The guest have serviced the interrupt and starting the counter again */
> + if (test_bit(IRQ_PMU_OVF, vcpu->arch.irqs_pending)) {
> + clear_bit(pmc_index, kvpmu->pmc_overflown);
> + kvm_riscv_vcpu_unset_interrupt(vcpu, IRQ_PMU_OVF);
> + }
> +
> out:
> retdata->err_val = sbiret;
>
> @@ -537,7 +584,12 @@ int kvm_riscv_vcpu_pmu_ctr_stop(struct kvm_vcpu *vcpu, unsigned long ctr_base,
> }
>
> if (bSnapshot && !sbiret) {
> - //TODO: Add counter overflow support when sscofpmf support is added
> + /* The counter and overflow indicies in the snapshot region are w.r.to
> + * cbase. Modify the set bit in the counter mask instead of the pmc_index
> + * which indicates the absolute counter index.
> + */
Use a double winged comment block here.
> + if (test_bit(pmc_index, kvpmu->pmc_overflown))
> + kvpmu->sdata->ctr_overflow_mask |= (1UL << i);
> kvpmu->sdata->ctr_values[i] = pmc->counter_val;
> kvm_vcpu_write_guest(vcpu, kvpmu->snapshot_addr, kvpmu->sdata,
> sizeof(struct riscv_pmu_snapshot_data));
> @@ -546,15 +598,19 @@ int kvm_riscv_vcpu_pmu_ctr_stop(struct kvm_vcpu *vcpu, unsigned long ctr_base,
> if (flags & SBI_PMU_STOP_FLAG_RESET) {
> pmc->event_idx = SBI_PMU_EVENT_IDX_INVALID;
> clear_bit(pmc_index, kvpmu->pmc_in_use);
> + clear_bit(pmc_index, kvpmu->pmc_overflown);
> if (bSnapshot) {
> /* Clear the snapshot area for the upcoming deletion event */
> kvpmu->sdata->ctr_values[i] = 0;
> + /* Only clear the given counter as the caller is responsible to
> + * validate both the overflow mask and configured counters.
> + */
Use a double winged comment block here.
> + kvpmu->sdata->ctr_overflow_mask &= ~(1UL << i);
> kvm_vcpu_write_guest(vcpu, kvpmu->snapshot_addr, kvpmu->sdata,
> sizeof(struct riscv_pmu_snapshot_data));
> }
> }
> }
> -
> out:
> retdata->err_val = sbiret;
>
> @@ -729,15 +785,16 @@ void kvm_riscv_vcpu_pmu_deinit(struct kvm_vcpu *vcpu)
> if (!kvpmu)
> return;
>
> - for_each_set_bit(i, kvpmu->pmc_in_use, RISCV_MAX_COUNTERS) {
> + for_each_set_bit(i, kvpmu->pmc_in_use, RISCV_KVM_MAX_COUNTERS) {
> pmc = &kvpmu->pmc[i];
> pmc->counter_val = 0;
> kvm_pmu_release_perf_event(pmc);
> pmc->event_idx = SBI_PMU_EVENT_IDX_INVALID;
> }
> - bitmap_zero(kvpmu->pmc_in_use, RISCV_MAX_COUNTERS);
> + bitmap_zero(kvpmu->pmc_in_use, RISCV_KVM_MAX_COUNTERS);
> + bitmap_zero(kvpmu->pmc_overflown, RISCV_KVM_MAX_COUNTERS);
> memset(&kvpmu->fw_event, 0, SBI_PMU_FW_MAX * sizeof(struct kvm_fw_event));
> - kvpmu->snapshot_addr = INVALID_GPA;
> + kvm_pmu_clear_snapshot_area(vcpu);
> }
>
> void kvm_riscv_vcpu_pmu_reset(struct kvm_vcpu *vcpu)
> --
> 2.34.1
>
Regards,
Anup
On Thu, Dec 7, 2023 at 4:34 AM Conor Dooley <[email protected]> wrote:
>
> On Mon, Dec 04, 2023 at 06:43:06PM -0800, Atish Patra wrote:
> > SBI PMU Snapshot function optimizes the number of traps to
> > higher privilege mode by leveraging a shared memory between the S/VS-mode
> > and the M/HS mode. Add the definitions for that extension
> >
> > Signed-off-by: Atish Patra <[email protected]>
> > ---
> > arch/riscv/include/asm/sbi.h | 10 ++++++++++
> > 1 file changed, 10 insertions(+)
> >
> > diff --git a/arch/riscv/include/asm/sbi.h b/arch/riscv/include/asm/sbi.h
> > index f3eeca79a02d..29821addb9b7 100644
> > --- a/arch/riscv/include/asm/sbi.h
> > +++ b/arch/riscv/include/asm/sbi.h
> > @@ -122,6 +122,7 @@ enum sbi_ext_pmu_fid {
> > SBI_EXT_PMU_COUNTER_STOP,
> > SBI_EXT_PMU_COUNTER_FW_READ,
> > SBI_EXT_PMU_COUNTER_FW_READ_HI,
> > + SBI_EXT_PMU_SNAPSHOT_SET_SHMEM,
> > };
> >
> > union sbi_pmu_ctr_info {
> > @@ -138,6 +139,13 @@ union sbi_pmu_ctr_info {
> > };
> > };
> >
> > +/* Data structure to contain the pmu snapshot data */
> > +struct riscv_pmu_snapshot_data {
> > + uint64_t ctr_overflow_mask;
> > + uint64_t ctr_values[64];
> > + uint64_t reserved[447];
> > +};
> > +
> > #define RISCV_PMU_RAW_EVENT_MASK GENMASK_ULL(47, 0)
> > #define RISCV_PMU_RAW_EVENT_IDX 0x20000
> >
> > @@ -234,9 +242,11 @@ enum sbi_pmu_ctr_type {
> >
> > /* Flags defined for counter start function */
> > #define SBI_PMU_START_FLAG_SET_INIT_VALUE (1 << 0)
> > +#define SBI_PMU_START_FLAG_INIT_FROM_SNAPSHOT (1 << 1)
> >
> > /* Flags defined for counter stop function */
> > #define SBI_PMU_STOP_FLAG_RESET (1 << 0)
> > +#define SBI_PMU_STOP_FLAG_TAKE_SNAPSHOT (1 << 1)
>
> If we can use GENMASK in this file, why can we not use BIT()?
>
Sure. Done. I will change the other ones in a separate patch as well.
> >
> > enum sbi_ext_dbcn_fid {
> > SBI_EXT_DBCN_CONSOLE_WRITE = 0,
> > --
> > 2.34.1
> >
--
Regards,
Atish
On Thu, Dec 14, 2023 at 4:30 AM Anup Patel <[email protected]> wrote:
>
> On Thu, Dec 7, 2023 at 6:03 PM Conor Dooley <[email protected]> wrote:
> >
> > On Mon, Dec 04, 2023 at 06:43:05PM -0800, Atish Patra wrote:
> > > SBI v2.0 introduced a explicit function to read the upper bits
> > > for any firmwar counter width that is longer than XLEN. Currently,
> > > this is only applicable for RV32 where firmware counter can be
> > > 64 bit.
> >
> > The v2.0 spec explicitly says that this function returns the upper
> > 32 bits of the counter for rv32 and will always return 0 for rv64
> > or higher. The commit message here seems overly generic compared to
> > the actual definition in the spec, and makes it seem like it could
> > be used with a 128 bit counter on rv64 to get the upper 64 bits.
> >
> > I tried to think about what "generic" situation the commit message
> > had been written for, but the things I came up with would all require
> > changes to the spec to define behaviour for FID #5 and/or FID #1, so
> > in the end I couldn't figure out the rationale behind the non-committal
> > wording used here.
> >
The intention was to show that this can be extended in the future for
other XLEN systems
(obviously with spec modification). But I got your point. We can
update it whenever we have
such systems and the spec. Modified the commit text to match what is
in the spec .
> > >
> > > Signed-off-by: Atish Patra <[email protected]>
> > > ---
> > > drivers/perf/riscv_pmu_sbi.c | 11 +++++++++--
> > > 1 file changed, 9 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/drivers/perf/riscv_pmu_sbi.c b/drivers/perf/riscv_pmu_sbi.c
> > > index 40a335350d08..1c9049e6b574 100644
> > > --- a/drivers/perf/riscv_pmu_sbi.c
> > > +++ b/drivers/perf/riscv_pmu_sbi.c
> > > @@ -490,16 +490,23 @@ static u64 pmu_sbi_ctr_read(struct perf_event *event)
> > > struct hw_perf_event *hwc = &event->hw;
> > > int idx = hwc->idx;
> > > struct sbiret ret;
> > > - union sbi_pmu_ctr_info info;
> > > u64 val = 0;
> > > + union sbi_pmu_ctr_info info = pmu_ctr_list[idx];
> > >
> > > if (pmu_sbi_is_fw_event(event)) {
> > > ret = sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_COUNTER_FW_READ,
> > > hwc->idx, 0, 0, 0, 0, 0);
> > > if (!ret.error)
> > > val = ret.value;
> > > +#if defined(CONFIG_32BIT)
> >
> > Why is this not IS_ENABLED()? The code below uses one. You could then
> > fold it into the if statement below.
> >
Done.
> > > + if (sbi_v2_available && info.width >= 32) {
> >
> > >= 32? I know it is from the spec, but why does the spec define it as
> > "One less than number of bits in CSR"? Saving bits in the structure I
> > guess?
>
> Yes, it is for using fewer bits in counter_info.
>
> The maximum width of a HW counter is 64 bits. The absolute value 64
> requires 7 bits in counter_info whereas absolute value 63 requires 6 bits
> in counter_info. Also, a HW counter if it exists will have at least 1 bit
> implemented otherwise the HW counter does not exist.
>
> Regards,
> Anup
>
> >
> > > + ret = sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_COUNTER_FW_READ_HI,
> > > + hwc->idx, 0, 0, 0, 0, 0);
> >
> > > + if (!ret.error)
> > > + val = val | ((u64)ret.value << 32);
> >
> > If the first ecall fails but the second one doesn't won't we corrupt
> > val by only setting the upper bits? If returning val == 0 is the thing
> > to do in the error case (which it is in the existing code) should the
> > first `if (!ret.error)` become `if (ret.error)` -> `return 0`?
> >
Sure. Fixed it.
> >
> > > + val = val | ((u64)ret.value << 32);
> >
> > Also, |= ?
> >
Done.
> > Cheers,
> > Conor.
> >
> > > + }
> > > +#endif
> > > } else {
> > > - info = pmu_ctr_list[idx];
> > > val = riscv_pmu_ctr_read_csr(info.csr);
> > > if (IS_ENABLED(CONFIG_32BIT))
> > > val = ((u64)riscv_pmu_ctr_read_csr(info.csr + 0x80)) << 31 | val;
> > > --
> > > 2.34.1
> > >
On Thu, Dec 14, 2023 at 4:16 AM Anup Patel <[email protected]> wrote:
>
> On Thu, Dec 7, 2023 at 5:39 PM Conor Dooley <[email protected]> wrote:
> >
> > On Mon, Dec 04, 2023 at 06:43:03PM -0800, Atish Patra wrote:
> > > SBI v2.0 added few functions to improve SBI PMU extension. In order
> > > to be backward compatible, the driver must use these functions only
> > > if SBI v2.0 is available.
> > >
> > > Signed-off-by: Atish Patra <[email protected]>
> >
> > IMO this does not make sense in a patch of its own and should probably
> > be squashed with the first user for it.
>
> I agree. This patch should be squashed into patch4 where the
> flag is first used.
>
Done. Thanks.
> Regards,
> Anup
>
> >
> > > ---
> > > drivers/perf/riscv_pmu_sbi.c | 5 +++++
> > > 1 file changed, 5 insertions(+)
> > >
> > > diff --git a/drivers/perf/riscv_pmu_sbi.c b/drivers/perf/riscv_pmu_sbi.c
> > > index 16acd4dcdb96..40a335350d08 100644
> > > --- a/drivers/perf/riscv_pmu_sbi.c
> > > +++ b/drivers/perf/riscv_pmu_sbi.c
> > > @@ -35,6 +35,8 @@
> > > PMU_FORMAT_ATTR(event, "config:0-47");
> > > PMU_FORMAT_ATTR(firmware, "config:63");
> > >
> > > +static bool sbi_v2_available;
> > > +
> > > static struct attribute *riscv_arch_formats_attr[] = {
> > > &format_attr_event.attr,
> > > &format_attr_firmware.attr,
> > > @@ -1108,6 +1110,9 @@ static int __init pmu_sbi_devinit(void)
> > > return 0;
> > > }
> > >
> > > + if (sbi_spec_version >= sbi_mk_version(2, 0))
> > > + sbi_v2_available = true;
> > > +
> > > ret = cpuhp_setup_state_multi(CPUHP_AP_PERF_RISCV_STARTING,
> > > "perf/riscv/pmu:starting",
> > > pmu_sbi_starting_cpu, pmu_sbi_dying_cpu);
> > > --
> > > 2.34.1
> > >
On Thu, Dec 7, 2023 at 5:06 AM Conor Dooley <[email protected]> wrote:
>
> Hey Atish,
>
> On Mon, Dec 04, 2023 at 06:43:07PM -0800, Atish Patra wrote:
> > SBI v2.0 SBI introduced PMU snapshot feature which adds the following
> > features.
> >
> > 1. Read counter values directly from the shared memory instead of
> > csr read.
> > 2. Start multiple counters with initial values with one SBI call.
> >
> > These functionalities optimizes the number of traps to the higher
> > privilege mode. If the kernel is in VS mode while the hypervisor
> > deploy trap & emulate method, this would minimize all the hpmcounter
> > CSR read traps. If the kernel is running in S-mode, the benfits
> > reduced to CSR latency vs DRAM/cache latency as there is no trap
> > involved while accessing the hpmcounter CSRs.
> >
> > In both modes, it does saves the number of ecalls while starting
> > multiple counter together with an initial values. This is a likely
> > scenario if multiple counters overflow at the same time.
> >
> > Signed-off-by: Atish Patra <[email protected]>
> > ---
> > drivers/perf/riscv_pmu.c | 1 +
> > drivers/perf/riscv_pmu_sbi.c | 203 ++++++++++++++++++++++++++++++---
> > include/linux/perf/riscv_pmu.h | 6 +
> > 3 files changed, 197 insertions(+), 13 deletions(-)
> >
> > diff --git a/drivers/perf/riscv_pmu.c b/drivers/perf/riscv_pmu.c
> > index 0dda70e1ef90..5b57acb770d3 100644
> > --- a/drivers/perf/riscv_pmu.c
> > +++ b/drivers/perf/riscv_pmu.c
> > @@ -412,6 +412,7 @@ struct riscv_pmu *riscv_pmu_alloc(void)
> > cpuc->n_events = 0;
> > for (i = 0; i < RISCV_MAX_COUNTERS; i++)
> > cpuc->events[i] = NULL;
> > + cpuc->snapshot_addr = NULL;
> > }
> > pmu->pmu = (struct pmu) {
> > .event_init = riscv_pmu_event_init,
> > diff --git a/drivers/perf/riscv_pmu_sbi.c b/drivers/perf/riscv_pmu_sbi.c
> > index 1c9049e6b574..1b8b6de63b69 100644
> > --- a/drivers/perf/riscv_pmu_sbi.c
> > +++ b/drivers/perf/riscv_pmu_sbi.c
> > @@ -36,6 +36,9 @@ PMU_FORMAT_ATTR(event, "config:0-47");
> > PMU_FORMAT_ATTR(firmware, "config:63");
> >
> > static bool sbi_v2_available;
> > +static DEFINE_STATIC_KEY_FALSE(sbi_pmu_snapshot_available);
> > +#define sbi_pmu_snapshot_available() \
> > + static_branch_unlikely(&sbi_pmu_snapshot_available)
> >
> > static struct attribute *riscv_arch_formats_attr[] = {
> > &format_attr_event.attr,
> > @@ -485,14 +488,101 @@ static int pmu_sbi_event_map(struct perf_event *event, u64 *econfig)
> > return ret;
> > }
> >
> > +static void pmu_sbi_snapshot_free(struct riscv_pmu *pmu)
> > +{
> > + int cpu;
>
> > + struct cpu_hw_events *cpu_hw_evt;
>
> This is only used inside the scope of the for loop.
>
Do you intend to suggest using mixed declarations ? Personally, I
prefer all the declarations upfront for readability.
Let me know if you think that's an issue or violates coding style.
> > +
> > + for_each_possible_cpu(cpu) {
> > + cpu_hw_evt = per_cpu_ptr(pmu->hw_events, cpu);
> > + if (!cpu_hw_evt->snapshot_addr)
> > + continue;
>
> Could you add a blank line here please?
Done.
>
> > + free_page((unsigned long)cpu_hw_evt->snapshot_addr);
> > + cpu_hw_evt->snapshot_addr = NULL;
> > + cpu_hw_evt->snapshot_addr_phys = 0;
>
> Why do these need to be explicitly zeroed?
>
We may get an allocation failure while allocating for all cpus. That's why,
we need to free the page and zero out the pointers for all the
possible cpus in that case.
> > + }
> > +}
> > +
> > +static int pmu_sbi_snapshot_alloc(struct riscv_pmu *pmu)
> > +{
> > + int cpu;
>
> > + struct page *snapshot_page;
> > + struct cpu_hw_events *cpu_hw_evt;
>
> Same here re scope
>
same reply as above.
> > +
> > + for_each_possible_cpu(cpu) {
> > + cpu_hw_evt = per_cpu_ptr(pmu->hw_events, cpu);
> > + if (cpu_hw_evt->snapshot_addr)
> > + continue;
>
> Same here re blank line
>
Done.
> > + snapshot_page = alloc_page(GFP_ATOMIC | __GFP_ZERO);
> > + if (!snapshot_page) {
> > + pmu_sbi_snapshot_free(pmu);
> > + return -ENOMEM;
> > + }
> > + cpu_hw_evt->snapshot_addr = page_to_virt(snapshot_page);
> > + cpu_hw_evt->snapshot_addr_phys = page_to_phys(snapshot_page);
> > + }
> > +
> > + return 0;
> > +}
> > +
> > +static void pmu_sbi_snapshot_disable(void)
> > +{
> > + sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_SNAPSHOT_SET_SHMEM, -1,
> > + -1, 0, 0, 0, 0);
> > +}
> > +
> > +static int pmu_sbi_snapshot_setup(struct riscv_pmu *pmu, int cpu)
> > +{
> > + struct cpu_hw_events *cpu_hw_evt;
> > + struct sbiret ret = {0};
> > + int rc;
> > +
> > + cpu_hw_evt = per_cpu_ptr(pmu->hw_events, cpu);
> > + if (!cpu_hw_evt->snapshot_addr_phys)
> > + return -EINVAL;
> > +
> > + if (cpu_hw_evt->snapshot_set_done)
> > + return 0;
> > +
> > +#if defined(CONFIG_32BIT)
>
> Why does this need to be an `#if defined()`? Does the code not compile
> if you use IS_ENABLED()?
>
changed it to IS_ENABLED.
> > + ret = sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_SNAPSHOT_SET_SHMEM, cpu_hw_evt->snapshot_addr_phys,
> > + (u64)(cpu_hw_evt->snapshot_addr_phys) >> 32, 0, 0, 0, 0);
> > +#else
> > + ret = sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_SNAPSHOT_SET_SHMEM, cpu_hw_evt->snapshot_addr_phys,
> > + 0, 0, 0, 0, 0);
> > +#endif
>
> > + /* Free up the snapshot area memory and fall back to default SBI */
>
> What does "fall back to the default SBI mean"? SBI is an interface so I
> don't understand what it means in this context. Secondly,
In absence of SBI PMU snapshot, the driver would try to read the
counters directly and end up traps.
Also, it would not use the SBI PMU snapshot flags in the SBI start/stop calls.
Snapshot is an alternative mechanism to minimize the traps. I just
wanted to highlight that.
How about this ?
"Free up the snapshot area memory and fall back to default SBI PMU
calls without snapshot */
> > + if (ret.error) {
> > + if (ret.error != SBI_ERR_NOT_SUPPORTED)
> > + pr_warn("%s: pmu snapshot setup failed with error %ld\n", __func__,
> > + ret.error);
>
> Why is the function relevant here? Is the error message in-and-of-itself
> not sufficient here? Where else would one be setting up the snapshots
> other than the setup function?
>
The SBI implementation (i.e OpenSBI) may or may not provide a snapshot
feature. This error message indicates
that SBI implementation supports PMU snapshot but setup failed for
some other error.
> > + rc = sbi_err_map_linux_errno(ret.error);
>
> > + if (rc)
> > + return rc;
>
> Is it even possible for !rc at this point? You've already checked that
> ret.error is non zero, so this just becomes
> `return sbi_err_map_linux_errno(ret.error);`?
>
Good catch. Thanks. Fixed it.
> > + }
> > +
> > + cpu_hw_evt->snapshot_set_done = true;
> > +
> > + return 0;
> > +}
> > +
> > static u64 pmu_sbi_ctr_read(struct perf_event *event)
> > {
> > struct hw_perf_event *hwc = &event->hw;
> > int idx = hwc->idx;
> > struct sbiret ret;
> > u64 val = 0;
> > + struct riscv_pmu *pmu = to_riscv_pmu(event->pmu);
> > + struct cpu_hw_events *cpu_hw_evt = this_cpu_ptr(pmu->hw_events);
> > + struct riscv_pmu_snapshot_data *sdata = cpu_hw_evt->snapshot_addr;
> > union sbi_pmu_ctr_info info = pmu_ctr_list[idx];
> >
> > + /* Read the value from the shared memory directly */
>
> Statement of the obvious, no?
>
Probably. Just wanted to be explicit for the reader who didn't read
the spec to understand how snapshot works.
> > + if (sbi_pmu_snapshot_available()) {
> > + val = sdata->ctr_values[idx];
> > + goto done;
>
> s/goto done/return val/
> There's no cleanup to be done here, what purpose does the goto serve?
>
Sure. Done.
> > + }
> > +
> > if (pmu_sbi_is_fw_event(event)) {
> > ret = sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_COUNTER_FW_READ,
> > hwc->idx, 0, 0, 0, 0, 0);
> > @@ -512,6 +602,7 @@ static u64 pmu_sbi_ctr_read(struct perf_event *event)
> > val = ((u64)riscv_pmu_ctr_read_csr(info.csr + 0x80)) << 31 | val;
> > }
> >
> > +done:
> > return val;
> > }
> >
> > @@ -539,6 +630,7 @@ static void pmu_sbi_ctr_start(struct perf_event *event, u64 ival)
> > struct hw_perf_event *hwc = &event->hw;
> > unsigned long flag = SBI_PMU_START_FLAG_SET_INIT_VALUE;
> >
> > + /* There is no benefit setting SNAPSHOT FLAG for a single counter */
> > #if defined(CONFIG_32BIT)
> > ret = sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_COUNTER_START, hwc->idx,
> > 1, flag, ival, ival >> 32, 0);
> > @@ -559,16 +651,29 @@ static void pmu_sbi_ctr_stop(struct perf_event *event, unsigned long flag)
> > {
> > struct sbiret ret;
> > struct hw_perf_event *hwc = &event->hw;
> > + struct riscv_pmu *pmu = to_riscv_pmu(event->pmu);
> > + struct cpu_hw_events *cpu_hw_evt = this_cpu_ptr(pmu->hw_events);
> > + struct riscv_pmu_snapshot_data *sdata = cpu_hw_evt->snapshot_addr;
> >
> > if ((hwc->flags & PERF_EVENT_FLAG_USER_ACCESS) &&
> > (hwc->flags & PERF_EVENT_FLAG_USER_READ_CNT))
> > pmu_sbi_reset_scounteren((void *)event);
> >
> > + if (sbi_pmu_snapshot_available())
> > + flag |= SBI_PMU_STOP_FLAG_TAKE_SNAPSHOT;
> > +
> > ret = sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_COUNTER_STOP, hwc->idx, 1, flag, 0, 0, 0);
> > - if (ret.error && (ret.error != SBI_ERR_ALREADY_STOPPED) &&
> > - flag != SBI_PMU_STOP_FLAG_RESET)
> > + if (!ret.error && sbi_pmu_snapshot_available()) {
>
> > + /* Snapshot is taken relative to the counter idx base. Apply a fixup. */
> > + if (hwc->idx > 0) {
> > + sdata->ctr_values[hwc->idx] = sdata->ctr_values[0];
> > + sdata->ctr_values[0] = 0;
>
> Why is this being zeroed in this manner? Why is zeroing it not required
> if hwc->idx == 0? You've got a comment there that could probably do with
> elaboration.
>
hwc->idx is the counter_idx_base here. If it is zero, that means the
counter0 value is updated
in the shared memory. However, if the base > 0, we need to update the
relative counter value
from the shared memory. Does it make sense ?
> > + }
> > + } else if (ret.error && (ret.error != SBI_ERR_ALREADY_STOPPED) &&
> > + flag != SBI_PMU_STOP_FLAG_RESET) {
> > pr_err("Stopping counter idx %d failed with error %d\n",
> > hwc->idx, sbi_err_map_linux_errno(ret.error));
> > + }
> > }
> >
> > static int pmu_sbi_find_num_ctrs(void)
> > @@ -626,10 +731,14 @@ static inline void pmu_sbi_stop_all(struct riscv_pmu *pmu)
> > static inline void pmu_sbi_stop_hw_ctrs(struct riscv_pmu *pmu)
> > {
> > struct cpu_hw_events *cpu_hw_evt = this_cpu_ptr(pmu->hw_events);
> > + unsigned long flag = 0;
> > +
> > + if (sbi_pmu_snapshot_available())
> > + flag = SBI_PMU_STOP_FLAG_TAKE_SNAPSHOT;
> >
> > /* No need to check the error here as we can't do anything about the error */
> > sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_COUNTER_STOP, 0,
> > - cpu_hw_evt->used_hw_ctrs[0], 0, 0, 0, 0);
> > + cpu_hw_evt->used_hw_ctrs[0], flag, 0, 0, 0);
> > }
> >
> > /*
> > @@ -638,11 +747,10 @@ static inline void pmu_sbi_stop_hw_ctrs(struct riscv_pmu *pmu)
> > * while the overflowed counters need to be started with updated initialization
> > * value.
> > */
> > -static inline void pmu_sbi_start_overflow_mask(struct riscv_pmu *pmu,
> > - unsigned long ctr_ovf_mask)
> > +static noinline void pmu_sbi_start_ovf_ctrs_sbi(struct cpu_hw_events *cpu_hw_evt,
> > + unsigned long ctr_ovf_mask)
> > {
> > int idx = 0;
> > - struct cpu_hw_events *cpu_hw_evt = this_cpu_ptr(pmu->hw_events);
> > struct perf_event *event;
> > unsigned long flag = SBI_PMU_START_FLAG_SET_INIT_VALUE;
> > unsigned long ctr_start_mask = 0;
> > @@ -677,6 +785,49 @@ static inline void pmu_sbi_start_overflow_mask(struct riscv_pmu *pmu,
> > }
> > }
> >
> > +static noinline void pmu_sbi_start_ovf_ctrs_snapshot(struct cpu_hw_events *cpu_hw_evt,
> > + unsigned long ctr_ovf_mask)
> > +{
> > + int idx = 0;
> > + struct perf_event *event;
> > + unsigned long flag = SBI_PMU_START_FLAG_INIT_FROM_SNAPSHOT;
> > + uint64_t max_period;
> > + struct hw_perf_event *hwc;
> > + u64 init_val = 0;
> > + unsigned long ctr_start_mask = 0;
> > + struct riscv_pmu_snapshot_data *sdata = cpu_hw_evt->snapshot_addr;
> > +
> > + for_each_set_bit(idx, cpu_hw_evt->used_hw_ctrs, RISCV_MAX_COUNTERS) {
> > + if (ctr_ovf_mask & (1 << idx)) {
> > + event = cpu_hw_evt->events[idx];
> > + hwc = &event->hw;
> > + max_period = riscv_pmu_ctr_get_width_mask(event);
> > + init_val = local64_read(&hwc->prev_count) & max_period;
> > + sdata->ctr_values[idx] = init_val;
> > + }
> > + /* We donot need to update the non-overflow counters the previous
>
> /*
> * We don't need to update the non-overflow counters as the previous
>
>
> > + * value should have been there already.
> > + */
> > + }
> > +
> > + ctr_start_mask = cpu_hw_evt->used_hw_ctrs[0];
> > +
> > + /* Start all the counters in a single shot */
> > + sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_COUNTER_START, 0, ctr_start_mask,
> > + flag, 0, 0, 0);
> > +}
> > +
> > +static void pmu_sbi_start_overflow_mask(struct riscv_pmu *pmu,
> > + unsigned long ctr_ovf_mask)
> > +{
> > + struct cpu_hw_events *cpu_hw_evt = this_cpu_ptr(pmu->hw_events);
> > +
> > + if (sbi_pmu_snapshot_available())
> > + pmu_sbi_start_ovf_ctrs_snapshot(cpu_hw_evt, ctr_ovf_mask);
> > + else
> > + pmu_sbi_start_ovf_ctrs_sbi(cpu_hw_evt, ctr_ovf_mask);
> > +}
> > +
> > static irqreturn_t pmu_sbi_ovf_handler(int irq, void *dev)
> > {
> > struct perf_sample_data data;
> > @@ -690,6 +841,7 @@ static irqreturn_t pmu_sbi_ovf_handler(int irq, void *dev)
> > unsigned long overflowed_ctrs = 0;
> > struct cpu_hw_events *cpu_hw_evt = dev;
> > u64 start_clock = sched_clock();
> > + struct riscv_pmu_snapshot_data *sdata = cpu_hw_evt->snapshot_addr;
> >
> > if (WARN_ON_ONCE(!cpu_hw_evt))
> > return IRQ_NONE;
> > @@ -711,8 +863,10 @@ static irqreturn_t pmu_sbi_ovf_handler(int irq, void *dev)
> > pmu_sbi_stop_hw_ctrs(pmu);
> >
> > /* Overflow status register should only be read after counter are stopped */
> > - ALT_SBI_PMU_OVERFLOW(overflow);
> > -
> > + if (sbi_pmu_snapshot_available())
> > + overflow = sdata->ctr_overflow_mask;
> > + else
> > + ALT_SBI_PMU_OVERFLOW(overflow);
> > /*
> > * Overflow interrupt pending bit should only be cleared after stopping
> > * all the counters to avoid any race condition.
> > @@ -774,6 +928,7 @@ static int pmu_sbi_starting_cpu(unsigned int cpu, struct hlist_node *node)
> > {
> > struct riscv_pmu *pmu = hlist_entry_safe(node, struct riscv_pmu, node);
> > struct cpu_hw_events *cpu_hw_evt = this_cpu_ptr(pmu->hw_events);
> > + int ret = 0;
> >
> > /*
> > * We keep enabling userspace access to CYCLE, TIME and INSTRET via the
> > @@ -794,7 +949,10 @@ static int pmu_sbi_starting_cpu(unsigned int cpu, struct hlist_node *node)
> > enable_percpu_irq(riscv_pmu_irq, IRQ_TYPE_NONE);
> > }
> >
> > - return 0;
> > + if (sbi_pmu_snapshot_available())
> > + ret = pmu_sbi_snapshot_setup(pmu, cpu);
> > +
> > + return ret;
>
> I'd just write this as
>
> if (sbi_pmu_snapshot_available())
> return pmu_sbi_snapshot_setup(pmu, cpu);
>
> return 0;
>
> and drop the newly added variable I think.
>
Sure. Just a preference thingy.
> > }
> >
> > static int pmu_sbi_dying_cpu(unsigned int cpu, struct hlist_node *node)
> > @@ -807,6 +965,9 @@ static int pmu_sbi_dying_cpu(unsigned int cpu, struct hlist_node *node)
> > /* Disable all counters access for user mode now */
> > csr_write(CSR_SCOUNTEREN, 0x0);
> >
> > + if (sbi_pmu_snapshot_available())
> > + pmu_sbi_snapshot_disable();
> > +
> > return 0;
> > }
> >
> > @@ -1076,10 +1237,6 @@ static int pmu_sbi_device_probe(struct platform_device *pdev)
> > pmu->event_unmapped = pmu_sbi_event_unmapped;
> > pmu->csr_index = pmu_sbi_csr_index;
> >
> > - ret = cpuhp_state_add_instance(CPUHP_AP_PERF_RISCV_STARTING, &pmu->node);
> > - if (ret)
> > - return ret;
> > -
> > ret = riscv_pm_pmu_register(pmu);
> > if (ret)
> > goto out_unregister;
> > @@ -1088,8 +1245,28 @@ static int pmu_sbi_device_probe(struct platform_device *pdev)
> > if (ret)
> > goto out_unregister;
> >
> > + /* SBI PMU Snasphot is only available in SBI v2.0 */
>
> s/Snasphot/Snapshot/
>
Thanks. Fixed.
> > + if (sbi_v2_available) {
> > + ret = pmu_sbi_snapshot_alloc(pmu);
> > + if (ret)
> > + goto out_unregister;
>
> A blank line here aids readability by breaking up the reuse of ret.
done.
>
> > + ret = pmu_sbi_snapshot_setup(pmu, smp_processor_id());
> > + if (!ret) {
> > + pr_info("SBI PMU snapshot is available to optimize the PMU traps\n");
>
> Why the verbose message? Could we standardise on one wording for the SBI
> function probing stuff? Most users seem to be "SBI FOO extension detected".
> Only IPI has additional wording and PMU differs slightly.
Additional information is for users to understand PMU functionality
uses less traps on this system.
We can just resort to and expect users to read upon the purpose of the
snapshot from the spec.
"SBI PMU snapshot available"
>
> > + /* We enable it once here for the boot cpu. If snapshot shmem fails during
>
> Again, comment style here. What does "snapshot shmem" mean? I think
> there's a missing action here. Registration? Allocation?
>
Fixed it. It is supposed to be "snapshot shmem setup"
> > + * cpu hotplug on, it should bail out.
>
> Should or will? What action does "bail out" correspond to?
>
bail out the cpu hotplug process. We don't support heterogeneous pmus
for snapshot.
If the SBI implementation returns success for SBI_EXT_PMU_SNAPSHOT_SET_SHMEM
boot cpu but fails for other cpus while bringing them up, it is
problematic to handle that.
> Thanks,
> Conor.
>
> > + */
> > + static_branch_enable(&sbi_pmu_snapshot_available);
> > + }
> > + /* Snapshot is an optional feature. Continue if not available */
> > + }
> > +
> > register_sysctl("kernel", sbi_pmu_sysctl_table);
> >
> > + ret = cpuhp_state_add_instance(CPUHP_AP_PERF_RISCV_STARTING, &pmu->node);
> > + if (ret)
> > + return ret;
> > +
> > return 0;
> >
> > out_unregister:
> > diff --git a/include/linux/perf/riscv_pmu.h b/include/linux/perf/riscv_pmu.h
> > index 43282e22ebe1..c3fa90970042 100644
> > --- a/include/linux/perf/riscv_pmu.h
> > +++ b/include/linux/perf/riscv_pmu.h
> > @@ -39,6 +39,12 @@ struct cpu_hw_events {
> > DECLARE_BITMAP(used_hw_ctrs, RISCV_MAX_COUNTERS);
> > /* currently enabled firmware counters */
> > DECLARE_BITMAP(used_fw_ctrs, RISCV_MAX_COUNTERS);
> > + /* The virtual address of the shared memory where counter snapshot will be taken */
> > + void *snapshot_addr;
> > + /* The physical address of the shared memory where counter snapshot will be taken */
> > + phys_addr_t snapshot_addr_phys;
> > + /* Boolean flag to indicate setup is already done */
> > + bool snapshot_set_done;
> > };
> >
> > struct riscv_pmu {
> > --
> > 2.34.1
> >
On Thu, Dec 14, 2023 at 8:02 AM Anup Patel <[email protected]> wrote:
>
> On Tue, Dec 5, 2023 at 8:13 AM Atish Patra <[email protected]> wrote:
> >
> > KVM enables perf for guest via counter virtualization. However, the
> > sampling can not be supported as there is no mechanism to enabled
> > trap/emulate scountovf in ISA yet. Rely on the SBI PMU snapshot
> > to provide the counter overflow data via the shared memory.
> >
> > In case of sampling event, the host first guest the LCOFI interrupt
> > and injects to the guest via irq filtering mechanism defined in AIA
> > specification. Thus, ssaia must be enabled in the host in order to
> > use perf sampling in the guest. No other AIA dpeendancy w.r.t kernel
>
> s/dpeendancy/dependency/
>
Fixed.
> > is required.
> >
> > Signed-off-by: Atish Patra <[email protected]>
> > ---
> > arch/riscv/include/asm/csr.h | 3 +-
> > arch/riscv/include/uapi/asm/kvm.h | 1 +
> > arch/riscv/kvm/main.c | 1 +
> > arch/riscv/kvm/vcpu.c | 8 ++--
> > arch/riscv/kvm/vcpu_onereg.c | 1 +
> > arch/riscv/kvm/vcpu_pmu.c | 69 ++++++++++++++++++++++++++++---
> > 6 files changed, 73 insertions(+), 10 deletions(-)
> >
> > diff --git a/arch/riscv/include/asm/csr.h b/arch/riscv/include/asm/csr.h
> > index 88cdc8a3e654..bec09b33e2f0 100644
> > --- a/arch/riscv/include/asm/csr.h
> > +++ b/arch/riscv/include/asm/csr.h
> > @@ -168,7 +168,8 @@
> > #define VSIP_TO_HVIP_SHIFT (IRQ_VS_SOFT - IRQ_S_SOFT)
> > #define VSIP_VALID_MASK ((_AC(1, UL) << IRQ_S_SOFT) | \
> > (_AC(1, UL) << IRQ_S_TIMER) | \
> > - (_AC(1, UL) << IRQ_S_EXT))
> > + (_AC(1, UL) << IRQ_S_EXT) | \
> > + (_AC(1, UL) << IRQ_PMU_OVF))
> >
> > /* AIA CSR bits */
> > #define TOPI_IID_SHIFT 16
> > diff --git a/arch/riscv/include/uapi/asm/kvm.h b/arch/riscv/include/uapi/asm/kvm.h
> > index 60d3b21dead7..741c16f4518e 100644
> > --- a/arch/riscv/include/uapi/asm/kvm.h
> > +++ b/arch/riscv/include/uapi/asm/kvm.h
> > @@ -139,6 +139,7 @@ enum KVM_RISCV_ISA_EXT_ID {
> > KVM_RISCV_ISA_EXT_ZIHPM,
> > KVM_RISCV_ISA_EXT_SMSTATEEN,
> > KVM_RISCV_ISA_EXT_ZICOND,
> > + KVM_RISCV_ISA_EXT_SSCOFPMF,
> > KVM_RISCV_ISA_EXT_MAX,
> > };
> >
> > diff --git a/arch/riscv/kvm/main.c b/arch/riscv/kvm/main.c
> > index 225a435d9c9a..5a3a4cee0e3d 100644
> > --- a/arch/riscv/kvm/main.c
> > +++ b/arch/riscv/kvm/main.c
> > @@ -43,6 +43,7 @@ int kvm_arch_hardware_enable(void)
> > csr_write(CSR_HCOUNTEREN, 0x02);
> >
> > csr_write(CSR_HVIP, 0);
> > + csr_write(CSR_HVIEN, 1UL << IRQ_PMU_OVF);
> >
> > kvm_riscv_aia_enable();
> >
> > diff --git a/arch/riscv/kvm/vcpu.c b/arch/riscv/kvm/vcpu.c
> > index e087c809073c..2d9f252356c3 100644
> > --- a/arch/riscv/kvm/vcpu.c
> > +++ b/arch/riscv/kvm/vcpu.c
> > @@ -380,7 +380,8 @@ int kvm_riscv_vcpu_set_interrupt(struct kvm_vcpu *vcpu, unsigned int irq)
> > if (irq < IRQ_LOCAL_MAX &&
> > irq != IRQ_VS_SOFT &&
> > irq != IRQ_VS_TIMER &&
> > - irq != IRQ_VS_EXT)
> > + irq != IRQ_VS_EXT &&
> > + irq != IRQ_PMU_OVF)
> > return -EINVAL;
> >
> > set_bit(irq, vcpu->arch.irqs_pending);
> > @@ -395,14 +396,15 @@ int kvm_riscv_vcpu_set_interrupt(struct kvm_vcpu *vcpu, unsigned int irq)
> > int kvm_riscv_vcpu_unset_interrupt(struct kvm_vcpu *vcpu, unsigned int irq)
> > {
> > /*
> > - * We only allow VS-mode software, timer, and external
> > + * We only allow VS-mode software, timer, counter overflow and external
> > * interrupts when irq is one of the local interrupts
> > * defined by RISC-V privilege specification.
> > */
> > if (irq < IRQ_LOCAL_MAX &&
> > irq != IRQ_VS_SOFT &&
> > irq != IRQ_VS_TIMER &&
> > - irq != IRQ_VS_EXT)
> > + irq != IRQ_VS_EXT &&
> > + irq != IRQ_PMU_OVF)
> > return -EINVAL;
> >
> > clear_bit(irq, vcpu->arch.irqs_pending);
> > diff --git a/arch/riscv/kvm/vcpu_onereg.c b/arch/riscv/kvm/vcpu_onereg.c
> > index f8c9fa0c03c5..19a0e4eaf0df 100644
> > --- a/arch/riscv/kvm/vcpu_onereg.c
> > +++ b/arch/riscv/kvm/vcpu_onereg.c
> > @@ -36,6 +36,7 @@ static const unsigned long kvm_isa_ext_arr[] = {
> > /* Multi letter extensions (alphabetically sorted) */
> > KVM_ISA_EXT_ARR(SMSTATEEN),
> > KVM_ISA_EXT_ARR(SSAIA),
> > + KVM_ISA_EXT_ARR(SSCOFPMF),
>
> Sscofpmf can't be disabled for guest so we should add it to
> kvm_riscv_vcpu_isa_disable_allowed(), no ?
>
Just to clarify it can't be disabled from the kvm user space via one
reg interface if kvm already exposes
it to the guest. However, Kvm will not expose Sscofpmf to the guest if
Ssaia is not available.
I have added these fixes in the next version.
> > KVM_ISA_EXT_ARR(SSTC),
> > KVM_ISA_EXT_ARR(SVINVAL),
> > KVM_ISA_EXT_ARR(SVNAPOT),
> > diff --git a/arch/riscv/kvm/vcpu_pmu.c b/arch/riscv/kvm/vcpu_pmu.c
> > index 622c4ee89e7b..86c8e92f92d3 100644
> > --- a/arch/riscv/kvm/vcpu_pmu.c
> > +++ b/arch/riscv/kvm/vcpu_pmu.c
> > @@ -229,6 +229,47 @@ static int kvm_pmu_validate_counter_mask(struct kvm_pmu *kvpmu, unsigned long ct
> > return 0;
> > }
> >
> > +static void kvm_riscv_pmu_overflow(struct perf_event *perf_event,
> > + struct perf_sample_data *data,
> > + struct pt_regs *regs)
> > +{
> > + struct kvm_pmc *pmc = perf_event->overflow_handler_context;
> > + struct kvm_vcpu *vcpu = pmc->vcpu;
>
> Ahh, the "vcpu" field is used here. Move that change from
> patch7 to this patch.
>
done.
> > + struct kvm_pmu *kvpmu = vcpu_to_pmu(vcpu);
> > + struct riscv_pmu *rpmu = to_riscv_pmu(perf_event->pmu);
> > + u64 period;
> > +
> > + /*
> > + * Stop the event counting by directly accessing the perf_event.
> > + * Otherwise, this needs to deferred via a workqueue.
> > + * That will introduce skew in the counter value because the actual
> > + * physical counter would start after returning from this function.
> > + * It will be stopped again once the workqueue is scheduled
> > + */
> > + rpmu->pmu.stop(perf_event, PERF_EF_UPDATE);
> > +
> > + /*
> > + * The hw counter would start automatically when this function returns.
> > + * Thus, the host may continue to interrupts and inject it to the guest
> > + * even without guest configuring the next event. Depending on the hardware
> > + * the host may some sluggishness only if privilege mode filtering is not
> > + * available. In an ideal world, where qemu is not the only capable hardware,
> > + * this can be removed.
> > + * FYI: ARM64 does this way while x86 doesn't do anything as such.
> > + * TODO: Should we keep it for RISC-V ?
> > + */
> > + period = -(local64_read(&perf_event->count));
> > +
> > + local64_set(&perf_event->hw.period_left, 0);
> > + perf_event->attr.sample_period = period;
> > + perf_event->hw.sample_period = period;
> > +
> > + set_bit(pmc->idx, kvpmu->pmc_overflown);
> > + kvm_riscv_vcpu_set_interrupt(vcpu, IRQ_PMU_OVF);
> > +
> > + rpmu->pmu.start(perf_event, PERF_EF_RELOAD);
> > +}
> > +
> > static int kvm_pmu_create_perf_event(struct kvm_pmc *pmc, struct perf_event_attr *attr,
> > unsigned long flags, unsigned long eidx, unsigned long evtdata)
> > {
> > @@ -247,7 +288,7 @@ static int kvm_pmu_create_perf_event(struct kvm_pmc *pmc, struct perf_event_attr
> > */
> > attr->sample_period = kvm_pmu_get_sample_period(pmc);
> >
> > - event = perf_event_create_kernel_counter(attr, -1, current, NULL, pmc);
> > + event = perf_event_create_kernel_counter(attr, -1, current, kvm_riscv_pmu_overflow, pmc);
> > if (IS_ERR(event)) {
> > pr_err("kvm pmu event creation failed for eidx %lx: %ld\n", eidx, PTR_ERR(event));
> > return PTR_ERR(event);
> > @@ -466,6 +507,12 @@ int kvm_riscv_vcpu_pmu_ctr_start(struct kvm_vcpu *vcpu, unsigned long ctr_base,
> > }
> > }
> >
> > + /* The guest have serviced the interrupt and starting the counter again */
> > + if (test_bit(IRQ_PMU_OVF, vcpu->arch.irqs_pending)) {
> > + clear_bit(pmc_index, kvpmu->pmc_overflown);
> > + kvm_riscv_vcpu_unset_interrupt(vcpu, IRQ_PMU_OVF);
> > + }
> > +
> > out:
> > retdata->err_val = sbiret;
> >
> > @@ -537,7 +584,12 @@ int kvm_riscv_vcpu_pmu_ctr_stop(struct kvm_vcpu *vcpu, unsigned long ctr_base,
> > }
> >
> > if (bSnapshot && !sbiret) {
> > - //TODO: Add counter overflow support when sscofpmf support is added
> > + /* The counter and overflow indicies in the snapshot region are w.r.to
> > + * cbase. Modify the set bit in the counter mask instead of the pmc_index
> > + * which indicates the absolute counter index.
> > + */
>
> Use a double winged comment block here.
>
> > + if (test_bit(pmc_index, kvpmu->pmc_overflown))
> > + kvpmu->sdata->ctr_overflow_mask |= (1UL << i);
> > kvpmu->sdata->ctr_values[i] = pmc->counter_val;
> > kvm_vcpu_write_guest(vcpu, kvpmu->snapshot_addr, kvpmu->sdata,
> > sizeof(struct riscv_pmu_snapshot_data));
> > @@ -546,15 +598,19 @@ int kvm_riscv_vcpu_pmu_ctr_stop(struct kvm_vcpu *vcpu, unsigned long ctr_base,
> > if (flags & SBI_PMU_STOP_FLAG_RESET) {
> > pmc->event_idx = SBI_PMU_EVENT_IDX_INVALID;
> > clear_bit(pmc_index, kvpmu->pmc_in_use);
> > + clear_bit(pmc_index, kvpmu->pmc_overflown);
> > if (bSnapshot) {
> > /* Clear the snapshot area for the upcoming deletion event */
> > kvpmu->sdata->ctr_values[i] = 0;
> > + /* Only clear the given counter as the caller is responsible to
> > + * validate both the overflow mask and configured counters.
> > + */
>
> Use a double winged comment block here.
>
Fixed all the comment styling.
> > + kvpmu->sdata->ctr_overflow_mask &= ~(1UL << i);
> > kvm_vcpu_write_guest(vcpu, kvpmu->snapshot_addr, kvpmu->sdata,
> > sizeof(struct riscv_pmu_snapshot_data));
> > }
> > }
> > }
> > -
> > out:
> > retdata->err_val = sbiret;
> >
> > @@ -729,15 +785,16 @@ void kvm_riscv_vcpu_pmu_deinit(struct kvm_vcpu *vcpu)
> > if (!kvpmu)
> > return;
> >
> > - for_each_set_bit(i, kvpmu->pmc_in_use, RISCV_MAX_COUNTERS) {
> > + for_each_set_bit(i, kvpmu->pmc_in_use, RISCV_KVM_MAX_COUNTERS) {
> > pmc = &kvpmu->pmc[i];
> > pmc->counter_val = 0;
> > kvm_pmu_release_perf_event(pmc);
> > pmc->event_idx = SBI_PMU_EVENT_IDX_INVALID;
> > }
> > - bitmap_zero(kvpmu->pmc_in_use, RISCV_MAX_COUNTERS);
> > + bitmap_zero(kvpmu->pmc_in_use, RISCV_KVM_MAX_COUNTERS);
> > + bitmap_zero(kvpmu->pmc_overflown, RISCV_KVM_MAX_COUNTERS);
> > memset(&kvpmu->fw_event, 0, SBI_PMU_FW_MAX * sizeof(struct kvm_fw_event));
> > - kvpmu->snapshot_addr = INVALID_GPA;
> > + kvm_pmu_clear_snapshot_area(vcpu);
> > }
> >
> > void kvm_riscv_vcpu_pmu_reset(struct kvm_vcpu *vcpu)
> > --
> > 2.34.1
> >
>
> Regards,
> Anup
On Thu, Dec 14, 2023 at 5:46 AM Anup Patel <[email protected]> wrote:
>
> On Tue, Dec 5, 2023 at 8:13 AM Atish Patra <[email protected]> wrote:
> >
> > PMU Snapshot function allows to minimize the number of traps when the
> > guest access configures/access the hpmcounters. If the snapshot feature
> > is enabled, the hypervisor updates the shared memory with counter
> > data and state of overflown counters. The guest can just read the
> > shared memory instead of trap & emulate done by the hypervisor.
> >
> > This patch doesn't implement the counter overflow yet.
> >
> > Signed-off-by: Atish Patra <[email protected]>
> > ---
> > arch/riscv/include/asm/kvm_vcpu_pmu.h | 10 ++
> > arch/riscv/kvm/vcpu_pmu.c | 129 ++++++++++++++++++++++++--
> > arch/riscv/kvm/vcpu_sbi_pmu.c | 3 +
> > 3 files changed, 134 insertions(+), 8 deletions(-)
> >
> > diff --git a/arch/riscv/include/asm/kvm_vcpu_pmu.h b/arch/riscv/include/asm/kvm_vcpu_pmu.h
> > index 395518a1664e..64c75acad6ba 100644
> > --- a/arch/riscv/include/asm/kvm_vcpu_pmu.h
> > +++ b/arch/riscv/include/asm/kvm_vcpu_pmu.h
> > @@ -36,6 +36,7 @@ struct kvm_pmc {
> > bool started;
> > /* Monitoring event ID */
> > unsigned long event_idx;
> > + struct kvm_vcpu *vcpu;
>
> Where is this used ?
>
Moved it to the next patch as suggested there.
> > };
> >
> > /* PMU data structure per vcpu */
> > @@ -50,6 +51,12 @@ struct kvm_pmu {
> > bool init_done;
> > /* Bit map of all the virtual counter used */
> > DECLARE_BITMAP(pmc_in_use, RISCV_KVM_MAX_COUNTERS);
> > + /* Bit map of all the virtual counter overflown */
> > + DECLARE_BITMAP(pmc_overflown, RISCV_KVM_MAX_COUNTERS);
> > + /* The address of the counter snapshot area (guest physical address) */
> > + unsigned long snapshot_addr;
> > + /* The actual data of the snapshot */
> > + struct riscv_pmu_snapshot_data *sdata;
> > };
> >
> > #define vcpu_to_pmu(vcpu) (&(vcpu)->arch.pmu_context)
> > @@ -85,6 +92,9 @@ int kvm_riscv_vcpu_pmu_ctr_cfg_match(struct kvm_vcpu *vcpu, unsigned long ctr_ba
> > int kvm_riscv_vcpu_pmu_ctr_read(struct kvm_vcpu *vcpu, unsigned long cidx,
> > struct kvm_vcpu_sbi_return *retdata);
> > void kvm_riscv_vcpu_pmu_init(struct kvm_vcpu *vcpu);
> > +int kvm_riscv_vcpu_pmu_setup_snapshot(struct kvm_vcpu *vcpu, unsigned long saddr_low,
> > + unsigned long saddr_high, unsigned long flags,
> > + struct kvm_vcpu_sbi_return *retdata);
> > void kvm_riscv_vcpu_pmu_deinit(struct kvm_vcpu *vcpu);
> > void kvm_riscv_vcpu_pmu_reset(struct kvm_vcpu *vcpu);
> >
> > diff --git a/arch/riscv/kvm/vcpu_pmu.c b/arch/riscv/kvm/vcpu_pmu.c
> > index 86391a5061dd..622c4ee89e7b 100644
> > --- a/arch/riscv/kvm/vcpu_pmu.c
> > +++ b/arch/riscv/kvm/vcpu_pmu.c
> > @@ -310,6 +310,79 @@ int kvm_riscv_vcpu_pmu_read_hpm(struct kvm_vcpu *vcpu, unsigned int csr_num,
> > return ret;
> > }
> >
> > +static void kvm_pmu_clear_snapshot_area(struct kvm_vcpu *vcpu)
> > +{
> > + struct kvm_pmu *kvpmu = vcpu_to_pmu(vcpu);
> > + int snapshot_area_size = sizeof(struct riscv_pmu_snapshot_data);
> > +
> > + if (kvpmu->sdata) {
> > + memset(kvpmu->sdata, 0, snapshot_area_size);
> > + if (kvpmu->snapshot_addr != INVALID_GPA)
> > + kvm_vcpu_write_guest(vcpu, kvpmu->snapshot_addr,
> > + kvpmu->sdata, snapshot_area_size);
>
> We should free the "kvpmu->sdata" and set it to NULL. This way subsequent
> re-enabling of snapshot won't leak the kernel memory.
>
Done.
> > + }
> > + kvpmu->snapshot_addr = INVALID_GPA;
> > +}
> > +
> > +int kvm_riscv_vcpu_pmu_setup_snapshot(struct kvm_vcpu *vcpu, unsigned long saddr_low,
> > + unsigned long saddr_high, unsigned long flags,
> > + struct kvm_vcpu_sbi_return *retdata)
> > +{
> > + struct kvm_pmu *kvpmu = vcpu_to_pmu(vcpu);
> > + int snapshot_area_size = sizeof(struct riscv_pmu_snapshot_data);
> > + int sbiret = 0;
> > + gpa_t saddr;
> > + unsigned long hva;
> > + bool writable;
> > +
> > + if (!kvpmu) {
> > + sbiret = SBI_ERR_INVALID_PARAM;
> > + goto out;
> > + }
> > +
> > + if (saddr_low == -1 && saddr_high == -1) {
> > + kvm_pmu_clear_snapshot_area(vcpu);
> > + return 0;
> > + }
> > +
> > + saddr = saddr_low;
> > +
> > + if (saddr_high != 0) {
> > +#ifdef CONFIG_32BIT
> > + saddr |= ((gpa_t)saddr << 32);
> > +#else
> > + sbiret = SBI_ERR_INVALID_ADDRESS;
> > + goto out;
> > +#endif
> > + }
> > +
> > + if (kvm_is_error_gpa(vcpu->kvm, saddr)) {
> > + sbiret = SBI_ERR_INVALID_PARAM;
> > + goto out;
> > + }
> > +
> > + hva = kvm_vcpu_gfn_to_hva_prot(vcpu, saddr >> PAGE_SHIFT, &writable);
> > + if (kvm_is_error_hva(hva) || !writable) {
> > + sbiret = SBI_ERR_INVALID_ADDRESS;
> > + goto out;
> > + }
> > +
> > + kvpmu->snapshot_addr = saddr;
> > + kvpmu->sdata = kzalloc(snapshot_area_size, GFP_ATOMIC);
> > + if (!kvpmu->sdata)
> > + return -ENOMEM;
> > +
> > + if (kvm_vcpu_write_guest(vcpu, saddr, kvpmu->sdata, snapshot_area_size)) {
> > + kfree(kvpmu->sdata);
> > + kvpmu->snapshot_addr = INVALID_GPA;
> > + sbiret = SBI_ERR_FAILURE;
> > + }
>
> Newline here.
>
Done.
> > +out:
> > + retdata->err_val = sbiret;
> > +
> > + return 0;
> > +}
> > +
> > int kvm_riscv_vcpu_pmu_num_ctrs(struct kvm_vcpu *vcpu,
> > struct kvm_vcpu_sbi_return *retdata)
> > {
> > @@ -343,8 +416,10 @@ int kvm_riscv_vcpu_pmu_ctr_start(struct kvm_vcpu *vcpu, unsigned long ctr_base,
> > int i, pmc_index, sbiret = 0;
> > struct kvm_pmc *pmc;
> > int fevent_code;
> > + bool bSnapshot = flags & SBI_PMU_START_FLAG_INIT_FROM_SNAPSHOT;
> >
> > - if (kvm_pmu_validate_counter_mask(kvpmu, ctr_base, ctr_mask) < 0) {
> > + if ((kvm_pmu_validate_counter_mask(kvpmu, ctr_base, ctr_mask) < 0) ||
> > + (bSnapshot && kvpmu->snapshot_addr == INVALID_GPA)) {
>
> We have a different error code when shared memory is not available.
>
Fixed.
> > sbiret = SBI_ERR_INVALID_PARAM;
> > goto out;
> > }
> > @@ -355,8 +430,14 @@ int kvm_riscv_vcpu_pmu_ctr_start(struct kvm_vcpu *vcpu, unsigned long ctr_base,
> > if (!test_bit(pmc_index, kvpmu->pmc_in_use))
> > continue;
> > pmc = &kvpmu->pmc[pmc_index];
> > - if (flags & SBI_PMU_START_FLAG_SET_INIT_VALUE)
> > + if (flags & SBI_PMU_START_FLAG_SET_INIT_VALUE) {
> > pmc->counter_val = ival;
> > + } else if (bSnapshot) {
> > + kvm_vcpu_read_guest(vcpu, kvpmu->snapshot_addr, kvpmu->sdata,
> > + sizeof(struct riscv_pmu_snapshot_data));
> > + pmc->counter_val = kvpmu->sdata->ctr_values[pmc_index];
> > + }
> > +
> > if (pmc->cinfo.type == SBI_PMU_CTR_TYPE_FW) {
> > fevent_code = get_event_code(pmc->event_idx);
> > if (fevent_code >= SBI_PMU_FW_MAX) {
> > @@ -400,8 +481,10 @@ int kvm_riscv_vcpu_pmu_ctr_stop(struct kvm_vcpu *vcpu, unsigned long ctr_base,
> > u64 enabled, running;
> > struct kvm_pmc *pmc;
> > int fevent_code;
> > + bool bSnapshot = flags & SBI_PMU_STOP_FLAG_TAKE_SNAPSHOT;
> >
> > - if (kvm_pmu_validate_counter_mask(kvpmu, ctr_base, ctr_mask) < 0) {
> > + if ((kvm_pmu_validate_counter_mask(kvpmu, ctr_base, ctr_mask) < 0) ||
> > + (bSnapshot && (kvpmu->snapshot_addr == INVALID_GPA))) {
>
> Same as above.
>
> > sbiret = SBI_ERR_INVALID_PARAM;
> > goto out;
> > }
> > @@ -423,27 +506,52 @@ int kvm_riscv_vcpu_pmu_ctr_stop(struct kvm_vcpu *vcpu, unsigned long ctr_base,
> > sbiret = SBI_ERR_ALREADY_STOPPED;
> >
> > kvpmu->fw_event[fevent_code].started = false;
> > + /* No need to increment the value as it is absolute for firmware events */
> > + pmc->counter_val = kvpmu->fw_event[fevent_code].value;
>
> This change does not relate to the current patch.
>
Actually it does. We need to assign pmc->counter_val here because
shared memory needs to be updated
with the actual counter val. However, we should do it if the snapshot
is enabled only.
Otherwise, it will be updated in pmu_ctr_read anyways. I have fixed
that and moved this to the if condition with bSnapshot
below.
> > } else if (pmc->perf_event) {
> > if (pmc->started) {
> > /* Stop counting the counter */
> > perf_event_disable(pmc->perf_event);
> > - pmc->started = false;
>
> Same as above.
>
> > } else {
> > sbiret = SBI_ERR_ALREADY_STOPPED;
> > }
> >
> > - if (flags & SBI_PMU_STOP_FLAG_RESET) {
> > - /* Relase the counter if this is a reset request */
> > + /* Stop counting the counter */
> > + perf_event_disable(pmc->perf_event);
> > +
This is not needed as we would have already stopped when started = true.
> > + /* We only update if stopped is already called. The caller may stop/reset
> > + * the event in two steps.
> > + */
>
> Use a double winged style multiline comment block.
>
Fixed.
> > + if (pmc->started) {
> > pmc->counter_val += perf_event_read_value(pmc->perf_event,
> > &enabled, &running);
> > + pmc->started = false;
> > + }
> > +
> > + if (flags & SBI_PMU_STOP_FLAG_RESET) {
>
> No need for braces here.
>
> > + /* Relase the counter if this is a reset request */
>
> s/Relase/Release/
>
Fixed.
> > kvm_pmu_release_perf_event(pmc);
> > }
> > } else {
> > sbiret = SBI_ERR_INVALID_PARAM;
> > }
> > +
> > + if (bSnapshot && !sbiret) {
> > + //TODO: Add counter overflow support when sscofpmf support is added
>
> Use "/* */"
>
> > + kvpmu->sdata->ctr_values[i] = pmc->counter_val;
> > + kvm_vcpu_write_guest(vcpu, kvpmu->snapshot_addr, kvpmu->sdata,
> > + sizeof(struct riscv_pmu_snapshot_data));
> > + }
> > +
> > if (flags & SBI_PMU_STOP_FLAG_RESET) {
> > pmc->event_idx = SBI_PMU_EVENT_IDX_INVALID;
> > clear_bit(pmc_index, kvpmu->pmc_in_use);
> > + if (bSnapshot) {
> > + /* Clear the snapshot area for the upcoming deletion event */
> > + kvpmu->sdata->ctr_values[i] = 0;
> > + kvm_vcpu_write_guest(vcpu, kvpmu->snapshot_addr, kvpmu->sdata,
> > + sizeof(struct riscv_pmu_snapshot_data));
> > + }
> > }
> > }
> >
> > @@ -517,8 +625,10 @@ int kvm_riscv_vcpu_pmu_ctr_cfg_match(struct kvm_vcpu *vcpu, unsigned long ctr_ba
> > kvpmu->fw_event[event_code].started = true;
> > } else {
> > ret = kvm_pmu_create_perf_event(pmc, &attr, flags, eidx, evtdata);
> > - if (ret)
> > - return ret;
> > + if (ret) {
> > + sbiret = SBI_ERR_NOT_SUPPORTED;
> > + goto out;
> > + }
>
> This also looks like a change not related to the current patch.
>
Moved to a separate patch.
> > }
> >
> > set_bit(ctr_idx, kvpmu->pmc_in_use);
> > @@ -566,6 +676,7 @@ void kvm_riscv_vcpu_pmu_init(struct kvm_vcpu *vcpu)
> > kvpmu->num_hw_ctrs = num_hw_ctrs + 1;
> > kvpmu->num_fw_ctrs = SBI_PMU_FW_MAX;
> > memset(&kvpmu->fw_event, 0, SBI_PMU_FW_MAX * sizeof(struct kvm_fw_event));
> > + kvpmu->snapshot_addr = INVALID_GPA;
> >
> > if (kvpmu->num_hw_ctrs > RISCV_KVM_MAX_HW_CTRS) {
> > pr_warn_once("Limiting the hardware counters to 32 as specified by the ISA");
> > @@ -585,6 +696,7 @@ void kvm_riscv_vcpu_pmu_init(struct kvm_vcpu *vcpu)
> > pmc = &kvpmu->pmc[i];
> > pmc->idx = i;
> > pmc->event_idx = SBI_PMU_EVENT_IDX_INVALID;
> > + pmc->vcpu = vcpu;
> > if (i < kvpmu->num_hw_ctrs) {
> > pmc->cinfo.type = SBI_PMU_CTR_TYPE_HW;
> > if (i < 3)
> > @@ -625,6 +737,7 @@ void kvm_riscv_vcpu_pmu_deinit(struct kvm_vcpu *vcpu)
> > }
> > bitmap_zero(kvpmu->pmc_in_use, RISCV_MAX_COUNTERS);
> > memset(&kvpmu->fw_event, 0, SBI_PMU_FW_MAX * sizeof(struct kvm_fw_event));
> > + kvpmu->snapshot_addr = INVALID_GPA;
>
> You need to also free the sdata pointer.
>
Fixed. Thanks.
> > }
> >
> > void kvm_riscv_vcpu_pmu_reset(struct kvm_vcpu *vcpu)
> > diff --git a/arch/riscv/kvm/vcpu_sbi_pmu.c b/arch/riscv/kvm/vcpu_sbi_pmu.c
> > index 7eca72df2cbd..77c20a61fd7d 100644
> > --- a/arch/riscv/kvm/vcpu_sbi_pmu.c
> > +++ b/arch/riscv/kvm/vcpu_sbi_pmu.c
> > @@ -64,6 +64,9 @@ static int kvm_sbi_ext_pmu_handler(struct kvm_vcpu *vcpu, struct kvm_run *run,
> > case SBI_EXT_PMU_COUNTER_FW_READ:
> > ret = kvm_riscv_vcpu_pmu_ctr_read(vcpu, cp->a0, retdata);
> > break;
> > + case SBI_EXT_PMU_SNAPSHOT_SET_SHMEM:
> > + ret = kvm_riscv_vcpu_pmu_setup_snapshot(vcpu, cp->a0, cp->a1, cp->a2, retdata);
> > + break;
> > default:
> > retdata->err_val = SBI_ERR_NOT_SUPPORTED;
> > }
> > --
> > 2.34.1
> >
>
> Regards,
> Anup
On Sat, Dec 16, 2023 at 05:39:12PM -0800, Atish Kumar Patra wrote:
> On Thu, Dec 7, 2023 at 5:06 AM Conor Dooley <[email protected]> wrote:
> > On Mon, Dec 04, 2023 at 06:43:07PM -0800, Atish Patra wrote:
> > > +static void pmu_sbi_snapshot_free(struct riscv_pmu *pmu)
> > > +{
> > > + int cpu;
> >
> > > + struct cpu_hw_events *cpu_hw_evt;
> >
> > This is only used inside the scope of the for loop.
> >
>
> Do you intend to suggest using mixed declarations ? Personally, I
> prefer all the declarations upfront for readability.
> Let me know if you think that's an issue or violates coding style.
I was suggesting
int cpu;
for_each_possible_cpu(cpu)
struct cpu_hw_events *cpu_hw_evt = per....()
I've been asked to do this in some subsystems I submitted code to,
and checkpatch etc do not complain about it. I don't think there is any
specific commentary in the coding style about minimising the scope of
variables however.
> > > + /* Free up the snapshot area memory and fall back to default SBI */
> >
> > What does "fall back to the default SBI mean"? SBI is an interface so I
> > don't understand what it means in this context. Secondly,
>
> In absence of SBI PMU snapshot, the driver would try to read the
> counters directly and end up traps.
> Also, it would not use the SBI PMU snapshot flags in the SBI start/stop calls.
> Snapshot is an alternative mechanism to minimize the traps. I just
> wanted to highlight that.
>
> How about this ?
> "Free up the snapshot area memory and fall back to default SBI PMU
> calls without snapshot */
Yeah, that's fine (modulo the */ placement). The original comment just
seemed truncated.
> > > + if (ret.error) {
> > > + if (ret.error != SBI_ERR_NOT_SUPPORTED)
> > > + pr_warn("%s: pmu snapshot setup failed with error %ld\n", __func__,
> > > + ret.error);
> >
> > Why is the function relevant here? Is the error message in-and-of-itself
> > not sufficient here? Where else would one be setting up the snapshots
> > other than the setup function?
> >
>
> The SBI implementation (i.e OpenSBI) may or may not provide a snapshot
> feature. This error message indicates
> that SBI implementation supports PMU snapshot but setup failed for
> some other error.
I don't see what this has to do with printing out the function. This is
a unique error message, and there is no other place where the setup is
done AFAICT.
> > > + /* Snapshot is taken relative to the counter idx base. Apply a fixup. */
> > > + if (hwc->idx > 0) {
> > > + sdata->ctr_values[hwc->idx] = sdata->ctr_values[0];
> > > + sdata->ctr_values[0] = 0;
> >
> > Why is this being zeroed in this manner? Why is zeroing it not required
> > if hwc->idx == 0? You've got a comment there that could probably do with
> > elaboration.
> >
>
> hwc->idx is the counter_idx_base here. If it is zero, that means the
> counter0 value is updated
> in the shared memory. However, if the base > 0, we need to update the
> relative counter value
> from the shared memory. Does it make sense ?
Please expand on the comment so that it contains this information.
> > > + ret = pmu_sbi_snapshot_setup(pmu, smp_processor_id());
> > > + if (!ret) {
> > > + pr_info("SBI PMU snapshot is available to optimize the PMU traps\n");
> >
> > Why the verbose message? Could we standardise on one wording for the SBI
> > function probing stuff? Most users seem to be "SBI FOO extension detected".
> > Only IPI has additional wording and PMU differs slightly.
>
> Additional information is for users to understand PMU functionality
> uses less traps on this system.
> We can just resort to and expect users to read upon the purpose of the
> snapshot from the spec.
> "SBI PMU snapshot available"
What I was asking for was alignment with the majority of other SBI
extensions that use the format I mentioned above.
>
> >
> > > + /* We enable it once here for the boot cpu. If snapshot shmem fails during
> >
> > Again, comment style here. What does "snapshot shmem" mean? I think
> > there's a missing action here. Registration? Allocation?
> >
>
> Fixed it. It is supposed to be "snapshot shmem setup"
>
> > > + * cpu hotplug on, it should bail out.
> >
> > Should or will? What action does "bail out" correspond to?
> >
>
> bail out the cpu hotplug process. We don't support heterogeneous pmus
> for snapshot.
> If the SBI implementation returns success for SBI_EXT_PMU_SNAPSHOT_SET_SHMEM
> boot cpu but fails for other cpus while bringing them up, it is
> problematic to handle that.
"bail out" should be replaced by a more technical explanation of what is
going to happen. "should" is a weird word to use, either the cpuhotplug
code does or does not deal with this case, and since that code is also
in the kernel, this patchset should ensure that it does handle the case,
no? If the kernel does handle it "should" should be replaced with more
definitive wording.
Thanks,
Conor.
On Sun, Dec 17, 2023 at 4:10 AM Conor Dooley <[email protected]> wrote:
>
> On Sat, Dec 16, 2023 at 05:39:12PM -0800, Atish Kumar Patra wrote:
> > On Thu, Dec 7, 2023 at 5:06 AM Conor Dooley <[email protected]> wrote:
> > > On Mon, Dec 04, 2023 at 06:43:07PM -0800, Atish Patra wrote:
>
> > > > +static void pmu_sbi_snapshot_free(struct riscv_pmu *pmu)
> > > > +{
> > > > + int cpu;
> > >
> > > > + struct cpu_hw_events *cpu_hw_evt;
> > >
> > > This is only used inside the scope of the for loop.
> > >
> >
> > Do you intend to suggest using mixed declarations ? Personally, I
> > prefer all the declarations upfront for readability.
> > Let me know if you think that's an issue or violates coding style.
>
> I was suggesting
>
> int cpu;
>
> for_each_possible_cpu(cpu)
> struct cpu_hw_events *cpu_hw_evt = per....()
>
That's what I meant by mixed declarations.
> I've been asked to do this in some subsystems I submitted code to,
> and checkpatch etc do not complain about it. I don't think there is any
> specific commentary in the coding style about minimising the scope of
> variables however.
>
I didn't know any subsystem which prefers mixed declaration vs upfront.
> > > > + /* Free up the snapshot area memory and fall back to default SBI */
> > >
> > > What does "fall back to the default SBI mean"? SBI is an interface so I
> > > don't understand what it means in this context. Secondly,
> >
> > In absence of SBI PMU snapshot, the driver would try to read the
> > counters directly and end up traps.
> > Also, it would not use the SBI PMU snapshot flags in the SBI start/stop calls.
> > Snapshot is an alternative mechanism to minimize the traps. I just
> > wanted to highlight that.
> >
> > How about this ?
> > "Free up the snapshot area memory and fall back to default SBI PMU
> > calls without snapshot */
>
> Yeah, that's fine (modulo the */ placement). The original comment just
> seemed truncated.
>
ok.
> > > > + if (ret.error) {
> > > > + if (ret.error != SBI_ERR_NOT_SUPPORTED)
> > > > + pr_warn("%s: pmu snapshot setup failed with error %ld\n", __func__,
> > > > + ret.error);
> > >
> > > Why is the function relevant here? Is the error message in-and-of-itself
> > > not sufficient here? Where else would one be setting up the snapshots
> > > other than the setup function?
> > >
> >
> > The SBI implementation (i.e OpenSBI) may or may not provide a snapshot
> > feature. This error message indicates
> > that SBI implementation supports PMU snapshot but setup failed for
> > some other error.
>
> I don't see what this has to do with printing out the function. This is
> a unique error message, and there is no other place where the setup is
> done AFAICT.
>
Ahh you were concerned about the function name in the log. I
misunderstood it at first.
The function name is not relevant and has been already removed.
> > > > + /* Snapshot is taken relative to the counter idx base. Apply a fixup. */
> > > > + if (hwc->idx > 0) {
> > > > + sdata->ctr_values[hwc->idx] = sdata->ctr_values[0];
> > > > + sdata->ctr_values[0] = 0;
> > >
> > > Why is this being zeroed in this manner? Why is zeroing it not required
> > > if hwc->idx == 0? You've got a comment there that could probably do with
> > > elaboration.
> > >
> >
> > hwc->idx is the counter_idx_base here. If it is zero, that means the
> > counter0 value is updated
> > in the shared memory. However, if the base > 0, we need to update the
> > relative counter value
> > from the shared memory. Does it make sense ?
>
> Please expand on the comment so that it contains this information.
>
Sure.
> > > > + ret = pmu_sbi_snapshot_setup(pmu, smp_processor_id());
> > > > + if (!ret) {
> > > > + pr_info("SBI PMU snapshot is available to optimize the PMU traps\n");
> > >
> > > Why the verbose message? Could we standardise on one wording for the SBI
> > > function probing stuff? Most users seem to be "SBI FOO extension detected".
> > > Only IPI has additional wording and PMU differs slightly.
> >
> > Additional information is for users to understand PMU functionality
> > uses less traps on this system.
> > We can just resort to and expect users to read upon the purpose of the
> > snapshot from the spec.
> > "SBI PMU snapshot available"
>
> What I was asking for was alignment with the majority of other SBI
> extensions that use the format I mentioned above.
>
PMU snapshot is a function and my previous suggestion aligns PMU
extension availability log message.
I can change it to "SBI PMU snapshot detected"
> >
> > >
> > > > + /* We enable it once here for the boot cpu. If snapshot shmem fails during
> > >
> > > Again, comment style here. What does "snapshot shmem" mean? I think
> > > there's a missing action here. Registration? Allocation?
> > >
> >
> > Fixed it. It is supposed to be "snapshot shmem setup"
> >
> > > > + * cpu hotplug on, it should bail out.
> > >
> > > Should or will? What action does "bail out" correspond to?
> > >
> >
> > bail out the cpu hotplug process. We don't support heterogeneous pmus
> > for snapshot.
> > If the SBI implementation returns success for SBI_EXT_PMU_SNAPSHOT_SET_SHMEM
> > boot cpu but fails for other cpus while bringing them up, it is
> > problematic to handle that.
>
> "bail out" should be replaced by a more technical explanation of what is
> going to happen. "should" is a weird word to use, either the cpuhotplug
> code does or does not deal with this case, and since that code is also
> in the kernel, this patchset should ensure that it does handle the case,
> no? If the kernel does handle it "should" should be replaced with more
> definitive wording.
>
ok. I will improve the comment to explain a bit more.
> Thanks,
> Conor.