2022-07-18 17:10:06

by Atish Kumar Patra

[permalink] [raw]
Subject: [RFC 0/9] KVM perf support

This series extends perf support for KVM. The KVM implementation relies
on the SBI PMU extension and trap n emulation of hpmcounter CSRs.
The KVM implementation exposes the virtual counters to the guest and internally
manage the counters using kernel perf counters.

This series doesn't support the counter overflow as the Sscofpmf extension
doesn't allow trap & emulation mechanism of scountovf CSR yet. The required
changes to allow that are being under discussions. Supporting overflow interrupt
also requires AIA support which is not frozen either.

This series can be found at github[1] as well. It depends Anup's CSR emulation
framework[1] series.

perf stat works in kvm guests with this series.

Here is example of running perf stat in a guest running in KVM.
===========================================================================
/ # /host/apps/perf stat -e instructions -e cycles -e r8000000000000005 \
> -e r8000000000000006 -e r8000000000000007 -e r8000000000000008 \
> -e r800000000000000a perf bench sched messaging -g 5 -l 15
# Running 'sched/messaging' benchmark:
# 20 sender and receiver processes per group
# 5 groups == 200 processes run

Total time: 5.210 [sec]

Performance counter stats for 'perf bench sched messaging -g 5 -l 15':

37209585734 instructions # 1.00 insn per cycle
37177435570 cycles
2740 r8000000000000005
3727 r8000000000000006
3655 r8000000000000007
10 r8000000000000008
0 r800000000000000a

5.863014800 seconds time elapsed

0.569373000 seconds user
10.771533000 seconds sys

[1] https://github.com/atishp04/linux/tree/kvm_perf_rfc
[2] https://lkml.org/lkml/2022/6/15/389

Atish Patra (9):
RISC-V: Define a helper function to probe number of hardware counters
RISC-V: Define a helper function to return counter width
RISC-V: KVM: Define a probe function for SBI extension data structures
RISC-V: KVM: Improve privilege mode filtering for perf
RISC-V: KVM: Add skeleton support for perf
RISC-V: KVM: Add SBI PMU extension support
RISC-V: KVM: Implement trap & emulate for hpmcounters
RISC-V: KVM: Implement perf support
RISC-V: KVM: Implement firmware events

arch/riscv/include/asm/kvm_host.h | 3 +
arch/riscv/include/asm/kvm_vcpu_pmu.h | 102 +++++
arch/riscv/include/asm/kvm_vcpu_sbi.h | 3 +
arch/riscv/include/asm/sbi.h | 2 +-
arch/riscv/kvm/Makefile | 1 +
arch/riscv/kvm/main.c | 3 +-
arch/riscv/kvm/tlb.c | 6 +-
arch/riscv/kvm/vcpu.c | 5 +
arch/riscv/kvm/vcpu_insn.c | 4 +-
arch/riscv/kvm/vcpu_pmu.c | 517 ++++++++++++++++++++++++++
arch/riscv/kvm/vcpu_sbi.c | 11 +
arch/riscv/kvm/vcpu_sbi_base.c | 13 +-
arch/riscv/kvm/vcpu_sbi_pmu.c | 81 ++++
arch/riscv/kvm/vcpu_sbi_replace.c | 7 +
drivers/perf/riscv_pmu_sbi.c | 75 +++-
include/linux/perf/riscv_pmu.h | 7 +
16 files changed, 823 insertions(+), 17 deletions(-)
create mode 100644 arch/riscv/include/asm/kvm_vcpu_pmu.h
create mode 100644 arch/riscv/kvm/vcpu_pmu.c
create mode 100644 arch/riscv/kvm/vcpu_sbi_pmu.c

--
2.25.1


2022-07-18 17:10:19

by Atish Kumar Patra

[permalink] [raw]
Subject: [RFC 1/9] RISC-V: Define a helper function to probe number of hardware counters

KVM module needs to know how many hardware counters the platform supports.
Otherwise, it will not be able to show optimal value of virtual
counters to the guest.

Signed-off-by: Atish Patra <[email protected]>
---
drivers/perf/riscv_pmu_sbi.c | 23 +++++++++++++++++------
include/linux/perf/riscv_pmu.h | 4 ++++
2 files changed, 21 insertions(+), 6 deletions(-)

diff --git a/drivers/perf/riscv_pmu_sbi.c b/drivers/perf/riscv_pmu_sbi.c
index 24124546844c..1723af68ffa1 100644
--- a/drivers/perf/riscv_pmu_sbi.c
+++ b/drivers/perf/riscv_pmu_sbi.c
@@ -27,6 +27,7 @@
*/
static union sbi_pmu_ctr_info *pmu_ctr_list;
static unsigned int riscv_pmu_irq;
+static struct riscv_pmu *rvpmu;

struct sbi_pmu_event_data {
union {
@@ -227,6 +228,12 @@ static const struct sbi_pmu_event_data pmu_cache_event_map[PERF_COUNT_HW_CACHE_M
},
};

+int riscv_pmu_sbi_get_num_hw_ctrs(void)
+{
+ return rvpmu ? rvpmu->num_hw_counters : 0;
+}
+EXPORT_SYMBOL(riscv_pmu_sbi_get_num_hw_ctrs);
+
static int pmu_sbi_ctr_get_width(int idx)
{
return pmu_ctr_list[idx].width;
@@ -443,7 +450,7 @@ static int pmu_sbi_find_num_ctrs(void)
return sbi_err_map_linux_errno(ret.error);
}

-static int pmu_sbi_get_ctrinfo(int nctr)
+static int pmu_sbi_get_ctrinfo(int nctr, int *num_hw_ctrs)
{
struct sbiret ret;
int i, num_hw_ctr = 0, num_fw_ctr = 0;
@@ -453,7 +460,7 @@ static int pmu_sbi_get_ctrinfo(int nctr)
if (!pmu_ctr_list)
return -ENOMEM;

- for (i = 0; i <= nctr; i++) {
+ for (i = 0; i < nctr; i++) {
ret = sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_COUNTER_GET_INFO, i, 0, 0, 0, 0, 0);
if (ret.error)
/* The logical counter ids are not expected to be contiguous */
@@ -466,6 +473,7 @@ static int pmu_sbi_get_ctrinfo(int nctr)
pmu_ctr_list[i].value = cinfo.value;
}

+ *num_hw_ctrs = num_hw_ctr;
pr_info("%d firmware and %d hardware counters\n", num_fw_ctr, num_hw_ctr);

return 0;
@@ -698,7 +706,7 @@ static int pmu_sbi_setup_irqs(struct riscv_pmu *pmu, struct platform_device *pde
static int pmu_sbi_device_probe(struct platform_device *pdev)
{
struct riscv_pmu *pmu = NULL;
- int num_counters;
+ int num_counters, num_hw_ctrs = 0;
int ret = -ENODEV;

pr_info("SBI PMU extension is available\n");
@@ -713,7 +721,7 @@ static int pmu_sbi_device_probe(struct platform_device *pdev)
}

/* cache all the information about counters now */
- if (pmu_sbi_get_ctrinfo(num_counters))
+ if (pmu_sbi_get_ctrinfo(num_counters, &num_hw_ctrs))
goto out_free;

ret = pmu_sbi_setup_irqs(pmu, pdev);
@@ -723,6 +731,7 @@ static int pmu_sbi_device_probe(struct platform_device *pdev)
pmu->pmu.capabilities |= PERF_PMU_CAP_NO_EXCLUDE;
}
pmu->num_counters = num_counters;
+ pmu->num_hw_counters = num_hw_ctrs;
pmu->ctr_start = pmu_sbi_ctr_start;
pmu->ctr_stop = pmu_sbi_ctr_stop;
pmu->event_map = pmu_sbi_event_map;
@@ -733,14 +742,16 @@ static int pmu_sbi_device_probe(struct platform_device *pdev)

ret = cpuhp_state_add_instance(CPUHP_AP_PERF_RISCV_STARTING, &pmu->node);
if (ret)
- return ret;
+ goto out_free;

ret = perf_pmu_register(&pmu->pmu, "cpu", PERF_TYPE_RAW);
if (ret) {
cpuhp_state_remove_instance(CPUHP_AP_PERF_RISCV_STARTING, &pmu->node);
- return ret;
+ goto out_free;
}

+ rvpmu = pmu;
+
return 0;

out_free:
diff --git a/include/linux/perf/riscv_pmu.h b/include/linux/perf/riscv_pmu.h
index 46f9b6fe306e..fc47167e000c 100644
--- a/include/linux/perf/riscv_pmu.h
+++ b/include/linux/perf/riscv_pmu.h
@@ -46,6 +46,7 @@ struct riscv_pmu {
irqreturn_t (*handle_irq)(int irq_num, void *dev);

int num_counters;
+ int num_hw_counters;
u64 (*ctr_read)(struct perf_event *event);
int (*ctr_get_idx)(struct perf_event *event);
int (*ctr_get_width)(int idx);
@@ -69,6 +70,9 @@ void riscv_pmu_legacy_skip_init(void);
static inline void riscv_pmu_legacy_skip_init(void) {};
#endif
struct riscv_pmu *riscv_pmu_alloc(void);
+#ifdef CONFIG_RISCV_PMU_SBI
+int riscv_pmu_sbi_get_num_hw_ctrs(void);
+#endif

#endif /* CONFIG_RISCV_PMU */

--
2.25.1

2022-07-18 17:10:46

by Atish Kumar Patra

[permalink] [raw]
Subject: [RFC 2/9] RISC-V: Define a helper function to return counter width

The virtual hardware counters need to have the same width as the
logical hardware counters for simplicity. However, there shouldn't
be mapping between virtual hardware counters and logical hardware
counters. As we don't support hetergeneous harts or counters with
different width as of now, the implementation relies on the counter
width of the first available programmable counter.

Signed-off-by: Atish Patra <[email protected]>
---
drivers/perf/riscv_pmu_sbi.c | 25 +++++++++++++++++++++++++
include/linux/perf/riscv_pmu.h | 1 +
2 files changed, 26 insertions(+)

diff --git a/drivers/perf/riscv_pmu_sbi.c b/drivers/perf/riscv_pmu_sbi.c
index 1723af68ffa1..5d0eef3ef136 100644
--- a/drivers/perf/riscv_pmu_sbi.c
+++ b/drivers/perf/riscv_pmu_sbi.c
@@ -250,6 +250,31 @@ static bool pmu_sbi_ctr_is_fw(int cidx)
return (info->type == SBI_PMU_CTR_TYPE_FW) ? true : false;
}

+/*
+ * Returns the counter width of a programmable counter
+ * As we don't support heterneous CPUs yet, it is okay to just
+ * return the counter width of the first programmable counter.
+ */
+int riscv_pmu_sbi_hpmc_width(void)
+{
+ int i;
+ union sbi_pmu_ctr_info *info;
+
+ if (!rvpmu)
+ return -EINVAL;
+
+ for (i = 0; i < rvpmu->num_counters; i++) {
+ info = &pmu_ctr_list[i];
+ if (!info)
+ continue;
+ if (info->type == SBI_PMU_CTR_TYPE_HW)
+ return info->width;
+ }
+
+ return 0;
+}
+EXPORT_SYMBOL(riscv_pmu_sbi_hpmc_width);
+
static int pmu_sbi_ctr_get_idx(struct perf_event *event)
{
struct hw_perf_event *hwc = &event->hw;
diff --git a/include/linux/perf/riscv_pmu.h b/include/linux/perf/riscv_pmu.h
index fc47167e000c..6fee211c27b5 100644
--- a/include/linux/perf/riscv_pmu.h
+++ b/include/linux/perf/riscv_pmu.h
@@ -72,6 +72,7 @@ static inline void riscv_pmu_legacy_skip_init(void) {};
struct riscv_pmu *riscv_pmu_alloc(void);
#ifdef CONFIG_RISCV_PMU_SBI
int riscv_pmu_sbi_get_num_hw_ctrs(void);
+int riscv_pmu_sbi_hpmc_width(void);
#endif

#endif /* CONFIG_RISCV_PMU */
--
2.25.1

2022-07-18 17:11:42

by Atish Kumar Patra

[permalink] [raw]
Subject: [RFC 4/9] RISC-V: KVM: Improve privilege mode filtering for perf

Currently, the host driver doesn't have any method to identify if the
requested perf event is from kvm or bare metal. As KVM runs in HS
mode, there are no separate hypervisor privilege mode to distinguish
between the attributes for guest/host.

Improve the privilege mode filtering by using the event specific
config1 field.

Signed-off-by: Atish Patra <[email protected]>
---
drivers/perf/riscv_pmu_sbi.c | 27 ++++++++++++++++++++++-----
include/linux/perf/riscv_pmu.h | 2 ++
2 files changed, 24 insertions(+), 5 deletions(-)

diff --git a/drivers/perf/riscv_pmu_sbi.c b/drivers/perf/riscv_pmu_sbi.c
index 5d0eef3ef136..34f9fcc221a8 100644
--- a/drivers/perf/riscv_pmu_sbi.c
+++ b/drivers/perf/riscv_pmu_sbi.c
@@ -275,6 +275,27 @@ int riscv_pmu_sbi_hpmc_width(void)
}
EXPORT_SYMBOL(riscv_pmu_sbi_hpmc_width);

+static unsigned long pmu_sbi_get_filter_flags(struct perf_event *event)
+{
+ unsigned long cflags = 0;
+ bool guest_events = false;
+
+ if (event->attr.config1 & RISCV_KVM_PMU_CONFIG1_GUEST_EVENTS)
+ guest_events = true;
+ if (event->attr.exclude_kernel)
+ cflags |= guest_events ? SBI_PMU_CFG_FLAG_SET_VSINH : SBI_PMU_CFG_FLAG_SET_SINH;
+ if (event->attr.exclude_user)
+ cflags |= guest_events ? SBI_PMU_CFG_FLAG_SET_VUINH : SBI_PMU_CFG_FLAG_SET_UINH;
+ if (guest_events && event->attr.exclude_hv)
+ cflags |= SBI_PMU_CFG_FLAG_SET_SINH;
+ if (event->attr.exclude_host)
+ cflags |= SBI_PMU_CFG_FLAG_SET_UINH | SBI_PMU_CFG_FLAG_SET_SINH;
+ if (event->attr.exclude_guest)
+ cflags |= SBI_PMU_CFG_FLAG_SET_VSINH | SBI_PMU_CFG_FLAG_SET_VUINH;
+
+ return cflags;
+}
+
static int pmu_sbi_ctr_get_idx(struct perf_event *event)
{
struct hw_perf_event *hwc = &event->hw;
@@ -286,11 +307,7 @@ static int pmu_sbi_ctr_get_idx(struct perf_event *event)
uint64_t cmask = GENMASK_ULL(rvpmu->num_counters - 1, 0);
unsigned long cflags = 0;

- if (event->attr.exclude_kernel)
- cflags |= SBI_PMU_CFG_FLAG_SET_SINH;
- if (event->attr.exclude_user)
- cflags |= SBI_PMU_CFG_FLAG_SET_UINH;
-
+ cflags = pmu_sbi_get_filter_flags(event);
/* retrieve the available counter index */
#if defined(CONFIG_32BIT)
ret = sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_COUNTER_CFG_MATCH, cbase, cmask,
diff --git a/include/linux/perf/riscv_pmu.h b/include/linux/perf/riscv_pmu.h
index 6fee211c27b5..825b95253bc5 100644
--- a/include/linux/perf/riscv_pmu.h
+++ b/include/linux/perf/riscv_pmu.h
@@ -26,6 +26,8 @@

#define RISCV_PMU_STOP_FLAG_RESET 1

+#define RISCV_KVM_PMU_CONFIG1_GUEST_EVENTS 0x1
+
struct cpu_hw_events {
/* currently enabled events */
int n_events;
--
2.25.1

2022-07-18 17:12:34

by Atish Kumar Patra

[permalink] [raw]
Subject: [RFC 9/9] RISC-V: KVM: Implement firmware events

SBI PMU extension defines a set of firmware events which can provide
useful information to guests about number of SBI calls. As hypervisor
implements the SBI PMU extension, these firmware events corresponds
to ecall invocations between VS->HS mode. All other firmware events
will always report zero if monitored as KVM doesn't implement them.

Signed-off-by: Atish Patra <[email protected]>
---
arch/riscv/include/asm/kvm_vcpu_pmu.h | 16 +++++
arch/riscv/include/asm/sbi.h | 2 +-
arch/riscv/kvm/tlb.c | 6 +-
arch/riscv/kvm/vcpu_pmu.c | 90 +++++++++++++++++++++++----
arch/riscv/kvm/vcpu_sbi_replace.c | 7 +++
5 files changed, 106 insertions(+), 15 deletions(-)

diff --git a/arch/riscv/include/asm/kvm_vcpu_pmu.h b/arch/riscv/include/asm/kvm_vcpu_pmu.h
index 5410236b62a8..d68b17ea796b 100644
--- a/arch/riscv/include/asm/kvm_vcpu_pmu.h
+++ b/arch/riscv/include/asm/kvm_vcpu_pmu.h
@@ -15,6 +15,14 @@
#ifdef CONFIG_RISCV_PMU_SBI
#define RISCV_KVM_MAX_FW_CTRS 32

+struct kvm_fw_event {
+ /* Current value of the event */
+ unsigned long value;
+
+ /* Event monitoring status */
+ bool started;
+};
+
/* Per virtual pmu counter data */
struct kvm_pmc {
u8 idx;
@@ -22,11 +30,14 @@ struct kvm_pmc {
struct perf_event *perf_event;
uint64_t counter_val;
union sbi_pmu_ctr_info cinfo;
+ /* Monitoring event ID */
+ unsigned long event_idx;
};

/* PMU data structure per vcpu */
struct kvm_pmu {
struct kvm_pmc pmc[RISCV_MAX_COUNTERS];
+ struct kvm_fw_event fw_event[RISCV_KVM_MAX_FW_CTRS];
/* Number of the virtual firmware counters available */
int num_fw_ctrs;
/* Number of the virtual hardware counters available */
@@ -48,6 +59,7 @@ struct kvm_pmu {
{ .base = CSR_CYCLE, .count = 31, .func = kvm_riscv_vcpu_pmu_read_hpm },
#endif

+int kvm_riscv_vcpu_pmu_incr_fw(struct kvm_vcpu *vcpu, unsigned long fid);
int kvm_riscv_vcpu_pmu_read_hpm(struct kvm_vcpu *vcpu, unsigned int csr_num,
unsigned long *val, unsigned long new_val,
unsigned long wr_mask);
@@ -75,6 +87,10 @@ struct kvm_pmu {
#define KVM_RISCV_VCPU_HPMCOUNTER_CSR_FUNCS \
{ .base = 0, .count = 0, .func = NULL },

+static inline int kvm_riscv_vcpu_pmu_incr_fw(struct kvm_vcpu *vcpu, unsigned long fid)
+{
+ return 0;
+}

static inline int kvm_riscv_vcpu_pmu_init(struct kvm_vcpu *vcpu)
{
diff --git a/arch/riscv/include/asm/sbi.h b/arch/riscv/include/asm/sbi.h
index 2a0ef738695e..a192a95a34eb 100644
--- a/arch/riscv/include/asm/sbi.h
+++ b/arch/riscv/include/asm/sbi.h
@@ -171,7 +171,7 @@ enum sbi_pmu_fw_generic_events_t {
SBI_PMU_FW_IPI_SENT = 6,
SBI_PMU_FW_IPI_RECVD = 7,
SBI_PMU_FW_FENCE_I_SENT = 8,
- SBI_PMU_FW_FENCE_I_RECVD = 9,
+ SBI_PMU_FW_FENCE_I_RCVD = 9,
SBI_PMU_FW_SFENCE_VMA_SENT = 10,
SBI_PMU_FW_SFENCE_VMA_RCVD = 11,
SBI_PMU_FW_SFENCE_VMA_ASID_SENT = 12,
diff --git a/arch/riscv/kvm/tlb.c b/arch/riscv/kvm/tlb.c
index 1a76d0b1907d..0793d39e8ff7 100644
--- a/arch/riscv/kvm/tlb.c
+++ b/arch/riscv/kvm/tlb.c
@@ -240,6 +240,7 @@ void kvm_riscv_local_tlb_sanitize(struct kvm_vcpu *vcpu)

void kvm_riscv_fence_i_process(struct kvm_vcpu *vcpu)
{
+ kvm_riscv_vcpu_pmu_incr_fw(vcpu, SBI_PMU_FW_FENCE_I_RCVD);
local_flush_icache_all();
}

@@ -323,15 +324,18 @@ void kvm_riscv_hfence_process(struct kvm_vcpu *vcpu)
d.addr, d.size, d.order);
break;
case KVM_RISCV_HFENCE_VVMA_ASID_GVA:
+ kvm_riscv_vcpu_pmu_incr_fw(vcpu, SBI_PMU_FW_HFENCE_VVMA_ASID_RCVD);
kvm_riscv_local_hfence_vvma_asid_gva(
READ_ONCE(v->vmid), d.asid,
d.addr, d.size, d.order);
break;
case KVM_RISCV_HFENCE_VVMA_ASID_ALL:
+ kvm_riscv_vcpu_pmu_incr_fw(vcpu, SBI_PMU_FW_HFENCE_VVMA_ASID_RCVD);
kvm_riscv_local_hfence_vvma_asid_all(
READ_ONCE(v->vmid), d.asid);
break;
case KVM_RISCV_HFENCE_VVMA_GVA:
+ kvm_riscv_vcpu_pmu_incr_fw(vcpu, SBI_PMU_FW_HFENCE_VVMA_RCVD);
kvm_riscv_local_hfence_vvma_gva(
READ_ONCE(v->vmid),
d.addr, d.size, d.order);
@@ -382,7 +386,7 @@ void kvm_riscv_fence_i(struct kvm *kvm,
unsigned long hbase, unsigned long hmask)
{
make_xfence_request(kvm, hbase, hmask, KVM_REQ_FENCE_I,
- KVM_REQ_FENCE_I, NULL);
+ KVM_REQ_FENCE_I, NULL);
}

void kvm_riscv_hfence_gvma_vmid_gpa(struct kvm *kvm,
diff --git a/arch/riscv/kvm/vcpu_pmu.c b/arch/riscv/kvm/vcpu_pmu.c
index 278c261efad3..f451d7ac2608 100644
--- a/arch/riscv/kvm/vcpu_pmu.c
+++ b/arch/riscv/kvm/vcpu_pmu.c
@@ -168,21 +168,39 @@ static int pmu_get_pmc_index(struct kvm_pmu *pmu, unsigned long eidx,
return pmu_get_programmable_pmc_index(pmu, eidx, cbase, cmask);
}

+int kvm_riscv_vcpu_pmu_incr_fw(struct kvm_vcpu *vcpu, unsigned long fid)
+{
+ struct kvm_pmu *kvpmu = vcpu_to_pmu(vcpu);
+ struct kvm_fw_event *fevent;
+
+ if (!kvpmu || fid >= SBI_PMU_FW_MAX)
+ return -EINVAL;
+
+ fevent = &kvpmu->fw_event[fid];
+ if (fevent->started)
+ fevent->value++;
+
+ return 0;
+}
+
int kvm_riscv_vcpu_pmu_ctr_read(struct kvm_vcpu *vcpu, unsigned long cidx,
unsigned long *out_val)
{
struct kvm_pmu *kvpmu = vcpu_to_pmu(vcpu);
struct kvm_pmc *pmc;
u64 enabled, running;
+ int fevent_code;

if (!kvpmu)
return -EINVAL;

pmc = &kvpmu->pmc[cidx];
- if (!pmc->perf_event)
- return -EINVAL;

- pmc->counter_val += perf_event_read_value(pmc->perf_event, &enabled, &running);
+ if (pmc->cinfo.type == SBI_PMU_CTR_TYPE_FW) {
+ fevent_code = get_event_code(pmc->event_idx);
+ pmc->counter_val = kvpmu->fw_event[fevent_code].value;
+ } else if (pmc->perf_event)
+ pmc->counter_val += perf_event_read_value(pmc->perf_event, &enabled, &running);
*out_val = pmc->counter_val;

return 0;
@@ -237,6 +255,7 @@ int kvm_riscv_vcpu_pmu_ctr_start(struct kvm_vcpu *vcpu, unsigned long ctr_base,
struct kvm_pmu *kvpmu = vcpu_to_pmu(vcpu);
int i, num_ctrs, pmc_index;
struct kvm_pmc *pmc;
+ int fevent_code;

num_ctrs = kvpmu->num_fw_ctrs + kvpmu->num_hw_ctrs;
if (ctr_base + __fls(ctr_mask) >= num_ctrs)
@@ -250,7 +269,14 @@ int kvm_riscv_vcpu_pmu_ctr_start(struct kvm_vcpu *vcpu, unsigned long ctr_base,
pmc = &kvpmu->pmc[pmc_index];
if (flag & SBI_PMU_START_FLAG_SET_INIT_VALUE)
pmc->counter_val = ival;
- if (pmc->perf_event) {
+ if (pmc->cinfo.type == SBI_PMU_CTR_TYPE_FW) {
+ fevent_code = get_event_code(pmc->event_idx);
+ if (fevent_code >= SBI_PMU_FW_MAX)
+ return -EINVAL;
+
+ kvpmu->fw_event[fevent_code].started = true;
+ kvpmu->fw_event[fevent_code].value = pmc->counter_val;
+ } else if (pmc->perf_event) {
perf_event_period(pmc->perf_event, pmu_get_sample_period(pmc));
perf_event_enable(pmc->perf_event);
}
@@ -266,6 +292,7 @@ int kvm_riscv_vcpu_pmu_ctr_stop(struct kvm_vcpu *vcpu, unsigned long ctr_base,
int i, num_ctrs, pmc_index;
u64 enabled, running;
struct kvm_pmc *pmc;
+ int fevent_code;

num_ctrs = kvpmu->num_fw_ctrs + kvpmu->num_hw_ctrs;
if ((ctr_base + __fls(ctr_mask)) >= num_ctrs)
@@ -277,7 +304,12 @@ int kvm_riscv_vcpu_pmu_ctr_stop(struct kvm_vcpu *vcpu, unsigned long ctr_base,
if (!test_bit(pmc_index, kvpmu->used_pmc))
continue;
pmc = &kvpmu->pmc[pmc_index];
- if (pmc->perf_event) {
+ if (pmc->cinfo.type == SBI_PMU_CTR_TYPE_FW) {
+ fevent_code = get_event_code(pmc->event_idx);
+ if (fevent_code >= SBI_PMU_FW_MAX)
+ return -EINVAL;
+ kvpmu->fw_event[fevent_code].started = false;
+ } else if (pmc->perf_event) {
/* Stop counting the counter */
perf_event_disable(pmc->perf_event);
if (flag & SBI_PMU_STOP_FLAG_RESET) {
@@ -285,9 +317,12 @@ int kvm_riscv_vcpu_pmu_ctr_stop(struct kvm_vcpu *vcpu, unsigned long ctr_base,
pmc->counter_val += perf_event_read_value(pmc->perf_event,
&enabled, &running);
pmu_release_perf_event(pmc);
- clear_bit(pmc_index, kvpmu->used_pmc);
}
}
+ if (flag & SBI_PMU_STOP_FLAG_RESET) {
+ pmc->event_idx = SBI_PMU_EVENT_IDX_INVALID;
+ clear_bit(pmc_index, kvpmu->used_pmc);
+ }
}

return 0;
@@ -303,14 +338,19 @@ int kvm_riscv_vcpu_pmu_ctr_cfg_match(struct kvm_vcpu *vcpu, unsigned long ctr_ba
int num_ctrs, ctr_idx;
u32 etype = pmu_get_perf_event_type(eidx);
u64 config;
- struct kvm_pmc *pmc;
+ struct kvm_pmc *pmc = NULL;
+ bool is_fevent;
+ unsigned long event_code;

num_ctrs = kvpmu->num_fw_ctrs + kvpmu->num_hw_ctrs;
if ((etype == PERF_TYPE_MAX) || ((ctr_base + __fls(ctr_mask)) >= num_ctrs))
return -EINVAL;

- if (pmu_is_fw_event(eidx))
+ event_code = get_event_code(eidx);
+ is_fevent = pmu_is_fw_event(eidx);
+ if (is_fevent && event_code >= SBI_PMU_FW_MAX)
return -EOPNOTSUPP;
+
/*
* SKIP_MATCH flag indicates the caller is aware of the assigned counter
* for this event. Just do a sanity check if it already marked used.
@@ -319,13 +359,23 @@ int kvm_riscv_vcpu_pmu_ctr_cfg_match(struct kvm_vcpu *vcpu, unsigned long ctr_ba
if (!test_bit(ctr_base, kvpmu->used_pmc))
return -EINVAL;
ctr_idx = ctr_base;
- goto match_done;
+ if (is_fevent)
+ goto perf_event_done;
+ else
+ goto match_done;
}

ctr_idx = pmu_get_pmc_index(kvpmu, eidx, ctr_base, ctr_mask);
if (ctr_idx < 0)
return -EOPNOTSUPP;

+ /*
+ * No need to create perf events for firmware events as the firmware counter
+ * is supposed to return the measurement of VS->HS mode invocations.
+ */
+ if (is_fevent)
+ goto perf_event_done;
+
match_done:
pmc = &kvpmu->pmc[ctr_idx];
pmu_release_perf_event(pmc);
@@ -363,17 +413,26 @@ int kvm_riscv_vcpu_pmu_ctr_cfg_match(struct kvm_vcpu *vcpu, unsigned long ctr_ba
return -EOPNOTSUPP;
}

- set_bit(ctr_idx, kvpmu->used_pmc);
pmc->perf_event = event;
- if (flag & SBI_PMU_CFG_FLAG_AUTO_START)
- perf_event_enable(pmc->perf_event);
+perf_event_done:
+ if (flag & SBI_PMU_CFG_FLAG_AUTO_START) {
+ if (is_fevent)
+ kvpmu->fw_event[event_code].started = true;
+ else
+ perf_event_enable(pmc->perf_event);
+ }
+ /* This should be only true for firmware events */
+ if (!pmc)
+ pmc = &kvpmu->pmc[ctr_idx];
+ pmc->event_idx = eidx;
+ set_bit(ctr_idx, kvpmu->used_pmc);

return ctr_idx;
}

int kvm_riscv_vcpu_pmu_init(struct kvm_vcpu *vcpu)
{
- int i = 0, num_hw_ctrs, num_fw_ctrs, hpm_width;
+ int i, num_hw_ctrs, num_fw_ctrs, hpm_width;
struct kvm_pmu *kvpmu = vcpu_to_pmu(vcpu);
struct kvm_pmc *pmc;

@@ -395,6 +454,7 @@ int kvm_riscv_vcpu_pmu_init(struct kvm_vcpu *vcpu)
bitmap_zero(kvpmu->used_pmc, RISCV_MAX_COUNTERS);
kvpmu->num_hw_ctrs = num_hw_ctrs;
kvpmu->num_fw_ctrs = num_fw_ctrs;
+ memset(&kvpmu->fw_event, 0, SBI_PMU_FW_MAX * sizeof(struct kvm_fw_event));
/*
* There is no corelation betwen the logical hardware counter and virtual counters.
* However, we need to encode a hpmcounter CSR in the counter info field so that
@@ -409,6 +469,7 @@ int kvm_riscv_vcpu_pmu_init(struct kvm_vcpu *vcpu)
pmc->idx = i;
pmc->counter_val = 0;
pmc->vcpu = vcpu;
+ pmc->event_idx = SBI_PMU_EVENT_IDX_INVALID;
if (i < kvpmu->num_hw_ctrs) {
kvpmu->pmc[i].cinfo.type = SBI_PMU_CTR_TYPE_HW;
if (i < 3)
@@ -444,7 +505,10 @@ void kvm_riscv_vcpu_pmu_deinit(struct kvm_vcpu *vcpu)
for_each_set_bit(i, kvpmu->used_pmc, RISCV_MAX_COUNTERS) {
pmc = &kvpmu->pmc[i];
pmu_release_perf_event(pmc);
+ pmc->counter_val = 0;
+ pmc->event_idx = SBI_PMU_EVENT_IDX_INVALID;
}
+ memset(&kvpmu->fw_event, 0, SBI_PMU_FW_MAX * sizeof(struct kvm_fw_event));
}

void kvm_riscv_vcpu_pmu_reset(struct kvm_vcpu *vcpu)
diff --git a/arch/riscv/kvm/vcpu_sbi_replace.c b/arch/riscv/kvm/vcpu_sbi_replace.c
index 4c034d8a606a..614ae127e102 100644
--- a/arch/riscv/kvm/vcpu_sbi_replace.c
+++ b/arch/riscv/kvm/vcpu_sbi_replace.c
@@ -12,6 +12,7 @@
#include <asm/csr.h>
#include <asm/sbi.h>
#include <asm/kvm_vcpu_timer.h>
+#include <asm/kvm_vcpu_pmu.h>
#include <asm/kvm_vcpu_sbi.h>

static int kvm_sbi_ext_time_handler(struct kvm_vcpu *vcpu, struct kvm_run *run,
@@ -25,6 +26,7 @@ static int kvm_sbi_ext_time_handler(struct kvm_vcpu *vcpu, struct kvm_run *run,
if (cp->a6 != SBI_EXT_TIME_SET_TIMER)
return -EINVAL;

+ kvm_riscv_vcpu_pmu_incr_fw(vcpu, SBI_PMU_FW_SET_TIMER);
#if __riscv_xlen == 32
next_cycle = ((u64)cp->a1 << 32) | (u64)cp->a0;
#else
@@ -55,6 +57,7 @@ static int kvm_sbi_ext_ipi_handler(struct kvm_vcpu *vcpu, struct kvm_run *run,
if (cp->a6 != SBI_EXT_IPI_SEND_IPI)
return -EINVAL;

+ kvm_riscv_vcpu_pmu_incr_fw(vcpu, SBI_PMU_FW_IPI_SENT);
kvm_for_each_vcpu(i, tmp, vcpu->kvm) {
if (hbase != -1UL) {
if (tmp->vcpu_id < hbase)
@@ -65,6 +68,7 @@ static int kvm_sbi_ext_ipi_handler(struct kvm_vcpu *vcpu, struct kvm_run *run,
ret = kvm_riscv_vcpu_set_interrupt(tmp, IRQ_VS_SOFT);
if (ret < 0)
break;
+ kvm_riscv_vcpu_pmu_incr_fw(tmp, SBI_PMU_FW_IPI_RECVD);
}

return ret;
@@ -89,6 +93,7 @@ static int kvm_sbi_ext_rfence_handler(struct kvm_vcpu *vcpu, struct kvm_run *run
switch (funcid) {
case SBI_EXT_RFENCE_REMOTE_FENCE_I:
kvm_riscv_fence_i(vcpu->kvm, hbase, hmask);
+ kvm_riscv_vcpu_pmu_incr_fw(vcpu, SBI_PMU_FW_FENCE_I_SENT);
break;
case SBI_EXT_RFENCE_REMOTE_SFENCE_VMA:
if (cp->a2 == 0 && cp->a3 == 0)
@@ -96,6 +101,7 @@ static int kvm_sbi_ext_rfence_handler(struct kvm_vcpu *vcpu, struct kvm_run *run
else
kvm_riscv_hfence_vvma_gva(vcpu->kvm, hbase, hmask,
cp->a2, cp->a3, PAGE_SHIFT);
+ kvm_riscv_vcpu_pmu_incr_fw(vcpu, SBI_PMU_FW_HFENCE_VVMA_SENT);
break;
case SBI_EXT_RFENCE_REMOTE_SFENCE_VMA_ASID:
if (cp->a2 == 0 && cp->a3 == 0)
@@ -106,6 +112,7 @@ static int kvm_sbi_ext_rfence_handler(struct kvm_vcpu *vcpu, struct kvm_run *run
hbase, hmask,
cp->a2, cp->a3,
PAGE_SHIFT, cp->a4);
+ kvm_riscv_vcpu_pmu_incr_fw(vcpu, SBI_PMU_FW_HFENCE_VVMA_ASID_SENT);
break;
case SBI_EXT_RFENCE_REMOTE_HFENCE_GVMA:
case SBI_EXT_RFENCE_REMOTE_HFENCE_GVMA_VMID:
--
2.25.1

2022-07-18 17:12:37

by Atish Kumar Patra

[permalink] [raw]
Subject: [RFC 3/9] RISC-V: KVM: Define a probe function for SBI extension data structures

c,urrently the probe function just check if an SBI extension is
registered or not. However, the extension may not want to advertise
itself depending on some other condition.
An additional extension specific probe function will allow
extensions to decide if they want to be advertised to the caller or
not. Any extension that do not require additional dependency check
does not required to implement this function.

Signed-off-by: Atish Patra <[email protected]>
---
arch/riscv/include/asm/kvm_vcpu_sbi.h | 3 +++
arch/riscv/kvm/vcpu_sbi_base.c | 13 +++++++++++--
2 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/arch/riscv/include/asm/kvm_vcpu_sbi.h b/arch/riscv/include/asm/kvm_vcpu_sbi.h
index 83d6d4d2b1df..5853a1ef71ea 100644
--- a/arch/riscv/include/asm/kvm_vcpu_sbi.h
+++ b/arch/riscv/include/asm/kvm_vcpu_sbi.h
@@ -25,6 +25,9 @@ struct kvm_vcpu_sbi_extension {
int (*handler)(struct kvm_vcpu *vcpu, struct kvm_run *run,
unsigned long *out_val, struct kvm_cpu_trap *utrap,
bool *exit);
+
+ /* Extension specific probe function */
+ unsigned long (*probe)(unsigned long extid);
};

void kvm_riscv_vcpu_sbi_forward(struct kvm_vcpu *vcpu, struct kvm_run *run);
diff --git a/arch/riscv/kvm/vcpu_sbi_base.c b/arch/riscv/kvm/vcpu_sbi_base.c
index 48f431091cdb..14be1a819588 100644
--- a/arch/riscv/kvm/vcpu_sbi_base.c
+++ b/arch/riscv/kvm/vcpu_sbi_base.c
@@ -22,6 +22,7 @@ static int kvm_sbi_ext_base_handler(struct kvm_vcpu *vcpu, struct kvm_run *run,
int ret = 0;
struct kvm_cpu_context *cp = &vcpu->arch.guest_context;
struct sbiret ecall_ret;
+ const struct kvm_vcpu_sbi_extension *sbi_ext;

switch (cp->a6) {
case SBI_EXT_BASE_GET_SPEC_VERSION:
@@ -46,8 +47,16 @@ static int kvm_sbi_ext_base_handler(struct kvm_vcpu *vcpu, struct kvm_run *run,
*/
kvm_riscv_vcpu_sbi_forward(vcpu, run);
*exit = true;
- } else
- *out_val = kvm_vcpu_sbi_find_ext(cp->a0) ? 1 : 0;
+ } else {
+ sbi_ext = kvm_vcpu_sbi_find_ext(cp->a0);
+ if (sbi_ext) {
+ if (sbi_ext->probe)
+ *out_val = sbi_ext->probe(cp->a0);
+ else
+ *out_val = 1;
+ } else
+ *out_val = 0;
+ }
break;
case SBI_EXT_BASE_GET_MVENDORID:
case SBI_EXT_BASE_GET_MARCHID:
--
2.25.1

2022-07-18 17:13:27

by Atish Kumar Patra

[permalink] [raw]
Subject: [RFC 8/9] RISC-V: KVM: Implement perf support

RISC-V SBI PMU & Sscofpmf ISA extension allows supporting perf in
the virtualization enviornment as well. KVM implementation
relies on SBI PMU extension for most of the part while traps
& emulates the CSRs read for counter access.

Signed-off-by: Atish Patra <[email protected]>
---
arch/riscv/kvm/vcpu_pmu.c | 318 ++++++++++++++++++++++++++++++++++++--
1 file changed, 301 insertions(+), 17 deletions(-)

diff --git a/arch/riscv/kvm/vcpu_pmu.c b/arch/riscv/kvm/vcpu_pmu.c
index 5434051f495d..278c261efad3 100644
--- a/arch/riscv/kvm/vcpu_pmu.c
+++ b/arch/riscv/kvm/vcpu_pmu.c
@@ -11,9 +11,163 @@
#include <linux/kvm_host.h>
#include <linux/perf/riscv_pmu.h>
#include <asm/csr.h>
+#include <asm/bitops.h>
#include <asm/kvm_vcpu_pmu.h>
#include <linux/kvm_host.h>

+#define get_event_type(x) ((x & SBI_PMU_EVENT_IDX_TYPE_MASK) >> 16)
+#define get_event_code(x) (x & SBI_PMU_EVENT_IDX_CODE_MASK)
+
+static inline u64 pmu_get_sample_period(struct kvm_pmc *pmc)
+{
+ u64 counter_val_mask = GENMASK(pmc->cinfo.width, 0);
+ u64 sample_period;
+
+ if (!pmc->counter_val)
+ sample_period = counter_val_mask;
+ else
+ sample_period = pmc->counter_val & counter_val_mask;
+
+ return sample_period;
+}
+
+static u32 pmu_get_perf_event_type(unsigned long eidx)
+{
+ enum sbi_pmu_event_type etype = get_event_type(eidx);
+ u32 type;
+
+ if (etype == SBI_PMU_EVENT_TYPE_HW)
+ type = PERF_TYPE_HARDWARE;
+ else if (etype == SBI_PMU_EVENT_TYPE_CACHE)
+ type = PERF_TYPE_HW_CACHE;
+ else if (etype == SBI_PMU_EVENT_TYPE_RAW || etype == SBI_PMU_EVENT_TYPE_FW)
+ type = PERF_TYPE_RAW;
+ else
+ type = PERF_TYPE_MAX;
+
+ return type;
+}
+
+static inline bool pmu_is_fw_event(unsigned long eidx)
+{
+ enum sbi_pmu_event_type etype = get_event_type(eidx);
+
+ return (etype == SBI_PMU_EVENT_TYPE_FW) ? true : false;
+}
+
+static void pmu_release_perf_event(struct kvm_pmc *pmc)
+{
+ if (pmc->perf_event) {
+ perf_event_disable(pmc->perf_event);
+ perf_event_release_kernel(pmc->perf_event);
+ pmc->perf_event = NULL;
+ }
+}
+
+static u64 pmu_get_perf_event_hw_config(u32 sbi_event_code)
+{
+ /* SBI PMU HW event code is offset by 1 from perf hw event codes */
+ return (u64)sbi_event_code - 1;
+}
+
+static u64 pmu_get_perf_event_cache_config(u32 sbi_event_code)
+{
+ u64 config = U64_MAX;
+ unsigned int cache_type, cache_op, cache_result;
+
+ /* All the cache event masks lie within 0xFF. No separate masking is necesssary */
+ cache_type = (sbi_event_code & SBI_PMU_EVENT_CACHE_ID_CODE_MASK) >> 3;
+ cache_op = (sbi_event_code & SBI_PMU_EVENT_CACHE_OP_ID_CODE_MASK) >> 1;
+ cache_result = sbi_event_code & SBI_PMU_EVENT_CACHE_RESULT_ID_CODE_MASK;
+
+ if (cache_type >= PERF_COUNT_HW_CACHE_MAX ||
+ cache_op >= PERF_COUNT_HW_CACHE_OP_MAX ||
+ cache_result >= PERF_COUNT_HW_CACHE_RESULT_MAX)
+ goto out;
+ config = cache_type | (cache_op << 8) | (cache_result << 16);
+out:
+ return config;
+}
+
+static u64 pmu_get_perf_event_config(unsigned long eidx, uint64_t edata)
+{
+ enum sbi_pmu_event_type etype = get_event_type(eidx);
+ u32 ecode = get_event_code(eidx);
+ u64 config = U64_MAX;
+
+ if (etype == SBI_PMU_EVENT_TYPE_HW)
+ config = pmu_get_perf_event_hw_config(ecode);
+ else if (etype == SBI_PMU_EVENT_TYPE_CACHE)
+ config = pmu_get_perf_event_cache_config(ecode);
+ else if (etype == SBI_PMU_EVENT_TYPE_RAW)
+ config = edata & RISCV_PMU_RAW_EVENT_MASK;
+ else if ((etype == SBI_PMU_EVENT_TYPE_FW) && (ecode < SBI_PMU_FW_MAX))
+ config = (1ULL << 63) | ecode;
+
+ return config;
+}
+
+static int pmu_get_fixed_pmc_index(unsigned long eidx)
+{
+ u32 etype = pmu_get_perf_event_type(eidx);
+ u32 ecode = get_event_code(eidx);
+ int ctr_idx;
+
+ if (etype != SBI_PMU_EVENT_TYPE_HW)
+ return -EINVAL;
+
+ if (ecode == SBI_PMU_HW_CPU_CYCLES)
+ ctr_idx = 0;
+ else if (ecode == SBI_PMU_HW_INSTRUCTIONS)
+ ctr_idx = 2;
+ else
+ return -EINVAL;
+
+ return ctr_idx;
+}
+
+static int pmu_get_programmable_pmc_index(struct kvm_pmu *kvpmu, unsigned long eidx,
+ unsigned long cbase, unsigned long cmask)
+{
+ int ctr_idx = -1;
+ int i, pmc_idx;
+ int min, max;
+
+ if (pmu_is_fw_event(eidx)) {
+ /* Firmware counters are mapped 1:1 starting from num_hw_ctrs for simplicity */
+ min = kvpmu->num_hw_ctrs;
+ max = min + kvpmu->num_fw_ctrs;
+ } else {
+ /* First 3 counters are reserved for fixed counters */
+ min = 3;
+ max = kvpmu->num_hw_ctrs;
+ }
+
+ for_each_set_bit(i, &cmask, BITS_PER_LONG) {
+ pmc_idx = i + cbase;
+ if ((pmc_idx >= min && pmc_idx < max) &&
+ !test_bit(pmc_idx, kvpmu->used_pmc)) {
+ ctr_idx = pmc_idx;
+ break;
+ }
+ }
+
+ return ctr_idx;
+}
+
+static int pmu_get_pmc_index(struct kvm_pmu *pmu, unsigned long eidx,
+ unsigned long cbase, unsigned long cmask)
+{
+ int ret;
+
+ /* Fixed counters need to be have fixed mapping as they have different width */
+ ret = pmu_get_fixed_pmc_index(eidx);
+ if (ret >= 0)
+ return ret;
+
+ return pmu_get_programmable_pmc_index(pmu, eidx, cbase, cmask);
+}
+
int kvm_riscv_vcpu_pmu_ctr_read(struct kvm_vcpu *vcpu, unsigned long cidx,
unsigned long *out_val)
{
@@ -43,7 +197,6 @@ int kvm_riscv_vcpu_pmu_read_hpm(struct kvm_vcpu *vcpu, unsigned int csr_num,

if (!kvpmu)
return KVM_INSN_EXIT_TO_USER_SPACE;
- //TODO: Should we check if vcpu pmu is initialized or not!
if (wr_mask)
return KVM_INSN_ILLEGAL_TRAP;
cidx = csr_num - CSR_CYCLE;
@@ -81,14 +234,62 @@ int kvm_riscv_vcpu_pmu_ctr_info(struct kvm_vcpu *vcpu, unsigned long cidx,
int kvm_riscv_vcpu_pmu_ctr_start(struct kvm_vcpu *vcpu, unsigned long ctr_base,
unsigned long ctr_mask, unsigned long flag, uint64_t ival)
{
- /* TODO */
+ struct kvm_pmu *kvpmu = vcpu_to_pmu(vcpu);
+ int i, num_ctrs, pmc_index;
+ struct kvm_pmc *pmc;
+
+ num_ctrs = kvpmu->num_fw_ctrs + kvpmu->num_hw_ctrs;
+ if (ctr_base + __fls(ctr_mask) >= num_ctrs)
+ return -EINVAL;
+
+ /* Start the counters that have been configured and requested by the guest */
+ for_each_set_bit(i, &ctr_mask, RISCV_MAX_COUNTERS) {
+ pmc_index = i + ctr_base;
+ if (!test_bit(pmc_index, kvpmu->used_pmc))
+ continue;
+ pmc = &kvpmu->pmc[pmc_index];
+ if (flag & SBI_PMU_START_FLAG_SET_INIT_VALUE)
+ pmc->counter_val = ival;
+ if (pmc->perf_event) {
+ perf_event_period(pmc->perf_event, pmu_get_sample_period(pmc));
+ perf_event_enable(pmc->perf_event);
+ }
+ }
+
return 0;
}

int kvm_riscv_vcpu_pmu_ctr_stop(struct kvm_vcpu *vcpu, unsigned long ctr_base,
unsigned long ctr_mask, unsigned long flag)
{
- /* TODO */
+ struct kvm_pmu *kvpmu = vcpu_to_pmu(vcpu);
+ int i, num_ctrs, pmc_index;
+ u64 enabled, running;
+ struct kvm_pmc *pmc;
+
+ num_ctrs = kvpmu->num_fw_ctrs + kvpmu->num_hw_ctrs;
+ if ((ctr_base + __fls(ctr_mask)) >= num_ctrs)
+ return -EINVAL;
+
+ /* Stop the counters that have been configured and requested by the guest */
+ for_each_set_bit(i, &ctr_mask, RISCV_MAX_COUNTERS) {
+ pmc_index = i + ctr_base;
+ if (!test_bit(pmc_index, kvpmu->used_pmc))
+ continue;
+ pmc = &kvpmu->pmc[pmc_index];
+ if (pmc->perf_event) {
+ /* Stop counting the counter */
+ perf_event_disable(pmc->perf_event);
+ if (flag & SBI_PMU_STOP_FLAG_RESET) {
+ /* Relase the counter if this is a reset request */
+ pmc->counter_val += perf_event_read_value(pmc->perf_event,
+ &enabled, &running);
+ pmu_release_perf_event(pmc);
+ clear_bit(pmc_index, kvpmu->used_pmc);
+ }
+ }
+ }
+
return 0;
}

@@ -96,14 +297,85 @@ int kvm_riscv_vcpu_pmu_ctr_cfg_match(struct kvm_vcpu *vcpu, unsigned long ctr_ba
unsigned long ctr_mask, unsigned long flag,
unsigned long eidx, uint64_t edata)
{
- /* TODO */
- return 0;
+ struct kvm_pmu *kvpmu = vcpu_to_pmu(vcpu);
+ struct perf_event *event;
+ struct perf_event_attr attr;
+ int num_ctrs, ctr_idx;
+ u32 etype = pmu_get_perf_event_type(eidx);
+ u64 config;
+ struct kvm_pmc *pmc;
+
+ num_ctrs = kvpmu->num_fw_ctrs + kvpmu->num_hw_ctrs;
+ if ((etype == PERF_TYPE_MAX) || ((ctr_base + __fls(ctr_mask)) >= num_ctrs))
+ return -EINVAL;
+
+ if (pmu_is_fw_event(eidx))
+ return -EOPNOTSUPP;
+ /*
+ * SKIP_MATCH flag indicates the caller is aware of the assigned counter
+ * for this event. Just do a sanity check if it already marked used.
+ */
+ if (flag & SBI_PMU_CFG_FLAG_SKIP_MATCH) {
+ if (!test_bit(ctr_base, kvpmu->used_pmc))
+ return -EINVAL;
+ ctr_idx = ctr_base;
+ goto match_done;
+ }
+
+ ctr_idx = pmu_get_pmc_index(kvpmu, eidx, ctr_base, ctr_mask);
+ if (ctr_idx < 0)
+ return -EOPNOTSUPP;
+
+match_done:
+ pmc = &kvpmu->pmc[ctr_idx];
+ pmu_release_perf_event(pmc);
+ pmc->idx = ctr_idx;
+
+ config = pmu_get_perf_event_config(eidx, edata);
+ memset(&attr, 0, sizeof(struct perf_event_attr));
+ attr.type = etype;
+ attr.size = sizeof(attr);
+ attr.pinned = true;
+
+ /*
+ * It should never reach here if the platform doesn't support sscofpmf extensio
+ * as mode filtering won't work without it.
+ */
+ attr.exclude_host = true;
+ attr.exclude_hv = true;
+ attr.exclude_user = flag & SBI_PMU_CFG_FLAG_SET_UINH ? 1 : 0;
+ attr.exclude_kernel = flag & SBI_PMU_CFG_FLAG_SET_SINH ? 1 : 0;
+ attr.config = config;
+ attr.config1 = RISCV_KVM_PMU_CONFIG1_GUEST_EVENTS;
+ if (flag & SBI_PMU_CFG_FLAG_CLEAR_VALUE) {
+ //TODO: Do we really want to clear the value in hardware counter
+ pmc->counter_val = 0;
+ }
+ /*
+ * Set the default sample_period for now. The guest specified value
+ * will be updated in the start call.
+ */
+ attr.sample_period = pmu_get_sample_period(pmc);
+
+ event = perf_event_create_kernel_counter(&attr, -1, current, NULL, pmc);
+ if (IS_ERR(event)) {
+ pr_err("kvm pmu event creation failed event %pe for eidx %lx\n", event, eidx);
+ return -EOPNOTSUPP;
+ }
+
+ set_bit(ctr_idx, kvpmu->used_pmc);
+ pmc->perf_event = event;
+ if (flag & SBI_PMU_CFG_FLAG_AUTO_START)
+ perf_event_enable(pmc->perf_event);
+
+ return ctr_idx;
}

int kvm_riscv_vcpu_pmu_init(struct kvm_vcpu *vcpu)
{
int i = 0, num_hw_ctrs, num_fw_ctrs, hpm_width;
struct kvm_pmu *kvpmu = vcpu_to_pmu(vcpu);
+ struct kvm_pmc *pmc;

if (!kvpmu)
return -EINVAL;
@@ -120,6 +392,7 @@ int kvm_riscv_vcpu_pmu_init(struct kvm_vcpu *vcpu)
return -EINVAL;
}

+ bitmap_zero(kvpmu->used_pmc, RISCV_MAX_COUNTERS);
kvpmu->num_hw_ctrs = num_hw_ctrs;
kvpmu->num_fw_ctrs = num_fw_ctrs;
/*
@@ -132,38 +405,49 @@ int kvm_riscv_vcpu_pmu_init(struct kvm_vcpu *vcpu)
/* TIME CSR shouldn't be read from perf interface */
if (i == 1)
continue;
- kvpmu->pmc[i].idx = i;
- kvpmu->pmc[i].vcpu = vcpu;
+ pmc = &kvpmu->pmc[i];
+ pmc->idx = i;
+ pmc->counter_val = 0;
+ pmc->vcpu = vcpu;
if (i < kvpmu->num_hw_ctrs) {
kvpmu->pmc[i].cinfo.type = SBI_PMU_CTR_TYPE_HW;
if (i < 3)
/* CY, IR counters */
- kvpmu->pmc[i].cinfo.width = 63;
+ pmc->cinfo.width = 63;
else
- kvpmu->pmc[i].cinfo.width = hpm_width;
+ pmc->cinfo.width = hpm_width;
/*
* The CSR number doesn't have any relation with the logical
* hardware counters. The CSR numbers are encoded sequentially
* to avoid maintaining a map between the virtual counter
* and CSR number.
*/
- kvpmu->pmc[i].cinfo.csr = CSR_CYCLE + i;
+ pmc->cinfo.csr = CSR_CYCLE + i;
} else {
- kvpmu->pmc[i].cinfo.type = SBI_PMU_CTR_TYPE_FW;
- kvpmu->pmc[i].cinfo.width = BITS_PER_LONG - 1;
+ pmc->cinfo.type = SBI_PMU_CTR_TYPE_FW;
+ pmc->cinfo.width = BITS_PER_LONG - 1;
}
}

return 0;
}

-void kvm_riscv_vcpu_pmu_reset(struct kvm_vcpu *vcpu)
+void kvm_riscv_vcpu_pmu_deinit(struct kvm_vcpu *vcpu)
{
- /* TODO */
+ struct kvm_pmu *kvpmu = vcpu_to_pmu(vcpu);
+ struct kvm_pmc *pmc;
+ int i;
+
+ if (!kvpmu)
+ return;
+
+ for_each_set_bit(i, kvpmu->used_pmc, RISCV_MAX_COUNTERS) {
+ pmc = &kvpmu->pmc[i];
+ pmu_release_perf_event(pmc);
+ }
}

-void kvm_riscv_vcpu_pmu_deinit(struct kvm_vcpu *vcpu)
+void kvm_riscv_vcpu_pmu_reset(struct kvm_vcpu *vcpu)
{
- /* TODO */
+ kvm_riscv_vcpu_pmu_deinit(vcpu);
}
-
--
2.25.1

2022-07-18 17:14:16

by Atish Kumar Patra

[permalink] [raw]
Subject: [RFC 7/9] RISC-V: KVM: Implement trap & emulate for hpmcounters

As the KVM guests only see the virtual PMU counters, all hpmcounter
access should trap and KVM emulates the read access on behalf of guests.

Signed-off-by: Atish Patra <[email protected]>
---
arch/riscv/include/asm/kvm_vcpu_pmu.h | 16 +++++++++
arch/riscv/kvm/vcpu_insn.c | 1 +
arch/riscv/kvm/vcpu_pmu.c | 47 +++++++++++++++++++++++----
3 files changed, 57 insertions(+), 7 deletions(-)

diff --git a/arch/riscv/include/asm/kvm_vcpu_pmu.h b/arch/riscv/include/asm/kvm_vcpu_pmu.h
index bffee052f2ae..5410236b62a8 100644
--- a/arch/riscv/include/asm/kvm_vcpu_pmu.h
+++ b/arch/riscv/include/asm/kvm_vcpu_pmu.h
@@ -39,6 +39,19 @@ struct kvm_pmu {
#define pmu_to_vcpu(pmu) (container_of((pmu), struct kvm_vcpu, arch.pmu))
#define pmc_to_pmu(pmc) (&(pmc)->vcpu->arch.pmu)

+#if defined(CONFIG_32BIT)
+#define KVM_RISCV_VCPU_HPMCOUNTER_CSR_FUNCS \
+{ .base = CSR_CYCLEH, .count = 31, .func = kvm_riscv_vcpu_pmu_read_hpm }, \
+{ .base = CSR_CYCLE, .count = 31, .func = kvm_riscv_vcpu_pmu_read_hpm },
+#else
+#define KVM_RISCV_VCPU_HPMCOUNTER_CSR_FUNCS \
+{ .base = CSR_CYCLE, .count = 31, .func = kvm_riscv_vcpu_pmu_read_hpm },
+#endif
+
+int kvm_riscv_vcpu_pmu_read_hpm(struct kvm_vcpu *vcpu, unsigned int csr_num,
+ unsigned long *val, unsigned long new_val,
+ unsigned long wr_mask);
+
int kvm_riscv_vcpu_pmu_num_ctrs(struct kvm_vcpu *vcpu, unsigned long *out_val);
int kvm_riscv_vcpu_pmu_ctr_info(struct kvm_vcpu *vcpu, unsigned long cidx,
unsigned long *ctr_info);
@@ -59,6 +72,9 @@ void kvm_riscv_vcpu_pmu_reset(struct kvm_vcpu *vcpu);
#else
struct kvm_pmu {
};
+#define KVM_RISCV_VCPU_HPMCOUNTER_CSR_FUNCS \
+{ .base = 0, .count = 0, .func = NULL },
+

static inline int kvm_riscv_vcpu_pmu_init(struct kvm_vcpu *vcpu)
{
diff --git a/arch/riscv/kvm/vcpu_insn.c b/arch/riscv/kvm/vcpu_insn.c
index 0aa334f853c8..7c2a4b1a69f7 100644
--- a/arch/riscv/kvm/vcpu_insn.c
+++ b/arch/riscv/kvm/vcpu_insn.c
@@ -215,6 +215,7 @@ struct csr_func {
};

static const struct csr_func csr_funcs[] = {
+ KVM_RISCV_VCPU_HPMCOUNTER_CSR_FUNCS
};

/**
diff --git a/arch/riscv/kvm/vcpu_pmu.c b/arch/riscv/kvm/vcpu_pmu.c
index 3168ed740bdd..5434051f495d 100644
--- a/arch/riscv/kvm/vcpu_pmu.c
+++ b/arch/riscv/kvm/vcpu_pmu.c
@@ -14,6 +14,46 @@
#include <asm/kvm_vcpu_pmu.h>
#include <linux/kvm_host.h>

+int kvm_riscv_vcpu_pmu_ctr_read(struct kvm_vcpu *vcpu, unsigned long cidx,
+ unsigned long *out_val)
+{
+ struct kvm_pmu *kvpmu = vcpu_to_pmu(vcpu);
+ struct kvm_pmc *pmc;
+ u64 enabled, running;
+
+ if (!kvpmu)
+ return -EINVAL;
+
+ pmc = &kvpmu->pmc[cidx];
+ if (!pmc->perf_event)
+ return -EINVAL;
+
+ pmc->counter_val += perf_event_read_value(pmc->perf_event, &enabled, &running);
+ *out_val = pmc->counter_val;
+
+ return 0;
+}
+
+int kvm_riscv_vcpu_pmu_read_hpm(struct kvm_vcpu *vcpu, unsigned int csr_num,
+ unsigned long *val, unsigned long new_val,
+ unsigned long wr_mask)
+{
+ struct kvm_pmu *kvpmu = vcpu_to_pmu(vcpu);
+ int cidx, ret = KVM_INSN_CONTINUE_NEXT_SEPC;
+
+ if (!kvpmu)
+ return KVM_INSN_EXIT_TO_USER_SPACE;
+ //TODO: Should we check if vcpu pmu is initialized or not!
+ if (wr_mask)
+ return KVM_INSN_ILLEGAL_TRAP;
+ cidx = csr_num - CSR_CYCLE;
+
+ if (kvm_riscv_vcpu_pmu_ctr_read(vcpu, cidx, val) < 0)
+ return KVM_INSN_EXIT_TO_USER_SPACE;
+
+ return ret;
+}
+
int kvm_riscv_vcpu_pmu_num_ctrs(struct kvm_vcpu *vcpu, unsigned long *out_val)
{
struct kvm_pmu *kvpmu = vcpu_to_pmu(vcpu);
@@ -60,13 +100,6 @@ int kvm_riscv_vcpu_pmu_ctr_cfg_match(struct kvm_vcpu *vcpu, unsigned long ctr_ba
return 0;
}

-int kvm_riscv_vcpu_pmu_ctr_read(struct kvm_vcpu *vcpu, unsigned long cidx,
- unsigned long *out_val)
-{
- /* TODO */
- return 0;
-}
-
int kvm_riscv_vcpu_pmu_init(struct kvm_vcpu *vcpu)
{
int i = 0, num_hw_ctrs, num_fw_ctrs, hpm_width;
--
2.25.1

2022-07-18 17:25:02

by Atish Kumar Patra

[permalink] [raw]
Subject: [RFC 6/9] RISC-V: KVM: Add SBI PMU extension support

SBI PMU extension allows KVM guests to configure/start/stop/query about
the PMU counters in virtualized enviornment as well.

In order to allow that, KVM implements the entire SBI PMU extension.

Signed-off-by: Atish Patra <[email protected]>
---
arch/riscv/kvm/vcpu_sbi.c | 11 +++++
arch/riscv/kvm/vcpu_sbi_pmu.c | 81 +++++++++++++++++++++++++++++++++++
2 files changed, 92 insertions(+)
create mode 100644 arch/riscv/kvm/vcpu_sbi_pmu.c

diff --git a/arch/riscv/kvm/vcpu_sbi.c b/arch/riscv/kvm/vcpu_sbi.c
index d45e7da3f0d3..da9f7959340e 100644
--- a/arch/riscv/kvm/vcpu_sbi.c
+++ b/arch/riscv/kvm/vcpu_sbi.c
@@ -50,6 +50,16 @@ extern const struct kvm_vcpu_sbi_extension vcpu_sbi_ext_hsm;
extern const struct kvm_vcpu_sbi_extension vcpu_sbi_ext_experimental;
extern const struct kvm_vcpu_sbi_extension vcpu_sbi_ext_vendor;

+#ifdef CONFIG_RISCV_PMU_SBI
+extern const struct kvm_vcpu_sbi_extension vcpu_sbi_ext_pmu;
+#else
+static const struct kvm_vcpu_sbi_extension vcpu_sbi_ext_pmu = {
+ .extid_start = -1UL,
+ .extid_end = -1UL,
+ .handler = NULL,
+};
+#endif
+
static const struct kvm_vcpu_sbi_extension *sbi_ext[] = {
&vcpu_sbi_ext_v01,
&vcpu_sbi_ext_base,
@@ -58,6 +68,7 @@ static const struct kvm_vcpu_sbi_extension *sbi_ext[] = {
&vcpu_sbi_ext_rfence,
&vcpu_sbi_ext_srst,
&vcpu_sbi_ext_hsm,
+ &vcpu_sbi_ext_pmu,
&vcpu_sbi_ext_experimental,
&vcpu_sbi_ext_vendor,
};
diff --git a/arch/riscv/kvm/vcpu_sbi_pmu.c b/arch/riscv/kvm/vcpu_sbi_pmu.c
new file mode 100644
index 000000000000..90c51a95d4f4
--- /dev/null
+++ b/arch/riscv/kvm/vcpu_sbi_pmu.c
@@ -0,0 +1,81 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2022 Rivos Inc
+ *
+ * Authors:
+ * Atish Patra <[email protected]>
+ */
+
+#include <linux/errno.h>
+#include <linux/err.h>
+#include <linux/kvm_host.h>
+#include <asm/csr.h>
+#include <asm/sbi.h>
+#include <asm/kvm_vcpu_sbi.h>
+
+static int kvm_sbi_ext_pmu_handler(struct kvm_vcpu *vcpu, struct kvm_run *run,
+ unsigned long *out_val,
+ struct kvm_cpu_trap *utrap,
+ bool *exit)
+{
+ int ret = -EOPNOTSUPP;
+ struct kvm_cpu_context *cp = &vcpu->arch.guest_context;
+ unsigned long funcid = cp->a6;
+ uint64_t temp;
+
+ switch (funcid) {
+ case SBI_EXT_PMU_NUM_COUNTERS:
+ ret = kvm_riscv_vcpu_pmu_num_ctrs(vcpu, out_val);
+ break;
+ case SBI_EXT_PMU_COUNTER_GET_INFO:
+ ret = kvm_riscv_vcpu_pmu_ctr_info(vcpu, cp->a0, out_val);
+ break;
+ case SBI_EXT_PMU_COUNTER_CFG_MATCH:
+#if defined(CONFIG_32BIT)
+ temp = ((uint64_t)cp->a5 << 32) | cp->a4;
+#else
+ temp = cp->a4;
+#endif
+ ret = kvm_riscv_vcpu_pmu_ctr_cfg_match(vcpu, cp->a0, cp->a1, cp->a2, cp->a3, temp);
+ if (ret >= 0) {
+ *out_val = ret;
+ ret = 0;
+ }
+ break;
+ case SBI_EXT_PMU_COUNTER_START:
+#if defined(CONFIG_32BIT)
+ temp = ((uint64_t)cp->a4 << 32) | cp->a3;
+#else
+ temp = cp->a3;
+#endif
+ ret = kvm_riscv_vcpu_pmu_ctr_start(vcpu, cp->a0, cp->a1, cp->a2, temp);
+ break;
+ case SBI_EXT_PMU_COUNTER_STOP:
+ ret = kvm_riscv_vcpu_pmu_ctr_stop(vcpu, cp->a0, cp->a1, cp->a2);
+ break;
+ case SBI_EXT_PMU_COUNTER_FW_READ:
+ ret = kvm_riscv_vcpu_pmu_ctr_read(vcpu, cp->a0, out_val);
+ break;
+ default:
+ ret = -EOPNOTSUPP;
+ }
+
+ return ret;
+}
+
+unsigned long kvm_sbi_ext_pmu_probe(unsigned long extid)
+{
+ /*
+ * PMU Extension is only available to guests if privilege mode filtering
+ * is available. Otherwise, guest will always count events while the
+ * execution is in hypervisor mode.
+ */
+ return riscv_isa_extension_available(NULL, SSCOFPMF);
+}
+
+const struct kvm_vcpu_sbi_extension vcpu_sbi_ext_pmu = {
+ .extid_start = SBI_EXT_PMU,
+ .extid_end = SBI_EXT_PMU,
+ .handler = kvm_sbi_ext_pmu_handler,
+ .probe = kvm_sbi_ext_pmu_probe,
+};
--
2.25.1

2022-07-18 17:40:52

by Atish Kumar Patra

[permalink] [raw]
Subject: [RFC 5/9] RISC-V: KVM: Add skeleton support for perf

This patch only adds barebore structure of perf implementation. Most of
the function returns zero at this point and will be implemented
fully in the future.

Signed-off-by: Atish Patra <[email protected]>
---
arch/riscv/include/asm/kvm_host.h | 3 +
arch/riscv/include/asm/kvm_vcpu_pmu.h | 70 +++++++++++++
arch/riscv/kvm/Makefile | 1 +
arch/riscv/kvm/main.c | 3 +-
arch/riscv/kvm/vcpu.c | 5 +
arch/riscv/kvm/vcpu_insn.c | 3 +-
arch/riscv/kvm/vcpu_pmu.c | 136 ++++++++++++++++++++++++++
7 files changed, 219 insertions(+), 2 deletions(-)
create mode 100644 arch/riscv/include/asm/kvm_vcpu_pmu.h
create mode 100644 arch/riscv/kvm/vcpu_pmu.c

diff --git a/arch/riscv/include/asm/kvm_host.h b/arch/riscv/include/asm/kvm_host.h
index 59a0cf2ca7b9..5d2312828bb2 100644
--- a/arch/riscv/include/asm/kvm_host.h
+++ b/arch/riscv/include/asm/kvm_host.h
@@ -18,6 +18,7 @@
#include <asm/kvm_vcpu_fp.h>
#include <asm/kvm_vcpu_insn.h>
#include <asm/kvm_vcpu_timer.h>
+#include <asm/kvm_vcpu_pmu.h>

#define KVM_MAX_VCPUS 1024

@@ -226,6 +227,8 @@ struct kvm_vcpu_arch {

/* Don't run the VCPU (blocked) */
bool pause;
+
+ struct kvm_pmu pmu;
};

static inline void kvm_arch_hardware_unsetup(void) {}
diff --git a/arch/riscv/include/asm/kvm_vcpu_pmu.h b/arch/riscv/include/asm/kvm_vcpu_pmu.h
new file mode 100644
index 000000000000..bffee052f2ae
--- /dev/null
+++ b/arch/riscv/include/asm/kvm_vcpu_pmu.h
@@ -0,0 +1,70 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (c) 2022 Rivos Inc
+ *
+ * Authors:
+ * Atish Patra <[email protected]>
+ */
+
+#ifndef _KVM_VCPU_RISCV_PMU_H
+#define _KVM_VCPU_RISCV_PMU_H
+
+#include <linux/perf/riscv_pmu.h>
+#include <asm/sbi.h>
+
+#ifdef CONFIG_RISCV_PMU_SBI
+#define RISCV_KVM_MAX_FW_CTRS 32
+
+/* Per virtual pmu counter data */
+struct kvm_pmc {
+ u8 idx;
+ struct kvm_vcpu *vcpu;
+ struct perf_event *perf_event;
+ uint64_t counter_val;
+ union sbi_pmu_ctr_info cinfo;
+};
+
+/* PMU data structure per vcpu */
+struct kvm_pmu {
+ struct kvm_pmc pmc[RISCV_MAX_COUNTERS];
+ /* Number of the virtual firmware counters available */
+ int num_fw_ctrs;
+ /* Number of the virtual hardware counters available */
+ int num_hw_ctrs;
+ /* Bit map of all the virtual counter used */
+ DECLARE_BITMAP(used_pmc, RISCV_MAX_COUNTERS);
+};
+
+#define vcpu_to_pmu(vcpu) (&(vcpu)->arch.pmu)
+#define pmu_to_vcpu(pmu) (container_of((pmu), struct kvm_vcpu, arch.pmu))
+#define pmc_to_pmu(pmc) (&(pmc)->vcpu->arch.pmu)
+
+int kvm_riscv_vcpu_pmu_num_ctrs(struct kvm_vcpu *vcpu, unsigned long *out_val);
+int kvm_riscv_vcpu_pmu_ctr_info(struct kvm_vcpu *vcpu, unsigned long cidx,
+ unsigned long *ctr_info);
+
+int kvm_riscv_vcpu_pmu_ctr_start(struct kvm_vcpu *vcpu, unsigned long ctr_base,
+ unsigned long ctr_mask, unsigned long flag, uint64_t ival);
+int kvm_riscv_vcpu_pmu_ctr_stop(struct kvm_vcpu *vcpu, unsigned long ctr_base,
+ unsigned long ctr_mask, unsigned long flag);
+int kvm_riscv_vcpu_pmu_ctr_cfg_match(struct kvm_vcpu *vcpu, unsigned long ctr_base,
+ unsigned long ctr_mask, unsigned long flag,
+ unsigned long eidx, uint64_t edata);
+int kvm_riscv_vcpu_pmu_ctr_read(struct kvm_vcpu *vcpu, unsigned long cidx,
+ unsigned long *out_val);
+int kvm_riscv_vcpu_pmu_init(struct kvm_vcpu *vcpu);
+void kvm_riscv_vcpu_pmu_deinit(struct kvm_vcpu *vcpu);
+void kvm_riscv_vcpu_pmu_reset(struct kvm_vcpu *vcpu);
+
+#else
+struct kvm_pmu {
+};
+
+static inline int kvm_riscv_vcpu_pmu_init(struct kvm_vcpu *vcpu)
+{
+ return 0;
+}
+static inline void kvm_riscv_vcpu_pmu_deinit(struct kvm_vcpu *vcpu) {}
+static inline void kvm_riscv_vcpu_pmu_reset(struct kvm_vcpu *vcpu) {}
+#endif
+#endif
diff --git a/arch/riscv/kvm/Makefile b/arch/riscv/kvm/Makefile
index 019df9208bdd..342d7199e89d 100644
--- a/arch/riscv/kvm/Makefile
+++ b/arch/riscv/kvm/Makefile
@@ -25,3 +25,4 @@ kvm-y += vcpu_sbi_base.o
kvm-y += vcpu_sbi_replace.o
kvm-y += vcpu_sbi_hsm.o
kvm-y += vcpu_timer.o
+kvm-$(CONFIG_RISCV_PMU_SBI) += vcpu_sbi_pmu.o vcpu_pmu.o
diff --git a/arch/riscv/kvm/main.c b/arch/riscv/kvm/main.c
index 1549205fe5fe..d41ab6d1987d 100644
--- a/arch/riscv/kvm/main.c
+++ b/arch/riscv/kvm/main.c
@@ -49,7 +49,8 @@ int kvm_arch_hardware_enable(void)
hideleg |= (1UL << IRQ_VS_EXT);
csr_write(CSR_HIDELEG, hideleg);

- csr_write(CSR_HCOUNTEREN, -1UL);
+ /* VS should access only TM bit. Everything else should trap */
+ csr_write(CSR_HCOUNTEREN, 0x02);

csr_write(CSR_HVIP, 0);

diff --git a/arch/riscv/kvm/vcpu.c b/arch/riscv/kvm/vcpu.c
index 3c95924d38c7..4cc964aaf2ad 100644
--- a/arch/riscv/kvm/vcpu.c
+++ b/arch/riscv/kvm/vcpu.c
@@ -122,6 +122,7 @@ static void kvm_riscv_reset_vcpu(struct kvm_vcpu *vcpu)

WRITE_ONCE(vcpu->arch.irqs_pending, 0);
WRITE_ONCE(vcpu->arch.irqs_pending_mask, 0);
+ kvm_riscv_vcpu_pmu_reset(vcpu);

vcpu->arch.hfence_head = 0;
vcpu->arch.hfence_tail = 0;
@@ -174,6 +175,9 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
/* Setup VCPU timer */
kvm_riscv_vcpu_timer_init(vcpu);

+ /* setup performance monitoring */
+ kvm_riscv_vcpu_pmu_init(vcpu);
+
/* Reset VCPU */
kvm_riscv_reset_vcpu(vcpu);

@@ -196,6 +200,7 @@ void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu)
/* Cleanup VCPU timer */
kvm_riscv_vcpu_timer_deinit(vcpu);

+ kvm_riscv_vcpu_pmu_deinit(vcpu);
/* Free unused pages pre-allocated for G-stage page table mappings */
kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_cache);
}
diff --git a/arch/riscv/kvm/vcpu_insn.c b/arch/riscv/kvm/vcpu_insn.c
index 7eb90a47b571..0aa334f853c8 100644
--- a/arch/riscv/kvm/vcpu_insn.c
+++ b/arch/riscv/kvm/vcpu_insn.c
@@ -214,7 +214,8 @@ struct csr_func {
unsigned long wr_mask);
};

-static const struct csr_func csr_funcs[] = { };
+static const struct csr_func csr_funcs[] = {
+};

/**
* kvm_riscv_vcpu_csr_return -- Handle CSR read/write after user space
diff --git a/arch/riscv/kvm/vcpu_pmu.c b/arch/riscv/kvm/vcpu_pmu.c
new file mode 100644
index 000000000000..3168ed740bdd
--- /dev/null
+++ b/arch/riscv/kvm/vcpu_pmu.c
@@ -0,0 +1,136 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2022 Rivos Inc
+ *
+ * Authors:
+ * Atish Patra <[email protected]>
+ */
+
+#include <linux/errno.h>
+#include <linux/err.h>
+#include <linux/kvm_host.h>
+#include <linux/perf/riscv_pmu.h>
+#include <asm/csr.h>
+#include <asm/kvm_vcpu_pmu.h>
+#include <linux/kvm_host.h>
+
+int kvm_riscv_vcpu_pmu_num_ctrs(struct kvm_vcpu *vcpu, unsigned long *out_val)
+{
+ struct kvm_pmu *kvpmu = vcpu_to_pmu(vcpu);
+
+ if (!kvpmu)
+ return -EINVAL;
+
+ *out_val = kvpmu->num_fw_ctrs + kvpmu->num_hw_ctrs;
+ return 0;
+}
+
+int kvm_riscv_vcpu_pmu_ctr_info(struct kvm_vcpu *vcpu, unsigned long cidx,
+ unsigned long *ctr_info)
+{
+ struct kvm_pmu *kvpmu = vcpu_to_pmu(vcpu);
+
+ if (!kvpmu || (cidx > RISCV_MAX_COUNTERS) || (cidx == 1))
+ return -EINVAL;
+
+ *ctr_info = kvpmu->pmc[cidx].cinfo.value;
+
+ return 0;
+}
+
+int kvm_riscv_vcpu_pmu_ctr_start(struct kvm_vcpu *vcpu, unsigned long ctr_base,
+ unsigned long ctr_mask, unsigned long flag, uint64_t ival)
+{
+ /* TODO */
+ return 0;
+}
+
+int kvm_riscv_vcpu_pmu_ctr_stop(struct kvm_vcpu *vcpu, unsigned long ctr_base,
+ unsigned long ctr_mask, unsigned long flag)
+{
+ /* TODO */
+ return 0;
+}
+
+int kvm_riscv_vcpu_pmu_ctr_cfg_match(struct kvm_vcpu *vcpu, unsigned long ctr_base,
+ unsigned long ctr_mask, unsigned long flag,
+ unsigned long eidx, uint64_t edata)
+{
+ /* TODO */
+ return 0;
+}
+
+int kvm_riscv_vcpu_pmu_ctr_read(struct kvm_vcpu *vcpu, unsigned long cidx,
+ unsigned long *out_val)
+{
+ /* TODO */
+ return 0;
+}
+
+int kvm_riscv_vcpu_pmu_init(struct kvm_vcpu *vcpu)
+{
+ int i = 0, num_hw_ctrs, num_fw_ctrs, hpm_width;
+ struct kvm_pmu *kvpmu = vcpu_to_pmu(vcpu);
+
+ if (!kvpmu)
+ return -EINVAL;
+
+ num_hw_ctrs = riscv_pmu_sbi_get_num_hw_ctrs();
+ if ((num_hw_ctrs + RISCV_KVM_MAX_FW_CTRS) > RISCV_MAX_COUNTERS)
+ num_fw_ctrs = RISCV_MAX_COUNTERS - num_hw_ctrs;
+ else
+ num_fw_ctrs = RISCV_KVM_MAX_FW_CTRS;
+
+ hpm_width = riscv_pmu_sbi_hpmc_width();
+ if (hpm_width <= 0) {
+ pr_err("Can not initialize PMU for vcpu as hpmcounter width is not available\n");
+ return -EINVAL;
+ }
+
+ kvpmu->num_hw_ctrs = num_hw_ctrs;
+ kvpmu->num_fw_ctrs = num_fw_ctrs;
+ /*
+ * There is no corelation betwen the logical hardware counter and virtual counters.
+ * However, we need to encode a hpmcounter CSR in the counter info field so that
+ * KVM can trap n emulate the read. This works well in the migraiton usecase as well
+ * KVM doesn't care if the actual hpmcounter is available in the hardware or not.
+ */
+ for (i = 0; i < num_hw_ctrs + num_fw_ctrs; i++) {
+ /* TIME CSR shouldn't be read from perf interface */
+ if (i == 1)
+ continue;
+ kvpmu->pmc[i].idx = i;
+ kvpmu->pmc[i].vcpu = vcpu;
+ if (i < kvpmu->num_hw_ctrs) {
+ kvpmu->pmc[i].cinfo.type = SBI_PMU_CTR_TYPE_HW;
+ if (i < 3)
+ /* CY, IR counters */
+ kvpmu->pmc[i].cinfo.width = 63;
+ else
+ kvpmu->pmc[i].cinfo.width = hpm_width;
+ /*
+ * The CSR number doesn't have any relation with the logical
+ * hardware counters. The CSR numbers are encoded sequentially
+ * to avoid maintaining a map between the virtual counter
+ * and CSR number.
+ */
+ kvpmu->pmc[i].cinfo.csr = CSR_CYCLE + i;
+ } else {
+ kvpmu->pmc[i].cinfo.type = SBI_PMU_CTR_TYPE_FW;
+ kvpmu->pmc[i].cinfo.width = BITS_PER_LONG - 1;
+ }
+ }
+
+ return 0;
+}
+
+void kvm_riscv_vcpu_pmu_reset(struct kvm_vcpu *vcpu)
+{
+ /* TODO */
+}
+
+void kvm_riscv_vcpu_pmu_deinit(struct kvm_vcpu *vcpu)
+{
+ /* TODO */
+}
+
--
2.25.1

2022-09-20 02:41:43

by Eric Lin

[permalink] [raw]
Subject: Re: [RFC 8/9] RISC-V: KVM: Implement perf support

Hi Atish,

On Tue, Jul 19, 2022 at 2:01 AM Atish Patra <[email protected]> wrote:
>
> RISC-V SBI PMU & Sscofpmf ISA extension allows supporting perf in
> the virtualization enviornment as well. KVM implementation
> relies on SBI PMU extension for most of the part while traps
> & emulates the CSRs read for counter access.
>
> Signed-off-by: Atish Patra <[email protected]>
> ---
> arch/riscv/kvm/vcpu_pmu.c | 318 ++++++++++++++++++++++++++++++++++++--
> 1 file changed, 301 insertions(+), 17 deletions(-)
>
> diff --git a/arch/riscv/kvm/vcpu_pmu.c b/arch/riscv/kvm/vcpu_pmu.c
> index 5434051f495d..278c261efad3 100644
> --- a/arch/riscv/kvm/vcpu_pmu.c
> +++ b/arch/riscv/kvm/vcpu_pmu.c
> @@ -11,9 +11,163 @@
> #include <linux/kvm_host.h>
> #include <linux/perf/riscv_pmu.h>
> #include <asm/csr.h>
> +#include <asm/bitops.h>
> #include <asm/kvm_vcpu_pmu.h>
> #include <linux/kvm_host.h>
>
> +#define get_event_type(x) ((x & SBI_PMU_EVENT_IDX_TYPE_MASK) >> 16)
> +#define get_event_code(x) (x & SBI_PMU_EVENT_IDX_CODE_MASK)
> +
> +static inline u64 pmu_get_sample_period(struct kvm_pmc *pmc)
> +{
> + u64 counter_val_mask = GENMASK(pmc->cinfo.width, 0);
> + u64 sample_period;
> +
> + if (!pmc->counter_val)
> + sample_period = counter_val_mask;
> + else
> + sample_period = pmc->counter_val & counter_val_mask;

I think sample_period should be =>
sample_period = (-pmc->counter_val) & counter_val_mask

When we are doing event counting, the pmu counter initial value comes
from the guest kernel is 0x800000000000000X.
If we let the sample period be the (pmc->counter_val) &
counter_val_mask, the sample_period will be 0x800000000000000X.
After we pass this sample_period to the host pmu driver, in
riscv_pmu_event_set_period(), it will make the final pmu counter
initial value be 0xffffffffffXX.
This will make the pmu counter overflow interrupt frequently and
trigger soft lockup in kvm guest.

I also checked the arm64 kvm perf implementation as below, its
sample_period is attr.sample_period = (-counter) & GENMASK(63, 0)

624 static void kvm_pmu_create_perf_event(struct kvm_vcpu *vcpu, u64
select_idx)
625 {
....
688 /* The initial sample period (overflow count) of
an event. */
689 if (kvm_pmu_idx_is_64bit(vcpu, pmc->idx))
690 attr.sample_period = (-counter) & GENMASK(63, 0);
691 else
692 attr.sample_period = (-counter) & GENMASK(31, 0);

After I apply the patch as below, no occur counter overflow interrupt
and the pmu counter initial value is the same as we do event counting
in the host.

--- a/arch/riscv/kvm/vcpu_pmu.c
+++ b/arch/riscv/kvm/vcpu_pmu.c
@@ -26,7 +26,7 @@ static inline u64 pmu_get_sample_period(struct kvm_pmc *pmc)
if (!pmc->counter_val)
sample_period = counter_val_mask;
else
- sample_period = pmc->counter_val & counter_val_mask;
+ sample_period = (-pmc->counter_val) & counter_val_mask;

Thanks,
Eric Lin

> +
> + return sample_period;
> +}
> +
> +static u32 pmu_get_perf_event_type(unsigned long eidx)
> +{
> + enum sbi_pmu_event_type etype = get_event_type(eidx);
> + u32 type;
> +
> + if (etype == SBI_PMU_EVENT_TYPE_HW)
> + type = PERF_TYPE_HARDWARE;
> + else if (etype == SBI_PMU_EVENT_TYPE_CACHE)
> + type = PERF_TYPE_HW_CACHE;
> + else if (etype == SBI_PMU_EVENT_TYPE_RAW || etype == SBI_PMU_EVENT_TYPE_FW)
> + type = PERF_TYPE_RAW;
> + else
> + type = PERF_TYPE_MAX;
> +
> + return type;
> +}
> +
> +static inline bool pmu_is_fw_event(unsigned long eidx)
> +{
> + enum sbi_pmu_event_type etype = get_event_type(eidx);
> +
> + return (etype == SBI_PMU_EVENT_TYPE_FW) ? true : false;
> +}
> +
> +static void pmu_release_perf_event(struct kvm_pmc *pmc)
> +{
> + if (pmc->perf_event) {
> + perf_event_disable(pmc->perf_event);
> + perf_event_release_kernel(pmc->perf_event);
> + pmc->perf_event = NULL;
> + }
> +}
> +
> +static u64 pmu_get_perf_event_hw_config(u32 sbi_event_code)
> +{
> + /* SBI PMU HW event code is offset by 1 from perf hw event codes */
> + return (u64)sbi_event_code - 1;
> +}
> +
> +static u64 pmu_get_perf_event_cache_config(u32 sbi_event_code)
> +{
> + u64 config = U64_MAX;
> + unsigned int cache_type, cache_op, cache_result;
> +
> + /* All the cache event masks lie within 0xFF. No separate masking is necesssary */
> + cache_type = (sbi_event_code & SBI_PMU_EVENT_CACHE_ID_CODE_MASK) >> 3;
> + cache_op = (sbi_event_code & SBI_PMU_EVENT_CACHE_OP_ID_CODE_MASK) >> 1;
> + cache_result = sbi_event_code & SBI_PMU_EVENT_CACHE_RESULT_ID_CODE_MASK;
> +
> + if (cache_type >= PERF_COUNT_HW_CACHE_MAX ||
> + cache_op >= PERF_COUNT_HW_CACHE_OP_MAX ||
> + cache_result >= PERF_COUNT_HW_CACHE_RESULT_MAX)
> + goto out;
> + config = cache_type | (cache_op << 8) | (cache_result << 16);
> +out:
> + return config;
> +}
> +
> +static u64 pmu_get_perf_event_config(unsigned long eidx, uint64_t edata)
> +{
> + enum sbi_pmu_event_type etype = get_event_type(eidx);
> + u32 ecode = get_event_code(eidx);
> + u64 config = U64_MAX;
> +
> + if (etype == SBI_PMU_EVENT_TYPE_HW)
> + config = pmu_get_perf_event_hw_config(ecode);
> + else if (etype == SBI_PMU_EVENT_TYPE_CACHE)
> + config = pmu_get_perf_event_cache_config(ecode);
> + else if (etype == SBI_PMU_EVENT_TYPE_RAW)
> + config = edata & RISCV_PMU_RAW_EVENT_MASK;
> + else if ((etype == SBI_PMU_EVENT_TYPE_FW) && (ecode < SBI_PMU_FW_MAX))
> + config = (1ULL << 63) | ecode;
> +
> + return config;
> +}
> +
> +static int pmu_get_fixed_pmc_index(unsigned long eidx)
> +{
> + u32 etype = pmu_get_perf_event_type(eidx);
> + u32 ecode = get_event_code(eidx);
> + int ctr_idx;
> +
> + if (etype != SBI_PMU_EVENT_TYPE_HW)
> + return -EINVAL;
> +
> + if (ecode == SBI_PMU_HW_CPU_CYCLES)
> + ctr_idx = 0;
> + else if (ecode == SBI_PMU_HW_INSTRUCTIONS)
> + ctr_idx = 2;
> + else
> + return -EINVAL;
> +
> + return ctr_idx;
> +}
> +
> +static int pmu_get_programmable_pmc_index(struct kvm_pmu *kvpmu, unsigned long eidx,
> + unsigned long cbase, unsigned long cmask)
> +{
> + int ctr_idx = -1;
> + int i, pmc_idx;
> + int min, max;
> +
> + if (pmu_is_fw_event(eidx)) {
> + /* Firmware counters are mapped 1:1 starting from num_hw_ctrs for simplicity */
> + min = kvpmu->num_hw_ctrs;
> + max = min + kvpmu->num_fw_ctrs;
> + } else {
> + /* First 3 counters are reserved for fixed counters */
> + min = 3;
> + max = kvpmu->num_hw_ctrs;
> + }
> +
> + for_each_set_bit(i, &cmask, BITS_PER_LONG) {
> + pmc_idx = i + cbase;
> + if ((pmc_idx >= min && pmc_idx < max) &&
> + !test_bit(pmc_idx, kvpmu->used_pmc)) {
> + ctr_idx = pmc_idx;
> + break;
> + }
> + }
> +
> + return ctr_idx;
> +}
> +
> +static int pmu_get_pmc_index(struct kvm_pmu *pmu, unsigned long eidx,
> + unsigned long cbase, unsigned long cmask)
> +{
> + int ret;
> +
> + /* Fixed counters need to be have fixed mapping as they have different width */
> + ret = pmu_get_fixed_pmc_index(eidx);
> + if (ret >= 0)
> + return ret;
> +
> + return pmu_get_programmable_pmc_index(pmu, eidx, cbase, cmask);
> +}
> +
> int kvm_riscv_vcpu_pmu_ctr_read(struct kvm_vcpu *vcpu, unsigned long cidx,
> unsigned long *out_val)
> {
> @@ -43,7 +197,6 @@ int kvm_riscv_vcpu_pmu_read_hpm(struct kvm_vcpu *vcpu, unsigned int csr_num,
>
> if (!kvpmu)
> return KVM_INSN_EXIT_TO_USER_SPACE;
> - //TODO: Should we check if vcpu pmu is initialized or not!
> if (wr_mask)
> return KVM_INSN_ILLEGAL_TRAP;
> cidx = csr_num - CSR_CYCLE;
> @@ -81,14 +234,62 @@ int kvm_riscv_vcpu_pmu_ctr_info(struct kvm_vcpu *vcpu, unsigned long cidx,
> int kvm_riscv_vcpu_pmu_ctr_start(struct kvm_vcpu *vcpu, unsigned long ctr_base,
> unsigned long ctr_mask, unsigned long flag, uint64_t ival)
> {
> - /* TODO */
> + struct kvm_pmu *kvpmu = vcpu_to_pmu(vcpu);
> + int i, num_ctrs, pmc_index;
> + struct kvm_pmc *pmc;
> +
> + num_ctrs = kvpmu->num_fw_ctrs + kvpmu->num_hw_ctrs;
> + if (ctr_base + __fls(ctr_mask) >= num_ctrs)
> + return -EINVAL;
> +
> + /* Start the counters that have been configured and requested by the guest */
> + for_each_set_bit(i, &ctr_mask, RISCV_MAX_COUNTERS) {
> + pmc_index = i + ctr_base;
> + if (!test_bit(pmc_index, kvpmu->used_pmc))
> + continue;
> + pmc = &kvpmu->pmc[pmc_index];
> + if (flag & SBI_PMU_START_FLAG_SET_INIT_VALUE)
> + pmc->counter_val = ival;
> + if (pmc->perf_event) {
> + perf_event_period(pmc->perf_event, pmu_get_sample_period(pmc));
> + perf_event_enable(pmc->perf_event);
> + }
> + }
> +
> return 0;
> }
>
> int kvm_riscv_vcpu_pmu_ctr_stop(struct kvm_vcpu *vcpu, unsigned long ctr_base,
> unsigned long ctr_mask, unsigned long flag)
> {
> - /* TODO */
> + struct kvm_pmu *kvpmu = vcpu_to_pmu(vcpu);
> + int i, num_ctrs, pmc_index;
> + u64 enabled, running;
> + struct kvm_pmc *pmc;
> +
> + num_ctrs = kvpmu->num_fw_ctrs + kvpmu->num_hw_ctrs;
> + if ((ctr_base + __fls(ctr_mask)) >= num_ctrs)
> + return -EINVAL;
> +
> + /* Stop the counters that have been configured and requested by the guest */
> + for_each_set_bit(i, &ctr_mask, RISCV_MAX_COUNTERS) {
> + pmc_index = i + ctr_base;
> + if (!test_bit(pmc_index, kvpmu->used_pmc))
> + continue;
> + pmc = &kvpmu->pmc[pmc_index];
> + if (pmc->perf_event) {
> + /* Stop counting the counter */
> + perf_event_disable(pmc->perf_event);
> + if (flag & SBI_PMU_STOP_FLAG_RESET) {
> + /* Relase the counter if this is a reset request */
> + pmc->counter_val += perf_event_read_value(pmc->perf_event,
> + &enabled, &running);
> + pmu_release_perf_event(pmc);
> + clear_bit(pmc_index, kvpmu->used_pmc);
> + }
> + }
> + }
> +
> return 0;
> }
>
> @@ -96,14 +297,85 @@ int kvm_riscv_vcpu_pmu_ctr_cfg_match(struct kvm_vcpu *vcpu, unsigned long ctr_ba
> unsigned long ctr_mask, unsigned long flag,
> unsigned long eidx, uint64_t edata)
> {
> - /* TODO */
> - return 0;
> + struct kvm_pmu *kvpmu = vcpu_to_pmu(vcpu);
> + struct perf_event *event;
> + struct perf_event_attr attr;
> + int num_ctrs, ctr_idx;
> + u32 etype = pmu_get_perf_event_type(eidx);
> + u64 config;
> + struct kvm_pmc *pmc;
> +
> + num_ctrs = kvpmu->num_fw_ctrs + kvpmu->num_hw_ctrs;
> + if ((etype == PERF_TYPE_MAX) || ((ctr_base + __fls(ctr_mask)) >= num_ctrs))
> + return -EINVAL;
> +
> + if (pmu_is_fw_event(eidx))
> + return -EOPNOTSUPP;
> + /*
> + * SKIP_MATCH flag indicates the caller is aware of the assigned counter
> + * for this event. Just do a sanity check if it already marked used.
> + */
> + if (flag & SBI_PMU_CFG_FLAG_SKIP_MATCH) {
> + if (!test_bit(ctr_base, kvpmu->used_pmc))
> + return -EINVAL;
> + ctr_idx = ctr_base;
> + goto match_done;
> + }
> +
> + ctr_idx = pmu_get_pmc_index(kvpmu, eidx, ctr_base, ctr_mask);
> + if (ctr_idx < 0)
> + return -EOPNOTSUPP;
> +
> +match_done:
> + pmc = &kvpmu->pmc[ctr_idx];
> + pmu_release_perf_event(pmc);
> + pmc->idx = ctr_idx;
> +
> + config = pmu_get_perf_event_config(eidx, edata);
> + memset(&attr, 0, sizeof(struct perf_event_attr));
> + attr.type = etype;
> + attr.size = sizeof(attr);
> + attr.pinned = true;
> +
> + /*
> + * It should never reach here if the platform doesn't support sscofpmf extensio
> + * as mode filtering won't work without it.
> + */
> + attr.exclude_host = true;
> + attr.exclude_hv = true;
> + attr.exclude_user = flag & SBI_PMU_CFG_FLAG_SET_UINH ? 1 : 0;
> + attr.exclude_kernel = flag & SBI_PMU_CFG_FLAG_SET_SINH ? 1 : 0;
> + attr.config = config;
> + attr.config1 = RISCV_KVM_PMU_CONFIG1_GUEST_EVENTS;
> + if (flag & SBI_PMU_CFG_FLAG_CLEAR_VALUE) {
> + //TODO: Do we really want to clear the value in hardware counter
> + pmc->counter_val = 0;
> + }
> + /*
> + * Set the default sample_period for now. The guest specified value
> + * will be updated in the start call.
> + */
> + attr.sample_period = pmu_get_sample_period(pmc);
> +
> + event = perf_event_create_kernel_counter(&attr, -1, current, NULL, pmc);
> + if (IS_ERR(event)) {
> + pr_err("kvm pmu event creation failed event %pe for eidx %lx\n", event, eidx);
> + return -EOPNOTSUPP;
> + }
> +
> + set_bit(ctr_idx, kvpmu->used_pmc);
> + pmc->perf_event = event;
> + if (flag & SBI_PMU_CFG_FLAG_AUTO_START)
> + perf_event_enable(pmc->perf_event);
> +
> + return ctr_idx;
> }
>
> int kvm_riscv_vcpu_pmu_init(struct kvm_vcpu *vcpu)
> {
> int i = 0, num_hw_ctrs, num_fw_ctrs, hpm_width;
> struct kvm_pmu *kvpmu = vcpu_to_pmu(vcpu);
> + struct kvm_pmc *pmc;
>
> if (!kvpmu)
> return -EINVAL;
> @@ -120,6 +392,7 @@ int kvm_riscv_vcpu_pmu_init(struct kvm_vcpu *vcpu)
> return -EINVAL;
> }
>
> + bitmap_zero(kvpmu->used_pmc, RISCV_MAX_COUNTERS);
> kvpmu->num_hw_ctrs = num_hw_ctrs;
> kvpmu->num_fw_ctrs = num_fw_ctrs;
> /*
> @@ -132,38 +405,49 @@ int kvm_riscv_vcpu_pmu_init(struct kvm_vcpu *vcpu)
> /* TIME CSR shouldn't be read from perf interface */
> if (i == 1)
> continue;
> - kvpmu->pmc[i].idx = i;
> - kvpmu->pmc[i].vcpu = vcpu;
> + pmc = &kvpmu->pmc[i];
> + pmc->idx = i;
> + pmc->counter_val = 0;
> + pmc->vcpu = vcpu;
> if (i < kvpmu->num_hw_ctrs) {
> kvpmu->pmc[i].cinfo.type = SBI_PMU_CTR_TYPE_HW;
> if (i < 3)
> /* CY, IR counters */
> - kvpmu->pmc[i].cinfo.width = 63;
> + pmc->cinfo.width = 63;
> else
> - kvpmu->pmc[i].cinfo.width = hpm_width;
> + pmc->cinfo.width = hpm_width;
> /*
> * The CSR number doesn't have any relation with the logical
> * hardware counters. The CSR numbers are encoded sequentially
> * to avoid maintaining a map between the virtual counter
> * and CSR number.
> */
> - kvpmu->pmc[i].cinfo.csr = CSR_CYCLE + i;
> + pmc->cinfo.csr = CSR_CYCLE + i;
> } else {
> - kvpmu->pmc[i].cinfo.type = SBI_PMU_CTR_TYPE_FW;
> - kvpmu->pmc[i].cinfo.width = BITS_PER_LONG - 1;
> + pmc->cinfo.type = SBI_PMU_CTR_TYPE_FW;
> + pmc->cinfo.width = BITS_PER_LONG - 1;
> }
> }
>
> return 0;
> }
>
> -void kvm_riscv_vcpu_pmu_reset(struct kvm_vcpu *vcpu)
> +void kvm_riscv_vcpu_pmu_deinit(struct kvm_vcpu *vcpu)
> {
> - /* TODO */
> + struct kvm_pmu *kvpmu = vcpu_to_pmu(vcpu);
> + struct kvm_pmc *pmc;
> + int i;
> +
> + if (!kvpmu)
> + return;
> +
> + for_each_set_bit(i, kvpmu->used_pmc, RISCV_MAX_COUNTERS) {
> + pmc = &kvpmu->pmc[i];
> + pmu_release_perf_event(pmc);
> + }
> }
>
> -void kvm_riscv_vcpu_pmu_deinit(struct kvm_vcpu *vcpu)
> +void kvm_riscv_vcpu_pmu_reset(struct kvm_vcpu *vcpu)
> {
> - /* TODO */
> + kvm_riscv_vcpu_pmu_deinit(vcpu);
> }
> -
> --
> 2.25.1
>
>
> _______________________________________________
> linux-riscv mailing list
> [email protected]
> http://lists.infradead.org/mailman/listinfo/linux-riscv

2022-09-23 22:00:13

by Atish Patra

[permalink] [raw]
Subject: Re: [RFC 8/9] RISC-V: KVM: Implement perf support

On Mon, Sep 19, 2022 at 7:24 PM Eric Lin <[email protected]> wrote:
>
> Hi Atish,
>
> On Tue, Jul 19, 2022 at 2:01 AM Atish Patra <[email protected]> wrote:
> >
> > RISC-V SBI PMU & Sscofpmf ISA extension allows supporting perf in
> > the virtualization enviornment as well. KVM implementation
> > relies on SBI PMU extension for most of the part while traps
> > & emulates the CSRs read for counter access.
> >
> > Signed-off-by: Atish Patra <[email protected]>
> > ---
> > arch/riscv/kvm/vcpu_pmu.c | 318 ++++++++++++++++++++++++++++++++++++--
> > 1 file changed, 301 insertions(+), 17 deletions(-)
> >
> > diff --git a/arch/riscv/kvm/vcpu_pmu.c b/arch/riscv/kvm/vcpu_pmu.c
> > index 5434051f495d..278c261efad3 100644
> > --- a/arch/riscv/kvm/vcpu_pmu.c
> > +++ b/arch/riscv/kvm/vcpu_pmu.c
> > @@ -11,9 +11,163 @@
> > #include <linux/kvm_host.h>
> > #include <linux/perf/riscv_pmu.h>
> > #include <asm/csr.h>
> > +#include <asm/bitops.h>
> > #include <asm/kvm_vcpu_pmu.h>
> > #include <linux/kvm_host.h>
> >
> > +#define get_event_type(x) ((x & SBI_PMU_EVENT_IDX_TYPE_MASK) >> 16)
> > +#define get_event_code(x) (x & SBI_PMU_EVENT_IDX_CODE_MASK)
> > +
> > +static inline u64 pmu_get_sample_period(struct kvm_pmc *pmc)
> > +{
> > + u64 counter_val_mask = GENMASK(pmc->cinfo.width, 0);
> > + u64 sample_period;
> > +
> > + if (!pmc->counter_val)
> > + sample_period = counter_val_mask;
> > + else
> > + sample_period = pmc->counter_val & counter_val_mask;
>
> I think sample_period should be =>
> sample_period = (-pmc->counter_val) & counter_val_mask
>
> When we are doing event counting, the pmu counter initial value comes
> from the guest kernel is 0x800000000000000X.
> If we let the sample period be the (pmc->counter_val) &
> counter_val_mask, the sample_period will be 0x800000000000000X.
> After we pass this sample_period to the host pmu driver, in
> riscv_pmu_event_set_period(), it will make the final pmu counter
> initial value be 0xffffffffffXX.
> This will make the pmu counter overflow interrupt frequently and
> trigger soft lockup in kvm guest.
>
> I also checked the arm64 kvm perf implementation as below, its
> sample_period is attr.sample_period = (-counter) & GENMASK(63, 0)
>
> 624 static void kvm_pmu_create_perf_event(struct kvm_vcpu *vcpu, u64
> select_idx)
> 625 {
> ....
> 688 /* The initial sample period (overflow count) of
> an event. */
> 689 if (kvm_pmu_idx_is_64bit(vcpu, pmc->idx))
> 690 attr.sample_period = (-counter) & GENMASK(63, 0);
> 691 else
> 692 attr.sample_period = (-counter) & GENMASK(31, 0);
>
> After I apply the patch as below, no occur counter overflow interrupt
> and the pmu counter initial value is the same as we do event counting
> in the host.
>
> --- a/arch/riscv/kvm/vcpu_pmu.c
> +++ b/arch/riscv/kvm/vcpu_pmu.c
> @@ -26,7 +26,7 @@ static inline u64 pmu_get_sample_period(struct kvm_pmc *pmc)
> if (!pmc->counter_val)
> sample_period = counter_val_mask;
> else
> - sample_period = pmc->counter_val & counter_val_mask;
> + sample_period = (-pmc->counter_val) & counter_val_mask;
>

Yes. Thanks for the catch.
The sample_period should be (-pmc->counter_val) & counter_val_mask.

I will revise the patch along with a few other fixes and send the v2.

> Thanks,
> Eric Lin
>
> > +
> > + return sample_period;
> > +}
> > +
> > +static u32 pmu_get_perf_event_type(unsigned long eidx)
> > +{
> > + enum sbi_pmu_event_type etype = get_event_type(eidx);
> > + u32 type;
> > +
> > + if (etype == SBI_PMU_EVENT_TYPE_HW)
> > + type = PERF_TYPE_HARDWARE;
> > + else if (etype == SBI_PMU_EVENT_TYPE_CACHE)
> > + type = PERF_TYPE_HW_CACHE;
> > + else if (etype == SBI_PMU_EVENT_TYPE_RAW || etype == SBI_PMU_EVENT_TYPE_FW)
> > + type = PERF_TYPE_RAW;
> > + else
> > + type = PERF_TYPE_MAX;
> > +
> > + return type;
> > +}
> > +
> > +static inline bool pmu_is_fw_event(unsigned long eidx)
> > +{
> > + enum sbi_pmu_event_type etype = get_event_type(eidx);
> > +
> > + return (etype == SBI_PMU_EVENT_TYPE_FW) ? true : false;
> > +}
> > +
> > +static void pmu_release_perf_event(struct kvm_pmc *pmc)
> > +{
> > + if (pmc->perf_event) {
> > + perf_event_disable(pmc->perf_event);
> > + perf_event_release_kernel(pmc->perf_event);
> > + pmc->perf_event = NULL;
> > + }
> > +}
> > +
> > +static u64 pmu_get_perf_event_hw_config(u32 sbi_event_code)
> > +{
> > + /* SBI PMU HW event code is offset by 1 from perf hw event codes */
> > + return (u64)sbi_event_code - 1;
> > +}
> > +
> > +static u64 pmu_get_perf_event_cache_config(u32 sbi_event_code)
> > +{
> > + u64 config = U64_MAX;
> > + unsigned int cache_type, cache_op, cache_result;
> > +
> > + /* All the cache event masks lie within 0xFF. No separate masking is necesssary */
> > + cache_type = (sbi_event_code & SBI_PMU_EVENT_CACHE_ID_CODE_MASK) >> 3;
> > + cache_op = (sbi_event_code & SBI_PMU_EVENT_CACHE_OP_ID_CODE_MASK) >> 1;
> > + cache_result = sbi_event_code & SBI_PMU_EVENT_CACHE_RESULT_ID_CODE_MASK;
> > +
> > + if (cache_type >= PERF_COUNT_HW_CACHE_MAX ||
> > + cache_op >= PERF_COUNT_HW_CACHE_OP_MAX ||
> > + cache_result >= PERF_COUNT_HW_CACHE_RESULT_MAX)
> > + goto out;
> > + config = cache_type | (cache_op << 8) | (cache_result << 16);
> > +out:
> > + return config;
> > +}
> > +
> > +static u64 pmu_get_perf_event_config(unsigned long eidx, uint64_t edata)
> > +{
> > + enum sbi_pmu_event_type etype = get_event_type(eidx);
> > + u32 ecode = get_event_code(eidx);
> > + u64 config = U64_MAX;
> > +
> > + if (etype == SBI_PMU_EVENT_TYPE_HW)
> > + config = pmu_get_perf_event_hw_config(ecode);
> > + else if (etype == SBI_PMU_EVENT_TYPE_CACHE)
> > + config = pmu_get_perf_event_cache_config(ecode);
> > + else if (etype == SBI_PMU_EVENT_TYPE_RAW)
> > + config = edata & RISCV_PMU_RAW_EVENT_MASK;
> > + else if ((etype == SBI_PMU_EVENT_TYPE_FW) && (ecode < SBI_PMU_FW_MAX))
> > + config = (1ULL << 63) | ecode;
> > +
> > + return config;
> > +}
> > +
> > +static int pmu_get_fixed_pmc_index(unsigned long eidx)
> > +{
> > + u32 etype = pmu_get_perf_event_type(eidx);
> > + u32 ecode = get_event_code(eidx);
> > + int ctr_idx;
> > +
> > + if (etype != SBI_PMU_EVENT_TYPE_HW)
> > + return -EINVAL;
> > +
> > + if (ecode == SBI_PMU_HW_CPU_CYCLES)
> > + ctr_idx = 0;
> > + else if (ecode == SBI_PMU_HW_INSTRUCTIONS)
> > + ctr_idx = 2;
> > + else
> > + return -EINVAL;
> > +
> > + return ctr_idx;
> > +}
> > +
> > +static int pmu_get_programmable_pmc_index(struct kvm_pmu *kvpmu, unsigned long eidx,
> > + unsigned long cbase, unsigned long cmask)
> > +{
> > + int ctr_idx = -1;
> > + int i, pmc_idx;
> > + int min, max;
> > +
> > + if (pmu_is_fw_event(eidx)) {
> > + /* Firmware counters are mapped 1:1 starting from num_hw_ctrs for simplicity */
> > + min = kvpmu->num_hw_ctrs;
> > + max = min + kvpmu->num_fw_ctrs;
> > + } else {
> > + /* First 3 counters are reserved for fixed counters */
> > + min = 3;
> > + max = kvpmu->num_hw_ctrs;
> > + }
> > +
> > + for_each_set_bit(i, &cmask, BITS_PER_LONG) {
> > + pmc_idx = i + cbase;
> > + if ((pmc_idx >= min && pmc_idx < max) &&
> > + !test_bit(pmc_idx, kvpmu->used_pmc)) {
> > + ctr_idx = pmc_idx;
> > + break;
> > + }
> > + }
> > +
> > + return ctr_idx;
> > +}
> > +
> > +static int pmu_get_pmc_index(struct kvm_pmu *pmu, unsigned long eidx,
> > + unsigned long cbase, unsigned long cmask)
> > +{
> > + int ret;
> > +
> > + /* Fixed counters need to be have fixed mapping as they have different width */
> > + ret = pmu_get_fixed_pmc_index(eidx);
> > + if (ret >= 0)
> > + return ret;
> > +
> > + return pmu_get_programmable_pmc_index(pmu, eidx, cbase, cmask);
> > +}
> > +
> > int kvm_riscv_vcpu_pmu_ctr_read(struct kvm_vcpu *vcpu, unsigned long cidx,
> > unsigned long *out_val)
> > {
> > @@ -43,7 +197,6 @@ int kvm_riscv_vcpu_pmu_read_hpm(struct kvm_vcpu *vcpu, unsigned int csr_num,
> >
> > if (!kvpmu)
> > return KVM_INSN_EXIT_TO_USER_SPACE;
> > - //TODO: Should we check if vcpu pmu is initialized or not!
> > if (wr_mask)
> > return KVM_INSN_ILLEGAL_TRAP;
> > cidx = csr_num - CSR_CYCLE;
> > @@ -81,14 +234,62 @@ int kvm_riscv_vcpu_pmu_ctr_info(struct kvm_vcpu *vcpu, unsigned long cidx,
> > int kvm_riscv_vcpu_pmu_ctr_start(struct kvm_vcpu *vcpu, unsigned long ctr_base,
> > unsigned long ctr_mask, unsigned long flag, uint64_t ival)
> > {
> > - /* TODO */
> > + struct kvm_pmu *kvpmu = vcpu_to_pmu(vcpu);
> > + int i, num_ctrs, pmc_index;
> > + struct kvm_pmc *pmc;
> > +
> > + num_ctrs = kvpmu->num_fw_ctrs + kvpmu->num_hw_ctrs;
> > + if (ctr_base + __fls(ctr_mask) >= num_ctrs)
> > + return -EINVAL;
> > +
> > + /* Start the counters that have been configured and requested by the guest */
> > + for_each_set_bit(i, &ctr_mask, RISCV_MAX_COUNTERS) {
> > + pmc_index = i + ctr_base;
> > + if (!test_bit(pmc_index, kvpmu->used_pmc))
> > + continue;
> > + pmc = &kvpmu->pmc[pmc_index];
> > + if (flag & SBI_PMU_START_FLAG_SET_INIT_VALUE)
> > + pmc->counter_val = ival;
> > + if (pmc->perf_event) {
> > + perf_event_period(pmc->perf_event, pmu_get_sample_period(pmc));
> > + perf_event_enable(pmc->perf_event);
> > + }
> > + }
> > +
> > return 0;
> > }
> >
> > int kvm_riscv_vcpu_pmu_ctr_stop(struct kvm_vcpu *vcpu, unsigned long ctr_base,
> > unsigned long ctr_mask, unsigned long flag)
> > {
> > - /* TODO */
> > + struct kvm_pmu *kvpmu = vcpu_to_pmu(vcpu);
> > + int i, num_ctrs, pmc_index;
> > + u64 enabled, running;
> > + struct kvm_pmc *pmc;
> > +
> > + num_ctrs = kvpmu->num_fw_ctrs + kvpmu->num_hw_ctrs;
> > + if ((ctr_base + __fls(ctr_mask)) >= num_ctrs)
> > + return -EINVAL;
> > +
> > + /* Stop the counters that have been configured and requested by the guest */
> > + for_each_set_bit(i, &ctr_mask, RISCV_MAX_COUNTERS) {
> > + pmc_index = i + ctr_base;
> > + if (!test_bit(pmc_index, kvpmu->used_pmc))
> > + continue;
> > + pmc = &kvpmu->pmc[pmc_index];
> > + if (pmc->perf_event) {
> > + /* Stop counting the counter */
> > + perf_event_disable(pmc->perf_event);
> > + if (flag & SBI_PMU_STOP_FLAG_RESET) {
> > + /* Relase the counter if this is a reset request */
> > + pmc->counter_val += perf_event_read_value(pmc->perf_event,
> > + &enabled, &running);
> > + pmu_release_perf_event(pmc);
> > + clear_bit(pmc_index, kvpmu->used_pmc);
> > + }
> > + }
> > + }
> > +
> > return 0;
> > }
> >
> > @@ -96,14 +297,85 @@ int kvm_riscv_vcpu_pmu_ctr_cfg_match(struct kvm_vcpu *vcpu, unsigned long ctr_ba
> > unsigned long ctr_mask, unsigned long flag,
> > unsigned long eidx, uint64_t edata)
> > {
> > - /* TODO */
> > - return 0;
> > + struct kvm_pmu *kvpmu = vcpu_to_pmu(vcpu);
> > + struct perf_event *event;
> > + struct perf_event_attr attr;
> > + int num_ctrs, ctr_idx;
> > + u32 etype = pmu_get_perf_event_type(eidx);
> > + u64 config;
> > + struct kvm_pmc *pmc;
> > +
> > + num_ctrs = kvpmu->num_fw_ctrs + kvpmu->num_hw_ctrs;
> > + if ((etype == PERF_TYPE_MAX) || ((ctr_base + __fls(ctr_mask)) >= num_ctrs))
> > + return -EINVAL;
> > +
> > + if (pmu_is_fw_event(eidx))
> > + return -EOPNOTSUPP;
> > + /*
> > + * SKIP_MATCH flag indicates the caller is aware of the assigned counter
> > + * for this event. Just do a sanity check if it already marked used.
> > + */
> > + if (flag & SBI_PMU_CFG_FLAG_SKIP_MATCH) {
> > + if (!test_bit(ctr_base, kvpmu->used_pmc))
> > + return -EINVAL;
> > + ctr_idx = ctr_base;
> > + goto match_done;
> > + }
> > +
> > + ctr_idx = pmu_get_pmc_index(kvpmu, eidx, ctr_base, ctr_mask);
> > + if (ctr_idx < 0)
> > + return -EOPNOTSUPP;
> > +
> > +match_done:
> > + pmc = &kvpmu->pmc[ctr_idx];
> > + pmu_release_perf_event(pmc);
> > + pmc->idx = ctr_idx;
> > +
> > + config = pmu_get_perf_event_config(eidx, edata);
> > + memset(&attr, 0, sizeof(struct perf_event_attr));
> > + attr.type = etype;
> > + attr.size = sizeof(attr);
> > + attr.pinned = true;
> > +
> > + /*
> > + * It should never reach here if the platform doesn't support sscofpmf extensio
> > + * as mode filtering won't work without it.
> > + */
> > + attr.exclude_host = true;
> > + attr.exclude_hv = true;
> > + attr.exclude_user = flag & SBI_PMU_CFG_FLAG_SET_UINH ? 1 : 0;
> > + attr.exclude_kernel = flag & SBI_PMU_CFG_FLAG_SET_SINH ? 1 : 0;
> > + attr.config = config;
> > + attr.config1 = RISCV_KVM_PMU_CONFIG1_GUEST_EVENTS;
> > + if (flag & SBI_PMU_CFG_FLAG_CLEAR_VALUE) {
> > + //TODO: Do we really want to clear the value in hardware counter
> > + pmc->counter_val = 0;
> > + }
> > + /*
> > + * Set the default sample_period for now. The guest specified value
> > + * will be updated in the start call.
> > + */
> > + attr.sample_period = pmu_get_sample_period(pmc);
> > +
> > + event = perf_event_create_kernel_counter(&attr, -1, current, NULL, pmc);
> > + if (IS_ERR(event)) {
> > + pr_err("kvm pmu event creation failed event %pe for eidx %lx\n", event, eidx);
> > + return -EOPNOTSUPP;
> > + }
> > +
> > + set_bit(ctr_idx, kvpmu->used_pmc);
> > + pmc->perf_event = event;
> > + if (flag & SBI_PMU_CFG_FLAG_AUTO_START)
> > + perf_event_enable(pmc->perf_event);
> > +
> > + return ctr_idx;
> > }
> >
> > int kvm_riscv_vcpu_pmu_init(struct kvm_vcpu *vcpu)
> > {
> > int i = 0, num_hw_ctrs, num_fw_ctrs, hpm_width;
> > struct kvm_pmu *kvpmu = vcpu_to_pmu(vcpu);
> > + struct kvm_pmc *pmc;
> >
> > if (!kvpmu)
> > return -EINVAL;
> > @@ -120,6 +392,7 @@ int kvm_riscv_vcpu_pmu_init(struct kvm_vcpu *vcpu)
> > return -EINVAL;
> > }
> >
> > + bitmap_zero(kvpmu->used_pmc, RISCV_MAX_COUNTERS);
> > kvpmu->num_hw_ctrs = num_hw_ctrs;
> > kvpmu->num_fw_ctrs = num_fw_ctrs;
> > /*
> > @@ -132,38 +405,49 @@ int kvm_riscv_vcpu_pmu_init(struct kvm_vcpu *vcpu)
> > /* TIME CSR shouldn't be read from perf interface */
> > if (i == 1)
> > continue;
> > - kvpmu->pmc[i].idx = i;
> > - kvpmu->pmc[i].vcpu = vcpu;
> > + pmc = &kvpmu->pmc[i];
> > + pmc->idx = i;
> > + pmc->counter_val = 0;
> > + pmc->vcpu = vcpu;
> > if (i < kvpmu->num_hw_ctrs) {
> > kvpmu->pmc[i].cinfo.type = SBI_PMU_CTR_TYPE_HW;
> > if (i < 3)
> > /* CY, IR counters */
> > - kvpmu->pmc[i].cinfo.width = 63;
> > + pmc->cinfo.width = 63;
> > else
> > - kvpmu->pmc[i].cinfo.width = hpm_width;
> > + pmc->cinfo.width = hpm_width;
> > /*
> > * The CSR number doesn't have any relation with the logical
> > * hardware counters. The CSR numbers are encoded sequentially
> > * to avoid maintaining a map between the virtual counter
> > * and CSR number.
> > */
> > - kvpmu->pmc[i].cinfo.csr = CSR_CYCLE + i;
> > + pmc->cinfo.csr = CSR_CYCLE + i;
> > } else {
> > - kvpmu->pmc[i].cinfo.type = SBI_PMU_CTR_TYPE_FW;
> > - kvpmu->pmc[i].cinfo.width = BITS_PER_LONG - 1;
> > + pmc->cinfo.type = SBI_PMU_CTR_TYPE_FW;
> > + pmc->cinfo.width = BITS_PER_LONG - 1;
> > }
> > }
> >
> > return 0;
> > }
> >
> > -void kvm_riscv_vcpu_pmu_reset(struct kvm_vcpu *vcpu)
> > +void kvm_riscv_vcpu_pmu_deinit(struct kvm_vcpu *vcpu)
> > {
> > - /* TODO */
> > + struct kvm_pmu *kvpmu = vcpu_to_pmu(vcpu);
> > + struct kvm_pmc *pmc;
> > + int i;
> > +
> > + if (!kvpmu)
> > + return;
> > +
> > + for_each_set_bit(i, kvpmu->used_pmc, RISCV_MAX_COUNTERS) {
> > + pmc = &kvpmu->pmc[i];
> > + pmu_release_perf_event(pmc);
> > + }
> > }
> >
> > -void kvm_riscv_vcpu_pmu_deinit(struct kvm_vcpu *vcpu)
> > +void kvm_riscv_vcpu_pmu_reset(struct kvm_vcpu *vcpu)
> > {
> > - /* TODO */
> > + kvm_riscv_vcpu_pmu_deinit(vcpu);
> > }
> > -
> > --
> > 2.25.1
> >
> >
> > _______________________________________________
> > linux-riscv mailing list
> > [email protected]
> > http://lists.infradead.org/mailman/listinfo/linux-riscv



--
Regards,
Atish

2022-11-01 12:36:06

by Andrew Jones

[permalink] [raw]
Subject: Re: [RFC 1/9] RISC-V: Define a helper function to probe number of hardware counters

On Mon, Jul 18, 2022 at 10:01:57AM -0700, Atish Patra wrote:
> KVM module needs to know how many hardware counters the platform supports.
> Otherwise, it will not be able to show optimal value of virtual
^ the
> counters to the guest.
>
> Signed-off-by: Atish Patra <[email protected]>
> ---
> drivers/perf/riscv_pmu_sbi.c | 23 +++++++++++++++++------
> include/linux/perf/riscv_pmu.h | 4 ++++
> 2 files changed, 21 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/perf/riscv_pmu_sbi.c b/drivers/perf/riscv_pmu_sbi.c
> index 24124546844c..1723af68ffa1 100644
> --- a/drivers/perf/riscv_pmu_sbi.c
> +++ b/drivers/perf/riscv_pmu_sbi.c
> @@ -27,6 +27,7 @@
> */
> static union sbi_pmu_ctr_info *pmu_ctr_list;
> static unsigned int riscv_pmu_irq;
> +static struct riscv_pmu *rvpmu;

Do we really need rvpmu? From a quick scan of the series it's only used
for num_hw_counters, which has to be added to struct riscv_pmu, and
num_counters. How about instead creating a static global for num_counters
and then getting num_hw_counters by iterating pmu_ctr_list. If we want
riscv_pmu_sbi_get_num_hw_ctrs() to be faster, then we can cache the value
in a static variable in the function.

Thanks,
drew

2022-11-01 13:03:42

by Andrew Jones

[permalink] [raw]
Subject: Re: [RFC 4/9] RISC-V: KVM: Improve privilege mode filtering for perf

On Mon, Jul 18, 2022 at 10:02:00AM -0700, Atish Patra wrote:
> Currently, the host driver doesn't have any method to identify if the
> requested perf event is from kvm or bare metal. As KVM runs in HS
> mode, there are no separate hypervisor privilege mode to distinguish
> between the attributes for guest/host.
>
> Improve the privilege mode filtering by using the event specific
> config1 field.
>
> Signed-off-by: Atish Patra <[email protected]>
> ---
> drivers/perf/riscv_pmu_sbi.c | 27 ++++++++++++++++++++++-----
> include/linux/perf/riscv_pmu.h | 2 ++
> 2 files changed, 24 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/perf/riscv_pmu_sbi.c b/drivers/perf/riscv_pmu_sbi.c
> index 5d0eef3ef136..34f9fcc221a8 100644
> --- a/drivers/perf/riscv_pmu_sbi.c
> +++ b/drivers/perf/riscv_pmu_sbi.c
> @@ -275,6 +275,27 @@ int riscv_pmu_sbi_hpmc_width(void)
> }
> EXPORT_SYMBOL(riscv_pmu_sbi_hpmc_width);
>
> +static unsigned long pmu_sbi_get_filter_flags(struct perf_event *event)
> +{
> + unsigned long cflags = 0;
> + bool guest_events = false;
> +
> + if (event->attr.config1 & RISCV_KVM_PMU_CONFIG1_GUEST_EVENTS)
> + guest_events = true;
> + if (event->attr.exclude_kernel)
> + cflags |= guest_events ? SBI_PMU_CFG_FLAG_SET_VSINH : SBI_PMU_CFG_FLAG_SET_SINH;
> + if (event->attr.exclude_user)
> + cflags |= guest_events ? SBI_PMU_CFG_FLAG_SET_VUINH : SBI_PMU_CFG_FLAG_SET_UINH;
> + if (guest_events && event->attr.exclude_hv)
> + cflags |= SBI_PMU_CFG_FLAG_SET_SINH;
> + if (event->attr.exclude_host)
> + cflags |= SBI_PMU_CFG_FLAG_SET_UINH | SBI_PMU_CFG_FLAG_SET_SINH;
> + if (event->attr.exclude_guest)
> + cflags |= SBI_PMU_CFG_FLAG_SET_VSINH | SBI_PMU_CFG_FLAG_SET_VUINH;
> +
> + return cflags;
> +}
> +
> static int pmu_sbi_ctr_get_idx(struct perf_event *event)
> {
> struct hw_perf_event *hwc = &event->hw;
> @@ -286,11 +307,7 @@ static int pmu_sbi_ctr_get_idx(struct perf_event *event)
> uint64_t cmask = GENMASK_ULL(rvpmu->num_counters - 1, 0);
> unsigned long cflags = 0;
>
> - if (event->attr.exclude_kernel)
> - cflags |= SBI_PMU_CFG_FLAG_SET_SINH;
> - if (event->attr.exclude_user)
> - cflags |= SBI_PMU_CFG_FLAG_SET_UINH;
> -
> + cflags = pmu_sbi_get_filter_flags(event);
> /* retrieve the available counter index */
> #if defined(CONFIG_32BIT)
> ret = sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_COUNTER_CFG_MATCH, cbase, cmask,
> diff --git a/include/linux/perf/riscv_pmu.h b/include/linux/perf/riscv_pmu.h
> index 6fee211c27b5..825b95253bc5 100644
> --- a/include/linux/perf/riscv_pmu.h
> +++ b/include/linux/perf/riscv_pmu.h
> @@ -26,6 +26,8 @@
>
> #define RISCV_PMU_STOP_FLAG_RESET 1
>
> +#define RISCV_KVM_PMU_CONFIG1_GUEST_EVENTS 0x1
> +
> struct cpu_hw_events {
> /* currently enabled events */
> int n_events;
> --
> 2.25.1
>

Reviewed-by: Andrew Jones <[email protected]>

2022-11-01 14:34:23

by Andrew Jones

[permalink] [raw]
Subject: Re: [RFC 5/9] RISC-V: KVM: Add skeleton support for perf

On Mon, Jul 18, 2022 at 10:02:01AM -0700, Atish Patra wrote:
> This patch only adds barebore structure of perf implementation. Most of
a bare bones ^ the

> the function returns zero at this point and will be implemented
functions

> fully in the future.

s/the future/later patches/

>
> Signed-off-by: Atish Patra <[email protected]>
> ---
> arch/riscv/include/asm/kvm_host.h | 3 +
> arch/riscv/include/asm/kvm_vcpu_pmu.h | 70 +++++++++++++
> arch/riscv/kvm/Makefile | 1 +
> arch/riscv/kvm/main.c | 3 +-
> arch/riscv/kvm/vcpu.c | 5 +
> arch/riscv/kvm/vcpu_insn.c | 3 +-
> arch/riscv/kvm/vcpu_pmu.c | 136 ++++++++++++++++++++++++++
> 7 files changed, 219 insertions(+), 2 deletions(-)
> create mode 100644 arch/riscv/include/asm/kvm_vcpu_pmu.h
> create mode 100644 arch/riscv/kvm/vcpu_pmu.c
>
> diff --git a/arch/riscv/include/asm/kvm_host.h b/arch/riscv/include/asm/kvm_host.h
> index 59a0cf2ca7b9..5d2312828bb2 100644
> --- a/arch/riscv/include/asm/kvm_host.h
> +++ b/arch/riscv/include/asm/kvm_host.h
> @@ -18,6 +18,7 @@
> #include <asm/kvm_vcpu_fp.h>
> #include <asm/kvm_vcpu_insn.h>
> #include <asm/kvm_vcpu_timer.h>
> +#include <asm/kvm_vcpu_pmu.h>
>
> #define KVM_MAX_VCPUS 1024
>
> @@ -226,6 +227,8 @@ struct kvm_vcpu_arch {
>
> /* Don't run the VCPU (blocked) */
> bool pause;
> +
> + struct kvm_pmu pmu;
> };
>
> static inline void kvm_arch_hardware_unsetup(void) {}
> diff --git a/arch/riscv/include/asm/kvm_vcpu_pmu.h b/arch/riscv/include/asm/kvm_vcpu_pmu.h
> new file mode 100644
> index 000000000000..bffee052f2ae
> --- /dev/null
> +++ b/arch/riscv/include/asm/kvm_vcpu_pmu.h
> @@ -0,0 +1,70 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * Copyright (c) 2022 Rivos Inc
> + *
> + * Authors:
> + * Atish Patra <[email protected]>
> + */
> +
> +#ifndef _KVM_VCPU_RISCV_PMU_H
> +#define _KVM_VCPU_RISCV_PMU_H

The convention seems to be to leading underscores for these types of
defines, i.e. __KVM_VCPU_RISCV_PMU_H

> +
> +#include <linux/perf/riscv_pmu.h>
> +#include <asm/sbi.h>
> +
> +#ifdef CONFIG_RISCV_PMU_SBI
> +#define RISCV_KVM_MAX_FW_CTRS 32
> +
> +/* Per virtual pmu counter data */
> +struct kvm_pmc {
> + u8 idx;
> + struct kvm_vcpu *vcpu;

I'm not sure we need a vcpu pointer here. If it's just to implement
pmc_to_pmu(), then we can instead implement a pmc_to_vcpu(), like
arm64's kvm_pmc_to_vcpu(). x86 might be able to do that too, since
it appears the conversion macros below originated there.

> + struct perf_event *perf_event;
> + uint64_t counter_val;
> + union sbi_pmu_ctr_info cinfo;
> +};
> +
> +/* PMU data structure per vcpu */
> +struct kvm_pmu {
> + struct kvm_pmc pmc[RISCV_MAX_COUNTERS];
> + /* Number of the virtual firmware counters available */
> + int num_fw_ctrs;
> + /* Number of the virtual hardware counters available */
> + int num_hw_ctrs;
> + /* Bit map of all the virtual counter used */
counters

> + DECLARE_BITMAP(used_pmc, RISCV_MAX_COUNTERS);

How about naming this pmc_in_use like x86?

> +};
> +
> +#define vcpu_to_pmu(vcpu) (&(vcpu)->arch.pmu)
> +#define pmu_to_vcpu(pmu) (container_of((pmu), struct kvm_vcpu, arch.pmu))
> +#define pmc_to_pmu(pmc) (&(pmc)->vcpu->arch.pmu)
> +
> +int kvm_riscv_vcpu_pmu_num_ctrs(struct kvm_vcpu *vcpu, unsigned long *out_val);
> +int kvm_riscv_vcpu_pmu_ctr_info(struct kvm_vcpu *vcpu, unsigned long cidx,
> + unsigned long *ctr_info);
> +

nit: no need for this blank line

> +int kvm_riscv_vcpu_pmu_ctr_start(struct kvm_vcpu *vcpu, unsigned long ctr_base,
> + unsigned long ctr_mask, unsigned long flag, uint64_t ival);
> +int kvm_riscv_vcpu_pmu_ctr_stop(struct kvm_vcpu *vcpu, unsigned long ctr_base,
> + unsigned long ctr_mask, unsigned long flag);
> +int kvm_riscv_vcpu_pmu_ctr_cfg_match(struct kvm_vcpu *vcpu, unsigned long ctr_base,
> + unsigned long ctr_mask, unsigned long flag,
> + unsigned long eidx, uint64_t edata);
> +int kvm_riscv_vcpu_pmu_ctr_read(struct kvm_vcpu *vcpu, unsigned long cidx,
> + unsigned long *out_val);
> +int kvm_riscv_vcpu_pmu_init(struct kvm_vcpu *vcpu);
> +void kvm_riscv_vcpu_pmu_deinit(struct kvm_vcpu *vcpu);
> +void kvm_riscv_vcpu_pmu_reset(struct kvm_vcpu *vcpu);
> +
> +#else
> +struct kvm_pmu {
> +};
> +
> +static inline int kvm_riscv_vcpu_pmu_init(struct kvm_vcpu *vcpu)
> +{
> + return 0;
> +}
> +static inline void kvm_riscv_vcpu_pmu_deinit(struct kvm_vcpu *vcpu) {}
> +static inline void kvm_riscv_vcpu_pmu_reset(struct kvm_vcpu *vcpu) {}
> +#endif
> +#endif
> diff --git a/arch/riscv/kvm/Makefile b/arch/riscv/kvm/Makefile
> index 019df9208bdd..342d7199e89d 100644
> --- a/arch/riscv/kvm/Makefile
> +++ b/arch/riscv/kvm/Makefile
> @@ -25,3 +25,4 @@ kvm-y += vcpu_sbi_base.o
> kvm-y += vcpu_sbi_replace.o
> kvm-y += vcpu_sbi_hsm.o
> kvm-y += vcpu_timer.o
> +kvm-$(CONFIG_RISCV_PMU_SBI) += vcpu_sbi_pmu.o vcpu_pmu.o
> diff --git a/arch/riscv/kvm/main.c b/arch/riscv/kvm/main.c
> index 1549205fe5fe..d41ab6d1987d 100644
> --- a/arch/riscv/kvm/main.c
> +++ b/arch/riscv/kvm/main.c
> @@ -49,7 +49,8 @@ int kvm_arch_hardware_enable(void)
> hideleg |= (1UL << IRQ_VS_EXT);
> csr_write(CSR_HIDELEG, hideleg);
>
> - csr_write(CSR_HCOUNTEREN, -1UL);
> + /* VS should access only TM bit. Everything else should trap */
> + csr_write(CSR_HCOUNTEREN, 0x02);

This looks like something that should be broken out into a separate patch
with a description of what happens now when guests try to access the newly
trapping counter registers. We should probably also create a TM define.

>
> csr_write(CSR_HVIP, 0);
>
> diff --git a/arch/riscv/kvm/vcpu.c b/arch/riscv/kvm/vcpu.c
> index 3c95924d38c7..4cc964aaf2ad 100644
> --- a/arch/riscv/kvm/vcpu.c
> +++ b/arch/riscv/kvm/vcpu.c
> @@ -122,6 +122,7 @@ static void kvm_riscv_reset_vcpu(struct kvm_vcpu *vcpu)
>
> WRITE_ONCE(vcpu->arch.irqs_pending, 0);
> WRITE_ONCE(vcpu->arch.irqs_pending_mask, 0);
> + kvm_riscv_vcpu_pmu_reset(vcpu);
>
> vcpu->arch.hfence_head = 0;
> vcpu->arch.hfence_tail = 0;
> @@ -174,6 +175,9 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
> /* Setup VCPU timer */
> kvm_riscv_vcpu_timer_init(vcpu);
>
> + /* setup performance monitoring */
> + kvm_riscv_vcpu_pmu_init(vcpu);
> +
> /* Reset VCPU */
> kvm_riscv_reset_vcpu(vcpu);
>
> @@ -196,6 +200,7 @@ void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu)
> /* Cleanup VCPU timer */
> kvm_riscv_vcpu_timer_deinit(vcpu);
>
> + kvm_riscv_vcpu_pmu_deinit(vcpu);
> /* Free unused pages pre-allocated for G-stage page table mappings */
> kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_cache);
> }
> diff --git a/arch/riscv/kvm/vcpu_insn.c b/arch/riscv/kvm/vcpu_insn.c
> index 7eb90a47b571..0aa334f853c8 100644
> --- a/arch/riscv/kvm/vcpu_insn.c
> +++ b/arch/riscv/kvm/vcpu_insn.c
> @@ -214,7 +214,8 @@ struct csr_func {
> unsigned long wr_mask);
> };
>
> -static const struct csr_func csr_funcs[] = { };
> +static const struct csr_func csr_funcs[] = {
> +};

stray change

>
> /**
> * kvm_riscv_vcpu_csr_return -- Handle CSR read/write after user space
> diff --git a/arch/riscv/kvm/vcpu_pmu.c b/arch/riscv/kvm/vcpu_pmu.c
> new file mode 100644
> index 000000000000..3168ed740bdd
> --- /dev/null
> +++ b/arch/riscv/kvm/vcpu_pmu.c
> @@ -0,0 +1,136 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (c) 2022 Rivos Inc
> + *
> + * Authors:
> + * Atish Patra <[email protected]>
> + */
> +
> +#include <linux/errno.h>
> +#include <linux/err.h>
> +#include <linux/kvm_host.h>
> +#include <linux/perf/riscv_pmu.h>
> +#include <asm/csr.h>
> +#include <asm/kvm_vcpu_pmu.h>
> +#include <linux/kvm_host.h>
> +
> +int kvm_riscv_vcpu_pmu_num_ctrs(struct kvm_vcpu *vcpu, unsigned long *out_val)
> +{
> + struct kvm_pmu *kvpmu = vcpu_to_pmu(vcpu);
> +
> + if (!kvpmu)
> + return -EINVAL;

kvpmu can never be null because arch.pmu isn't a pointer. We probably
shouldn't be making calls to kvm_riscv_vcpu_pmu_num_ctrs() without knowing
we have an initialized pmu anyway, though.

> +
> + *out_val = kvpmu->num_fw_ctrs + kvpmu->num_hw_ctrs;
> + return 0;
> +}
> +
> +int kvm_riscv_vcpu_pmu_ctr_info(struct kvm_vcpu *vcpu, unsigned long cidx,
> + unsigned long *ctr_info)
> +{
> + struct kvm_pmu *kvpmu = vcpu_to_pmu(vcpu);
> +
> + if (!kvpmu || (cidx > RISCV_MAX_COUNTERS) || (cidx == 1))

nit: unnecessary ()

> + return -EINVAL;
> +
> + *ctr_info = kvpmu->pmc[cidx].cinfo.value;
> +
> + return 0;
> +}
> +
> +int kvm_riscv_vcpu_pmu_ctr_start(struct kvm_vcpu *vcpu, unsigned long ctr_base,
> + unsigned long ctr_mask, unsigned long flag, uint64_t ival)
> +{
> + /* TODO */
> + return 0;
> +}
> +
> +int kvm_riscv_vcpu_pmu_ctr_stop(struct kvm_vcpu *vcpu, unsigned long ctr_base,
> + unsigned long ctr_mask, unsigned long flag)
> +{
> + /* TODO */
> + return 0;
> +}
> +
> +int kvm_riscv_vcpu_pmu_ctr_cfg_match(struct kvm_vcpu *vcpu, unsigned long ctr_base,
> + unsigned long ctr_mask, unsigned long flag,
> + unsigned long eidx, uint64_t edata)
> +{
> + /* TODO */
> + return 0;
> +}
> +
> +int kvm_riscv_vcpu_pmu_ctr_read(struct kvm_vcpu *vcpu, unsigned long cidx,
> + unsigned long *out_val)
> +{
> + /* TODO */
> + return 0;
> +}
> +
> +int kvm_riscv_vcpu_pmu_init(struct kvm_vcpu *vcpu)
> +{
> + int i = 0, num_hw_ctrs, num_fw_ctrs, hpm_width;
> + struct kvm_pmu *kvpmu = vcpu_to_pmu(vcpu);
> +
> + if (!kvpmu)
> + return -EINVAL;
> +
> + num_hw_ctrs = riscv_pmu_sbi_get_num_hw_ctrs();
> + if ((num_hw_ctrs + RISCV_KVM_MAX_FW_CTRS) > RISCV_MAX_COUNTERS)
> + num_fw_ctrs = RISCV_MAX_COUNTERS - num_hw_ctrs;
> + else
> + num_fw_ctrs = RISCV_KVM_MAX_FW_CTRS;

Why do we need RISCV_KVM_MAX_FW_CTRS? Can't we just always get the number
with RISCV_MAX_COUNTERS - num_hw_ctrs ?

> +
> + hpm_width = riscv_pmu_sbi_hpmc_width();
> + if (hpm_width <= 0) {
> + pr_err("Can not initialize PMU for vcpu as hpmcounter width is not available\n");
> + return -EINVAL;
> + }
> +
> + kvpmu->num_hw_ctrs = num_hw_ctrs;
> + kvpmu->num_fw_ctrs = num_fw_ctrs;

Maybe it's coming later, but we need to give KVM userspace control over
the number of counters to allow it to migrate to a larger set of hosts.
Also, a previous patch said the virtual width must be the same as the
host width for the hw counters, so we need userspace to know what that
is in order to determine to which hosts it can migrate a guest.

> + /*
> + * There is no corelation betwen the logical hardware counter and virtual counters.
> + * However, we need to encode a hpmcounter CSR in the counter info field so that
> + * KVM can trap n emulate the read. This works well in the migraiton usecase as well

s/well//

> + * KVM doesn't care if the actual hpmcounter is available in the hardware or not.
> + */
> + for (i = 0; i < num_hw_ctrs + num_fw_ctrs; i++) {

Maybe we need a helper macro like

#define kvm_pmu_num_counters(pmu) ((pmu)->num_hw_ctrs + (pmu)->num_fw_ctrs)

if we're going to loop over all counters frequently.

> + /* TIME CSR shouldn't be read from perf interface */
> + if (i == 1)
> + continue;
> + kvpmu->pmc[i].idx = i;
> + kvpmu->pmc[i].vcpu = vcpu;
> + if (i < kvpmu->num_hw_ctrs) {
> + kvpmu->pmc[i].cinfo.type = SBI_PMU_CTR_TYPE_HW;
> + if (i < 3)
> + /* CY, IR counters */
> + kvpmu->pmc[i].cinfo.width = 63;
> + else
> + kvpmu->pmc[i].cinfo.width = hpm_width;
> + /*
> + * The CSR number doesn't have any relation with the logical
> + * hardware counters. The CSR numbers are encoded sequentially
> + * to avoid maintaining a map between the virtual counter
> + * and CSR number.
> + */
> + kvpmu->pmc[i].cinfo.csr = CSR_CYCLE + i;
> + } else {
> + kvpmu->pmc[i].cinfo.type = SBI_PMU_CTR_TYPE_FW;
> + kvpmu->pmc[i].cinfo.width = BITS_PER_LONG - 1;
> + }
> + }
> +
> + return 0;
> +}
> +
> +void kvm_riscv_vcpu_pmu_reset(struct kvm_vcpu *vcpu)
> +{
> + /* TODO */
> +}
> +
> +void kvm_riscv_vcpu_pmu_deinit(struct kvm_vcpu *vcpu)
> +{
> + /* TODO */
> +}
> +
> --
> 2.25.1
>

Thanks,
drew

2022-11-01 14:39:19

by Andrew Jones

[permalink] [raw]
Subject: Re: [RFC 6/9] RISC-V: KVM: Add SBI PMU extension support

On Mon, Jul 18, 2022 at 10:02:02AM -0700, Atish Patra wrote:
> SBI PMU extension allows KVM guests to configure/start/stop/query about
> the PMU counters in virtualized enviornment as well.
>
> In order to allow that, KVM implements the entire SBI PMU extension.
>
> Signed-off-by: Atish Patra <[email protected]>
> ---
> arch/riscv/kvm/vcpu_sbi.c | 11 +++++
> arch/riscv/kvm/vcpu_sbi_pmu.c | 81 +++++++++++++++++++++++++++++++++++
> 2 files changed, 92 insertions(+)
> create mode 100644 arch/riscv/kvm/vcpu_sbi_pmu.c
>
> diff --git a/arch/riscv/kvm/vcpu_sbi.c b/arch/riscv/kvm/vcpu_sbi.c
> index d45e7da3f0d3..da9f7959340e 100644
> --- a/arch/riscv/kvm/vcpu_sbi.c
> +++ b/arch/riscv/kvm/vcpu_sbi.c
> @@ -50,6 +50,16 @@ extern const struct kvm_vcpu_sbi_extension vcpu_sbi_ext_hsm;
> extern const struct kvm_vcpu_sbi_extension vcpu_sbi_ext_experimental;
> extern const struct kvm_vcpu_sbi_extension vcpu_sbi_ext_vendor;
>
> +#ifdef CONFIG_RISCV_PMU_SBI
> +extern const struct kvm_vcpu_sbi_extension vcpu_sbi_ext_pmu;
> +#else
> +static const struct kvm_vcpu_sbi_extension vcpu_sbi_ext_pmu = {
> + .extid_start = -1UL,
> + .extid_end = -1UL,
> + .handler = NULL,
> +};
> +#endif
> +
> static const struct kvm_vcpu_sbi_extension *sbi_ext[] = {
> &vcpu_sbi_ext_v01,
> &vcpu_sbi_ext_base,
> @@ -58,6 +68,7 @@ static const struct kvm_vcpu_sbi_extension *sbi_ext[] = {
> &vcpu_sbi_ext_rfence,
> &vcpu_sbi_ext_srst,
> &vcpu_sbi_ext_hsm,
> + &vcpu_sbi_ext_pmu,
> &vcpu_sbi_ext_experimental,
> &vcpu_sbi_ext_vendor,
> };
> diff --git a/arch/riscv/kvm/vcpu_sbi_pmu.c b/arch/riscv/kvm/vcpu_sbi_pmu.c
> new file mode 100644
> index 000000000000..90c51a95d4f4
> --- /dev/null
> +++ b/arch/riscv/kvm/vcpu_sbi_pmu.c
> @@ -0,0 +1,81 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (c) 2022 Rivos Inc
> + *
> + * Authors:
> + * Atish Patra <[email protected]>
> + */
> +
> +#include <linux/errno.h>
> +#include <linux/err.h>
> +#include <linux/kvm_host.h>
> +#include <asm/csr.h>
> +#include <asm/sbi.h>
> +#include <asm/kvm_vcpu_sbi.h>
> +
> +static int kvm_sbi_ext_pmu_handler(struct kvm_vcpu *vcpu, struct kvm_run *run,
> + unsigned long *out_val,
> + struct kvm_cpu_trap *utrap,
> + bool *exit)
> +{
> + int ret = -EOPNOTSUPP;
> + struct kvm_cpu_context *cp = &vcpu->arch.guest_context;
> + unsigned long funcid = cp->a6;
> + uint64_t temp;

I think we need something like

if (!vcpu_to_pmu(vcpu)->enabled)
return -EOPNOTSUPP;

here. Where 'enabled' is only true when we successfully initialize
the pmu. And, successful initialization depends on
IS_ENABLED(CONFIG_RISCV_PMU_SBI) and
riscv_isa_extension_available(NULL, SSCOFPMF) as well as not
failing other things. And, KVM userspace should be able to
disable the pmu, which keep enabled from being set as well.

> +
> + switch (funcid) {
> + case SBI_EXT_PMU_NUM_COUNTERS:
> + ret = kvm_riscv_vcpu_pmu_num_ctrs(vcpu, out_val);
> + break;
> + case SBI_EXT_PMU_COUNTER_GET_INFO:
> + ret = kvm_riscv_vcpu_pmu_ctr_info(vcpu, cp->a0, out_val);
> + break;
> + case SBI_EXT_PMU_COUNTER_CFG_MATCH:
> +#if defined(CONFIG_32BIT)
> + temp = ((uint64_t)cp->a5 << 32) | cp->a4;
> +#else
> + temp = cp->a4;
> +#endif
> + ret = kvm_riscv_vcpu_pmu_ctr_cfg_match(vcpu, cp->a0, cp->a1, cp->a2, cp->a3, temp);
> + if (ret >= 0) {
> + *out_val = ret;
> + ret = 0;
> + }
> + break;
> + case SBI_EXT_PMU_COUNTER_START:
> +#if defined(CONFIG_32BIT)
> + temp = ((uint64_t)cp->a4 << 32) | cp->a3;
> +#else
> + temp = cp->a3;
> +#endif
> + ret = kvm_riscv_vcpu_pmu_ctr_start(vcpu, cp->a0, cp->a1, cp->a2, temp);
> + break;
> + case SBI_EXT_PMU_COUNTER_STOP:
> + ret = kvm_riscv_vcpu_pmu_ctr_stop(vcpu, cp->a0, cp->a1, cp->a2);
> + break;
> + case SBI_EXT_PMU_COUNTER_FW_READ:
> + ret = kvm_riscv_vcpu_pmu_ctr_read(vcpu, cp->a0, out_val);
> + break;
> + default:
> + ret = -EOPNOTSUPP;
> + }
> +
> + return ret;
> +}
> +
> +unsigned long kvm_sbi_ext_pmu_probe(unsigned long extid)
> +{
> + /*
> + * PMU Extension is only available to guests if privilege mode filtering
> + * is available. Otherwise, guest will always count events while the
> + * execution is in hypervisor mode.
> + */
> + return riscv_isa_extension_available(NULL, SSCOFPMF);
> +}
> +
> +const struct kvm_vcpu_sbi_extension vcpu_sbi_ext_pmu = {
> + .extid_start = SBI_EXT_PMU,
> + .extid_end = SBI_EXT_PMU,
> + .handler = kvm_sbi_ext_pmu_handler,
> + .probe = kvm_sbi_ext_pmu_probe,
> +};
> --
> 2.25.1
>

Thanks,
drew

2022-11-01 14:42:21

by Andrew Jones

[permalink] [raw]
Subject: Re: [RFC 7/9] RISC-V: KVM: Implement trap & emulate for hpmcounters

On Mon, Jul 18, 2022 at 10:02:03AM -0700, Atish Patra wrote:
> As the KVM guests only see the virtual PMU counters, all hpmcounter
> access should trap and KVM emulates the read access on behalf of guests.
>
> Signed-off-by: Atish Patra <[email protected]>
> ---
> arch/riscv/include/asm/kvm_vcpu_pmu.h | 16 +++++++++
> arch/riscv/kvm/vcpu_insn.c | 1 +
> arch/riscv/kvm/vcpu_pmu.c | 47 +++++++++++++++++++++++----
> 3 files changed, 57 insertions(+), 7 deletions(-)
>
> diff --git a/arch/riscv/include/asm/kvm_vcpu_pmu.h b/arch/riscv/include/asm/kvm_vcpu_pmu.h
> index bffee052f2ae..5410236b62a8 100644
> --- a/arch/riscv/include/asm/kvm_vcpu_pmu.h
> +++ b/arch/riscv/include/asm/kvm_vcpu_pmu.h
> @@ -39,6 +39,19 @@ struct kvm_pmu {
> #define pmu_to_vcpu(pmu) (container_of((pmu), struct kvm_vcpu, arch.pmu))
> #define pmc_to_pmu(pmc) (&(pmc)->vcpu->arch.pmu)
>
> +#if defined(CONFIG_32BIT)
> +#define KVM_RISCV_VCPU_HPMCOUNTER_CSR_FUNCS \
> +{ .base = CSR_CYCLEH, .count = 31, .func = kvm_riscv_vcpu_pmu_read_hpm }, \
> +{ .base = CSR_CYCLE, .count = 31, .func = kvm_riscv_vcpu_pmu_read_hpm },
> +#else
> +#define KVM_RISCV_VCPU_HPMCOUNTER_CSR_FUNCS \
> +{ .base = CSR_CYCLE, .count = 31, .func = kvm_riscv_vcpu_pmu_read_hpm },
> +#endif
> +
> +int kvm_riscv_vcpu_pmu_read_hpm(struct kvm_vcpu *vcpu, unsigned int csr_num,
> + unsigned long *val, unsigned long new_val,
> + unsigned long wr_mask);
> +
> int kvm_riscv_vcpu_pmu_num_ctrs(struct kvm_vcpu *vcpu, unsigned long *out_val);
> int kvm_riscv_vcpu_pmu_ctr_info(struct kvm_vcpu *vcpu, unsigned long cidx,
> unsigned long *ctr_info);
> @@ -59,6 +72,9 @@ void kvm_riscv_vcpu_pmu_reset(struct kvm_vcpu *vcpu);
> #else
> struct kvm_pmu {
> };
> +#define KVM_RISCV_VCPU_HPMCOUNTER_CSR_FUNCS \
> +{ .base = 0, .count = 0, .func = NULL },
> +
>
> static inline int kvm_riscv_vcpu_pmu_init(struct kvm_vcpu *vcpu)
> {
> diff --git a/arch/riscv/kvm/vcpu_insn.c b/arch/riscv/kvm/vcpu_insn.c
> index 0aa334f853c8..7c2a4b1a69f7 100644
> --- a/arch/riscv/kvm/vcpu_insn.c
> +++ b/arch/riscv/kvm/vcpu_insn.c
> @@ -215,6 +215,7 @@ struct csr_func {
> };
>
> static const struct csr_func csr_funcs[] = {
> + KVM_RISCV_VCPU_HPMCOUNTER_CSR_FUNCS
> };
>
> /**
> diff --git a/arch/riscv/kvm/vcpu_pmu.c b/arch/riscv/kvm/vcpu_pmu.c
> index 3168ed740bdd..5434051f495d 100644
> --- a/arch/riscv/kvm/vcpu_pmu.c
> +++ b/arch/riscv/kvm/vcpu_pmu.c
> @@ -14,6 +14,46 @@
> #include <asm/kvm_vcpu_pmu.h>
> #include <linux/kvm_host.h>
>
> +int kvm_riscv_vcpu_pmu_ctr_read(struct kvm_vcpu *vcpu, unsigned long cidx,
> + unsigned long *out_val)
> +{
> + struct kvm_pmu *kvpmu = vcpu_to_pmu(vcpu);
> + struct kvm_pmc *pmc;
> + u64 enabled, running;
> +
> + if (!kvpmu)
> + return -EINVAL;
> +
> + pmc = &kvpmu->pmc[cidx];
> + if (!pmc->perf_event)
> + return -EINVAL;
> +
> + pmc->counter_val += perf_event_read_value(pmc->perf_event, &enabled, &running);
> + *out_val = pmc->counter_val;
> +
> + return 0;
> +}
> +
> +int kvm_riscv_vcpu_pmu_read_hpm(struct kvm_vcpu *vcpu, unsigned int csr_num,
> + unsigned long *val, unsigned long new_val,
> + unsigned long wr_mask)
> +{
> + struct kvm_pmu *kvpmu = vcpu_to_pmu(vcpu);
> + int cidx, ret = KVM_INSN_CONTINUE_NEXT_SEPC;
> +
> + if (!kvpmu)
> + return KVM_INSN_EXIT_TO_USER_SPACE;
> + //TODO: Should we check if vcpu pmu is initialized or not!

I guess it depends on the path to this call. It'd be best to keep the
checks to the minimum, so if this isn't a top level call then I'd say
no, but we need to check in the top level.

> + if (wr_mask)
> + return KVM_INSN_ILLEGAL_TRAP;
> + cidx = csr_num - CSR_CYCLE;
> +
> + if (kvm_riscv_vcpu_pmu_ctr_read(vcpu, cidx, val) < 0)
> + return KVM_INSN_EXIT_TO_USER_SPACE;
> +
> + return ret;
> +}
> +
> int kvm_riscv_vcpu_pmu_num_ctrs(struct kvm_vcpu *vcpu, unsigned long *out_val)
> {
> struct kvm_pmu *kvpmu = vcpu_to_pmu(vcpu);
> @@ -60,13 +100,6 @@ int kvm_riscv_vcpu_pmu_ctr_cfg_match(struct kvm_vcpu *vcpu, unsigned long ctr_ba
> return 0;
> }
>
> -int kvm_riscv_vcpu_pmu_ctr_read(struct kvm_vcpu *vcpu, unsigned long cidx,
> - unsigned long *out_val)
> -{
> - /* TODO */
> - return 0;
> -}
> -
> int kvm_riscv_vcpu_pmu_init(struct kvm_vcpu *vcpu)
> {
> int i = 0, num_hw_ctrs, num_fw_ctrs, hpm_width;
> --
> 2.25.1
>

Thanks,
drew

2022-11-01 16:04:09

by Andrew Jones

[permalink] [raw]
Subject: Re: [RFC 8/9] RISC-V: KVM: Implement perf support

On Mon, Jul 18, 2022 at 10:02:04AM -0700, Atish Patra wrote:
> RISC-V SBI PMU & Sscofpmf ISA extension allows supporting perf in
> the virtualization enviornment as well. KVM implementation
> relies on SBI PMU extension for most of the part while traps

...relies on the SBI PMU extension for the most part while trapping
and emulating...

> & emulates the CSRs read for counter access.
>
> Signed-off-by: Atish Patra <[email protected]>
> ---
> arch/riscv/kvm/vcpu_pmu.c | 318 ++++++++++++++++++++++++++++++++++++--
> 1 file changed, 301 insertions(+), 17 deletions(-)
>
> diff --git a/arch/riscv/kvm/vcpu_pmu.c b/arch/riscv/kvm/vcpu_pmu.c
> index 5434051f495d..278c261efad3 100644
> --- a/arch/riscv/kvm/vcpu_pmu.c
> +++ b/arch/riscv/kvm/vcpu_pmu.c
> @@ -11,9 +11,163 @@
> #include <linux/kvm_host.h>
> #include <linux/perf/riscv_pmu.h>
> #include <asm/csr.h>
> +#include <asm/bitops.h>
> #include <asm/kvm_vcpu_pmu.h>
> #include <linux/kvm_host.h>
>
> +#define get_event_type(x) ((x & SBI_PMU_EVENT_IDX_TYPE_MASK) >> 16)
> +#define get_event_code(x) (x & SBI_PMU_EVENT_IDX_CODE_MASK)
> +
> +static inline u64 pmu_get_sample_period(struct kvm_pmc *pmc)
> +{
> + u64 counter_val_mask = GENMASK(pmc->cinfo.width, 0);
> + u64 sample_period;
> +
> + if (!pmc->counter_val)
> + sample_period = counter_val_mask;
> + else
> + sample_period = pmc->counter_val & counter_val_mask;
> +
> + return sample_period;
> +}
> +
> +static u32 pmu_get_perf_event_type(unsigned long eidx)
> +{
> + enum sbi_pmu_event_type etype = get_event_type(eidx);
> + u32 type;
> +
> + if (etype == SBI_PMU_EVENT_TYPE_HW)
> + type = PERF_TYPE_HARDWARE;
> + else if (etype == SBI_PMU_EVENT_TYPE_CACHE)
> + type = PERF_TYPE_HW_CACHE;
> + else if (etype == SBI_PMU_EVENT_TYPE_RAW || etype == SBI_PMU_EVENT_TYPE_FW)
> + type = PERF_TYPE_RAW;
> + else
> + type = PERF_TYPE_MAX;
> +
> + return type;
> +}
> +
> +static inline bool pmu_is_fw_event(unsigned long eidx)
> +{
> + enum sbi_pmu_event_type etype = get_event_type(eidx);
> +
> + return (etype == SBI_PMU_EVENT_TYPE_FW) ? true : false;

return get_event_type(eidx) == SBI_PMU_EVENT_TYPE_FW;

> +}
> +
> +static void pmu_release_perf_event(struct kvm_pmc *pmc)
> +{
> + if (pmc->perf_event) {
> + perf_event_disable(pmc->perf_event);
> + perf_event_release_kernel(pmc->perf_event);
> + pmc->perf_event = NULL;
> + }
> +}
> +
> +static u64 pmu_get_perf_event_hw_config(u32 sbi_event_code)
> +{
> + /* SBI PMU HW event code is offset by 1 from perf hw event codes */
> + return (u64)sbi_event_code - 1;
> +}
> +
> +static u64 pmu_get_perf_event_cache_config(u32 sbi_event_code)
> +{
> + u64 config = U64_MAX;
> + unsigned int cache_type, cache_op, cache_result;
> +
> + /* All the cache event masks lie within 0xFF. No separate masking is necesssary */
> + cache_type = (sbi_event_code & SBI_PMU_EVENT_CACHE_ID_CODE_MASK) >> 3;
> + cache_op = (sbi_event_code & SBI_PMU_EVENT_CACHE_OP_ID_CODE_MASK) >> 1;
> + cache_result = sbi_event_code & SBI_PMU_EVENT_CACHE_RESULT_ID_CODE_MASK;
> +
> + if (cache_type >= PERF_COUNT_HW_CACHE_MAX ||
> + cache_op >= PERF_COUNT_HW_CACHE_OP_MAX ||
> + cache_result >= PERF_COUNT_HW_CACHE_RESULT_MAX)
> + goto out;

No goto necessary

if (...)
return U64_MAX;
return cache_type | (cache_op << 8) | (cache_result << 16);

> + config = cache_type | (cache_op << 8) | (cache_result << 16);
> +out:
> + return config;
> +}
> +
> +static u64 pmu_get_perf_event_config(unsigned long eidx, uint64_t edata)
> +{
> + enum sbi_pmu_event_type etype = get_event_type(eidx);
> + u32 ecode = get_event_code(eidx);
> + u64 config = U64_MAX;
> +
> + if (etype == SBI_PMU_EVENT_TYPE_HW)
> + config = pmu_get_perf_event_hw_config(ecode);
> + else if (etype == SBI_PMU_EVENT_TYPE_CACHE)
> + config = pmu_get_perf_event_cache_config(ecode);
> + else if (etype == SBI_PMU_EVENT_TYPE_RAW)
> + config = edata & RISCV_PMU_RAW_EVENT_MASK;
> + else if ((etype == SBI_PMU_EVENT_TYPE_FW) && (ecode < SBI_PMU_FW_MAX))
> + config = (1ULL << 63) | ecode;
> +
> + return config;
> +}
> +
> +static int pmu_get_fixed_pmc_index(unsigned long eidx)
> +{
> + u32 etype = pmu_get_perf_event_type(eidx);
> + u32 ecode = get_event_code(eidx);
> + int ctr_idx;
> +
> + if (etype != SBI_PMU_EVENT_TYPE_HW)
> + return -EINVAL;
> +
> + if (ecode == SBI_PMU_HW_CPU_CYCLES)
> + ctr_idx = 0;
> + else if (ecode == SBI_PMU_HW_INSTRUCTIONS)
> + ctr_idx = 2;
> + else
> + return -EINVAL;
> +
> + return ctr_idx;
> +}
> +
> +static int pmu_get_programmable_pmc_index(struct kvm_pmu *kvpmu, unsigned long eidx,
> + unsigned long cbase, unsigned long cmask)
> +{
> + int ctr_idx = -1;
> + int i, pmc_idx;
> + int min, max;
> +
> + if (pmu_is_fw_event(eidx)) {
> + /* Firmware counters are mapped 1:1 starting from num_hw_ctrs for simplicity */
> + min = kvpmu->num_hw_ctrs;
> + max = min + kvpmu->num_fw_ctrs;
> + } else {
> + /* First 3 counters are reserved for fixed counters */
> + min = 3;
> + max = kvpmu->num_hw_ctrs;
> + }
> +
> + for_each_set_bit(i, &cmask, BITS_PER_LONG) {
> + pmc_idx = i + cbase;
> + if ((pmc_idx >= min && pmc_idx < max) &&
> + !test_bit(pmc_idx, kvpmu->used_pmc)) {
> + ctr_idx = pmc_idx;
> + break;
> + }
> + }
> +
> + return ctr_idx;
> +}
> +
> +static int pmu_get_pmc_index(struct kvm_pmu *pmu, unsigned long eidx,
> + unsigned long cbase, unsigned long cmask)
> +{
> + int ret;
> +
> + /* Fixed counters need to be have fixed mapping as they have different width */
> + ret = pmu_get_fixed_pmc_index(eidx);
> + if (ret >= 0)
> + return ret;
> +
> + return pmu_get_programmable_pmc_index(pmu, eidx, cbase, cmask);
> +}
> +
> int kvm_riscv_vcpu_pmu_ctr_read(struct kvm_vcpu *vcpu, unsigned long cidx,
> unsigned long *out_val)
> {
> @@ -43,7 +197,6 @@ int kvm_riscv_vcpu_pmu_read_hpm(struct kvm_vcpu *vcpu, unsigned int csr_num,
>
> if (!kvpmu)
> return KVM_INSN_EXIT_TO_USER_SPACE;
> - //TODO: Should we check if vcpu pmu is initialized or not!
> if (wr_mask)
> return KVM_INSN_ILLEGAL_TRAP;
> cidx = csr_num - CSR_CYCLE;
> @@ -81,14 +234,62 @@ int kvm_riscv_vcpu_pmu_ctr_info(struct kvm_vcpu *vcpu, unsigned long cidx,
> int kvm_riscv_vcpu_pmu_ctr_start(struct kvm_vcpu *vcpu, unsigned long ctr_base,
> unsigned long ctr_mask, unsigned long flag, uint64_t ival)
> {
> - /* TODO */
> + struct kvm_pmu *kvpmu = vcpu_to_pmu(vcpu);
> + int i, num_ctrs, pmc_index;
> + struct kvm_pmc *pmc;
> +
> + num_ctrs = kvpmu->num_fw_ctrs + kvpmu->num_hw_ctrs;
> + if (ctr_base + __fls(ctr_mask) >= num_ctrs)
> + return -EINVAL;
> +
> + /* Start the counters that have been configured and requested by the guest */
> + for_each_set_bit(i, &ctr_mask, RISCV_MAX_COUNTERS) {
> + pmc_index = i + ctr_base;
> + if (!test_bit(pmc_index, kvpmu->used_pmc))
> + continue;
> + pmc = &kvpmu->pmc[pmc_index];
> + if (flag & SBI_PMU_START_FLAG_SET_INIT_VALUE)
> + pmc->counter_val = ival;
> + if (pmc->perf_event) {
> + perf_event_period(pmc->perf_event, pmu_get_sample_period(pmc));
> + perf_event_enable(pmc->perf_event);
> + }

What about checking the "SBI_ERR_ALREADY_STARTED" condition?

> + }
> +
> return 0;
> }
>
> int kvm_riscv_vcpu_pmu_ctr_stop(struct kvm_vcpu *vcpu, unsigned long ctr_base,
> unsigned long ctr_mask, unsigned long flag)
> {
> - /* TODO */
> + struct kvm_pmu *kvpmu = vcpu_to_pmu(vcpu);
> + int i, num_ctrs, pmc_index;
> + u64 enabled, running;
> + struct kvm_pmc *pmc;
> +
> + num_ctrs = kvpmu->num_fw_ctrs + kvpmu->num_hw_ctrs;
> + if ((ctr_base + __fls(ctr_mask)) >= num_ctrs)
> + return -EINVAL;
> +
> + /* Stop the counters that have been configured and requested by the guest */
> + for_each_set_bit(i, &ctr_mask, RISCV_MAX_COUNTERS) {
> + pmc_index = i + ctr_base;
> + if (!test_bit(pmc_index, kvpmu->used_pmc))
> + continue;
> + pmc = &kvpmu->pmc[pmc_index];
> + if (pmc->perf_event) {
> + /* Stop counting the counter */
> + perf_event_disable(pmc->perf_event);
> + if (flag & SBI_PMU_STOP_FLAG_RESET) {
> + /* Relase the counter if this is a reset request */
> + pmc->counter_val += perf_event_read_value(pmc->perf_event,
> + &enabled, &running);
> + pmu_release_perf_event(pmc);
> + clear_bit(pmc_index, kvpmu->used_pmc);
> + }

What about checking the "SBI_ERR_ALREADY_STOPPED" condition?

> + }
> + }
> +
> return 0;
> }
>
> @@ -96,14 +297,85 @@ int kvm_riscv_vcpu_pmu_ctr_cfg_match(struct kvm_vcpu *vcpu, unsigned long ctr_ba
> unsigned long ctr_mask, unsigned long flag,
> unsigned long eidx, uint64_t edata)
> {
> - /* TODO */
> - return 0;
> + struct kvm_pmu *kvpmu = vcpu_to_pmu(vcpu);
> + struct perf_event *event;
> + struct perf_event_attr attr;
> + int num_ctrs, ctr_idx;
> + u32 etype = pmu_get_perf_event_type(eidx);
> + u64 config;
> + struct kvm_pmc *pmc;
> +
> + num_ctrs = kvpmu->num_fw_ctrs + kvpmu->num_hw_ctrs;
> + if ((etype == PERF_TYPE_MAX) || ((ctr_base + __fls(ctr_mask)) >= num_ctrs))

nit: Unnecessary ()

> + return -EINVAL;
> +
> + if (pmu_is_fw_event(eidx))
> + return -EOPNOTSUPP;

nit: need blank line here

> + /*
> + * SKIP_MATCH flag indicates the caller is aware of the assigned counter
> + * for this event. Just do a sanity check if it already marked used.
> + */
> + if (flag & SBI_PMU_CFG_FLAG_SKIP_MATCH) {
> + if (!test_bit(ctr_base, kvpmu->used_pmc))
> + return -EINVAL;
> + ctr_idx = ctr_base;
> + goto match_done;
> + }
> +
> + ctr_idx = pmu_get_pmc_index(kvpmu, eidx, ctr_base, ctr_mask);
> + if (ctr_idx < 0)
> + return -EOPNOTSUPP;
> +
> +match_done:
> + pmc = &kvpmu->pmc[ctr_idx];
> + pmu_release_perf_event(pmc);
> + pmc->idx = ctr_idx;
> +
> + config = pmu_get_perf_event_config(eidx, edata);
> + memset(&attr, 0, sizeof(struct perf_event_attr));
> + attr.type = etype;
> + attr.size = sizeof(attr);
> + attr.pinned = true;
> +
> + /*
> + * It should never reach here if the platform doesn't support sscofpmf extensio
> + * as mode filtering won't work without it.
> + */
> + attr.exclude_host = true;
> + attr.exclude_hv = true;
> + attr.exclude_user = flag & SBI_PMU_CFG_FLAG_SET_UINH ? 1 : 0;
> + attr.exclude_kernel = flag & SBI_PMU_CFG_FLAG_SET_SINH ? 1 : 0;

nit: can use !!(flag & SBI_PMU_CFG_FLAG_SET_UINH)

> + attr.config = config;
> + attr.config1 = RISCV_KVM_PMU_CONFIG1_GUEST_EVENTS;
> + if (flag & SBI_PMU_CFG_FLAG_CLEAR_VALUE) {
> + //TODO: Do we really want to clear the value in hardware counter
> + pmc->counter_val = 0;
> + }

nit: add blank line

> + /*
> + * Set the default sample_period for now. The guest specified value
> + * will be updated in the start call.
> + */
> + attr.sample_period = pmu_get_sample_period(pmc);
> +
> + event = perf_event_create_kernel_counter(&attr, -1, current, NULL, pmc);
> + if (IS_ERR(event)) {
> + pr_err("kvm pmu event creation failed event %pe for eidx %lx\n", event, eidx);
> + return -EOPNOTSUPP;
> + }
> +
> + set_bit(ctr_idx, kvpmu->used_pmc);
> + pmc->perf_event = event;
> + if (flag & SBI_PMU_CFG_FLAG_AUTO_START)
> + perf_event_enable(pmc->perf_event);
> +
> + return ctr_idx;
> }
>
> int kvm_riscv_vcpu_pmu_init(struct kvm_vcpu *vcpu)
> {
> int i = 0, num_hw_ctrs, num_fw_ctrs, hpm_width;
> struct kvm_pmu *kvpmu = vcpu_to_pmu(vcpu);
> + struct kvm_pmc *pmc;
>
> if (!kvpmu)
> return -EINVAL;
> @@ -120,6 +392,7 @@ int kvm_riscv_vcpu_pmu_init(struct kvm_vcpu *vcpu)
> return -EINVAL;
> }
>
> + bitmap_zero(kvpmu->used_pmc, RISCV_MAX_COUNTERS);
> kvpmu->num_hw_ctrs = num_hw_ctrs;
> kvpmu->num_fw_ctrs = num_fw_ctrs;
> /*
> @@ -132,38 +405,49 @@ int kvm_riscv_vcpu_pmu_init(struct kvm_vcpu *vcpu)
> /* TIME CSR shouldn't be read from perf interface */
> if (i == 1)
> continue;
> - kvpmu->pmc[i].idx = i;
> - kvpmu->pmc[i].vcpu = vcpu;
> + pmc = &kvpmu->pmc[i];

Better to introduce this with the patch that introduced this loop.

> + pmc->idx = i;
> + pmc->counter_val = 0;
> + pmc->vcpu = vcpu;
> if (i < kvpmu->num_hw_ctrs) {
> kvpmu->pmc[i].cinfo.type = SBI_PMU_CTR_TYPE_HW;
> if (i < 3)
> /* CY, IR counters */
> - kvpmu->pmc[i].cinfo.width = 63;
> + pmc->cinfo.width = 63;
> else
> - kvpmu->pmc[i].cinfo.width = hpm_width;
> + pmc->cinfo.width = hpm_width;
> /*
> * The CSR number doesn't have any relation with the logical
> * hardware counters. The CSR numbers are encoded sequentially
> * to avoid maintaining a map between the virtual counter
> * and CSR number.
> */
> - kvpmu->pmc[i].cinfo.csr = CSR_CYCLE + i;
> + pmc->cinfo.csr = CSR_CYCLE + i;
> } else {
> - kvpmu->pmc[i].cinfo.type = SBI_PMU_CTR_TYPE_FW;
> - kvpmu->pmc[i].cinfo.width = BITS_PER_LONG - 1;
> + pmc->cinfo.type = SBI_PMU_CTR_TYPE_FW;
> + pmc->cinfo.width = BITS_PER_LONG - 1;
> }
> }
>
> return 0;
> }
>
> -void kvm_riscv_vcpu_pmu_reset(struct kvm_vcpu *vcpu)
> +void kvm_riscv_vcpu_pmu_deinit(struct kvm_vcpu *vcpu)
> {
> - /* TODO */
> + struct kvm_pmu *kvpmu = vcpu_to_pmu(vcpu);
> + struct kvm_pmc *pmc;
> + int i;
> +
> + if (!kvpmu)
> + return;
> +
> + for_each_set_bit(i, kvpmu->used_pmc, RISCV_MAX_COUNTERS) {
> + pmc = &kvpmu->pmc[i];
> + pmu_release_perf_event(pmc);
> + }
> }
>
> -void kvm_riscv_vcpu_pmu_deinit(struct kvm_vcpu *vcpu)
> +void kvm_riscv_vcpu_pmu_reset(struct kvm_vcpu *vcpu)
> {
> - /* TODO */
> + kvm_riscv_vcpu_pmu_deinit(vcpu);
> }
> -
> --
> 2.25.1
>

Thanks,
drew

2022-11-01 16:11:08

by Andrew Jones

[permalink] [raw]
Subject: Re: [RFC 3/9] RISC-V: KVM: Define a probe function for SBI extension data structures

On Mon, Jul 18, 2022 at 10:01:59AM -0700, Atish Patra wrote:
> c,urrently the probe function just check if an SBI extension is
Currently checks

> registered or not. However, the extension may not want to advertise
> itself depending on some other condition.
> An additional extension specific probe function will allow
> extensions to decide if they want to be advertised to the caller or
> not. Any extension that do not require additional dependency check
does checks

> does not required to implement this function.

s/does/is/

>
> Signed-off-by: Atish Patra <[email protected]>
> ---
> arch/riscv/include/asm/kvm_vcpu_sbi.h | 3 +++
> arch/riscv/kvm/vcpu_sbi_base.c | 13 +++++++++++--
> 2 files changed, 14 insertions(+), 2 deletions(-)
>
> diff --git a/arch/riscv/include/asm/kvm_vcpu_sbi.h b/arch/riscv/include/asm/kvm_vcpu_sbi.h
> index 83d6d4d2b1df..5853a1ef71ea 100644
> --- a/arch/riscv/include/asm/kvm_vcpu_sbi.h
> +++ b/arch/riscv/include/asm/kvm_vcpu_sbi.h
> @@ -25,6 +25,9 @@ struct kvm_vcpu_sbi_extension {
> int (*handler)(struct kvm_vcpu *vcpu, struct kvm_run *run,
> unsigned long *out_val, struct kvm_cpu_trap *utrap,
> bool *exit);
> +
> + /* Extension specific probe function */
> + unsigned long (*probe)(unsigned long extid);
> };
>
> void kvm_riscv_vcpu_sbi_forward(struct kvm_vcpu *vcpu, struct kvm_run *run);
> diff --git a/arch/riscv/kvm/vcpu_sbi_base.c b/arch/riscv/kvm/vcpu_sbi_base.c
> index 48f431091cdb..14be1a819588 100644
> --- a/arch/riscv/kvm/vcpu_sbi_base.c
> +++ b/arch/riscv/kvm/vcpu_sbi_base.c
> @@ -22,6 +22,7 @@ static int kvm_sbi_ext_base_handler(struct kvm_vcpu *vcpu, struct kvm_run *run,
> int ret = 0;
> struct kvm_cpu_context *cp = &vcpu->arch.guest_context;
> struct sbiret ecall_ret;
> + const struct kvm_vcpu_sbi_extension *sbi_ext;
>
> switch (cp->a6) {
> case SBI_EXT_BASE_GET_SPEC_VERSION:
> @@ -46,8 +47,16 @@ static int kvm_sbi_ext_base_handler(struct kvm_vcpu *vcpu, struct kvm_run *run,
> */
> kvm_riscv_vcpu_sbi_forward(vcpu, run);
> *exit = true;
> - } else
> - *out_val = kvm_vcpu_sbi_find_ext(cp->a0) ? 1 : 0;
> + } else {
> + sbi_ext = kvm_vcpu_sbi_find_ext(cp->a0);
> + if (sbi_ext) {
> + if (sbi_ext->probe)
> + *out_val = sbi_ext->probe(cp->a0);
> + else
> + *out_val = 1;
> + } else
> + *out_val = 0;
> + }
> break;
> case SBI_EXT_BASE_GET_MVENDORID:
> case SBI_EXT_BASE_GET_MARCHID:
> --
> 2.25.1
>

Other than the commit message fixes

Reviewed-by: Andrew Jones <[email protected]>

2022-11-09 14:06:46

by Sergey Matyukevich

[permalink] [raw]
Subject: Re: [RFC 4/9] RISC-V: KVM: Improve privilege mode filtering for perf

Hi Atish,

> Currently, the host driver doesn't have any method to identify if the
> requested perf event is from kvm or bare metal. As KVM runs in HS
> mode, there are no separate hypervisor privilege mode to distinguish
> between the attributes for guest/host.
>
> Improve the privilege mode filtering by using the event specific
> config1 field.

... [snip]

> +static unsigned long pmu_sbi_get_filter_flags(struct perf_event *event)
> +{
> + unsigned long cflags = 0;
> + bool guest_events = false;
> +
> + if (event->attr.config1 & RISCV_KVM_PMU_CONFIG1_GUEST_EVENTS)
> + guest_events = true;
> + if (event->attr.exclude_kernel)
> + cflags |= guest_events ? SBI_PMU_CFG_FLAG_SET_VSINH : SBI_PMU_CFG_FLAG_SET_SINH;

IIUC we should inhibit host counting if we want guest events:
cflags |= guest_events ? SBI_PMU_CFG_FLAG_SET_SINH : SBI_PMU_CFG_FLAG_SET_VSINH;

> + if (event->attr.exclude_user)
> + cflags |= guest_events ? SBI_PMU_CFG_FLAG_SET_VUINH : SBI_PMU_CFG_FLAG_SET_UINH;

Same here.

> + if (guest_events && event->attr.exclude_hv)
> + cflags |= SBI_PMU_CFG_FLAG_SET_SINH;
> + if (event->attr.exclude_host)
> + cflags |= SBI_PMU_CFG_FLAG_SET_UINH | SBI_PMU_CFG_FLAG_SET_SINH;
> + if (event->attr.exclude_guest)
> + cflags |= SBI_PMU_CFG_FLAG_SET_VSINH | SBI_PMU_CFG_FLAG_SET_VUINH;
> +
> + return cflags;
> +}

Regards,
Sergey

2022-11-21 23:58:08

by Atish Patra

[permalink] [raw]
Subject: Re: [RFC 1/9] RISC-V: Define a helper function to probe number of hardware counters

On Tue, Nov 1, 2022 at 5:30 AM Andrew Jones <[email protected]> wrote:
>
> On Mon, Jul 18, 2022 at 10:01:57AM -0700, Atish Patra wrote:
> > KVM module needs to know how many hardware counters the platform supports.
> > Otherwise, it will not be able to show optimal value of virtual
> ^ the
> > counters to the guest.
> >
> > Signed-off-by: Atish Patra <[email protected]>
> > ---
> > drivers/perf/riscv_pmu_sbi.c | 23 +++++++++++++++++------
> > include/linux/perf/riscv_pmu.h | 4 ++++
> > 2 files changed, 21 insertions(+), 6 deletions(-)
> >
> > diff --git a/drivers/perf/riscv_pmu_sbi.c b/drivers/perf/riscv_pmu_sbi.c
> > index 24124546844c..1723af68ffa1 100644
> > --- a/drivers/perf/riscv_pmu_sbi.c
> > +++ b/drivers/perf/riscv_pmu_sbi.c
> > @@ -27,6 +27,7 @@
> > */
> > static union sbi_pmu_ctr_info *pmu_ctr_list;
> > static unsigned int riscv_pmu_irq;
> > +static struct riscv_pmu *rvpmu;
>
> Do we really need rvpmu? From a quick scan of the series it's only used
> for num_hw_counters, which has to be added to struct riscv_pmu, and
> num_counters. How about instead creating a static global for num_counters

Yes. I added rvpmu just for future usage if any.

> and then getting num_hw_counters by iterating pmu_ctr_list. If we want

iteration is fine as we are doing that for hpm_width anyways.

> riscv_pmu_sbi_get_num_hw_ctrs() to be faster, then we can cache the value
> in a static variable in the function.
>

We have cmask now which can be cached in a static variable. We need to
retrieve the
counter width and the hardware counters for kvm. I have combined PATCH
1 & PATCH 2
to return both the values in one function.

> Thanks,
> drew



--
Regards,
Atish

2022-11-22 00:38:45

by Atish Patra

[permalink] [raw]
Subject: Re: [RFC 4/9] RISC-V: KVM: Improve privilege mode filtering for perf

On Wed, Nov 9, 2022 at 5:42 AM Sergey Matyukevich <[email protected]> wrote:
>
> Hi Atish,
>
> > Currently, the host driver doesn't have any method to identify if the
> > requested perf event is from kvm or bare metal. As KVM runs in HS
> > mode, there are no separate hypervisor privilege mode to distinguish
> > between the attributes for guest/host.
> >
> > Improve the privilege mode filtering by using the event specific
> > config1 field.
>
> ... [snip]
>
> > +static unsigned long pmu_sbi_get_filter_flags(struct perf_event *event)
> > +{
> > + unsigned long cflags = 0;
> > + bool guest_events = false;
> > +
> > + if (event->attr.config1 & RISCV_KVM_PMU_CONFIG1_GUEST_EVENTS)
> > + guest_events = true;
> > + if (event->attr.exclude_kernel)
> > + cflags |= guest_events ? SBI_PMU_CFG_FLAG_SET_VSINH : SBI_PMU_CFG_FLAG_SET_SINH;
>
> IIUC we should inhibit host counting if we want guest events:
> cflags |= guest_events ? SBI_PMU_CFG_FLAG_SET_SINH : SBI_PMU_CFG_FLAG_SET_VSINH;
>

guest_events indicate that the user in the guest VM is configured to
exclude the kernel i.e. the guest kernel which
is running in VS mode.
That's why, we have to set SBI_PMU_CFG_FLAG_SET_VSINH

To inhibit host counting, the user needs to specify exclude_host
and/or exclude_hv which happens below as well.

> > + if (event->attr.exclude_user)
> > + cflags |= guest_events ? SBI_PMU_CFG_FLAG_SET_VUINH : SBI_PMU_CFG_FLAG_SET_UINH;
>
> Same here.
>

Same explanation as above.

> > + if (guest_events && event->attr.exclude_hv)
> > + cflags |= SBI_PMU_CFG_FLAG_SET_SINH;
> > + if (event->attr.exclude_host)
> > + cflags |= SBI_PMU_CFG_FLAG_SET_UINH | SBI_PMU_CFG_FLAG_SET_SINH;
> > + if (event->attr.exclude_guest)
> > + cflags |= SBI_PMU_CFG_FLAG_SET_VSINH | SBI_PMU_CFG_FLAG_SET_VUINH;
> > +
> > + return cflags;
> > +}
>
> Regards,
> Sergey



--
Regards,
Atish

2022-11-22 23:28:43

by Atish Patra

[permalink] [raw]
Subject: Re: [RFC 6/9] RISC-V: KVM: Add SBI PMU extension support

On Tue, Nov 1, 2022 at 7:26 AM Andrew Jones <[email protected]> wrote:
>
> On Mon, Jul 18, 2022 at 10:02:02AM -0700, Atish Patra wrote:
> > SBI PMU extension allows KVM guests to configure/start/stop/query about
> > the PMU counters in virtualized enviornment as well.
> >
> > In order to allow that, KVM implements the entire SBI PMU extension.
> >
> > Signed-off-by: Atish Patra <[email protected]>
> > ---
> > arch/riscv/kvm/vcpu_sbi.c | 11 +++++
> > arch/riscv/kvm/vcpu_sbi_pmu.c | 81 +++++++++++++++++++++++++++++++++++
> > 2 files changed, 92 insertions(+)
> > create mode 100644 arch/riscv/kvm/vcpu_sbi_pmu.c
> >
> > diff --git a/arch/riscv/kvm/vcpu_sbi.c b/arch/riscv/kvm/vcpu_sbi.c
> > index d45e7da3f0d3..da9f7959340e 100644
> > --- a/arch/riscv/kvm/vcpu_sbi.c
> > +++ b/arch/riscv/kvm/vcpu_sbi.c
> > @@ -50,6 +50,16 @@ extern const struct kvm_vcpu_sbi_extension vcpu_sbi_ext_hsm;
> > extern const struct kvm_vcpu_sbi_extension vcpu_sbi_ext_experimental;
> > extern const struct kvm_vcpu_sbi_extension vcpu_sbi_ext_vendor;
> >
> > +#ifdef CONFIG_RISCV_PMU_SBI
> > +extern const struct kvm_vcpu_sbi_extension vcpu_sbi_ext_pmu;
> > +#else
> > +static const struct kvm_vcpu_sbi_extension vcpu_sbi_ext_pmu = {
> > + .extid_start = -1UL,
> > + .extid_end = -1UL,
> > + .handler = NULL,
> > +};
> > +#endif
> > +
> > static const struct kvm_vcpu_sbi_extension *sbi_ext[] = {
> > &vcpu_sbi_ext_v01,
> > &vcpu_sbi_ext_base,
> > @@ -58,6 +68,7 @@ static const struct kvm_vcpu_sbi_extension *sbi_ext[] = {
> > &vcpu_sbi_ext_rfence,
> > &vcpu_sbi_ext_srst,
> > &vcpu_sbi_ext_hsm,
> > + &vcpu_sbi_ext_pmu,
> > &vcpu_sbi_ext_experimental,
> > &vcpu_sbi_ext_vendor,
> > };
> > diff --git a/arch/riscv/kvm/vcpu_sbi_pmu.c b/arch/riscv/kvm/vcpu_sbi_pmu.c
> > new file mode 100644
> > index 000000000000..90c51a95d4f4
> > --- /dev/null
> > +++ b/arch/riscv/kvm/vcpu_sbi_pmu.c
> > @@ -0,0 +1,81 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * Copyright (c) 2022 Rivos Inc
> > + *
> > + * Authors:
> > + * Atish Patra <[email protected]>
> > + */
> > +
> > +#include <linux/errno.h>
> > +#include <linux/err.h>
> > +#include <linux/kvm_host.h>
> > +#include <asm/csr.h>
> > +#include <asm/sbi.h>
> > +#include <asm/kvm_vcpu_sbi.h>
> > +
> > +static int kvm_sbi_ext_pmu_handler(struct kvm_vcpu *vcpu, struct kvm_run *run,
> > + unsigned long *out_val,
> > + struct kvm_cpu_trap *utrap,
> > + bool *exit)
> > +{
> > + int ret = -EOPNOTSUPP;
> > + struct kvm_cpu_context *cp = &vcpu->arch.guest_context;
> > + unsigned long funcid = cp->a6;
> > + uint64_t temp;
>
> I think we need something like
>
> if (!vcpu_to_pmu(vcpu)->enabled)
> return -EOPNOTSUPP;
>
> here. Where 'enabled' is only true when we successfully initialize
> the pmu. And, successful initialization depends on

Yes. I have added the flag. But should we do the check here or
respective function
as a paranoia sanity check ?

> IS_ENABLED(CONFIG_RISCV_PMU_SBI) and

Why do we need to guard under CONFIG_RISCV_PMU_SBI ?
vcpu_sbi_pmu.c is only compiled under CONFIG_RISCV_PMU_SBI

If CONFIG_RISCV_PMU_SBI is not enabled, probe function will return failure.

In fact, I think we should also add the pmu enabled check in the probe function
itself. Probe function(kvm_sbi_ext_pmu_probe) should only true when
both vcpu_to_pmu(vcpu)->enabled and
riscv_isa_extension_available(NULL, SSCOFPMF) are true.

Thoughts?

> riscv_isa_extension_available(NULL, SSCOFPMF) as well as not
> failing other things. And, KVM userspace should be able to
> disable the pmu, which keep enabled from being set as well.
>
We already have provisions for disabling sscofpmf from guests via ISA
one reg interface.
Do you mean disable the entire PMU from userspace ?

Currently, ARM64 enables pmu from user space using device control APIs
on vcpu fd.
Are you suggesting we should do something like that ?

If PMU needs to have device control APIs (either via vcpu fd or its
own), we can retrieve
the hpmcounter width and count from there as well.

> > +
> > + switch (funcid) {
> > + case SBI_EXT_PMU_NUM_COUNTERS:
> > + ret = kvm_riscv_vcpu_pmu_num_ctrs(vcpu, out_val);
> > + break;
> > + case SBI_EXT_PMU_COUNTER_GET_INFO:
> > + ret = kvm_riscv_vcpu_pmu_ctr_info(vcpu, cp->a0, out_val);
> > + break;
> > + case SBI_EXT_PMU_COUNTER_CFG_MATCH:
> > +#if defined(CONFIG_32BIT)
> > + temp = ((uint64_t)cp->a5 << 32) | cp->a4;
> > +#else
> > + temp = cp->a4;
> > +#endif
> > + ret = kvm_riscv_vcpu_pmu_ctr_cfg_match(vcpu, cp->a0, cp->a1, cp->a2, cp->a3, temp);
> > + if (ret >= 0) {
> > + *out_val = ret;
> > + ret = 0;
> > + }
> > + break;
> > + case SBI_EXT_PMU_COUNTER_START:
> > +#if defined(CONFIG_32BIT)
> > + temp = ((uint64_t)cp->a4 << 32) | cp->a3;
> > +#else
> > + temp = cp->a3;
> > +#endif
> > + ret = kvm_riscv_vcpu_pmu_ctr_start(vcpu, cp->a0, cp->a1, cp->a2, temp);
> > + break;
> > + case SBI_EXT_PMU_COUNTER_STOP:
> > + ret = kvm_riscv_vcpu_pmu_ctr_stop(vcpu, cp->a0, cp->a1, cp->a2);
> > + break;
> > + case SBI_EXT_PMU_COUNTER_FW_READ:
> > + ret = kvm_riscv_vcpu_pmu_ctr_read(vcpu, cp->a0, out_val);
> > + break;
> > + default:
> > + ret = -EOPNOTSUPP;
> > + }
> > +
> > + return ret;
> > +}
> > +
> > +unsigned long kvm_sbi_ext_pmu_probe(unsigned long extid)
> > +{
> > + /*
> > + * PMU Extension is only available to guests if privilege mode filtering
> > + * is available. Otherwise, guest will always count events while the
> > + * execution is in hypervisor mode.
> > + */
> > + return riscv_isa_extension_available(NULL, SSCOFPMF);
> > +}
> > +
> > +const struct kvm_vcpu_sbi_extension vcpu_sbi_ext_pmu = {
> > + .extid_start = SBI_EXT_PMU,
> > + .extid_end = SBI_EXT_PMU,
> > + .handler = kvm_sbi_ext_pmu_handler,
> > + .probe = kvm_sbi_ext_pmu_probe,
> > +};
> > --
> > 2.25.1
> >
>
> Thanks,
> drew



--
Regards,
Atish

2022-11-22 23:29:56

by Atish Patra

[permalink] [raw]
Subject: Re: [RFC 7/9] RISC-V: KVM: Implement trap & emulate for hpmcounters

On Tue, Nov 1, 2022 at 7:35 AM Andrew Jones <[email protected]> wrote:
>
> On Mon, Jul 18, 2022 at 10:02:03AM -0700, Atish Patra wrote:
> > As the KVM guests only see the virtual PMU counters, all hpmcounter
> > access should trap and KVM emulates the read access on behalf of guests.
> >
> > Signed-off-by: Atish Patra <[email protected]>
> > ---
> > arch/riscv/include/asm/kvm_vcpu_pmu.h | 16 +++++++++
> > arch/riscv/kvm/vcpu_insn.c | 1 +
> > arch/riscv/kvm/vcpu_pmu.c | 47 +++++++++++++++++++++++----
> > 3 files changed, 57 insertions(+), 7 deletions(-)
> >
> > diff --git a/arch/riscv/include/asm/kvm_vcpu_pmu.h b/arch/riscv/include/asm/kvm_vcpu_pmu.h
> > index bffee052f2ae..5410236b62a8 100644
> > --- a/arch/riscv/include/asm/kvm_vcpu_pmu.h
> > +++ b/arch/riscv/include/asm/kvm_vcpu_pmu.h
> > @@ -39,6 +39,19 @@ struct kvm_pmu {
> > #define pmu_to_vcpu(pmu) (container_of((pmu), struct kvm_vcpu, arch.pmu))
> > #define pmc_to_pmu(pmc) (&(pmc)->vcpu->arch.pmu)
> >
> > +#if defined(CONFIG_32BIT)
> > +#define KVM_RISCV_VCPU_HPMCOUNTER_CSR_FUNCS \
> > +{ .base = CSR_CYCLEH, .count = 31, .func = kvm_riscv_vcpu_pmu_read_hpm }, \
> > +{ .base = CSR_CYCLE, .count = 31, .func = kvm_riscv_vcpu_pmu_read_hpm },
> > +#else
> > +#define KVM_RISCV_VCPU_HPMCOUNTER_CSR_FUNCS \
> > +{ .base = CSR_CYCLE, .count = 31, .func = kvm_riscv_vcpu_pmu_read_hpm },
> > +#endif
> > +
> > +int kvm_riscv_vcpu_pmu_read_hpm(struct kvm_vcpu *vcpu, unsigned int csr_num,
> > + unsigned long *val, unsigned long new_val,
> > + unsigned long wr_mask);
> > +
> > int kvm_riscv_vcpu_pmu_num_ctrs(struct kvm_vcpu *vcpu, unsigned long *out_val);
> > int kvm_riscv_vcpu_pmu_ctr_info(struct kvm_vcpu *vcpu, unsigned long cidx,
> > unsigned long *ctr_info);
> > @@ -59,6 +72,9 @@ void kvm_riscv_vcpu_pmu_reset(struct kvm_vcpu *vcpu);
> > #else
> > struct kvm_pmu {
> > };
> > +#define KVM_RISCV_VCPU_HPMCOUNTER_CSR_FUNCS \
> > +{ .base = 0, .count = 0, .func = NULL },
> > +
> >
> > static inline int kvm_riscv_vcpu_pmu_init(struct kvm_vcpu *vcpu)
> > {
> > diff --git a/arch/riscv/kvm/vcpu_insn.c b/arch/riscv/kvm/vcpu_insn.c
> > index 0aa334f853c8..7c2a4b1a69f7 100644
> > --- a/arch/riscv/kvm/vcpu_insn.c
> > +++ b/arch/riscv/kvm/vcpu_insn.c
> > @@ -215,6 +215,7 @@ struct csr_func {
> > };
> >
> > static const struct csr_func csr_funcs[] = {
> > + KVM_RISCV_VCPU_HPMCOUNTER_CSR_FUNCS
> > };
> >
> > /**
> > diff --git a/arch/riscv/kvm/vcpu_pmu.c b/arch/riscv/kvm/vcpu_pmu.c
> > index 3168ed740bdd..5434051f495d 100644
> > --- a/arch/riscv/kvm/vcpu_pmu.c
> > +++ b/arch/riscv/kvm/vcpu_pmu.c
> > @@ -14,6 +14,46 @@
> > #include <asm/kvm_vcpu_pmu.h>
> > #include <linux/kvm_host.h>
> >
> > +int kvm_riscv_vcpu_pmu_ctr_read(struct kvm_vcpu *vcpu, unsigned long cidx,
> > + unsigned long *out_val)
> > +{
> > + struct kvm_pmu *kvpmu = vcpu_to_pmu(vcpu);
> > + struct kvm_pmc *pmc;
> > + u64 enabled, running;
> > +
> > + if (!kvpmu)
> > + return -EINVAL;
> > +
> > + pmc = &kvpmu->pmc[cidx];
> > + if (!pmc->perf_event)
> > + return -EINVAL;
> > +
> > + pmc->counter_val += perf_event_read_value(pmc->perf_event, &enabled, &running);
> > + *out_val = pmc->counter_val;
> > +
> > + return 0;
> > +}
> > +
> > +int kvm_riscv_vcpu_pmu_read_hpm(struct kvm_vcpu *vcpu, unsigned int csr_num,
> > + unsigned long *val, unsigned long new_val,
> > + unsigned long wr_mask)
> > +{
> > + struct kvm_pmu *kvpmu = vcpu_to_pmu(vcpu);
> > + int cidx, ret = KVM_INSN_CONTINUE_NEXT_SEPC;
> > +
> > + if (!kvpmu)
> > + return KVM_INSN_EXIT_TO_USER_SPACE;
> > + //TODO: Should we check if vcpu pmu is initialized or not!
>
> I guess it depends on the path to this call. It'd be best to keep the
> checks to the minimum, so if this isn't a top level call then I'd say
> no, but we need to check in the top level.
>

Based on the discussion on PATCH 6 we won't require the initialization check
at these functions.

We can leave the paranoia sanity check at kvm_riscv_vcpu_pmu_num_ctrs and
kvm_riscv_vcpu_pmu_ctr_info though.

> > + if (wr_mask)
> > + return KVM_INSN_ILLEGAL_TRAP;
> > + cidx = csr_num - CSR_CYCLE;
> > +
> > + if (kvm_riscv_vcpu_pmu_ctr_read(vcpu, cidx, val) < 0)
> > + return KVM_INSN_EXIT_TO_USER_SPACE;
> > +
> > + return ret;
> > +}
> > +
> > int kvm_riscv_vcpu_pmu_num_ctrs(struct kvm_vcpu *vcpu, unsigned long *out_val)
> > {
> > struct kvm_pmu *kvpmu = vcpu_to_pmu(vcpu);
> > @@ -60,13 +100,6 @@ int kvm_riscv_vcpu_pmu_ctr_cfg_match(struct kvm_vcpu *vcpu, unsigned long ctr_ba
> > return 0;
> > }
> >
> > -int kvm_riscv_vcpu_pmu_ctr_read(struct kvm_vcpu *vcpu, unsigned long cidx,
> > - unsigned long *out_val)
> > -{
> > - /* TODO */
> > - return 0;
> > -}
> > -
> > int kvm_riscv_vcpu_pmu_init(struct kvm_vcpu *vcpu)
> > {
> > int i = 0, num_hw_ctrs, num_fw_ctrs, hpm_width;
> > --
> > 2.25.1
> >
>
> Thanks,
> drew



--
Regards,
Atish

2022-11-23 01:03:13

by Atish Patra

[permalink] [raw]
Subject: Re: [RFC 5/9] RISC-V: KVM: Add skeleton support for perf

at runtime.

On Tue, Nov 1, 2022 at 7:13 AM Andrew Jones <[email protected]> wrote:
>
> On Mon, Jul 18, 2022 at 10:02:01AM -0700, Atish Patra wrote:
> > This patch only adds barebore structure of perf implementation. Most of
> a bare bones ^ the
>
> > the function returns zero at this point and will be implemented
> functions
>
> > fully in the future.
>
> s/the future/later patches/
>

Done.

> >
> > Signed-off-by: Atish Patra <[email protected]>
> > ---
> > arch/riscv/include/asm/kvm_host.h | 3 +
> > arch/riscv/include/asm/kvm_vcpu_pmu.h | 70 +++++++++++++
> > arch/riscv/kvm/Makefile | 1 +
> > arch/riscv/kvm/main.c | 3 +-
> > arch/riscv/kvm/vcpu.c | 5 +
> > arch/riscv/kvm/vcpu_insn.c | 3 +-
> > arch/riscv/kvm/vcpu_pmu.c | 136 ++++++++++++++++++++++++++
> > 7 files changed, 219 insertions(+), 2 deletions(-)
> > create mode 100644 arch/riscv/include/asm/kvm_vcpu_pmu.h
> > create mode 100644 arch/riscv/kvm/vcpu_pmu.c
> >
> > diff --git a/arch/riscv/include/asm/kvm_host.h b/arch/riscv/include/asm/kvm_host.h
> > index 59a0cf2ca7b9..5d2312828bb2 100644
> > --- a/arch/riscv/include/asm/kvm_host.h
> > +++ b/arch/riscv/include/asm/kvm_host.h
> > @@ -18,6 +18,7 @@
> > #include <asm/kvm_vcpu_fp.h>
> > #include <asm/kvm_vcpu_insn.h>
> > #include <asm/kvm_vcpu_timer.h>
> > +#include <asm/kvm_vcpu_pmu.h>
> >
> > #define KVM_MAX_VCPUS 1024
> >
> > @@ -226,6 +227,8 @@ struct kvm_vcpu_arch {
> >
> > /* Don't run the VCPU (blocked) */
> > bool pause;
> > +
> > + struct kvm_pmu pmu;
> > };
> >
> > static inline void kvm_arch_hardware_unsetup(void) {}
> > diff --git a/arch/riscv/include/asm/kvm_vcpu_pmu.h b/arch/riscv/include/asm/kvm_vcpu_pmu.h
> > new file mode 100644
> > index 000000000000..bffee052f2ae
> > --- /dev/null
> > +++ b/arch/riscv/include/asm/kvm_vcpu_pmu.h
> > @@ -0,0 +1,70 @@
> > +/* SPDX-License-Identifier: GPL-2.0-only */
> > +/*
> > + * Copyright (c) 2022 Rivos Inc
> > + *
> > + * Authors:
> > + * Atish Patra <[email protected]>
> > + */
> > +
> > +#ifndef _KVM_VCPU_RISCV_PMU_H
> > +#define _KVM_VCPU_RISCV_PMU_H
>
> The convention seems to be to leading underscores for these types of
> defines, i.e. __KVM_VCPU_RISCV_PMU_H
>

Yes. It was a typo. Fixed.

> > +
> > +#include <linux/perf/riscv_pmu.h>
> > +#include <asm/sbi.h>
> > +
> > +#ifdef CONFIG_RISCV_PMU_SBI
> > +#define RISCV_KVM_MAX_FW_CTRS 32
> > +
> > +/* Per virtual pmu counter data */
> > +struct kvm_pmc {
> > + u8 idx;
> > + struct kvm_vcpu *vcpu;
>
> I'm not sure we need a vcpu pointer here. If it's just to implement
> pmc_to_pmu(), then we can instead implement a pmc_to_vcpu(), like
> arm64's kvm_pmc_to_vcpu(). x86 might be able to do that too, since
> it appears the conversion macros below originated there.
>

Yes. We can implement arm64 as well instead of x86.
I just thought the x86 approach keeping a reference to vcpu is a bit
simpler than computing
the parent offset using container_of multiple times. This will be
invoked once per overflow in the counter overflow path.

If you feel strongly about it arm64 way, we can follow that. I have
removed from this series for now.
Depending on the conclusion, I will add it back in kvm sscofpmf
support series if required.

> > + struct perf_event *perf_event;
> > + uint64_t counter_val;
> > + union sbi_pmu_ctr_info cinfo;
> > +};
> > +
> > +/* PMU data structure per vcpu */
> > +struct kvm_pmu {
> > + struct kvm_pmc pmc[RISCV_MAX_COUNTERS];
> > + /* Number of the virtual firmware counters available */
> > + int num_fw_ctrs;
> > + /* Number of the virtual hardware counters available */
> > + int num_hw_ctrs;
> > + /* Bit map of all the virtual counter used */
> counters
>
> > + DECLARE_BITMAP(used_pmc, RISCV_MAX_COUNTERS);
>
> How about naming this pmc_in_use like x86?
>

Done.

> > +};
> > +
> > +#define vcpu_to_pmu(vcpu) (&(vcpu)->arch.pmu)
> > +#define pmu_to_vcpu(pmu) (container_of((pmu), struct kvm_vcpu, arch.pmu))
> > +#define pmc_to_pmu(pmc) (&(pmc)->vcpu->arch.pmu)
> > +
> > +int kvm_riscv_vcpu_pmu_num_ctrs(struct kvm_vcpu *vcpu, unsigned long *out_val);
> > +int kvm_riscv_vcpu_pmu_ctr_info(struct kvm_vcpu *vcpu, unsigned long cidx,
> > + unsigned long *ctr_info);
> > +
>
> nit: no need for this blank line

Fixed.

>
> > +int kvm_riscv_vcpu_pmu_ctr_start(struct kvm_vcpu *vcpu, unsigned long ctr_base,
> > + unsigned long ctr_mask, unsigned long flag, uint64_t ival);
> > +int kvm_riscv_vcpu_pmu_ctr_stop(struct kvm_vcpu *vcpu, unsigned long ctr_base,
> > + unsigned long ctr_mask, unsigned long flag);
> > +int kvm_riscv_vcpu_pmu_ctr_cfg_match(struct kvm_vcpu *vcpu, unsigned long ctr_base,
> > + unsigned long ctr_mask, unsigned long flag,
> > + unsigned long eidx, uint64_t edata);
> > +int kvm_riscv_vcpu_pmu_ctr_read(struct kvm_vcpu *vcpu, unsigned long cidx,
> > + unsigned long *out_val);
> > +int kvm_riscv_vcpu_pmu_init(struct kvm_vcpu *vcpu);
> > +void kvm_riscv_vcpu_pmu_deinit(struct kvm_vcpu *vcpu);
> > +void kvm_riscv_vcpu_pmu_reset(struct kvm_vcpu *vcpu);
> > +
> > +#else
> > +struct kvm_pmu {
> > +};
> > +
> > +static inline int kvm_riscv_vcpu_pmu_init(struct kvm_vcpu *vcpu)
> > +{
> > + return 0;
> > +}
> > +static inline void kvm_riscv_vcpu_pmu_deinit(struct kvm_vcpu *vcpu) {}
> > +static inline void kvm_riscv_vcpu_pmu_reset(struct kvm_vcpu *vcpu) {}
> > +#endif
> > +#endif
> > diff --git a/arch/riscv/kvm/Makefile b/arch/riscv/kvm/Makefile
> > index 019df9208bdd..342d7199e89d 100644
> > --- a/arch/riscv/kvm/Makefile
> > +++ b/arch/riscv/kvm/Makefile
> > @@ -25,3 +25,4 @@ kvm-y += vcpu_sbi_base.o
> > kvm-y += vcpu_sbi_replace.o
> > kvm-y += vcpu_sbi_hsm.o
> > kvm-y += vcpu_timer.o
> > +kvm-$(CONFIG_RISCV_PMU_SBI) += vcpu_sbi_pmu.o vcpu_pmu.o
> > diff --git a/arch/riscv/kvm/main.c b/arch/riscv/kvm/main.c
> > index 1549205fe5fe..d41ab6d1987d 100644
> > --- a/arch/riscv/kvm/main.c
> > +++ b/arch/riscv/kvm/main.c
> > @@ -49,7 +49,8 @@ int kvm_arch_hardware_enable(void)
> > hideleg |= (1UL << IRQ_VS_EXT);
> > csr_write(CSR_HIDELEG, hideleg);
> >
> > - csr_write(CSR_HCOUNTEREN, -1UL);
> > + /* VS should access only TM bit. Everything else should trap */
> > + csr_write(CSR_HCOUNTEREN, 0x02);
>
> This looks like something that should be broken out into a separate patch
> with a description of what happens now when guests try to access the newly
> trapping counter registers. We should probably also create a TM define.
>

Done.

> >
> > csr_write(CSR_HVIP, 0);
> >
> > diff --git a/arch/riscv/kvm/vcpu.c b/arch/riscv/kvm/vcpu.c
> > index 3c95924d38c7..4cc964aaf2ad 100644
> > --- a/arch/riscv/kvm/vcpu.c
> > +++ b/arch/riscv/kvm/vcpu.c
> > @@ -122,6 +122,7 @@ static void kvm_riscv_reset_vcpu(struct kvm_vcpu *vcpu)
> >
> > WRITE_ONCE(vcpu->arch.irqs_pending, 0);
> > WRITE_ONCE(vcpu->arch.irqs_pending_mask, 0);
> > + kvm_riscv_vcpu_pmu_reset(vcpu);
> >
> > vcpu->arch.hfence_head = 0;
> > vcpu->arch.hfence_tail = 0;
> > @@ -174,6 +175,9 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
> > /* Setup VCPU timer */
> > kvm_riscv_vcpu_timer_init(vcpu);
> >
> > + /* setup performance monitoring */
> > + kvm_riscv_vcpu_pmu_init(vcpu);
> > +
> > /* Reset VCPU */
> > kvm_riscv_reset_vcpu(vcpu);
> >
> > @@ -196,6 +200,7 @@ void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu)
> > /* Cleanup VCPU timer */
> > kvm_riscv_vcpu_timer_deinit(vcpu);
> >
> > + kvm_riscv_vcpu_pmu_deinit(vcpu);
> > /* Free unused pages pre-allocated for G-stage page table mappings */
> > kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_cache);
> > }
> > diff --git a/arch/riscv/kvm/vcpu_insn.c b/arch/riscv/kvm/vcpu_insn.c
> > index 7eb90a47b571..0aa334f853c8 100644
> > --- a/arch/riscv/kvm/vcpu_insn.c
> > +++ b/arch/riscv/kvm/vcpu_insn.c
> > @@ -214,7 +214,8 @@ struct csr_func {
> > unsigned long wr_mask);
> > };
> >
> > -static const struct csr_func csr_funcs[] = { };
> > +static const struct csr_func csr_funcs[] = {
> > +};
>
> stray change
>

Fixed

> >
> > /**
> > * kvm_riscv_vcpu_csr_return -- Handle CSR read/write after user space
> > diff --git a/arch/riscv/kvm/vcpu_pmu.c b/arch/riscv/kvm/vcpu_pmu.c
> > new file mode 100644
> > index 000000000000..3168ed740bdd
> > --- /dev/null
> > +++ b/arch/riscv/kvm/vcpu_pmu.c
> > @@ -0,0 +1,136 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * Copyright (c) 2022 Rivos Inc
> > + *
> > + * Authors:
> > + * Atish Patra <[email protected]>
> > + */
> > +
> > +#include <linux/errno.h>
> > +#include <linux/err.h>
> > +#include <linux/kvm_host.h>
> > +#include <linux/perf/riscv_pmu.h>
> > +#include <asm/csr.h>
> > +#include <asm/kvm_vcpu_pmu.h>
> > +#include <linux/kvm_host.h>
> > +
> > +int kvm_riscv_vcpu_pmu_num_ctrs(struct kvm_vcpu *vcpu, unsigned long *out_val)
> > +{
> > + struct kvm_pmu *kvpmu = vcpu_to_pmu(vcpu);
> > +
> > + if (!kvpmu)
> > + return -EINVAL;
>
> kvpmu can never be null because arch.pmu isn't a pointer. We probably
> shouldn't be making calls to kvm_riscv_vcpu_pmu_num_ctrs() without knowing
> we have an initialized pmu anyway, though.
>

Yes. I have added an init_done flag to do that sanity check.
I can change it based on the conclusion on PATCH 6.

> > +
> > + *out_val = kvpmu->num_fw_ctrs + kvpmu->num_hw_ctrs;
> > + return 0;
> > +}
> > +
> > +int kvm_riscv_vcpu_pmu_ctr_info(struct kvm_vcpu *vcpu, unsigned long cidx,
> > + unsigned long *ctr_info)
> > +{
> > + struct kvm_pmu *kvpmu = vcpu_to_pmu(vcpu);
> > +
> > + if (!kvpmu || (cidx > RISCV_MAX_COUNTERS) || (cidx == 1))
>
> nit: unnecessary ()
>
> > + return -EINVAL;
> > +
> > + *ctr_info = kvpmu->pmc[cidx].cinfo.value;
> > +
> > + return 0;
> > +}
> > +
> > +int kvm_riscv_vcpu_pmu_ctr_start(struct kvm_vcpu *vcpu, unsigned long ctr_base,
> > + unsigned long ctr_mask, unsigned long flag, uint64_t ival)
> > +{
> > + /* TODO */
> > + return 0;
> > +}
> > +
> > +int kvm_riscv_vcpu_pmu_ctr_stop(struct kvm_vcpu *vcpu, unsigned long ctr_base,
> > + unsigned long ctr_mask, unsigned long flag)
> > +{
> > + /* TODO */
> > + return 0;
> > +}
> > +
> > +int kvm_riscv_vcpu_pmu_ctr_cfg_match(struct kvm_vcpu *vcpu, unsigned long ctr_base,
> > + unsigned long ctr_mask, unsigned long flag,
> > + unsigned long eidx, uint64_t edata)
> > +{
> > + /* TODO */
> > + return 0;
> > +}
> > +
> > +int kvm_riscv_vcpu_pmu_ctr_read(struct kvm_vcpu *vcpu, unsigned long cidx,
> > + unsigned long *out_val)
> > +{
> > + /* TODO */
> > + return 0;
> > +}
> > +
> > +int kvm_riscv_vcpu_pmu_init(struct kvm_vcpu *vcpu)
> > +{
> > + int i = 0, num_hw_ctrs, num_fw_ctrs, hpm_width;
> > + struct kvm_pmu *kvpmu = vcpu_to_pmu(vcpu);
> > +
> > + if (!kvpmu)
> > + return -EINVAL;
> > +
> > + num_hw_ctrs = riscv_pmu_sbi_get_num_hw_ctrs();
> > + if ((num_hw_ctrs + RISCV_KVM_MAX_FW_CTRS) > RISCV_MAX_COUNTERS)
> > + num_fw_ctrs = RISCV_MAX_COUNTERS - num_hw_ctrs;
> > + else
> > + num_fw_ctrs = RISCV_KVM_MAX_FW_CTRS;
>
> Why do we need RISCV_KVM_MAX_FW_CTRS? Can't we just always get the number
> with RISCV_MAX_COUNTERS - num_hw_ctrs ?
>
We can. But we have to allocate fw_event at runtime. As most platforms
don't implement
more than all 29 hpmcounters, you end up having more firmware counters
than needed.
Current, SBI spec only define 21 firmware counter anyways.

Thus I felt it is unnecessary to do runtime allocation. But it's just
few bytes. So I don't feel
strongly about it.

> > +
> > + hpm_width = riscv_pmu_sbi_hpmc_width();
> > + if (hpm_width <= 0) {
> > + pr_err("Can not initialize PMU for vcpu as hpmcounter width is not available\n");
> > + return -EINVAL;
> > + }
> > +
> > + kvpmu->num_hw_ctrs = num_hw_ctrs;
> > + kvpmu->num_fw_ctrs = num_fw_ctrs;
>
> Maybe it's coming later, but we need to give KVM userspace control over
> the number of counters to allow it to migrate to a larger set of hosts.
> Also, a previous patch said the virtual width must be the same as the
> host width for the hw counters, so we need userspace to know what that
> is in order to determine to which hosts it can migrate a guest.
>

Yes. The entire user space access control needs to be sketched out.
We probably need another one reg interface to set/get the number of
counters/width.

However, Is it a common to migrate a guest between different hosts
with different PMU capabilities ?

> > + /*
> > + * There is no corelation betwen the logical hardware counter and virtual counters.
> > + * However, we need to encode a hpmcounter CSR in the counter info field so that
> > + * KVM can trap n emulate the read. This works well in the migraiton usecase as well
>
> s/well//
>
> > + * KVM doesn't care if the actual hpmcounter is available in the hardware or not.
> > + */
> > + for (i = 0; i < num_hw_ctrs + num_fw_ctrs; i++) {
>
> Maybe we need a helper macro like
>
> #define kvm_pmu_num_counters(pmu) ((pmu)->num_hw_ctrs + (pmu)->num_fw_ctrs)
>
> if we're going to loop over all counters frequently.
>

Done.


> > + /* TIME CSR shouldn't be read from perf interface */
> > + if (i == 1)
> > + continue;
> > + kvpmu->pmc[i].idx = i;
> > + kvpmu->pmc[i].vcpu = vcpu;
> > + if (i < kvpmu->num_hw_ctrs) {
> > + kvpmu->pmc[i].cinfo.type = SBI_PMU_CTR_TYPE_HW;
> > + if (i < 3)
> > + /* CY, IR counters */
> > + kvpmu->pmc[i].cinfo.width = 63;
> > + else
> > + kvpmu->pmc[i].cinfo.width = hpm_width;
> > + /*
> > + * The CSR number doesn't have any relation with the logical
> > + * hardware counters. The CSR numbers are encoded sequentially
> > + * to avoid maintaining a map between the virtual counter
> > + * and CSR number.
> > + */
> > + kvpmu->pmc[i].cinfo.csr = CSR_CYCLE + i;
> > + } else {
> > + kvpmu->pmc[i].cinfo.type = SBI_PMU_CTR_TYPE_FW;
> > + kvpmu->pmc[i].cinfo.width = BITS_PER_LONG - 1;
> > + }
> > + }
> > +
> > + return 0;
> > +}
> > +
> > +void kvm_riscv_vcpu_pmu_reset(struct kvm_vcpu *vcpu)
> > +{
> > + /* TODO */
> > +}
> > +
> > +void kvm_riscv_vcpu_pmu_deinit(struct kvm_vcpu *vcpu)
> > +{
> > + /* TODO */
> > +}
> > +
> > --
> > 2.25.1
> >
>
> Thanks,
> drew



--
Regards,
Atish

2022-11-23 01:24:42

by Atish Patra

[permalink] [raw]
Subject: Re: [RFC 8/9] RISC-V: KVM: Implement perf support

On Tue, Nov 1, 2022 at 8:31 AM Andrew Jones <[email protected]> wrote:
>
> On Mon, Jul 18, 2022 at 10:02:04AM -0700, Atish Patra wrote:
> > RISC-V SBI PMU & Sscofpmf ISA extension allows supporting perf in
> > the virtualization enviornment as well. KVM implementation
> > relies on SBI PMU extension for most of the part while traps
>
> ...relies on the SBI PMU extension for the most part while trapping
> and emulating...
>
> > & emulates the CSRs read for counter access.
> >
> > Signed-off-by: Atish Patra <[email protected]>
> > ---
> > arch/riscv/kvm/vcpu_pmu.c | 318 ++++++++++++++++++++++++++++++++++++--
> > 1 file changed, 301 insertions(+), 17 deletions(-)
> >
> > diff --git a/arch/riscv/kvm/vcpu_pmu.c b/arch/riscv/kvm/vcpu_pmu.c
> > index 5434051f495d..278c261efad3 100644
> > --- a/arch/riscv/kvm/vcpu_pmu.c
> > +++ b/arch/riscv/kvm/vcpu_pmu.c
> > @@ -11,9 +11,163 @@
> > #include <linux/kvm_host.h>
> > #include <linux/perf/riscv_pmu.h>
> > #include <asm/csr.h>
> > +#include <asm/bitops.h>
> > #include <asm/kvm_vcpu_pmu.h>
> > #include <linux/kvm_host.h>
> >
> > +#define get_event_type(x) ((x & SBI_PMU_EVENT_IDX_TYPE_MASK) >> 16)
> > +#define get_event_code(x) (x & SBI_PMU_EVENT_IDX_CODE_MASK)
> > +
> > +static inline u64 pmu_get_sample_period(struct kvm_pmc *pmc)
> > +{
> > + u64 counter_val_mask = GENMASK(pmc->cinfo.width, 0);
> > + u64 sample_period;
> > +
> > + if (!pmc->counter_val)
> > + sample_period = counter_val_mask;
> > + else
> > + sample_period = pmc->counter_val & counter_val_mask;
> > +
> > + return sample_period;
> > +}
> > +
> > +static u32 pmu_get_perf_event_type(unsigned long eidx)
> > +{
> > + enum sbi_pmu_event_type etype = get_event_type(eidx);
> > + u32 type;
> > +
> > + if (etype == SBI_PMU_EVENT_TYPE_HW)
> > + type = PERF_TYPE_HARDWARE;
> > + else if (etype == SBI_PMU_EVENT_TYPE_CACHE)
> > + type = PERF_TYPE_HW_CACHE;
> > + else if (etype == SBI_PMU_EVENT_TYPE_RAW || etype == SBI_PMU_EVENT_TYPE_FW)
> > + type = PERF_TYPE_RAW;
> > + else
> > + type = PERF_TYPE_MAX;
> > +
> > + return type;
> > +}
> > +
> > +static inline bool pmu_is_fw_event(unsigned long eidx)
> > +{
> > + enum sbi_pmu_event_type etype = get_event_type(eidx);
> > +
> > + return (etype == SBI_PMU_EVENT_TYPE_FW) ? true : false;
>
> return get_event_type(eidx) == SBI_PMU_EVENT_TYPE_FW;
>

Done.

> > +}
> > +
> > +static void pmu_release_perf_event(struct kvm_pmc *pmc)
> > +{
> > + if (pmc->perf_event) {
> > + perf_event_disable(pmc->perf_event);
> > + perf_event_release_kernel(pmc->perf_event);
> > + pmc->perf_event = NULL;
> > + }
> > +}
> > +
> > +static u64 pmu_get_perf_event_hw_config(u32 sbi_event_code)
> > +{
> > + /* SBI PMU HW event code is offset by 1 from perf hw event codes */
> > + return (u64)sbi_event_code - 1;
> > +}
> > +
> > +static u64 pmu_get_perf_event_cache_config(u32 sbi_event_code)
> > +{
> > + u64 config = U64_MAX;
> > + unsigned int cache_type, cache_op, cache_result;
> > +
> > + /* All the cache event masks lie within 0xFF. No separate masking is necesssary */
> > + cache_type = (sbi_event_code & SBI_PMU_EVENT_CACHE_ID_CODE_MASK) >> 3;
> > + cache_op = (sbi_event_code & SBI_PMU_EVENT_CACHE_OP_ID_CODE_MASK) >> 1;
> > + cache_result = sbi_event_code & SBI_PMU_EVENT_CACHE_RESULT_ID_CODE_MASK;
> > +
> > + if (cache_type >= PERF_COUNT_HW_CACHE_MAX ||
> > + cache_op >= PERF_COUNT_HW_CACHE_OP_MAX ||
> > + cache_result >= PERF_COUNT_HW_CACHE_RESULT_MAX)
> > + goto out;
>
> No goto necessary
>

Done.

> if (...)
> return U64_MAX;
> return cache_type | (cache_op << 8) | (cache_result << 16);
>
> > + config = cache_type | (cache_op << 8) | (cache_result << 16);
> > +out:
> > + return config;
> > +}
> > +
> > +static u64 pmu_get_perf_event_config(unsigned long eidx, uint64_t edata)
> > +{
> > + enum sbi_pmu_event_type etype = get_event_type(eidx);
> > + u32 ecode = get_event_code(eidx);
> > + u64 config = U64_MAX;
> > +
> > + if (etype == SBI_PMU_EVENT_TYPE_HW)
> > + config = pmu_get_perf_event_hw_config(ecode);
> > + else if (etype == SBI_PMU_EVENT_TYPE_CACHE)
> > + config = pmu_get_perf_event_cache_config(ecode);
> > + else if (etype == SBI_PMU_EVENT_TYPE_RAW)
> > + config = edata & RISCV_PMU_RAW_EVENT_MASK;
> > + else if ((etype == SBI_PMU_EVENT_TYPE_FW) && (ecode < SBI_PMU_FW_MAX))
> > + config = (1ULL << 63) | ecode;
> > +
> > + return config;
> > +}
> > +
> > +static int pmu_get_fixed_pmc_index(unsigned long eidx)
> > +{
> > + u32 etype = pmu_get_perf_event_type(eidx);
> > + u32 ecode = get_event_code(eidx);
> > + int ctr_idx;
> > +
> > + if (etype != SBI_PMU_EVENT_TYPE_HW)
> > + return -EINVAL;
> > +
> > + if (ecode == SBI_PMU_HW_CPU_CYCLES)
> > + ctr_idx = 0;
> > + else if (ecode == SBI_PMU_HW_INSTRUCTIONS)
> > + ctr_idx = 2;
> > + else
> > + return -EINVAL;
> > +
> > + return ctr_idx;
> > +}
> > +
> > +static int pmu_get_programmable_pmc_index(struct kvm_pmu *kvpmu, unsigned long eidx,
> > + unsigned long cbase, unsigned long cmask)
> > +{
> > + int ctr_idx = -1;
> > + int i, pmc_idx;
> > + int min, max;
> > +
> > + if (pmu_is_fw_event(eidx)) {
> > + /* Firmware counters are mapped 1:1 starting from num_hw_ctrs for simplicity */
> > + min = kvpmu->num_hw_ctrs;
> > + max = min + kvpmu->num_fw_ctrs;
> > + } else {
> > + /* First 3 counters are reserved for fixed counters */
> > + min = 3;
> > + max = kvpmu->num_hw_ctrs;
> > + }
> > +
> > + for_each_set_bit(i, &cmask, BITS_PER_LONG) {
> > + pmc_idx = i + cbase;
> > + if ((pmc_idx >= min && pmc_idx < max) &&
> > + !test_bit(pmc_idx, kvpmu->used_pmc)) {
> > + ctr_idx = pmc_idx;
> > + break;
> > + }
> > + }
> > +
> > + return ctr_idx;
> > +}
> > +
> > +static int pmu_get_pmc_index(struct kvm_pmu *pmu, unsigned long eidx,
> > + unsigned long cbase, unsigned long cmask)
> > +{
> > + int ret;
> > +
> > + /* Fixed counters need to be have fixed mapping as they have different width */
> > + ret = pmu_get_fixed_pmc_index(eidx);
> > + if (ret >= 0)
> > + return ret;
> > +
> > + return pmu_get_programmable_pmc_index(pmu, eidx, cbase, cmask);
> > +}
> > +
> > int kvm_riscv_vcpu_pmu_ctr_read(struct kvm_vcpu *vcpu, unsigned long cidx,
> > unsigned long *out_val)
> > {
> > @@ -43,7 +197,6 @@ int kvm_riscv_vcpu_pmu_read_hpm(struct kvm_vcpu *vcpu, unsigned int csr_num,
> >
> > if (!kvpmu)
> > return KVM_INSN_EXIT_TO_USER_SPACE;
> > - //TODO: Should we check if vcpu pmu is initialized or not!
> > if (wr_mask)
> > return KVM_INSN_ILLEGAL_TRAP;
> > cidx = csr_num - CSR_CYCLE;
> > @@ -81,14 +234,62 @@ int kvm_riscv_vcpu_pmu_ctr_info(struct kvm_vcpu *vcpu, unsigned long cidx,
> > int kvm_riscv_vcpu_pmu_ctr_start(struct kvm_vcpu *vcpu, unsigned long ctr_base,
> > unsigned long ctr_mask, unsigned long flag, uint64_t ival)
> > {
> > - /* TODO */
> > + struct kvm_pmu *kvpmu = vcpu_to_pmu(vcpu);
> > + int i, num_ctrs, pmc_index;
> > + struct kvm_pmc *pmc;
> > +
> > + num_ctrs = kvpmu->num_fw_ctrs + kvpmu->num_hw_ctrs;
> > + if (ctr_base + __fls(ctr_mask) >= num_ctrs)
> > + return -EINVAL;
> > +
> > + /* Start the counters that have been configured and requested by the guest */
> > + for_each_set_bit(i, &ctr_mask, RISCV_MAX_COUNTERS) {
> > + pmc_index = i + ctr_base;
> > + if (!test_bit(pmc_index, kvpmu->used_pmc))
> > + continue;
> > + pmc = &kvpmu->pmc[pmc_index];
> > + if (flag & SBI_PMU_START_FLAG_SET_INIT_VALUE)
> > + pmc->counter_val = ival;
> > + if (pmc->perf_event) {
> > + perf_event_period(pmc->perf_event, pmu_get_sample_period(pmc));
> > + perf_event_enable(pmc->perf_event);
> > + }
>
> What about checking the "SBI_ERR_ALREADY_STARTED" condition?
>

pmu start and stop functions in core framework return void. Thus, perf
core can not check any error code
returning from pmu start/stop. We can not check SBI_ERR_ALREADY_STARTED/STOPPED.
It does maintain a state PERF_EVENT_STATE_ACTIVE(once
enabled)/PERF_EVENT_STATE_OFF(disabled).
We can check those states but those states can be invalid for other
reasons. A debug message can be printed
if enabling/disabling fails though.

However, KVM pmu implementation should return SBI_ERR_ALREADY_STARTED/STOPPED if
the event state is not accurate before enabling/disabling the event. I
have added that.

This brings up another generic error returning problem in KVM SBI
land. Usually, SBI error code numbers do not
align with Linux error codes to accommodate other operating systems.
However, most of the SBI error codes
have 1-1 relationship with the Linux error code.
Thus, kvm internal code returns a Linux specific error code and
vcpu_sbi will map those to SBI error code using
kvm_linux_err_map_sbi.

However, this will not work for SBI_ERR_ALREADY_STARTED/STOPPED as
there are no corresponding
Linux specific error codes. We can directly return the SBI error codes
from vcpu_pmu.c and modify the
kvm_linux_err_map_sbi to pass through those. In that case, we can't
map any linux error code that
collides with SBI error code. Any other ideas to handle this case ?


> > + }
> > +
> > return 0;
> > }
> >
> > int kvm_riscv_vcpu_pmu_ctr_stop(struct kvm_vcpu *vcpu, unsigned long ctr_base,
> > unsigned long ctr_mask, unsigned long flag)
> > {
> > - /* TODO */
> > + struct kvm_pmu *kvpmu = vcpu_to_pmu(vcpu);
> > + int i, num_ctrs, pmc_index;
> > + u64 enabled, running;
> > + struct kvm_pmc *pmc;
> > +
> > + num_ctrs = kvpmu->num_fw_ctrs + kvpmu->num_hw_ctrs;
> > + if ((ctr_base + __fls(ctr_mask)) >= num_ctrs)
> > + return -EINVAL;
> > +
> > + /* Stop the counters that have been configured and requested by the guest */
> > + for_each_set_bit(i, &ctr_mask, RISCV_MAX_COUNTERS) {
> > + pmc_index = i + ctr_base;
> > + if (!test_bit(pmc_index, kvpmu->used_pmc))
> > + continue;
> > + pmc = &kvpmu->pmc[pmc_index];
> > + if (pmc->perf_event) {
> > + /* Stop counting the counter */
> > + perf_event_disable(pmc->perf_event);
> > + if (flag & SBI_PMU_STOP_FLAG_RESET) {
> > + /* Relase the counter if this is a reset request */
> > + pmc->counter_val += perf_event_read_value(pmc->perf_event,
> > + &enabled, &running);
> > + pmu_release_perf_event(pmc);
> > + clear_bit(pmc_index, kvpmu->used_pmc);
> > + }
>
> What about checking the "SBI_ERR_ALREADY_STOPPED" condition?
>

> > + }
> > + }
> > +
> > return 0;
> > }
> >
> > @@ -96,14 +297,85 @@ int kvm_riscv_vcpu_pmu_ctr_cfg_match(struct kvm_vcpu *vcpu, unsigned long ctr_ba
> > unsigned long ctr_mask, unsigned long flag,
> > unsigned long eidx, uint64_t edata)
> > {
> > - /* TODO */
> > - return 0;
> > + struct kvm_pmu *kvpmu = vcpu_to_pmu(vcpu);
> > + struct perf_event *event;
> > + struct perf_event_attr attr;
> > + int num_ctrs, ctr_idx;
> > + u32 etype = pmu_get_perf_event_type(eidx);
> > + u64 config;
> > + struct kvm_pmc *pmc;
> > +
> > + num_ctrs = kvpmu->num_fw_ctrs + kvpmu->num_hw_ctrs;
> > + if ((etype == PERF_TYPE_MAX) || ((ctr_base + __fls(ctr_mask)) >= num_ctrs))
>
> nit: Unnecessary ()
>

Fixed.

> > + return -EINVAL;
> > +
> > + if (pmu_is_fw_event(eidx))
> > + return -EOPNOTSUPP;
>
> nit: need blank line here
>

Fixed.

> > + /*
> > + * SKIP_MATCH flag indicates the caller is aware of the assigned counter
> > + * for this event. Just do a sanity check if it already marked used.
> > + */
> > + if (flag & SBI_PMU_CFG_FLAG_SKIP_MATCH) {
> > + if (!test_bit(ctr_base, kvpmu->used_pmc))
> > + return -EINVAL;
> > + ctr_idx = ctr_base;
> > + goto match_done;
> > + }
> > +
> > + ctr_idx = pmu_get_pmc_index(kvpmu, eidx, ctr_base, ctr_mask);
> > + if (ctr_idx < 0)
> > + return -EOPNOTSUPP;
> > +
> > +match_done:
> > + pmc = &kvpmu->pmc[ctr_idx];
> > + pmu_release_perf_event(pmc);
> > + pmc->idx = ctr_idx;
> > +
> > + config = pmu_get_perf_event_config(eidx, edata);
> > + memset(&attr, 0, sizeof(struct perf_event_attr));
> > + attr.type = etype;
> > + attr.size = sizeof(attr);
> > + attr.pinned = true;
> > +
> > + /*
> > + * It should never reach here if the platform doesn't support sscofpmf extensio
> > + * as mode filtering won't work without it.
> > + */
> > + attr.exclude_host = true;
> > + attr.exclude_hv = true;
> > + attr.exclude_user = flag & SBI_PMU_CFG_FLAG_SET_UINH ? 1 : 0;
> > + attr.exclude_kernel = flag & SBI_PMU_CFG_FLAG_SET_SINH ? 1 : 0;
>
> nit: can use !!(flag & SBI_PMU_CFG_FLAG_SET_UINH)

Fixed.

>
> > + attr.config = config;
> > + attr.config1 = RISCV_KVM_PMU_CONFIG1_GUEST_EVENTS;
> > + if (flag & SBI_PMU_CFG_FLAG_CLEAR_VALUE) {
> > + //TODO: Do we really want to clear the value in hardware counter
> > + pmc->counter_val = 0;
> > + }
>
> nit: add blank line

Fixed.

>
> > + /*
> > + * Set the default sample_period for now. The guest specified value
> > + * will be updated in the start call.
> > + */
> > + attr.sample_period = pmu_get_sample_period(pmc);
> > +
> > + event = perf_event_create_kernel_counter(&attr, -1, current, NULL, pmc);
> > + if (IS_ERR(event)) {
> > + pr_err("kvm pmu event creation failed event %pe for eidx %lx\n", event, eidx);
> > + return -EOPNOTSUPP;
> > + }
> > +
> > + set_bit(ctr_idx, kvpmu->used_pmc);
> > + pmc->perf_event = event;
> > + if (flag & SBI_PMU_CFG_FLAG_AUTO_START)
> > + perf_event_enable(pmc->perf_event);
> > +
> > + return ctr_idx;
> > }
> >
> > int kvm_riscv_vcpu_pmu_init(struct kvm_vcpu *vcpu)
> > {
> > int i = 0, num_hw_ctrs, num_fw_ctrs, hpm_width;
> > struct kvm_pmu *kvpmu = vcpu_to_pmu(vcpu);
> > + struct kvm_pmc *pmc;
> >
> > if (!kvpmu)
> > return -EINVAL;
> > @@ -120,6 +392,7 @@ int kvm_riscv_vcpu_pmu_init(struct kvm_vcpu *vcpu)
> > return -EINVAL;
> > }
> >
> > + bitmap_zero(kvpmu->used_pmc, RISCV_MAX_COUNTERS);
> > kvpmu->num_hw_ctrs = num_hw_ctrs;
> > kvpmu->num_fw_ctrs = num_fw_ctrs;
> > /*
> > @@ -132,38 +405,49 @@ int kvm_riscv_vcpu_pmu_init(struct kvm_vcpu *vcpu)
> > /* TIME CSR shouldn't be read from perf interface */
> > if (i == 1)
> > continue;
> > - kvpmu->pmc[i].idx = i;
> > - kvpmu->pmc[i].vcpu = vcpu;
> > + pmc = &kvpmu->pmc[i];
>
> Better to introduce this with the patch that introduced this loop.
>

Sure. Done.

> > + pmc->idx = i;
> > + pmc->counter_val = 0;
> > + pmc->vcpu = vcpu;
> > if (i < kvpmu->num_hw_ctrs) {
> > kvpmu->pmc[i].cinfo.type = SBI_PMU_CTR_TYPE_HW;
> > if (i < 3)
> > /* CY, IR counters */
> > - kvpmu->pmc[i].cinfo.width = 63;
> > + pmc->cinfo.width = 63;
> > else
> > - kvpmu->pmc[i].cinfo.width = hpm_width;
> > + pmc->cinfo.width = hpm_width;
> > /*
> > * The CSR number doesn't have any relation with the logical
> > * hardware counters. The CSR numbers are encoded sequentially
> > * to avoid maintaining a map between the virtual counter
> > * and CSR number.
> > */
> > - kvpmu->pmc[i].cinfo.csr = CSR_CYCLE + i;
> > + pmc->cinfo.csr = CSR_CYCLE + i;
> > } else {
> > - kvpmu->pmc[i].cinfo.type = SBI_PMU_CTR_TYPE_FW;
> > - kvpmu->pmc[i].cinfo.width = BITS_PER_LONG - 1;
> > + pmc->cinfo.type = SBI_PMU_CTR_TYPE_FW;
> > + pmc->cinfo.width = BITS_PER_LONG - 1;
> > }
> > }
> >
> > return 0;
> > }
> >
> > -void kvm_riscv_vcpu_pmu_reset(struct kvm_vcpu *vcpu)
> > +void kvm_riscv_vcpu_pmu_deinit(struct kvm_vcpu *vcpu)
> > {
> > - /* TODO */
> > + struct kvm_pmu *kvpmu = vcpu_to_pmu(vcpu);
> > + struct kvm_pmc *pmc;
> > + int i;
> > +
> > + if (!kvpmu)
> > + return;
> > +
> > + for_each_set_bit(i, kvpmu->used_pmc, RISCV_MAX_COUNTERS) {
> > + pmc = &kvpmu->pmc[i];
> > + pmu_release_perf_event(pmc);
> > + }
> > }
> >
> > -void kvm_riscv_vcpu_pmu_deinit(struct kvm_vcpu *vcpu)
> > +void kvm_riscv_vcpu_pmu_reset(struct kvm_vcpu *vcpu)
> > {
> > - /* TODO */
> > + kvm_riscv_vcpu_pmu_deinit(vcpu);
> > }
> > -
> > --
> > 2.25.1
> >
>
> Thanks,
> drew



--
Regards,
Atish

2022-11-23 01:49:38

by Atish Patra

[permalink] [raw]
Subject: Re: [RFC 5/9] RISC-V: KVM: Add skeleton support for perf

On Tue, Nov 22, 2022 at 4:46 PM Atish Patra <[email protected]> wrote:
>
> at runtime.
>
> On Tue, Nov 1, 2022 at 7:13 AM Andrew Jones <[email protected]> wrote:
> >
> > On Mon, Jul 18, 2022 at 10:02:01AM -0700, Atish Patra wrote:
> > > This patch only adds barebore structure of perf implementation. Most of
> > a bare bones ^ the
> >
> > > the function returns zero at this point and will be implemented
> > functions
> >
> > > fully in the future.
> >
> > s/the future/later patches/
> >
>
> Done.
>
> > >
> > > Signed-off-by: Atish Patra <[email protected]>
> > > ---
> > > arch/riscv/include/asm/kvm_host.h | 3 +
> > > arch/riscv/include/asm/kvm_vcpu_pmu.h | 70 +++++++++++++
> > > arch/riscv/kvm/Makefile | 1 +
> > > arch/riscv/kvm/main.c | 3 +-
> > > arch/riscv/kvm/vcpu.c | 5 +
> > > arch/riscv/kvm/vcpu_insn.c | 3 +-
> > > arch/riscv/kvm/vcpu_pmu.c | 136 ++++++++++++++++++++++++++
> > > 7 files changed, 219 insertions(+), 2 deletions(-)
> > > create mode 100644 arch/riscv/include/asm/kvm_vcpu_pmu.h
> > > create mode 100644 arch/riscv/kvm/vcpu_pmu.c
> > >
> > > diff --git a/arch/riscv/include/asm/kvm_host.h b/arch/riscv/include/asm/kvm_host.h
> > > index 59a0cf2ca7b9..5d2312828bb2 100644
> > > --- a/arch/riscv/include/asm/kvm_host.h
> > > +++ b/arch/riscv/include/asm/kvm_host.h
> > > @@ -18,6 +18,7 @@
> > > #include <asm/kvm_vcpu_fp.h>
> > > #include <asm/kvm_vcpu_insn.h>
> > > #include <asm/kvm_vcpu_timer.h>
> > > +#include <asm/kvm_vcpu_pmu.h>
> > >
> > > #define KVM_MAX_VCPUS 1024
> > >
> > > @@ -226,6 +227,8 @@ struct kvm_vcpu_arch {
> > >
> > > /* Don't run the VCPU (blocked) */
> > > bool pause;
> > > +
> > > + struct kvm_pmu pmu;
> > > };
> > >
> > > static inline void kvm_arch_hardware_unsetup(void) {}
> > > diff --git a/arch/riscv/include/asm/kvm_vcpu_pmu.h b/arch/riscv/include/asm/kvm_vcpu_pmu.h
> > > new file mode 100644
> > > index 000000000000..bffee052f2ae
> > > --- /dev/null
> > > +++ b/arch/riscv/include/asm/kvm_vcpu_pmu.h
> > > @@ -0,0 +1,70 @@
> > > +/* SPDX-License-Identifier: GPL-2.0-only */
> > > +/*
> > > + * Copyright (c) 2022 Rivos Inc
> > > + *
> > > + * Authors:
> > > + * Atish Patra <[email protected]>
> > > + */
> > > +
> > > +#ifndef _KVM_VCPU_RISCV_PMU_H
> > > +#define _KVM_VCPU_RISCV_PMU_H
> >
> > The convention seems to be to leading underscores for these types of
> > defines, i.e. __KVM_VCPU_RISCV_PMU_H
> >
>
> Yes. It was a typo. Fixed.
>
> > > +
> > > +#include <linux/perf/riscv_pmu.h>
> > > +#include <asm/sbi.h>
> > > +
> > > +#ifdef CONFIG_RISCV_PMU_SBI
> > > +#define RISCV_KVM_MAX_FW_CTRS 32
> > > +
> > > +/* Per virtual pmu counter data */
> > > +struct kvm_pmc {
> > > + u8 idx;
> > > + struct kvm_vcpu *vcpu;
> >
> > I'm not sure we need a vcpu pointer here. If it's just to implement
> > pmc_to_pmu(), then we can instead implement a pmc_to_vcpu(), like
> > arm64's kvm_pmc_to_vcpu(). x86 might be able to do that too, since
> > it appears the conversion macros below originated there.
> >
>
> Yes. We can implement arm64 as well instead of x86.
> I just thought the x86 approach keeping a reference to vcpu is a bit
> simpler than computing
> the parent offset using container_of multiple times. This will be
> invoked once per overflow in the counter overflow path.
>
> If you feel strongly about it arm64 way, we can follow that. I have
> removed from this series for now.
> Depending on the conclusion, I will add it back in kvm sscofpmf
> support series if required.
>
> > > + struct perf_event *perf_event;
> > > + uint64_t counter_val;
> > > + union sbi_pmu_ctr_info cinfo;
> > > +};
> > > +
> > > +/* PMU data structure per vcpu */
> > > +struct kvm_pmu {
> > > + struct kvm_pmc pmc[RISCV_MAX_COUNTERS];
> > > + /* Number of the virtual firmware counters available */
> > > + int num_fw_ctrs;
> > > + /* Number of the virtual hardware counters available */
> > > + int num_hw_ctrs;
> > > + /* Bit map of all the virtual counter used */
> > counters
> >
> > > + DECLARE_BITMAP(used_pmc, RISCV_MAX_COUNTERS);
> >
> > How about naming this pmc_in_use like x86?
> >
>
> Done.
>
> > > +};
> > > +
> > > +#define vcpu_to_pmu(vcpu) (&(vcpu)->arch.pmu)
> > > +#define pmu_to_vcpu(pmu) (container_of((pmu), struct kvm_vcpu, arch.pmu))
> > > +#define pmc_to_pmu(pmc) (&(pmc)->vcpu->arch.pmu)
> > > +
> > > +int kvm_riscv_vcpu_pmu_num_ctrs(struct kvm_vcpu *vcpu, unsigned long *out_val);
> > > +int kvm_riscv_vcpu_pmu_ctr_info(struct kvm_vcpu *vcpu, unsigned long cidx,
> > > + unsigned long *ctr_info);
> > > +
> >
> > nit: no need for this blank line
>
> Fixed.
>
> >
> > > +int kvm_riscv_vcpu_pmu_ctr_start(struct kvm_vcpu *vcpu, unsigned long ctr_base,
> > > + unsigned long ctr_mask, unsigned long flag, uint64_t ival);
> > > +int kvm_riscv_vcpu_pmu_ctr_stop(struct kvm_vcpu *vcpu, unsigned long ctr_base,
> > > + unsigned long ctr_mask, unsigned long flag);
> > > +int kvm_riscv_vcpu_pmu_ctr_cfg_match(struct kvm_vcpu *vcpu, unsigned long ctr_base,
> > > + unsigned long ctr_mask, unsigned long flag,
> > > + unsigned long eidx, uint64_t edata);
> > > +int kvm_riscv_vcpu_pmu_ctr_read(struct kvm_vcpu *vcpu, unsigned long cidx,
> > > + unsigned long *out_val);
> > > +int kvm_riscv_vcpu_pmu_init(struct kvm_vcpu *vcpu);
> > > +void kvm_riscv_vcpu_pmu_deinit(struct kvm_vcpu *vcpu);
> > > +void kvm_riscv_vcpu_pmu_reset(struct kvm_vcpu *vcpu);
> > > +
> > > +#else
> > > +struct kvm_pmu {
> > > +};
> > > +
> > > +static inline int kvm_riscv_vcpu_pmu_init(struct kvm_vcpu *vcpu)
> > > +{
> > > + return 0;
> > > +}
> > > +static inline void kvm_riscv_vcpu_pmu_deinit(struct kvm_vcpu *vcpu) {}
> > > +static inline void kvm_riscv_vcpu_pmu_reset(struct kvm_vcpu *vcpu) {}
> > > +#endif
> > > +#endif
> > > diff --git a/arch/riscv/kvm/Makefile b/arch/riscv/kvm/Makefile
> > > index 019df9208bdd..342d7199e89d 100644
> > > --- a/arch/riscv/kvm/Makefile
> > > +++ b/arch/riscv/kvm/Makefile
> > > @@ -25,3 +25,4 @@ kvm-y += vcpu_sbi_base.o
> > > kvm-y += vcpu_sbi_replace.o
> > > kvm-y += vcpu_sbi_hsm.o
> > > kvm-y += vcpu_timer.o
> > > +kvm-$(CONFIG_RISCV_PMU_SBI) += vcpu_sbi_pmu.o vcpu_pmu.o
> > > diff --git a/arch/riscv/kvm/main.c b/arch/riscv/kvm/main.c
> > > index 1549205fe5fe..d41ab6d1987d 100644
> > > --- a/arch/riscv/kvm/main.c
> > > +++ b/arch/riscv/kvm/main.c
> > > @@ -49,7 +49,8 @@ int kvm_arch_hardware_enable(void)
> > > hideleg |= (1UL << IRQ_VS_EXT);
> > > csr_write(CSR_HIDELEG, hideleg);
> > >
> > > - csr_write(CSR_HCOUNTEREN, -1UL);
> > > + /* VS should access only TM bit. Everything else should trap */
> > > + csr_write(CSR_HCOUNTEREN, 0x02);
> >
> > This looks like something that should be broken out into a separate patch
> > with a description of what happens now when guests try to access the newly
> > trapping counter registers. We should probably also create a TM define.
> >
>
> Done.
>

As we allow cycles & instret for host user space now [1], should we do the same
for guests as well ? I would prefer not to but same user space
software will start to break
they will run inside a guest.

https://lore.kernel.org/all/[email protected]/

> > >
> > > csr_write(CSR_HVIP, 0);
> > >
> > > diff --git a/arch/riscv/kvm/vcpu.c b/arch/riscv/kvm/vcpu.c
> > > index 3c95924d38c7..4cc964aaf2ad 100644
> > > --- a/arch/riscv/kvm/vcpu.c
> > > +++ b/arch/riscv/kvm/vcpu.c
> > > @@ -122,6 +122,7 @@ static void kvm_riscv_reset_vcpu(struct kvm_vcpu *vcpu)
> > >
> > > WRITE_ONCE(vcpu->arch.irqs_pending, 0);
> > > WRITE_ONCE(vcpu->arch.irqs_pending_mask, 0);
> > > + kvm_riscv_vcpu_pmu_reset(vcpu);
> > >
> > > vcpu->arch.hfence_head = 0;
> > > vcpu->arch.hfence_tail = 0;
> > > @@ -174,6 +175,9 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
> > > /* Setup VCPU timer */
> > > kvm_riscv_vcpu_timer_init(vcpu);
> > >
> > > + /* setup performance monitoring */
> > > + kvm_riscv_vcpu_pmu_init(vcpu);
> > > +
> > > /* Reset VCPU */
> > > kvm_riscv_reset_vcpu(vcpu);
> > >
> > > @@ -196,6 +200,7 @@ void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu)
> > > /* Cleanup VCPU timer */
> > > kvm_riscv_vcpu_timer_deinit(vcpu);
> > >
> > > + kvm_riscv_vcpu_pmu_deinit(vcpu);
> > > /* Free unused pages pre-allocated for G-stage page table mappings */
> > > kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_cache);
> > > }
> > > diff --git a/arch/riscv/kvm/vcpu_insn.c b/arch/riscv/kvm/vcpu_insn.c
> > > index 7eb90a47b571..0aa334f853c8 100644
> > > --- a/arch/riscv/kvm/vcpu_insn.c
> > > +++ b/arch/riscv/kvm/vcpu_insn.c
> > > @@ -214,7 +214,8 @@ struct csr_func {
> > > unsigned long wr_mask);
> > > };
> > >
> > > -static const struct csr_func csr_funcs[] = { };
> > > +static const struct csr_func csr_funcs[] = {
> > > +};
> >
> > stray change
> >
>
> Fixed
>
> > >
> > > /**
> > > * kvm_riscv_vcpu_csr_return -- Handle CSR read/write after user space
> > > diff --git a/arch/riscv/kvm/vcpu_pmu.c b/arch/riscv/kvm/vcpu_pmu.c
> > > new file mode 100644
> > > index 000000000000..3168ed740bdd
> > > --- /dev/null
> > > +++ b/arch/riscv/kvm/vcpu_pmu.c
> > > @@ -0,0 +1,136 @@
> > > +// SPDX-License-Identifier: GPL-2.0
> > > +/*
> > > + * Copyright (c) 2022 Rivos Inc
> > > + *
> > > + * Authors:
> > > + * Atish Patra <[email protected]>
> > > + */
> > > +
> > > +#include <linux/errno.h>
> > > +#include <linux/err.h>
> > > +#include <linux/kvm_host.h>
> > > +#include <linux/perf/riscv_pmu.h>
> > > +#include <asm/csr.h>
> > > +#include <asm/kvm_vcpu_pmu.h>
> > > +#include <linux/kvm_host.h>
> > > +
> > > +int kvm_riscv_vcpu_pmu_num_ctrs(struct kvm_vcpu *vcpu, unsigned long *out_val)
> > > +{
> > > + struct kvm_pmu *kvpmu = vcpu_to_pmu(vcpu);
> > > +
> > > + if (!kvpmu)
> > > + return -EINVAL;
> >
> > kvpmu can never be null because arch.pmu isn't a pointer. We probably
> > shouldn't be making calls to kvm_riscv_vcpu_pmu_num_ctrs() without knowing
> > we have an initialized pmu anyway, though.
> >
>
> Yes. I have added an init_done flag to do that sanity check.
> I can change it based on the conclusion on PATCH 6.
>
> > > +
> > > + *out_val = kvpmu->num_fw_ctrs + kvpmu->num_hw_ctrs;
> > > + return 0;
> > > +}
> > > +
> > > +int kvm_riscv_vcpu_pmu_ctr_info(struct kvm_vcpu *vcpu, unsigned long cidx,
> > > + unsigned long *ctr_info)
> > > +{
> > > + struct kvm_pmu *kvpmu = vcpu_to_pmu(vcpu);
> > > +
> > > + if (!kvpmu || (cidx > RISCV_MAX_COUNTERS) || (cidx == 1))
> >
> > nit: unnecessary ()
> >
> > > + return -EINVAL;
> > > +
> > > + *ctr_info = kvpmu->pmc[cidx].cinfo.value;
> > > +
> > > + return 0;
> > > +}
> > > +
> > > +int kvm_riscv_vcpu_pmu_ctr_start(struct kvm_vcpu *vcpu, unsigned long ctr_base,
> > > + unsigned long ctr_mask, unsigned long flag, uint64_t ival)
> > > +{
> > > + /* TODO */
> > > + return 0;
> > > +}
> > > +
> > > +int kvm_riscv_vcpu_pmu_ctr_stop(struct kvm_vcpu *vcpu, unsigned long ctr_base,
> > > + unsigned long ctr_mask, unsigned long flag)
> > > +{
> > > + /* TODO */
> > > + return 0;
> > > +}
> > > +
> > > +int kvm_riscv_vcpu_pmu_ctr_cfg_match(struct kvm_vcpu *vcpu, unsigned long ctr_base,
> > > + unsigned long ctr_mask, unsigned long flag,
> > > + unsigned long eidx, uint64_t edata)
> > > +{
> > > + /* TODO */
> > > + return 0;
> > > +}
> > > +
> > > +int kvm_riscv_vcpu_pmu_ctr_read(struct kvm_vcpu *vcpu, unsigned long cidx,
> > > + unsigned long *out_val)
> > > +{
> > > + /* TODO */
> > > + return 0;
> > > +}
> > > +
> > > +int kvm_riscv_vcpu_pmu_init(struct kvm_vcpu *vcpu)
> > > +{
> > > + int i = 0, num_hw_ctrs, num_fw_ctrs, hpm_width;
> > > + struct kvm_pmu *kvpmu = vcpu_to_pmu(vcpu);
> > > +
> > > + if (!kvpmu)
> > > + return -EINVAL;
> > > +
> > > + num_hw_ctrs = riscv_pmu_sbi_get_num_hw_ctrs();
> > > + if ((num_hw_ctrs + RISCV_KVM_MAX_FW_CTRS) > RISCV_MAX_COUNTERS)
> > > + num_fw_ctrs = RISCV_MAX_COUNTERS - num_hw_ctrs;
> > > + else
> > > + num_fw_ctrs = RISCV_KVM_MAX_FW_CTRS;
> >
> > Why do we need RISCV_KVM_MAX_FW_CTRS? Can't we just always get the number
> > with RISCV_MAX_COUNTERS - num_hw_ctrs ?
> >
> We can. But we have to allocate fw_event at runtime. As most platforms
> don't implement
> more than all 29 hpmcounters, you end up having more firmware counters
> than needed.
> Current, SBI spec only define 21 firmware counter anyways.
>
> Thus I felt it is unnecessary to do runtime allocation. But it's just
> few bytes. So I don't feel
> strongly about it.
>
> > > +
> > > + hpm_width = riscv_pmu_sbi_hpmc_width();
> > > + if (hpm_width <= 0) {
> > > + pr_err("Can not initialize PMU for vcpu as hpmcounter width is not available\n");
> > > + return -EINVAL;
> > > + }
> > > +
> > > + kvpmu->num_hw_ctrs = num_hw_ctrs;
> > > + kvpmu->num_fw_ctrs = num_fw_ctrs;
> >
> > Maybe it's coming later, but we need to give KVM userspace control over
> > the number of counters to allow it to migrate to a larger set of hosts.
> > Also, a previous patch said the virtual width must be the same as the
> > host width for the hw counters, so we need userspace to know what that
> > is in order to determine to which hosts it can migrate a guest.
> >
>
> Yes. The entire user space access control needs to be sketched out.
> We probably need another one reg interface to set/get the number of
> counters/width.
>
> However, Is it a common to migrate a guest between different hosts
> with different PMU capabilities ?
>
> > > + /*
> > > + * There is no corelation betwen the logical hardware counter and virtual counters.
> > > + * However, we need to encode a hpmcounter CSR in the counter info field so that
> > > + * KVM can trap n emulate the read. This works well in the migraiton usecase as well
> >
> > s/well//
> >
> > > + * KVM doesn't care if the actual hpmcounter is available in the hardware or not.
> > > + */
> > > + for (i = 0; i < num_hw_ctrs + num_fw_ctrs; i++) {
> >
> > Maybe we need a helper macro like
> >
> > #define kvm_pmu_num_counters(pmu) ((pmu)->num_hw_ctrs + (pmu)->num_fw_ctrs)
> >
> > if we're going to loop over all counters frequently.
> >
>
> Done.
>
>
> > > + /* TIME CSR shouldn't be read from perf interface */
> > > + if (i == 1)
> > > + continue;
> > > + kvpmu->pmc[i].idx = i;
> > > + kvpmu->pmc[i].vcpu = vcpu;
> > > + if (i < kvpmu->num_hw_ctrs) {
> > > + kvpmu->pmc[i].cinfo.type = SBI_PMU_CTR_TYPE_HW;
> > > + if (i < 3)
> > > + /* CY, IR counters */
> > > + kvpmu->pmc[i].cinfo.width = 63;
> > > + else
> > > + kvpmu->pmc[i].cinfo.width = hpm_width;
> > > + /*
> > > + * The CSR number doesn't have any relation with the logical
> > > + * hardware counters. The CSR numbers are encoded sequentially
> > > + * to avoid maintaining a map between the virtual counter
> > > + * and CSR number.
> > > + */
> > > + kvpmu->pmc[i].cinfo.csr = CSR_CYCLE + i;
> > > + } else {
> > > + kvpmu->pmc[i].cinfo.type = SBI_PMU_CTR_TYPE_FW;
> > > + kvpmu->pmc[i].cinfo.width = BITS_PER_LONG - 1;
> > > + }
> > > + }
> > > +
> > > + return 0;
> > > +}
> > > +
> > > +void kvm_riscv_vcpu_pmu_reset(struct kvm_vcpu *vcpu)
> > > +{
> > > + /* TODO */
> > > +}
> > > +
> > > +void kvm_riscv_vcpu_pmu_deinit(struct kvm_vcpu *vcpu)
> > > +{
> > > + /* TODO */
> > > +}
> > > +
> > > --
> > > 2.25.1
> > >
> >
> > Thanks,
> > drew
>
>
>
> --
> Regards,
> Atish



--
Regards,
Atish

2022-11-23 14:21:10

by Andrew Jones

[permalink] [raw]
Subject: Re: [RFC 5/9] RISC-V: KVM: Add skeleton support for perf

...
> > > > - csr_write(CSR_HCOUNTEREN, -1UL);
> > > > + /* VS should access only TM bit. Everything else should trap */
> > > > + csr_write(CSR_HCOUNTEREN, 0x02);
> > >
> > > This looks like something that should be broken out into a separate patch
> > > with a description of what happens now when guests try to access the newly
> > > trapping counter registers. We should probably also create a TM define.
> > >
> >
> > Done.
> >
>
> As we allow cycles & instret for host user space now [1], should we do the same
> for guests as well ? I would prefer not to but same user space
> software will start to break
> they will run inside a guest.
>
> https://lore.kernel.org/all/[email protected]/
>

Yeah, it seems like we should either forbid access to unprivileged users
or ensure the numbers include some random noise. For guests, a privileged
KVM userspace should need to explicitly request access for them, ensuring
that the creation of privileged guests is done by conscious choice.

Thanks,
drew

2022-11-23 14:21:44

by Andrew Jones

[permalink] [raw]
Subject: Re: [RFC 5/9] RISC-V: KVM: Add skeleton support for perf

On Tue, Nov 22, 2022 at 04:46:14PM -0800, Atish Patra wrote:
...
> > > + kvpmu->num_hw_ctrs = num_hw_ctrs;
> > > + kvpmu->num_fw_ctrs = num_fw_ctrs;
> >
> > Maybe it's coming later, but we need to give KVM userspace control over
> > the number of counters to allow it to migrate to a larger set of hosts.
> > Also, a previous patch said the virtual width must be the same as the
> > host width for the hw counters, so we need userspace to know what that
> > is in order to determine to which hosts it can migrate a guest.
> >
>
> Yes. The entire user space access control needs to be sketched out.
> We probably need another one reg interface to set/get the number of
> counters/width.
>
> However, Is it a common to migrate a guest between different hosts
> with different PMU capabilities ?
>

Ideally we'd be able to define a virtual CPU+PMU which represents the
least common denominator of a set of hosts, allowing VMs which use that
VCPU model to migrate among all the hosts. x86 pulls this off pretty well,
but arm64 doesn't. In the least, I think a goal should be to enable
migration of VMs from hosts with less extensions and less PMU counters to
hosts with more, as that would support host upgrades without having to
recreate VMs.

Thanks,
drew

2022-11-23 14:45:17

by Andrew Jones

[permalink] [raw]
Subject: Re: [RFC 6/9] RISC-V: KVM: Add SBI PMU extension support

On Tue, Nov 22, 2022 at 03:08:34PM -0800, Atish Patra wrote:
> On Tue, Nov 1, 2022 at 7:26 AM Andrew Jones <[email protected]> wrote:
> >
> > On Mon, Jul 18, 2022 at 10:02:02AM -0700, Atish Patra wrote:
...
> > > +static int kvm_sbi_ext_pmu_handler(struct kvm_vcpu *vcpu, struct kvm_run *run,
> > > + unsigned long *out_val,
> > > + struct kvm_cpu_trap *utrap,
> > > + bool *exit)
> > > +{
> > > + int ret = -EOPNOTSUPP;
> > > + struct kvm_cpu_context *cp = &vcpu->arch.guest_context;
> > > + unsigned long funcid = cp->a6;
> > > + uint64_t temp;
> >
> > I think we need something like
> >
> > if (!vcpu_to_pmu(vcpu)->enabled)
> > return -EOPNOTSUPP;
> >
> > here. Where 'enabled' is only true when we successfully initialize
> > the pmu. And, successful initialization depends on
>
> Yes. I have added the flag. But should we do the check here or
> respective function
> as a paranoia sanity check ?

I think something like above that returns a not-supported error should be
in all the entry points, like the top level SBI call handler. Functions
that should never be called unless the PMU is active could have WARNs
added for sanity checks.

>
> > IS_ENABLED(CONFIG_RISCV_PMU_SBI) and
>
> Why do we need to guard under CONFIG_RISCV_PMU_SBI ?
> vcpu_sbi_pmu.c is only compiled under CONFIG_RISCV_PMU_SBI
>
> If CONFIG_RISCV_PMU_SBI is not enabled, probe function will return failure.

You're right. We don't need explicit config checks for things that can't
exist without the config.

>
> In fact, I think we should also add the pmu enabled check in the probe function
> itself. Probe function(kvm_sbi_ext_pmu_probe) should only true when
> both vcpu_to_pmu(vcpu)->enabled and
> riscv_isa_extension_available(NULL, SSCOFPMF) are true.
>
> Thoughts?

Agreed.

>
> > riscv_isa_extension_available(NULL, SSCOFPMF) as well as not
> > failing other things. And, KVM userspace should be able to
> > disable the pmu, which keep enabled from being set as well.
> >
> We already have provisions for disabling sscofpmf from guests via ISA
> one reg interface.
> Do you mean disable the entire PMU from userspace ?

Yes. We may need to configure a VM without a PMU to increase its
migratability, workaround errata, or just for testing/debugging purposes.

>
> Currently, ARM64 enables pmu from user space using device control APIs
> on vcpu fd.
> Are you suggesting we should do something like that ?

Yes. Although choosing which KVM API should be used could probably be
thought-out again. x86 uses VM ioctls.

>
> If PMU needs to have device control APIs (either via vcpu fd or its
> own), we can retrieve
> the hpmcounter width and count from there as well.

Right. We need to decide how the VM/VCPU + PMU user interface should look.
A separate PMU device, like arm64 has, sounds good, but the ioctl
sequences for initialization may get more tricky.

Thanks,
drew

2022-11-23 14:46:57

by Andrew Jones

[permalink] [raw]
Subject: Re: [RFC 8/9] RISC-V: KVM: Implement perf support

On Tue, Nov 22, 2022 at 04:45:16PM -0800, Atish Patra wrote:
...
> This brings up another generic error returning problem in KVM SBI
> land. Usually, SBI error code numbers do not
> align with Linux error codes to accommodate other operating systems.
> However, most of the SBI error codes
> have 1-1 relationship with the Linux error code.
> Thus, kvm internal code returns a Linux specific error code and
> vcpu_sbi will map those to SBI error code using
> kvm_linux_err_map_sbi.
>
> However, this will not work for SBI_ERR_ALREADY_STARTED/STOPPED as
> there are no corresponding
> Linux specific error codes. We can directly return the SBI error codes
> from vcpu_pmu.c and modify the
> kvm_linux_err_map_sbi to pass through those. In that case, we can't
> map any linux error code that
> collides with SBI error code. Any other ideas to handle this case ?
>

It seems like we should drop kvm_linux_err_map_sbi() and add another
parameter to kvm_vcpu_sbi_extension.handler for the SBI error. Another
option is to continue mapping SBI errors to Linux errors, e.g.
SBI_ERR_ALREADY_STARTED == EBUSY, but that may not be too easy in
all cases and the errors become ambiguous, as we can't tell if the
Linux implementation generated the error or if the SBI call did.

Thanks,
drew

2022-11-24 09:14:58

by Atish Patra

[permalink] [raw]
Subject: Re: [RFC 5/9] RISC-V: KVM: Add skeleton support for perf

On Wed, Nov 23, 2022 at 5:36 AM Andrew Jones <[email protected]> wrote:
>
> ...
> > > > > - csr_write(CSR_HCOUNTEREN, -1UL);
> > > > > + /* VS should access only TM bit. Everything else should trap */
> > > > > + csr_write(CSR_HCOUNTEREN, 0x02);
> > > >
> > > > This looks like something that should be broken out into a separate patch
> > > > with a description of what happens now when guests try to access the newly
> > > > trapping counter registers. We should probably also create a TM define.
> > > >
> > >
> > > Done.
> > >
> >
> > As we allow cycles & instret for host user space now [1], should we do the same
> > for guests as well ? I would prefer not to but same user space
> > software will start to break
> > they will run inside a guest.
> >
> > https://lore.kernel.org/all/[email protected]/
> >
>
> Yeah, it seems like we should either forbid access to unprivileged users
> or ensure the numbers include some random noise. For guests, a privileged
> KVM userspace should need to explicitly request access for them, ensuring
> that the creation of privileged guests is done by conscious choice.
>

If I understand you correctly, you are suggesting we only enable TM
bit in hcounteren ?
We also need a mechanism to enable the hcounteren bits from KVM guest userspace.

I can think of the following approaches.

1. The CYCLE, INSTRET enabling can also be via one reg interface.
2. We can enable it during first virtual instruction trap if these
bits in guest scounteren
are enabled.

> Thanks,
> drew



--
Regards,
Atish

2022-11-24 10:18:40

by Atish Patra

[permalink] [raw]
Subject: Re: [RFC 5/9] RISC-V: KVM: Add skeleton support for perf

On Wed, Nov 23, 2022 at 5:11 AM Andrew Jones <[email protected]> wrote:
>
> On Tue, Nov 22, 2022 at 04:46:14PM -0800, Atish Patra wrote:
> ...
> > > > + kvpmu->num_hw_ctrs = num_hw_ctrs;
> > > > + kvpmu->num_fw_ctrs = num_fw_ctrs;
> > >
> > > Maybe it's coming later, but we need to give KVM userspace control over
> > > the number of counters to allow it to migrate to a larger set of hosts.
> > > Also, a previous patch said the virtual width must be the same as the
> > > host width for the hw counters, so we need userspace to know what that
> > > is in order to determine to which hosts it can migrate a guest.
> > >
> >
> > Yes. The entire user space access control needs to be sketched out.
> > We probably need another one reg interface to set/get the number of
> > counters/width.
> >
> > However, Is it a common to migrate a guest between different hosts
> > with different PMU capabilities ?
> >
>
> Ideally we'd be able to define a virtual CPU+PMU which represents the
> least common denominator of a set of hosts, allowing VMs which use that
> VCPU model to migrate among all the hosts. x86 pulls this off pretty well,
> but arm64 doesn't. In the least, I think a goal should be to enable
> migration of VMs from hosts with less extensions and less PMU counters to
> hosts with more, as that would support host upgrades without having to
> recreate VMs.
>

We also have hpmwidth to worry about. I guess the policy details will be
VMM specific one. We need to think through the parameters that can be
varied between
hosts. As you pointed out earlier,

1. hpmcounter wdith
2. Number of hpmcounters
3. PMU related ISA extension - this will be applicable for all the extensions.

Anything else ?


> Thanks,
> drew



--
Regards,
Atish

2022-11-24 10:30:21

by Atish Patra

[permalink] [raw]
Subject: Re: [RFC 6/9] RISC-V: KVM: Add SBI PMU extension support

On Wed, Nov 23, 2022 at 5:58 AM Andrew Jones <[email protected]> wrote:
>
> On Tue, Nov 22, 2022 at 03:08:34PM -0800, Atish Patra wrote:
> > On Tue, Nov 1, 2022 at 7:26 AM Andrew Jones <[email protected]> wrote:
> > >
> > > On Mon, Jul 18, 2022 at 10:02:02AM -0700, Atish Patra wrote:
> ...
> > > > +static int kvm_sbi_ext_pmu_handler(struct kvm_vcpu *vcpu, struct kvm_run *run,
> > > > + unsigned long *out_val,
> > > > + struct kvm_cpu_trap *utrap,
> > > > + bool *exit)
> > > > +{
> > > > + int ret = -EOPNOTSUPP;
> > > > + struct kvm_cpu_context *cp = &vcpu->arch.guest_context;
> > > > + unsigned long funcid = cp->a6;
> > > > + uint64_t temp;
> > >
> > > I think we need something like
> > >
> > > if (!vcpu_to_pmu(vcpu)->enabled)
> > > return -EOPNOTSUPP;
> > >
> > > here. Where 'enabled' is only true when we successfully initialize
> > > the pmu. And, successful initialization depends on
> >
> > Yes. I have added the flag. But should we do the check here or
> > respective function
> > as a paranoia sanity check ?
>
> I think something like above that returns a not-supported error should be
> in all the entry points, like the top level SBI call handler. Functions
> that should never be called unless the PMU is active could have WARNs
> added for sanity checks.
>

Sure. Will add those checks.

> >
> > > IS_ENABLED(CONFIG_RISCV_PMU_SBI) and
> >
> > Why do we need to guard under CONFIG_RISCV_PMU_SBI ?
> > vcpu_sbi_pmu.c is only compiled under CONFIG_RISCV_PMU_SBI
> >
> > If CONFIG_RISCV_PMU_SBI is not enabled, probe function will return failure.
>
> You're right. We don't need explicit config checks for things that can't
> exist without the config.
>
> >
> > In fact, I think we should also add the pmu enabled check in the probe function
> > itself. Probe function(kvm_sbi_ext_pmu_probe) should only true when
> > both vcpu_to_pmu(vcpu)->enabled and
> > riscv_isa_extension_available(NULL, SSCOFPMF) are true.
> >
> > Thoughts?
>
> Agreed.
>
> >
> > > riscv_isa_extension_available(NULL, SSCOFPMF) as well as not
> > > failing other things. And, KVM userspace should be able to
> > > disable the pmu, which keep enabled from being set as well.
> > >
> > We already have provisions for disabling sscofpmf from guests via ISA
> > one reg interface.
> > Do you mean disable the entire PMU from userspace ?
>
> Yes. We may need to configure a VM without a PMU to increase its
> migratability, workaround errata, or just for testing/debugging purposes.
>
> >
> > Currently, ARM64 enables pmu from user space using device control APIs
> > on vcpu fd.
> > Are you suggesting we should do something like that ?
>
> Yes. Although choosing which KVM API should be used could probably be
> thought-out again. x86 uses VM ioctls.
>

How does it handle hetergenous systems in per VM ioctls ?

> >
> > If PMU needs to have device control APIs (either via vcpu fd or its
> > own), we can retrieve
> > the hpmcounter width and count from there as well.
>
> Right. We need to decide how the VM/VCPU + PMU user interface should look.
> A separate PMU device, like arm64 has, sounds good, but the ioctl
> sequences for initialization may get more tricky.
>

Do we really need a per VM interface ? I was thinking we can just
continue to use
one reg interface for PMU as well. We probably need two of them.

1. To enable/disable SBI extension
-- The probe function will depend on this
2. PMU specific get/set
-- Number of hpmcounters
-- hpmcounter width
-- enable PMU


> Thanks,
> drew



--
Regards,
Atish

2022-11-24 11:18:01

by Andrew Jones

[permalink] [raw]
Subject: Re: [RFC 6/9] RISC-V: KVM: Add SBI PMU extension support

On Thu, Nov 24, 2022 at 02:18:26AM -0800, Atish Patra wrote:
> On Wed, Nov 23, 2022 at 5:58 AM Andrew Jones <[email protected]> wrote:
> >
> > On Tue, Nov 22, 2022 at 03:08:34PM -0800, Atish Patra wrote:
...
> > > Currently, ARM64 enables pmu from user space using device control APIs
> > > on vcpu fd.
> > > Are you suggesting we should do something like that ?
> >
> > Yes. Although choosing which KVM API should be used could probably be
> > thought-out again. x86 uses VM ioctls.
> >
>
> How does it handle hetergenous systems in per VM ioctls ?

I don't think it does, but neither does arm64. Afaik, the only way to run
KVM VMs on heterogeneous systems is to pin the VM to one set of the CPUs,
i.e. make sure the system it runs on is homogeneous.

I agree we shouldn't paint ourselves into a homogeneous-only corner for
riscv, though, so if it's possible to use VCPU APIs, then I guess we
should. Although, one thing to keep in mind is that if the same ioctl
needs to be run on each VCPU, then, when we start building VMs with
hundreds of VCPUs, we'll see slow VM starts.

>
> > >
> > > If PMU needs to have device control APIs (either via vcpu fd or its
> > > own), we can retrieve
> > > the hpmcounter width and count from there as well.
> >
> > Right. We need to decide how the VM/VCPU + PMU user interface should look.
> > A separate PMU device, like arm64 has, sounds good, but the ioctl
> > sequences for initialization may get more tricky.
> >
>
> Do we really need a per VM interface ? I was thinking we can just
> continue to use
> one reg interface for PMU as well. We probably need two of them.
>
> 1. To enable/disable SBI extension
> -- The probe function will depend on this
> 2. PMU specific get/set
> -- Number of hpmcounters
> -- hpmcounter width
> -- enable PMU

ONE_REG is good for registers and virtual registers, which means the
number of hpmcounters and the hpmcounter width are probably good
candidates, but I'm not sure we should use it for enable/init types of
purposes.

Thanks,
drew

2022-11-24 11:20:57

by Andrew Jones

[permalink] [raw]
Subject: Re: [RFC 5/9] RISC-V: KVM: Add skeleton support for perf

On Thu, Nov 24, 2022 at 01:04:15AM -0800, Atish Patra wrote:
> On Wed, Nov 23, 2022 at 5:36 AM Andrew Jones <[email protected]> wrote:
> >
> > ...
> > > > > > - csr_write(CSR_HCOUNTEREN, -1UL);
> > > > > > + /* VS should access only TM bit. Everything else should trap */
> > > > > > + csr_write(CSR_HCOUNTEREN, 0x02);
> > > > >
> > > > > This looks like something that should be broken out into a separate patch
> > > > > with a description of what happens now when guests try to access the newly
> > > > > trapping counter registers. We should probably also create a TM define.
> > > > >
> > > >
> > > > Done.
> > > >
> > >
> > > As we allow cycles & instret for host user space now [1], should we do the same
> > > for guests as well ? I would prefer not to but same user space
> > > software will start to break
> > > they will run inside a guest.
> > >
> > > https://lore.kernel.org/all/[email protected]/
> > >
> >
> > Yeah, it seems like we should either forbid access to unprivileged users
> > or ensure the numbers include some random noise. For guests, a privileged
> > KVM userspace should need to explicitly request access for them, ensuring
> > that the creation of privileged guests is done by conscious choice.
> >
>
> If I understand you correctly, you are suggesting we only enable TM
> bit in hcounteren ?

Yeah, and also that I think it'd be nice to revisit this for userspace.

> We also need a mechanism to enable the hcounteren bits from KVM guest userspace.
>
> I can think of the following approaches.
>
> 1. The CYCLE, INSTRET enabling can also be via one reg interface.
> 2. We can enable it during first virtual instruction trap if these
> bits in guest scounteren
> are enabled.

Those sound good to me.

Thanks,
drew

2022-11-24 11:51:26

by Andrew Jones

[permalink] [raw]
Subject: Re: [RFC 5/9] RISC-V: KVM: Add skeleton support for perf

On Thu, Nov 24, 2022 at 01:09:03AM -0800, Atish Patra wrote:
> On Wed, Nov 23, 2022 at 5:11 AM Andrew Jones <[email protected]> wrote:
> >
> > On Tue, Nov 22, 2022 at 04:46:14PM -0800, Atish Patra wrote:
> > ...
> > > > > + kvpmu->num_hw_ctrs = num_hw_ctrs;
> > > > > + kvpmu->num_fw_ctrs = num_fw_ctrs;
> > > >
> > > > Maybe it's coming later, but we need to give KVM userspace control over
> > > > the number of counters to allow it to migrate to a larger set of hosts.
> > > > Also, a previous patch said the virtual width must be the same as the
> > > > host width for the hw counters, so we need userspace to know what that
> > > > is in order to determine to which hosts it can migrate a guest.
> > > >
> > >
> > > Yes. The entire user space access control needs to be sketched out.
> > > We probably need another one reg interface to set/get the number of
> > > counters/width.
> > >
> > > However, Is it a common to migrate a guest between different hosts
> > > with different PMU capabilities ?
> > >
> >
> > Ideally we'd be able to define a virtual CPU+PMU which represents the
> > least common denominator of a set of hosts, allowing VMs which use that
> > VCPU model to migrate among all the hosts. x86 pulls this off pretty well,
> > but arm64 doesn't. In the least, I think a goal should be to enable
> > migration of VMs from hosts with less extensions and less PMU counters to
> > hosts with more, as that would support host upgrades without having to
> > recreate VMs.
> >
>
> We also have hpmwidth to worry about. I guess the policy details will be
> VMM specific one. We need to think through the parameters that can be
> varied between
> hosts. As you pointed out earlier,
>
> 1. hpmcounter wdith
> 2. Number of hpmcounters
> 3. PMU related ISA extension - this will be applicable for all the extensions.
>
> Anything else ?

I can't think of anything at the moment. Hopefully most potential
differences (features and errata) will be in extensions or in
number, i.e. smaller sets of things which can be emulated with
larger sets.

Thanks,
drew

2022-11-24 13:41:47

by Anup Patel

[permalink] [raw]
Subject: Re: [RFC 6/9] RISC-V: KVM: Add SBI PMU extension support

On Thu, Nov 24, 2022 at 4:21 PM Andrew Jones <[email protected]> wrote:
>
> On Thu, Nov 24, 2022 at 02:18:26AM -0800, Atish Patra wrote:
> > On Wed, Nov 23, 2022 at 5:58 AM Andrew Jones <[email protected]> wrote:
> > >
> > > On Tue, Nov 22, 2022 at 03:08:34PM -0800, Atish Patra wrote:
> ...
> > > > Currently, ARM64 enables pmu from user space using device control APIs
> > > > on vcpu fd.
> > > > Are you suggesting we should do something like that ?
> > >
> > > Yes. Although choosing which KVM API should be used could probably be
> > > thought-out again. x86 uses VM ioctls.
> > >
> >
> > How does it handle hetergenous systems in per VM ioctls ?
>
> I don't think it does, but neither does arm64. Afaik, the only way to run
> KVM VMs on heterogeneous systems is to pin the VM to one set of the CPUs,
> i.e. make sure the system it runs on is homogeneous.
>
> I agree we shouldn't paint ourselves into a homogeneous-only corner for
> riscv, though, so if it's possible to use VCPU APIs, then I guess we
> should. Although, one thing to keep in mind is that if the same ioctl
> needs to be run on each VCPU, then, when we start building VMs with
> hundreds of VCPUs, we'll see slow VM starts.
>
> >
> > > >
> > > > If PMU needs to have device control APIs (either via vcpu fd or its
> > > > own), we can retrieve
> > > > the hpmcounter width and count from there as well.
> > >
> > > Right. We need to decide how the VM/VCPU + PMU user interface should look.
> > > A separate PMU device, like arm64 has, sounds good, but the ioctl
> > > sequences for initialization may get more tricky.
> > >
> >
> > Do we really need a per VM interface ? I was thinking we can just
> > continue to use
> > one reg interface for PMU as well. We probably need two of them.
> >
> > 1. To enable/disable SBI extension
> > -- The probe function will depend on this
> > 2. PMU specific get/set
> > -- Number of hpmcounters
> > -- hpmcounter width
> > -- enable PMU
>
> ONE_REG is good for registers and virtual registers, which means the
> number of hpmcounters and the hpmcounter width are probably good
> candidates, but I'm not sure we should use it for enable/init types of
> purposes.

We are already using ONE_REG interface to enable/disable
ISA extensions so we should follow the same pattern and have
ONE_REG interface to enable/disable SBI extensions as well.

Regards,
Anup

2022-11-28 21:41:46

by Atish Patra

[permalink] [raw]
Subject: Re: [RFC 6/9] RISC-V: KVM: Add SBI PMU extension support

On Thu, Nov 24, 2022 at 4:59 AM Anup Patel <[email protected]> wrote:
>
> On Thu, Nov 24, 2022 at 4:21 PM Andrew Jones <[email protected]> wrote:
> >
> > On Thu, Nov 24, 2022 at 02:18:26AM -0800, Atish Patra wrote:
> > > On Wed, Nov 23, 2022 at 5:58 AM Andrew Jones <[email protected]> wrote:
> > > >
> > > > On Tue, Nov 22, 2022 at 03:08:34PM -0800, Atish Patra wrote:
> > ...
> > > > > Currently, ARM64 enables pmu from user space using device control APIs
> > > > > on vcpu fd.
> > > > > Are you suggesting we should do something like that ?
> > > >
> > > > Yes. Although choosing which KVM API should be used could probably be
> > > > thought-out again. x86 uses VM ioctls.
> > > >
> > >
> > > How does it handle hetergenous systems in per VM ioctls ?
> >
> > I don't think it does, but neither does arm64. Afaik, the only way to run
> > KVM VMs on heterogeneous systems is to pin the VM to one set of the CPUs,
> > i.e. make sure the system it runs on is homogeneous.
> >
> > I agree we shouldn't paint ourselves into a homogeneous-only corner for
> > riscv, though, so if it's possible to use VCPU APIs, then I guess we
> > should. Although, one thing to keep in mind is that if the same ioctl
> > needs to be run on each VCPU, then, when we start building VMs with
> > hundreds of VCPUs, we'll see slow VM starts.
> >
> > >
> > > > >
> > > > > If PMU needs to have device control APIs (either via vcpu fd or its
> > > > > own), we can retrieve
> > > > > the hpmcounter width and count from there as well.
> > > >
> > > > Right. We need to decide how the VM/VCPU + PMU user interface should look.
> > > > A separate PMU device, like arm64 has, sounds good, but the ioctl
> > > > sequences for initialization may get more tricky.
> > > >
> > >
> > > Do we really need a per VM interface ? I was thinking we can just
> > > continue to use
> > > one reg interface for PMU as well. We probably need two of them.
> > >
> > > 1. To enable/disable SBI extension
> > > -- The probe function will depend on this
> > > 2. PMU specific get/set
> > > -- Number of hpmcounters
> > > -- hpmcounter width
> > > -- enable PMU
> >
> > ONE_REG is good for registers and virtual registers, which means the
> > number of hpmcounters and the hpmcounter width are probably good
> > candidates, but I'm not sure we should use it for enable/init types of
> > purposes.
>
> We are already using ONE_REG interface to enable/disable
> ISA extensions so we should follow the same pattern and have
> ONE_REG interface to enable/disable SBI extensions as well.
>

Thinking about it more, it may end up in vcpus with heterogeneous
capabilities (different SBI extension).
Most likely, it will be a bug and incorrect configuration from VMM but
it's a possibility.
For example: There will be two different scenarios for PMU extension

1. Boot vcpu disabled PMU extension - probably okay as guest queries
the PMU extension on boot cpu only.
The PMU extension won't be available for any other vcpu.

2. Boot vcpu has PMU extension enabled but others have it disabled.
The run time SBI PMU calls will
fail with EOPNOTSUPP and the user gets confused.

There will be similar cases for HSM extension as well.

As the entire extension won't be available to the guest if the SBI
extension is disabled for the boot cpu only.
There is also a slow VM start issue Andrew pointed about earlier.

This makes me think if a VM ioctl to disable/enable the SBI extension
for all vcpus makes more sense in this case.


> Regards,
> Anup



--
Regards,
Atish

2022-12-02 10:07:34

by Atish Patra

[permalink] [raw]
Subject: Re: [RFC 8/9] RISC-V: KVM: Implement perf support

On Wed, Nov 23, 2022 at 6:22 AM Andrew Jones <[email protected]> wrote:
>
> On Tue, Nov 22, 2022 at 04:45:16PM -0800, Atish Patra wrote:
> ...
> > This brings up another generic error returning problem in KVM SBI
> > land. Usually, SBI error code numbers do not
> > align with Linux error codes to accommodate other operating systems.
> > However, most of the SBI error codes
> > have 1-1 relationship with the Linux error code.
> > Thus, kvm internal code returns a Linux specific error code and
> > vcpu_sbi will map those to SBI error code using
> > kvm_linux_err_map_sbi.
> >
> > However, this will not work for SBI_ERR_ALREADY_STARTED/STOPPED as
> > there are no corresponding
> > Linux specific error codes. We can directly return the SBI error codes
> > from vcpu_pmu.c and modify the
> > kvm_linux_err_map_sbi to pass through those. In that case, we can't
> > map any linux error code that
> > collides with SBI error code. Any other ideas to handle this case ?
> >
>
> It seems like we should drop kvm_linux_err_map_sbi() and add another
> parameter to kvm_vcpu_sbi_extension.handler for the SBI error. Another

That will just move the problem from the generic SBI layer to
extension specific layer.
The root problem remains the same as we can't expect the individual
extension to return
a valid linux specific error code.

Maybe we can relax that requirement. Thus, any extension that has
additional SBI error codes
may opt to return SBI error codes directly. For example, PMU extension
implementation will
directly SBI specific error codes from arch/riscv/kvm/vcpu_pmu.c. In
future, there will be other
extensions (e.g TEE) will have many more error codes that can leverage
this as well.

Does that sound reasonable ?

> option is to continue mapping SBI errors to Linux errors, e.g.
> SBI_ERR_ALREADY_STARTED == EBUSY, but that may not be too easy in
> all cases and the errors become ambiguous, as we can't tell if the
> Linux implementation generated the error or if the SBI call did.
>

We can't distinguish between SBI_ERR_ALREADY_STARTED/STOPPED in that case.

> Thanks,
> drew



--
Regards,
Atish

2022-12-02 11:55:26

by Andrew Jones

[permalink] [raw]
Subject: Re: [RFC 8/9] RISC-V: KVM: Implement perf support

On Fri, Dec 02, 2022 at 01:08:47AM -0800, Atish Patra wrote:
> On Wed, Nov 23, 2022 at 6:22 AM Andrew Jones <[email protected]> wrote:
> >
> > On Tue, Nov 22, 2022 at 04:45:16PM -0800, Atish Patra wrote:
> > ...
> > > This brings up another generic error returning problem in KVM SBI
> > > land. Usually, SBI error code numbers do not
> > > align with Linux error codes to accommodate other operating systems.
> > > However, most of the SBI error codes
> > > have 1-1 relationship with the Linux error code.
> > > Thus, kvm internal code returns a Linux specific error code and
> > > vcpu_sbi will map those to SBI error code using
> > > kvm_linux_err_map_sbi.
> > >
> > > However, this will not work for SBI_ERR_ALREADY_STARTED/STOPPED as
> > > there are no corresponding
> > > Linux specific error codes. We can directly return the SBI error codes
> > > from vcpu_pmu.c and modify the
> > > kvm_linux_err_map_sbi to pass through those. In that case, we can't
> > > map any linux error code that
> > > collides with SBI error code. Any other ideas to handle this case ?
> > >
> >
> > It seems like we should drop kvm_linux_err_map_sbi() and add another
> > parameter to kvm_vcpu_sbi_extension.handler for the SBI error. Another
>
> That will just move the problem from the generic SBI layer to
> extension specific layer.
> The root problem remains the same as we can't expect the individual
> extension to return
> a valid linux specific error code.

I'm saying we return both from the extension specific layer, particularly
because only the extension specific layer knows what it should return.
KVM's SBI handlers currently have a return value and *out_val. *out_val
maps directly to SBI's sbiret.value, but the return value does not map to
SBI's sbiret.error. But, all we have to do is add *error_val to the
parameters for the extension handler to get it. Then, cp->a0 should be set
to that, not the return value.

>
> Maybe we can relax that requirement. Thus, any extension that has
> additional SBI error codes
> may opt to return SBI error codes directly. For example, PMU extension
> implementation will
> directly SBI specific error codes from arch/riscv/kvm/vcpu_pmu.c. In
> future, there will be other
> extensions (e.g TEE) will have many more error codes that can leverage
> this as well.
>
> Does that sound reasonable ?

I think we need both the Linux return and sbiret.error. The return value
indicates a problem *with* the emulation, while the new parameter I'm
proposing (*error_val) is the return value *of* the emulation. Normally
the Linux return value will be zero (a successful Linux call) even when
emulating a failure (*error_val != SBI_SUCCESS). When the return value
is not zero, then there's something wrong in KVM and the return value
should be propagated to userspace. We could also set the exit_reason to
KVM_EXIT_INTERNAL_ERROR, but KVM_EXIT_UNKNOWN is probably fine too.

>
> > option is to continue mapping SBI errors to Linux errors, e.g.
> > SBI_ERR_ALREADY_STARTED == EBUSY, but that may not be too easy in
> > all cases and the errors become ambiguous, as we can't tell if the
> > Linux implementation generated the error or if the SBI call did.
> >
>
> We can't distinguish between SBI_ERR_ALREADY_STARTED/STOPPED in that case.

That's why I only suggested using EBUSY for STARTED. Mapping STOPPED
was left as an exercise for the reader :-)

Thanks,
drew

2022-12-02 17:23:48

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC 8/9] RISC-V: KVM: Implement perf support

On Mon, Jul 18, 2022, Atish Patra wrote:
> RISC-V SBI PMU & Sscofpmf ISA extension allows supporting perf in
> the virtualization enviornment as well. KVM implementation
> relies on SBI PMU extension for most of the part while traps
> & emulates the CSRs read for counter access.

For the benefit of non-RISCV people, the changelog (and documentation?) should
explain why RISC-V doesn't need to tap into kvm_register_perf_callbacks().
Presumably there's something in the "RISC-V SBI PMU & Sscofpmf ISA extension" spec
that allows hardware to differentiate between events that are for guest vs. host?

2022-12-07 08:41:12

by Atish Patra

[permalink] [raw]
Subject: Re: [RFC 8/9] RISC-V: KVM: Implement perf support

On Fri, Dec 2, 2022 at 9:09 AM Sean Christopherson <[email protected]> wrote:
>
> On Mon, Jul 18, 2022, Atish Patra wrote:
> > RISC-V SBI PMU & Sscofpmf ISA extension allows supporting perf in
> > the virtualization enviornment as well. KVM implementation
> > relies on SBI PMU extension for most of the part while traps
> > & emulates the CSRs read for counter access.
>
> For the benefit of non-RISCV people, the changelog (and documentation?) should
> explain why RISC-V doesn't need to tap into kvm_register_perf_callbacks().

As per my understanding, kvm_register_perf_callbacks is only useful
during event sampling for guests. Please let me know if I missed
something.
This series doesn't support sampling and guest counter overflow
interrupt yet. That's why kvm_register_perf_callbacks support is
missing.
I will add kvm_register_perf_callbacks in the next revision along with
interrupt support.

> Presumably there's something in the "RISC-V SBI PMU & Sscofpmf ISA extension" spec
> that allows hardware to differentiate between events that are for guest vs. host?

Not directly. The Sscofpmf extension does have privilege mode specific
filtering bits[1] i.e. VSINH and VUINH which will indicate if the
events are specific to guest.
But that is only true if the hypervisor has enabled those bits. As I
said above, RISC-V do need to tap into kvm_register_perf_callbacks as
well.

[1] https://drive.google.com/file/d/171j4jFjIkKdj5LWcExphq4xG_2sihbfd/edit

--
Regards,
Atish

2022-12-07 09:17:27

by Atish Patra

[permalink] [raw]
Subject: Re: [RFC 8/9] RISC-V: KVM: Implement perf support

On Fri, Dec 2, 2022 at 3:37 AM Andrew Jones <[email protected]> wrote:
>
> On Fri, Dec 02, 2022 at 01:08:47AM -0800, Atish Patra wrote:
> > On Wed, Nov 23, 2022 at 6:22 AM Andrew Jones <[email protected]> wrote:
> > >
> > > On Tue, Nov 22, 2022 at 04:45:16PM -0800, Atish Patra wrote:
> > > ...
> > > > This brings up another generic error returning problem in KVM SBI
> > > > land. Usually, SBI error code numbers do not
> > > > align with Linux error codes to accommodate other operating systems.
> > > > However, most of the SBI error codes
> > > > have 1-1 relationship with the Linux error code.
> > > > Thus, kvm internal code returns a Linux specific error code and
> > > > vcpu_sbi will map those to SBI error code using
> > > > kvm_linux_err_map_sbi.
> > > >
> > > > However, this will not work for SBI_ERR_ALREADY_STARTED/STOPPED as
> > > > there are no corresponding
> > > > Linux specific error codes. We can directly return the SBI error codes
> > > > from vcpu_pmu.c and modify the
> > > > kvm_linux_err_map_sbi to pass through those. In that case, we can't
> > > > map any linux error code that
> > > > collides with SBI error code. Any other ideas to handle this case ?
> > > >
> > >
> > > It seems like we should drop kvm_linux_err_map_sbi() and add another
> > > parameter to kvm_vcpu_sbi_extension.handler for the SBI error. Another
> >
> > That will just move the problem from the generic SBI layer to
> > extension specific layer.
> > The root problem remains the same as we can't expect the individual
> > extension to return
> > a valid linux specific error code.
>
> I'm saying we return both from the extension specific layer, particularly
> because only the extension specific layer knows what it should return.
> KVM's SBI handlers currently have a return value and *out_val. *out_val
> maps directly to SBI's sbiret.value, but the return value does not map to
> SBI's sbiret.error. But, all we have to do is add *error_val to the
> parameters for the extension handler to get it. Then, cp->a0 should be set
> to that, not the return value.
>

Ahh. Yes. That will solve this issue.

> >
> > Maybe we can relax that requirement. Thus, any extension that has
> > additional SBI error codes
> > may opt to return SBI error codes directly. For example, PMU extension
> > implementation will
> > directly SBI specific error codes from arch/riscv/kvm/vcpu_pmu.c. In
> > future, there will be other
> > extensions (e.g TEE) will have many more error codes that can leverage
> > this as well.
> >
> > Does that sound reasonable ?
>
> I think we need both the Linux return and sbiret.error. The return value
> indicates a problem *with* the emulation, while the new parameter I'm
> proposing (*error_val) is the return value *of* the emulation. Normally
> the Linux return value will be zero (a successful Linux call) even when
> emulating a failure (*error_val != SBI_SUCCESS). When the return value
> is not zero, then there's something wrong in KVM and the return value
> should be propagated to userspace. We could also set the exit_reason to
> KVM_EXIT_INTERNAL_ERROR, but KVM_EXIT_UNKNOWN is probably fine too.
>

Agreed. I revise the series accordingly. Thanks!

> >
> > > option is to continue mapping SBI errors to Linux errors, e.g.
> > > SBI_ERR_ALREADY_STARTED == EBUSY, but that may not be too easy in
> > > all cases and the errors become ambiguous, as we can't tell if the
> > > Linux implementation generated the error or if the SBI call did.
> > >
> >
> > We can't distinguish between SBI_ERR_ALREADY_STARTED/STOPPED in that case.
>
> That's why I only suggested using EBUSY for STARTED. Mapping STOPPED
> was left as an exercise for the reader :-)
>
> Thanks,
> drew



--
Regards,
Atish

2022-12-07 17:13:50

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC 8/9] RISC-V: KVM: Implement perf support

On Wed, Dec 07, 2022, Atish Patra wrote:
> On Fri, Dec 2, 2022 at 9:09 AM Sean Christopherson <[email protected]> wrote:
> >
> > On Mon, Jul 18, 2022, Atish Patra wrote:
> > > RISC-V SBI PMU & Sscofpmf ISA extension allows supporting perf in
> > > the virtualization enviornment as well. KVM implementation
> > > relies on SBI PMU extension for most of the part while traps
> > > & emulates the CSRs read for counter access.
> >
> > For the benefit of non-RISCV people, the changelog (and documentation?) should
> > explain why RISC-V doesn't need to tap into kvm_register_perf_callbacks().
>
> As per my understanding, kvm_register_perf_callbacks is only useful
> during event sampling for guests. Please let me know if I missed
> something.
> This series doesn't support sampling and guest counter overflow interrupt yet.
> That's why kvm_register_perf_callbacks support is missing.

Ah, I missed that connection in the cover letter.

In the future, if a patch adds partial support for a thing/feature, it's very
helpful to call that out in the lack shortlog and changelog, even for RFCs. E.g.
adding a single word in the shortlog and sentence or two in the changelog doesn't
take much time on your end, and helps avoid cases like this where drive-by reviewers
like me from cause a fuss about non-issues.

RISC-V: KVM: Implement partial perf support

...

Counter overflow and interrupts are not supported as the relevant
architectural specifications are still under discussion.

Thanks!

2022-12-08 01:25:01

by Atish Patra

[permalink] [raw]
Subject: Re: [RFC 8/9] RISC-V: KVM: Implement perf support

On Wed, Dec 7, 2022 at 8:31 AM Sean Christopherson <[email protected]> wrote:
>
> On Wed, Dec 07, 2022, Atish Patra wrote:
> > On Fri, Dec 2, 2022 at 9:09 AM Sean Christopherson <[email protected]> wrote:
> > >
> > > On Mon, Jul 18, 2022, Atish Patra wrote:
> > > > RISC-V SBI PMU & Sscofpmf ISA extension allows supporting perf in
> > > > the virtualization enviornment as well. KVM implementation
> > > > relies on SBI PMU extension for most of the part while traps
> > > > & emulates the CSRs read for counter access.
> > >
> > > For the benefit of non-RISCV people, the changelog (and documentation?) should
> > > explain why RISC-V doesn't need to tap into kvm_register_perf_callbacks().
> >
> > As per my understanding, kvm_register_perf_callbacks is only useful
> > during event sampling for guests. Please let me know if I missed
> > something.
> > This series doesn't support sampling and guest counter overflow interrupt yet.
> > That's why kvm_register_perf_callbacks support is missing.
>
> Ah, I missed that connection in the cover letter.
>
> In the future, if a patch adds partial support for a thing/feature, it's very
> helpful to call that out in the lack shortlog and changelog, even for RFCs. E.g.
> adding a single word in the shortlog and sentence or two in the changelog doesn't
> take much time on your end, and helps avoid cases like this where drive-by reviewers
> like me from cause a fuss about non-issues.
>

Absolutely. I will update the commit text in v2 accordingly.

> RISC-V: KVM: Implement partial perf support
>
> ...
>
> Counter overflow and interrupts are not supported as the relevant
> architectural specifications are still under discussion.
>
> Thanks!



--
Regards,
Atish