2021-05-19 19:07:32

by Anup Patel

[permalink] [raw]
Subject: [PATCH v18 00/18] KVM RISC-V Support

From: Anup Patel <[email protected]>

This series adds initial KVM RISC-V support. Currently, we are able to boot
Linux on RV64/RV32 Guest with multiple VCPUs.

Key aspects of KVM RISC-V added by this series are:
1. No RISC-V specific KVM IOCTL
2. Minimal possible KVM world-switch which touches only GPRs and few CSRs
3. Both RV64 and RV32 host supported
4. Full Guest/VM switch is done via vcpu_get/vcpu_put infrastructure
5. KVM ONE_REG interface for VCPU register access from user-space
6. PLIC emulation is done in user-space
7. Timer and IPI emuation is done in-kernel
8. Both Sv39x4 and Sv48x4 supported for RV64 host
9. MMU notifiers supported
10. Generic dirtylog supported
11. FP lazy save/restore supported
12. SBI v0.1 emulation for KVM Guest available
13. Forward unhandled SBI calls to KVM userspace
14. Hugepage support for Guest/VM
15. IOEVENTFD support for Vhost

Here's a brief TODO list which we will work upon after this series:
1. SBI v0.2 emulation in-kernel
2. SBI v0.2 hart state management emulation in-kernel
3. In-kernel PLIC emulation
4. ..... and more .....

This series can be found in riscv_kvm_v18 branch at:
https//github.com/avpatel/linux.git

Our work-in-progress KVMTOOL RISC-V port can be found in riscv_v7 branch
at: https//github.com/avpatel/kvmtool.git

The QEMU RISC-V hypervisor emulation is done by Alistair and is available
in master branch at: https://git.qemu.org/git/qemu.git

To play around with KVM RISC-V, refer KVM RISC-V wiki at:
https://github.com/kvm-riscv/howto/wiki
https://github.com/kvm-riscv/howto/wiki/KVM-RISCV64-on-QEMU
https://github.com/kvm-riscv/howto/wiki/KVM-RISCV64-on-Spike

Changes since v17:
- Rebased on Linux-5.13-rc2
- Moved to new KVM MMU notifier APIs
- Removed redundant kvm_arch_vcpu_uninit()
- Moved KVM RISC-V sources to drivers/staging for compliance with
Linux RISC-V patch acceptance policy

Changes since v16:
- Rebased on Linux-5.12-rc5
- Remove redundant kvm_arch_create_memslot(), kvm_arch_vcpu_setup(),
kvm_arch_vcpu_init(), kvm_arch_has_vcpu_debugfs(), and
kvm_arch_create_vcpu_debugfs() from PATCH5
- Make stage2_wp_memory_region() and stage2_ioremap() as static
in PATCH13

Changes since v15:
- Rebased on Linux-5.11-rc3
- Fixed kvm_stage2_map() to use gfn_to_pfn_prot() for determing
writeability of a host pfn.
- Use "__u64" in-place of "u64" and "__u32" in-place of "u32" for
uapi/asm/kvm.h

Changes since v14:
- Rebased on Linux-5.10-rc3
- Fixed Stage2 (G-stage) PDG allocation to ensure it is 16KB aligned

Changes since v13:
- Rebased on Linux-5.9-rc3
- Fixed kvm_riscv_vcpu_set_reg_csr() for SIP updation in PATCH5
- Fixed instruction length computation in PATCH7
- Added ioeventfd support in PATCH7
- Ensure HSTATUS.SPVP is set to correct value before using HLV/HSV
intructions in PATCH7
- Fixed stage2_map_page() to set PTE 'A' and 'D' bits correctly
in PATCH10
- Added stage2 dirty page logging in PATCH10
- Allow KVM user-space to SET/GET SCOUNTER CSR in PATCH5
- Save/restore SCOUNTEREN in PATCH6
- Reduced quite a few instructions for __kvm_riscv_switch_to() by
using CSR swap instruction in PATCH6
- Detect and use Sv48x4 when available in PATCH10

Changes since v12:
- Rebased patches on Linux-5.8-rc4
- By default enable all counters in HCOUNTEREN
- RISC-V H-Extension v0.6.1 spec support

Changes since v11:
- Rebased patches on Linux-5.7-rc3
- Fixed typo in typecast of stage2_map_size define
- Introduced struct kvm_cpu_trap to represent trap details and
use it as function parameter wherever applicable
- Pass memslot to kvm_riscv_stage2_map() for supporing dirty page
logging in future
- RISC-V H-Extension v0.6 spec support
- Send-out first three patches as separate series so that it can
be taken by Palmer for Linux RISC-V

Changes since v10:
- Rebased patches on Linux-5.6-rc5
- Reduce RISCV_ISA_EXT_MAX from 256 to 64
- Separate PATCH for removing N-extension related defines
- Added comments as requested by Palmer
- Fixed HIDELEG CSR programming

Changes since v9:
- Rebased patches on Linux-5.5-rc3
- Squash PATCH19 and PATCH20 into PATCH5
- Squash PATCH18 into PATCH11
- Squash PATCH17 into PATCH16
- Added ONE_REG interface for VCPU timer in PATCH13
- Use HTIMEDELTA for VCPU timer in PATCH13
- Updated KVM RISC-V mailing list in MAINTAINERS entry
- Update KVM kconfig option to depend on RISCV_SBI and MMU
- Check for SBI v0.2 and SBI v0.2 RFENCE extension at boot-time
- Use SBI v0.2 RFENCE extension in VMID implementation
- Use SBI v0.2 RFENCE extension in Stage2 MMU implementation
- Use SBI v0.2 RFENCE extension in SBI implementation
- Moved to RISC-V Hypervisor v0.5 draft spec
- Updated Documentation/virt/kvm/api.txt for timer ONE_REG interface

Changes since v8:
- Rebased series on Linux-5.4-rc3 and Atish's SBI v0.2 patches
- Use HRTIMER_MODE_REL instead of HRTIMER_MODE_ABS in timer emulation
- Fixed kvm_riscv_stage2_map() to handle hugepages
- Added patch to forward unhandled SBI calls to user-space
- Added patch for iterative/recursive stage2 page table programming
- Added patch to remove per-CPU vsip_shadow variable
- Added patch to fix race-condition in kvm_riscv_vcpu_sync_interrupts()

Changes since v7:
- Rebased series on Linux-5.4-rc1 and Atish's SBI v0.2 patches
- Removed PATCH1, PATCH3, and PATCH20 because these already merged
- Use kernel doc style comments for ISA bitmap functions
- Don't parse X, Y, and Z extension in riscv_fill_hwcap() because it will
be added in-future
- Mark KVM RISC-V kconfig option as EXPERIMENTAL
- Typo fix in commit description of PATCH6 of v7 series
- Use separate structs for CORE and CSR registers of ONE_REG interface
- Explicitly include asm/sbi.h in kvm/vcpu_sbi.c
- Removed implicit switch-case fall-through in kvm_riscv_vcpu_exit()
- No need to set VSSTATUS.MXR bit in kvm_riscv_vcpu_unpriv_read()
- Removed register for instruction length in kvm_riscv_vcpu_unpriv_read()
- Added defines for checking/decoding instruction length
- Added separate patch to forward unhandled SBI calls to userspace tool

Changes since v6:
- Rebased patches on Linux-5.3-rc7
- Added "return_handled" in struct kvm_mmio_decode to ensure that
kvm_riscv_vcpu_mmio_return() updates SEPC only once
- Removed trap_stval parameter from kvm_riscv_vcpu_unpriv_read()
- Updated git repo URL in MAINTAINERS entry

Changes since v5:
- Renamed KVM_REG_RISCV_CONFIG_TIMEBASE register to
KVM_REG_RISCV_CONFIG_TBFREQ register in ONE_REG interface
- Update SPEC in kvm_riscv_vcpu_mmio_return() for MMIO exits
- Use switch case instead of illegal instruction opcode table for simplicity
- Improve comments in stage2_remote_tlb_flush() for a potential remote TLB
flush optimization
- Handle all unsupported SBI calls in default case of
kvm_riscv_vcpu_sbi_ecall() function
- Fixed kvm_riscv_vcpu_sync_interrupts() for software interrupts
- Improved unprivilege reads to handle traps due to Guest stage1 page table
- Added separate patch to document RISC-V specific things in
Documentation/virt/kvm/api.txt

Changes since v4:
- Rebased patches on Linux-5.3-rc5
- Added Paolo's Acked-by and Reviewed-by
- Updated mailing list in MAINTAINERS entry

Changes since v3:
- Moved patch for ISA bitmap from KVM prep series to this series
- Make vsip_shadow as run-time percpu variable instead of compile-time
- Flush Guest TLBs on all Host CPUs whenever we run-out of VMIDs

Changes since v2:
- Removed references of KVM_REQ_IRQ_PENDING from all patches
- Use kvm->srcu within in-kernel KVM run loop
- Added percpu vsip_shadow to track last value programmed in VSIP CSR
- Added comments about irqs_pending and irqs_pending_mask
- Used kvm_arch_vcpu_runnable() in-place-of kvm_riscv_vcpu_has_interrupt()
in system_opcode_insn()
- Removed unwanted smp_wmb() in kvm_riscv_stage2_vmid_update()
- Use kvm_flush_remote_tlbs() in kvm_riscv_stage2_vmid_update()
- Use READ_ONCE() in kvm_riscv_stage2_update_hgatp() for vmid

Changes since v1:
- Fixed compile errors in building KVM RISC-V as module
- Removed unused kvm_riscv_halt_guest() and kvm_riscv_resume_guest()
- Set KVM_CAP_SYNC_MMU capability only after MMU notifiers are implemented
- Made vmid_version as unsigned long instead of atomic
- Renamed KVM_REQ_UPDATE_PGTBL to KVM_REQ_UPDATE_HGATP
- Renamed kvm_riscv_stage2_update_pgtbl() to kvm_riscv_stage2_update_hgatp()
- Configure HIDELEG and HEDELEG in kvm_arch_hardware_enable()
- Updated ONE_REG interface for CSR access to user-space
- Removed irqs_pending_lock and use atomic bitops instead
- Added separate patch for FP ONE_REG interface
- Added separate patch for updating MAINTAINERS file

Anup Patel (14):
RISC-V: Add hypervisor extension related CSR defines
RISC-V: Add initial skeletal KVM support
RISC-V: KVM: Implement VCPU create, init and destroy functions
RISC-V: KVM: Implement VCPU interrupts and requests handling
RISC-V: KVM: Implement KVM_GET_ONE_REG/KVM_SET_ONE_REG ioctls
RISC-V: KVM: Implement VCPU world-switch
RISC-V: KVM: Handle MMIO exits for VCPU
RISC-V: KVM: Handle WFI exits for VCPU
RISC-V: KVM: Implement VMID allocator
RISC-V: KVM: Implement stage2 page table programming
RISC-V: KVM: Implement MMU notifiers
RISC-V: KVM: Document RISC-V specific parts of KVM API
RISC-V: KVM: Move sources to drivers/staging directory
RISC-V: KVM: Add MAINTAINERS entry

Atish Patra (4):
RISC-V: KVM: Add timer functionality
RISC-V: KVM: FP lazy save/restore
RISC-V: KVM: Implement ONE REG interface for FP registers
RISC-V: KVM: Add SBI v0.1 support

Documentation/virt/kvm/api.rst | 193 +++-
MAINTAINERS | 11 +
arch/riscv/Kconfig | 1 +
arch/riscv/Makefile | 1 +
arch/riscv/include/uapi/asm/kvm.h | 128 +++
drivers/clocksource/timer-riscv.c | 9 +
drivers/staging/riscv/kvm/Kconfig | 36 +
drivers/staging/riscv/kvm/Makefile | 23 +
drivers/staging/riscv/kvm/asm/kvm_csr.h | 105 ++
drivers/staging/riscv/kvm/asm/kvm_host.h | 271 +++++
drivers/staging/riscv/kvm/asm/kvm_types.h | 7 +
.../staging/riscv/kvm/asm/kvm_vcpu_timer.h | 44 +
drivers/staging/riscv/kvm/main.c | 118 +++
drivers/staging/riscv/kvm/mmu.c | 802 ++++++++++++++
drivers/staging/riscv/kvm/riscv_offsets.c | 170 +++
drivers/staging/riscv/kvm/tlb.S | 74 ++
drivers/staging/riscv/kvm/vcpu.c | 997 ++++++++++++++++++
drivers/staging/riscv/kvm/vcpu_exit.c | 701 ++++++++++++
drivers/staging/riscv/kvm/vcpu_sbi.c | 173 +++
drivers/staging/riscv/kvm/vcpu_switch.S | 401 +++++++
drivers/staging/riscv/kvm/vcpu_timer.c | 225 ++++
drivers/staging/riscv/kvm/vm.c | 81 ++
drivers/staging/riscv/kvm/vmid.c | 120 +++
include/clocksource/timer-riscv.h | 16 +
include/uapi/linux/kvm.h | 8 +
25 files changed, 4706 insertions(+), 9 deletions(-)
create mode 100644 arch/riscv/include/uapi/asm/kvm.h
create mode 100644 drivers/staging/riscv/kvm/Kconfig
create mode 100644 drivers/staging/riscv/kvm/Makefile
create mode 100644 drivers/staging/riscv/kvm/asm/kvm_csr.h
create mode 100644 drivers/staging/riscv/kvm/asm/kvm_host.h
create mode 100644 drivers/staging/riscv/kvm/asm/kvm_types.h
create mode 100644 drivers/staging/riscv/kvm/asm/kvm_vcpu_timer.h
create mode 100644 drivers/staging/riscv/kvm/main.c
create mode 100644 drivers/staging/riscv/kvm/mmu.c
create mode 100644 drivers/staging/riscv/kvm/riscv_offsets.c
create mode 100644 drivers/staging/riscv/kvm/tlb.S
create mode 100644 drivers/staging/riscv/kvm/vcpu.c
create mode 100644 drivers/staging/riscv/kvm/vcpu_exit.c
create mode 100644 drivers/staging/riscv/kvm/vcpu_sbi.c
create mode 100644 drivers/staging/riscv/kvm/vcpu_switch.S
create mode 100644 drivers/staging/riscv/kvm/vcpu_timer.c
create mode 100644 drivers/staging/riscv/kvm/vm.c
create mode 100644 drivers/staging/riscv/kvm/vmid.c
create mode 100644 include/clocksource/timer-riscv.h

--
2.25.1



2021-05-19 19:07:35

by Anup Patel

[permalink] [raw]
Subject: [PATCH v18 01/18] RISC-V: Add hypervisor extension related CSR defines

This patch adds asm/kvm_csr.h for RISC-V hypervisor extension
related defines.

Signed-off-by: Anup Patel <[email protected]>
Acked-by: Paolo Bonzini <[email protected]>
Reviewed-by: Paolo Bonzini <[email protected]>
Reviewed-by: Alexander Graf <[email protected]>
---
arch/riscv/include/asm/kvm_csr.h | 105 +++++++++++++++++++++++++++++++
1 file changed, 105 insertions(+)
create mode 100644 arch/riscv/include/asm/kvm_csr.h

diff --git a/arch/riscv/include/asm/kvm_csr.h b/arch/riscv/include/asm/kvm_csr.h
new file mode 100644
index 000000000000..def91f53514c
--- /dev/null
+++ b/arch/riscv/include/asm/kvm_csr.h
@@ -0,0 +1,105 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2021 Western Digital Corporation or its affiliates.
+ *
+ * Authors:
+ * Anup Patel <[email protected]>
+ */
+
+#ifndef __RISCV_KVM_CSR_H__
+#define __RISCV_KVM_CSR_H__
+
+#include <asm/csr.h>
+
+/* Interrupt causes (minus the high bit) */
+#define IRQ_VS_SOFT 2
+#define IRQ_VS_TIMER 6
+#define IRQ_VS_EXT 10
+
+ /* Exception causes */
+#define EXC_INST_ILLEGAL 2
+#define EXC_HYPERVISOR_SYSCALL 9
+#define EXC_SUPERVISOR_SYSCALL 10
+#define EXC_INST_GUEST_PAGE_FAULT 20
+#define EXC_LOAD_GUEST_PAGE_FAULT 21
+#define EXC_VIRTUAL_INST_FAULT 22
+#define EXC_STORE_GUEST_PAGE_FAULT 23
+
+/* HSTATUS flags */
+#ifdef CONFIG_64BIT
+#define HSTATUS_VSXL _AC(0x300000000, UL)
+#define HSTATUS_VSXL_SHIFT 32
+#endif
+#define HSTATUS_VTSR _AC(0x00400000, UL)
+#define HSTATUS_VTW _AC(0x00200000, UL)
+#define HSTATUS_VTVM _AC(0x00100000, UL)
+#define HSTATUS_VGEIN _AC(0x0003f000, UL)
+#define HSTATUS_VGEIN_SHIFT 12
+#define HSTATUS_HU _AC(0x00000200, UL)
+#define HSTATUS_SPVP _AC(0x00000100, UL)
+#define HSTATUS_SPV _AC(0x00000080, UL)
+#define HSTATUS_GVA _AC(0x00000040, UL)
+#define HSTATUS_VSBE _AC(0x00000020, UL)
+
+/* HGATP flags */
+#define HGATP_MODE_OFF _AC(0, UL)
+#define HGATP_MODE_SV32X4 _AC(1, UL)
+#define HGATP_MODE_SV39X4 _AC(8, UL)
+#define HGATP_MODE_SV48X4 _AC(9, UL)
+
+#define HGATP32_MODE_SHIFT 31
+#define HGATP32_VMID_SHIFT 22
+#define HGATP32_VMID_MASK _AC(0x1FC00000, UL)
+#define HGATP32_PPN _AC(0x003FFFFF, UL)
+
+#define HGATP64_MODE_SHIFT 60
+#define HGATP64_VMID_SHIFT 44
+#define HGATP64_VMID_MASK _AC(0x03FFF00000000000, UL)
+#define HGATP64_PPN _AC(0x00000FFFFFFFFFFF, UL)
+
+#define HGATP_PAGE_SHIFT 12
+
+#ifdef CONFIG_64BIT
+#define HGATP_PPN HGATP64_PPN
+#define HGATP_VMID_SHIFT HGATP64_VMID_SHIFT
+#define HGATP_VMID_MASK HGATP64_VMID_MASK
+#define HGATP_MODE_SHIFT HGATP64_MODE_SHIFT
+#else
+#define HGATP_PPN HGATP32_PPN
+#define HGATP_VMID_SHIFT HGATP32_VMID_SHIFT
+#define HGATP_VMID_MASK HGATP32_VMID_MASK
+#define HGATP_MODE_SHIFT HGATP32_MODE_SHIFT
+#endif
+
+/* VSIP & HVIP relation */
+#define VSIP_TO_HVIP_SHIFT (IRQ_VS_SOFT - IRQ_S_SOFT)
+#define VSIP_VALID_MASK ((_AC(1, UL) << IRQ_S_SOFT) | \
+ (_AC(1, UL) << IRQ_S_TIMER) | \
+ (_AC(1, UL) << IRQ_S_EXT))
+
+#define CSR_VSSTATUS 0x200
+#define CSR_VSIE 0x204
+#define CSR_VSTVEC 0x205
+#define CSR_VSSCRATCH 0x240
+#define CSR_VSEPC 0x241
+#define CSR_VSCAUSE 0x242
+#define CSR_VSTVAL 0x243
+#define CSR_VSIP 0x244
+#define CSR_VSATP 0x280
+
+#define CSR_HSTATUS 0x600
+#define CSR_HEDELEG 0x602
+#define CSR_HIDELEG 0x603
+#define CSR_HIE 0x604
+#define CSR_HTIMEDELTA 0x605
+#define CSR_HCOUNTEREN 0x606
+#define CSR_HGEIE 0x607
+#define CSR_HTIMEDELTAH 0x615
+#define CSR_HTVAL 0x643
+#define CSR_HIP 0x644
+#define CSR_HVIP 0x645
+#define CSR_HTINST 0x64a
+#define CSR_HGATP 0x680
+#define CSR_HGEIP 0xe12
+
+#endif
--
2.25.1


2021-05-19 19:07:40

by Anup Patel

[permalink] [raw]
Subject: [PATCH v18 02/18] RISC-V: Add initial skeletal KVM support

This patch adds initial skeletal KVM RISC-V support which has:
1. A simple implementation of arch specific VM functions
except kvm_vm_ioctl_get_dirty_log() which will implemeted
in-future as part of stage2 page loging.
2. Stubs of required arch specific VCPU functions except
kvm_arch_vcpu_ioctl_run() which is semi-complete and
extended by subsequent patches.
3. Stubs for required arch specific stage2 MMU functions.

Signed-off-by: Anup Patel <[email protected]>
Acked-by: Paolo Bonzini <[email protected]>
Reviewed-by: Paolo Bonzini <[email protected]>
Reviewed-by: Alexander Graf <[email protected]>
---
arch/riscv/Kconfig | 1 +
arch/riscv/Makefile | 1 +
arch/riscv/include/asm/kvm_host.h | 89 +++++++++
arch/riscv/include/asm/kvm_types.h | 7 +
arch/riscv/include/uapi/asm/kvm.h | 47 +++++
arch/riscv/kvm/Kconfig | 33 +++
arch/riscv/kvm/Makefile | 13 ++
arch/riscv/kvm/main.c | 95 +++++++++
arch/riscv/kvm/mmu.c | 80 ++++++++
arch/riscv/kvm/vcpu.c | 311 +++++++++++++++++++++++++++++
arch/riscv/kvm/vcpu_exit.c | 35 ++++
arch/riscv/kvm/vm.c | 79 ++++++++
12 files changed, 791 insertions(+)
create mode 100644 arch/riscv/include/asm/kvm_host.h
create mode 100644 arch/riscv/include/asm/kvm_types.h
create mode 100644 arch/riscv/include/uapi/asm/kvm.h
create mode 100644 arch/riscv/kvm/Kconfig
create mode 100644 arch/riscv/kvm/Makefile
create mode 100644 arch/riscv/kvm/main.c
create mode 100644 arch/riscv/kvm/mmu.c
create mode 100644 arch/riscv/kvm/vcpu.c
create mode 100644 arch/riscv/kvm/vcpu_exit.c
create mode 100644 arch/riscv/kvm/vm.c

diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index 195c6d319ab8..d0602ea394bc 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -555,4 +555,5 @@ source "kernel/power/Kconfig"

endmenu

+source "arch/riscv/kvm/Kconfig"
source "drivers/firmware/Kconfig"
diff --git a/arch/riscv/Makefile b/arch/riscv/Makefile
index 3eb9590a0775..05687d8b7b99 100644
--- a/arch/riscv/Makefile
+++ b/arch/riscv/Makefile
@@ -92,6 +92,7 @@ head-y := arch/riscv/kernel/head.o

core-y += arch/riscv/
core-$(CONFIG_RISCV_ERRATA_ALTERNATIVE) += arch/riscv/errata/
+core-$(CONFIG_KVM) += arch/riscv/kvm/

libs-y += arch/riscv/lib/
libs-$(CONFIG_EFI_STUB) += $(objtree)/drivers/firmware/efi/libstub/lib.a
diff --git a/arch/riscv/include/asm/kvm_host.h b/arch/riscv/include/asm/kvm_host.h
new file mode 100644
index 000000000000..2068475bd168
--- /dev/null
+++ b/arch/riscv/include/asm/kvm_host.h
@@ -0,0 +1,89 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2019 Western Digital Corporation or its affiliates.
+ *
+ * Authors:
+ * Anup Patel <[email protected]>
+ */
+
+#ifndef __RISCV_KVM_HOST_H__
+#define __RISCV_KVM_HOST_H__
+
+#include <linux/types.h>
+#include <linux/kvm.h>
+#include <linux/kvm_types.h>
+
+#ifdef CONFIG_64BIT
+#define KVM_MAX_VCPUS (1U << 16)
+#else
+#define KVM_MAX_VCPUS (1U << 9)
+#endif
+
+#define KVM_HALT_POLL_NS_DEFAULT 500000
+
+#define KVM_VCPU_MAX_FEATURES 0
+
+#define KVM_REQ_SLEEP \
+ KVM_ARCH_REQ_FLAGS(0, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
+#define KVM_REQ_VCPU_RESET KVM_ARCH_REQ(1)
+
+struct kvm_vm_stat {
+ ulong remote_tlb_flush;
+};
+
+struct kvm_vcpu_stat {
+ u64 halt_successful_poll;
+ u64 halt_attempted_poll;
+ u64 halt_poll_success_ns;
+ u64 halt_poll_fail_ns;
+ u64 halt_poll_invalid;
+ u64 halt_wakeup;
+ u64 ecall_exit_stat;
+ u64 wfi_exit_stat;
+ u64 mmio_exit_user;
+ u64 mmio_exit_kernel;
+ u64 exits;
+};
+
+struct kvm_arch_memory_slot {
+};
+
+struct kvm_arch {
+ /* stage2 page table */
+ pgd_t *pgd;
+ phys_addr_t pgd_phys;
+};
+
+struct kvm_cpu_trap {
+ unsigned long sepc;
+ unsigned long scause;
+ unsigned long stval;
+ unsigned long htval;
+ unsigned long htinst;
+};
+
+struct kvm_vcpu_arch {
+ /* Don't run the VCPU (blocked) */
+ bool pause;
+
+ /* SRCU lock index for in-kernel run loop */
+ int srcu_idx;
+};
+
+static inline void kvm_arch_hardware_unsetup(void) {}
+static inline void kvm_arch_sync_events(struct kvm *kvm) {}
+static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {}
+static inline void kvm_arch_vcpu_block_finish(struct kvm_vcpu *vcpu) {}
+
+void kvm_riscv_stage2_flush_cache(struct kvm_vcpu *vcpu);
+int kvm_riscv_stage2_alloc_pgd(struct kvm *kvm);
+void kvm_riscv_stage2_free_pgd(struct kvm *kvm);
+void kvm_riscv_stage2_update_hgatp(struct kvm_vcpu *vcpu);
+
+int kvm_riscv_vcpu_mmio_return(struct kvm_vcpu *vcpu, struct kvm_run *run);
+int kvm_riscv_vcpu_exit(struct kvm_vcpu *vcpu, struct kvm_run *run,
+ struct kvm_cpu_trap *trap);
+
+static inline void __kvm_riscv_switch_to(struct kvm_vcpu_arch *vcpu_arch) {}
+
+#endif /* __RISCV_KVM_HOST_H__ */
diff --git a/arch/riscv/include/asm/kvm_types.h b/arch/riscv/include/asm/kvm_types.h
new file mode 100644
index 000000000000..e476b404eb67
--- /dev/null
+++ b/arch/riscv/include/asm/kvm_types.h
@@ -0,0 +1,7 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_RISCV_KVM_TYPES_H
+#define _ASM_RISCV_KVM_TYPES_H
+
+#define KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE 40
+
+#endif /* _ASM_RISCV_KVM_TYPES_H */
diff --git a/arch/riscv/include/uapi/asm/kvm.h b/arch/riscv/include/uapi/asm/kvm.h
new file mode 100644
index 000000000000..984d041a3e3b
--- /dev/null
+++ b/arch/riscv/include/uapi/asm/kvm.h
@@ -0,0 +1,47 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+/*
+ * Copyright (C) 2019 Western Digital Corporation or its affiliates.
+ *
+ * Authors:
+ * Anup Patel <[email protected]>
+ */
+
+#ifndef __LINUX_KVM_RISCV_H
+#define __LINUX_KVM_RISCV_H
+
+#ifndef __ASSEMBLY__
+
+#include <linux/types.h>
+#include <asm/ptrace.h>
+
+#define __KVM_HAVE_READONLY_MEM
+
+#define KVM_COALESCED_MMIO_PAGE_OFFSET 1
+
+/* for KVM_GET_REGS and KVM_SET_REGS */
+struct kvm_regs {
+};
+
+/* for KVM_GET_FPU and KVM_SET_FPU */
+struct kvm_fpu {
+};
+
+/* KVM Debug exit structure */
+struct kvm_debug_exit_arch {
+};
+
+/* for KVM_SET_GUEST_DEBUG */
+struct kvm_guest_debug_arch {
+};
+
+/* definition of registers in kvm_run */
+struct kvm_sync_regs {
+};
+
+/* dummy definition */
+struct kvm_sregs {
+};
+
+#endif
+
+#endif /* __LINUX_KVM_RISCV_H */
diff --git a/arch/riscv/kvm/Kconfig b/arch/riscv/kvm/Kconfig
new file mode 100644
index 000000000000..88edd477b3a8
--- /dev/null
+++ b/arch/riscv/kvm/Kconfig
@@ -0,0 +1,33 @@
+# SPDX-License-Identifier: GPL-2.0
+#
+# KVM configuration
+#
+
+source "virt/kvm/Kconfig"
+
+menuconfig VIRTUALIZATION
+ bool "Virtualization"
+ help
+ Say Y here to get to see options for using your Linux host to run
+ other operating systems inside virtual machines (guests).
+ This option alone does not add any kernel code.
+
+ If you say N, all options in this submenu will be skipped and
+ disabled.
+
+if VIRTUALIZATION
+
+config KVM
+ tristate "Kernel-based Virtual Machine (KVM) support (EXPERIMENTAL)"
+ depends on RISCV_SBI && MMU
+ select PREEMPT_NOTIFIERS
+ select ANON_INODES
+ select KVM_MMIO
+ select HAVE_KVM_VCPU_ASYNC_IOCTL
+ select SRCU
+ help
+ Support hosting virtualized guest machines.
+
+ If unsure, say N.
+
+endif # VIRTUALIZATION
diff --git a/arch/riscv/kvm/Makefile b/arch/riscv/kvm/Makefile
new file mode 100644
index 000000000000..37b5a59d4f4f
--- /dev/null
+++ b/arch/riscv/kvm/Makefile
@@ -0,0 +1,13 @@
+# SPDX-License-Identifier: GPL-2.0
+# Makefile for RISC-V KVM support
+#
+
+common-objs-y = $(addprefix ../../../virt/kvm/, kvm_main.o coalesced_mmio.o)
+
+ccflags-y := -Ivirt/kvm -Iarch/riscv/kvm
+
+kvm-objs := $(common-objs-y)
+
+kvm-objs += main.o vm.o mmu.o vcpu.o vcpu_exit.o
+
+obj-$(CONFIG_KVM) += kvm.o
diff --git a/arch/riscv/kvm/main.c b/arch/riscv/kvm/main.c
new file mode 100644
index 000000000000..c717d37fd87f
--- /dev/null
+++ b/arch/riscv/kvm/main.c
@@ -0,0 +1,95 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2019 Western Digital Corporation or its affiliates.
+ *
+ * Authors:
+ * Anup Patel <[email protected]>
+ */
+
+#include <linux/errno.h>
+#include <linux/err.h>
+#include <linux/module.h>
+#include <linux/kvm_host.h>
+#include <asm/kvm_csr.h>
+#include <asm/hwcap.h>
+#include <asm/sbi.h>
+
+long kvm_arch_dev_ioctl(struct file *filp,
+ unsigned int ioctl, unsigned long arg)
+{
+ return -EINVAL;
+}
+
+int kvm_arch_check_processor_compat(void *opaque)
+{
+ return 0;
+}
+
+int kvm_arch_hardware_setup(void *opaque)
+{
+ return 0;
+}
+
+int kvm_arch_hardware_enable(void)
+{
+ unsigned long hideleg, hedeleg;
+
+ hedeleg = 0;
+ hedeleg |= (1UL << EXC_INST_MISALIGNED);
+ hedeleg |= (1UL << EXC_BREAKPOINT);
+ hedeleg |= (1UL << EXC_SYSCALL);
+ hedeleg |= (1UL << EXC_INST_PAGE_FAULT);
+ hedeleg |= (1UL << EXC_LOAD_PAGE_FAULT);
+ hedeleg |= (1UL << EXC_STORE_PAGE_FAULT);
+ csr_write(CSR_HEDELEG, hedeleg);
+
+ hideleg = 0;
+ hideleg |= (1UL << IRQ_VS_SOFT);
+ hideleg |= (1UL << IRQ_VS_TIMER);
+ hideleg |= (1UL << IRQ_VS_EXT);
+ csr_write(CSR_HIDELEG, hideleg);
+
+ csr_write(CSR_HCOUNTEREN, -1UL);
+
+ csr_write(CSR_HVIP, 0);
+
+ return 0;
+}
+
+void kvm_arch_hardware_disable(void)
+{
+ csr_write(CSR_HEDELEG, 0);
+ csr_write(CSR_HIDELEG, 0);
+}
+
+int kvm_arch_init(void *opaque)
+{
+ if (!riscv_isa_extension_available(NULL, h)) {
+ kvm_info("hypervisor extension not available\n");
+ return -ENODEV;
+ }
+
+ if (sbi_spec_is_0_1()) {
+ kvm_info("require SBI v0.2 or higher\n");
+ return -ENODEV;
+ }
+
+ if (sbi_probe_extension(SBI_EXT_RFENCE) <= 0) {
+ kvm_info("require SBI RFENCE extension\n");
+ return -ENODEV;
+ }
+
+ kvm_info("hypervisor extension available\n");
+
+ return 0;
+}
+
+void kvm_arch_exit(void)
+{
+}
+
+static int riscv_kvm_init(void)
+{
+ return kvm_init(NULL, sizeof(struct kvm_vcpu), 0, THIS_MODULE);
+}
+module_init(riscv_kvm_init);
diff --git a/arch/riscv/kvm/mmu.c b/arch/riscv/kvm/mmu.c
new file mode 100644
index 000000000000..abfd2b22fa8e
--- /dev/null
+++ b/arch/riscv/kvm/mmu.c
@@ -0,0 +1,80 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2019 Western Digital Corporation or its affiliates.
+ *
+ * Authors:
+ * Anup Patel <[email protected]>
+ */
+
+#include <linux/bitops.h>
+#include <linux/errno.h>
+#include <linux/err.h>
+#include <linux/hugetlb.h>
+#include <linux/module.h>
+#include <linux/uaccess.h>
+#include <linux/vmalloc.h>
+#include <linux/kvm_host.h>
+#include <linux/sched/signal.h>
+#include <asm/page.h>
+#include <asm/pgtable.h>
+
+void kvm_arch_sync_dirty_log(struct kvm *kvm, struct kvm_memory_slot *memslot)
+{
+}
+
+void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *free)
+{
+}
+
+void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen)
+{
+}
+
+void kvm_arch_flush_shadow_all(struct kvm *kvm)
+{
+ /* TODO: */
+}
+
+void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
+ struct kvm_memory_slot *slot)
+{
+}
+
+void kvm_arch_commit_memory_region(struct kvm *kvm,
+ const struct kvm_userspace_memory_region *mem,
+ struct kvm_memory_slot *old,
+ const struct kvm_memory_slot *new,
+ enum kvm_mr_change change)
+{
+ /* TODO: */
+}
+
+int kvm_arch_prepare_memory_region(struct kvm *kvm,
+ struct kvm_memory_slot *memslot,
+ const struct kvm_userspace_memory_region *mem,
+ enum kvm_mr_change change)
+{
+ /* TODO: */
+ return 0;
+}
+
+void kvm_riscv_stage2_flush_cache(struct kvm_vcpu *vcpu)
+{
+ /* TODO: */
+}
+
+int kvm_riscv_stage2_alloc_pgd(struct kvm *kvm)
+{
+ /* TODO: */
+ return 0;
+}
+
+void kvm_riscv_stage2_free_pgd(struct kvm *kvm)
+{
+ /* TODO: */
+}
+
+void kvm_riscv_stage2_update_hgatp(struct kvm_vcpu *vcpu)
+{
+ /* TODO: */
+}
diff --git a/arch/riscv/kvm/vcpu.c b/arch/riscv/kvm/vcpu.c
new file mode 100644
index 000000000000..d76cecf93de4
--- /dev/null
+++ b/arch/riscv/kvm/vcpu.c
@@ -0,0 +1,311 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2019 Western Digital Corporation or its affiliates.
+ *
+ * Authors:
+ * Anup Patel <[email protected]>
+ */
+
+#include <linux/bitops.h>
+#include <linux/errno.h>
+#include <linux/err.h>
+#include <linux/kdebug.h>
+#include <linux/module.h>
+#include <linux/uaccess.h>
+#include <linux/vmalloc.h>
+#include <linux/sched/signal.h>
+#include <linux/fs.h>
+#include <linux/kvm_host.h>
+#include <asm/kvm_csr.h>
+#include <asm/delay.h>
+#include <asm/hwcap.h>
+
+struct kvm_stats_debugfs_item debugfs_entries[] = {
+ VCPU_STAT("halt_successful_poll", halt_successful_poll),
+ VCPU_STAT("halt_attempted_poll", halt_attempted_poll),
+ VCPU_STAT("halt_poll_success_ns", halt_poll_success_ns),
+ VCPU_STAT("halt_poll_fail_ns", halt_poll_fail_ns),
+ VCPU_STAT("halt_poll_invalid", halt_poll_invalid),
+ VCPU_STAT("halt_wakeup", halt_wakeup),
+ VCPU_STAT("ecall_exit_stat", ecall_exit_stat),
+ VCPU_STAT("wfi_exit_stat", wfi_exit_stat),
+ VCPU_STAT("mmio_exit_user", mmio_exit_user),
+ VCPU_STAT("mmio_exit_kernel", mmio_exit_kernel),
+ VCPU_STAT("exits", exits),
+ { NULL }
+};
+
+int kvm_arch_vcpu_precreate(struct kvm *kvm, unsigned int id)
+{
+ return 0;
+}
+
+int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
+{
+ /* TODO: */
+ return 0;
+}
+
+void kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu)
+{
+}
+
+int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu)
+{
+ /* TODO: */
+ return 0;
+}
+
+void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu)
+{
+ /* TODO: */
+}
+
+int kvm_cpu_has_pending_timer(struct kvm_vcpu *vcpu)
+{
+ /* TODO: */
+ return 0;
+}
+
+void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu)
+{
+}
+
+void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu)
+{
+}
+
+int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu)
+{
+ /* TODO: */
+ return 0;
+}
+
+int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu)
+{
+ /* TODO: */
+ return 0;
+}
+
+bool kvm_arch_vcpu_in_kernel(struct kvm_vcpu *vcpu)
+{
+ /* TODO: */
+ return false;
+}
+
+vm_fault_t kvm_arch_vcpu_fault(struct kvm_vcpu *vcpu, struct vm_fault *vmf)
+{
+ return VM_FAULT_SIGBUS;
+}
+
+long kvm_arch_vcpu_async_ioctl(struct file *filp,
+ unsigned int ioctl, unsigned long arg)
+{
+ /* TODO; */
+ return -ENOIOCTLCMD;
+}
+
+long kvm_arch_vcpu_ioctl(struct file *filp,
+ unsigned int ioctl, unsigned long arg)
+{
+ /* TODO: */
+ return -EINVAL;
+}
+
+int kvm_arch_vcpu_ioctl_get_sregs(struct kvm_vcpu *vcpu,
+ struct kvm_sregs *sregs)
+{
+ return -EINVAL;
+}
+
+int kvm_arch_vcpu_ioctl_set_sregs(struct kvm_vcpu *vcpu,
+ struct kvm_sregs *sregs)
+{
+ return -EINVAL;
+}
+
+int kvm_arch_vcpu_ioctl_get_fpu(struct kvm_vcpu *vcpu, struct kvm_fpu *fpu)
+{
+ return -EINVAL;
+}
+
+int kvm_arch_vcpu_ioctl_set_fpu(struct kvm_vcpu *vcpu, struct kvm_fpu *fpu)
+{
+ return -EINVAL;
+}
+
+int kvm_arch_vcpu_ioctl_translate(struct kvm_vcpu *vcpu,
+ struct kvm_translation *tr)
+{
+ return -EINVAL;
+}
+
+int kvm_arch_vcpu_ioctl_get_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs)
+{
+ return -EINVAL;
+}
+
+int kvm_arch_vcpu_ioctl_set_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs)
+{
+ return -EINVAL;
+}
+
+int kvm_arch_vcpu_ioctl_get_mpstate(struct kvm_vcpu *vcpu,
+ struct kvm_mp_state *mp_state)
+{
+ /* TODO: */
+ return 0;
+}
+
+int kvm_arch_vcpu_ioctl_set_mpstate(struct kvm_vcpu *vcpu,
+ struct kvm_mp_state *mp_state)
+{
+ /* TODO: */
+ return 0;
+}
+
+int kvm_arch_vcpu_ioctl_set_guest_debug(struct kvm_vcpu *vcpu,
+ struct kvm_guest_debug *dbg)
+{
+ /* TODO; To be implemented later. */
+ return -EINVAL;
+}
+
+void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
+{
+ /* TODO: */
+
+ kvm_riscv_stage2_update_hgatp(vcpu);
+}
+
+void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
+{
+ /* TODO: */
+}
+
+static void kvm_riscv_check_vcpu_requests(struct kvm_vcpu *vcpu)
+{
+ /* TODO: */
+}
+
+int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
+{
+ int ret;
+ struct kvm_cpu_trap trap;
+ struct kvm_run *run = vcpu->run;
+
+ vcpu->arch.srcu_idx = srcu_read_lock(&vcpu->kvm->srcu);
+
+ /* Process MMIO value returned from user-space */
+ if (run->exit_reason == KVM_EXIT_MMIO) {
+ ret = kvm_riscv_vcpu_mmio_return(vcpu, vcpu->run);
+ if (ret) {
+ srcu_read_unlock(&vcpu->kvm->srcu, vcpu->arch.srcu_idx);
+ return ret;
+ }
+ }
+
+ if (run->immediate_exit) {
+ srcu_read_unlock(&vcpu->kvm->srcu, vcpu->arch.srcu_idx);
+ return -EINTR;
+ }
+
+ vcpu_load(vcpu);
+
+ kvm_sigset_activate(vcpu);
+
+ ret = 1;
+ run->exit_reason = KVM_EXIT_UNKNOWN;
+ while (ret > 0) {
+ /* Check conditions before entering the guest */
+ cond_resched();
+
+ kvm_riscv_check_vcpu_requests(vcpu);
+
+ preempt_disable();
+
+ local_irq_disable();
+
+ /*
+ * Exit if we have a signal pending so that we can deliver
+ * the signal to user space.
+ */
+ if (signal_pending(current)) {
+ ret = -EINTR;
+ run->exit_reason = KVM_EXIT_INTR;
+ }
+
+ /*
+ * Ensure we set mode to IN_GUEST_MODE after we disable
+ * interrupts and before the final VCPU requests check.
+ * See the comment in kvm_vcpu_exiting_guest_mode() and
+ * Documentation/virtual/kvm/vcpu-requests.rst
+ */
+ vcpu->mode = IN_GUEST_MODE;
+
+ srcu_read_unlock(&vcpu->kvm->srcu, vcpu->arch.srcu_idx);
+ smp_mb__after_srcu_read_unlock();
+
+ if (ret <= 0 ||
+ kvm_request_pending(vcpu)) {
+ vcpu->mode = OUTSIDE_GUEST_MODE;
+ local_irq_enable();
+ preempt_enable();
+ vcpu->arch.srcu_idx = srcu_read_lock(&vcpu->kvm->srcu);
+ continue;
+ }
+
+ guest_enter_irqoff();
+
+ __kvm_riscv_switch_to(&vcpu->arch);
+
+ vcpu->mode = OUTSIDE_GUEST_MODE;
+ vcpu->stat.exits++;
+
+ /*
+ * Save SCAUSE, STVAL, HTVAL, and HTINST because we might
+ * get an interrupt between __kvm_riscv_switch_to() and
+ * local_irq_enable() which can potentially change CSRs.
+ */
+ trap.sepc = 0;
+ trap.scause = csr_read(CSR_SCAUSE);
+ trap.stval = csr_read(CSR_STVAL);
+ trap.htval = csr_read(CSR_HTVAL);
+ trap.htinst = csr_read(CSR_HTINST);
+
+ /*
+ * We may have taken a host interrupt in VS/VU-mode (i.e.
+ * while executing the guest). This interrupt is still
+ * pending, as we haven't serviced it yet!
+ *
+ * We're now back in HS-mode with interrupts disabled
+ * so enabling the interrupts now will have the effect
+ * of taking the interrupt again, in HS-mode this time.
+ */
+ local_irq_enable();
+
+ /*
+ * We do local_irq_enable() before calling guest_exit() so
+ * that if a timer interrupt hits while running the guest
+ * we account that tick as being spent in the guest. We
+ * enable preemption after calling guest_exit() so that if
+ * we get preempted we make sure ticks after that is not
+ * counted as guest time.
+ */
+ guest_exit();
+
+ preempt_enable();
+
+ vcpu->arch.srcu_idx = srcu_read_lock(&vcpu->kvm->srcu);
+
+ ret = kvm_riscv_vcpu_exit(vcpu, run, &trap);
+ }
+
+ kvm_sigset_deactivate(vcpu);
+
+ vcpu_put(vcpu);
+
+ srcu_read_unlock(&vcpu->kvm->srcu, vcpu->arch.srcu_idx);
+
+ return ret;
+}
diff --git a/arch/riscv/kvm/vcpu_exit.c b/arch/riscv/kvm/vcpu_exit.c
new file mode 100644
index 000000000000..4484e9200fe4
--- /dev/null
+++ b/arch/riscv/kvm/vcpu_exit.c
@@ -0,0 +1,35 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2019 Western Digital Corporation or its affiliates.
+ *
+ * Authors:
+ * Anup Patel <[email protected]>
+ */
+
+#include <linux/errno.h>
+#include <linux/err.h>
+#include <linux/kvm_host.h>
+
+/**
+ * kvm_riscv_vcpu_mmio_return -- Handle MMIO loads after user space emulation
+ * or in-kernel IO emulation
+ *
+ * @vcpu: The VCPU pointer
+ * @run: The VCPU run struct containing the mmio data
+ */
+int kvm_riscv_vcpu_mmio_return(struct kvm_vcpu *vcpu, struct kvm_run *run)
+{
+ /* TODO: */
+ return 0;
+}
+
+/*
+ * Return > 0 to return to guest, < 0 on error, 0 (and set exit_reason) on
+ * proper exit to userspace.
+ */
+int kvm_riscv_vcpu_exit(struct kvm_vcpu *vcpu, struct kvm_run *run,
+ struct kvm_cpu_trap *trap)
+{
+ /* TODO: */
+ return 0;
+}
diff --git a/arch/riscv/kvm/vm.c b/arch/riscv/kvm/vm.c
new file mode 100644
index 000000000000..d6776b4819bb
--- /dev/null
+++ b/arch/riscv/kvm/vm.c
@@ -0,0 +1,79 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2019 Western Digital Corporation or its affiliates.
+ *
+ * Authors:
+ * Anup Patel <[email protected]>
+ */
+
+#include <linux/errno.h>
+#include <linux/err.h>
+#include <linux/module.h>
+#include <linux/uaccess.h>
+#include <linux/kvm_host.h>
+
+int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log)
+{
+ /* TODO: To be added later. */
+ return -EOPNOTSUPP;
+}
+
+int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
+{
+ int r;
+
+ r = kvm_riscv_stage2_alloc_pgd(kvm);
+ if (r)
+ return r;
+
+ return 0;
+}
+
+void kvm_arch_destroy_vm(struct kvm *kvm)
+{
+ int i;
+
+ for (i = 0; i < KVM_MAX_VCPUS; ++i) {
+ if (kvm->vcpus[i]) {
+ kvm_arch_vcpu_destroy(kvm->vcpus[i]);
+ kvm->vcpus[i] = NULL;
+ }
+ }
+}
+
+int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
+{
+ int r;
+
+ switch (ext) {
+ case KVM_CAP_DEVICE_CTRL:
+ case KVM_CAP_USER_MEMORY:
+ case KVM_CAP_DESTROY_MEMORY_REGION_WORKS:
+ case KVM_CAP_ONE_REG:
+ case KVM_CAP_READONLY_MEM:
+ case KVM_CAP_MP_STATE:
+ case KVM_CAP_IMMEDIATE_EXIT:
+ r = 1;
+ break;
+ case KVM_CAP_NR_VCPUS:
+ r = num_online_cpus();
+ break;
+ case KVM_CAP_MAX_VCPUS:
+ r = KVM_MAX_VCPUS;
+ break;
+ case KVM_CAP_NR_MEMSLOTS:
+ r = KVM_USER_MEM_SLOTS;
+ break;
+ default:
+ r = 0;
+ break;
+ }
+
+ return r;
+}
+
+long kvm_arch_vm_ioctl(struct file *filp,
+ unsigned int ioctl, unsigned long arg)
+{
+ return -EINVAL;
+}
--
2.25.1


2021-05-19 19:07:42

by Anup Patel

[permalink] [raw]
Subject: [PATCH v18 08/18] RISC-V: KVM: Handle WFI exits for VCPU

We get illegal instruction trap whenever Guest/VM executes WFI
instruction.

This patch handles WFI trap by blocking the trapped VCPU using
kvm_vcpu_block() API. The blocked VCPU will be automatically
resumed whenever a VCPU interrupt is injected from user-space
or from in-kernel IRQCHIP emulation.

Signed-off-by: Anup Patel <[email protected]>
Acked-by: Paolo Bonzini <[email protected]>
Reviewed-by: Paolo Bonzini <[email protected]>
---
arch/riscv/kvm/vcpu_exit.c | 76 ++++++++++++++++++++++++++++++++++++++
1 file changed, 76 insertions(+)

diff --git a/arch/riscv/kvm/vcpu_exit.c b/arch/riscv/kvm/vcpu_exit.c
index 80ab07ff0313..34d9bd9da585 100644
--- a/arch/riscv/kvm/vcpu_exit.c
+++ b/arch/riscv/kvm/vcpu_exit.c
@@ -12,6 +12,13 @@
#include <linux/kvm_host.h>
#include <asm/kvm_csr.h>

+#define INSN_OPCODE_MASK 0x007c
+#define INSN_OPCODE_SHIFT 2
+#define INSN_OPCODE_SYSTEM 28
+
+#define INSN_MASK_WFI 0xffffff00
+#define INSN_MATCH_WFI 0x10500000
+
#define INSN_MATCH_LB 0x3
#define INSN_MASK_LB 0x707f
#define INSN_MATCH_LH 0x1003
@@ -116,6 +123,71 @@
(s32)(((insn) >> 7) & 0x1f))
#define MASK_FUNCT3 0x7000

+static int truly_illegal_insn(struct kvm_vcpu *vcpu,
+ struct kvm_run *run,
+ ulong insn)
+{
+ struct kvm_cpu_trap utrap = { 0 };
+
+ /* Redirect trap to Guest VCPU */
+ utrap.sepc = vcpu->arch.guest_context.sepc;
+ utrap.scause = EXC_INST_ILLEGAL;
+ utrap.stval = insn;
+ kvm_riscv_vcpu_trap_redirect(vcpu, &utrap);
+
+ return 1;
+}
+
+static int system_opcode_insn(struct kvm_vcpu *vcpu,
+ struct kvm_run *run,
+ ulong insn)
+{
+ if ((insn & INSN_MASK_WFI) == INSN_MATCH_WFI) {
+ vcpu->stat.wfi_exit_stat++;
+ if (!kvm_arch_vcpu_runnable(vcpu)) {
+ srcu_read_unlock(&vcpu->kvm->srcu, vcpu->arch.srcu_idx);
+ kvm_vcpu_block(vcpu);
+ vcpu->arch.srcu_idx = srcu_read_lock(&vcpu->kvm->srcu);
+ kvm_clear_request(KVM_REQ_UNHALT, vcpu);
+ }
+ vcpu->arch.guest_context.sepc += INSN_LEN(insn);
+ return 1;
+ }
+
+ return truly_illegal_insn(vcpu, run, insn);
+}
+
+static int virtual_inst_fault(struct kvm_vcpu *vcpu, struct kvm_run *run,
+ struct kvm_cpu_trap *trap)
+{
+ unsigned long insn = trap->stval;
+ struct kvm_cpu_trap utrap = { 0 };
+ struct kvm_cpu_context *ct;
+
+ if (unlikely(INSN_IS_16BIT(insn))) {
+ if (insn == 0) {
+ ct = &vcpu->arch.guest_context;
+ insn = kvm_riscv_vcpu_unpriv_read(vcpu, true,
+ ct->sepc,
+ &utrap);
+ if (utrap.scause) {
+ utrap.sepc = ct->sepc;
+ kvm_riscv_vcpu_trap_redirect(vcpu, &utrap);
+ return 1;
+ }
+ }
+ if (INSN_IS_16BIT(insn))
+ return truly_illegal_insn(vcpu, run, insn);
+ }
+
+ switch ((insn & INSN_OPCODE_MASK) >> INSN_OPCODE_SHIFT) {
+ case INSN_OPCODE_SYSTEM:
+ return system_opcode_insn(vcpu, run, insn);
+ default:
+ return truly_illegal_insn(vcpu, run, insn);
+ }
+}
+
static int emulate_load(struct kvm_vcpu *vcpu, struct kvm_run *run,
unsigned long fault_addr, unsigned long htinst)
{
@@ -596,6 +668,10 @@ int kvm_riscv_vcpu_exit(struct kvm_vcpu *vcpu, struct kvm_run *run,
ret = -EFAULT;
run->exit_reason = KVM_EXIT_UNKNOWN;
switch (trap->scause) {
+ case EXC_VIRTUAL_INST_FAULT:
+ if (vcpu->arch.guest_context.hstatus & HSTATUS_SPV)
+ ret = virtual_inst_fault(vcpu, run, trap);
+ break;
case EXC_INST_GUEST_PAGE_FAULT:
case EXC_LOAD_GUEST_PAGE_FAULT:
case EXC_STORE_GUEST_PAGE_FAULT:
--
2.25.1


2021-05-19 19:07:42

by Anup Patel

[permalink] [raw]
Subject: [PATCH v18 07/18] RISC-V: KVM: Handle MMIO exits for VCPU

We will get stage2 page faults whenever Guest/VM access SW emulated
MMIO device or unmapped Guest RAM.

This patch implements MMIO read/write emulation by extracting MMIO
details from the trapped load/store instruction and forwarding the
MMIO read/write to user-space. The actual MMIO emulation will happen
in user-space and KVM kernel module will only take care of register
updates before resuming the trapped VCPU.

The handling for stage2 page faults for unmapped Guest RAM will be
implemeted by a separate patch later.

[jiangyifei: ioeventfd and in-kernel mmio device support]
Signed-off-by: Yifei Jiang <[email protected]>
Signed-off-by: Anup Patel <[email protected]>
Acked-by: Paolo Bonzini <[email protected]>
Reviewed-by: Paolo Bonzini <[email protected]>
Reviewed-by: Alexander Graf <[email protected]>
---
arch/riscv/include/asm/kvm_host.h | 22 ++
arch/riscv/kvm/Kconfig | 1 +
arch/riscv/kvm/Makefile | 1 +
arch/riscv/kvm/mmu.c | 8 +
arch/riscv/kvm/riscv_offsets.c | 6 +
arch/riscv/kvm/vcpu_exit.c | 592 +++++++++++++++++++++++++++++-
arch/riscv/kvm/vcpu_switch.S | 23 ++
arch/riscv/kvm/vm.c | 1 +
8 files changed, 651 insertions(+), 3 deletions(-)

diff --git a/arch/riscv/include/asm/kvm_host.h b/arch/riscv/include/asm/kvm_host.h
index 25b24606a89c..bd6d49aeebd9 100644
--- a/arch/riscv/include/asm/kvm_host.h
+++ b/arch/riscv/include/asm/kvm_host.h
@@ -54,6 +54,14 @@ struct kvm_arch {
phys_addr_t pgd_phys;
};

+struct kvm_mmio_decode {
+ unsigned long insn;
+ int insn_len;
+ int len;
+ int shift;
+ int return_handled;
+};
+
struct kvm_cpu_trap {
unsigned long sepc;
unsigned long scause;
@@ -152,6 +160,9 @@ struct kvm_vcpu_arch {
unsigned long irqs_pending;
unsigned long irqs_pending_mask;

+ /* MMIO instruction details */
+ struct kvm_mmio_decode mmio_decode;
+
/* VCPU power-off state */
bool power_off;

@@ -167,11 +178,22 @@ static inline void kvm_arch_sync_events(struct kvm *kvm) {}
static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {}
static inline void kvm_arch_vcpu_block_finish(struct kvm_vcpu *vcpu) {}

+int kvm_riscv_stage2_map(struct kvm_vcpu *vcpu,
+ struct kvm_memory_slot *memslot,
+ gpa_t gpa, unsigned long hva, bool is_write);
void kvm_riscv_stage2_flush_cache(struct kvm_vcpu *vcpu);
int kvm_riscv_stage2_alloc_pgd(struct kvm *kvm);
void kvm_riscv_stage2_free_pgd(struct kvm *kvm);
void kvm_riscv_stage2_update_hgatp(struct kvm_vcpu *vcpu);

+void __kvm_riscv_unpriv_trap(void);
+
+unsigned long kvm_riscv_vcpu_unpriv_read(struct kvm_vcpu *vcpu,
+ bool read_insn,
+ unsigned long guest_addr,
+ struct kvm_cpu_trap *trap);
+void kvm_riscv_vcpu_trap_redirect(struct kvm_vcpu *vcpu,
+ struct kvm_cpu_trap *trap);
int kvm_riscv_vcpu_mmio_return(struct kvm_vcpu *vcpu, struct kvm_run *run);
int kvm_riscv_vcpu_exit(struct kvm_vcpu *vcpu, struct kvm_run *run,
struct kvm_cpu_trap *trap);
diff --git a/arch/riscv/kvm/Kconfig b/arch/riscv/kvm/Kconfig
index 88edd477b3a8..b42979f84042 100644
--- a/arch/riscv/kvm/Kconfig
+++ b/arch/riscv/kvm/Kconfig
@@ -24,6 +24,7 @@ config KVM
select ANON_INODES
select KVM_MMIO
select HAVE_KVM_VCPU_ASYNC_IOCTL
+ select HAVE_KVM_EVENTFD
select SRCU
help
Support hosting virtualized guest machines.
diff --git a/arch/riscv/kvm/Makefile b/arch/riscv/kvm/Makefile
index f3b60e8045a5..e121b940c9ec 100644
--- a/arch/riscv/kvm/Makefile
+++ b/arch/riscv/kvm/Makefile
@@ -3,6 +3,7 @@
#

common-objs-y = $(addprefix ../../../virt/kvm/, kvm_main.o coalesced_mmio.o)
+common-objs-y += $(addprefix ../../../virt/kvm/, eventfd.o)

ccflags-y := -Ivirt/kvm -Iarch/riscv/kvm

diff --git a/arch/riscv/kvm/mmu.c b/arch/riscv/kvm/mmu.c
index abfd2b22fa8e..8ec10ef861e7 100644
--- a/arch/riscv/kvm/mmu.c
+++ b/arch/riscv/kvm/mmu.c
@@ -58,6 +58,14 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
return 0;
}

+int kvm_riscv_stage2_map(struct kvm_vcpu *vcpu,
+ struct kvm_memory_slot *memslot,
+ gpa_t gpa, unsigned long hva, bool is_write)
+{
+ /* TODO: */
+ return 0;
+}
+
void kvm_riscv_stage2_flush_cache(struct kvm_vcpu *vcpu)
{
/* TODO: */
diff --git a/arch/riscv/kvm/riscv_offsets.c b/arch/riscv/kvm/riscv_offsets.c
index a3d4effe9947..3c92d2a1ee82 100644
--- a/arch/riscv/kvm/riscv_offsets.c
+++ b/arch/riscv/kvm/riscv_offsets.c
@@ -88,5 +88,11 @@ int main(void)
OFFSET(KVM_ARCH_HOST_STVEC, kvm_vcpu_arch, host_stvec);
OFFSET(KVM_ARCH_HOST_SCOUNTEREN, kvm_vcpu_arch, host_scounteren);

+ OFFSET(KVM_ARCH_TRAP_SEPC, kvm_cpu_trap, sepc);
+ OFFSET(KVM_ARCH_TRAP_SCAUSE, kvm_cpu_trap, scause);
+ OFFSET(KVM_ARCH_TRAP_STVAL, kvm_cpu_trap, stval);
+ OFFSET(KVM_ARCH_TRAP_HTVAL, kvm_cpu_trap, htval);
+ OFFSET(KVM_ARCH_TRAP_HTINST, kvm_cpu_trap, htinst);
+
return 0;
}
diff --git a/arch/riscv/kvm/vcpu_exit.c b/arch/riscv/kvm/vcpu_exit.c
index 4484e9200fe4..80ab07ff0313 100644
--- a/arch/riscv/kvm/vcpu_exit.c
+++ b/arch/riscv/kvm/vcpu_exit.c
@@ -6,9 +6,518 @@
* Anup Patel <[email protected]>
*/

+#include <linux/bitops.h>
#include <linux/errno.h>
#include <linux/err.h>
#include <linux/kvm_host.h>
+#include <asm/kvm_csr.h>
+
+#define INSN_MATCH_LB 0x3
+#define INSN_MASK_LB 0x707f
+#define INSN_MATCH_LH 0x1003
+#define INSN_MASK_LH 0x707f
+#define INSN_MATCH_LW 0x2003
+#define INSN_MASK_LW 0x707f
+#define INSN_MATCH_LD 0x3003
+#define INSN_MASK_LD 0x707f
+#define INSN_MATCH_LBU 0x4003
+#define INSN_MASK_LBU 0x707f
+#define INSN_MATCH_LHU 0x5003
+#define INSN_MASK_LHU 0x707f
+#define INSN_MATCH_LWU 0x6003
+#define INSN_MASK_LWU 0x707f
+#define INSN_MATCH_SB 0x23
+#define INSN_MASK_SB 0x707f
+#define INSN_MATCH_SH 0x1023
+#define INSN_MASK_SH 0x707f
+#define INSN_MATCH_SW 0x2023
+#define INSN_MASK_SW 0x707f
+#define INSN_MATCH_SD 0x3023
+#define INSN_MASK_SD 0x707f
+
+#define INSN_MATCH_C_LD 0x6000
+#define INSN_MASK_C_LD 0xe003
+#define INSN_MATCH_C_SD 0xe000
+#define INSN_MASK_C_SD 0xe003
+#define INSN_MATCH_C_LW 0x4000
+#define INSN_MASK_C_LW 0xe003
+#define INSN_MATCH_C_SW 0xc000
+#define INSN_MASK_C_SW 0xe003
+#define INSN_MATCH_C_LDSP 0x6002
+#define INSN_MASK_C_LDSP 0xe003
+#define INSN_MATCH_C_SDSP 0xe002
+#define INSN_MASK_C_SDSP 0xe003
+#define INSN_MATCH_C_LWSP 0x4002
+#define INSN_MASK_C_LWSP 0xe003
+#define INSN_MATCH_C_SWSP 0xc002
+#define INSN_MASK_C_SWSP 0xe003
+
+#define INSN_16BIT_MASK 0x3
+
+#define INSN_IS_16BIT(insn) (((insn) & INSN_16BIT_MASK) != INSN_16BIT_MASK)
+
+#define INSN_LEN(insn) (INSN_IS_16BIT(insn) ? 2 : 4)
+
+#ifdef CONFIG_64BIT
+#define LOG_REGBYTES 3
+#else
+#define LOG_REGBYTES 2
+#endif
+#define REGBYTES (1 << LOG_REGBYTES)
+
+#define SH_RD 7
+#define SH_RS1 15
+#define SH_RS2 20
+#define SH_RS2C 2
+
+#define RV_X(x, s, n) (((x) >> (s)) & ((1 << (n)) - 1))
+#define RVC_LW_IMM(x) ((RV_X(x, 6, 1) << 2) | \
+ (RV_X(x, 10, 3) << 3) | \
+ (RV_X(x, 5, 1) << 6))
+#define RVC_LD_IMM(x) ((RV_X(x, 10, 3) << 3) | \
+ (RV_X(x, 5, 2) << 6))
+#define RVC_LWSP_IMM(x) ((RV_X(x, 4, 3) << 2) | \
+ (RV_X(x, 12, 1) << 5) | \
+ (RV_X(x, 2, 2) << 6))
+#define RVC_LDSP_IMM(x) ((RV_X(x, 5, 2) << 3) | \
+ (RV_X(x, 12, 1) << 5) | \
+ (RV_X(x, 2, 3) << 6))
+#define RVC_SWSP_IMM(x) ((RV_X(x, 9, 4) << 2) | \
+ (RV_X(x, 7, 2) << 6))
+#define RVC_SDSP_IMM(x) ((RV_X(x, 10, 3) << 3) | \
+ (RV_X(x, 7, 3) << 6))
+#define RVC_RS1S(insn) (8 + RV_X(insn, SH_RD, 3))
+#define RVC_RS2S(insn) (8 + RV_X(insn, SH_RS2C, 3))
+#define RVC_RS2(insn) RV_X(insn, SH_RS2C, 5)
+
+#define SHIFT_RIGHT(x, y) \
+ ((y) < 0 ? ((x) << -(y)) : ((x) >> (y)))
+
+#define REG_MASK \
+ ((1 << (5 + LOG_REGBYTES)) - (1 << LOG_REGBYTES))
+
+#define REG_OFFSET(insn, pos) \
+ (SHIFT_RIGHT((insn), (pos) - LOG_REGBYTES) & REG_MASK)
+
+#define REG_PTR(insn, pos, regs) \
+ ((ulong *)((ulong)(regs) + REG_OFFSET(insn, pos)))
+
+#define GET_RM(insn) (((insn) >> 12) & 7)
+
+#define GET_RS1(insn, regs) (*REG_PTR(insn, SH_RS1, regs))
+#define GET_RS2(insn, regs) (*REG_PTR(insn, SH_RS2, regs))
+#define GET_RS1S(insn, regs) (*REG_PTR(RVC_RS1S(insn), 0, regs))
+#define GET_RS2S(insn, regs) (*REG_PTR(RVC_RS2S(insn), 0, regs))
+#define GET_RS2C(insn, regs) (*REG_PTR(insn, SH_RS2C, regs))
+#define GET_SP(regs) (*REG_PTR(2, 0, regs))
+#define SET_RD(insn, regs, val) (*REG_PTR(insn, SH_RD, regs) = (val))
+#define IMM_I(insn) ((s32)(insn) >> 20)
+#define IMM_S(insn) (((s32)(insn) >> 25 << 5) | \
+ (s32)(((insn) >> 7) & 0x1f))
+#define MASK_FUNCT3 0x7000
+
+static int emulate_load(struct kvm_vcpu *vcpu, struct kvm_run *run,
+ unsigned long fault_addr, unsigned long htinst)
+{
+ u8 data_buf[8];
+ unsigned long insn;
+ int shift = 0, len = 0, insn_len = 0;
+ struct kvm_cpu_trap utrap = { 0 };
+ struct kvm_cpu_context *ct = &vcpu->arch.guest_context;
+
+ /* Determine trapped instruction */
+ if (htinst & 0x1) {
+ /*
+ * Bit[0] == 1 implies trapped instruction value is
+ * transformed instruction or custom instruction.
+ */
+ insn = htinst | INSN_16BIT_MASK;
+ insn_len = (htinst & BIT(1)) ? INSN_LEN(insn) : 2;
+ } else {
+ /*
+ * Bit[0] == 0 implies trapped instruction value is
+ * zero or special value.
+ */
+ insn = kvm_riscv_vcpu_unpriv_read(vcpu, true, ct->sepc,
+ &utrap);
+ if (utrap.scause) {
+ /* Redirect trap if we failed to read instruction */
+ utrap.sepc = ct->sepc;
+ kvm_riscv_vcpu_trap_redirect(vcpu, &utrap);
+ return 1;
+ }
+ insn_len = INSN_LEN(insn);
+ }
+
+ /* Decode length of MMIO and shift */
+ if ((insn & INSN_MASK_LW) == INSN_MATCH_LW) {
+ len = 4;
+ shift = 8 * (sizeof(ulong) - len);
+ } else if ((insn & INSN_MASK_LB) == INSN_MATCH_LB) {
+ len = 1;
+ shift = 8 * (sizeof(ulong) - len);
+ } else if ((insn & INSN_MASK_LBU) == INSN_MATCH_LBU) {
+ len = 1;
+ shift = 8 * (sizeof(ulong) - len);
+#ifdef CONFIG_64BIT
+ } else if ((insn & INSN_MASK_LD) == INSN_MATCH_LD) {
+ len = 8;
+ shift = 8 * (sizeof(ulong) - len);
+ } else if ((insn & INSN_MASK_LWU) == INSN_MATCH_LWU) {
+ len = 4;
+#endif
+ } else if ((insn & INSN_MASK_LH) == INSN_MATCH_LH) {
+ len = 2;
+ shift = 8 * (sizeof(ulong) - len);
+ } else if ((insn & INSN_MASK_LHU) == INSN_MATCH_LHU) {
+ len = 2;
+#ifdef CONFIG_64BIT
+ } else if ((insn & INSN_MASK_C_LD) == INSN_MATCH_C_LD) {
+ len = 8;
+ shift = 8 * (sizeof(ulong) - len);
+ insn = RVC_RS2S(insn) << SH_RD;
+ } else if ((insn & INSN_MASK_C_LDSP) == INSN_MATCH_C_LDSP &&
+ ((insn >> SH_RD) & 0x1f)) {
+ len = 8;
+ shift = 8 * (sizeof(ulong) - len);
+#endif
+ } else if ((insn & INSN_MASK_C_LW) == INSN_MATCH_C_LW) {
+ len = 4;
+ shift = 8 * (sizeof(ulong) - len);
+ insn = RVC_RS2S(insn) << SH_RD;
+ } else if ((insn & INSN_MASK_C_LWSP) == INSN_MATCH_C_LWSP &&
+ ((insn >> SH_RD) & 0x1f)) {
+ len = 4;
+ shift = 8 * (sizeof(ulong) - len);
+ } else {
+ return -EOPNOTSUPP;
+ }
+
+ /* Fault address should be aligned to length of MMIO */
+ if (fault_addr & (len - 1))
+ return -EIO;
+
+ /* Save instruction decode info */
+ vcpu->arch.mmio_decode.insn = insn;
+ vcpu->arch.mmio_decode.insn_len = insn_len;
+ vcpu->arch.mmio_decode.shift = shift;
+ vcpu->arch.mmio_decode.len = len;
+ vcpu->arch.mmio_decode.return_handled = 0;
+
+ /* Update MMIO details in kvm_run struct */
+ run->mmio.is_write = false;
+ run->mmio.phys_addr = fault_addr;
+ run->mmio.len = len;
+
+ /* Try to handle MMIO access in the kernel */
+ if (!kvm_io_bus_read(vcpu, KVM_MMIO_BUS, fault_addr, len, data_buf)) {
+ /* Successfully handled MMIO access in the kernel so resume */
+ memcpy(run->mmio.data, data_buf, len);
+ vcpu->stat.mmio_exit_kernel++;
+ kvm_riscv_vcpu_mmio_return(vcpu, run);
+ return 1;
+ }
+
+ /* Exit to userspace for MMIO emulation */
+ vcpu->stat.mmio_exit_user++;
+ run->exit_reason = KVM_EXIT_MMIO;
+
+ return 0;
+}
+
+static int emulate_store(struct kvm_vcpu *vcpu, struct kvm_run *run,
+ unsigned long fault_addr, unsigned long htinst)
+{
+ u8 data8;
+ u16 data16;
+ u32 data32;
+ u64 data64;
+ ulong data;
+ unsigned long insn;
+ int len = 0, insn_len = 0;
+ struct kvm_cpu_trap utrap = { 0 };
+ struct kvm_cpu_context *ct = &vcpu->arch.guest_context;
+
+ /* Determine trapped instruction */
+ if (htinst & 0x1) {
+ /*
+ * Bit[0] == 1 implies trapped instruction value is
+ * transformed instruction or custom instruction.
+ */
+ insn = htinst | INSN_16BIT_MASK;
+ insn_len = (htinst & BIT(1)) ? INSN_LEN(insn) : 2;
+ } else {
+ /*
+ * Bit[0] == 0 implies trapped instruction value is
+ * zero or special value.
+ */
+ insn = kvm_riscv_vcpu_unpriv_read(vcpu, true, ct->sepc,
+ &utrap);
+ if (utrap.scause) {
+ /* Redirect trap if we failed to read instruction */
+ utrap.sepc = ct->sepc;
+ kvm_riscv_vcpu_trap_redirect(vcpu, &utrap);
+ return 1;
+ }
+ insn_len = INSN_LEN(insn);
+ }
+
+ data = GET_RS2(insn, &vcpu->arch.guest_context);
+ data8 = data16 = data32 = data64 = data;
+
+ if ((insn & INSN_MASK_SW) == INSN_MATCH_SW) {
+ len = 4;
+ } else if ((insn & INSN_MASK_SB) == INSN_MATCH_SB) {
+ len = 1;
+#ifdef CONFIG_64BIT
+ } else if ((insn & INSN_MASK_SD) == INSN_MATCH_SD) {
+ len = 8;
+#endif
+ } else if ((insn & INSN_MASK_SH) == INSN_MATCH_SH) {
+ len = 2;
+#ifdef CONFIG_64BIT
+ } else if ((insn & INSN_MASK_C_SD) == INSN_MATCH_C_SD) {
+ len = 8;
+ data64 = GET_RS2S(insn, &vcpu->arch.guest_context);
+ } else if ((insn & INSN_MASK_C_SDSP) == INSN_MATCH_C_SDSP &&
+ ((insn >> SH_RD) & 0x1f)) {
+ len = 8;
+ data64 = GET_RS2C(insn, &vcpu->arch.guest_context);
+#endif
+ } else if ((insn & INSN_MASK_C_SW) == INSN_MATCH_C_SW) {
+ len = 4;
+ data32 = GET_RS2S(insn, &vcpu->arch.guest_context);
+ } else if ((insn & INSN_MASK_C_SWSP) == INSN_MATCH_C_SWSP &&
+ ((insn >> SH_RD) & 0x1f)) {
+ len = 4;
+ data32 = GET_RS2C(insn, &vcpu->arch.guest_context);
+ } else {
+ return -EOPNOTSUPP;
+ }
+
+ /* Fault address should be aligned to length of MMIO */
+ if (fault_addr & (len - 1))
+ return -EIO;
+
+ /* Save instruction decode info */
+ vcpu->arch.mmio_decode.insn = insn;
+ vcpu->arch.mmio_decode.insn_len = insn_len;
+ vcpu->arch.mmio_decode.shift = 0;
+ vcpu->arch.mmio_decode.len = len;
+ vcpu->arch.mmio_decode.return_handled = 0;
+
+ /* Copy data to kvm_run instance */
+ switch (len) {
+ case 1:
+ *((u8 *)run->mmio.data) = data8;
+ break;
+ case 2:
+ *((u16 *)run->mmio.data) = data16;
+ break;
+ case 4:
+ *((u32 *)run->mmio.data) = data32;
+ break;
+ case 8:
+ *((u64 *)run->mmio.data) = data64;
+ break;
+ default:
+ return -EOPNOTSUPP;
+ };
+
+ /* Update MMIO details in kvm_run struct */
+ run->mmio.is_write = true;
+ run->mmio.phys_addr = fault_addr;
+ run->mmio.len = len;
+
+ /* Try to handle MMIO access in the kernel */
+ if (!kvm_io_bus_write(vcpu, KVM_MMIO_BUS,
+ fault_addr, len, run->mmio.data)) {
+ /* Successfully handled MMIO access in the kernel so resume */
+ vcpu->stat.mmio_exit_kernel++;
+ kvm_riscv_vcpu_mmio_return(vcpu, run);
+ return 1;
+ }
+
+ /* Exit to userspace for MMIO emulation */
+ vcpu->stat.mmio_exit_user++;
+ run->exit_reason = KVM_EXIT_MMIO;
+
+ return 0;
+}
+
+static int stage2_page_fault(struct kvm_vcpu *vcpu, struct kvm_run *run,
+ struct kvm_cpu_trap *trap)
+{
+ struct kvm_memory_slot *memslot;
+ unsigned long hva, fault_addr;
+ bool writeable;
+ gfn_t gfn;
+ int ret;
+
+ fault_addr = (trap->htval << 2) | (trap->stval & 0x3);
+ gfn = fault_addr >> PAGE_SHIFT;
+ memslot = gfn_to_memslot(vcpu->kvm, gfn);
+ hva = gfn_to_hva_memslot_prot(memslot, gfn, &writeable);
+
+ if (kvm_is_error_hva(hva) ||
+ (trap->scause == EXC_STORE_GUEST_PAGE_FAULT && !writeable)) {
+ switch (trap->scause) {
+ case EXC_LOAD_GUEST_PAGE_FAULT:
+ return emulate_load(vcpu, run, fault_addr,
+ trap->htinst);
+ case EXC_STORE_GUEST_PAGE_FAULT:
+ return emulate_store(vcpu, run, fault_addr,
+ trap->htinst);
+ default:
+ return -EOPNOTSUPP;
+ };
+ }
+
+ ret = kvm_riscv_stage2_map(vcpu, memslot, fault_addr, hva,
+ (trap->scause == EXC_STORE_GUEST_PAGE_FAULT) ? true : false);
+ if (ret < 0)
+ return ret;
+
+ return 1;
+}
+
+/**
+ * kvm_riscv_vcpu_unpriv_read -- Read machine word from Guest memory
+ *
+ * @vcpu: The VCPU pointer
+ * @read_insn: Flag representing whether we are reading instruction
+ * @guest_addr: Guest address to read
+ * @trap: Output pointer to trap details
+ */
+unsigned long kvm_riscv_vcpu_unpriv_read(struct kvm_vcpu *vcpu,
+ bool read_insn,
+ unsigned long guest_addr,
+ struct kvm_cpu_trap *trap)
+{
+ register unsigned long taddr asm("a0") = (unsigned long)trap;
+ register unsigned long ttmp asm("a1");
+ register unsigned long val asm("t0");
+ register unsigned long tmp asm("t1");
+ register unsigned long addr asm("t2") = guest_addr;
+ unsigned long flags;
+ unsigned long old_stvec, old_hstatus;
+
+ local_irq_save(flags);
+
+ old_hstatus = csr_swap(CSR_HSTATUS, vcpu->arch.guest_context.hstatus);
+ old_stvec = csr_swap(CSR_STVEC, (ulong)&__kvm_riscv_unpriv_trap);
+
+ if (read_insn) {
+ /*
+ * HLVX.HU instruction
+ * 0110010 00011 rs1 100 rd 1110011
+ */
+ asm volatile ("\n"
+ ".option push\n"
+ ".option norvc\n"
+ "add %[ttmp], %[taddr], 0\n"
+ /*
+ * HLVX.HU %[val], (%[addr])
+ * HLVX.HU t0, (t2)
+ * 0110010 00011 00111 100 00101 1110011
+ */
+ ".word 0x6433c2f3\n"
+ "andi %[tmp], %[val], 3\n"
+ "addi %[tmp], %[tmp], -3\n"
+ "bne %[tmp], zero, 2f\n"
+ "addi %[addr], %[addr], 2\n"
+ /*
+ * HLVX.HU %[tmp], (%[addr])
+ * HLVX.HU t1, (t2)
+ * 0110010 00011 00111 100 00110 1110011
+ */
+ ".word 0x6433c373\n"
+ "sll %[tmp], %[tmp], 16\n"
+ "add %[val], %[val], %[tmp]\n"
+ "2:\n"
+ ".option pop"
+ : [val] "=&r" (val), [tmp] "=&r" (tmp),
+ [taddr] "+&r" (taddr), [ttmp] "+&r" (ttmp),
+ [addr] "+&r" (addr) : : "memory");
+
+ if (trap->scause == EXC_LOAD_PAGE_FAULT)
+ trap->scause = EXC_INST_PAGE_FAULT;
+ } else {
+ /*
+ * HLV.D instruction
+ * 0110110 00000 rs1 100 rd 1110011
+ *
+ * HLV.W instruction
+ * 0110100 00000 rs1 100 rd 1110011
+ */
+ asm volatile ("\n"
+ ".option push\n"
+ ".option norvc\n"
+ "add %[ttmp], %[taddr], 0\n"
+#ifdef CONFIG_64BIT
+ /*
+ * HLV.D %[val], (%[addr])
+ * HLV.D t0, (t2)
+ * 0110110 00000 00111 100 00101 1110011
+ */
+ ".word 0x6c03c2f3\n"
+#else
+ /*
+ * HLV.W %[val], (%[addr])
+ * HLV.W t0, (t2)
+ * 0110100 00000 00111 100 00101 1110011
+ */
+ ".word 0x6803c2f3\n"
+#endif
+ ".option pop"
+ : [val] "=&r" (val),
+ [taddr] "+&r" (taddr), [ttmp] "+&r" (ttmp)
+ : [addr] "r" (addr) : "memory");
+ }
+
+ csr_write(CSR_STVEC, old_stvec);
+ csr_write(CSR_HSTATUS, old_hstatus);
+
+ local_irq_restore(flags);
+
+ return val;
+}
+
+/**
+ * kvm_riscv_vcpu_trap_redirect -- Redirect trap to Guest
+ *
+ * @vcpu: The VCPU pointer
+ * @trap: Trap details
+ */
+void kvm_riscv_vcpu_trap_redirect(struct kvm_vcpu *vcpu,
+ struct kvm_cpu_trap *trap)
+{
+ unsigned long vsstatus = csr_read(CSR_VSSTATUS);
+
+ /* Change Guest SSTATUS.SPP bit */
+ vsstatus &= ~SR_SPP;
+ if (vcpu->arch.guest_context.sstatus & SR_SPP)
+ vsstatus |= SR_SPP;
+
+ /* Change Guest SSTATUS.SPIE bit */
+ vsstatus &= ~SR_SPIE;
+ if (vsstatus & SR_SIE)
+ vsstatus |= SR_SPIE;
+
+ /* Clear Guest SSTATUS.SIE bit */
+ vsstatus &= ~SR_SIE;
+
+ /* Update Guest SSTATUS */
+ csr_write(CSR_VSSTATUS, vsstatus);
+
+ /* Update Guest SCAUSE, STVAL, and SEPC */
+ csr_write(CSR_VSCAUSE, trap->scause);
+ csr_write(CSR_VSTVAL, trap->stval);
+ csr_write(CSR_VSEPC, trap->sepc);
+
+ /* Set Guest PC to Guest exception vector */
+ vcpu->arch.guest_context.sepc = csr_read(CSR_VSTVEC);
+}

/**
* kvm_riscv_vcpu_mmio_return -- Handle MMIO loads after user space emulation
@@ -19,7 +528,54 @@
*/
int kvm_riscv_vcpu_mmio_return(struct kvm_vcpu *vcpu, struct kvm_run *run)
{
- /* TODO: */
+ u8 data8;
+ u16 data16;
+ u32 data32;
+ u64 data64;
+ ulong insn;
+ int len, shift;
+
+ if (vcpu->arch.mmio_decode.return_handled)
+ return 0;
+
+ vcpu->arch.mmio_decode.return_handled = 1;
+ insn = vcpu->arch.mmio_decode.insn;
+
+ if (run->mmio.is_write)
+ goto done;
+
+ len = vcpu->arch.mmio_decode.len;
+ shift = vcpu->arch.mmio_decode.shift;
+
+ switch (len) {
+ case 1:
+ data8 = *((u8 *)run->mmio.data);
+ SET_RD(insn, &vcpu->arch.guest_context,
+ (ulong)data8 << shift >> shift);
+ break;
+ case 2:
+ data16 = *((u16 *)run->mmio.data);
+ SET_RD(insn, &vcpu->arch.guest_context,
+ (ulong)data16 << shift >> shift);
+ break;
+ case 4:
+ data32 = *((u32 *)run->mmio.data);
+ SET_RD(insn, &vcpu->arch.guest_context,
+ (ulong)data32 << shift >> shift);
+ break;
+ case 8:
+ data64 = *((u64 *)run->mmio.data);
+ SET_RD(insn, &vcpu->arch.guest_context,
+ (ulong)data64 << shift >> shift);
+ break;
+ default:
+ return -EOPNOTSUPP;
+ };
+
+done:
+ /* Move to next instruction */
+ vcpu->arch.guest_context.sepc += vcpu->arch.mmio_decode.insn_len;
+
return 0;
}

@@ -30,6 +586,36 @@ int kvm_riscv_vcpu_mmio_return(struct kvm_vcpu *vcpu, struct kvm_run *run)
int kvm_riscv_vcpu_exit(struct kvm_vcpu *vcpu, struct kvm_run *run,
struct kvm_cpu_trap *trap)
{
- /* TODO: */
- return 0;
+ int ret;
+
+ /* If we got host interrupt then do nothing */
+ if (trap->scause & CAUSE_IRQ_FLAG)
+ return 1;
+
+ /* Handle guest traps */
+ ret = -EFAULT;
+ run->exit_reason = KVM_EXIT_UNKNOWN;
+ switch (trap->scause) {
+ case EXC_INST_GUEST_PAGE_FAULT:
+ case EXC_LOAD_GUEST_PAGE_FAULT:
+ case EXC_STORE_GUEST_PAGE_FAULT:
+ if (vcpu->arch.guest_context.hstatus & HSTATUS_SPV)
+ ret = stage2_page_fault(vcpu, run, trap);
+ break;
+ default:
+ break;
+ };
+
+ /* Print details in-case of error */
+ if (ret < 0) {
+ kvm_err("VCPU exit error %d\n", ret);
+ kvm_err("SEPC=0x%lx SSTATUS=0x%lx HSTATUS=0x%lx\n",
+ vcpu->arch.guest_context.sepc,
+ vcpu->arch.guest_context.sstatus,
+ vcpu->arch.guest_context.hstatus);
+ kvm_err("SCAUSE=0x%lx STVAL=0x%lx HTVAL=0x%lx HTINST=0x%lx\n",
+ trap->scause, trap->stval, trap->htval, trap->htinst);
+ }
+
+ return ret;
}
diff --git a/arch/riscv/kvm/vcpu_switch.S b/arch/riscv/kvm/vcpu_switch.S
index 20237940db03..68d461729fd2 100644
--- a/arch/riscv/kvm/vcpu_switch.S
+++ b/arch/riscv/kvm/vcpu_switch.S
@@ -202,3 +202,26 @@ __kvm_switch_return:
/* Return to C code */
ret
ENDPROC(__kvm_riscv_switch_to)
+
+ENTRY(__kvm_riscv_unpriv_trap)
+ /*
+ * We assume that faulting unpriv load/store instruction is
+ * 4-byte long and blindly increment SEPC by 4.
+ *
+ * The trap details will be saved at address pointed by 'A0'
+ * register and we use 'A1' register as temporary.
+ */
+ csrr a1, CSR_SEPC
+ REG_S a1, (KVM_ARCH_TRAP_SEPC)(a0)
+ addi a1, a1, 4
+ csrw CSR_SEPC, a1
+ csrr a1, CSR_SCAUSE
+ REG_S a1, (KVM_ARCH_TRAP_SCAUSE)(a0)
+ csrr a1, CSR_STVAL
+ REG_S a1, (KVM_ARCH_TRAP_STVAL)(a0)
+ csrr a1, CSR_HTVAL
+ REG_S a1, (KVM_ARCH_TRAP_HTVAL)(a0)
+ csrr a1, CSR_HTINST
+ REG_S a1, (KVM_ARCH_TRAP_HTINST)(a0)
+ sret
+ENDPROC(__kvm_riscv_unpriv_trap)
diff --git a/arch/riscv/kvm/vm.c b/arch/riscv/kvm/vm.c
index d6776b4819bb..496a86a74236 100644
--- a/arch/riscv/kvm/vm.c
+++ b/arch/riscv/kvm/vm.c
@@ -46,6 +46,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
int r;

switch (ext) {
+ case KVM_CAP_IOEVENTFD:
case KVM_CAP_DEVICE_CTRL:
case KVM_CAP_USER_MEMORY:
case KVM_CAP_DESTROY_MEMORY_REGION_WORKS:
--
2.25.1


2021-05-19 19:07:44

by Anup Patel

[permalink] [raw]
Subject: [PATCH v18 03/18] RISC-V: KVM: Implement VCPU create, init and destroy functions

This patch implements VCPU create, init and destroy functions
required by generic KVM module. We don't have much dynamic
resources in struct kvm_vcpu_arch so these functions are quite
simple for KVM RISC-V.

Signed-off-by: Anup Patel <[email protected]>
Acked-by: Paolo Bonzini <[email protected]>
Reviewed-by: Paolo Bonzini <[email protected]>
Reviewed-by: Alexander Graf <[email protected]>
---
arch/riscv/include/asm/kvm_host.h | 69 +++++++++++++++++++++++++++++++
arch/riscv/kvm/vcpu.c | 55 ++++++++++++++++++++----
2 files changed, 115 insertions(+), 9 deletions(-)

diff --git a/arch/riscv/include/asm/kvm_host.h b/arch/riscv/include/asm/kvm_host.h
index 2068475bd168..cf2a23bbd560 100644
--- a/arch/riscv/include/asm/kvm_host.h
+++ b/arch/riscv/include/asm/kvm_host.h
@@ -62,7 +62,76 @@ struct kvm_cpu_trap {
unsigned long htinst;
};

+struct kvm_cpu_context {
+ unsigned long zero;
+ unsigned long ra;
+ unsigned long sp;
+ unsigned long gp;
+ unsigned long tp;
+ unsigned long t0;
+ unsigned long t1;
+ unsigned long t2;
+ unsigned long s0;
+ unsigned long s1;
+ unsigned long a0;
+ unsigned long a1;
+ unsigned long a2;
+ unsigned long a3;
+ unsigned long a4;
+ unsigned long a5;
+ unsigned long a6;
+ unsigned long a7;
+ unsigned long s2;
+ unsigned long s3;
+ unsigned long s4;
+ unsigned long s5;
+ unsigned long s6;
+ unsigned long s7;
+ unsigned long s8;
+ unsigned long s9;
+ unsigned long s10;
+ unsigned long s11;
+ unsigned long t3;
+ unsigned long t4;
+ unsigned long t5;
+ unsigned long t6;
+ unsigned long sepc;
+ unsigned long sstatus;
+ unsigned long hstatus;
+};
+
+struct kvm_vcpu_csr {
+ unsigned long vsstatus;
+ unsigned long hie;
+ unsigned long vstvec;
+ unsigned long vsscratch;
+ unsigned long vsepc;
+ unsigned long vscause;
+ unsigned long vstval;
+ unsigned long hvip;
+ unsigned long vsatp;
+ unsigned long scounteren;
+};
+
struct kvm_vcpu_arch {
+ /* VCPU ran at least once */
+ bool ran_atleast_once;
+
+ /* ISA feature bits (similar to MISA) */
+ unsigned long isa;
+
+ /* CPU context of Guest VCPU */
+ struct kvm_cpu_context guest_context;
+
+ /* CPU CSR context of Guest VCPU */
+ struct kvm_vcpu_csr guest_csr;
+
+ /* CPU context upon Guest VCPU reset */
+ struct kvm_cpu_context guest_reset_context;
+
+ /* CPU CSR context upon Guest VCPU reset */
+ struct kvm_vcpu_csr guest_reset_csr;
+
/* Don't run the VCPU (blocked) */
bool pause;

diff --git a/arch/riscv/kvm/vcpu.c b/arch/riscv/kvm/vcpu.c
index d76cecf93de4..904d908a7544 100644
--- a/arch/riscv/kvm/vcpu.c
+++ b/arch/riscv/kvm/vcpu.c
@@ -35,6 +35,27 @@ struct kvm_stats_debugfs_item debugfs_entries[] = {
{ NULL }
};

+#define KVM_RISCV_ISA_ALLOWED (riscv_isa_extension_mask(a) | \
+ riscv_isa_extension_mask(c) | \
+ riscv_isa_extension_mask(d) | \
+ riscv_isa_extension_mask(f) | \
+ riscv_isa_extension_mask(i) | \
+ riscv_isa_extension_mask(m) | \
+ riscv_isa_extension_mask(s) | \
+ riscv_isa_extension_mask(u))
+
+static void kvm_riscv_reset_vcpu(struct kvm_vcpu *vcpu)
+{
+ struct kvm_vcpu_csr *csr = &vcpu->arch.guest_csr;
+ struct kvm_vcpu_csr *reset_csr = &vcpu->arch.guest_reset_csr;
+ struct kvm_cpu_context *cntx = &vcpu->arch.guest_context;
+ struct kvm_cpu_context *reset_cntx = &vcpu->arch.guest_reset_context;
+
+ memcpy(csr, reset_csr, sizeof(*csr));
+
+ memcpy(cntx, reset_cntx, sizeof(*cntx));
+}
+
int kvm_arch_vcpu_precreate(struct kvm *kvm, unsigned int id)
{
return 0;
@@ -42,7 +63,25 @@ int kvm_arch_vcpu_precreate(struct kvm *kvm, unsigned int id)

int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
{
- /* TODO: */
+ struct kvm_cpu_context *cntx;
+
+ /* Mark this VCPU never ran */
+ vcpu->arch.ran_atleast_once = false;
+
+ /* Setup ISA features available to VCPU */
+ vcpu->arch.isa = riscv_isa_extension_base(NULL) & KVM_RISCV_ISA_ALLOWED;
+
+ /* Setup reset state of shadow SSTATUS and HSTATUS CSRs */
+ cntx = &vcpu->arch.guest_reset_context;
+ cntx->sstatus = SR_SPP | SR_SPIE;
+ cntx->hstatus = 0;
+ cntx->hstatus |= HSTATUS_VTW;
+ cntx->hstatus |= HSTATUS_SPVP;
+ cntx->hstatus |= HSTATUS_SPV;
+
+ /* Reset VCPU */
+ kvm_riscv_reset_vcpu(vcpu);
+
return 0;
}

@@ -50,15 +89,10 @@ void kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu)
{
}

-int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu)
-{
- /* TODO: */
- return 0;
-}
-
void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu)
{
- /* TODO: */
+ /* Flush the pages pre-allocated for Stage2 page table mappings */
+ kvm_riscv_stage2_flush_cache(vcpu);
}

int kvm_cpu_has_pending_timer(struct kvm_vcpu *vcpu)
@@ -194,6 +228,9 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
struct kvm_cpu_trap trap;
struct kvm_run *run = vcpu->run;

+ /* Mark this VCPU ran at least once */
+ vcpu->arch.ran_atleast_once = true;
+
vcpu->arch.srcu_idx = srcu_read_lock(&vcpu->kvm->srcu);

/* Process MMIO value returned from user-space */
@@ -267,7 +304,7 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
* get an interrupt between __kvm_riscv_switch_to() and
* local_irq_enable() which can potentially change CSRs.
*/
- trap.sepc = 0;
+ trap.sepc = vcpu->arch.guest_context.sepc;
trap.scause = csr_read(CSR_SCAUSE);
trap.stval = csr_read(CSR_STVAL);
trap.htval = csr_read(CSR_HTVAL);
--
2.25.1


2021-05-19 19:07:48

by Anup Patel

[permalink] [raw]
Subject: [PATCH v18 09/18] RISC-V: KVM: Implement VMID allocator

We implement a simple VMID allocator for Guests/VMs which:
1. Detects number of VMID bits at boot-time
2. Uses atomic number to track VMID version and increments
VMID version whenever we run-out of VMIDs
3. Flushes Guest TLBs on all host CPUs whenever we run-out
of VMIDs
4. Force updates HW Stage2 VMID for each Guest VCPU whenever
VMID changes using VCPU request KVM_REQ_UPDATE_HGATP

Signed-off-by: Anup Patel <[email protected]>
Acked-by: Paolo Bonzini <[email protected]>
Reviewed-by: Paolo Bonzini <[email protected]>
Reviewed-by: Alexander Graf <[email protected]>
---
arch/riscv/include/asm/kvm_host.h | 24 ++++++
arch/riscv/kvm/Makefile | 3 +-
arch/riscv/kvm/main.c | 4 +
arch/riscv/kvm/tlb.S | 74 ++++++++++++++++++
arch/riscv/kvm/vcpu.c | 9 +++
arch/riscv/kvm/vm.c | 6 ++
arch/riscv/kvm/vmid.c | 120 ++++++++++++++++++++++++++++++
7 files changed, 239 insertions(+), 1 deletion(-)
create mode 100644 arch/riscv/kvm/tlb.S
create mode 100644 arch/riscv/kvm/vmid.c

diff --git a/arch/riscv/include/asm/kvm_host.h b/arch/riscv/include/asm/kvm_host.h
index bd6d49aeebd9..40449ab2916d 100644
--- a/arch/riscv/include/asm/kvm_host.h
+++ b/arch/riscv/include/asm/kvm_host.h
@@ -26,6 +26,7 @@
#define KVM_REQ_SLEEP \
KVM_ARCH_REQ_FLAGS(0, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
#define KVM_REQ_VCPU_RESET KVM_ARCH_REQ(1)
+#define KVM_REQ_UPDATE_HGATP KVM_ARCH_REQ(2)

struct kvm_vm_stat {
ulong remote_tlb_flush;
@@ -48,7 +49,19 @@ struct kvm_vcpu_stat {
struct kvm_arch_memory_slot {
};

+struct kvm_vmid {
+ /*
+ * Writes to vmid_version and vmid happen with vmid_lock held
+ * whereas reads happen without any lock held.
+ */
+ unsigned long vmid_version;
+ unsigned long vmid;
+};
+
struct kvm_arch {
+ /* stage2 vmid */
+ struct kvm_vmid vmid;
+
/* stage2 page table */
pgd_t *pgd;
phys_addr_t pgd_phys;
@@ -178,6 +191,11 @@ static inline void kvm_arch_sync_events(struct kvm *kvm) {}
static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {}
static inline void kvm_arch_vcpu_block_finish(struct kvm_vcpu *vcpu) {}

+void __kvm_riscv_hfence_gvma_vmid_gpa(unsigned long gpa, unsigned long vmid);
+void __kvm_riscv_hfence_gvma_vmid(unsigned long vmid);
+void __kvm_riscv_hfence_gvma_gpa(unsigned long gpa);
+void __kvm_riscv_hfence_gvma_all(void);
+
int kvm_riscv_stage2_map(struct kvm_vcpu *vcpu,
struct kvm_memory_slot *memslot,
gpa_t gpa, unsigned long hva, bool is_write);
@@ -186,6 +204,12 @@ int kvm_riscv_stage2_alloc_pgd(struct kvm *kvm);
void kvm_riscv_stage2_free_pgd(struct kvm *kvm);
void kvm_riscv_stage2_update_hgatp(struct kvm_vcpu *vcpu);

+void kvm_riscv_stage2_vmid_detect(void);
+unsigned long kvm_riscv_stage2_vmid_bits(void);
+int kvm_riscv_stage2_vmid_init(struct kvm *kvm);
+bool kvm_riscv_stage2_vmid_ver_changed(struct kvm_vmid *vmid);
+void kvm_riscv_stage2_vmid_update(struct kvm_vcpu *vcpu);
+
void __kvm_riscv_unpriv_trap(void);

unsigned long kvm_riscv_vcpu_unpriv_read(struct kvm_vcpu *vcpu,
diff --git a/arch/riscv/kvm/Makefile b/arch/riscv/kvm/Makefile
index e121b940c9ec..98b294cbd96d 100644
--- a/arch/riscv/kvm/Makefile
+++ b/arch/riscv/kvm/Makefile
@@ -9,7 +9,8 @@ ccflags-y := -Ivirt/kvm -Iarch/riscv/kvm

kvm-objs := $(common-objs-y)

-kvm-objs += main.o vm.o mmu.o vcpu.o vcpu_exit.o vcpu_switch.o
+kvm-objs += main.o vm.o vmid.o tlb.o mmu.o
+kvm-objs += vcpu.o vcpu_exit.o vcpu_switch.o

obj-$(CONFIG_KVM) += kvm.o

diff --git a/arch/riscv/kvm/main.c b/arch/riscv/kvm/main.c
index c717d37fd87f..998110227d1e 100644
--- a/arch/riscv/kvm/main.c
+++ b/arch/riscv/kvm/main.c
@@ -79,8 +79,12 @@ int kvm_arch_init(void *opaque)
return -ENODEV;
}

+ kvm_riscv_stage2_vmid_detect();
+
kvm_info("hypervisor extension available\n");

+ kvm_info("VMID %ld bits available\n", kvm_riscv_stage2_vmid_bits());
+
return 0;
}

diff --git a/arch/riscv/kvm/tlb.S b/arch/riscv/kvm/tlb.S
new file mode 100644
index 000000000000..c858570f0856
--- /dev/null
+++ b/arch/riscv/kvm/tlb.S
@@ -0,0 +1,74 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (C) 2019 Western Digital Corporation or its affiliates.
+ *
+ * Authors:
+ * Anup Patel <[email protected]>
+ */
+
+#include <linux/linkage.h>
+#include <asm/asm.h>
+
+ .text
+ .altmacro
+ .option norelax
+
+ /*
+ * Instruction encoding of hfence.gvma is:
+ * HFENCE.GVMA rs1, rs2
+ * HFENCE.GVMA zero, rs2
+ * HFENCE.GVMA rs1
+ * HFENCE.GVMA
+ *
+ * rs1!=zero and rs2!=zero ==> HFENCE.GVMA rs1, rs2
+ * rs1==zero and rs2!=zero ==> HFENCE.GVMA zero, rs2
+ * rs1!=zero and rs2==zero ==> HFENCE.GVMA rs1
+ * rs1==zero and rs2==zero ==> HFENCE.GVMA
+ *
+ * Instruction encoding of HFENCE.GVMA is:
+ * 0110001 rs2(5) rs1(5) 000 00000 1110011
+ */
+
+ENTRY(__kvm_riscv_hfence_gvma_vmid_gpa)
+ /*
+ * rs1 = a0 (GPA)
+ * rs2 = a1 (VMID)
+ * HFENCE.GVMA a0, a1
+ * 0110001 01011 01010 000 00000 1110011
+ */
+ .word 0x62b50073
+ ret
+ENDPROC(__kvm_riscv_hfence_gvma_vmid_gpa)
+
+ENTRY(__kvm_riscv_hfence_gvma_vmid)
+ /*
+ * rs1 = zero
+ * rs2 = a0 (VMID)
+ * HFENCE.GVMA zero, a0
+ * 0110001 01010 00000 000 00000 1110011
+ */
+ .word 0x62a00073
+ ret
+ENDPROC(__kvm_riscv_hfence_gvma_vmid)
+
+ENTRY(__kvm_riscv_hfence_gvma_gpa)
+ /*
+ * rs1 = a0 (GPA)
+ * rs2 = zero
+ * HFENCE.GVMA a0
+ * 0110001 00000 01010 000 00000 1110011
+ */
+ .word 0x62050073
+ ret
+ENDPROC(__kvm_riscv_hfence_gvma_gpa)
+
+ENTRY(__kvm_riscv_hfence_gvma_all)
+ /*
+ * rs1 = zero
+ * rs2 = zero
+ * HFENCE.GVMA
+ * 0110001 00000 00000 000 00000 1110011
+ */
+ .word 0x62000073
+ ret
+ENDPROC(__kvm_riscv_hfence_gvma_all)
diff --git a/arch/riscv/kvm/vcpu.c b/arch/riscv/kvm/vcpu.c
index 654a4834a317..cbaf14502c25 100644
--- a/arch/riscv/kvm/vcpu.c
+++ b/arch/riscv/kvm/vcpu.c
@@ -622,6 +622,12 @@ static void kvm_riscv_check_vcpu_requests(struct kvm_vcpu *vcpu)

if (kvm_check_request(KVM_REQ_VCPU_RESET, vcpu))
kvm_riscv_reset_vcpu(vcpu);
+
+ if (kvm_check_request(KVM_REQ_UPDATE_HGATP, vcpu))
+ kvm_riscv_stage2_update_hgatp(vcpu);
+
+ if (kvm_check_request(KVM_REQ_TLB_FLUSH, vcpu))
+ __kvm_riscv_hfence_gvma_all();
}
}

@@ -667,6 +673,8 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
/* Check conditions before entering the guest */
cond_resched();

+ kvm_riscv_stage2_vmid_update(vcpu);
+
kvm_riscv_check_vcpu_requests(vcpu);

preempt_disable();
@@ -703,6 +711,7 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
kvm_riscv_update_hvip(vcpu);

if (ret <= 0 ||
+ kvm_riscv_stage2_vmid_ver_changed(&vcpu->kvm->arch.vmid) ||
kvm_request_pending(vcpu)) {
vcpu->mode = OUTSIDE_GUEST_MODE;
local_irq_enable();
diff --git a/arch/riscv/kvm/vm.c b/arch/riscv/kvm/vm.c
index 496a86a74236..282d67617229 100644
--- a/arch/riscv/kvm/vm.c
+++ b/arch/riscv/kvm/vm.c
@@ -26,6 +26,12 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
if (r)
return r;

+ r = kvm_riscv_stage2_vmid_init(kvm);
+ if (r) {
+ kvm_riscv_stage2_free_pgd(kvm);
+ return r;
+ }
+
return 0;
}

diff --git a/arch/riscv/kvm/vmid.c b/arch/riscv/kvm/vmid.c
new file mode 100644
index 000000000000..aa643001bb6a
--- /dev/null
+++ b/arch/riscv/kvm/vmid.c
@@ -0,0 +1,120 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2019 Western Digital Corporation or its affiliates.
+ *
+ * Authors:
+ * Anup Patel <[email protected]>
+ */
+
+#include <linux/bitops.h>
+#include <linux/cpumask.h>
+#include <linux/errno.h>
+#include <linux/err.h>
+#include <linux/module.h>
+#include <linux/kvm_host.h>
+#include <asm/kvm_csr.h>
+#include <asm/sbi.h>
+
+static unsigned long vmid_version = 1;
+static unsigned long vmid_next;
+static unsigned long vmid_bits;
+static DEFINE_SPINLOCK(vmid_lock);
+
+void kvm_riscv_stage2_vmid_detect(void)
+{
+ unsigned long old;
+
+ /* Figure-out number of VMID bits in HW */
+ old = csr_read(CSR_HGATP);
+ csr_write(CSR_HGATP, old | HGATP_VMID_MASK);
+ vmid_bits = csr_read(CSR_HGATP);
+ vmid_bits = (vmid_bits & HGATP_VMID_MASK) >> HGATP_VMID_SHIFT;
+ vmid_bits = fls_long(vmid_bits);
+ csr_write(CSR_HGATP, old);
+
+ /* We polluted local TLB so flush all guest TLB */
+ __kvm_riscv_hfence_gvma_all();
+
+ /* We don't use VMID bits if they are not sufficient */
+ if ((1UL << vmid_bits) < num_possible_cpus())
+ vmid_bits = 0;
+}
+
+unsigned long kvm_riscv_stage2_vmid_bits(void)
+{
+ return vmid_bits;
+}
+
+int kvm_riscv_stage2_vmid_init(struct kvm *kvm)
+{
+ /* Mark the initial VMID and VMID version invalid */
+ kvm->arch.vmid.vmid_version = 0;
+ kvm->arch.vmid.vmid = 0;
+
+ return 0;
+}
+
+bool kvm_riscv_stage2_vmid_ver_changed(struct kvm_vmid *vmid)
+{
+ if (!vmid_bits)
+ return false;
+
+ return unlikely(READ_ONCE(vmid->vmid_version) !=
+ READ_ONCE(vmid_version));
+}
+
+void kvm_riscv_stage2_vmid_update(struct kvm_vcpu *vcpu)
+{
+ int i;
+ struct kvm_vcpu *v;
+ struct cpumask hmask;
+ struct kvm_vmid *vmid = &vcpu->kvm->arch.vmid;
+
+ if (!kvm_riscv_stage2_vmid_ver_changed(vmid))
+ return;
+
+ spin_lock(&vmid_lock);
+
+ /*
+ * We need to re-check the vmid_version here to ensure that if
+ * another vcpu already allocated a valid vmid for this vm.
+ */
+ if (!kvm_riscv_stage2_vmid_ver_changed(vmid)) {
+ spin_unlock(&vmid_lock);
+ return;
+ }
+
+ /* First user of a new VMID version? */
+ if (unlikely(vmid_next == 0)) {
+ WRITE_ONCE(vmid_version, READ_ONCE(vmid_version) + 1);
+ vmid_next = 1;
+
+ /*
+ * We ran out of VMIDs so we increment vmid_version and
+ * start assigning VMIDs from 1.
+ *
+ * This also means existing VMIDs assignement to all Guest
+ * instances is invalid and we have force VMID re-assignement
+ * for all Guest instances. The Guest instances that were not
+ * running will automatically pick-up new VMIDs because will
+ * call kvm_riscv_stage2_vmid_update() whenever they enter
+ * in-kernel run loop. For Guest instances that are already
+ * running, we force VM exits on all host CPUs using IPI and
+ * flush all Guest TLBs.
+ */
+ riscv_cpuid_to_hartid_mask(cpu_online_mask, &hmask);
+ sbi_remote_hfence_gvma(cpumask_bits(&hmask), 0, 0);
+ }
+
+ vmid->vmid = vmid_next;
+ vmid_next++;
+ vmid_next &= (1 << vmid_bits) - 1;
+
+ WRITE_ONCE(vmid->vmid_version, READ_ONCE(vmid_version));
+
+ spin_unlock(&vmid_lock);
+
+ /* Request stage2 page table update for all VCPUs */
+ kvm_for_each_vcpu(i, v, vcpu->kvm)
+ kvm_make_request(KVM_REQ_UPDATE_HGATP, v);
+}
--
2.25.1


2021-05-19 19:07:59

by Anup Patel

[permalink] [raw]
Subject: [PATCH v18 05/18] RISC-V: KVM: Implement KVM_GET_ONE_REG/KVM_SET_ONE_REG ioctls

For KVM RISC-V, we use KVM_GET_ONE_REG/KVM_SET_ONE_REG ioctls to access
VCPU config and registers from user-space.

We have three types of VCPU registers:
1. CONFIG - these are VCPU config and capabilities
2. CORE - these are VCPU general purpose registers
3. CSR - these are VCPU control and status registers

The CONFIG register available to user-space is ISA. The ISA register is
a read and write register where user-space can only write the desired
VCPU ISA capabilities before running the VCPU.

The CORE registers available to user-space are PC, RA, SP, GP, TP, A0-A7,
T0-T6, S0-S11 and MODE. Most of these are RISC-V general registers except
PC and MODE. The PC register represents program counter whereas the MODE
register represent VCPU privilege mode (i.e. S/U-mode).

The CSRs available to user-space are SSTATUS, SIE, STVEC, SSCRATCH, SEPC,
SCAUSE, STVAL, SIP, and SATP. All of these are read/write registers.

In future, more VCPU register types will be added (such as FP) for the
KVM_GET_ONE_REG/KVM_SET_ONE_REG ioctls.

Signed-off-by: Anup Patel <[email protected]>
Acked-by: Paolo Bonzini <[email protected]>
Reviewed-by: Paolo Bonzini <[email protected]>
---
arch/riscv/include/uapi/asm/kvm.h | 53 ++++++-
arch/riscv/kvm/vcpu.c | 246 +++++++++++++++++++++++++++++-
2 files changed, 295 insertions(+), 4 deletions(-)

diff --git a/arch/riscv/include/uapi/asm/kvm.h b/arch/riscv/include/uapi/asm/kvm.h
index 3d3d703713c6..f7e9dc388d54 100644
--- a/arch/riscv/include/uapi/asm/kvm.h
+++ b/arch/riscv/include/uapi/asm/kvm.h
@@ -41,10 +41,61 @@ struct kvm_guest_debug_arch {
struct kvm_sync_regs {
};

-/* dummy definition */
+/* for KVM_GET_SREGS and KVM_SET_SREGS */
struct kvm_sregs {
};

+/* CONFIG registers for KVM_GET_ONE_REG and KVM_SET_ONE_REG */
+struct kvm_riscv_config {
+ unsigned long isa;
+};
+
+/* CORE registers for KVM_GET_ONE_REG and KVM_SET_ONE_REG */
+struct kvm_riscv_core {
+ struct user_regs_struct regs;
+ unsigned long mode;
+};
+
+/* Possible privilege modes for kvm_riscv_core */
+#define KVM_RISCV_MODE_S 1
+#define KVM_RISCV_MODE_U 0
+
+/* CSR registers for KVM_GET_ONE_REG and KVM_SET_ONE_REG */
+struct kvm_riscv_csr {
+ unsigned long sstatus;
+ unsigned long sie;
+ unsigned long stvec;
+ unsigned long sscratch;
+ unsigned long sepc;
+ unsigned long scause;
+ unsigned long stval;
+ unsigned long sip;
+ unsigned long satp;
+ unsigned long scounteren;
+};
+
+#define KVM_REG_SIZE(id) \
+ (1U << (((id) & KVM_REG_SIZE_MASK) >> KVM_REG_SIZE_SHIFT))
+
+/* If you need to interpret the index values, here is the key: */
+#define KVM_REG_RISCV_TYPE_MASK 0x00000000FF000000
+#define KVM_REG_RISCV_TYPE_SHIFT 24
+
+/* Config registers are mapped as type 1 */
+#define KVM_REG_RISCV_CONFIG (0x01 << KVM_REG_RISCV_TYPE_SHIFT)
+#define KVM_REG_RISCV_CONFIG_REG(name) \
+ (offsetof(struct kvm_riscv_config, name) / sizeof(unsigned long))
+
+/* Core registers are mapped as type 2 */
+#define KVM_REG_RISCV_CORE (0x02 << KVM_REG_RISCV_TYPE_SHIFT)
+#define KVM_REG_RISCV_CORE_REG(name) \
+ (offsetof(struct kvm_riscv_core, name) / sizeof(unsigned long))
+
+/* Control and status registers are mapped as type 3 */
+#define KVM_REG_RISCV_CSR (0x03 << KVM_REG_RISCV_TYPE_SHIFT)
+#define KVM_REG_RISCV_CSR_REG(name) \
+ (offsetof(struct kvm_riscv_csr, name) / sizeof(unsigned long))
+
#endif

#endif /* __LINUX_KVM_RISCV_H */
diff --git a/arch/riscv/kvm/vcpu.c b/arch/riscv/kvm/vcpu.c
index 1c3c3bd72df9..1df21f9a0d6a 100644
--- a/arch/riscv/kvm/vcpu.c
+++ b/arch/riscv/kvm/vcpu.c
@@ -18,7 +18,6 @@
#include <linux/fs.h>
#include <linux/kvm_host.h>
#include <asm/kvm_csr.h>
-#include <asm/delay.h>
#include <asm/hwcap.h>

struct kvm_stats_debugfs_item debugfs_entries[] = {
@@ -133,6 +132,225 @@ vm_fault_t kvm_arch_vcpu_fault(struct kvm_vcpu *vcpu, struct vm_fault *vmf)
return VM_FAULT_SIGBUS;
}

+static int kvm_riscv_vcpu_get_reg_config(struct kvm_vcpu *vcpu,
+ const struct kvm_one_reg *reg)
+{
+ unsigned long __user *uaddr =
+ (unsigned long __user *)(unsigned long)reg->addr;
+ unsigned long reg_num = reg->id & ~(KVM_REG_ARCH_MASK |
+ KVM_REG_SIZE_MASK |
+ KVM_REG_RISCV_CONFIG);
+ unsigned long reg_val;
+
+ if (KVM_REG_SIZE(reg->id) != sizeof(unsigned long))
+ return -EINVAL;
+
+ switch (reg_num) {
+ case KVM_REG_RISCV_CONFIG_REG(isa):
+ reg_val = vcpu->arch.isa;
+ break;
+ default:
+ return -EINVAL;
+ };
+
+ if (copy_to_user(uaddr, &reg_val, KVM_REG_SIZE(reg->id)))
+ return -EFAULT;
+
+ return 0;
+}
+
+static int kvm_riscv_vcpu_set_reg_config(struct kvm_vcpu *vcpu,
+ const struct kvm_one_reg *reg)
+{
+ unsigned long __user *uaddr =
+ (unsigned long __user *)(unsigned long)reg->addr;
+ unsigned long reg_num = reg->id & ~(KVM_REG_ARCH_MASK |
+ KVM_REG_SIZE_MASK |
+ KVM_REG_RISCV_CONFIG);
+ unsigned long reg_val;
+
+ if (KVM_REG_SIZE(reg->id) != sizeof(unsigned long))
+ return -EINVAL;
+
+ if (copy_from_user(&reg_val, uaddr, KVM_REG_SIZE(reg->id)))
+ return -EFAULT;
+
+ switch (reg_num) {
+ case KVM_REG_RISCV_CONFIG_REG(isa):
+ if (!vcpu->arch.ran_atleast_once) {
+ vcpu->arch.isa = reg_val;
+ vcpu->arch.isa &= riscv_isa_extension_base(NULL);
+ vcpu->arch.isa &= KVM_RISCV_ISA_ALLOWED;
+ } else {
+ return -EOPNOTSUPP;
+ }
+ break;
+ default:
+ return -EINVAL;
+ };
+
+ return 0;
+}
+
+static int kvm_riscv_vcpu_get_reg_core(struct kvm_vcpu *vcpu,
+ const struct kvm_one_reg *reg)
+{
+ struct kvm_cpu_context *cntx = &vcpu->arch.guest_context;
+ unsigned long __user *uaddr =
+ (unsigned long __user *)(unsigned long)reg->addr;
+ unsigned long reg_num = reg->id & ~(KVM_REG_ARCH_MASK |
+ KVM_REG_SIZE_MASK |
+ KVM_REG_RISCV_CORE);
+ unsigned long reg_val;
+
+ if (KVM_REG_SIZE(reg->id) != sizeof(unsigned long))
+ return -EINVAL;
+ if (reg_num >= sizeof(struct kvm_riscv_core) / sizeof(unsigned long))
+ return -EINVAL;
+
+ if (reg_num == KVM_REG_RISCV_CORE_REG(regs.pc))
+ reg_val = cntx->sepc;
+ else if (KVM_REG_RISCV_CORE_REG(regs.pc) < reg_num &&
+ reg_num <= KVM_REG_RISCV_CORE_REG(regs.t6))
+ reg_val = ((unsigned long *)cntx)[reg_num];
+ else if (reg_num == KVM_REG_RISCV_CORE_REG(mode))
+ reg_val = (cntx->sstatus & SR_SPP) ?
+ KVM_RISCV_MODE_S : KVM_RISCV_MODE_U;
+ else
+ return -EINVAL;
+
+ if (copy_to_user(uaddr, &reg_val, KVM_REG_SIZE(reg->id)))
+ return -EFAULT;
+
+ return 0;
+}
+
+static int kvm_riscv_vcpu_set_reg_core(struct kvm_vcpu *vcpu,
+ const struct kvm_one_reg *reg)
+{
+ struct kvm_cpu_context *cntx = &vcpu->arch.guest_context;
+ unsigned long __user *uaddr =
+ (unsigned long __user *)(unsigned long)reg->addr;
+ unsigned long reg_num = reg->id & ~(KVM_REG_ARCH_MASK |
+ KVM_REG_SIZE_MASK |
+ KVM_REG_RISCV_CORE);
+ unsigned long reg_val;
+
+ if (KVM_REG_SIZE(reg->id) != sizeof(unsigned long))
+ return -EINVAL;
+ if (reg_num >= sizeof(struct kvm_riscv_core) / sizeof(unsigned long))
+ return -EINVAL;
+
+ if (copy_from_user(&reg_val, uaddr, KVM_REG_SIZE(reg->id)))
+ return -EFAULT;
+
+ if (reg_num == KVM_REG_RISCV_CORE_REG(regs.pc))
+ cntx->sepc = reg_val;
+ else if (KVM_REG_RISCV_CORE_REG(regs.pc) < reg_num &&
+ reg_num <= KVM_REG_RISCV_CORE_REG(regs.t6))
+ ((unsigned long *)cntx)[reg_num] = reg_val;
+ else if (reg_num == KVM_REG_RISCV_CORE_REG(mode)) {
+ if (reg_val == KVM_RISCV_MODE_S)
+ cntx->sstatus |= SR_SPP;
+ else
+ cntx->sstatus &= ~SR_SPP;
+ } else
+ return -EINVAL;
+
+ return 0;
+}
+
+static int kvm_riscv_vcpu_get_reg_csr(struct kvm_vcpu *vcpu,
+ const struct kvm_one_reg *reg)
+{
+ struct kvm_vcpu_csr *csr = &vcpu->arch.guest_csr;
+ unsigned long __user *uaddr =
+ (unsigned long __user *)(unsigned long)reg->addr;
+ unsigned long reg_num = reg->id & ~(KVM_REG_ARCH_MASK |
+ KVM_REG_SIZE_MASK |
+ KVM_REG_RISCV_CSR);
+ unsigned long reg_val;
+
+ if (KVM_REG_SIZE(reg->id) != sizeof(unsigned long))
+ return -EINVAL;
+ if (reg_num >= sizeof(struct kvm_riscv_csr) / sizeof(unsigned long))
+ return -EINVAL;
+
+ if (reg_num == KVM_REG_RISCV_CSR_REG(sip)) {
+ kvm_riscv_vcpu_flush_interrupts(vcpu);
+ reg_val = csr->hvip >> VSIP_TO_HVIP_SHIFT;
+ reg_val = reg_val & VSIP_VALID_MASK;
+ } else if (reg_num == KVM_REG_RISCV_CSR_REG(sie)) {
+ reg_val = csr->hie >> VSIP_TO_HVIP_SHIFT;
+ reg_val = reg_val & VSIP_VALID_MASK;
+ } else
+ reg_val = ((unsigned long *)csr)[reg_num];
+
+ if (copy_to_user(uaddr, &reg_val, KVM_REG_SIZE(reg->id)))
+ return -EFAULT;
+
+ return 0;
+}
+
+static int kvm_riscv_vcpu_set_reg_csr(struct kvm_vcpu *vcpu,
+ const struct kvm_one_reg *reg)
+{
+ struct kvm_vcpu_csr *csr = &vcpu->arch.guest_csr;
+ unsigned long __user *uaddr =
+ (unsigned long __user *)(unsigned long)reg->addr;
+ unsigned long reg_num = reg->id & ~(KVM_REG_ARCH_MASK |
+ KVM_REG_SIZE_MASK |
+ KVM_REG_RISCV_CSR);
+ unsigned long reg_val;
+
+ if (KVM_REG_SIZE(reg->id) != sizeof(unsigned long))
+ return -EINVAL;
+ if (reg_num >= sizeof(struct kvm_riscv_csr) / sizeof(unsigned long))
+ return -EINVAL;
+
+ if (copy_from_user(&reg_val, uaddr, KVM_REG_SIZE(reg->id)))
+ return -EFAULT;
+
+ if (reg_num == KVM_REG_RISCV_CSR_REG(sip) ||
+ reg_num == KVM_REG_RISCV_CSR_REG(sie)) {
+ reg_val = reg_val & VSIP_VALID_MASK;
+ reg_val = reg_val << VSIP_TO_HVIP_SHIFT;
+ }
+
+ ((unsigned long *)csr)[reg_num] = reg_val;
+
+ if (reg_num == KVM_REG_RISCV_CSR_REG(sip))
+ WRITE_ONCE(vcpu->arch.irqs_pending_mask, 0);
+
+ return 0;
+}
+
+static int kvm_riscv_vcpu_set_reg(struct kvm_vcpu *vcpu,
+ const struct kvm_one_reg *reg)
+{
+ if ((reg->id & KVM_REG_RISCV_TYPE_MASK) == KVM_REG_RISCV_CONFIG)
+ return kvm_riscv_vcpu_set_reg_config(vcpu, reg);
+ else if ((reg->id & KVM_REG_RISCV_TYPE_MASK) == KVM_REG_RISCV_CORE)
+ return kvm_riscv_vcpu_set_reg_core(vcpu, reg);
+ else if ((reg->id & KVM_REG_RISCV_TYPE_MASK) == KVM_REG_RISCV_CSR)
+ return kvm_riscv_vcpu_set_reg_csr(vcpu, reg);
+
+ return -EINVAL;
+}
+
+static int kvm_riscv_vcpu_get_reg(struct kvm_vcpu *vcpu,
+ const struct kvm_one_reg *reg)
+{
+ if ((reg->id & KVM_REG_RISCV_TYPE_MASK) == KVM_REG_RISCV_CONFIG)
+ return kvm_riscv_vcpu_get_reg_config(vcpu, reg);
+ else if ((reg->id & KVM_REG_RISCV_TYPE_MASK) == KVM_REG_RISCV_CORE)
+ return kvm_riscv_vcpu_get_reg_core(vcpu, reg);
+ else if ((reg->id & KVM_REG_RISCV_TYPE_MASK) == KVM_REG_RISCV_CSR)
+ return kvm_riscv_vcpu_get_reg_csr(vcpu, reg);
+
+ return -EINVAL;
+}
+
long kvm_arch_vcpu_async_ioctl(struct file *filp,
unsigned int ioctl, unsigned long arg)
{
@@ -157,8 +375,30 @@ long kvm_arch_vcpu_async_ioctl(struct file *filp,
long kvm_arch_vcpu_ioctl(struct file *filp,
unsigned int ioctl, unsigned long arg)
{
- /* TODO: */
- return -EINVAL;
+ struct kvm_vcpu *vcpu = filp->private_data;
+ void __user *argp = (void __user *)arg;
+ long r = -EINVAL;
+
+ switch (ioctl) {
+ case KVM_SET_ONE_REG:
+ case KVM_GET_ONE_REG: {
+ struct kvm_one_reg reg;
+
+ r = -EFAULT;
+ if (copy_from_user(&reg, argp, sizeof(reg)))
+ break;
+
+ if (ioctl == KVM_SET_ONE_REG)
+ r = kvm_riscv_vcpu_set_reg(vcpu, &reg);
+ else
+ r = kvm_riscv_vcpu_get_reg(vcpu, &reg);
+ break;
+ }
+ default:
+ break;
+ }
+
+ return r;
}

int kvm_arch_vcpu_ioctl_get_sregs(struct kvm_vcpu *vcpu,
--
2.25.1


2021-05-19 19:08:25

by Anup Patel

[permalink] [raw]
Subject: [PATCH v18 13/18] RISC-V: KVM: FP lazy save/restore

From: Atish Patra <[email protected]>

This patch adds floating point (F and D extension) context save/restore
for guest VCPUs. The FP context is saved and restored lazily only when
kernel enter/exits the in-kernel run loop and not during the KVM world
switch. This way FP save/restore has minimal impact on KVM performance.

Signed-off-by: Atish Patra <[email protected]>
Signed-off-by: Anup Patel <[email protected]>
Acked-by: Paolo Bonzini <[email protected]>
Reviewed-by: Paolo Bonzini <[email protected]>
Reviewed-by: Alexander Graf <[email protected]>
---
arch/riscv/include/asm/kvm_host.h | 5 +
arch/riscv/kvm/riscv_offsets.c | 72 +++++++++++++
arch/riscv/kvm/vcpu.c | 91 ++++++++++++++++
arch/riscv/kvm/vcpu_switch.S | 174 ++++++++++++++++++++++++++++++
4 files changed, 342 insertions(+)

diff --git a/arch/riscv/include/asm/kvm_host.h b/arch/riscv/include/asm/kvm_host.h
index 0134201afb8c..834c6986cc2d 100644
--- a/arch/riscv/include/asm/kvm_host.h
+++ b/arch/riscv/include/asm/kvm_host.h
@@ -130,6 +130,7 @@ struct kvm_cpu_context {
unsigned long sepc;
unsigned long sstatus;
unsigned long hstatus;
+ union __riscv_fp_state fp;
};

struct kvm_vcpu_csr {
@@ -244,6 +245,10 @@ int kvm_riscv_vcpu_exit(struct kvm_vcpu *vcpu, struct kvm_run *run,
struct kvm_cpu_trap *trap);

void __kvm_riscv_switch_to(struct kvm_vcpu_arch *vcpu_arch);
+void __kvm_riscv_fp_f_save(struct kvm_cpu_context *context);
+void __kvm_riscv_fp_f_restore(struct kvm_cpu_context *context);
+void __kvm_riscv_fp_d_save(struct kvm_cpu_context *context);
+void __kvm_riscv_fp_d_restore(struct kvm_cpu_context *context);

int kvm_riscv_vcpu_set_interrupt(struct kvm_vcpu *vcpu, unsigned int irq);
int kvm_riscv_vcpu_unset_interrupt(struct kvm_vcpu *vcpu, unsigned int irq);
diff --git a/arch/riscv/kvm/riscv_offsets.c b/arch/riscv/kvm/riscv_offsets.c
index 3c92d2a1ee82..eafa51955dfb 100644
--- a/arch/riscv/kvm/riscv_offsets.c
+++ b/arch/riscv/kvm/riscv_offsets.c
@@ -94,5 +94,77 @@ int main(void)
OFFSET(KVM_ARCH_TRAP_HTVAL, kvm_cpu_trap, htval);
OFFSET(KVM_ARCH_TRAP_HTINST, kvm_cpu_trap, htinst);

+ /* F extension */
+
+ OFFSET(KVM_ARCH_FP_F_F0, kvm_cpu_context, fp.f.f[0]);
+ OFFSET(KVM_ARCH_FP_F_F1, kvm_cpu_context, fp.f.f[1]);
+ OFFSET(KVM_ARCH_FP_F_F2, kvm_cpu_context, fp.f.f[2]);
+ OFFSET(KVM_ARCH_FP_F_F3, kvm_cpu_context, fp.f.f[3]);
+ OFFSET(KVM_ARCH_FP_F_F4, kvm_cpu_context, fp.f.f[4]);
+ OFFSET(KVM_ARCH_FP_F_F5, kvm_cpu_context, fp.f.f[5]);
+ OFFSET(KVM_ARCH_FP_F_F6, kvm_cpu_context, fp.f.f[6]);
+ OFFSET(KVM_ARCH_FP_F_F7, kvm_cpu_context, fp.f.f[7]);
+ OFFSET(KVM_ARCH_FP_F_F8, kvm_cpu_context, fp.f.f[8]);
+ OFFSET(KVM_ARCH_FP_F_F9, kvm_cpu_context, fp.f.f[9]);
+ OFFSET(KVM_ARCH_FP_F_F10, kvm_cpu_context, fp.f.f[10]);
+ OFFSET(KVM_ARCH_FP_F_F11, kvm_cpu_context, fp.f.f[11]);
+ OFFSET(KVM_ARCH_FP_F_F12, kvm_cpu_context, fp.f.f[12]);
+ OFFSET(KVM_ARCH_FP_F_F13, kvm_cpu_context, fp.f.f[13]);
+ OFFSET(KVM_ARCH_FP_F_F14, kvm_cpu_context, fp.f.f[14]);
+ OFFSET(KVM_ARCH_FP_F_F15, kvm_cpu_context, fp.f.f[15]);
+ OFFSET(KVM_ARCH_FP_F_F16, kvm_cpu_context, fp.f.f[16]);
+ OFFSET(KVM_ARCH_FP_F_F17, kvm_cpu_context, fp.f.f[17]);
+ OFFSET(KVM_ARCH_FP_F_F18, kvm_cpu_context, fp.f.f[18]);
+ OFFSET(KVM_ARCH_FP_F_F19, kvm_cpu_context, fp.f.f[19]);
+ OFFSET(KVM_ARCH_FP_F_F20, kvm_cpu_context, fp.f.f[20]);
+ OFFSET(KVM_ARCH_FP_F_F21, kvm_cpu_context, fp.f.f[21]);
+ OFFSET(KVM_ARCH_FP_F_F22, kvm_cpu_context, fp.f.f[22]);
+ OFFSET(KVM_ARCH_FP_F_F23, kvm_cpu_context, fp.f.f[23]);
+ OFFSET(KVM_ARCH_FP_F_F24, kvm_cpu_context, fp.f.f[24]);
+ OFFSET(KVM_ARCH_FP_F_F25, kvm_cpu_context, fp.f.f[25]);
+ OFFSET(KVM_ARCH_FP_F_F26, kvm_cpu_context, fp.f.f[26]);
+ OFFSET(KVM_ARCH_FP_F_F27, kvm_cpu_context, fp.f.f[27]);
+ OFFSET(KVM_ARCH_FP_F_F28, kvm_cpu_context, fp.f.f[28]);
+ OFFSET(KVM_ARCH_FP_F_F29, kvm_cpu_context, fp.f.f[29]);
+ OFFSET(KVM_ARCH_FP_F_F30, kvm_cpu_context, fp.f.f[30]);
+ OFFSET(KVM_ARCH_FP_F_F31, kvm_cpu_context, fp.f.f[31]);
+ OFFSET(KVM_ARCH_FP_F_FCSR, kvm_cpu_context, fp.f.fcsr);
+
+ /* D extension */
+
+ OFFSET(KVM_ARCH_FP_D_F0, kvm_cpu_context, fp.d.f[0]);
+ OFFSET(KVM_ARCH_FP_D_F1, kvm_cpu_context, fp.d.f[1]);
+ OFFSET(KVM_ARCH_FP_D_F2, kvm_cpu_context, fp.d.f[2]);
+ OFFSET(KVM_ARCH_FP_D_F3, kvm_cpu_context, fp.d.f[3]);
+ OFFSET(KVM_ARCH_FP_D_F4, kvm_cpu_context, fp.d.f[4]);
+ OFFSET(KVM_ARCH_FP_D_F5, kvm_cpu_context, fp.d.f[5]);
+ OFFSET(KVM_ARCH_FP_D_F6, kvm_cpu_context, fp.d.f[6]);
+ OFFSET(KVM_ARCH_FP_D_F7, kvm_cpu_context, fp.d.f[7]);
+ OFFSET(KVM_ARCH_FP_D_F8, kvm_cpu_context, fp.d.f[8]);
+ OFFSET(KVM_ARCH_FP_D_F9, kvm_cpu_context, fp.d.f[9]);
+ OFFSET(KVM_ARCH_FP_D_F10, kvm_cpu_context, fp.d.f[10]);
+ OFFSET(KVM_ARCH_FP_D_F11, kvm_cpu_context, fp.d.f[11]);
+ OFFSET(KVM_ARCH_FP_D_F12, kvm_cpu_context, fp.d.f[12]);
+ OFFSET(KVM_ARCH_FP_D_F13, kvm_cpu_context, fp.d.f[13]);
+ OFFSET(KVM_ARCH_FP_D_F14, kvm_cpu_context, fp.d.f[14]);
+ OFFSET(KVM_ARCH_FP_D_F15, kvm_cpu_context, fp.d.f[15]);
+ OFFSET(KVM_ARCH_FP_D_F16, kvm_cpu_context, fp.d.f[16]);
+ OFFSET(KVM_ARCH_FP_D_F17, kvm_cpu_context, fp.d.f[17]);
+ OFFSET(KVM_ARCH_FP_D_F18, kvm_cpu_context, fp.d.f[18]);
+ OFFSET(KVM_ARCH_FP_D_F19, kvm_cpu_context, fp.d.f[19]);
+ OFFSET(KVM_ARCH_FP_D_F20, kvm_cpu_context, fp.d.f[20]);
+ OFFSET(KVM_ARCH_FP_D_F21, kvm_cpu_context, fp.d.f[21]);
+ OFFSET(KVM_ARCH_FP_D_F22, kvm_cpu_context, fp.d.f[22]);
+ OFFSET(KVM_ARCH_FP_D_F23, kvm_cpu_context, fp.d.f[23]);
+ OFFSET(KVM_ARCH_FP_D_F24, kvm_cpu_context, fp.d.f[24]);
+ OFFSET(KVM_ARCH_FP_D_F25, kvm_cpu_context, fp.d.f[25]);
+ OFFSET(KVM_ARCH_FP_D_F26, kvm_cpu_context, fp.d.f[26]);
+ OFFSET(KVM_ARCH_FP_D_F27, kvm_cpu_context, fp.d.f[27]);
+ OFFSET(KVM_ARCH_FP_D_F28, kvm_cpu_context, fp.d.f[28]);
+ OFFSET(KVM_ARCH_FP_D_F29, kvm_cpu_context, fp.d.f[29]);
+ OFFSET(KVM_ARCH_FP_D_F30, kvm_cpu_context, fp.d.f[30]);
+ OFFSET(KVM_ARCH_FP_D_F31, kvm_cpu_context, fp.d.f[31]);
+ OFFSET(KVM_ARCH_FP_D_FCSR, kvm_cpu_context, fp.d.fcsr);
+
return 0;
}
diff --git a/arch/riscv/kvm/vcpu.c b/arch/riscv/kvm/vcpu.c
index b6f19ca35562..f2f2321507e6 100644
--- a/arch/riscv/kvm/vcpu.c
+++ b/arch/riscv/kvm/vcpu.c
@@ -35,6 +35,86 @@ struct kvm_stats_debugfs_item debugfs_entries[] = {
{ NULL }
};

+#ifdef CONFIG_FPU
+static void kvm_riscv_vcpu_fp_reset(struct kvm_vcpu *vcpu)
+{
+ unsigned long isa = vcpu->arch.isa;
+ struct kvm_cpu_context *cntx = &vcpu->arch.guest_context;
+
+ cntx->sstatus &= ~SR_FS;
+ if (riscv_isa_extension_available(&isa, f) ||
+ riscv_isa_extension_available(&isa, d))
+ cntx->sstatus |= SR_FS_INITIAL;
+ else
+ cntx->sstatus |= SR_FS_OFF;
+}
+
+static void kvm_riscv_vcpu_fp_clean(struct kvm_cpu_context *cntx)
+{
+ cntx->sstatus &= ~SR_FS;
+ cntx->sstatus |= SR_FS_CLEAN;
+}
+
+static void kvm_riscv_vcpu_guest_fp_save(struct kvm_cpu_context *cntx,
+ unsigned long isa)
+{
+ if ((cntx->sstatus & SR_FS) == SR_FS_DIRTY) {
+ if (riscv_isa_extension_available(&isa, d))
+ __kvm_riscv_fp_d_save(cntx);
+ else if (riscv_isa_extension_available(&isa, f))
+ __kvm_riscv_fp_f_save(cntx);
+ kvm_riscv_vcpu_fp_clean(cntx);
+ }
+}
+
+static void kvm_riscv_vcpu_guest_fp_restore(struct kvm_cpu_context *cntx,
+ unsigned long isa)
+{
+ if ((cntx->sstatus & SR_FS) != SR_FS_OFF) {
+ if (riscv_isa_extension_available(&isa, d))
+ __kvm_riscv_fp_d_restore(cntx);
+ else if (riscv_isa_extension_available(&isa, f))
+ __kvm_riscv_fp_f_restore(cntx);
+ kvm_riscv_vcpu_fp_clean(cntx);
+ }
+}
+
+static void kvm_riscv_vcpu_host_fp_save(struct kvm_cpu_context *cntx)
+{
+ /* No need to check host sstatus as it can be modified outside */
+ if (riscv_isa_extension_available(NULL, d))
+ __kvm_riscv_fp_d_save(cntx);
+ else if (riscv_isa_extension_available(NULL, f))
+ __kvm_riscv_fp_f_save(cntx);
+}
+
+static void kvm_riscv_vcpu_host_fp_restore(struct kvm_cpu_context *cntx)
+{
+ if (riscv_isa_extension_available(NULL, d))
+ __kvm_riscv_fp_d_restore(cntx);
+ else if (riscv_isa_extension_available(NULL, f))
+ __kvm_riscv_fp_f_restore(cntx);
+}
+#else
+static void kvm_riscv_vcpu_fp_reset(struct kvm_vcpu *vcpu)
+{
+}
+static void kvm_riscv_vcpu_guest_fp_save(struct kvm_cpu_context *cntx,
+ unsigned long isa)
+{
+}
+static void kvm_riscv_vcpu_guest_fp_restore(struct kvm_cpu_context *cntx,
+ unsigned long isa)
+{
+}
+static void kvm_riscv_vcpu_host_fp_save(struct kvm_cpu_context *cntx)
+{
+}
+static void kvm_riscv_vcpu_host_fp_restore(struct kvm_cpu_context *cntx)
+{
+}
+#endif
+
#define KVM_RISCV_ISA_ALLOWED (riscv_isa_extension_mask(a) | \
riscv_isa_extension_mask(c) | \
riscv_isa_extension_mask(d) | \
@@ -55,6 +135,8 @@ static void kvm_riscv_reset_vcpu(struct kvm_vcpu *vcpu)

memcpy(cntx, reset_cntx, sizeof(*cntx));

+ kvm_riscv_vcpu_fp_reset(vcpu);
+
kvm_riscv_vcpu_timer_reset(vcpu);

WRITE_ONCE(vcpu->arch.irqs_pending, 0);
@@ -189,6 +271,7 @@ static int kvm_riscv_vcpu_set_reg_config(struct kvm_vcpu *vcpu,
vcpu->arch.isa = reg_val;
vcpu->arch.isa &= riscv_isa_extension_base(NULL);
vcpu->arch.isa &= KVM_RISCV_ISA_ALLOWED;
+ kvm_riscv_vcpu_fp_reset(vcpu);
} else {
return -EOPNOTSUPP;
}
@@ -593,6 +676,10 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)

kvm_riscv_vcpu_timer_restore(vcpu);

+ kvm_riscv_vcpu_host_fp_save(&vcpu->arch.host_context);
+ kvm_riscv_vcpu_guest_fp_restore(&vcpu->arch.guest_context,
+ vcpu->arch.isa);
+
vcpu->cpu = cpu;
}

@@ -602,6 +689,10 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)

vcpu->cpu = -1;

+ kvm_riscv_vcpu_guest_fp_save(&vcpu->arch.guest_context,
+ vcpu->arch.isa);
+ kvm_riscv_vcpu_host_fp_restore(&vcpu->arch.host_context);
+
csr_write(CSR_HGATP, 0);

csr->vsstatus = csr_read(CSR_VSSTATUS);
diff --git a/arch/riscv/kvm/vcpu_switch.S b/arch/riscv/kvm/vcpu_switch.S
index 68d461729fd2..cc81891c66d4 100644
--- a/arch/riscv/kvm/vcpu_switch.S
+++ b/arch/riscv/kvm/vcpu_switch.S
@@ -225,3 +225,177 @@ ENTRY(__kvm_riscv_unpriv_trap)
REG_S a1, (KVM_ARCH_TRAP_HTINST)(a0)
sret
ENDPROC(__kvm_riscv_unpriv_trap)
+
+#ifdef CONFIG_FPU
+ .align 3
+ .global __kvm_riscv_fp_f_save
+__kvm_riscv_fp_f_save:
+ csrr t2, CSR_SSTATUS
+ li t1, SR_FS
+ csrs CSR_SSTATUS, t1
+ frcsr t0
+ fsw f0, KVM_ARCH_FP_F_F0(a0)
+ fsw f1, KVM_ARCH_FP_F_F1(a0)
+ fsw f2, KVM_ARCH_FP_F_F2(a0)
+ fsw f3, KVM_ARCH_FP_F_F3(a0)
+ fsw f4, KVM_ARCH_FP_F_F4(a0)
+ fsw f5, KVM_ARCH_FP_F_F5(a0)
+ fsw f6, KVM_ARCH_FP_F_F6(a0)
+ fsw f7, KVM_ARCH_FP_F_F7(a0)
+ fsw f8, KVM_ARCH_FP_F_F8(a0)
+ fsw f9, KVM_ARCH_FP_F_F9(a0)
+ fsw f10, KVM_ARCH_FP_F_F10(a0)
+ fsw f11, KVM_ARCH_FP_F_F11(a0)
+ fsw f12, KVM_ARCH_FP_F_F12(a0)
+ fsw f13, KVM_ARCH_FP_F_F13(a0)
+ fsw f14, KVM_ARCH_FP_F_F14(a0)
+ fsw f15, KVM_ARCH_FP_F_F15(a0)
+ fsw f16, KVM_ARCH_FP_F_F16(a0)
+ fsw f17, KVM_ARCH_FP_F_F17(a0)
+ fsw f18, KVM_ARCH_FP_F_F18(a0)
+ fsw f19, KVM_ARCH_FP_F_F19(a0)
+ fsw f20, KVM_ARCH_FP_F_F20(a0)
+ fsw f21, KVM_ARCH_FP_F_F21(a0)
+ fsw f22, KVM_ARCH_FP_F_F22(a0)
+ fsw f23, KVM_ARCH_FP_F_F23(a0)
+ fsw f24, KVM_ARCH_FP_F_F24(a0)
+ fsw f25, KVM_ARCH_FP_F_F25(a0)
+ fsw f26, KVM_ARCH_FP_F_F26(a0)
+ fsw f27, KVM_ARCH_FP_F_F27(a0)
+ fsw f28, KVM_ARCH_FP_F_F28(a0)
+ fsw f29, KVM_ARCH_FP_F_F29(a0)
+ fsw f30, KVM_ARCH_FP_F_F30(a0)
+ fsw f31, KVM_ARCH_FP_F_F31(a0)
+ sw t0, KVM_ARCH_FP_F_FCSR(a0)
+ csrw CSR_SSTATUS, t2
+ ret
+
+ .align 3
+ .global __kvm_riscv_fp_d_save
+__kvm_riscv_fp_d_save:
+ csrr t2, CSR_SSTATUS
+ li t1, SR_FS
+ csrs CSR_SSTATUS, t1
+ frcsr t0
+ fsd f0, KVM_ARCH_FP_D_F0(a0)
+ fsd f1, KVM_ARCH_FP_D_F1(a0)
+ fsd f2, KVM_ARCH_FP_D_F2(a0)
+ fsd f3, KVM_ARCH_FP_D_F3(a0)
+ fsd f4, KVM_ARCH_FP_D_F4(a0)
+ fsd f5, KVM_ARCH_FP_D_F5(a0)
+ fsd f6, KVM_ARCH_FP_D_F6(a0)
+ fsd f7, KVM_ARCH_FP_D_F7(a0)
+ fsd f8, KVM_ARCH_FP_D_F8(a0)
+ fsd f9, KVM_ARCH_FP_D_F9(a0)
+ fsd f10, KVM_ARCH_FP_D_F10(a0)
+ fsd f11, KVM_ARCH_FP_D_F11(a0)
+ fsd f12, KVM_ARCH_FP_D_F12(a0)
+ fsd f13, KVM_ARCH_FP_D_F13(a0)
+ fsd f14, KVM_ARCH_FP_D_F14(a0)
+ fsd f15, KVM_ARCH_FP_D_F15(a0)
+ fsd f16, KVM_ARCH_FP_D_F16(a0)
+ fsd f17, KVM_ARCH_FP_D_F17(a0)
+ fsd f18, KVM_ARCH_FP_D_F18(a0)
+ fsd f19, KVM_ARCH_FP_D_F19(a0)
+ fsd f20, KVM_ARCH_FP_D_F20(a0)
+ fsd f21, KVM_ARCH_FP_D_F21(a0)
+ fsd f22, KVM_ARCH_FP_D_F22(a0)
+ fsd f23, KVM_ARCH_FP_D_F23(a0)
+ fsd f24, KVM_ARCH_FP_D_F24(a0)
+ fsd f25, KVM_ARCH_FP_D_F25(a0)
+ fsd f26, KVM_ARCH_FP_D_F26(a0)
+ fsd f27, KVM_ARCH_FP_D_F27(a0)
+ fsd f28, KVM_ARCH_FP_D_F28(a0)
+ fsd f29, KVM_ARCH_FP_D_F29(a0)
+ fsd f30, KVM_ARCH_FP_D_F30(a0)
+ fsd f31, KVM_ARCH_FP_D_F31(a0)
+ sw t0, KVM_ARCH_FP_D_FCSR(a0)
+ csrw CSR_SSTATUS, t2
+ ret
+
+ .align 3
+ .global __kvm_riscv_fp_f_restore
+__kvm_riscv_fp_f_restore:
+ csrr t2, CSR_SSTATUS
+ li t1, SR_FS
+ lw t0, KVM_ARCH_FP_F_FCSR(a0)
+ csrs CSR_SSTATUS, t1
+ flw f0, KVM_ARCH_FP_F_F0(a0)
+ flw f1, KVM_ARCH_FP_F_F1(a0)
+ flw f2, KVM_ARCH_FP_F_F2(a0)
+ flw f3, KVM_ARCH_FP_F_F3(a0)
+ flw f4, KVM_ARCH_FP_F_F4(a0)
+ flw f5, KVM_ARCH_FP_F_F5(a0)
+ flw f6, KVM_ARCH_FP_F_F6(a0)
+ flw f7, KVM_ARCH_FP_F_F7(a0)
+ flw f8, KVM_ARCH_FP_F_F8(a0)
+ flw f9, KVM_ARCH_FP_F_F9(a0)
+ flw f10, KVM_ARCH_FP_F_F10(a0)
+ flw f11, KVM_ARCH_FP_F_F11(a0)
+ flw f12, KVM_ARCH_FP_F_F12(a0)
+ flw f13, KVM_ARCH_FP_F_F13(a0)
+ flw f14, KVM_ARCH_FP_F_F14(a0)
+ flw f15, KVM_ARCH_FP_F_F15(a0)
+ flw f16, KVM_ARCH_FP_F_F16(a0)
+ flw f17, KVM_ARCH_FP_F_F17(a0)
+ flw f18, KVM_ARCH_FP_F_F18(a0)
+ flw f19, KVM_ARCH_FP_F_F19(a0)
+ flw f20, KVM_ARCH_FP_F_F20(a0)
+ flw f21, KVM_ARCH_FP_F_F21(a0)
+ flw f22, KVM_ARCH_FP_F_F22(a0)
+ flw f23, KVM_ARCH_FP_F_F23(a0)
+ flw f24, KVM_ARCH_FP_F_F24(a0)
+ flw f25, KVM_ARCH_FP_F_F25(a0)
+ flw f26, KVM_ARCH_FP_F_F26(a0)
+ flw f27, KVM_ARCH_FP_F_F27(a0)
+ flw f28, KVM_ARCH_FP_F_F28(a0)
+ flw f29, KVM_ARCH_FP_F_F29(a0)
+ flw f30, KVM_ARCH_FP_F_F30(a0)
+ flw f31, KVM_ARCH_FP_F_F31(a0)
+ fscsr t0
+ csrw CSR_SSTATUS, t2
+ ret
+
+ .align 3
+ .global __kvm_riscv_fp_d_restore
+__kvm_riscv_fp_d_restore:
+ csrr t2, CSR_SSTATUS
+ li t1, SR_FS
+ lw t0, KVM_ARCH_FP_D_FCSR(a0)
+ csrs CSR_SSTATUS, t1
+ fld f0, KVM_ARCH_FP_D_F0(a0)
+ fld f1, KVM_ARCH_FP_D_F1(a0)
+ fld f2, KVM_ARCH_FP_D_F2(a0)
+ fld f3, KVM_ARCH_FP_D_F3(a0)
+ fld f4, KVM_ARCH_FP_D_F4(a0)
+ fld f5, KVM_ARCH_FP_D_F5(a0)
+ fld f6, KVM_ARCH_FP_D_F6(a0)
+ fld f7, KVM_ARCH_FP_D_F7(a0)
+ fld f8, KVM_ARCH_FP_D_F8(a0)
+ fld f9, KVM_ARCH_FP_D_F9(a0)
+ fld f10, KVM_ARCH_FP_D_F10(a0)
+ fld f11, KVM_ARCH_FP_D_F11(a0)
+ fld f12, KVM_ARCH_FP_D_F12(a0)
+ fld f13, KVM_ARCH_FP_D_F13(a0)
+ fld f14, KVM_ARCH_FP_D_F14(a0)
+ fld f15, KVM_ARCH_FP_D_F15(a0)
+ fld f16, KVM_ARCH_FP_D_F16(a0)
+ fld f17, KVM_ARCH_FP_D_F17(a0)
+ fld f18, KVM_ARCH_FP_D_F18(a0)
+ fld f19, KVM_ARCH_FP_D_F19(a0)
+ fld f20, KVM_ARCH_FP_D_F20(a0)
+ fld f21, KVM_ARCH_FP_D_F21(a0)
+ fld f22, KVM_ARCH_FP_D_F22(a0)
+ fld f23, KVM_ARCH_FP_D_F23(a0)
+ fld f24, KVM_ARCH_FP_D_F24(a0)
+ fld f25, KVM_ARCH_FP_D_F25(a0)
+ fld f26, KVM_ARCH_FP_D_F26(a0)
+ fld f27, KVM_ARCH_FP_D_F27(a0)
+ fld f28, KVM_ARCH_FP_D_F28(a0)
+ fld f29, KVM_ARCH_FP_D_F29(a0)
+ fld f30, KVM_ARCH_FP_D_F30(a0)
+ fld f31, KVM_ARCH_FP_D_F31(a0)
+ fscsr t0
+ csrw CSR_SSTATUS, t2
+ ret
+#endif
--
2.25.1


2021-05-19 19:08:27

by Anup Patel

[permalink] [raw]
Subject: [PATCH v18 12/18] RISC-V: KVM: Add timer functionality

From: Atish Patra <[email protected]>

The RISC-V hypervisor specification doesn't have any virtual timer
feature.

Due to this, the guest VCPU timer will be programmed via SBI calls.
The host will use a separate hrtimer event for each guest VCPU to
provide timer functionality. We inject a virtual timer interrupt to
the guest VCPU whenever the guest VCPU hrtimer event expires.

This patch adds guest VCPU timer implementation along with ONE_REG
interface to access VCPU timer state from user space.

Signed-off-by: Atish Patra <[email protected]>
Signed-off-by: Anup Patel <[email protected]>
Acked-by: Paolo Bonzini <[email protected]>
Reviewed-by: Paolo Bonzini <[email protected]>
Acked-by: Daniel Lezcano <[email protected]>
---
arch/riscv/include/asm/kvm_host.h | 7 +
arch/riscv/include/asm/kvm_vcpu_timer.h | 44 +++++
arch/riscv/include/uapi/asm/kvm.h | 17 ++
arch/riscv/kvm/Makefile | 2 +-
arch/riscv/kvm/vcpu.c | 14 ++
arch/riscv/kvm/vcpu_timer.c | 225 ++++++++++++++++++++++++
arch/riscv/kvm/vm.c | 2 +-
drivers/clocksource/timer-riscv.c | 9 +
include/clocksource/timer-riscv.h | 16 ++
9 files changed, 334 insertions(+), 2 deletions(-)
create mode 100644 arch/riscv/include/asm/kvm_vcpu_timer.h
create mode 100644 arch/riscv/kvm/vcpu_timer.c
create mode 100644 include/clocksource/timer-riscv.h

diff --git a/arch/riscv/include/asm/kvm_host.h b/arch/riscv/include/asm/kvm_host.h
index 51fe663b5093..0134201afb8c 100644
--- a/arch/riscv/include/asm/kvm_host.h
+++ b/arch/riscv/include/asm/kvm_host.h
@@ -12,6 +12,7 @@
#include <linux/types.h>
#include <linux/kvm.h>
#include <linux/kvm_types.h>
+#include <asm/kvm_vcpu_timer.h>

#ifdef CONFIG_64BIT
#define KVM_MAX_VCPUS (1U << 16)
@@ -65,6 +66,9 @@ struct kvm_arch {
/* stage2 page table */
pgd_t *pgd;
phys_addr_t pgd_phys;
+
+ /* Guest Timer */
+ struct kvm_guest_timer timer;
};

struct kvm_mmio_decode {
@@ -180,6 +184,9 @@ struct kvm_vcpu_arch {
unsigned long irqs_pending;
unsigned long irqs_pending_mask;

+ /* VCPU Timer */
+ struct kvm_vcpu_timer timer;
+
/* MMIO instruction details */
struct kvm_mmio_decode mmio_decode;

diff --git a/arch/riscv/include/asm/kvm_vcpu_timer.h b/arch/riscv/include/asm/kvm_vcpu_timer.h
new file mode 100644
index 000000000000..375281eb49e0
--- /dev/null
+++ b/arch/riscv/include/asm/kvm_vcpu_timer.h
@@ -0,0 +1,44 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2019 Western Digital Corporation or its affiliates.
+ *
+ * Authors:
+ * Atish Patra <[email protected]>
+ */
+
+#ifndef __KVM_VCPU_RISCV_TIMER_H
+#define __KVM_VCPU_RISCV_TIMER_H
+
+#include <linux/hrtimer.h>
+
+struct kvm_guest_timer {
+ /* Mult & Shift values to get nanoseconds from cycles */
+ u32 nsec_mult;
+ u32 nsec_shift;
+ /* Time delta value */
+ u64 time_delta;
+};
+
+struct kvm_vcpu_timer {
+ /* Flag for whether init is done */
+ bool init_done;
+ /* Flag for whether timer event is configured */
+ bool next_set;
+ /* Next timer event cycles */
+ u64 next_cycles;
+ /* Underlying hrtimer instance */
+ struct hrtimer hrt;
+};
+
+int kvm_riscv_vcpu_timer_next_event(struct kvm_vcpu *vcpu, u64 ncycles);
+int kvm_riscv_vcpu_get_reg_timer(struct kvm_vcpu *vcpu,
+ const struct kvm_one_reg *reg);
+int kvm_riscv_vcpu_set_reg_timer(struct kvm_vcpu *vcpu,
+ const struct kvm_one_reg *reg);
+int kvm_riscv_vcpu_timer_init(struct kvm_vcpu *vcpu);
+int kvm_riscv_vcpu_timer_deinit(struct kvm_vcpu *vcpu);
+int kvm_riscv_vcpu_timer_reset(struct kvm_vcpu *vcpu);
+void kvm_riscv_vcpu_timer_restore(struct kvm_vcpu *vcpu);
+int kvm_riscv_guest_timer_init(struct kvm *kvm);
+
+#endif
diff --git a/arch/riscv/include/uapi/asm/kvm.h b/arch/riscv/include/uapi/asm/kvm.h
index f7e9dc388d54..08691dd27bcf 100644
--- a/arch/riscv/include/uapi/asm/kvm.h
+++ b/arch/riscv/include/uapi/asm/kvm.h
@@ -74,6 +74,18 @@ struct kvm_riscv_csr {
unsigned long scounteren;
};

+/* TIMER registers for KVM_GET_ONE_REG and KVM_SET_ONE_REG */
+struct kvm_riscv_timer {
+ __u64 frequency;
+ __u64 time;
+ __u64 compare;
+ __u64 state;
+};
+
+/* Possible states for kvm_riscv_timer */
+#define KVM_RISCV_TIMER_STATE_OFF 0
+#define KVM_RISCV_TIMER_STATE_ON 1
+
#define KVM_REG_SIZE(id) \
(1U << (((id) & KVM_REG_SIZE_MASK) >> KVM_REG_SIZE_SHIFT))

@@ -96,6 +108,11 @@ struct kvm_riscv_csr {
#define KVM_REG_RISCV_CSR_REG(name) \
(offsetof(struct kvm_riscv_csr, name) / sizeof(unsigned long))

+/* Timer registers are mapped as type 4 */
+#define KVM_REG_RISCV_TIMER (0x04 << KVM_REG_RISCV_TYPE_SHIFT)
+#define KVM_REG_RISCV_TIMER_REG(name) \
+ (offsetof(struct kvm_riscv_timer, name) / sizeof(__u64))
+
#endif

#endif /* __LINUX_KVM_RISCV_H */
diff --git a/arch/riscv/kvm/Makefile b/arch/riscv/kvm/Makefile
index 98b294cbd96d..4f90443ab1ef 100644
--- a/arch/riscv/kvm/Makefile
+++ b/arch/riscv/kvm/Makefile
@@ -10,7 +10,7 @@ ccflags-y := -Ivirt/kvm -Iarch/riscv/kvm
kvm-objs := $(common-objs-y)

kvm-objs += main.o vm.o vmid.o tlb.o mmu.o
-kvm-objs += vcpu.o vcpu_exit.o vcpu_switch.o
+kvm-objs += vcpu.o vcpu_exit.o vcpu_switch.o vcpu_timer.o

obj-$(CONFIG_KVM) += kvm.o

diff --git a/arch/riscv/kvm/vcpu.c b/arch/riscv/kvm/vcpu.c
index cbaf14502c25..b6f19ca35562 100644
--- a/arch/riscv/kvm/vcpu.c
+++ b/arch/riscv/kvm/vcpu.c
@@ -55,6 +55,8 @@ static void kvm_riscv_reset_vcpu(struct kvm_vcpu *vcpu)

memcpy(cntx, reset_cntx, sizeof(*cntx));

+ kvm_riscv_vcpu_timer_reset(vcpu);
+
WRITE_ONCE(vcpu->arch.irqs_pending, 0);
WRITE_ONCE(vcpu->arch.irqs_pending_mask, 0);
}
@@ -82,6 +84,9 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
cntx->hstatus |= HSTATUS_SPVP;
cntx->hstatus |= HSTATUS_SPV;

+ /* Setup VCPU timer */
+ kvm_riscv_vcpu_timer_init(vcpu);
+
/* Reset VCPU */
kvm_riscv_reset_vcpu(vcpu);

@@ -94,6 +99,9 @@ void kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu)

void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu)
{
+ /* Cleanup VCPU timer */
+ kvm_riscv_vcpu_timer_deinit(vcpu);
+
/* Flush the pages pre-allocated for Stage2 page table mappings */
kvm_riscv_stage2_flush_cache(vcpu);
}
@@ -334,6 +342,8 @@ static int kvm_riscv_vcpu_set_reg(struct kvm_vcpu *vcpu,
return kvm_riscv_vcpu_set_reg_core(vcpu, reg);
else if ((reg->id & KVM_REG_RISCV_TYPE_MASK) == KVM_REG_RISCV_CSR)
return kvm_riscv_vcpu_set_reg_csr(vcpu, reg);
+ else if ((reg->id & KVM_REG_RISCV_TYPE_MASK) == KVM_REG_RISCV_TIMER)
+ return kvm_riscv_vcpu_set_reg_timer(vcpu, reg);

return -EINVAL;
}
@@ -347,6 +357,8 @@ static int kvm_riscv_vcpu_get_reg(struct kvm_vcpu *vcpu,
return kvm_riscv_vcpu_get_reg_core(vcpu, reg);
else if ((reg->id & KVM_REG_RISCV_TYPE_MASK) == KVM_REG_RISCV_CSR)
return kvm_riscv_vcpu_get_reg_csr(vcpu, reg);
+ else if ((reg->id & KVM_REG_RISCV_TYPE_MASK) == KVM_REG_RISCV_TIMER)
+ return kvm_riscv_vcpu_get_reg_timer(vcpu, reg);

return -EINVAL;
}
@@ -579,6 +591,8 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)

kvm_riscv_stage2_update_hgatp(vcpu);

+ kvm_riscv_vcpu_timer_restore(vcpu);
+
vcpu->cpu = cpu;
}

diff --git a/arch/riscv/kvm/vcpu_timer.c b/arch/riscv/kvm/vcpu_timer.c
new file mode 100644
index 000000000000..ca08c420bf0a
--- /dev/null
+++ b/arch/riscv/kvm/vcpu_timer.c
@@ -0,0 +1,225 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2019 Western Digital Corporation or its affiliates.
+ *
+ * Authors:
+ * Atish Patra <[email protected]>
+ */
+
+#include <linux/errno.h>
+#include <linux/err.h>
+#include <linux/kvm_host.h>
+#include <linux/uaccess.h>
+#include <clocksource/timer-riscv.h>
+#include <asm/kvm_csr.h>
+#include <asm/delay.h>
+#include <asm/kvm_vcpu_timer.h>
+
+static u64 kvm_riscv_current_cycles(struct kvm_guest_timer *gt)
+{
+ return get_cycles64() + gt->time_delta;
+}
+
+static u64 kvm_riscv_delta_cycles2ns(u64 cycles,
+ struct kvm_guest_timer *gt,
+ struct kvm_vcpu_timer *t)
+{
+ unsigned long flags;
+ u64 cycles_now, cycles_delta, delta_ns;
+
+ local_irq_save(flags);
+ cycles_now = kvm_riscv_current_cycles(gt);
+ if (cycles_now < cycles)
+ cycles_delta = cycles - cycles_now;
+ else
+ cycles_delta = 0;
+ delta_ns = (cycles_delta * gt->nsec_mult) >> gt->nsec_shift;
+ local_irq_restore(flags);
+
+ return delta_ns;
+}
+
+static enum hrtimer_restart kvm_riscv_vcpu_hrtimer_expired(struct hrtimer *h)
+{
+ u64 delta_ns;
+ struct kvm_vcpu_timer *t = container_of(h, struct kvm_vcpu_timer, hrt);
+ struct kvm_vcpu *vcpu = container_of(t, struct kvm_vcpu, arch.timer);
+ struct kvm_guest_timer *gt = &vcpu->kvm->arch.timer;
+
+ if (kvm_riscv_current_cycles(gt) < t->next_cycles) {
+ delta_ns = kvm_riscv_delta_cycles2ns(t->next_cycles, gt, t);
+ hrtimer_forward_now(&t->hrt, ktime_set(0, delta_ns));
+ return HRTIMER_RESTART;
+ }
+
+ t->next_set = false;
+ kvm_riscv_vcpu_set_interrupt(vcpu, IRQ_VS_TIMER);
+
+ return HRTIMER_NORESTART;
+}
+
+static int kvm_riscv_vcpu_timer_cancel(struct kvm_vcpu_timer *t)
+{
+ if (!t->init_done || !t->next_set)
+ return -EINVAL;
+
+ hrtimer_cancel(&t->hrt);
+ t->next_set = false;
+
+ return 0;
+}
+
+int kvm_riscv_vcpu_timer_next_event(struct kvm_vcpu *vcpu, u64 ncycles)
+{
+ struct kvm_vcpu_timer *t = &vcpu->arch.timer;
+ struct kvm_guest_timer *gt = &vcpu->kvm->arch.timer;
+ u64 delta_ns;
+
+ if (!t->init_done)
+ return -EINVAL;
+
+ kvm_riscv_vcpu_unset_interrupt(vcpu, IRQ_VS_TIMER);
+
+ delta_ns = kvm_riscv_delta_cycles2ns(ncycles, gt, t);
+ t->next_cycles = ncycles;
+ hrtimer_start(&t->hrt, ktime_set(0, delta_ns), HRTIMER_MODE_REL);
+ t->next_set = true;
+
+ return 0;
+}
+
+int kvm_riscv_vcpu_get_reg_timer(struct kvm_vcpu *vcpu,
+ const struct kvm_one_reg *reg)
+{
+ struct kvm_vcpu_timer *t = &vcpu->arch.timer;
+ struct kvm_guest_timer *gt = &vcpu->kvm->arch.timer;
+ u64 __user *uaddr = (u64 __user *)(unsigned long)reg->addr;
+ unsigned long reg_num = reg->id & ~(KVM_REG_ARCH_MASK |
+ KVM_REG_SIZE_MASK |
+ KVM_REG_RISCV_TIMER);
+ u64 reg_val;
+
+ if (KVM_REG_SIZE(reg->id) != sizeof(u64))
+ return -EINVAL;
+ if (reg_num >= sizeof(struct kvm_riscv_timer) / sizeof(u64))
+ return -EINVAL;
+
+ switch (reg_num) {
+ case KVM_REG_RISCV_TIMER_REG(frequency):
+ reg_val = riscv_timebase;
+ break;
+ case KVM_REG_RISCV_TIMER_REG(time):
+ reg_val = kvm_riscv_current_cycles(gt);
+ break;
+ case KVM_REG_RISCV_TIMER_REG(compare):
+ reg_val = t->next_cycles;
+ break;
+ case KVM_REG_RISCV_TIMER_REG(state):
+ reg_val = (t->next_set) ? KVM_RISCV_TIMER_STATE_ON :
+ KVM_RISCV_TIMER_STATE_OFF;
+ break;
+ default:
+ return -EINVAL;
+ };
+
+ if (copy_to_user(uaddr, &reg_val, KVM_REG_SIZE(reg->id)))
+ return -EFAULT;
+
+ return 0;
+}
+
+int kvm_riscv_vcpu_set_reg_timer(struct kvm_vcpu *vcpu,
+ const struct kvm_one_reg *reg)
+{
+ struct kvm_vcpu_timer *t = &vcpu->arch.timer;
+ struct kvm_guest_timer *gt = &vcpu->kvm->arch.timer;
+ u64 __user *uaddr = (u64 __user *)(unsigned long)reg->addr;
+ unsigned long reg_num = reg->id & ~(KVM_REG_ARCH_MASK |
+ KVM_REG_SIZE_MASK |
+ KVM_REG_RISCV_TIMER);
+ u64 reg_val;
+ int ret = 0;
+
+ if (KVM_REG_SIZE(reg->id) != sizeof(u64))
+ return -EINVAL;
+ if (reg_num >= sizeof(struct kvm_riscv_timer) / sizeof(u64))
+ return -EINVAL;
+
+ if (copy_from_user(&reg_val, uaddr, KVM_REG_SIZE(reg->id)))
+ return -EFAULT;
+
+ switch (reg_num) {
+ case KVM_REG_RISCV_TIMER_REG(frequency):
+ ret = -EOPNOTSUPP;
+ break;
+ case KVM_REG_RISCV_TIMER_REG(time):
+ gt->time_delta = reg_val - get_cycles64();
+ break;
+ case KVM_REG_RISCV_TIMER_REG(compare):
+ t->next_cycles = reg_val;
+ break;
+ case KVM_REG_RISCV_TIMER_REG(state):
+ if (reg_val == KVM_RISCV_TIMER_STATE_ON)
+ ret = kvm_riscv_vcpu_timer_next_event(vcpu, reg_val);
+ else
+ ret = kvm_riscv_vcpu_timer_cancel(t);
+ break;
+ default:
+ ret = -EINVAL;
+ break;
+ };
+
+ return ret;
+}
+
+int kvm_riscv_vcpu_timer_init(struct kvm_vcpu *vcpu)
+{
+ struct kvm_vcpu_timer *t = &vcpu->arch.timer;
+
+ if (t->init_done)
+ return -EINVAL;
+
+ hrtimer_init(&t->hrt, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+ t->hrt.function = kvm_riscv_vcpu_hrtimer_expired;
+ t->init_done = true;
+ t->next_set = false;
+
+ return 0;
+}
+
+int kvm_riscv_vcpu_timer_deinit(struct kvm_vcpu *vcpu)
+{
+ int ret;
+
+ ret = kvm_riscv_vcpu_timer_cancel(&vcpu->arch.timer);
+ vcpu->arch.timer.init_done = false;
+
+ return ret;
+}
+
+int kvm_riscv_vcpu_timer_reset(struct kvm_vcpu *vcpu)
+{
+ return kvm_riscv_vcpu_timer_cancel(&vcpu->arch.timer);
+}
+
+void kvm_riscv_vcpu_timer_restore(struct kvm_vcpu *vcpu)
+{
+ struct kvm_guest_timer *gt = &vcpu->kvm->arch.timer;
+
+#ifdef CONFIG_64BIT
+ csr_write(CSR_HTIMEDELTA, gt->time_delta);
+#else
+ csr_write(CSR_HTIMEDELTA, (u32)(gt->time_delta));
+ csr_write(CSR_HTIMEDELTAH, (u32)(gt->time_delta >> 32));
+#endif
+}
+
+int kvm_riscv_guest_timer_init(struct kvm *kvm)
+{
+ struct kvm_guest_timer *gt = &kvm->arch.timer;
+
+ riscv_cs_get_mult_shift(&gt->nsec_mult, &gt->nsec_shift);
+ gt->time_delta = -get_cycles64();
+
+ return 0;
+}
diff --git a/arch/riscv/kvm/vm.c b/arch/riscv/kvm/vm.c
index 00a1a88008be..253c45ee20f9 100644
--- a/arch/riscv/kvm/vm.c
+++ b/arch/riscv/kvm/vm.c
@@ -26,7 +26,7 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
return r;
}

- return 0;
+ return kvm_riscv_guest_timer_init(kvm);
}

void kvm_arch_destroy_vm(struct kvm *kvm)
diff --git a/drivers/clocksource/timer-riscv.c b/drivers/clocksource/timer-riscv.c
index c51c5ed15aa7..1767f8bf2013 100644
--- a/drivers/clocksource/timer-riscv.c
+++ b/drivers/clocksource/timer-riscv.c
@@ -13,10 +13,12 @@
#include <linux/delay.h>
#include <linux/irq.h>
#include <linux/irqdomain.h>
+#include <linux/module.h>
#include <linux/sched_clock.h>
#include <linux/io-64-nonatomic-lo-hi.h>
#include <linux/interrupt.h>
#include <linux/of_irq.h>
+#include <clocksource/timer-riscv.h>
#include <asm/smp.h>
#include <asm/sbi.h>
#include <asm/timex.h>
@@ -79,6 +81,13 @@ static int riscv_timer_dying_cpu(unsigned int cpu)
return 0;
}

+void riscv_cs_get_mult_shift(u32 *mult, u32 *shift)
+{
+ *mult = riscv_clocksource.mult;
+ *shift = riscv_clocksource.shift;
+}
+EXPORT_SYMBOL_GPL(riscv_cs_get_mult_shift);
+
/* called directly from the low-level interrupt handler */
static irqreturn_t riscv_timer_interrupt(int irq, void *dev_id)
{
diff --git a/include/clocksource/timer-riscv.h b/include/clocksource/timer-riscv.h
new file mode 100644
index 000000000000..d7f455754e60
--- /dev/null
+++ b/include/clocksource/timer-riscv.h
@@ -0,0 +1,16 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2019 Western Digital Corporation or its affiliates.
+ *
+ * Authors:
+ * Atish Patra <[email protected]>
+ */
+
+#ifndef __TIMER_RISCV_H
+#define __TIMER_RISCV_H
+
+#include <linux/types.h>
+
+extern void riscv_cs_get_mult_shift(u32 *mult, u32 *shift);
+
+#endif
--
2.25.1


2021-05-19 19:08:41

by Anup Patel

[permalink] [raw]
Subject: [PATCH v18 10/18] RISC-V: KVM: Implement stage2 page table programming

This patch implements all required functions for programming
the stage2 page table for each Guest/VM.

At high-level, the flow of stage2 related functions is similar
from KVM ARM/ARM64 implementation but the stage2 page table
format is quite different for KVM RISC-V.

[jiangyifei: stage2 dirty log support]
Signed-off-by: Yifei Jiang <[email protected]>
Signed-off-by: Anup Patel <[email protected]>
Acked-by: Paolo Bonzini <[email protected]>
Reviewed-by: Paolo Bonzini <[email protected]>
---
arch/riscv/include/asm/kvm_host.h | 12 +
arch/riscv/kvm/Kconfig | 1 +
arch/riscv/kvm/main.c | 19 +
arch/riscv/kvm/mmu.c | 654 +++++++++++++++++++++++++++++-
arch/riscv/kvm/vm.c | 6 -
5 files changed, 676 insertions(+), 16 deletions(-)

diff --git a/arch/riscv/include/asm/kvm_host.h b/arch/riscv/include/asm/kvm_host.h
index 40449ab2916d..d2a7d299d67c 100644
--- a/arch/riscv/include/asm/kvm_host.h
+++ b/arch/riscv/include/asm/kvm_host.h
@@ -75,6 +75,13 @@ struct kvm_mmio_decode {
int return_handled;
};

+#define KVM_MMU_PAGE_CACHE_NR_OBJS 32
+
+struct kvm_mmu_page_cache {
+ int nobjs;
+ void *objects[KVM_MMU_PAGE_CACHE_NR_OBJS];
+};
+
struct kvm_cpu_trap {
unsigned long sepc;
unsigned long scause;
@@ -176,6 +183,9 @@ struct kvm_vcpu_arch {
/* MMIO instruction details */
struct kvm_mmio_decode mmio_decode;

+ /* Cache pages needed to program page tables with spinlock held */
+ struct kvm_mmu_page_cache mmu_page_cache;
+
/* VCPU power-off state */
bool power_off;

@@ -203,6 +213,8 @@ void kvm_riscv_stage2_flush_cache(struct kvm_vcpu *vcpu);
int kvm_riscv_stage2_alloc_pgd(struct kvm *kvm);
void kvm_riscv_stage2_free_pgd(struct kvm *kvm);
void kvm_riscv_stage2_update_hgatp(struct kvm_vcpu *vcpu);
+void kvm_riscv_stage2_mode_detect(void);
+unsigned long kvm_riscv_stage2_mode(void);

void kvm_riscv_stage2_vmid_detect(void);
unsigned long kvm_riscv_stage2_vmid_bits(void);
diff --git a/arch/riscv/kvm/Kconfig b/arch/riscv/kvm/Kconfig
index b42979f84042..633063edaee8 100644
--- a/arch/riscv/kvm/Kconfig
+++ b/arch/riscv/kvm/Kconfig
@@ -23,6 +23,7 @@ config KVM
select PREEMPT_NOTIFIERS
select ANON_INODES
select KVM_MMIO
+ select KVM_GENERIC_DIRTYLOG_READ_PROTECT
select HAVE_KVM_VCPU_ASYNC_IOCTL
select HAVE_KVM_EVENTFD
select SRCU
diff --git a/arch/riscv/kvm/main.c b/arch/riscv/kvm/main.c
index 998110227d1e..2860cb7b348d 100644
--- a/arch/riscv/kvm/main.c
+++ b/arch/riscv/kvm/main.c
@@ -64,6 +64,8 @@ void kvm_arch_hardware_disable(void)

int kvm_arch_init(void *opaque)
{
+ const char *str;
+
if (!riscv_isa_extension_available(NULL, h)) {
kvm_info("hypervisor extension not available\n");
return -ENODEV;
@@ -79,10 +81,27 @@ int kvm_arch_init(void *opaque)
return -ENODEV;
}

+ kvm_riscv_stage2_mode_detect();
+
kvm_riscv_stage2_vmid_detect();

kvm_info("hypervisor extension available\n");

+ switch (kvm_riscv_stage2_mode()) {
+ case HGATP_MODE_SV32X4:
+ str = "Sv32x4";
+ break;
+ case HGATP_MODE_SV39X4:
+ str = "Sv39x4";
+ break;
+ case HGATP_MODE_SV48X4:
+ str = "Sv48x4";
+ break;
+ default:
+ return -ENODEV;
+ }
+ kvm_info("using %s G-stage page table format\n", str);
+
kvm_info("VMID %ld bits available\n", kvm_riscv_stage2_vmid_bits());

return 0;
diff --git a/arch/riscv/kvm/mmu.c b/arch/riscv/kvm/mmu.c
index 8ec10ef861e7..fcf9967f4b29 100644
--- a/arch/riscv/kvm/mmu.c
+++ b/arch/riscv/kvm/mmu.c
@@ -15,13 +15,421 @@
#include <linux/vmalloc.h>
#include <linux/kvm_host.h>
#include <linux/sched/signal.h>
+#include <asm/kvm_csr.h>
#include <asm/page.h>
#include <asm/pgtable.h>
+#include <asm/sbi.h>
+
+#ifdef CONFIG_64BIT
+static unsigned long stage2_mode = (HGATP_MODE_SV39X4 << HGATP_MODE_SHIFT);
+static unsigned long stage2_pgd_levels = 3;
+#define stage2_index_bits 9
+#else
+static unsigned long stage2_mode = (HGATP_MODE_SV32X4 << HGATP_MODE_SHIFT);
+static unsigned long stage2_pgd_levels = 2;
+#define stage2_index_bits 10
+#endif
+
+#define stage2_pgd_xbits 2
+#define stage2_pgd_size (1UL << (HGATP_PAGE_SHIFT + stage2_pgd_xbits))
+#define stage2_gpa_bits (HGATP_PAGE_SHIFT + \
+ (stage2_pgd_levels * stage2_index_bits) + \
+ stage2_pgd_xbits)
+#define stage2_gpa_size ((gpa_t)(1ULL << stage2_gpa_bits))
+
+#define stage2_pte_leaf(__ptep) \
+ (pte_val(*(__ptep)) & (_PAGE_READ | _PAGE_WRITE | _PAGE_EXEC))
+
+static inline unsigned long stage2_pte_index(gpa_t addr, u32 level)
+{
+ unsigned long mask;
+ unsigned long shift = HGATP_PAGE_SHIFT + (stage2_index_bits * level);
+
+ if (level == (stage2_pgd_levels - 1))
+ mask = (PTRS_PER_PTE * (1UL << stage2_pgd_xbits)) - 1;
+ else
+ mask = PTRS_PER_PTE - 1;
+
+ return (addr >> shift) & mask;
+}
+
+static inline unsigned long stage2_pte_page_vaddr(pte_t pte)
+{
+ return (unsigned long)pfn_to_virt(pte_val(pte) >> _PAGE_PFN_SHIFT);
+}
+
+static int stage2_page_size_to_level(unsigned long page_size, u32 *out_level)
+{
+ u32 i;
+ unsigned long psz = 1UL << 12;
+
+ for (i = 0; i < stage2_pgd_levels; i++) {
+ if (page_size == (psz << (i * stage2_index_bits))) {
+ *out_level = i;
+ return 0;
+ }
+ }
+
+ return -EINVAL;
+}
+
+static int stage2_level_to_page_size(u32 level, unsigned long *out_pgsize)
+{
+ if (stage2_pgd_levels < level)
+ return -EINVAL;
+
+ *out_pgsize = 1UL << (12 + (level * stage2_index_bits));
+
+ return 0;
+}
+
+static int stage2_cache_topup(struct kvm_mmu_page_cache *pcache,
+ int min, int max)
+{
+ void *page;
+
+ BUG_ON(max > KVM_MMU_PAGE_CACHE_NR_OBJS);
+ if (pcache->nobjs >= min)
+ return 0;
+ while (pcache->nobjs < max) {
+ page = (void *)__get_free_page(GFP_KERNEL | __GFP_ZERO);
+ if (!page)
+ return -ENOMEM;
+ pcache->objects[pcache->nobjs++] = page;
+ }
+
+ return 0;
+}
+
+static void stage2_cache_flush(struct kvm_mmu_page_cache *pcache)
+{
+ while (pcache && pcache->nobjs)
+ free_page((unsigned long)pcache->objects[--pcache->nobjs]);
+}
+
+static void *stage2_cache_alloc(struct kvm_mmu_page_cache *pcache)
+{
+ void *p;
+
+ if (!pcache)
+ return NULL;
+
+ BUG_ON(!pcache->nobjs);
+ p = pcache->objects[--pcache->nobjs];
+
+ return p;
+}
+
+static bool stage2_get_leaf_entry(struct kvm *kvm, gpa_t addr,
+ pte_t **ptepp, u32 *ptep_level)
+{
+ pte_t *ptep;
+ u32 current_level = stage2_pgd_levels - 1;
+
+ *ptep_level = current_level;
+ ptep = (pte_t *)kvm->arch.pgd;
+ ptep = &ptep[stage2_pte_index(addr, current_level)];
+ while (ptep && pte_val(*ptep)) {
+ if (stage2_pte_leaf(ptep)) {
+ *ptep_level = current_level;
+ *ptepp = ptep;
+ return true;
+ }
+
+ if (current_level) {
+ current_level--;
+ *ptep_level = current_level;
+ ptep = (pte_t *)stage2_pte_page_vaddr(*ptep);
+ ptep = &ptep[stage2_pte_index(addr, current_level)];
+ } else {
+ ptep = NULL;
+ }
+ }
+
+ return false;
+}
+
+static void stage2_remote_tlb_flush(struct kvm *kvm, u32 level, gpa_t addr)
+{
+ struct cpumask hmask;
+ unsigned long size = PAGE_SIZE;
+ struct kvm_vmid *vmid = &kvm->arch.vmid;
+
+ if (stage2_level_to_page_size(level, &size))
+ return;
+ addr &= ~(size - 1);
+
+ /*
+ * TODO: Instead of cpu_online_mask, we should only target CPUs
+ * where the Guest/VM is running.
+ */
+ preempt_disable();
+ riscv_cpuid_to_hartid_mask(cpu_online_mask, &hmask);
+ sbi_remote_hfence_gvma_vmid(cpumask_bits(&hmask), addr, size,
+ READ_ONCE(vmid->vmid));
+ preempt_enable();
+}
+
+static int stage2_set_pte(struct kvm *kvm, u32 level,
+ struct kvm_mmu_page_cache *pcache,
+ gpa_t addr, const pte_t *new_pte)
+{
+ u32 current_level = stage2_pgd_levels - 1;
+ pte_t *next_ptep = (pte_t *)kvm->arch.pgd;
+ pte_t *ptep = &next_ptep[stage2_pte_index(addr, current_level)];
+
+ if (current_level < level)
+ return -EINVAL;
+
+ while (current_level != level) {
+ if (stage2_pte_leaf(ptep))
+ return -EEXIST;
+
+ if (!pte_val(*ptep)) {
+ next_ptep = stage2_cache_alloc(pcache);
+ if (!next_ptep)
+ return -ENOMEM;
+ *ptep = pfn_pte(PFN_DOWN(__pa(next_ptep)),
+ __pgprot(_PAGE_TABLE));
+ } else {
+ if (stage2_pte_leaf(ptep))
+ return -EEXIST;
+ next_ptep = (pte_t *)stage2_pte_page_vaddr(*ptep);
+ }
+
+ current_level--;
+ ptep = &next_ptep[stage2_pte_index(addr, current_level)];
+ }
+
+ *ptep = *new_pte;
+ if (stage2_pte_leaf(ptep))
+ stage2_remote_tlb_flush(kvm, current_level, addr);
+
+ return 0;
+}
+
+static int stage2_map_page(struct kvm *kvm,
+ struct kvm_mmu_page_cache *pcache,
+ gpa_t gpa, phys_addr_t hpa,
+ unsigned long page_size,
+ bool page_rdonly, bool page_exec)
+{
+ int ret;
+ u32 level = 0;
+ pte_t new_pte;
+ pgprot_t prot;
+
+ ret = stage2_page_size_to_level(page_size, &level);
+ if (ret)
+ return ret;
+
+ /*
+ * A RISC-V implementation can choose to either:
+ * 1) Update 'A' and 'D' PTE bits in hardware
+ * 2) Generate page fault when 'A' and/or 'D' bits are not set
+ * PTE so that software can update these bits.
+ *
+ * We support both options mentioned above. To achieve this, we
+ * always set 'A' and 'D' PTE bits at time of creating stage2
+ * mapping. To support KVM dirty page logging with both options
+ * mentioned above, we will write-protect stage2 PTEs to track
+ * dirty pages.
+ */
+
+ if (page_exec) {
+ if (page_rdonly)
+ prot = PAGE_READ_EXEC;
+ else
+ prot = PAGE_WRITE_EXEC;
+ } else {
+ if (page_rdonly)
+ prot = PAGE_READ;
+ else
+ prot = PAGE_WRITE;
+ }
+ new_pte = pfn_pte(PFN_DOWN(hpa), prot);
+ new_pte = pte_mkdirty(new_pte);
+
+ return stage2_set_pte(kvm, level, pcache, gpa, &new_pte);
+}
+
+enum stage2_op {
+ STAGE2_OP_NOP = 0, /* Nothing */
+ STAGE2_OP_CLEAR, /* Clear/Unmap */
+ STAGE2_OP_WP, /* Write-protect */
+};
+
+static void stage2_op_pte(struct kvm *kvm, gpa_t addr,
+ pte_t *ptep, u32 ptep_level, enum stage2_op op)
+{
+ int i, ret;
+ pte_t *next_ptep;
+ u32 next_ptep_level;
+ unsigned long next_page_size, page_size;
+
+ ret = stage2_level_to_page_size(ptep_level, &page_size);
+ if (ret)
+ return;
+
+ BUG_ON(addr & (page_size - 1));
+
+ if (!pte_val(*ptep))
+ return;
+
+ if (ptep_level && !stage2_pte_leaf(ptep)) {
+ next_ptep = (pte_t *)stage2_pte_page_vaddr(*ptep);
+ next_ptep_level = ptep_level - 1;
+ ret = stage2_level_to_page_size(next_ptep_level,
+ &next_page_size);
+ if (ret)
+ return;
+
+ if (op == STAGE2_OP_CLEAR)
+ set_pte(ptep, __pte(0));
+ for (i = 0; i < PTRS_PER_PTE; i++)
+ stage2_op_pte(kvm, addr + i * next_page_size,
+ &next_ptep[i], next_ptep_level, op);
+ if (op == STAGE2_OP_CLEAR)
+ put_page(virt_to_page(next_ptep));
+ } else {
+ if (op == STAGE2_OP_CLEAR)
+ set_pte(ptep, __pte(0));
+ else if (op == STAGE2_OP_WP)
+ set_pte(ptep, __pte(pte_val(*ptep) & ~_PAGE_WRITE));
+ stage2_remote_tlb_flush(kvm, ptep_level, addr);
+ }
+}
+
+static void stage2_unmap_range(struct kvm *kvm, gpa_t start, gpa_t size)
+{
+ int ret;
+ pte_t *ptep;
+ u32 ptep_level;
+ bool found_leaf;
+ unsigned long page_size;
+ gpa_t addr = start, end = start + size;
+
+ while (addr < end) {
+ found_leaf = stage2_get_leaf_entry(kvm, addr,
+ &ptep, &ptep_level);
+ ret = stage2_level_to_page_size(ptep_level, &page_size);
+ if (ret)
+ break;
+
+ if (!found_leaf)
+ goto next;
+
+ if (!(addr & (page_size - 1)) && ((end - addr) >= page_size))
+ stage2_op_pte(kvm, addr, ptep,
+ ptep_level, STAGE2_OP_CLEAR);
+
+next:
+ addr += page_size;
+ }
+}
+
+static void stage2_wp_range(struct kvm *kvm, gpa_t start, gpa_t end)
+{
+ int ret;
+ pte_t *ptep;
+ u32 ptep_level;
+ bool found_leaf;
+ gpa_t addr = start;
+ unsigned long page_size;
+
+ while (addr < end) {
+ found_leaf = stage2_get_leaf_entry(kvm, addr,
+ &ptep, &ptep_level);
+ ret = stage2_level_to_page_size(ptep_level, &page_size);
+ if (ret)
+ break;
+
+ if (!found_leaf)
+ goto next;
+
+ if (!(addr & (page_size - 1)) && ((end - addr) >= page_size))
+ stage2_op_pte(kvm, addr, ptep,
+ ptep_level, STAGE2_OP_WP);
+
+next:
+ addr += page_size;
+ }
+}
+
+static void stage2_wp_memory_region(struct kvm *kvm, int slot)
+{
+ struct kvm_memslots *slots = kvm_memslots(kvm);
+ struct kvm_memory_slot *memslot = id_to_memslot(slots, slot);
+ phys_addr_t start = memslot->base_gfn << PAGE_SHIFT;
+ phys_addr_t end = (memslot->base_gfn + memslot->npages) << PAGE_SHIFT;
+
+ spin_lock(&kvm->mmu_lock);
+ stage2_wp_range(kvm, start, end);
+ spin_unlock(&kvm->mmu_lock);
+ kvm_flush_remote_tlbs(kvm);
+}
+
+static int stage2_ioremap(struct kvm *kvm, gpa_t gpa, phys_addr_t hpa,
+ unsigned long size, bool writable)
+{
+ pte_t pte;
+ int ret = 0;
+ unsigned long pfn;
+ phys_addr_t addr, end;
+ struct kvm_mmu_page_cache pcache = { 0, };
+
+ end = (gpa + size + PAGE_SIZE - 1) & PAGE_MASK;
+ pfn = __phys_to_pfn(hpa);
+
+ for (addr = gpa; addr < end; addr += PAGE_SIZE) {
+ pte = pfn_pte(pfn, PAGE_KERNEL);
+
+ if (!writable)
+ pte = pte_wrprotect(pte);
+
+ ret = stage2_cache_topup(&pcache,
+ stage2_pgd_levels,
+ KVM_MMU_PAGE_CACHE_NR_OBJS);
+ if (ret)
+ goto out;
+
+ spin_lock(&kvm->mmu_lock);
+ ret = stage2_set_pte(kvm, 0, &pcache, addr, &pte);
+ spin_unlock(&kvm->mmu_lock);
+ if (ret)
+ goto out;
+
+ pfn++;
+ }
+
+out:
+ stage2_cache_flush(&pcache);
+ return ret;
+
+}
+
+void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
+ struct kvm_memory_slot *slot,
+ gfn_t gfn_offset,
+ unsigned long mask)
+{
+ phys_addr_t base_gfn = slot->base_gfn + gfn_offset;
+ phys_addr_t start = (base_gfn + __ffs(mask)) << PAGE_SHIFT;
+ phys_addr_t end = (base_gfn + __fls(mask) + 1) << PAGE_SHIFT;
+
+ stage2_wp_range(kvm, start, end);
+}

void kvm_arch_sync_dirty_log(struct kvm *kvm, struct kvm_memory_slot *memslot)
{
}

+void kvm_arch_flush_remote_tlbs_memslot(struct kvm *kvm,
+ const struct kvm_memory_slot *memslot)
+{
+ kvm_flush_remote_tlbs(kvm);
+}
+
void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *free)
{
}
@@ -32,7 +440,7 @@ void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen)

void kvm_arch_flush_shadow_all(struct kvm *kvm)
{
- /* TODO: */
+ kvm_riscv_stage2_free_pgd(kvm);
}

void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
@@ -46,7 +454,13 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
const struct kvm_memory_slot *new,
enum kvm_mr_change change)
{
- /* TODO: */
+ /*
+ * At this point memslot has been committed and there is an
+ * allocated dirty_bitmap[], dirty pages will be tracked while
+ * the memory slot is write protected.
+ */
+ if (change != KVM_MR_DELETE && mem->flags & KVM_MEM_LOG_DIRTY_PAGES)
+ stage2_wp_memory_region(kvm, mem->slot);
}

int kvm_arch_prepare_memory_region(struct kvm *kvm,
@@ -54,35 +468,255 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
const struct kvm_userspace_memory_region *mem,
enum kvm_mr_change change)
{
- /* TODO: */
- return 0;
+ hva_t hva = mem->userspace_addr;
+ hva_t reg_end = hva + mem->memory_size;
+ bool writable = !(mem->flags & KVM_MEM_READONLY);
+ int ret = 0;
+
+ if (change != KVM_MR_CREATE && change != KVM_MR_MOVE &&
+ change != KVM_MR_FLAGS_ONLY)
+ return 0;
+
+ /*
+ * Prevent userspace from creating a memory region outside of the GPA
+ * space addressable by the KVM guest GPA space.
+ */
+ if ((memslot->base_gfn + memslot->npages) >=
+ (stage2_gpa_size >> PAGE_SHIFT))
+ return -EFAULT;
+
+ mmap_read_lock(current->mm);
+
+ /*
+ * A memory region could potentially cover multiple VMAs, and
+ * any holes between them, so iterate over all of them to find
+ * out if we can map any of them right now.
+ *
+ * +--------------------------------------------+
+ * +---------------+----------------+ +----------------+
+ * | : VMA 1 | VMA 2 | | VMA 3 : |
+ * +---------------+----------------+ +----------------+
+ * | memory region |
+ * +--------------------------------------------+
+ */
+ do {
+ struct vm_area_struct *vma = find_vma(current->mm, hva);
+ hva_t vm_start, vm_end;
+
+ if (!vma || vma->vm_start >= reg_end)
+ break;
+
+ /*
+ * Mapping a read-only VMA is only allowed if the
+ * memory region is configured as read-only.
+ */
+ if (writable && !(vma->vm_flags & VM_WRITE)) {
+ ret = -EPERM;
+ break;
+ }
+
+ /* Take the intersection of this VMA with the memory region */
+ vm_start = max(hva, vma->vm_start);
+ vm_end = min(reg_end, vma->vm_end);
+
+ if (vma->vm_flags & VM_PFNMAP) {
+ gpa_t gpa = mem->guest_phys_addr +
+ (vm_start - mem->userspace_addr);
+ phys_addr_t pa;
+
+ pa = (phys_addr_t)vma->vm_pgoff << PAGE_SHIFT;
+ pa += vm_start - vma->vm_start;
+
+ /* IO region dirty page logging not allowed */
+ if (memslot->flags & KVM_MEM_LOG_DIRTY_PAGES) {
+ ret = -EINVAL;
+ goto out;
+ }
+
+ ret = stage2_ioremap(kvm, gpa, pa,
+ vm_end - vm_start, writable);
+ if (ret)
+ break;
+ }
+ hva = vm_end;
+ } while (hva < reg_end);
+
+ if (change == KVM_MR_FLAGS_ONLY)
+ goto out;
+
+ spin_lock(&kvm->mmu_lock);
+ if (ret)
+ stage2_unmap_range(kvm, mem->guest_phys_addr,
+ mem->memory_size);
+ spin_unlock(&kvm->mmu_lock);
+
+out:
+ mmap_read_unlock(current->mm);
+ return ret;
}

int kvm_riscv_stage2_map(struct kvm_vcpu *vcpu,
struct kvm_memory_slot *memslot,
gpa_t gpa, unsigned long hva, bool is_write)
{
- /* TODO: */
- return 0;
+ int ret;
+ kvm_pfn_t hfn;
+ bool writeable;
+ short vma_pageshift;
+ gfn_t gfn = gpa >> PAGE_SHIFT;
+ struct vm_area_struct *vma;
+ struct kvm *kvm = vcpu->kvm;
+ struct kvm_mmu_page_cache *pcache = &vcpu->arch.mmu_page_cache;
+ bool logging = (memslot->dirty_bitmap &&
+ !(memslot->flags & KVM_MEM_READONLY)) ? true : false;
+ unsigned long vma_pagesize;
+
+ mmap_read_lock(current->mm);
+
+ vma = find_vma_intersection(current->mm, hva, hva + 1);
+ if (unlikely(!vma)) {
+ kvm_err("Failed to find VMA for hva 0x%lx\n", hva);
+ mmap_read_unlock(current->mm);
+ return -EFAULT;
+ }
+
+ if (is_vm_hugetlb_page(vma))
+ vma_pageshift = huge_page_shift(hstate_vma(vma));
+ else
+ vma_pageshift = PAGE_SHIFT;
+ vma_pagesize = 1ULL << vma_pageshift;
+ if (logging || (vma->vm_flags & VM_PFNMAP))
+ vma_pagesize = PAGE_SIZE;
+
+ if (vma_pagesize == PMD_SIZE || vma_pagesize == PGDIR_SIZE)
+ gfn = (gpa & huge_page_mask(hstate_vma(vma))) >> PAGE_SHIFT;
+
+ mmap_read_unlock(current->mm);
+
+ if (vma_pagesize != PGDIR_SIZE &&
+ vma_pagesize != PMD_SIZE &&
+ vma_pagesize != PAGE_SIZE) {
+ kvm_err("Invalid VMA page size 0x%lx\n", vma_pagesize);
+ return -EFAULT;
+ }
+
+ /* We need minimum second+third level pages */
+ ret = stage2_cache_topup(pcache, stage2_pgd_levels,
+ KVM_MMU_PAGE_CACHE_NR_OBJS);
+ if (ret) {
+ kvm_err("Failed to topup stage2 cache\n");
+ return ret;
+ }
+
+ hfn = gfn_to_pfn_prot(kvm, gfn, is_write, &writeable);
+ if (hfn == KVM_PFN_ERR_HWPOISON) {
+ send_sig_mceerr(BUS_MCEERR_AR, (void __user *)hva,
+ vma_pageshift, current);
+ return 0;
+ }
+ if (is_error_noslot_pfn(hfn))
+ return -EFAULT;
+
+ /*
+ * If logging is active then we allow writable pages only
+ * for write faults.
+ */
+ if (logging && !is_write)
+ writeable = false;
+
+ spin_lock(&kvm->mmu_lock);
+
+ if (writeable) {
+ kvm_set_pfn_dirty(hfn);
+ mark_page_dirty(kvm, gfn);
+ ret = stage2_map_page(kvm, pcache, gpa, hfn << PAGE_SHIFT,
+ vma_pagesize, false, true);
+ } else {
+ ret = stage2_map_page(kvm, pcache, gpa, hfn << PAGE_SHIFT,
+ vma_pagesize, true, true);
+ }
+
+ if (ret)
+ kvm_err("Failed to map in stage2\n");
+
+ spin_unlock(&kvm->mmu_lock);
+ kvm_set_pfn_accessed(hfn);
+ kvm_release_pfn_clean(hfn);
+ return ret;
}

void kvm_riscv_stage2_flush_cache(struct kvm_vcpu *vcpu)
{
- /* TODO: */
+ stage2_cache_flush(&vcpu->arch.mmu_page_cache);
}

int kvm_riscv_stage2_alloc_pgd(struct kvm *kvm)
{
- /* TODO: */
+ struct page *pgd_page;
+
+ if (kvm->arch.pgd != NULL) {
+ kvm_err("kvm_arch already initialized?\n");
+ return -EINVAL;
+ }
+
+ pgd_page = alloc_pages(GFP_KERNEL | __GFP_ZERO,
+ get_order(stage2_pgd_size));
+ if (!pgd_page)
+ return -ENOMEM;
+ kvm->arch.pgd = page_to_virt(pgd_page);
+ kvm->arch.pgd_phys = page_to_phys(pgd_page);
+
return 0;
}

void kvm_riscv_stage2_free_pgd(struct kvm *kvm)
{
- /* TODO: */
+ void *pgd = NULL;
+
+ spin_lock(&kvm->mmu_lock);
+ if (kvm->arch.pgd) {
+ stage2_unmap_range(kvm, 0UL, stage2_gpa_size);
+ pgd = READ_ONCE(kvm->arch.pgd);
+ kvm->arch.pgd = NULL;
+ kvm->arch.pgd_phys = 0;
+ }
+ spin_unlock(&kvm->mmu_lock);
+
+ if (pgd)
+ free_pages((unsigned long)pgd, get_order(stage2_pgd_size));
}

void kvm_riscv_stage2_update_hgatp(struct kvm_vcpu *vcpu)
{
- /* TODO: */
+ unsigned long hgatp = stage2_mode;
+ struct kvm_arch *k = &vcpu->kvm->arch;
+
+ hgatp |= (READ_ONCE(k->vmid.vmid) << HGATP_VMID_SHIFT) &
+ HGATP_VMID_MASK;
+ hgatp |= (k->pgd_phys >> PAGE_SHIFT) & HGATP_PPN;
+
+ csr_write(CSR_HGATP, hgatp);
+
+ if (!kvm_riscv_stage2_vmid_bits())
+ __kvm_riscv_hfence_gvma_all();
+}
+
+void kvm_riscv_stage2_mode_detect(void)
+{
+#ifdef CONFIG_64BIT
+ /* Try Sv48x4 stage2 mode */
+ csr_write(CSR_HGATP, HGATP_MODE_SV48X4 << HGATP_MODE_SHIFT);
+ if ((csr_read(CSR_HGATP) >> HGATP_MODE_SHIFT) == HGATP_MODE_SV48X4) {
+ stage2_mode = (HGATP_MODE_SV48X4 << HGATP_MODE_SHIFT);
+ stage2_pgd_levels = 4;
+ }
+ csr_write(CSR_HGATP, 0);
+
+ __kvm_riscv_hfence_gvma_all();
+#endif
+}
+
+unsigned long kvm_riscv_stage2_mode(void)
+{
+ return stage2_mode >> HGATP_MODE_SHIFT;
}
diff --git a/arch/riscv/kvm/vm.c b/arch/riscv/kvm/vm.c
index 282d67617229..6cde69a82252 100644
--- a/arch/riscv/kvm/vm.c
+++ b/arch/riscv/kvm/vm.c
@@ -12,12 +12,6 @@
#include <linux/uaccess.h>
#include <linux/kvm_host.h>

-int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log)
-{
- /* TODO: To be added later. */
- return -EOPNOTSUPP;
-}
-
int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
{
int r;
--
2.25.1


2021-05-19 19:09:03

by Anup Patel

[permalink] [raw]
Subject: [PATCH v18 11/18] RISC-V: KVM: Implement MMU notifiers

This patch implements MMU notifiers for KVM RISC-V so that Guest
physical address space is in-sync with Host physical address space.

This will allow swapping, page migration, etc to work transparently
with KVM RISC-V.

Signed-off-by: Anup Patel <[email protected]>
Acked-by: Paolo Bonzini <[email protected]>
Reviewed-by: Paolo Bonzini <[email protected]>
Reviewed-by: Alexander Graf <[email protected]>
---
arch/riscv/include/asm/kvm_host.h | 2 +
arch/riscv/kvm/Kconfig | 1 +
arch/riscv/kvm/mmu.c | 90 +++++++++++++++++++++++++++++--
arch/riscv/kvm/vm.c | 1 +
4 files changed, 89 insertions(+), 5 deletions(-)

diff --git a/arch/riscv/include/asm/kvm_host.h b/arch/riscv/include/asm/kvm_host.h
index d2a7d299d67c..51fe663b5093 100644
--- a/arch/riscv/include/asm/kvm_host.h
+++ b/arch/riscv/include/asm/kvm_host.h
@@ -201,6 +201,8 @@ static inline void kvm_arch_sync_events(struct kvm *kvm) {}
static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {}
static inline void kvm_arch_vcpu_block_finish(struct kvm_vcpu *vcpu) {}

+#define KVM_ARCH_WANT_MMU_NOTIFIER
+
void __kvm_riscv_hfence_gvma_vmid_gpa(unsigned long gpa, unsigned long vmid);
void __kvm_riscv_hfence_gvma_vmid(unsigned long vmid);
void __kvm_riscv_hfence_gvma_gpa(unsigned long gpa);
diff --git a/arch/riscv/kvm/Kconfig b/arch/riscv/kvm/Kconfig
index 633063edaee8..a712bb910cda 100644
--- a/arch/riscv/kvm/Kconfig
+++ b/arch/riscv/kvm/Kconfig
@@ -20,6 +20,7 @@ if VIRTUALIZATION
config KVM
tristate "Kernel-based Virtual Machine (KVM) support (EXPERIMENTAL)"
depends on RISCV_SBI && MMU
+ select MMU_NOTIFIER
select PREEMPT_NOTIFIERS
select ANON_INODES
select KVM_MMIO
diff --git a/arch/riscv/kvm/mmu.c b/arch/riscv/kvm/mmu.c
index fcf9967f4b29..428bf8915a45 100644
--- a/arch/riscv/kvm/mmu.c
+++ b/arch/riscv/kvm/mmu.c
@@ -300,7 +300,8 @@ static void stage2_op_pte(struct kvm *kvm, gpa_t addr,
}
}

-static void stage2_unmap_range(struct kvm *kvm, gpa_t start, gpa_t size)
+static void stage2_unmap_range(struct kvm *kvm, gpa_t start,
+ gpa_t size, bool may_block)
{
int ret;
pte_t *ptep;
@@ -325,6 +326,13 @@ static void stage2_unmap_range(struct kvm *kvm, gpa_t start, gpa_t size)

next:
addr += page_size;
+
+ /*
+ * If the range is too large, release the kvm->mmu_lock
+ * to prevent starvation and lockup detector warnings.
+ */
+ if (may_block && addr < end)
+ cond_resched_lock(&kvm->mmu_lock);
}
}

@@ -405,7 +413,6 @@ static int stage2_ioremap(struct kvm *kvm, gpa_t gpa, phys_addr_t hpa,
out:
stage2_cache_flush(&pcache);
return ret;
-
}

void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
@@ -547,7 +554,7 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
spin_lock(&kvm->mmu_lock);
if (ret)
stage2_unmap_range(kvm, mem->guest_phys_addr,
- mem->memory_size);
+ mem->memory_size, false);
spin_unlock(&kvm->mmu_lock);

out:
@@ -555,6 +562,73 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
return ret;
}

+bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
+{
+ if (!kvm->arch.pgd)
+ return 0;
+
+ stage2_unmap_range(kvm, range->start << PAGE_SHIFT,
+ (range->end - range->start) << PAGE_SHIFT,
+ range->may_block);
+ return 0;
+}
+
+bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
+{
+ int ret;
+ kvm_pfn_t pfn = pte_pfn(range->pte);
+
+ if (!kvm->arch.pgd)
+ return 0;
+
+ WARN_ON(range->end - range->start != 1);
+
+ ret = stage2_map_page(kvm, NULL, range->start << PAGE_SHIFT,
+ __pfn_to_phys(pfn), PAGE_SIZE, true, true);
+ if (ret) {
+ kvm_err("Failed to map stage2 page (error %d)\n", ret);
+ return 1;
+ }
+
+ return 0;
+}
+
+bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
+{
+ pte_t *ptep;
+ u32 ptep_level = 0;
+ u64 size = (range->end - range->start) << PAGE_SHIFT;
+
+ if (!kvm->arch.pgd)
+ return 0;
+
+ WARN_ON(size != PAGE_SIZE && size != PMD_SIZE && size != PGDIR_SIZE);
+
+ if (!stage2_get_leaf_entry(kvm, range->start << PAGE_SHIFT,
+ &ptep, &ptep_level))
+ return 0;
+
+ return ptep_test_and_clear_young(NULL, 0, ptep);
+}
+
+bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
+{
+ pte_t *ptep;
+ u32 ptep_level = 0;
+ u64 size = (range->end - range->start) << PAGE_SHIFT;
+
+ if (!kvm->arch.pgd)
+ return 0;
+
+ WARN_ON(size != PAGE_SIZE && size != PMD_SIZE && size != PGDIR_SIZE);
+
+ if (!stage2_get_leaf_entry(kvm, range->start << PAGE_SHIFT,
+ &ptep, &ptep_level))
+ return 0;
+
+ return pte_young(*ptep);
+}
+
int kvm_riscv_stage2_map(struct kvm_vcpu *vcpu,
struct kvm_memory_slot *memslot,
gpa_t gpa, unsigned long hva, bool is_write)
@@ -569,7 +643,7 @@ int kvm_riscv_stage2_map(struct kvm_vcpu *vcpu,
struct kvm_mmu_page_cache *pcache = &vcpu->arch.mmu_page_cache;
bool logging = (memslot->dirty_bitmap &&
!(memslot->flags & KVM_MEM_READONLY)) ? true : false;
- unsigned long vma_pagesize;
+ unsigned long vma_pagesize, mmu_seq;

mmap_read_lock(current->mm);

@@ -608,6 +682,8 @@ int kvm_riscv_stage2_map(struct kvm_vcpu *vcpu,
return ret;
}

+ mmu_seq = kvm->mmu_notifier_seq;
+
hfn = gfn_to_pfn_prot(kvm, gfn, is_write, &writeable);
if (hfn == KVM_PFN_ERR_HWPOISON) {
send_sig_mceerr(BUS_MCEERR_AR, (void __user *)hva,
@@ -626,6 +702,9 @@ int kvm_riscv_stage2_map(struct kvm_vcpu *vcpu,

spin_lock(&kvm->mmu_lock);

+ if (mmu_notifier_retry(kvm, mmu_seq))
+ goto out_unlock;
+
if (writeable) {
kvm_set_pfn_dirty(hfn);
mark_page_dirty(kvm, gfn);
@@ -639,6 +718,7 @@ int kvm_riscv_stage2_map(struct kvm_vcpu *vcpu,
if (ret)
kvm_err("Failed to map in stage2\n");

+out_unlock:
spin_unlock(&kvm->mmu_lock);
kvm_set_pfn_accessed(hfn);
kvm_release_pfn_clean(hfn);
@@ -675,7 +755,7 @@ void kvm_riscv_stage2_free_pgd(struct kvm *kvm)

spin_lock(&kvm->mmu_lock);
if (kvm->arch.pgd) {
- stage2_unmap_range(kvm, 0UL, stage2_gpa_size);
+ stage2_unmap_range(kvm, 0UL, stage2_gpa_size, false);
pgd = READ_ONCE(kvm->arch.pgd);
kvm->arch.pgd = NULL;
kvm->arch.pgd_phys = 0;
diff --git a/arch/riscv/kvm/vm.c b/arch/riscv/kvm/vm.c
index 6cde69a82252..00a1a88008be 100644
--- a/arch/riscv/kvm/vm.c
+++ b/arch/riscv/kvm/vm.c
@@ -49,6 +49,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
case KVM_CAP_IOEVENTFD:
case KVM_CAP_DEVICE_CTRL:
case KVM_CAP_USER_MEMORY:
+ case KVM_CAP_SYNC_MMU:
case KVM_CAP_DESTROY_MEMORY_REGION_WORKS:
case KVM_CAP_ONE_REG:
case KVM_CAP_READONLY_MEM:
--
2.25.1


2021-05-19 19:09:16

by Anup Patel

[permalink] [raw]
Subject: [PATCH v18 17/18] RISC-V: KVM: Move sources to drivers/staging directory

As-per the Linux RISC-V patch acceptance policy, patches for unfrozen
specifications won't be accepted in arch/riscv directory.

To unblock KVM RISC-V development, we move KVM RISC-V sources to
drivers/staging directory. Only arch/riscv/include/uapi/asm/kvm.h
header will remain in arch/riscv directory because this KVM RISC-V
UAPI header is compliant with ratified RISC-V privilege specification
hence also satisfies Linux RISC-V patch acceptance policy.

Signed-off-by: Anup Patel <[email protected]>
---
arch/riscv/Kconfig | 2 +-
arch/riscv/Makefile | 2 +-
{arch => drivers/staging}/riscv/kvm/Kconfig | 0
{arch => drivers/staging}/riscv/kvm/Makefile | 6 +++---
.../include => drivers/staging/riscv/kvm}/asm/kvm_csr.h | 0
.../include => drivers/staging/riscv/kvm}/asm/kvm_host.h | 0
.../include => drivers/staging/riscv/kvm}/asm/kvm_types.h | 0
.../staging/riscv/kvm}/asm/kvm_vcpu_timer.h | 0
{arch => drivers/staging}/riscv/kvm/main.c | 0
{arch => drivers/staging}/riscv/kvm/mmu.c | 0
{arch => drivers/staging}/riscv/kvm/riscv_offsets.c | 0
{arch => drivers/staging}/riscv/kvm/tlb.S | 0
{arch => drivers/staging}/riscv/kvm/vcpu.c | 0
{arch => drivers/staging}/riscv/kvm/vcpu_exit.c | 0
{arch => drivers/staging}/riscv/kvm/vcpu_sbi.c | 0
{arch => drivers/staging}/riscv/kvm/vcpu_switch.S | 0
{arch => drivers/staging}/riscv/kvm/vcpu_timer.c | 0
{arch => drivers/staging}/riscv/kvm/vm.c | 0
{arch => drivers/staging}/riscv/kvm/vmid.c | 0
19 files changed, 5 insertions(+), 5 deletions(-)
rename {arch => drivers/staging}/riscv/kvm/Kconfig (100%)
rename {arch => drivers/staging}/riscv/kvm/Makefile (69%)
rename {arch/riscv/include => drivers/staging/riscv/kvm}/asm/kvm_csr.h (100%)
rename {arch/riscv/include => drivers/staging/riscv/kvm}/asm/kvm_host.h (100%)
rename {arch/riscv/include => drivers/staging/riscv/kvm}/asm/kvm_types.h (100%)
rename {arch/riscv/include => drivers/staging/riscv/kvm}/asm/kvm_vcpu_timer.h (100%)
rename {arch => drivers/staging}/riscv/kvm/main.c (100%)
rename {arch => drivers/staging}/riscv/kvm/mmu.c (100%)
rename {arch => drivers/staging}/riscv/kvm/riscv_offsets.c (100%)
rename {arch => drivers/staging}/riscv/kvm/tlb.S (100%)
rename {arch => drivers/staging}/riscv/kvm/vcpu.c (100%)
rename {arch => drivers/staging}/riscv/kvm/vcpu_exit.c (100%)
rename {arch => drivers/staging}/riscv/kvm/vcpu_sbi.c (100%)
rename {arch => drivers/staging}/riscv/kvm/vcpu_switch.S (100%)
rename {arch => drivers/staging}/riscv/kvm/vcpu_timer.c (100%)
rename {arch => drivers/staging}/riscv/kvm/vm.c (100%)
rename {arch => drivers/staging}/riscv/kvm/vmid.c (100%)

diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index d0602ea394bc..e79a73ff86c0 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -555,5 +555,5 @@ source "kernel/power/Kconfig"

endmenu

-source "arch/riscv/kvm/Kconfig"
+source "drivers/staging/riscv/kvm/Kconfig"
source "drivers/firmware/Kconfig"
diff --git a/arch/riscv/Makefile b/arch/riscv/Makefile
index 05687d8b7b99..e8706c01733c 100644
--- a/arch/riscv/Makefile
+++ b/arch/riscv/Makefile
@@ -92,7 +92,7 @@ head-y := arch/riscv/kernel/head.o

core-y += arch/riscv/
core-$(CONFIG_RISCV_ERRATA_ALTERNATIVE) += arch/riscv/errata/
-core-$(CONFIG_KVM) += arch/riscv/kvm/
+core-$(CONFIG_KVM) += drivers/staging/riscv/kvm/

libs-y += arch/riscv/lib/
libs-$(CONFIG_EFI_STUB) += $(objtree)/drivers/firmware/efi/libstub/lib.a
diff --git a/arch/riscv/kvm/Kconfig b/drivers/staging/riscv/kvm/Kconfig
similarity index 100%
rename from arch/riscv/kvm/Kconfig
rename to drivers/staging/riscv/kvm/Kconfig
diff --git a/arch/riscv/kvm/Makefile b/drivers/staging/riscv/kvm/Makefile
similarity index 69%
rename from arch/riscv/kvm/Makefile
rename to drivers/staging/riscv/kvm/Makefile
index 938584254aad..3b876b6263e7 100644
--- a/arch/riscv/kvm/Makefile
+++ b/drivers/staging/riscv/kvm/Makefile
@@ -2,10 +2,10 @@
# Makefile for RISC-V KVM support
#

-common-objs-y = $(addprefix ../../../virt/kvm/, kvm_main.o coalesced_mmio.o)
-common-objs-y += $(addprefix ../../../virt/kvm/, eventfd.o)
+common-objs-y = $(addprefix ../../../../virt/kvm/, kvm_main.o coalesced_mmio.o)
+common-objs-y += $(addprefix ../../../../virt/kvm/, eventfd.o)

-ccflags-y := -Ivirt/kvm -Iarch/riscv/kvm
+ccflags-y := -Ivirt/kvm -Idrivers/staging/riscv/kvm

kvm-objs := $(common-objs-y)

diff --git a/arch/riscv/include/asm/kvm_csr.h b/drivers/staging/riscv/kvm/asm/kvm_csr.h
similarity index 100%
rename from arch/riscv/include/asm/kvm_csr.h
rename to drivers/staging/riscv/kvm/asm/kvm_csr.h
diff --git a/arch/riscv/include/asm/kvm_host.h b/drivers/staging/riscv/kvm/asm/kvm_host.h
similarity index 100%
rename from arch/riscv/include/asm/kvm_host.h
rename to drivers/staging/riscv/kvm/asm/kvm_host.h
diff --git a/arch/riscv/include/asm/kvm_types.h b/drivers/staging/riscv/kvm/asm/kvm_types.h
similarity index 100%
rename from arch/riscv/include/asm/kvm_types.h
rename to drivers/staging/riscv/kvm/asm/kvm_types.h
diff --git a/arch/riscv/include/asm/kvm_vcpu_timer.h b/drivers/staging/riscv/kvm/asm/kvm_vcpu_timer.h
similarity index 100%
rename from arch/riscv/include/asm/kvm_vcpu_timer.h
rename to drivers/staging/riscv/kvm/asm/kvm_vcpu_timer.h
diff --git a/arch/riscv/kvm/main.c b/drivers/staging/riscv/kvm/main.c
similarity index 100%
rename from arch/riscv/kvm/main.c
rename to drivers/staging/riscv/kvm/main.c
diff --git a/arch/riscv/kvm/mmu.c b/drivers/staging/riscv/kvm/mmu.c
similarity index 100%
rename from arch/riscv/kvm/mmu.c
rename to drivers/staging/riscv/kvm/mmu.c
diff --git a/arch/riscv/kvm/riscv_offsets.c b/drivers/staging/riscv/kvm/riscv_offsets.c
similarity index 100%
rename from arch/riscv/kvm/riscv_offsets.c
rename to drivers/staging/riscv/kvm/riscv_offsets.c
diff --git a/arch/riscv/kvm/tlb.S b/drivers/staging/riscv/kvm/tlb.S
similarity index 100%
rename from arch/riscv/kvm/tlb.S
rename to drivers/staging/riscv/kvm/tlb.S
diff --git a/arch/riscv/kvm/vcpu.c b/drivers/staging/riscv/kvm/vcpu.c
similarity index 100%
rename from arch/riscv/kvm/vcpu.c
rename to drivers/staging/riscv/kvm/vcpu.c
diff --git a/arch/riscv/kvm/vcpu_exit.c b/drivers/staging/riscv/kvm/vcpu_exit.c
similarity index 100%
rename from arch/riscv/kvm/vcpu_exit.c
rename to drivers/staging/riscv/kvm/vcpu_exit.c
diff --git a/arch/riscv/kvm/vcpu_sbi.c b/drivers/staging/riscv/kvm/vcpu_sbi.c
similarity index 100%
rename from arch/riscv/kvm/vcpu_sbi.c
rename to drivers/staging/riscv/kvm/vcpu_sbi.c
diff --git a/arch/riscv/kvm/vcpu_switch.S b/drivers/staging/riscv/kvm/vcpu_switch.S
similarity index 100%
rename from arch/riscv/kvm/vcpu_switch.S
rename to drivers/staging/riscv/kvm/vcpu_switch.S
diff --git a/arch/riscv/kvm/vcpu_timer.c b/drivers/staging/riscv/kvm/vcpu_timer.c
similarity index 100%
rename from arch/riscv/kvm/vcpu_timer.c
rename to drivers/staging/riscv/kvm/vcpu_timer.c
diff --git a/arch/riscv/kvm/vm.c b/drivers/staging/riscv/kvm/vm.c
similarity index 100%
rename from arch/riscv/kvm/vm.c
rename to drivers/staging/riscv/kvm/vm.c
diff --git a/arch/riscv/kvm/vmid.c b/drivers/staging/riscv/kvm/vmid.c
similarity index 100%
rename from arch/riscv/kvm/vmid.c
rename to drivers/staging/riscv/kvm/vmid.c
--
2.25.1


2021-05-19 19:09:19

by Anup Patel

[permalink] [raw]
Subject: [PATCH v18 18/18] RISC-V: KVM: Add MAINTAINERS entry

Add myself as maintainer for KVM RISC-V and Atish as designated reviewer.

Signed-off-by: Atish Patra <[email protected]>
Signed-off-by: Anup Patel <[email protected]>
Acked-by: Paolo Bonzini <[email protected]>
Reviewed-by: Paolo Bonzini <[email protected]>
Reviewed-by: Alexander Graf <[email protected]>
---
MAINTAINERS | 11 +++++++++++
1 file changed, 11 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 008fcad7ac00..8a54857a383c 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -10009,6 +10009,17 @@ F: arch/powerpc/include/uapi/asm/kvm*
F: arch/powerpc/kernel/kvm*
F: arch/powerpc/kvm/

+KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv)
+M: Anup Patel <[email protected]>
+R: Atish Patra <[email protected]>
+L: [email protected]
+L: [email protected]
+L: [email protected]
+S: Maintained
+T: git git://github.com/kvm-riscv/linux.git
+F: arch/riscv/include/uapi/asm/kvm*
+F: drivers/staging/riscv/kvm/
+
KERNEL VIRTUAL MACHINE for s390 (KVM/s390)
M: Christian Borntraeger <[email protected]>
M: Janosch Frank <[email protected]>
--
2.25.1


2021-05-19 19:09:50

by Anup Patel

[permalink] [raw]
Subject: [PATCH v18 15/18] RISC-V: KVM: Add SBI v0.1 support

From: Atish Patra <[email protected]>

The KVM host kernel is running in HS-mode needs so we need to handle
the SBI calls coming from guest kernel running in VS-mode.

This patch adds SBI v0.1 support in KVM RISC-V. Almost all SBI v0.1
calls are implemented in KVM kernel module except GETCHAR and PUTCHART
calls which are forwarded to user space because these calls cannot be
implemented in kernel space. In future, when we implement SBI v0.2 for
Guest, we will forward SBI v0.2 experimental and vendor extension calls
to user space.

Signed-off-by: Atish Patra <[email protected]>
Signed-off-by: Anup Patel <[email protected]>
Acked-by: Paolo Bonzini <[email protected]>
Reviewed-by: Paolo Bonzini <[email protected]>
---
arch/riscv/include/asm/kvm_host.h | 10 ++
arch/riscv/kvm/Makefile | 2 +-
arch/riscv/kvm/vcpu.c | 9 ++
arch/riscv/kvm/vcpu_exit.c | 4 +
arch/riscv/kvm/vcpu_sbi.c | 173 ++++++++++++++++++++++++++++++
include/uapi/linux/kvm.h | 8 ++
6 files changed, 205 insertions(+), 1 deletion(-)
create mode 100644 arch/riscv/kvm/vcpu_sbi.c

diff --git a/arch/riscv/include/asm/kvm_host.h b/arch/riscv/include/asm/kvm_host.h
index 834c6986cc2d..29cbdccfa65d 100644
--- a/arch/riscv/include/asm/kvm_host.h
+++ b/arch/riscv/include/asm/kvm_host.h
@@ -79,6 +79,10 @@ struct kvm_mmio_decode {
int return_handled;
};

+struct kvm_sbi_context {
+ int return_handled;
+};
+
#define KVM_MMU_PAGE_CACHE_NR_OBJS 32

struct kvm_mmu_page_cache {
@@ -191,6 +195,9 @@ struct kvm_vcpu_arch {
/* MMIO instruction details */
struct kvm_mmio_decode mmio_decode;

+ /* SBI context */
+ struct kvm_sbi_context sbi_context;
+
/* Cache pages needed to program page tables with spinlock held */
struct kvm_mmu_page_cache mmu_page_cache;

@@ -258,4 +265,7 @@ bool kvm_riscv_vcpu_has_interrupts(struct kvm_vcpu *vcpu, unsigned long mask);
void kvm_riscv_vcpu_power_off(struct kvm_vcpu *vcpu);
void kvm_riscv_vcpu_power_on(struct kvm_vcpu *vcpu);

+int kvm_riscv_vcpu_sbi_return(struct kvm_vcpu *vcpu, struct kvm_run *run);
+int kvm_riscv_vcpu_sbi_ecall(struct kvm_vcpu *vcpu, struct kvm_run *run);
+
#endif /* __RISCV_KVM_HOST_H__ */
diff --git a/arch/riscv/kvm/Makefile b/arch/riscv/kvm/Makefile
index 4f90443ab1ef..938584254aad 100644
--- a/arch/riscv/kvm/Makefile
+++ b/arch/riscv/kvm/Makefile
@@ -10,7 +10,7 @@ ccflags-y := -Ivirt/kvm -Iarch/riscv/kvm
kvm-objs := $(common-objs-y)

kvm-objs += main.o vm.o vmid.o tlb.o mmu.o
-kvm-objs += vcpu.o vcpu_exit.o vcpu_switch.o vcpu_timer.o
+kvm-objs += vcpu.o vcpu_exit.o vcpu_switch.o vcpu_timer.o vcpu_sbi.o

obj-$(CONFIG_KVM) += kvm.o

diff --git a/arch/riscv/kvm/vcpu.c b/arch/riscv/kvm/vcpu.c
index 7119158b370f..fe028d977745 100644
--- a/arch/riscv/kvm/vcpu.c
+++ b/arch/riscv/kvm/vcpu.c
@@ -867,6 +867,15 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
}
}

+ /* Process SBI value returned from user-space */
+ if (run->exit_reason == KVM_EXIT_RISCV_SBI) {
+ ret = kvm_riscv_vcpu_sbi_return(vcpu, vcpu->run);
+ if (ret) {
+ srcu_read_unlock(&vcpu->kvm->srcu, vcpu->arch.srcu_idx);
+ return ret;
+ }
+ }
+
if (run->immediate_exit) {
srcu_read_unlock(&vcpu->kvm->srcu, vcpu->arch.srcu_idx);
return -EINTR;
diff --git a/arch/riscv/kvm/vcpu_exit.c b/arch/riscv/kvm/vcpu_exit.c
index 34d9bd9da585..6a97db14b7b2 100644
--- a/arch/riscv/kvm/vcpu_exit.c
+++ b/arch/riscv/kvm/vcpu_exit.c
@@ -678,6 +678,10 @@ int kvm_riscv_vcpu_exit(struct kvm_vcpu *vcpu, struct kvm_run *run,
if (vcpu->arch.guest_context.hstatus & HSTATUS_SPV)
ret = stage2_page_fault(vcpu, run, trap);
break;
+ case EXC_SUPERVISOR_SYSCALL:
+ if (vcpu->arch.guest_context.hstatus & HSTATUS_SPV)
+ ret = kvm_riscv_vcpu_sbi_ecall(vcpu, run);
+ break;
default:
break;
};
diff --git a/arch/riscv/kvm/vcpu_sbi.c b/arch/riscv/kvm/vcpu_sbi.c
new file mode 100644
index 000000000000..a5f7da5f33c8
--- /dev/null
+++ b/arch/riscv/kvm/vcpu_sbi.c
@@ -0,0 +1,173 @@
+// SPDX-License-Identifier: GPL-2.0
+/**
+ * Copyright (c) 2019 Western Digital Corporation or its affiliates.
+ *
+ * Authors:
+ * Atish Patra <[email protected]>
+ */
+
+#include <linux/errno.h>
+#include <linux/err.h>
+#include <linux/kvm_host.h>
+#include <asm/kvm_csr.h>
+#include <asm/sbi.h>
+#include <asm/kvm_vcpu_timer.h>
+
+#define SBI_VERSION_MAJOR 0
+#define SBI_VERSION_MINOR 1
+
+static void kvm_sbi_system_shutdown(struct kvm_vcpu *vcpu,
+ struct kvm_run *run, u32 type)
+{
+ int i;
+ struct kvm_vcpu *tmp;
+
+ kvm_for_each_vcpu(i, tmp, vcpu->kvm)
+ tmp->arch.power_off = true;
+ kvm_make_all_cpus_request(vcpu->kvm, KVM_REQ_SLEEP);
+
+ memset(&run->system_event, 0, sizeof(run->system_event));
+ run->system_event.type = type;
+ run->exit_reason = KVM_EXIT_SYSTEM_EVENT;
+}
+
+static void kvm_riscv_vcpu_sbi_forward(struct kvm_vcpu *vcpu,
+ struct kvm_run *run)
+{
+ struct kvm_cpu_context *cp = &vcpu->arch.guest_context;
+
+ vcpu->arch.sbi_context.return_handled = 0;
+ vcpu->stat.ecall_exit_stat++;
+ run->exit_reason = KVM_EXIT_RISCV_SBI;
+ run->riscv_sbi.extension_id = cp->a7;
+ run->riscv_sbi.function_id = cp->a6;
+ run->riscv_sbi.args[0] = cp->a0;
+ run->riscv_sbi.args[1] = cp->a1;
+ run->riscv_sbi.args[2] = cp->a2;
+ run->riscv_sbi.args[3] = cp->a3;
+ run->riscv_sbi.args[4] = cp->a4;
+ run->riscv_sbi.args[5] = cp->a5;
+ run->riscv_sbi.ret[0] = cp->a0;
+ run->riscv_sbi.ret[1] = cp->a1;
+}
+
+int kvm_riscv_vcpu_sbi_return(struct kvm_vcpu *vcpu, struct kvm_run *run)
+{
+ struct kvm_cpu_context *cp = &vcpu->arch.guest_context;
+
+ /* Handle SBI return only once */
+ if (vcpu->arch.sbi_context.return_handled)
+ return 0;
+ vcpu->arch.sbi_context.return_handled = 1;
+
+ /* Update return values */
+ cp->a0 = run->riscv_sbi.ret[0];
+ cp->a1 = run->riscv_sbi.ret[1];
+
+ /* Move to next instruction */
+ vcpu->arch.guest_context.sepc += 4;
+
+ return 0;
+}
+
+int kvm_riscv_vcpu_sbi_ecall(struct kvm_vcpu *vcpu, struct kvm_run *run)
+{
+ ulong hmask;
+ int i, ret = 1;
+ u64 next_cycle;
+ struct kvm_vcpu *rvcpu;
+ bool next_sepc = true;
+ struct cpumask cm, hm;
+ struct kvm *kvm = vcpu->kvm;
+ struct kvm_cpu_trap utrap = { 0 };
+ struct kvm_cpu_context *cp = &vcpu->arch.guest_context;
+
+ if (!cp)
+ return -EINVAL;
+
+ switch (cp->a7) {
+ case SBI_EXT_0_1_CONSOLE_GETCHAR:
+ case SBI_EXT_0_1_CONSOLE_PUTCHAR:
+ /*
+ * The CONSOLE_GETCHAR/CONSOLE_PUTCHAR SBI calls cannot be
+ * handled in kernel so we forward these to user-space
+ */
+ kvm_riscv_vcpu_sbi_forward(vcpu, run);
+ next_sepc = false;
+ ret = 0;
+ break;
+ case SBI_EXT_0_1_SET_TIMER:
+#if __riscv_xlen == 32
+ next_cycle = ((u64)cp->a1 << 32) | (u64)cp->a0;
+#else
+ next_cycle = (u64)cp->a0;
+#endif
+ kvm_riscv_vcpu_timer_next_event(vcpu, next_cycle);
+ break;
+ case SBI_EXT_0_1_CLEAR_IPI:
+ kvm_riscv_vcpu_unset_interrupt(vcpu, IRQ_VS_SOFT);
+ break;
+ case SBI_EXT_0_1_SEND_IPI:
+ if (cp->a0)
+ hmask = kvm_riscv_vcpu_unpriv_read(vcpu, false, cp->a0,
+ &utrap);
+ else
+ hmask = (1UL << atomic_read(&kvm->online_vcpus)) - 1;
+ if (utrap.scause) {
+ utrap.sepc = cp->sepc;
+ kvm_riscv_vcpu_trap_redirect(vcpu, &utrap);
+ next_sepc = false;
+ break;
+ }
+ for_each_set_bit(i, &hmask, BITS_PER_LONG) {
+ rvcpu = kvm_get_vcpu_by_id(vcpu->kvm, i);
+ kvm_riscv_vcpu_set_interrupt(rvcpu, IRQ_VS_SOFT);
+ }
+ break;
+ case SBI_EXT_0_1_SHUTDOWN:
+ kvm_sbi_system_shutdown(vcpu, run, KVM_SYSTEM_EVENT_SHUTDOWN);
+ next_sepc = false;
+ ret = 0;
+ break;
+ case SBI_EXT_0_1_REMOTE_FENCE_I:
+ case SBI_EXT_0_1_REMOTE_SFENCE_VMA:
+ case SBI_EXT_0_1_REMOTE_SFENCE_VMA_ASID:
+ if (cp->a0)
+ hmask = kvm_riscv_vcpu_unpriv_read(vcpu, false, cp->a0,
+ &utrap);
+ else
+ hmask = (1UL << atomic_read(&kvm->online_vcpus)) - 1;
+ if (utrap.scause) {
+ utrap.sepc = cp->sepc;
+ kvm_riscv_vcpu_trap_redirect(vcpu, &utrap);
+ next_sepc = false;
+ break;
+ }
+ cpumask_clear(&cm);
+ for_each_set_bit(i, &hmask, BITS_PER_LONG) {
+ rvcpu = kvm_get_vcpu_by_id(vcpu->kvm, i);
+ if (rvcpu->cpu < 0)
+ continue;
+ cpumask_set_cpu(rvcpu->cpu, &cm);
+ }
+ riscv_cpuid_to_hartid_mask(&cm, &hm);
+ if (cp->a7 == SBI_EXT_0_1_REMOTE_FENCE_I)
+ sbi_remote_fence_i(cpumask_bits(&hm));
+ else if (cp->a7 == SBI_EXT_0_1_REMOTE_SFENCE_VMA)
+ sbi_remote_hfence_vvma(cpumask_bits(&hm),
+ cp->a1, cp->a2);
+ else
+ sbi_remote_hfence_vvma_asid(cpumask_bits(&hm),
+ cp->a1, cp->a2, cp->a3);
+ break;
+ default:
+ /* Return error for unsupported SBI calls */
+ cp->a0 = SBI_ERR_NOT_SUPPORTED;
+ break;
+ };
+
+ if (next_sepc)
+ cp->sepc += 4;
+
+ return ret;
+}
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 3fd9a7e9d90c..ed5fd5863361 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -268,6 +268,7 @@ struct kvm_xen_exit {
#define KVM_EXIT_AP_RESET_HOLD 32
#define KVM_EXIT_X86_BUS_LOCK 33
#define KVM_EXIT_XEN 34
+#define KVM_EXIT_RISCV_SBI 35

/* For KVM_EXIT_INTERNAL_ERROR */
/* Emulate instruction failed. */
@@ -446,6 +447,13 @@ struct kvm_run {
} msr;
/* KVM_EXIT_XEN */
struct kvm_xen_exit xen;
+ /* KVM_EXIT_RISCV_SBI */
+ struct {
+ unsigned long extension_id;
+ unsigned long function_id;
+ unsigned long args[6];
+ unsigned long ret[2];
+ } riscv_sbi;
/* Fix the size of the union. */
char padding[256];
};
--
2.25.1


2021-05-19 19:10:11

by Anup Patel

[permalink] [raw]
Subject: [PATCH v18 04/18] RISC-V: KVM: Implement VCPU interrupts and requests handling

This patch implements VCPU interrupts and requests which are both
asynchronous events.

The VCPU interrupts can be set/unset using KVM_INTERRUPT ioctl from
user-space. In future, the in-kernel IRQCHIP emulation will use
kvm_riscv_vcpu_set_interrupt() and kvm_riscv_vcpu_unset_interrupt()
functions to set/unset VCPU interrupts.

Important VCPU requests implemented by this patch are:
KVM_REQ_SLEEP - set whenever VCPU itself goes to sleep state
KVM_REQ_VCPU_RESET - set whenever VCPU reset is requested

The WFI trap-n-emulate (added later) will use KVM_REQ_SLEEP request
and kvm_riscv_vcpu_has_interrupt() function.

The KVM_REQ_VCPU_RESET request will be used by SBI emulation (added
later) to power-up a VCPU in power-off state. The user-space can use
the GET_MPSTATE/SET_MPSTATE ioctls to get/set power state of a VCPU.

Signed-off-by: Anup Patel <[email protected]>
Acked-by: Paolo Bonzini <[email protected]>
Reviewed-by: Paolo Bonzini <[email protected]>
Reviewed-by: Alexander Graf <[email protected]>
---
arch/riscv/include/asm/kvm_host.h | 23 ++++
arch/riscv/include/uapi/asm/kvm.h | 3 +
arch/riscv/kvm/vcpu.c | 182 +++++++++++++++++++++++++++---
3 files changed, 195 insertions(+), 13 deletions(-)

diff --git a/arch/riscv/include/asm/kvm_host.h b/arch/riscv/include/asm/kvm_host.h
index cf2a23bbd560..5e1c3140e49d 100644
--- a/arch/riscv/include/asm/kvm_host.h
+++ b/arch/riscv/include/asm/kvm_host.h
@@ -132,6 +132,21 @@ struct kvm_vcpu_arch {
/* CPU CSR context upon Guest VCPU reset */
struct kvm_vcpu_csr guest_reset_csr;

+ /*
+ * VCPU interrupts
+ *
+ * We have a lockless approach for tracking pending VCPU interrupts
+ * implemented using atomic bitops. The irqs_pending bitmap represent
+ * pending interrupts whereas irqs_pending_mask represent bits changed
+ * in irqs_pending. Our approach is modeled around multiple producer
+ * and single consumer problem where the consumer is the VCPU itself.
+ */
+ unsigned long irqs_pending;
+ unsigned long irqs_pending_mask;
+
+ /* VCPU power-off state */
+ bool power_off;
+
/* Don't run the VCPU (blocked) */
bool pause;

@@ -155,4 +170,12 @@ int kvm_riscv_vcpu_exit(struct kvm_vcpu *vcpu, struct kvm_run *run,

static inline void __kvm_riscv_switch_to(struct kvm_vcpu_arch *vcpu_arch) {}

+int kvm_riscv_vcpu_set_interrupt(struct kvm_vcpu *vcpu, unsigned int irq);
+int kvm_riscv_vcpu_unset_interrupt(struct kvm_vcpu *vcpu, unsigned int irq);
+void kvm_riscv_vcpu_flush_interrupts(struct kvm_vcpu *vcpu);
+void kvm_riscv_vcpu_sync_interrupts(struct kvm_vcpu *vcpu);
+bool kvm_riscv_vcpu_has_interrupts(struct kvm_vcpu *vcpu, unsigned long mask);
+void kvm_riscv_vcpu_power_off(struct kvm_vcpu *vcpu);
+void kvm_riscv_vcpu_power_on(struct kvm_vcpu *vcpu);
+
#endif /* __RISCV_KVM_HOST_H__ */
diff --git a/arch/riscv/include/uapi/asm/kvm.h b/arch/riscv/include/uapi/asm/kvm.h
index 984d041a3e3b..3d3d703713c6 100644
--- a/arch/riscv/include/uapi/asm/kvm.h
+++ b/arch/riscv/include/uapi/asm/kvm.h
@@ -18,6 +18,9 @@

#define KVM_COALESCED_MMIO_PAGE_OFFSET 1

+#define KVM_INTERRUPT_SET -1U
+#define KVM_INTERRUPT_UNSET -2U
+
/* for KVM_GET_REGS and KVM_SET_REGS */
struct kvm_regs {
};
diff --git a/arch/riscv/kvm/vcpu.c b/arch/riscv/kvm/vcpu.c
index 904d908a7544..1c3c3bd72df9 100644
--- a/arch/riscv/kvm/vcpu.c
+++ b/arch/riscv/kvm/vcpu.c
@@ -11,6 +11,7 @@
#include <linux/err.h>
#include <linux/kdebug.h>
#include <linux/module.h>
+#include <linux/percpu.h>
#include <linux/uaccess.h>
#include <linux/vmalloc.h>
#include <linux/sched/signal.h>
@@ -54,6 +55,9 @@ static void kvm_riscv_reset_vcpu(struct kvm_vcpu *vcpu)
memcpy(csr, reset_csr, sizeof(*csr));

memcpy(cntx, reset_cntx, sizeof(*cntx));
+
+ WRITE_ONCE(vcpu->arch.irqs_pending, 0);
+ WRITE_ONCE(vcpu->arch.irqs_pending_mask, 0);
}

int kvm_arch_vcpu_precreate(struct kvm *kvm, unsigned int id)
@@ -97,8 +101,7 @@ void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu)

int kvm_cpu_has_pending_timer(struct kvm_vcpu *vcpu)
{
- /* TODO: */
- return 0;
+ return kvm_riscv_vcpu_has_interrupts(vcpu, 1UL << IRQ_VS_TIMER);
}

void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu)
@@ -111,20 +114,18 @@ void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu)

int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu)
{
- /* TODO: */
- return 0;
+ return (kvm_riscv_vcpu_has_interrupts(vcpu, -1UL) &&
+ !vcpu->arch.power_off && !vcpu->arch.pause);
}

int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu)
{
- /* TODO: */
- return 0;
+ return kvm_vcpu_exiting_guest_mode(vcpu) == IN_GUEST_MODE;
}

bool kvm_arch_vcpu_in_kernel(struct kvm_vcpu *vcpu)
{
- /* TODO: */
- return false;
+ return (vcpu->arch.guest_context.sstatus & SR_SPP) ? true : false;
}

vm_fault_t kvm_arch_vcpu_fault(struct kvm_vcpu *vcpu, struct vm_fault *vmf)
@@ -135,7 +136,21 @@ vm_fault_t kvm_arch_vcpu_fault(struct kvm_vcpu *vcpu, struct vm_fault *vmf)
long kvm_arch_vcpu_async_ioctl(struct file *filp,
unsigned int ioctl, unsigned long arg)
{
- /* TODO; */
+ struct kvm_vcpu *vcpu = filp->private_data;
+ void __user *argp = (void __user *)arg;
+
+ if (ioctl == KVM_INTERRUPT) {
+ struct kvm_interrupt irq;
+
+ if (copy_from_user(&irq, argp, sizeof(irq)))
+ return -EFAULT;
+
+ if (irq.irq == KVM_INTERRUPT_SET)
+ return kvm_riscv_vcpu_set_interrupt(vcpu, IRQ_VS_EXT);
+ else
+ return kvm_riscv_vcpu_unset_interrupt(vcpu, IRQ_VS_EXT);
+ }
+
return -ENOIOCTLCMD;
}

@@ -184,18 +199,121 @@ int kvm_arch_vcpu_ioctl_set_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs)
return -EINVAL;
}

+void kvm_riscv_vcpu_flush_interrupts(struct kvm_vcpu *vcpu)
+{
+ struct kvm_vcpu_csr *csr = &vcpu->arch.guest_csr;
+ unsigned long mask, val;
+
+ if (READ_ONCE(vcpu->arch.irqs_pending_mask)) {
+ mask = xchg_acquire(&vcpu->arch.irqs_pending_mask, 0);
+ val = READ_ONCE(vcpu->arch.irqs_pending) & mask;
+
+ csr->hvip &= ~mask;
+ csr->hvip |= val;
+ }
+}
+
+void kvm_riscv_vcpu_sync_interrupts(struct kvm_vcpu *vcpu)
+{
+ unsigned long hvip;
+ struct kvm_vcpu_arch *v = &vcpu->arch;
+ struct kvm_vcpu_csr *csr = &vcpu->arch.guest_csr;
+
+ /* Read current HVIP and HIE CSRs */
+ hvip = csr_read(CSR_HVIP);
+ csr->hie = csr_read(CSR_HIE);
+
+ /* Sync-up HVIP.VSSIP bit changes does by Guest */
+ if ((csr->hvip ^ hvip) & (1UL << IRQ_VS_SOFT)) {
+ if (hvip & (1UL << IRQ_VS_SOFT)) {
+ if (!test_and_set_bit(IRQ_VS_SOFT,
+ &v->irqs_pending_mask))
+ set_bit(IRQ_VS_SOFT, &v->irqs_pending);
+ } else {
+ if (!test_and_set_bit(IRQ_VS_SOFT,
+ &v->irqs_pending_mask))
+ clear_bit(IRQ_VS_SOFT, &v->irqs_pending);
+ }
+ }
+}
+
+int kvm_riscv_vcpu_set_interrupt(struct kvm_vcpu *vcpu, unsigned int irq)
+{
+ if (irq != IRQ_VS_SOFT &&
+ irq != IRQ_VS_TIMER &&
+ irq != IRQ_VS_EXT)
+ return -EINVAL;
+
+ set_bit(irq, &vcpu->arch.irqs_pending);
+ smp_mb__before_atomic();
+ set_bit(irq, &vcpu->arch.irqs_pending_mask);
+
+ kvm_vcpu_kick(vcpu);
+
+ return 0;
+}
+
+int kvm_riscv_vcpu_unset_interrupt(struct kvm_vcpu *vcpu, unsigned int irq)
+{
+ if (irq != IRQ_VS_SOFT &&
+ irq != IRQ_VS_TIMER &&
+ irq != IRQ_VS_EXT)
+ return -EINVAL;
+
+ clear_bit(irq, &vcpu->arch.irqs_pending);
+ smp_mb__before_atomic();
+ set_bit(irq, &vcpu->arch.irqs_pending_mask);
+
+ return 0;
+}
+
+bool kvm_riscv_vcpu_has_interrupts(struct kvm_vcpu *vcpu, unsigned long mask)
+{
+ return (READ_ONCE(vcpu->arch.irqs_pending) &
+ vcpu->arch.guest_csr.hie & mask) ? true : false;
+}
+
+void kvm_riscv_vcpu_power_off(struct kvm_vcpu *vcpu)
+{
+ vcpu->arch.power_off = true;
+ kvm_make_request(KVM_REQ_SLEEP, vcpu);
+ kvm_vcpu_kick(vcpu);
+}
+
+void kvm_riscv_vcpu_power_on(struct kvm_vcpu *vcpu)
+{
+ vcpu->arch.power_off = false;
+ kvm_vcpu_wake_up(vcpu);
+}
+
int kvm_arch_vcpu_ioctl_get_mpstate(struct kvm_vcpu *vcpu,
struct kvm_mp_state *mp_state)
{
- /* TODO: */
+ if (vcpu->arch.power_off)
+ mp_state->mp_state = KVM_MP_STATE_STOPPED;
+ else
+ mp_state->mp_state = KVM_MP_STATE_RUNNABLE;
+
return 0;
}

int kvm_arch_vcpu_ioctl_set_mpstate(struct kvm_vcpu *vcpu,
struct kvm_mp_state *mp_state)
{
- /* TODO: */
- return 0;
+ int ret = 0;
+
+ switch (mp_state->mp_state) {
+ case KVM_MP_STATE_RUNNABLE:
+ vcpu->arch.power_off = false;
+ break;
+ case KVM_MP_STATE_STOPPED:
+ kvm_riscv_vcpu_power_off(vcpu);
+ break;
+ default:
+ ret = -EINVAL;
+ }
+
+ return ret;
}

int kvm_arch_vcpu_ioctl_set_guest_debug(struct kvm_vcpu *vcpu,
@@ -219,7 +337,33 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)

static void kvm_riscv_check_vcpu_requests(struct kvm_vcpu *vcpu)
{
- /* TODO: */
+ struct rcuwait *wait = kvm_arch_vcpu_get_wait(vcpu);
+
+ if (kvm_request_pending(vcpu)) {
+ if (kvm_check_request(KVM_REQ_SLEEP, vcpu)) {
+ rcuwait_wait_event(wait,
+ (!vcpu->arch.power_off) && (!vcpu->arch.pause),
+ TASK_INTERRUPTIBLE);
+
+ if (vcpu->arch.power_off || vcpu->arch.pause) {
+ /*
+ * Awaken to handle a signal, request to
+ * sleep again later.
+ */
+ kvm_make_request(KVM_REQ_SLEEP, vcpu);
+ }
+ }
+
+ if (kvm_check_request(KVM_REQ_VCPU_RESET, vcpu))
+ kvm_riscv_reset_vcpu(vcpu);
+ }
+}
+
+static void kvm_riscv_update_hvip(struct kvm_vcpu *vcpu)
+{
+ struct kvm_vcpu_csr *csr = &vcpu->arch.guest_csr;
+
+ csr_write(CSR_HVIP, csr->hvip);
}

int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
@@ -283,6 +427,15 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
srcu_read_unlock(&vcpu->kvm->srcu, vcpu->arch.srcu_idx);
smp_mb__after_srcu_read_unlock();

+ /*
+ * We might have got VCPU interrupts updated asynchronously
+ * so update it in HW.
+ */
+ kvm_riscv_vcpu_flush_interrupts(vcpu);
+
+ /* Update HVIP CSR for current CPU */
+ kvm_riscv_update_hvip(vcpu);
+
if (ret <= 0 ||
kvm_request_pending(vcpu)) {
vcpu->mode = OUTSIDE_GUEST_MODE;
@@ -310,6 +463,9 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
trap.htval = csr_read(CSR_HTVAL);
trap.htinst = csr_read(CSR_HTINST);

+ /* Syncup interrupts state with HW */
+ kvm_riscv_vcpu_sync_interrupts(vcpu);
+
/*
* We may have taken a host interrupt in VS/VU-mode (i.e.
* while executing the guest). This interrupt is still
--
2.25.1


2021-05-19 19:10:20

by Anup Patel

[permalink] [raw]
Subject: [PATCH v18 14/18] RISC-V: KVM: Implement ONE REG interface for FP registers

From: Atish Patra <[email protected]>

Add a KVM_GET_ONE_REG/KVM_SET_ONE_REG ioctl interface for floating
point registers such as F0-F31 and FCSR. This support is added for
both 'F' and 'D' extensions.

Signed-off-by: Atish Patra <[email protected]>
Signed-off-by: Anup Patel <[email protected]>
Acked-by: Paolo Bonzini <[email protected]>
Reviewed-by: Paolo Bonzini <[email protected]>
Reviewed-by: Alexander Graf <[email protected]>
---
arch/riscv/include/uapi/asm/kvm.h | 10 +++
arch/riscv/kvm/vcpu.c | 104 ++++++++++++++++++++++++++++++
2 files changed, 114 insertions(+)

diff --git a/arch/riscv/include/uapi/asm/kvm.h b/arch/riscv/include/uapi/asm/kvm.h
index 08691dd27bcf..f808ad1ce500 100644
--- a/arch/riscv/include/uapi/asm/kvm.h
+++ b/arch/riscv/include/uapi/asm/kvm.h
@@ -113,6 +113,16 @@ struct kvm_riscv_timer {
#define KVM_REG_RISCV_TIMER_REG(name) \
(offsetof(struct kvm_riscv_timer, name) / sizeof(__u64))

+/* F extension registers are mapped as type 5 */
+#define KVM_REG_RISCV_FP_F (0x05 << KVM_REG_RISCV_TYPE_SHIFT)
+#define KVM_REG_RISCV_FP_F_REG(name) \
+ (offsetof(struct __riscv_f_ext_state, name) / sizeof(__u32))
+
+/* D extension registers are mapped as type 6 */
+#define KVM_REG_RISCV_FP_D (0x06 << KVM_REG_RISCV_TYPE_SHIFT)
+#define KVM_REG_RISCV_FP_D_REG(name) \
+ (offsetof(struct __riscv_d_ext_state, name) / sizeof(__u64))
+
#endif

#endif /* __LINUX_KVM_RISCV_H */
diff --git a/arch/riscv/kvm/vcpu.c b/arch/riscv/kvm/vcpu.c
index f2f2321507e6..7119158b370f 100644
--- a/arch/riscv/kvm/vcpu.c
+++ b/arch/riscv/kvm/vcpu.c
@@ -416,6 +416,98 @@ static int kvm_riscv_vcpu_set_reg_csr(struct kvm_vcpu *vcpu,
return 0;
}

+static int kvm_riscv_vcpu_get_reg_fp(struct kvm_vcpu *vcpu,
+ const struct kvm_one_reg *reg,
+ unsigned long rtype)
+{
+ struct kvm_cpu_context *cntx = &vcpu->arch.guest_context;
+ unsigned long isa = vcpu->arch.isa;
+ unsigned long __user *uaddr =
+ (unsigned long __user *)(unsigned long)reg->addr;
+ unsigned long reg_num = reg->id & ~(KVM_REG_ARCH_MASK |
+ KVM_REG_SIZE_MASK |
+ rtype);
+ void *reg_val;
+
+ if ((rtype == KVM_REG_RISCV_FP_F) &&
+ riscv_isa_extension_available(&isa, f)) {
+ if (KVM_REG_SIZE(reg->id) != sizeof(u32))
+ return -EINVAL;
+ if (reg_num == KVM_REG_RISCV_FP_F_REG(fcsr))
+ reg_val = &cntx->fp.f.fcsr;
+ else if ((KVM_REG_RISCV_FP_F_REG(f[0]) <= reg_num) &&
+ reg_num <= KVM_REG_RISCV_FP_F_REG(f[31]))
+ reg_val = &cntx->fp.f.f[reg_num];
+ else
+ return -EINVAL;
+ } else if ((rtype == KVM_REG_RISCV_FP_D) &&
+ riscv_isa_extension_available(&isa, d)) {
+ if (reg_num == KVM_REG_RISCV_FP_D_REG(fcsr)) {
+ if (KVM_REG_SIZE(reg->id) != sizeof(u32))
+ return -EINVAL;
+ reg_val = &cntx->fp.d.fcsr;
+ } else if ((KVM_REG_RISCV_FP_D_REG(f[0]) <= reg_num) &&
+ reg_num <= KVM_REG_RISCV_FP_D_REG(f[31])) {
+ if (KVM_REG_SIZE(reg->id) != sizeof(u64))
+ return -EINVAL;
+ reg_val = &cntx->fp.d.f[reg_num];
+ } else
+ return -EINVAL;
+ } else
+ return -EINVAL;
+
+ if (copy_to_user(uaddr, reg_val, KVM_REG_SIZE(reg->id)))
+ return -EFAULT;
+
+ return 0;
+}
+
+static int kvm_riscv_vcpu_set_reg_fp(struct kvm_vcpu *vcpu,
+ const struct kvm_one_reg *reg,
+ unsigned long rtype)
+{
+ struct kvm_cpu_context *cntx = &vcpu->arch.guest_context;
+ unsigned long isa = vcpu->arch.isa;
+ unsigned long __user *uaddr =
+ (unsigned long __user *)(unsigned long)reg->addr;
+ unsigned long reg_num = reg->id & ~(KVM_REG_ARCH_MASK |
+ KVM_REG_SIZE_MASK |
+ rtype);
+ void *reg_val;
+
+ if ((rtype == KVM_REG_RISCV_FP_F) &&
+ riscv_isa_extension_available(&isa, f)) {
+ if (KVM_REG_SIZE(reg->id) != sizeof(u32))
+ return -EINVAL;
+ if (reg_num == KVM_REG_RISCV_FP_F_REG(fcsr))
+ reg_val = &cntx->fp.f.fcsr;
+ else if ((KVM_REG_RISCV_FP_F_REG(f[0]) <= reg_num) &&
+ reg_num <= KVM_REG_RISCV_FP_F_REG(f[31]))
+ reg_val = &cntx->fp.f.f[reg_num];
+ else
+ return -EINVAL;
+ } else if ((rtype == KVM_REG_RISCV_FP_D) &&
+ riscv_isa_extension_available(&isa, d)) {
+ if (reg_num == KVM_REG_RISCV_FP_D_REG(fcsr)) {
+ if (KVM_REG_SIZE(reg->id) != sizeof(u32))
+ return -EINVAL;
+ reg_val = &cntx->fp.d.fcsr;
+ } else if ((KVM_REG_RISCV_FP_D_REG(f[0]) <= reg_num) &&
+ reg_num <= KVM_REG_RISCV_FP_D_REG(f[31])) {
+ if (KVM_REG_SIZE(reg->id) != sizeof(u64))
+ return -EINVAL;
+ reg_val = &cntx->fp.d.f[reg_num];
+ } else
+ return -EINVAL;
+ } else
+ return -EINVAL;
+
+ if (copy_from_user(reg_val, uaddr, KVM_REG_SIZE(reg->id)))
+ return -EFAULT;
+
+ return 0;
+}
+
static int kvm_riscv_vcpu_set_reg(struct kvm_vcpu *vcpu,
const struct kvm_one_reg *reg)
{
@@ -427,6 +519,12 @@ static int kvm_riscv_vcpu_set_reg(struct kvm_vcpu *vcpu,
return kvm_riscv_vcpu_set_reg_csr(vcpu, reg);
else if ((reg->id & KVM_REG_RISCV_TYPE_MASK) == KVM_REG_RISCV_TIMER)
return kvm_riscv_vcpu_set_reg_timer(vcpu, reg);
+ else if ((reg->id & KVM_REG_RISCV_TYPE_MASK) == KVM_REG_RISCV_FP_F)
+ return kvm_riscv_vcpu_set_reg_fp(vcpu, reg,
+ KVM_REG_RISCV_FP_F);
+ else if ((reg->id & KVM_REG_RISCV_TYPE_MASK) == KVM_REG_RISCV_FP_D)
+ return kvm_riscv_vcpu_set_reg_fp(vcpu, reg,
+ KVM_REG_RISCV_FP_D);

return -EINVAL;
}
@@ -442,6 +540,12 @@ static int kvm_riscv_vcpu_get_reg(struct kvm_vcpu *vcpu,
return kvm_riscv_vcpu_get_reg_csr(vcpu, reg);
else if ((reg->id & KVM_REG_RISCV_TYPE_MASK) == KVM_REG_RISCV_TIMER)
return kvm_riscv_vcpu_get_reg_timer(vcpu, reg);
+ else if ((reg->id & KVM_REG_RISCV_TYPE_MASK) == KVM_REG_RISCV_FP_F)
+ return kvm_riscv_vcpu_get_reg_fp(vcpu, reg,
+ KVM_REG_RISCV_FP_F);
+ else if ((reg->id & KVM_REG_RISCV_TYPE_MASK) == KVM_REG_RISCV_FP_D)
+ return kvm_riscv_vcpu_get_reg_fp(vcpu, reg,
+ KVM_REG_RISCV_FP_D);

return -EINVAL;
}
--
2.25.1


2021-05-19 19:11:16

by Anup Patel

[permalink] [raw]
Subject: [PATCH v18 16/18] RISC-V: KVM: Document RISC-V specific parts of KVM API

Document RISC-V specific parts of the KVM API, such as:
- The interrupt numbers passed to the KVM_INTERRUPT ioctl.
- The states supported by the KVM_{GET,SET}_MP_STATE ioctls.
- The registers supported by the KVM_{GET,SET}_ONE_REG interface
and the encoding of those register ids.
- The exit reason KVM_EXIT_RISCV_SBI for SBI calls forwarded to
userspace tool.

CC: Jonathan Corbet <[email protected]>
CC: [email protected]
Signed-off-by: Anup Patel <[email protected]>
---
Documentation/virt/kvm/api.rst | 193 +++++++++++++++++++++++++++++++--
1 file changed, 184 insertions(+), 9 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 7fcb2fd38f42..642f858d3605 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -532,7 +532,7 @@ translation mode.
------------------

:Capability: basic
-:Architectures: x86, ppc, mips
+:Architectures: x86, ppc, mips, riscv
:Type: vcpu ioctl
:Parameters: struct kvm_interrupt (in)
:Returns: 0 on success, negative on failure.
@@ -601,6 +601,23 @@ interrupt number dequeues the interrupt.

This is an asynchronous vcpu ioctl and can be invoked from any thread.

+RISC-V:
+^^^^^^^
+
+Queues an external interrupt to be injected into the virutal CPU. This ioctl
+is overloaded with 2 different irq values:
+
+a) KVM_INTERRUPT_SET
+
+ This sets external interrupt for a virtual CPU and it will receive
+ once it is ready.
+
+b) KVM_INTERRUPT_UNSET
+
+ This clears pending external interrupt for a virtual CPU.
+
+This is an asynchronous vcpu ioctl and can be invoked from any thread.
+

4.17 KVM_DEBUG_GUEST
--------------------
@@ -1394,7 +1411,7 @@ for vm-wide capabilities.
---------------------

:Capability: KVM_CAP_MP_STATE
-:Architectures: x86, s390, arm, arm64
+:Architectures: x86, s390, arm, arm64, riscv
:Type: vcpu ioctl
:Parameters: struct kvm_mp_state (out)
:Returns: 0 on success; -1 on error
@@ -1411,7 +1428,8 @@ uniprocessor guests).
Possible values are:

========================== ===============================================
- KVM_MP_STATE_RUNNABLE the vcpu is currently running [x86,arm/arm64]
+ KVM_MP_STATE_RUNNABLE the vcpu is currently running
+ [x86,arm/arm64,riscv]
KVM_MP_STATE_UNINITIALIZED the vcpu is an application processor (AP)
which has not yet received an INIT signal [x86]
KVM_MP_STATE_INIT_RECEIVED the vcpu has received an INIT signal, and is
@@ -1420,7 +1438,7 @@ Possible values are:
is waiting for an interrupt [x86]
KVM_MP_STATE_SIPI_RECEIVED the vcpu has just received a SIPI (vector
accessible via KVM_GET_VCPU_EVENTS) [x86]
- KVM_MP_STATE_STOPPED the vcpu is stopped [s390,arm/arm64]
+ KVM_MP_STATE_STOPPED the vcpu is stopped [s390,arm/arm64,riscv]
KVM_MP_STATE_CHECK_STOP the vcpu is in a special error state [s390]
KVM_MP_STATE_OPERATING the vcpu is operating (running or halted)
[s390]
@@ -1432,8 +1450,8 @@ On x86, this ioctl is only useful after KVM_CREATE_IRQCHIP. Without an
in-kernel irqchip, the multiprocessing state must be maintained by userspace on
these architectures.

-For arm/arm64:
-^^^^^^^^^^^^^^
+For arm/arm64/riscv:
+^^^^^^^^^^^^^^^^^^^^

The only states that are valid are KVM_MP_STATE_STOPPED and
KVM_MP_STATE_RUNNABLE which reflect if the vcpu is paused or not.
@@ -1442,7 +1460,7 @@ KVM_MP_STATE_RUNNABLE which reflect if the vcpu is paused or not.
---------------------

:Capability: KVM_CAP_MP_STATE
-:Architectures: x86, s390, arm, arm64
+:Architectures: x86, s390, arm, arm64, riscv
:Type: vcpu ioctl
:Parameters: struct kvm_mp_state (in)
:Returns: 0 on success; -1 on error
@@ -1454,8 +1472,8 @@ On x86, this ioctl is only useful after KVM_CREATE_IRQCHIP. Without an
in-kernel irqchip, the multiprocessing state must be maintained by userspace on
these architectures.

-For arm/arm64:
-^^^^^^^^^^^^^^
+For arm/arm64/riscv:
+^^^^^^^^^^^^^^^^^^^^

The only states that are valid are KVM_MP_STATE_STOPPED and
KVM_MP_STATE_RUNNABLE which reflect if the vcpu should be paused or not.
@@ -2572,6 +2590,144 @@ following id bit patterns::

0x7020 0000 0003 02 <0:3> <reg:5>

+RISC-V registers are mapped using the lower 32 bits. The upper 8 bits of
+that is the register group type.
+
+RISC-V config registers are meant for configuring a Guest VCPU and it has
+the following id bit patterns::
+
+ 0x8020 0000 01 <index into the kvm_riscv_config struct:24> (32bit Host)
+ 0x8030 0000 01 <index into the kvm_riscv_config struct:24> (64bit Host)
+
+Following are the RISC-V config registers:
+
+======================= ========= =============================================
+ Encoding Register Description
+======================= ========= =============================================
+ 0x80x0 0000 0100 0000 isa ISA feature bitmap of Guest VCPU
+======================= ========= =============================================
+
+The isa config register can be read anytime but can only be written before
+a Guest VCPU runs. It will have ISA feature bits matching underlying host
+set by default.
+
+RISC-V core registers represent the general excution state of a Guest VCPU
+and it has the following id bit patterns::
+
+ 0x8020 0000 02 <index into the kvm_riscv_core struct:24> (32bit Host)
+ 0x8030 0000 02 <index into the kvm_riscv_core struct:24> (64bit Host)
+
+Following are the RISC-V core registers:
+
+======================= ========= =============================================
+ Encoding Register Description
+======================= ========= =============================================
+ 0x80x0 0000 0200 0000 regs.pc Program counter
+ 0x80x0 0000 0200 0001 regs.ra Return address
+ 0x80x0 0000 0200 0002 regs.sp Stack pointer
+ 0x80x0 0000 0200 0003 regs.gp Global pointer
+ 0x80x0 0000 0200 0004 regs.tp Task pointer
+ 0x80x0 0000 0200 0005 regs.t0 Caller saved register 0
+ 0x80x0 0000 0200 0006 regs.t1 Caller saved register 1
+ 0x80x0 0000 0200 0007 regs.t2 Caller saved register 2
+ 0x80x0 0000 0200 0008 regs.s0 Callee saved register 0
+ 0x80x0 0000 0200 0009 regs.s1 Callee saved register 1
+ 0x80x0 0000 0200 000a regs.a0 Function argument (or return value) 0
+ 0x80x0 0000 0200 000b regs.a1 Function argument (or return value) 1
+ 0x80x0 0000 0200 000c regs.a2 Function argument 2
+ 0x80x0 0000 0200 000d regs.a3 Function argument 3
+ 0x80x0 0000 0200 000e regs.a4 Function argument 4
+ 0x80x0 0000 0200 000f regs.a5 Function argument 5
+ 0x80x0 0000 0200 0010 regs.a6 Function argument 6
+ 0x80x0 0000 0200 0011 regs.a7 Function argument 7
+ 0x80x0 0000 0200 0012 regs.s2 Callee saved register 2
+ 0x80x0 0000 0200 0013 regs.s3 Callee saved register 3
+ 0x80x0 0000 0200 0014 regs.s4 Callee saved register 4
+ 0x80x0 0000 0200 0015 regs.s5 Callee saved register 5
+ 0x80x0 0000 0200 0016 regs.s6 Callee saved register 6
+ 0x80x0 0000 0200 0017 regs.s7 Callee saved register 7
+ 0x80x0 0000 0200 0018 regs.s8 Callee saved register 8
+ 0x80x0 0000 0200 0019 regs.s9 Callee saved register 9
+ 0x80x0 0000 0200 001a regs.s10 Callee saved register 10
+ 0x80x0 0000 0200 001b regs.s11 Callee saved register 11
+ 0x80x0 0000 0200 001c regs.t3 Caller saved register 3
+ 0x80x0 0000 0200 001d regs.t4 Caller saved register 4
+ 0x80x0 0000 0200 001e regs.t5 Caller saved register 5
+ 0x80x0 0000 0200 001f regs.t6 Caller saved register 6
+ 0x80x0 0000 0200 0020 mode Privilege mode (1 = S-mode or 0 = U-mode)
+======================= ========= =============================================
+
+RISC-V csr registers represent the supervisor mode control/status registers
+of a Guest VCPU and it has the following id bit patterns::
+
+ 0x8020 0000 03 <index into the kvm_riscv_csr struct:24> (32bit Host)
+ 0x8030 0000 03 <index into the kvm_riscv_csr struct:24> (64bit Host)
+
+Following are the RISC-V csr registers:
+
+======================= ========= =============================================
+ Encoding Register Description
+======================= ========= =============================================
+ 0x80x0 0000 0300 0000 sstatus Supervisor status
+ 0x80x0 0000 0300 0001 sie Supervisor interrupt enable
+ 0x80x0 0000 0300 0002 stvec Supervisor trap vector base
+ 0x80x0 0000 0300 0003 sscratch Supervisor scratch register
+ 0x80x0 0000 0300 0004 sepc Supervisor exception program counter
+ 0x80x0 0000 0300 0005 scause Supervisor trap cause
+ 0x80x0 0000 0300 0006 stval Supervisor bad address or instruction
+ 0x80x0 0000 0300 0007 sip Supervisor interrupt pending
+ 0x80x0 0000 0300 0008 satp Supervisor address translation and protection
+======================= ========= =============================================
+
+RISC-V timer registers represent the timer state of a Guest VCPU and it has
+the following id bit patterns::
+
+ 0x8030 0000 04 <index into the kvm_riscv_timer struct:24>
+
+Following are the RISC-V timer registers:
+
+======================= ========= =============================================
+ Encoding Register Description
+======================= ========= =============================================
+ 0x8030 0000 0400 0000 frequency Time base frequency (read-only)
+ 0x8030 0000 0400 0001 time Time value visible to Guest
+ 0x8030 0000 0400 0002 compare Time compare programmed by Guest
+ 0x8030 0000 0400 0003 state Time compare state (1 = ON or 0 = OFF)
+======================= ========= =============================================
+
+RISC-V F-extension registers represent the single precision floating point
+state of a Guest VCPU and it has the following id bit patterns::
+
+ 0x8020 0000 05 <index into the __riscv_f_ext_state struct:24>
+
+Following are the RISC-V F-extension registers:
+
+======================= ========= =============================================
+ Encoding Register Description
+======================= ========= =============================================
+ 0x8020 0000 0500 0000 f[0] Floating point register 0
+ ...
+ 0x8020 0000 0500 001f f[31] Floating point register 31
+ 0x8020 0000 0500 0020 fcsr Floating point control and status register
+======================= ========= =============================================
+
+RISC-V D-extension registers represent the double precision floating point
+state of a Guest VCPU and it has the following id bit patterns::
+
+ 0x8020 0000 06 <index into the __riscv_d_ext_state struct:24> (fcsr)
+ 0x8030 0000 06 <index into the __riscv_d_ext_state struct:24> (non-fcsr)
+
+Following are the RISC-V D-extension registers:
+
+======================= ========= =============================================
+ Encoding Register Description
+======================= ========= =============================================
+ 0x8030 0000 0600 0000 f[0] Floating point register 0
+ ...
+ 0x8030 0000 0600 001f f[31] Floating point register 31
+ 0x8020 0000 0600 0020 fcsr Floating point control and status register
+======================= ========= =============================================
+

4.69 KVM_GET_ONE_REG
--------------------
@@ -5565,6 +5721,25 @@ Valid values for 'type' are:
Userspace is expected to place the hypercall result into the appropriate
field before invoking KVM_RUN again.

+::
+
+ /* KVM_EXIT_RISCV_SBI */
+ struct {
+ unsigned long extension_id;
+ unsigned long function_id;
+ unsigned long args[6];
+ unsigned long ret[2];
+ } riscv_sbi;
+If exit reason is KVM_EXIT_RISCV_SBI then it indicates that the VCPU has
+done a SBI call which is not handled by KVM RISC-V kernel module. The details
+of the SBI call are available in 'riscv_sbi' member of kvm_run structure. The
+'extension_id' field of 'riscv_sbi' represents SBI extension ID whereas the
+'function_id' field represents function ID of given SBI extension. The 'args'
+array field of 'riscv_sbi' represents parameters for the SBI call and 'ret'
+array field represents return values. The userspace should update the return
+values of SBI call before resuming the VCPU. For more details on RISC-V SBI
+spec refer, https://github.com/riscv/riscv-sbi-doc.
+
::

/* Fix the size of the union. */
--
2.25.1


2021-05-19 19:11:49

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH v18 00/18] KVM RISC-V Support

On Wed, May 19, 2021 at 09:05:35AM +0530, Anup Patel wrote:
> From: Anup Patel <[email protected]>
>
> This series adds initial KVM RISC-V support. Currently, we are able to boot
> Linux on RV64/RV32 Guest with multiple VCPUs.
>
> Key aspects of KVM RISC-V added by this series are:
> 1. No RISC-V specific KVM IOCTL
> 2. Minimal possible KVM world-switch which touches only GPRs and few CSRs
> 3. Both RV64 and RV32 host supported
> 4. Full Guest/VM switch is done via vcpu_get/vcpu_put infrastructure
> 5. KVM ONE_REG interface for VCPU register access from user-space
> 6. PLIC emulation is done in user-space
> 7. Timer and IPI emuation is done in-kernel
> 8. Both Sv39x4 and Sv48x4 supported for RV64 host
> 9. MMU notifiers supported
> 10. Generic dirtylog supported
> 11. FP lazy save/restore supported
> 12. SBI v0.1 emulation for KVM Guest available
> 13. Forward unhandled SBI calls to KVM userspace
> 14. Hugepage support for Guest/VM
> 15. IOEVENTFD support for Vhost
>
> Here's a brief TODO list which we will work upon after this series:
> 1. SBI v0.2 emulation in-kernel
> 2. SBI v0.2 hart state management emulation in-kernel
> 3. In-kernel PLIC emulation
> 4. ..... and more .....
>
> This series can be found in riscv_kvm_v18 branch at:
> https//github.com/avpatel/linux.git
>
> Our work-in-progress KVMTOOL RISC-V port can be found in riscv_v7 branch
> at: https//github.com/avpatel/kvmtool.git
>
> The QEMU RISC-V hypervisor emulation is done by Alistair and is available
> in master branch at: https://git.qemu.org/git/qemu.git
>
> To play around with KVM RISC-V, refer KVM RISC-V wiki at:
> https://github.com/kvm-riscv/howto/wiki
> https://github.com/kvm-riscv/howto/wiki/KVM-RISCV64-on-QEMU
> https://github.com/kvm-riscv/howto/wiki/KVM-RISCV64-on-Spike
>
> Changes since v17:
> - Rebased on Linux-5.13-rc2
> - Moved to new KVM MMU notifier APIs
> - Removed redundant kvm_arch_vcpu_uninit()
> - Moved KVM RISC-V sources to drivers/staging for compliance with
> Linux RISC-V patch acceptance policy

What is this new "patch acceptance policy" and what does it have to do
with drivers/staging?

What does drivers/staging/ have to do with this at all? Did anyone ask
the staging maintainer about this?

Not cool, and not something I'm about to take without some very good
reasons...

greg k-h

2021-05-19 19:13:09

by Anup Patel

[permalink] [raw]
Subject: Re: [PATCH v18 00/18] KVM RISC-V Support

On Wed, May 19, 2021 at 10:28 AM Greg Kroah-Hartman
<[email protected]> wrote:
>
> On Wed, May 19, 2021 at 09:05:35AM +0530, Anup Patel wrote:
> > From: Anup Patel <[email protected]>
> >
> > This series adds initial KVM RISC-V support. Currently, we are able to boot
> > Linux on RV64/RV32 Guest with multiple VCPUs.
> >
> > Key aspects of KVM RISC-V added by this series are:
> > 1. No RISC-V specific KVM IOCTL
> > 2. Minimal possible KVM world-switch which touches only GPRs and few CSRs
> > 3. Both RV64 and RV32 host supported
> > 4. Full Guest/VM switch is done via vcpu_get/vcpu_put infrastructure
> > 5. KVM ONE_REG interface for VCPU register access from user-space
> > 6. PLIC emulation is done in user-space
> > 7. Timer and IPI emuation is done in-kernel
> > 8. Both Sv39x4 and Sv48x4 supported for RV64 host
> > 9. MMU notifiers supported
> > 10. Generic dirtylog supported
> > 11. FP lazy save/restore supported
> > 12. SBI v0.1 emulation for KVM Guest available
> > 13. Forward unhandled SBI calls to KVM userspace
> > 14. Hugepage support for Guest/VM
> > 15. IOEVENTFD support for Vhost
> >
> > Here's a brief TODO list which we will work upon after this series:
> > 1. SBI v0.2 emulation in-kernel
> > 2. SBI v0.2 hart state management emulation in-kernel
> > 3. In-kernel PLIC emulation
> > 4. ..... and more .....
> >
> > This series can be found in riscv_kvm_v18 branch at:
> > https//github.com/avpatel/linux.git
> >
> > Our work-in-progress KVMTOOL RISC-V port can be found in riscv_v7 branch
> > at: https//github.com/avpatel/kvmtool.git
> >
> > The QEMU RISC-V hypervisor emulation is done by Alistair and is available
> > in master branch at: https://git.qemu.org/git/qemu.git
> >
> > To play around with KVM RISC-V, refer KVM RISC-V wiki at:
> > https://github.com/kvm-riscv/howto/wiki
> > https://github.com/kvm-riscv/howto/wiki/KVM-RISCV64-on-QEMU
> > https://github.com/kvm-riscv/howto/wiki/KVM-RISCV64-on-Spike
> >
> > Changes since v17:
> > - Rebased on Linux-5.13-rc2
> > - Moved to new KVM MMU notifier APIs
> > - Removed redundant kvm_arch_vcpu_uninit()
> > - Moved KVM RISC-V sources to drivers/staging for compliance with
> > Linux RISC-V patch acceptance policy
>
> What is this new "patch acceptance policy" and what does it have to do
> with drivers/staging?

The Linux RISC-V patch acceptance policy is here:
Documentation/riscv/patch-acceptance.rst

As-per this policy, the Linux RISC-V maintainers will only accept
patches for frozen/ratified RISC-V extensions. Basically, it links the
Linux RISC-V development process with the RISC-V foundation
process which is painfully slow.

The KVM RISC-V patches have been sitting on the lists for almost
2 years now. The requirements for freezing RISC-V H-extension
(hypervisor extension) keeps changing and we are not clear when
it will be frozen. In fact, quite a few people have already implemented
RISC-V H-extension in hardware as well and KVM RISC-V works
on real HW as well.

Rationale of moving KVM RISC-V to drivers/staging is to continue
KVM RISC-V development without breaking the Linux RISC-V patch
acceptance policy until RISC-V H-extension is frozen. Once, RISC-V
H-extension is frozen we will move KVM RISC-V back to arch/riscv
(like other architectures).

>
> What does drivers/staging/ have to do with this at all? Did anyone ask
> the staging maintainer about this?

Yes, Paolo (KVM maintainer) suggested having KVM RISC-V under
drivers/staging until RISC-V H-extension is frozen and continue the
KVM RISC-V development from there.

>
> Not cool, and not something I'm about to take without some very good
> reasons...

Regards,
Anup

2021-05-19 19:14:16

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH v18 00/18] KVM RISC-V Support

On Wed, May 19, 2021 at 10:40:13AM +0530, Anup Patel wrote:
> On Wed, May 19, 2021 at 10:28 AM Greg Kroah-Hartman
> <[email protected]> wrote:
> >
> > On Wed, May 19, 2021 at 09:05:35AM +0530, Anup Patel wrote:
> > > From: Anup Patel <[email protected]>
> > >
> > > This series adds initial KVM RISC-V support. Currently, we are able to boot
> > > Linux on RV64/RV32 Guest with multiple VCPUs.
> > >
> > > Key aspects of KVM RISC-V added by this series are:
> > > 1. No RISC-V specific KVM IOCTL
> > > 2. Minimal possible KVM world-switch which touches only GPRs and few CSRs
> > > 3. Both RV64 and RV32 host supported
> > > 4. Full Guest/VM switch is done via vcpu_get/vcpu_put infrastructure
> > > 5. KVM ONE_REG interface for VCPU register access from user-space
> > > 6. PLIC emulation is done in user-space
> > > 7. Timer and IPI emuation is done in-kernel
> > > 8. Both Sv39x4 and Sv48x4 supported for RV64 host
> > > 9. MMU notifiers supported
> > > 10. Generic dirtylog supported
> > > 11. FP lazy save/restore supported
> > > 12. SBI v0.1 emulation for KVM Guest available
> > > 13. Forward unhandled SBI calls to KVM userspace
> > > 14. Hugepage support for Guest/VM
> > > 15. IOEVENTFD support for Vhost
> > >
> > > Here's a brief TODO list which we will work upon after this series:
> > > 1. SBI v0.2 emulation in-kernel
> > > 2. SBI v0.2 hart state management emulation in-kernel
> > > 3. In-kernel PLIC emulation
> > > 4. ..... and more .....
> > >
> > > This series can be found in riscv_kvm_v18 branch at:
> > > https//github.com/avpatel/linux.git
> > >
> > > Our work-in-progress KVMTOOL RISC-V port can be found in riscv_v7 branch
> > > at: https//github.com/avpatel/kvmtool.git
> > >
> > > The QEMU RISC-V hypervisor emulation is done by Alistair and is available
> > > in master branch at: https://git.qemu.org/git/qemu.git
> > >
> > > To play around with KVM RISC-V, refer KVM RISC-V wiki at:
> > > https://github.com/kvm-riscv/howto/wiki
> > > https://github.com/kvm-riscv/howto/wiki/KVM-RISCV64-on-QEMU
> > > https://github.com/kvm-riscv/howto/wiki/KVM-RISCV64-on-Spike
> > >
> > > Changes since v17:
> > > - Rebased on Linux-5.13-rc2
> > > - Moved to new KVM MMU notifier APIs
> > > - Removed redundant kvm_arch_vcpu_uninit()
> > > - Moved KVM RISC-V sources to drivers/staging for compliance with
> > > Linux RISC-V patch acceptance policy
> >
> > What is this new "patch acceptance policy" and what does it have to do
> > with drivers/staging?
>
> The Linux RISC-V patch acceptance policy is here:
> Documentation/riscv/patch-acceptance.rst
>
> As-per this policy, the Linux RISC-V maintainers will only accept
> patches for frozen/ratified RISC-V extensions. Basically, it links the
> Linux RISC-V development process with the RISC-V foundation
> process which is painfully slow.
>
> The KVM RISC-V patches have been sitting on the lists for almost
> 2 years now. The requirements for freezing RISC-V H-extension
> (hypervisor extension) keeps changing and we are not clear when
> it will be frozen. In fact, quite a few people have already implemented
> RISC-V H-extension in hardware as well and KVM RISC-V works
> on real HW as well.
>
> Rationale of moving KVM RISC-V to drivers/staging is to continue
> KVM RISC-V development without breaking the Linux RISC-V patch
> acceptance policy until RISC-V H-extension is frozen. Once, RISC-V
> H-extension is frozen we will move KVM RISC-V back to arch/riscv
> (like other architectures).

Wait, no, this has nothing to do with what drivers/staging/ is for and
how it is used. Again, not ok.

> > What does drivers/staging/ have to do with this at all? Did anyone ask
> > the staging maintainer about this?
>
> Yes, Paolo (KVM maintainer) suggested having KVM RISC-V under
> drivers/staging until RISC-V H-extension is frozen and continue the
> KVM RISC-V development from there.

staging is not for stuff like this at all. It is for code that is
self-contained (not this) and needs work to get merged into the main
part of the kernel (listed in a TODO file, and is not this).

It is not a dumping ground for stuff that arch maintainers can not seem
to agree on, and it is not a place where you can just randomly play
around with user/kernel apis with no consequences.

So no, sorry, not going to take this code at all.

gre gk-h

2021-05-19 19:27:45

by Dan Carpenter

[permalink] [raw]
Subject: Re: [PATCH v18 02/18] RISC-V: Add initial skeletal KVM support

On Wed, May 19, 2021 at 09:05:37AM +0530, Anup Patel wrote:
> +int kvm_arch_hardware_enable(void)
> +{
> + unsigned long hideleg, hedeleg;
> +
> + hedeleg = 0;
> + hedeleg |= (1UL << EXC_INST_MISALIGNED);

You may as well use BIT_UL(EXC_INST_MISALIGNED) for all of these.
There is a Coccinelle script to convert these so please just make it
standard like everyone else.

> + hedeleg |= (1UL << EXC_BREAKPOINT);
> + hedeleg |= (1UL << EXC_SYSCALL);
> + hedeleg |= (1UL << EXC_INST_PAGE_FAULT);
> + hedeleg |= (1UL << EXC_LOAD_PAGE_FAULT);
> + hedeleg |= (1UL << EXC_STORE_PAGE_FAULT);
> + csr_write(CSR_HEDELEG, hedeleg);
> +
> + hideleg = 0;
> + hideleg |= (1UL << IRQ_VS_SOFT);
> + hideleg |= (1UL << IRQ_VS_TIMER);
> + hideleg |= (1UL << IRQ_VS_EXT);
> + csr_write(CSR_HIDELEG, hideleg);
> +
> + csr_write(CSR_HCOUNTEREN, -1UL);
> +
> + csr_write(CSR_HVIP, 0);
> +
> + return 0;
> +}
> +
> +void kvm_arch_hardware_disable(void)
> +{
> + csr_write(CSR_HEDELEG, 0);
> + csr_write(CSR_HIDELEG, 0);
> +}
> +
> +int kvm_arch_init(void *opaque)
> +{
> + if (!riscv_isa_extension_available(NULL, h)) {
> + kvm_info("hypervisor extension not available\n");
> + return -ENODEV;
> + }
> +
> + if (sbi_spec_is_0_1()) {
> + kvm_info("require SBI v0.2 or higher\n");
> + return -ENODEV;
> + }
> +
> + if (sbi_probe_extension(SBI_EXT_RFENCE) <= 0) {

sbi_probe_extension() never returns zero.


> + kvm_info("require SBI RFENCE extension\n");
> + return -ENODEV;
> + }
> +
> + kvm_info("hypervisor extension available\n");
> +
> + return 0;
> +}
> +
> +void kvm_arch_exit(void)
> +{
> +}
> +
> +static int riscv_kvm_init(void)
> +{
> + return kvm_init(NULL, sizeof(struct kvm_vcpu), 0, THIS_MODULE);
> +}
> +module_init(riscv_kvm_init);


[ snip ]

> +int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
> +{
> + int ret;
> + struct kvm_cpu_trap trap;
> + struct kvm_run *run = vcpu->run;
> +
> + vcpu->arch.srcu_idx = srcu_read_lock(&vcpu->kvm->srcu);
> +
> + /* Process MMIO value returned from user-space */
> + if (run->exit_reason == KVM_EXIT_MMIO) {
> + ret = kvm_riscv_vcpu_mmio_return(vcpu, vcpu->run);
> + if (ret) {
> + srcu_read_unlock(&vcpu->kvm->srcu, vcpu->arch.srcu_idx);
> + return ret;
> + }
> + }
> +
> + if (run->immediate_exit) {
> + srcu_read_unlock(&vcpu->kvm->srcu, vcpu->arch.srcu_idx);
> + return -EINTR;
> + }
> +
> + vcpu_load(vcpu);
> +
> + kvm_sigset_activate(vcpu);
> +
> + ret = 1;
> + run->exit_reason = KVM_EXIT_UNKNOWN;
> + while (ret > 0) {
> + /* Check conditions before entering the guest */
> + cond_resched();
> +
> + kvm_riscv_check_vcpu_requests(vcpu);
> +
> + preempt_disable();
> +
> + local_irq_disable();
> +
> + /*
> + * Exit if we have a signal pending so that we can deliver
> + * the signal to user space.
> + */
> + if (signal_pending(current)) {
> + ret = -EINTR;
> + run->exit_reason = KVM_EXIT_INTR;
> + }
> +
> + /*
> + * Ensure we set mode to IN_GUEST_MODE after we disable
> + * interrupts and before the final VCPU requests check.
> + * See the comment in kvm_vcpu_exiting_guest_mode() and
> + * Documentation/virtual/kvm/vcpu-requests.rst
> + */
> + vcpu->mode = IN_GUEST_MODE;
> +
> + srcu_read_unlock(&vcpu->kvm->srcu, vcpu->arch.srcu_idx);
> + smp_mb__after_srcu_read_unlock();
> +
> + if (ret <= 0 ||

ret can never be == 0 at this point.

> + kvm_request_pending(vcpu)) {
> + vcpu->mode = OUTSIDE_GUEST_MODE;
> + local_irq_enable();
> + preempt_enable();
> + vcpu->arch.srcu_idx = srcu_read_lock(&vcpu->kvm->srcu);
> + continue;
> + }
> +
> + guest_enter_irqoff();
> +
> + __kvm_riscv_switch_to(&vcpu->arch);
> +
> + vcpu->mode = OUTSIDE_GUEST_MODE;
> + vcpu->stat.exits++;
> +
> + /*
> + * Save SCAUSE, STVAL, HTVAL, and HTINST because we might
> + * get an interrupt between __kvm_riscv_switch_to() and
> + * local_irq_enable() which can potentially change CSRs.
> + */
> + trap.sepc = 0;
> + trap.scause = csr_read(CSR_SCAUSE);
> + trap.stval = csr_read(CSR_STVAL);
> + trap.htval = csr_read(CSR_HTVAL);
> + trap.htinst = csr_read(CSR_HTINST);
> +
> + /*
> + * We may have taken a host interrupt in VS/VU-mode (i.e.
> + * while executing the guest). This interrupt is still
> + * pending, as we haven't serviced it yet!
> + *
> + * We're now back in HS-mode with interrupts disabled
> + * so enabling the interrupts now will have the effect
> + * of taking the interrupt again, in HS-mode this time.
> + */
> + local_irq_enable();
> +
> + /*
> + * We do local_irq_enable() before calling guest_exit() so
> + * that if a timer interrupt hits while running the guest
> + * we account that tick as being spent in the guest. We
> + * enable preemption after calling guest_exit() so that if
> + * we get preempted we make sure ticks after that is not
> + * counted as guest time.
> + */
> + guest_exit();
> +
> + preempt_enable();
> +
> + vcpu->arch.srcu_idx = srcu_read_lock(&vcpu->kvm->srcu);
> +
> + ret = kvm_riscv_vcpu_exit(vcpu, run, &trap);
> + }
> +
> + kvm_sigset_deactivate(vcpu);
> +
> + vcpu_put(vcpu);
> +
> + srcu_read_unlock(&vcpu->kvm->srcu, vcpu->arch.srcu_idx);
> +
> + return ret;
> +}

regards,
dan carpenter

2021-05-19 19:28:04

by Dan Carpenter

[permalink] [raw]
Subject: Re: [PATCH v18 02/18] RISC-V: Add initial skeletal KVM support

On Wed, May 19, 2021 at 09:05:37AM +0530, Anup Patel wrote:
> +void kvm_riscv_stage2_free_pgd(struct kvm *kvm)
> +{
> + /* TODO: */
> +}
> +

I was disappointed how many stub functions remained at the end of the
patchset... It's better to not publish those. How useful is this
patchset with the functionality that is implemented currently?

> +int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
> +{
> + int r;
> +
> + r = kvm_riscv_stage2_alloc_pgd(kvm);
> + if (r)
> + return r;
> +
> + return 0;
> +}

Half the code uses "int ret;" and half uses "int r;". Make everything
int ret.

regards,
dan carpenter

2021-05-19 19:28:31

by Dan Carpenter

[permalink] [raw]
Subject: Re: [PATCH v18 11/18] RISC-V: KVM: Implement MMU notifiers

On Wed, May 19, 2021 at 09:05:46AM +0530, Anup Patel wrote:
> int kvm_riscv_stage2_map(struct kvm_vcpu *vcpu,
> struct kvm_memory_slot *memslot,
> gpa_t gpa, unsigned long hva, bool is_write)
> @@ -569,7 +643,7 @@ int kvm_riscv_stage2_map(struct kvm_vcpu *vcpu,
> struct kvm_mmu_page_cache *pcache = &vcpu->arch.mmu_page_cache;
> bool logging = (memslot->dirty_bitmap &&
> !(memslot->flags & KVM_MEM_READONLY)) ? true : false;
> - unsigned long vma_pagesize;
> + unsigned long vma_pagesize, mmu_seq;
>
> mmap_read_lock(current->mm);
>
> @@ -608,6 +682,8 @@ int kvm_riscv_stage2_map(struct kvm_vcpu *vcpu,
> return ret;
> }
>
> + mmu_seq = kvm->mmu_notifier_seq;
> +
> hfn = gfn_to_pfn_prot(kvm, gfn, is_write, &writeable);
> if (hfn == KVM_PFN_ERR_HWPOISON) {
> send_sig_mceerr(BUS_MCEERR_AR, (void __user *)hva,
> @@ -626,6 +702,9 @@ int kvm_riscv_stage2_map(struct kvm_vcpu *vcpu,
>
> spin_lock(&kvm->mmu_lock);
>
> + if (mmu_notifier_retry(kvm, mmu_seq))
> + goto out_unlock;

Do we need an error code here or is it a success path? You would
expect from the name that mmu_notifier_retry() would retry something
and return an error code, but it's actually a boolean function.

regards,
dan carpenter

> +
> if (writeable) {
> kvm_set_pfn_dirty(hfn);
> mark_page_dirty(kvm, gfn);
> @@ -639,6 +718,7 @@ int kvm_riscv_stage2_map(struct kvm_vcpu *vcpu,
> if (ret)
> kvm_err("Failed to map in stage2\n");
>
> +out_unlock:
> spin_unlock(&kvm->mmu_lock);
> kvm_set_pfn_accessed(hfn);
> kvm_release_pfn_clean(hfn);


2021-05-19 19:28:31

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH v18 00/18] KVM RISC-V Support

On Wed, May 19, 2021 at 07:21:54AM +0200, Greg Kroah-Hartman wrote:
> On Wed, May 19, 2021 at 10:40:13AM +0530, Anup Patel wrote:
> > On Wed, May 19, 2021 at 10:28 AM Greg Kroah-Hartman
> > <[email protected]> wrote:
> > >
> > > On Wed, May 19, 2021 at 09:05:35AM +0530, Anup Patel wrote:
> > > > From: Anup Patel <[email protected]>
> > > >
> > > > This series adds initial KVM RISC-V support. Currently, we are able to boot
> > > > Linux on RV64/RV32 Guest with multiple VCPUs.
> > > >
> > > > Key aspects of KVM RISC-V added by this series are:
> > > > 1. No RISC-V specific KVM IOCTL
> > > > 2. Minimal possible KVM world-switch which touches only GPRs and few CSRs
> > > > 3. Both RV64 and RV32 host supported
> > > > 4. Full Guest/VM switch is done via vcpu_get/vcpu_put infrastructure
> > > > 5. KVM ONE_REG interface for VCPU register access from user-space
> > > > 6. PLIC emulation is done in user-space
> > > > 7. Timer and IPI emuation is done in-kernel
> > > > 8. Both Sv39x4 and Sv48x4 supported for RV64 host
> > > > 9. MMU notifiers supported
> > > > 10. Generic dirtylog supported
> > > > 11. FP lazy save/restore supported
> > > > 12. SBI v0.1 emulation for KVM Guest available
> > > > 13. Forward unhandled SBI calls to KVM userspace
> > > > 14. Hugepage support for Guest/VM
> > > > 15. IOEVENTFD support for Vhost
> > > >
> > > > Here's a brief TODO list which we will work upon after this series:
> > > > 1. SBI v0.2 emulation in-kernel
> > > > 2. SBI v0.2 hart state management emulation in-kernel
> > > > 3. In-kernel PLIC emulation
> > > > 4. ..... and more .....
> > > >
> > > > This series can be found in riscv_kvm_v18 branch at:
> > > > https//github.com/avpatel/linux.git
> > > >
> > > > Our work-in-progress KVMTOOL RISC-V port can be found in riscv_v7 branch
> > > > at: https//github.com/avpatel/kvmtool.git
> > > >
> > > > The QEMU RISC-V hypervisor emulation is done by Alistair and is available
> > > > in master branch at: https://git.qemu.org/git/qemu.git
> > > >
> > > > To play around with KVM RISC-V, refer KVM RISC-V wiki at:
> > > > https://github.com/kvm-riscv/howto/wiki
> > > > https://github.com/kvm-riscv/howto/wiki/KVM-RISCV64-on-QEMU
> > > > https://github.com/kvm-riscv/howto/wiki/KVM-RISCV64-on-Spike
> > > >
> > > > Changes since v17:
> > > > - Rebased on Linux-5.13-rc2
> > > > - Moved to new KVM MMU notifier APIs
> > > > - Removed redundant kvm_arch_vcpu_uninit()
> > > > - Moved KVM RISC-V sources to drivers/staging for compliance with
> > > > Linux RISC-V patch acceptance policy
> > >
> > > What is this new "patch acceptance policy" and what does it have to do
> > > with drivers/staging?
> >
> > The Linux RISC-V patch acceptance policy is here:
> > Documentation/riscv/patch-acceptance.rst
> >
> > As-per this policy, the Linux RISC-V maintainers will only accept
> > patches for frozen/ratified RISC-V extensions. Basically, it links the
> > Linux RISC-V development process with the RISC-V foundation
> > process which is painfully slow.
> >
> > The KVM RISC-V patches have been sitting on the lists for almost
> > 2 years now. The requirements for freezing RISC-V H-extension
> > (hypervisor extension) keeps changing and we are not clear when
> > it will be frozen. In fact, quite a few people have already implemented
> > RISC-V H-extension in hardware as well and KVM RISC-V works
> > on real HW as well.
> >
> > Rationale of moving KVM RISC-V to drivers/staging is to continue
> > KVM RISC-V development without breaking the Linux RISC-V patch
> > acceptance policy until RISC-V H-extension is frozen. Once, RISC-V
> > H-extension is frozen we will move KVM RISC-V back to arch/riscv
> > (like other architectures).
>
> Wait, no, this has nothing to do with what drivers/staging/ is for and
> how it is used. Again, not ok.
>
> > > What does drivers/staging/ have to do with this at all? Did anyone ask
> > > the staging maintainer about this?
> >
> > Yes, Paolo (KVM maintainer) suggested having KVM RISC-V under
> > drivers/staging until RISC-V H-extension is frozen and continue the
> > KVM RISC-V development from there.
>
> staging is not for stuff like this at all. It is for code that is
> self-contained (not this) and needs work to get merged into the main
> part of the kernel (listed in a TODO file, and is not this).
>
> It is not a dumping ground for stuff that arch maintainers can not seem
> to agree on, and it is not a place where you can just randomly play
> around with user/kernel apis with no consequences.
>
> So no, sorry, not going to take this code at all.

And to be a bit more clear about this, having other subsystem
maintainers drop their unwanted code on this subsystem, _without_ even
asking me first is just not very nice. All of a sudden I am now
responsible for this stuff, without me even being asked about it.
Should I start throwing random drivers into the kvm subsystem for them
to maintain because I don't want to? :)

If there's really no other way to do this, than to put it in staging,
let's talk about it. But saying "this must go here" is not a
conversation...

thanks,

greg k-h

2021-05-19 19:29:03

by Dan Carpenter

[permalink] [raw]
Subject: Re: [PATCH v18 14/18] RISC-V: KVM: Implement ONE REG interface for FP registers

On Wed, May 19, 2021 at 09:05:49AM +0530, Anup Patel wrote:
> static int kvm_riscv_vcpu_set_reg(struct kvm_vcpu *vcpu,
> const struct kvm_one_reg *reg)
> {
> @@ -427,6 +519,12 @@ static int kvm_riscv_vcpu_set_reg(struct kvm_vcpu *vcpu,
> return kvm_riscv_vcpu_set_reg_csr(vcpu, reg);
> else if ((reg->id & KVM_REG_RISCV_TYPE_MASK) == KVM_REG_RISCV_TIMER)
> return kvm_riscv_vcpu_set_reg_timer(vcpu, reg);
> + else if ((reg->id & KVM_REG_RISCV_TYPE_MASK) == KVM_REG_RISCV_FP_F)
> + return kvm_riscv_vcpu_set_reg_fp(vcpu, reg,
> + KVM_REG_RISCV_FP_F);
> + else if ((reg->id & KVM_REG_RISCV_TYPE_MASK) == KVM_REG_RISCV_FP_D)
> + return kvm_riscv_vcpu_set_reg_fp(vcpu, reg,
> + KVM_REG_RISCV_FP_D);
>
> return -EINVAL;
> }
> @@ -442,6 +540,12 @@ static int kvm_riscv_vcpu_get_reg(struct kvm_vcpu *vcpu,
> return kvm_riscv_vcpu_get_reg_csr(vcpu, reg);
> else if ((reg->id & KVM_REG_RISCV_TYPE_MASK) == KVM_REG_RISCV_TIMER)
> return kvm_riscv_vcpu_get_reg_timer(vcpu, reg);
> + else if ((reg->id & KVM_REG_RISCV_TYPE_MASK) == KVM_REG_RISCV_FP_F)
> + return kvm_riscv_vcpu_get_reg_fp(vcpu, reg,
> + KVM_REG_RISCV_FP_F);
> + else if ((reg->id & KVM_REG_RISCV_TYPE_MASK) == KVM_REG_RISCV_FP_D)
> + return kvm_riscv_vcpu_get_reg_fp(vcpu, reg,
> + KVM_REG_RISCV_FP_D);

These have become unwieldy. Use a switch statement:

switch (reg->id & KVM_REG_RISCV_TYPE_MASK) {
case KVM_REG_RISCV_TIMER:
return kvm_riscv_vcpu_get_reg_timer(vcpu, reg);
regards,
dan carpenter


2021-05-19 19:30:48

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v18 00/18] KVM RISC-V Support

On 19/05/21 12:47, Greg Kroah-Hartman wrote:
>> It is not a dumping ground for stuff that arch maintainers can not seem
>> to agree on, and it is not a place where you can just randomly play
>> around with user/kernel apis with no consequences.
>>
>> So no, sorry, not going to take this code at all.
>
> And to be a bit more clear about this, having other subsystem
> maintainers drop their unwanted code on this subsystem,_without_ even
> asking me first is just not very nice. All of a sudden I am now > responsible for this stuff, without me even being asked about it.
> Should I start throwing random drivers into the kvm subsystem for them
> to maintain because I don't want to?:)

(I did see the smiley), I'm on board with throwing random drivers in
arch/riscv. :)

The situation here didn't seem very far from what process/2.Process.rst
says about staging:

- "a way to keep track of drivers that aren't up to standards", though
in this case the issue is not coding standards or quality---the code is
very good---and which people "may want to use"

- the code could be removed if there's no progress on either changing
the RISC-V acceptance policy or ratifying the spec

Of course there should have been a TODO file explaining the situation.
But if you think this is not the right place, I totally understand; if
my opinion had any weight in this, I would just place it in arch/riscv/kvm.

The RISC-V acceptance policy as is just doesn't work, and the fact that
people are trying to work around it is proving it. There are many ways
to improve it:

- get rid of it;

- provide a path to get an exception;

- provide a staging place sot hat people to do their job of contributing
code to Linux (e.g. arch/riscv/staging/kvm).

If everything else fail, I guess we can place it in
drivers/virt/riscv/kvm, even though that's just as silly a workaround.
It's a pity because the RISC-V virtualization architecture has a very
nice design, and the KVM code is also a very good example of how to do
things right.

Paolo

> If there's really no other way to do this, than to put it in staging,
> let's talk about it. But saying "this must go here" is not a
> conversation...


2021-05-19 19:35:43

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v18 00/18] KVM RISC-V Support

On 19/05/21 14:23, Greg Kroah-Hartman wrote:
>> - the code could be removed if there's no progress on either changing the
>> RISC-V acceptance policy or ratifying the spec
>
> I really do not understand the issue here, why can this just not be
> merged normally?

Because the RISC-V people only want to merge code for "frozen" or
"ratified" processor extensions, and the RISC-V foundation is dragging
their feet in ratifying the hypervisor extension.

It's totally a self-inflicted pain on part of the RISC-V maintainers;
see Documentation/riscv/patch-acceptance.rst:

We'll only accept patches for new modules or extensions if the
specifications for those modules or extensions are listed as being
"Frozen" or "Ratified" by the RISC-V Foundation. (Developers may, of
course, maintain their own Linux kernel trees that contain code for
any draft extensions that they wish.)

(Link:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/riscv/patch-acceptance.rst)

> All staging drivers need a TODO list that shows what needs to be done in
> order to get it out of staging. All I can tell so far is that the riscv
> maintainers do not want to take this for "unknown reasons" so let's dump
> it over here for now where we don't have to see it.
>
> And that's not good for developers or users, so perhaps the riscv rules
> are not very good?

I agree wholeheartedly.

I have heard contrasting opinions on conflict of interest where the
employers of the maintainers benefit from slowing down the integration
of code in Linus's tree. I find these allegations believable, but even
if that weren't the case, the policy is (to put it kindly) showing its
limits.

>> Of course there should have been a TODO file explaining the situation. But
>> if you think this is not the right place, I totally understand; if my
>> opinion had any weight in this, I would just place it in arch/riscv/kvm.
>>
>> The RISC-V acceptance policy as is just doesn't work, and the fact that
>> people are trying to work around it is proving it. There are many ways to
>> improve it:
>
> What is this magical acceptance policy that is preventing working code
> from being merged? And why is it suddenly the rest of the kernel
> developer's problems because of this?

It is my problem because I am trying to help Anup merging some perfectly
good KVM code; when a new KVM port comes up, I coordinate merging the
first arch/*/kvm bits with the arch/ maintainers and from that point on
that directory becomes "mine" (or my submaintainers').

Paolo


2021-05-19 20:11:47

by Dan Carpenter

[permalink] [raw]
Subject: Re: [PATCH v18 00/18] KVM RISC-V Support

It's sort of frustrating that none of this information was in the commit
message.

"This code is not ready to be merged into the arch/risc/ directory
because the RISC-V Foundation has not certified the hardware spec yet.
However, the following chips have implemented it ABC12345, ABC6789 and
they've already shipping to thousands of customers since blah blah blah
so we should support it."

I honestly thought it was an issue with the code or the userspace API.

regards,
dan carpenter


2021-05-19 20:12:36

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v18 00/18] KVM RISC-V Support

On 19/05/21 17:08, Dan Carpenter wrote:
> It's sort of frustrating that none of this information was in the commit
> message.
>
> "This code is not ready to be merged into the arch/risc/ directory
> because the RISC-V Foundation has not certified the hardware spec yet.
> However, the following chips have implemented it ABC12345, ABC6789 and
> they've already shipping to thousands of customers since blah blah blah
> so we should support it."
>
> I honestly thought it was an issue with the code or the userspace API.

Yes, I was expecting this to be in the staging TODO file - I should have
been more explicit with Anup.

Paolo


2021-05-19 21:12:19

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH v18 00/18] KVM RISC-V Support

On Wed, May 19, 2021 at 01:18:44PM +0200, Paolo Bonzini wrote:
> On 19/05/21 12:47, Greg Kroah-Hartman wrote:
> > > It is not a dumping ground for stuff that arch maintainers can not seem
> > > to agree on, and it is not a place where you can just randomly play
> > > around with user/kernel apis with no consequences.
> > >
> > > So no, sorry, not going to take this code at all.
> >
> > And to be a bit more clear about this, having other subsystem
> > maintainers drop their unwanted code on this subsystem,_without_ even
> > asking me first is just not very nice. All of a sudden I am now > responsible for this stuff, without me even being asked about it.
> > Should I start throwing random drivers into the kvm subsystem for them
> > to maintain because I don't want to?:)
>
> (I did see the smiley), I'm on board with throwing random drivers in
> arch/riscv. :)
>
> The situation here didn't seem very far from what process/2.Process.rst says
> about staging:
>
> - "a way to keep track of drivers that aren't up to standards", though in
> this case the issue is not coding standards or quality---the code is very
> good---and which people "may want to use"

Exactly, this is different. And it's not self-contained, which is
another requirement for staging code that we have learned to enforce
over the years (makes it easier to rip out if no one is willing to
maintain it.)

> - the code could be removed if there's no progress on either changing the
> RISC-V acceptance policy or ratifying the spec

I really do not understand the issue here, why can this just not be
merged normally?

Is the code somehow not ok? Is it relying on hardware in ways that
breaks other users? Does it cause problems for different users? Is it
a user api that you don't like or think is the "proper" one?

All staging drivers need a TODO list that shows what needs to be done in
order to get it out of staging. All I can tell so far is that the riscv
maintainers do not want to take this for "unknown reasons" so let's dump
it over here for now where we don't have to see it.

And that's not good for developers or users, so perhaps the riscv rules
are not very good?

> Of course there should have been a TODO file explaining the situation. But
> if you think this is not the right place, I totally understand; if my
> opinion had any weight in this, I would just place it in arch/riscv/kvm.
>
> The RISC-V acceptance policy as is just doesn't work, and the fact that
> people are trying to work around it is proving it. There are many ways to
> improve it:

What is this magical acceptance policy that is preventing working code
from being merged? And why is it suddenly the rest of the kernel
developer's problems because of this?

thanks,

greg k-h

2021-05-19 21:16:41

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH v18 00/18] KVM RISC-V Support

On Wed, May 19, 2021 at 03:29:24PM +0200, Paolo Bonzini wrote:
> On 19/05/21 14:23, Greg Kroah-Hartman wrote:
> > > - the code could be removed if there's no progress on either changing the
> > > RISC-V acceptance policy or ratifying the spec
> >
> > I really do not understand the issue here, why can this just not be
> > merged normally?
>
> Because the RISC-V people only want to merge code for "frozen" or "ratified"
> processor extensions, and the RISC-V foundation is dragging their feet in
> ratifying the hypervisor extension.
>
> It's totally a self-inflicted pain on part of the RISC-V maintainers; see
> Documentation/riscv/patch-acceptance.rst:
>
> We'll only accept patches for new modules or extensions if the
> specifications for those modules or extensions are listed as being
> "Frozen" or "Ratified" by the RISC-V Foundation. (Developers may, of
> course, maintain their own Linux kernel trees that contain code for
> any draft extensions that they wish.)
>
> (Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/riscv/patch-acceptance.rst)

Lovely, and how is that going to work for code that lives outside of the
riscv "arch" layer? Like all drivers?

And what exactly is "not ratified" that these patches take advantage of?
If there is hardware out there with these features, well, Linux needs to
run on it, so we need to support that. No external committee rules
should be relevant here.

Now if this is for hardware that is not "real", then that's a different
story. In that case, who cares, no one can use it, so why not take it?

So what exactly is this trying to "protect" Linux from?

> > All staging drivers need a TODO list that shows what needs to be done in
> > order to get it out of staging. All I can tell so far is that the riscv
> > maintainers do not want to take this for "unknown reasons" so let's dump
> > it over here for now where we don't have to see it.
> >
> > And that's not good for developers or users, so perhaps the riscv rules
> > are not very good?
>
> I agree wholeheartedly.
>
> I have heard contrasting opinions on conflict of interest where the
> employers of the maintainers benefit from slowing down the integration of
> code in Linus's tree. I find these allegations believable, but even if that
> weren't the case, the policy is (to put it kindly) showing its limits.

Slowing down code merges is horrible, again, if there's hardware out
there, and someone sends code to support it, and wants to maintain it,
then we should not be rejecting it.

Otherwise we are not doing our job as an operating system kernel, our
role is to make hardware work. We don't get to just ignore code because
we don't like the hardware (oh if only we could!), if a user wants to
use it, our role is to handle that.

> > > Of course there should have been a TODO file explaining the situation. But
> > > if you think this is not the right place, I totally understand; if my
> > > opinion had any weight in this, I would just place it in arch/riscv/kvm.
> > >
> > > The RISC-V acceptance policy as is just doesn't work, and the fact that
> > > people are trying to work around it is proving it. There are many ways to
> > > improve it:
> >
> > What is this magical acceptance policy that is preventing working code
> > from being merged? And why is it suddenly the rest of the kernel
> > developer's problems because of this?
>
> It is my problem because I am trying to help Anup merging some perfectly
> good KVM code; when a new KVM port comes up, I coordinate merging the first
> arch/*/kvm bits with the arch/ maintainers and from that point on that
> directory becomes "mine" (or my submaintainers').

Agreed, but the riscv maintainers should not be forcing this "problem"
onto all of us, like it seems is starting to happen :(

Ok, so, Paul, Palmer, and Albert, what do we do here? Why can't we take
working code like this into the kernel if someone is willing to support
and maintain it over time?

thanks,

greg k-h

2021-05-20 06:13:56

by Dan Carpenter

[permalink] [raw]
Subject: Re: [PATCH v18 14/18] RISC-V: KVM: Implement ONE REG interface for FP registers

On Wed, May 19, 2021 at 09:05:49AM +0530, Anup Patel wrote:
> +static int kvm_riscv_vcpu_set_reg_fp(struct kvm_vcpu *vcpu,
> + const struct kvm_one_reg *reg,
> + unsigned long rtype)
> +{
> + struct kvm_cpu_context *cntx = &vcpu->arch.guest_context;
> + unsigned long isa = vcpu->arch.isa;
> + unsigned long __user *uaddr =
> + (unsigned long __user *)(unsigned long)reg->addr;
> + unsigned long reg_num = reg->id & ~(KVM_REG_ARCH_MASK |
> + KVM_REG_SIZE_MASK |
> + rtype);
> + void *reg_val;
> +
> + if ((rtype == KVM_REG_RISCV_FP_F) &&
> + riscv_isa_extension_available(&isa, f)) {
> + if (KVM_REG_SIZE(reg->id) != sizeof(u32))
> + return -EINVAL;
> + if (reg_num == KVM_REG_RISCV_FP_F_REG(fcsr))
> + reg_val = &cntx->fp.f.fcsr;
> + else if ((KVM_REG_RISCV_FP_F_REG(f[0]) <= reg_num) &&
> + reg_num <= KVM_REG_RISCV_FP_F_REG(f[31]))
> + reg_val = &cntx->fp.f.f[reg_num];
> + else
> + return -EINVAL;
> + } else if ((rtype == KVM_REG_RISCV_FP_D) &&
> + riscv_isa_extension_available(&isa, d)) {
> + if (reg_num == KVM_REG_RISCV_FP_D_REG(fcsr)) {
> + if (KVM_REG_SIZE(reg->id) != sizeof(u32))
> + return -EINVAL;
> + reg_val = &cntx->fp.d.fcsr;
> + } else if ((KVM_REG_RISCV_FP_D_REG(f[0]) <= reg_num) &&
> + reg_num <= KVM_REG_RISCV_FP_D_REG(f[31])) {
> + if (KVM_REG_SIZE(reg->id) != sizeof(u64))
> + return -EINVAL;
> + reg_val = &cntx->fp.d.f[reg_num];
> + } else
> + return -EINVAL;
> + } else
> + return -EINVAL;
> +
> + if (copy_from_user(reg_val, uaddr, KVM_REG_SIZE(reg->id)))
^^^^^^^
It sort of bothers me that if this copy fails then we have no idea
what garbage is in reg_val. It would be nicer to copy it to a temporary
buffer and then memcpy it when we know it's going succeed.

> + return -EFAULT;
> +
> + return 0;
> +}

regards,
dan carpenter

2021-05-21 20:23:15

by Palmer Dabbelt

[permalink] [raw]
Subject: Re: [PATCH v18 00/18] KVM RISC-V Support

On Wed, 19 May 2021 06:58:05 PDT (-0700), Greg KH wrote:
> On Wed, May 19, 2021 at 03:29:24PM +0200, Paolo Bonzini wrote:
>> On 19/05/21 14:23, Greg Kroah-Hartman wrote:
>> > > - the code could be removed if there's no progress on either changing the
>> > > RISC-V acceptance policy or ratifying the spec
>> >
>> > I really do not understand the issue here, why can this just not be
>> > merged normally?
>>
>> Because the RISC-V people only want to merge code for "frozen" or "ratified"
>> processor extensions, and the RISC-V foundation is dragging their feet in
>> ratifying the hypervisor extension.
>>
>> It's totally a self-inflicted pain on part of the RISC-V maintainers; see
>> Documentation/riscv/patch-acceptance.rst:
>>
>> We'll only accept patches for new modules or extensions if the
>> specifications for those modules or extensions are listed as being
>> "Frozen" or "Ratified" by the RISC-V Foundation. (Developers may, of
>> course, maintain their own Linux kernel trees that contain code for
>> any draft extensions that they wish.)
>>
>> (Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/riscv/patch-acceptance.rst)
>
> Lovely, and how is that going to work for code that lives outside of the
> riscv "arch" layer? Like all drivers?
>
> And what exactly is "not ratified" that these patches take advantage of?
> If there is hardware out there with these features, well, Linux needs to
> run on it, so we need to support that. No external committee rules
> should be relevant here.
>
> Now if this is for hardware that is not "real", then that's a different
> story. In that case, who cares, no one can use it, so why not take it?
>
> So what exactly is this trying to "protect" Linux from?
>
>> > All staging drivers need a TODO list that shows what needs to be done in
>> > order to get it out of staging. All I can tell so far is that the riscv
>> > maintainers do not want to take this for "unknown reasons" so let's dump
>> > it over here for now where we don't have to see it.
>> >
>> > And that's not good for developers or users, so perhaps the riscv rules
>> > are not very good?
>>
>> I agree wholeheartedly.
>>
>> I have heard contrasting opinions on conflict of interest where the
>> employers of the maintainers benefit from slowing down the integration of
>> code in Linus's tree. I find these allegations believable, but even if that
>> weren't the case, the policy is (to put it kindly) showing its limits.
>
> Slowing down code merges is horrible, again, if there's hardware out
> there, and someone sends code to support it, and wants to maintain it,
> then we should not be rejecting it.
>
> Otherwise we are not doing our job as an operating system kernel, our
> role is to make hardware work. We don't get to just ignore code because
> we don't like the hardware (oh if only we could!), if a user wants to
> use it, our role is to handle that.
>
>> > > Of course there should have been a TODO file explaining the situation. But
>> > > if you think this is not the right place, I totally understand; if my
>> > > opinion had any weight in this, I would just place it in arch/riscv/kvm.
>> > >
>> > > The RISC-V acceptance policy as is just doesn't work, and the fact that
>> > > people are trying to work around it is proving it. There are many ways to
>> > > improve it:
>> >
>> > What is this magical acceptance policy that is preventing working code
>> > from being merged? And why is it suddenly the rest of the kernel
>> > developer's problems because of this?
>>
>> It is my problem because I am trying to help Anup merging some perfectly
>> good KVM code; when a new KVM port comes up, I coordinate merging the first
>> arch/*/kvm bits with the arch/ maintainers and from that point on that
>> directory becomes "mine" (or my submaintainers').
>
> Agreed, but the riscv maintainers should not be forcing this "problem"
> onto all of us, like it seems is starting to happen :(
>
> Ok, so, Paul, Palmer, and Albert, what do we do here? Why can't we take
> working code like this into the kernel if someone is willing to support
> and maintain it over time?

I don't view this code as being in a state where it can be maintained,
at least to the standards we generally set within the kernel. The ISA
extension in question is still subject to change, it says so right at
the top of the H extension
<https://github.com/riscv/riscv-isa-manual/blob/master/src/hypervisor.tex#L4>

{\bf Warning! This draft specification may change before being
accepted as standard by the RISC-V Foundation.}

That means we really can't rely on any of this to be compatible with
what is eventually ratified and (hopefully, because this is really
important stuff) widely implemented in hardware. We've already had
isuses with other specifications where drafts were propossed as being
ready for implemnetation, software was ported, and the future drafts
were later incompatible -- we had this years ago with the debug support,
which was a huge headache to deal with, and we're running into it again
with these v-0.7.1 chips coming out. I don't want to get stuck in a
spot where we're forced to either deal with some old draft extension
forever or end up breaking users.

Ultimately the whole RISC-V thing is only going to work out if we can
get to the point where vendors can agree on a shared ISA. I understand
that there's been a lot of frustration WRT the timelines on the H
extension, it's been frustrating for me as well. There are clearly
issues with how the ISA development process is being run and while those
are coming to a head in other areas (the V extension and non-coherent
devices, for example) I really don't think that's the case here because
as far as I know we don't actually have any real hardware that
implements the H extension.

All I really care about is getting to the point where we have real
RISC-V systems running software that's as close to upstream as is
reasonable. As it currently stands, I don't know of anything this is
blocking: there's some RTL implementation floating around, but that's a
very long way from being real hardware. Something of this complexity
isn't suitable for a soft core, and RTL alone doesn't fix the
fundamental problem of having a stable platform to run on (it needs a
complex FPGA environment, and even then it's very limited in
functionality). I'm not sure where exactly the line for real hardware
is, but for something like this it would at least involve some chip that
is widely availiable and needs the H extension to be useful. Such a
system existing without a ratified extension would obviously be a major
failing on the specification side, and while I think that's happening
now for some systems (some of these V-0.7.1 chips, and the non-coherent
systems) I just don't see that as the case for the H extension. We've
got to get to the point where the ISA extensions can be ratified in a
timely fashion, but circumventing that process by merging code early
doesn't fix the problem. This really needs to be fixed at the RISC-V
foundation, not papered over in software.

We have lots of real RISC-V hardware right now that's going to require a
huge amount of work to support, trying to chase around a draft extension
that may not even end up in hardware is just going to make headaches we
don't have the time for.

2021-05-21 20:23:48

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v18 00/18] KVM RISC-V Support

On 21/05/21 19:13, Palmer Dabbelt wrote:
>>
>
> I don't view this code as being in a state where it can be
> maintained, at least to the standards we generally set within the
> kernel. The ISA extension in question is still subject to change, it
> says so right at the top of the H extension
> <https://github.com/riscv/riscv-isa-manual/blob/master/src/hypervisor.tex#L4>
>
> {\bf Warning! This draft specification may change before being
> accepted as standard by the RISC-V Foundation.}

To give a complete picture, the last three relevant changes have been in
August 2019, November 2019 and May 2020. It seems pretty frozen to me.

In any case, I think it's clear from the experience with Android that
the acceptance policy cannot succeed. The only thing that such a policy
guarantees, is that vendors will use more out-of-tree code. Keeping a
fully-developed feature out-of-tree for years is not how Linux is run.

> I'm not sure where exactly the line for real hardware is, but for
> something like this it would at least involve some chip that is
> widely availiable and needs the H extension to be useful

Anup said that "quite a few people have already implemented RISC-V
H-extension in hardware as well and KVM RISC-V works on real HW as
well". Those people would benefit from having KVM in the Linus tree.

Paolo

2021-05-21 20:25:52

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH v18 00/18] KVM RISC-V Support

On Fri, May 21, 2021 at 07:21:12PM +0200, Paolo Bonzini wrote:
> On 21/05/21 19:13, Palmer Dabbelt wrote:
> > >
> >
> > I don't view this code as being in a state where it can be
> > maintained, at least to the standards we generally set within the
> > kernel. The ISA extension in question is still subject to change, it
> > says so right at the top of the H extension <https://github.com/riscv/riscv-isa-manual/blob/master/src/hypervisor.tex#L4>
> >
> > {\bf Warning! This draft specification may change before being
> > accepted as standard by the RISC-V Foundation.}
>
> To give a complete picture, the last three relevant changes have been in
> August 2019, November 2019 and May 2020. It seems pretty frozen to me.
>
> In any case, I think it's clear from the experience with Android that
> the acceptance policy cannot succeed. The only thing that such a policy
> guarantees, is that vendors will use more out-of-tree code. Keeping a
> fully-developed feature out-of-tree for years is not how Linux is run.
>
> > I'm not sure where exactly the line for real hardware is, but for
> > something like this it would at least involve some chip that is
> > widely availiable and needs the H extension to be useful
>
> Anup said that "quite a few people have already implemented RISC-V
> H-extension in hardware as well and KVM RISC-V works on real HW as well".
> Those people would benefit from having KVM in the Linus tree.

Great, but is this really true? If so, what hardware has this? I have
a new RISC-V device right here next to me, what would I need to do to
see if this is supported in it or not?

If this isn't in any hardware that anyone outside of
internal-to-company-prototypes, then let's wait until it really is in a
device that people can test this code on.

What's the rush to get this merged now if no one can use it?

thanks,

greg k-h

2021-05-21 20:26:40

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH v18 00/18] KVM RISC-V Support

On Fri, May 21, 2021 at 11:08:15AM -0700, Palmer Dabbelt wrote:
> On Fri, 21 May 2021 10:47:51 PDT (-0700), Greg KH wrote:
> > On Fri, May 21, 2021 at 07:21:12PM +0200, Paolo Bonzini wrote:
> > > On 21/05/21 19:13, Palmer Dabbelt wrote:
> > > > >
> > > >
> > > > I don't view this code as being in a state where it can be
> > > > maintained, at least to the standards we generally set within the
> > > > kernel. The ISA extension in question is still subject to change, it
> > > > says so right at the top of the H extension <https://github.com/riscv/riscv-isa-manual/blob/master/src/hypervisor.tex#L4>
> > > >
> > > > {\bf Warning! This draft specification may change before being
> > > > accepted as standard by the RISC-V Foundation.}
> > >
> > > To give a complete picture, the last three relevant changes have been in
> > > August 2019, November 2019 and May 2020. It seems pretty frozen to me.
> > >
> > > In any case, I think it's clear from the experience with Android that
> > > the acceptance policy cannot succeed. The only thing that such a policy
> > > guarantees, is that vendors will use more out-of-tree code. Keeping a
> > > fully-developed feature out-of-tree for years is not how Linux is run.
> > >
> > > > I'm not sure where exactly the line for real hardware is, but for
> > > > something like this it would at least involve some chip that is
> > > > widely availiable and needs the H extension to be useful
> > >
> > > Anup said that "quite a few people have already implemented RISC-V
> > > H-extension in hardware as well and KVM RISC-V works on real HW as well".
> > > Those people would benefit from having KVM in the Linus tree.
> >
> > Great, but is this really true? If so, what hardware has this? I have
> > a new RISC-V device right here next to me, what would I need to do to
> > see if this is supported in it or not?
>
> You can probe the misa register, it should have the H bit set if it supports
> the H extension.

To let everyone know, based on our private chat we had off-list, no, the
device I have does not support this extension, so unless someone can
point me at real hardware, I don't think this code needs to be
considered for merging anywhere just yet.

thanks,

greg k-h

2021-05-21 20:26:41

by Palmer Dabbelt

[permalink] [raw]
Subject: Re: [PATCH v18 00/18] KVM RISC-V Support

On Fri, 21 May 2021 10:47:51 PDT (-0700), Greg KH wrote:
> On Fri, May 21, 2021 at 07:21:12PM +0200, Paolo Bonzini wrote:
>> On 21/05/21 19:13, Palmer Dabbelt wrote:
>> > >
>> >
>> > I don't view this code as being in a state where it can be
>> > maintained, at least to the standards we generally set within the
>> > kernel. The ISA extension in question is still subject to change, it
>> > says so right at the top of the H extension <https://github.com/riscv/riscv-isa-manual/blob/master/src/hypervisor.tex#L4>
>> >
>> > {\bf Warning! This draft specification may change before being
>> > accepted as standard by the RISC-V Foundation.}
>>
>> To give a complete picture, the last three relevant changes have been in
>> August 2019, November 2019 and May 2020. It seems pretty frozen to me.
>>
>> In any case, I think it's clear from the experience with Android that
>> the acceptance policy cannot succeed. The only thing that such a policy
>> guarantees, is that vendors will use more out-of-tree code. Keeping a
>> fully-developed feature out-of-tree for years is not how Linux is run.
>>
>> > I'm not sure where exactly the line for real hardware is, but for
>> > something like this it would at least involve some chip that is
>> > widely availiable and needs the H extension to be useful
>>
>> Anup said that "quite a few people have already implemented RISC-V
>> H-extension in hardware as well and KVM RISC-V works on real HW as well".
>> Those people would benefit from having KVM in the Linus tree.
>
> Great, but is this really true? If so, what hardware has this? I have
> a new RISC-V device right here next to me, what would I need to do to
> see if this is supported in it or not?

You can probe the misa register, it should have the H bit set if it
supports the H extension.

> If this isn't in any hardware that anyone outside of
> internal-to-company-prototypes, then let's wait until it really is in a
> device that people can test this code on.
>
> What's the rush to get this merged now if no one can use it?
>
> thanks,
>
> greg k-h

2021-05-21 20:30:51

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v18 00/18] KVM RISC-V Support

On 21/05/21 19:47, Greg KH wrote:
> If this isn't in any hardware that anyone outside of
> internal-to-company-prototypes, then let's wait until it really is in a
> device that people can test this code on.
>
> What's the rush to get this merged now if no one can use it?

There is not just hardware, there are simulators and emulators too (you
can use QEMU to test it for example), and it's not exactly a rush since
it's basically been ready for 2 years and has hardly seen any code
changes since v13 which was based on Linux 5.9.

Not having the code upstream is hindering further development so that
RISC-V KVM can be feature complete when hardware does come out. Missing
features and optimizations could be added on top, but they are harder to
review if they are integrated in a relatively large series instead of
being done incrementally. Not having the header files in Linus's tree
makes it harder to merge RISC-V KVM support in userspace (userspace is
shielded anyway by any future changes to the hypervisor specification,
so there's no risk of breaking the ABI).

At some point one has to say enough is enough; for me, that is after one
year with no changes to the spec and, especially, no deadline in sight
for freezing it. The last 5 versions of the patch set were just
adapting to changes in the generic KVM code. If the code is good, I
don't see why the onus of doing those changes should be on Anup, rather
than being shared amongst all KVM developers as is the case for all the
other architectures.

Paolo

2021-05-24 07:11:35

by Guo Ren

[permalink] [raw]
Subject: Re: [PATCH v18 00/18] KVM RISC-V Support

Thx Anup,

Tested-by: Guo Ren <[email protected]> (Just on qemu-rv64)

I'm following your KVM patchset and it's a great job for riscv
H-extension. I think hardware companies hope Linux KVM ready first
before the real chip. That means we can ensure the hardware could run
mainline linux.

Good luck!

On Wed, May 19, 2021 at 11:36 AM Anup Patel <[email protected]> wrote:
>
> From: Anup Patel <[email protected]>
>
> This series adds initial KVM RISC-V support. Currently, we are able to boot
> Linux on RV64/RV32 Guest with multiple VCPUs.
>
> Key aspects of KVM RISC-V added by this series are:
> 1. No RISC-V specific KVM IOCTL
> 2. Minimal possible KVM world-switch which touches only GPRs and few CSRs
> 3. Both RV64 and RV32 host supported
> 4. Full Guest/VM switch is done via vcpu_get/vcpu_put infrastructure
> 5. KVM ONE_REG interface for VCPU register access from user-space
> 6. PLIC emulation is done in user-space
> 7. Timer and IPI emuation is done in-kernel
> 8. Both Sv39x4 and Sv48x4 supported for RV64 host
> 9. MMU notifiers supported
> 10. Generic dirtylog supported
> 11. FP lazy save/restore supported
> 12. SBI v0.1 emulation for KVM Guest available
> 13. Forward unhandled SBI calls to KVM userspace
> 14. Hugepage support for Guest/VM
> 15. IOEVENTFD support for Vhost
>
> Here's a brief TODO list which we will work upon after this series:
> 1. SBI v0.2 emulation in-kernel
> 2. SBI v0.2 hart state management emulation in-kernel
> 3. In-kernel PLIC emulation
> 4. ..... and more .....
>
> This series can be found in riscv_kvm_v18 branch at:
> https//github.com/avpatel/linux.git
>
> Our work-in-progress KVMTOOL RISC-V port can be found in riscv_v7 branch
> at: https//github.com/avpatel/kvmtool.git
>
> The QEMU RISC-V hypervisor emulation is done by Alistair and is available
> in master branch at: https://git.qemu.org/git/qemu.git
>
> To play around with KVM RISC-V, refer KVM RISC-V wiki at:
> https://github.com/kvm-riscv/howto/wiki
> https://github.com/kvm-riscv/howto/wiki/KVM-RISCV64-on-QEMU
> https://github.com/kvm-riscv/howto/wiki/KVM-RISCV64-on-Spike
>
> Changes since v17:
> - Rebased on Linux-5.13-rc2
> - Moved to new KVM MMU notifier APIs
> - Removed redundant kvm_arch_vcpu_uninit()
> - Moved KVM RISC-V sources to drivers/staging for compliance with
> Linux RISC-V patch acceptance policy
>
> Changes since v16:
> - Rebased on Linux-5.12-rc5
> - Remove redundant kvm_arch_create_memslot(), kvm_arch_vcpu_setup(),
> kvm_arch_vcpu_init(), kvm_arch_has_vcpu_debugfs(), and
> kvm_arch_create_vcpu_debugfs() from PATCH5
> - Make stage2_wp_memory_region() and stage2_ioremap() as static
> in PATCH13
>
> Changes since v15:
> - Rebased on Linux-5.11-rc3
> - Fixed kvm_stage2_map() to use gfn_to_pfn_prot() for determing
> writeability of a host pfn.
> - Use "__u64" in-place of "u64" and "__u32" in-place of "u32" for
> uapi/asm/kvm.h
>
> Changes since v14:
> - Rebased on Linux-5.10-rc3
> - Fixed Stage2 (G-stage) PDG allocation to ensure it is 16KB aligned
>
> Changes since v13:
> - Rebased on Linux-5.9-rc3
> - Fixed kvm_riscv_vcpu_set_reg_csr() for SIP updation in PATCH5
> - Fixed instruction length computation in PATCH7
> - Added ioeventfd support in PATCH7
> - Ensure HSTATUS.SPVP is set to correct value before using HLV/HSV
> intructions in PATCH7
> - Fixed stage2_map_page() to set PTE 'A' and 'D' bits correctly
> in PATCH10
> - Added stage2 dirty page logging in PATCH10
> - Allow KVM user-space to SET/GET SCOUNTER CSR in PATCH5
> - Save/restore SCOUNTEREN in PATCH6
> - Reduced quite a few instructions for __kvm_riscv_switch_to() by
> using CSR swap instruction in PATCH6
> - Detect and use Sv48x4 when available in PATCH10
>
> Changes since v12:
> - Rebased patches on Linux-5.8-rc4
> - By default enable all counters in HCOUNTEREN
> - RISC-V H-Extension v0.6.1 spec support
>
> Changes since v11:
> - Rebased patches on Linux-5.7-rc3
> - Fixed typo in typecast of stage2_map_size define
> - Introduced struct kvm_cpu_trap to represent trap details and
> use it as function parameter wherever applicable
> - Pass memslot to kvm_riscv_stage2_map() for supporing dirty page
> logging in future
> - RISC-V H-Extension v0.6 spec support
> - Send-out first three patches as separate series so that it can
> be taken by Palmer for Linux RISC-V
>
> Changes since v10:
> - Rebased patches on Linux-5.6-rc5
> - Reduce RISCV_ISA_EXT_MAX from 256 to 64
> - Separate PATCH for removing N-extension related defines
> - Added comments as requested by Palmer
> - Fixed HIDELEG CSR programming
>
> Changes since v9:
> - Rebased patches on Linux-5.5-rc3
> - Squash PATCH19 and PATCH20 into PATCH5
> - Squash PATCH18 into PATCH11
> - Squash PATCH17 into PATCH16
> - Added ONE_REG interface for VCPU timer in PATCH13
> - Use HTIMEDELTA for VCPU timer in PATCH13
> - Updated KVM RISC-V mailing list in MAINTAINERS entry
> - Update KVM kconfig option to depend on RISCV_SBI and MMU
> - Check for SBI v0.2 and SBI v0.2 RFENCE extension at boot-time
> - Use SBI v0.2 RFENCE extension in VMID implementation
> - Use SBI v0.2 RFENCE extension in Stage2 MMU implementation
> - Use SBI v0.2 RFENCE extension in SBI implementation
> - Moved to RISC-V Hypervisor v0.5 draft spec
> - Updated Documentation/virt/kvm/api.txt for timer ONE_REG interface
>
> Changes since v8:
> - Rebased series on Linux-5.4-rc3 and Atish's SBI v0.2 patches
> - Use HRTIMER_MODE_REL instead of HRTIMER_MODE_ABS in timer emulation
> - Fixed kvm_riscv_stage2_map() to handle hugepages
> - Added patch to forward unhandled SBI calls to user-space
> - Added patch for iterative/recursive stage2 page table programming
> - Added patch to remove per-CPU vsip_shadow variable
> - Added patch to fix race-condition in kvm_riscv_vcpu_sync_interrupts()
>
> Changes since v7:
> - Rebased series on Linux-5.4-rc1 and Atish's SBI v0.2 patches
> - Removed PATCH1, PATCH3, and PATCH20 because these already merged
> - Use kernel doc style comments for ISA bitmap functions
> - Don't parse X, Y, and Z extension in riscv_fill_hwcap() because it will
> be added in-future
> - Mark KVM RISC-V kconfig option as EXPERIMENTAL
> - Typo fix in commit description of PATCH6 of v7 series
> - Use separate structs for CORE and CSR registers of ONE_REG interface
> - Explicitly include asm/sbi.h in kvm/vcpu_sbi.c
> - Removed implicit switch-case fall-through in kvm_riscv_vcpu_exit()
> - No need to set VSSTATUS.MXR bit in kvm_riscv_vcpu_unpriv_read()
> - Removed register for instruction length in kvm_riscv_vcpu_unpriv_read()
> - Added defines for checking/decoding instruction length
> - Added separate patch to forward unhandled SBI calls to userspace tool
>
> Changes since v6:
> - Rebased patches on Linux-5.3-rc7
> - Added "return_handled" in struct kvm_mmio_decode to ensure that
> kvm_riscv_vcpu_mmio_return() updates SEPC only once
> - Removed trap_stval parameter from kvm_riscv_vcpu_unpriv_read()
> - Updated git repo URL in MAINTAINERS entry
>
> Changes since v5:
> - Renamed KVM_REG_RISCV_CONFIG_TIMEBASE register to
> KVM_REG_RISCV_CONFIG_TBFREQ register in ONE_REG interface
> - Update SPEC in kvm_riscv_vcpu_mmio_return() for MMIO exits
> - Use switch case instead of illegal instruction opcode table for simplicity
> - Improve comments in stage2_remote_tlb_flush() for a potential remote TLB
> flush optimization
> - Handle all unsupported SBI calls in default case of
> kvm_riscv_vcpu_sbi_ecall() function
> - Fixed kvm_riscv_vcpu_sync_interrupts() for software interrupts
> - Improved unprivilege reads to handle traps due to Guest stage1 page table
> - Added separate patch to document RISC-V specific things in
> Documentation/virt/kvm/api.txt
>
> Changes since v4:
> - Rebased patches on Linux-5.3-rc5
> - Added Paolo's Acked-by and Reviewed-by
> - Updated mailing list in MAINTAINERS entry
>
> Changes since v3:
> - Moved patch for ISA bitmap from KVM prep series to this series
> - Make vsip_shadow as run-time percpu variable instead of compile-time
> - Flush Guest TLBs on all Host CPUs whenever we run-out of VMIDs
>
> Changes since v2:
> - Removed references of KVM_REQ_IRQ_PENDING from all patches
> - Use kvm->srcu within in-kernel KVM run loop
> - Added percpu vsip_shadow to track last value programmed in VSIP CSR
> - Added comments about irqs_pending and irqs_pending_mask
> - Used kvm_arch_vcpu_runnable() in-place-of kvm_riscv_vcpu_has_interrupt()
> in system_opcode_insn()
> - Removed unwanted smp_wmb() in kvm_riscv_stage2_vmid_update()
> - Use kvm_flush_remote_tlbs() in kvm_riscv_stage2_vmid_update()
> - Use READ_ONCE() in kvm_riscv_stage2_update_hgatp() for vmid
>
> Changes since v1:
> - Fixed compile errors in building KVM RISC-V as module
> - Removed unused kvm_riscv_halt_guest() and kvm_riscv_resume_guest()
> - Set KVM_CAP_SYNC_MMU capability only after MMU notifiers are implemented
> - Made vmid_version as unsigned long instead of atomic
> - Renamed KVM_REQ_UPDATE_PGTBL to KVM_REQ_UPDATE_HGATP
> - Renamed kvm_riscv_stage2_update_pgtbl() to kvm_riscv_stage2_update_hgatp()
> - Configure HIDELEG and HEDELEG in kvm_arch_hardware_enable()
> - Updated ONE_REG interface for CSR access to user-space
> - Removed irqs_pending_lock and use atomic bitops instead
> - Added separate patch for FP ONE_REG interface
> - Added separate patch for updating MAINTAINERS file
>
> Anup Patel (14):
> RISC-V: Add hypervisor extension related CSR defines
> RISC-V: Add initial skeletal KVM support
> RISC-V: KVM: Implement VCPU create, init and destroy functions
> RISC-V: KVM: Implement VCPU interrupts and requests handling
> RISC-V: KVM: Implement KVM_GET_ONE_REG/KVM_SET_ONE_REG ioctls
> RISC-V: KVM: Implement VCPU world-switch
> RISC-V: KVM: Handle MMIO exits for VCPU
> RISC-V: KVM: Handle WFI exits for VCPU
> RISC-V: KVM: Implement VMID allocator
> RISC-V: KVM: Implement stage2 page table programming
> RISC-V: KVM: Implement MMU notifiers
> RISC-V: KVM: Document RISC-V specific parts of KVM API
> RISC-V: KVM: Move sources to drivers/staging directory
> RISC-V: KVM: Add MAINTAINERS entry
>
> Atish Patra (4):
> RISC-V: KVM: Add timer functionality
> RISC-V: KVM: FP lazy save/restore
> RISC-V: KVM: Implement ONE REG interface for FP registers
> RISC-V: KVM: Add SBI v0.1 support
>
> Documentation/virt/kvm/api.rst | 193 +++-
> MAINTAINERS | 11 +
> arch/riscv/Kconfig | 1 +
> arch/riscv/Makefile | 1 +
> arch/riscv/include/uapi/asm/kvm.h | 128 +++
> drivers/clocksource/timer-riscv.c | 9 +
> drivers/staging/riscv/kvm/Kconfig | 36 +
> drivers/staging/riscv/kvm/Makefile | 23 +
> drivers/staging/riscv/kvm/asm/kvm_csr.h | 105 ++
> drivers/staging/riscv/kvm/asm/kvm_host.h | 271 +++++
> drivers/staging/riscv/kvm/asm/kvm_types.h | 7 +
> .../staging/riscv/kvm/asm/kvm_vcpu_timer.h | 44 +
> drivers/staging/riscv/kvm/main.c | 118 +++
> drivers/staging/riscv/kvm/mmu.c | 802 ++++++++++++++
> drivers/staging/riscv/kvm/riscv_offsets.c | 170 +++
> drivers/staging/riscv/kvm/tlb.S | 74 ++
> drivers/staging/riscv/kvm/vcpu.c | 997 ++++++++++++++++++
> drivers/staging/riscv/kvm/vcpu_exit.c | 701 ++++++++++++
> drivers/staging/riscv/kvm/vcpu_sbi.c | 173 +++
> drivers/staging/riscv/kvm/vcpu_switch.S | 401 +++++++
> drivers/staging/riscv/kvm/vcpu_timer.c | 225 ++++
> drivers/staging/riscv/kvm/vm.c | 81 ++
> drivers/staging/riscv/kvm/vmid.c | 120 +++
> include/clocksource/timer-riscv.h | 16 +
> include/uapi/linux/kvm.h | 8 +
> 25 files changed, 4706 insertions(+), 9 deletions(-)
> create mode 100644 arch/riscv/include/uapi/asm/kvm.h
> create mode 100644 drivers/staging/riscv/kvm/Kconfig
> create mode 100644 drivers/staging/riscv/kvm/Makefile
> create mode 100644 drivers/staging/riscv/kvm/asm/kvm_csr.h
> create mode 100644 drivers/staging/riscv/kvm/asm/kvm_host.h
> create mode 100644 drivers/staging/riscv/kvm/asm/kvm_types.h
> create mode 100644 drivers/staging/riscv/kvm/asm/kvm_vcpu_timer.h
> create mode 100644 drivers/staging/riscv/kvm/main.c
> create mode 100644 drivers/staging/riscv/kvm/mmu.c
> create mode 100644 drivers/staging/riscv/kvm/riscv_offsets.c
> create mode 100644 drivers/staging/riscv/kvm/tlb.S
> create mode 100644 drivers/staging/riscv/kvm/vcpu.c
> create mode 100644 drivers/staging/riscv/kvm/vcpu_exit.c
> create mode 100644 drivers/staging/riscv/kvm/vcpu_sbi.c
> create mode 100644 drivers/staging/riscv/kvm/vcpu_switch.S
> create mode 100644 drivers/staging/riscv/kvm/vcpu_timer.c
> create mode 100644 drivers/staging/riscv/kvm/vm.c
> create mode 100644 drivers/staging/riscv/kvm/vmid.c
> create mode 100644 include/clocksource/timer-riscv.h
>
> --
> 2.25.1
>


--
Best Regards
Guo Ren

ML: https://lore.kernel.org/linux-csky/

2021-05-24 22:59:09

by Palmer Dabbelt

[permalink] [raw]
Subject: Re: [PATCH v18 00/18] KVM RISC-V Support

On Mon, 24 May 2021 00:09:45 PDT (-0700), [email protected] wrote:
> Thx Anup,
>
> Tested-by: Guo Ren <[email protected]> (Just on qemu-rv64)
>
> I'm following your KVM patchset and it's a great job for riscv
> H-extension. I think hardware companies hope Linux KVM ready first
> before the real chip. That means we can ensure the hardware could run
> mainline linux.

I understand that it would be wonderful for hardware vendors to have a
guarantee that their hardware will be supported by the software
ecosystem, but that's not what we're talking about here. Specifically,
the proposal for this code is to track the latest draft extension which
would specifically leave vendors who implement the current draft out in
the cold was something to change. In practice that is the only way to
move forward with any draft extension that doesn't have hardware
available, as the software RISC-V implementations rapidly deprecate
draft extensions and without a way to test our code it is destined to
bit rot.

If vendors want to make sure their hardware is supported then the best
way to do that is to make sure specifications get ratified in a timely
fashion that describe the behavior required from their products. That
way we have an agreed upon interface that vendors can implement and
software can rely on. I understand that a lot of people are frustrated
with the pace of that process when it comes to the H extension, but
circumventing that process doesn't fix the fundamental problem. If
there really are products out there that people can't build because the
H extension isn't upstream then we need to have a serious discussion
about those, but without something specific to discuss this is just
going to devolve into speculation which isn't a good use of time.

I can't find any hardware that implements the draft H extension, at
least via poking around on Google and in my email. I'm very hesitant to
talk about private vendor roadmaps in public, as that's getting way too
close to my day job, but I've yet to have a vendor raise this as an
issue to me privately and I do try my best to make sure to talk to the
RISC-V hardware vendors whenever possible (though I try to stick to
public roadmaps there, to avoid issues around discussions like this and
conflicts with work). Anup is clearly in a much more privileged
position than I am here, given that he has real hardware and is able to
allude to vendor roadmaps that I can't find in public, but until we can
all get on the same page about that it's going to be difficult to have a
reasonable discussion -- if we all have different information we're
naturally going to arrive at different conclusions, which IMO is why
this argument just keeps coming up. It's totally possible I'm just
missing something here, in which case I'd love to be corrected as we can
be having a very different discussion.

I certainly hope that vendors understand that we're willing to work with
them when it comes to making the software run on their hardware. I've
always tried to be quite explicit that's our goal here, both by just
saying so and by demonstrating that we're willing to take code that
exhibits behavior not specified by the specifications but that is
necessary to make real hardware work. There's always a balance here and
I can't commit to making every chip with a RISC-V logo on it run Linux
well, as there will always be some implementations that are just
impossible to run real code on, but I'm always willing to do whatever I
can to try to make things work.

If anyone has concrete concerns about RISC-V hardware not being
supported by Linux then I'm happy to have a discussion about that.
Having a discussion in public is always best, as then everyone can be on
the same page, but as far as I know we're doing a good job supporting
the publicly available hardware -- not saying we're perfect, but given
the size of the RISC-V user base and how new much of the hardware is I
think we're well above average when it comes to upstream support of real
hardware. I have a feeling anyone's worries would be about unreleased
hardware, in which case I can understand it's difficult to have concrete
discussions in public. I'm always happy to at least make an attempt to
have private discussions about these (private discussion are tricky,
though, so I can't promise I can always participate), and while I don't
think those discussions should meaningfully sway the kernel's policies
one way or the other it could at least help alleviate any acute concerns
that vendors have. We've gotten to the point where some pretty serious
accusations are starting to get thrown around, and that sort of thing
really doesn't benefit anyone so I'm willing to do whatever I can to
help fix that.

IMO we're just trying to follow the standard Linux development policies
here, where the focus is on making real hardware work in a way that can
be sustainably maintained so we don't break users. If anything I think
we're a notch more liberal WRT the code we accept than standard with the
current policy, as accepting anything in a frozen extension doesn't even
require a commitment from a hardware vendor WRT implementing the code.
That obviously opens us up to behavior differences between the hardware
and the specification, which have historically been retrofitted back to
the specifications, but I'm willing to take on the extra work as it
helps lend weight to the specification development process.

If I'm just missing something here and there is publicly available
hardware that implements the H extension then I'd be happy to have that
pointed out and very much change the tune of this discussion, but until
hardware shows up or the ISA is frozen I just don't see any way to
maintain this code up the standards generally set by Linux or
specifically by arch/riscv and therefor cannot support merging it.

>
> Good luck!
>
> On Wed, May 19, 2021 at 11:36 AM Anup Patel <[email protected]> wrote:
>>
>> From: Anup Patel <[email protected]>
>>
>> This series adds initial KVM RISC-V support. Currently, we are able to boot
>> Linux on RV64/RV32 Guest with multiple VCPUs.
>>
>> Key aspects of KVM RISC-V added by this series are:
>> 1. No RISC-V specific KVM IOCTL
>> 2. Minimal possible KVM world-switch which touches only GPRs and few CSRs
>> 3. Both RV64 and RV32 host supported
>> 4. Full Guest/VM switch is done via vcpu_get/vcpu_put infrastructure
>> 5. KVM ONE_REG interface for VCPU register access from user-space
>> 6. PLIC emulation is done in user-space
>> 7. Timer and IPI emuation is done in-kernel
>> 8. Both Sv39x4 and Sv48x4 supported for RV64 host
>> 9. MMU notifiers supported
>> 10. Generic dirtylog supported
>> 11. FP lazy save/restore supported
>> 12. SBI v0.1 emulation for KVM Guest available
>> 13. Forward unhandled SBI calls to KVM userspace
>> 14. Hugepage support for Guest/VM
>> 15. IOEVENTFD support for Vhost
>>
>> Here's a brief TODO list which we will work upon after this series:
>> 1. SBI v0.2 emulation in-kernel
>> 2. SBI v0.2 hart state management emulation in-kernel
>> 3. In-kernel PLIC emulation
>> 4. ..... and more .....
>>
>> This series can be found in riscv_kvm_v18 branch at:
>> https//github.com/avpatel/linux.git
>>
>> Our work-in-progress KVMTOOL RISC-V port can be found in riscv_v7 branch
>> at: https//github.com/avpatel/kvmtool.git
>>
>> The QEMU RISC-V hypervisor emulation is done by Alistair and is available
>> in master branch at: https://git.qemu.org/git/qemu.git
>>
>> To play around with KVM RISC-V, refer KVM RISC-V wiki at:
>> https://github.com/kvm-riscv/howto/wiki
>> https://github.com/kvm-riscv/howto/wiki/KVM-RISCV64-on-QEMU
>> https://github.com/kvm-riscv/howto/wiki/KVM-RISCV64-on-Spike
>>
>> Changes since v17:
>> - Rebased on Linux-5.13-rc2
>> - Moved to new KVM MMU notifier APIs
>> - Removed redundant kvm_arch_vcpu_uninit()
>> - Moved KVM RISC-V sources to drivers/staging for compliance with
>> Linux RISC-V patch acceptance policy
>>
>> Changes since v16:
>> - Rebased on Linux-5.12-rc5
>> - Remove redundant kvm_arch_create_memslot(), kvm_arch_vcpu_setup(),
>> kvm_arch_vcpu_init(), kvm_arch_has_vcpu_debugfs(), and
>> kvm_arch_create_vcpu_debugfs() from PATCH5
>> - Make stage2_wp_memory_region() and stage2_ioremap() as static
>> in PATCH13
>>
>> Changes since v15:
>> - Rebased on Linux-5.11-rc3
>> - Fixed kvm_stage2_map() to use gfn_to_pfn_prot() for determing
>> writeability of a host pfn.
>> - Use "__u64" in-place of "u64" and "__u32" in-place of "u32" for
>> uapi/asm/kvm.h
>>
>> Changes since v14:
>> - Rebased on Linux-5.10-rc3
>> - Fixed Stage2 (G-stage) PDG allocation to ensure it is 16KB aligned
>>
>> Changes since v13:
>> - Rebased on Linux-5.9-rc3
>> - Fixed kvm_riscv_vcpu_set_reg_csr() for SIP updation in PATCH5
>> - Fixed instruction length computation in PATCH7
>> - Added ioeventfd support in PATCH7
>> - Ensure HSTATUS.SPVP is set to correct value before using HLV/HSV
>> intructions in PATCH7
>> - Fixed stage2_map_page() to set PTE 'A' and 'D' bits correctly
>> in PATCH10
>> - Added stage2 dirty page logging in PATCH10
>> - Allow KVM user-space to SET/GET SCOUNTER CSR in PATCH5
>> - Save/restore SCOUNTEREN in PATCH6
>> - Reduced quite a few instructions for __kvm_riscv_switch_to() by
>> using CSR swap instruction in PATCH6
>> - Detect and use Sv48x4 when available in PATCH10
>>
>> Changes since v12:
>> - Rebased patches on Linux-5.8-rc4
>> - By default enable all counters in HCOUNTEREN
>> - RISC-V H-Extension v0.6.1 spec support
>>
>> Changes since v11:
>> - Rebased patches on Linux-5.7-rc3
>> - Fixed typo in typecast of stage2_map_size define
>> - Introduced struct kvm_cpu_trap to represent trap details and
>> use it as function parameter wherever applicable
>> - Pass memslot to kvm_riscv_stage2_map() for supporing dirty page
>> logging in future
>> - RISC-V H-Extension v0.6 spec support
>> - Send-out first three patches as separate series so that it can
>> be taken by Palmer for Linux RISC-V
>>
>> Changes since v10:
>> - Rebased patches on Linux-5.6-rc5
>> - Reduce RISCV_ISA_EXT_MAX from 256 to 64
>> - Separate PATCH for removing N-extension related defines
>> - Added comments as requested by Palmer
>> - Fixed HIDELEG CSR programming
>>
>> Changes since v9:
>> - Rebased patches on Linux-5.5-rc3
>> - Squash PATCH19 and PATCH20 into PATCH5
>> - Squash PATCH18 into PATCH11
>> - Squash PATCH17 into PATCH16
>> - Added ONE_REG interface for VCPU timer in PATCH13
>> - Use HTIMEDELTA for VCPU timer in PATCH13
>> - Updated KVM RISC-V mailing list in MAINTAINERS entry
>> - Update KVM kconfig option to depend on RISCV_SBI and MMU
>> - Check for SBI v0.2 and SBI v0.2 RFENCE extension at boot-time
>> - Use SBI v0.2 RFENCE extension in VMID implementation
>> - Use SBI v0.2 RFENCE extension in Stage2 MMU implementation
>> - Use SBI v0.2 RFENCE extension in SBI implementation
>> - Moved to RISC-V Hypervisor v0.5 draft spec
>> - Updated Documentation/virt/kvm/api.txt for timer ONE_REG interface
>>
>> Changes since v8:
>> - Rebased series on Linux-5.4-rc3 and Atish's SBI v0.2 patches
>> - Use HRTIMER_MODE_REL instead of HRTIMER_MODE_ABS in timer emulation
>> - Fixed kvm_riscv_stage2_map() to handle hugepages
>> - Added patch to forward unhandled SBI calls to user-space
>> - Added patch for iterative/recursive stage2 page table programming
>> - Added patch to remove per-CPU vsip_shadow variable
>> - Added patch to fix race-condition in kvm_riscv_vcpu_sync_interrupts()
>>
>> Changes since v7:
>> - Rebased series on Linux-5.4-rc1 and Atish's SBI v0.2 patches
>> - Removed PATCH1, PATCH3, and PATCH20 because these already merged
>> - Use kernel doc style comments for ISA bitmap functions
>> - Don't parse X, Y, and Z extension in riscv_fill_hwcap() because it will
>> be added in-future
>> - Mark KVM RISC-V kconfig option as EXPERIMENTAL
>> - Typo fix in commit description of PATCH6 of v7 series
>> - Use separate structs for CORE and CSR registers of ONE_REG interface
>> - Explicitly include asm/sbi.h in kvm/vcpu_sbi.c
>> - Removed implicit switch-case fall-through in kvm_riscv_vcpu_exit()
>> - No need to set VSSTATUS.MXR bit in kvm_riscv_vcpu_unpriv_read()
>> - Removed register for instruction length in kvm_riscv_vcpu_unpriv_read()
>> - Added defines for checking/decoding instruction length
>> - Added separate patch to forward unhandled SBI calls to userspace tool
>>
>> Changes since v6:
>> - Rebased patches on Linux-5.3-rc7
>> - Added "return_handled" in struct kvm_mmio_decode to ensure that
>> kvm_riscv_vcpu_mmio_return() updates SEPC only once
>> - Removed trap_stval parameter from kvm_riscv_vcpu_unpriv_read()
>> - Updated git repo URL in MAINTAINERS entry
>>
>> Changes since v5:
>> - Renamed KVM_REG_RISCV_CONFIG_TIMEBASE register to
>> KVM_REG_RISCV_CONFIG_TBFREQ register in ONE_REG interface
>> - Update SPEC in kvm_riscv_vcpu_mmio_return() for MMIO exits
>> - Use switch case instead of illegal instruction opcode table for simplicity
>> - Improve comments in stage2_remote_tlb_flush() for a potential remote TLB
>> flush optimization
>> - Handle all unsupported SBI calls in default case of
>> kvm_riscv_vcpu_sbi_ecall() function
>> - Fixed kvm_riscv_vcpu_sync_interrupts() for software interrupts
>> - Improved unprivilege reads to handle traps due to Guest stage1 page table
>> - Added separate patch to document RISC-V specific things in
>> Documentation/virt/kvm/api.txt
>>
>> Changes since v4:
>> - Rebased patches on Linux-5.3-rc5
>> - Added Paolo's Acked-by and Reviewed-by
>> - Updated mailing list in MAINTAINERS entry
>>
>> Changes since v3:
>> - Moved patch for ISA bitmap from KVM prep series to this series
>> - Make vsip_shadow as run-time percpu variable instead of compile-time
>> - Flush Guest TLBs on all Host CPUs whenever we run-out of VMIDs
>>
>> Changes since v2:
>> - Removed references of KVM_REQ_IRQ_PENDING from all patches
>> - Use kvm->srcu within in-kernel KVM run loop
>> - Added percpu vsip_shadow to track last value programmed in VSIP CSR
>> - Added comments about irqs_pending and irqs_pending_mask
>> - Used kvm_arch_vcpu_runnable() in-place-of kvm_riscv_vcpu_has_interrupt()
>> in system_opcode_insn()
>> - Removed unwanted smp_wmb() in kvm_riscv_stage2_vmid_update()
>> - Use kvm_flush_remote_tlbs() in kvm_riscv_stage2_vmid_update()
>> - Use READ_ONCE() in kvm_riscv_stage2_update_hgatp() for vmid
>>
>> Changes since v1:
>> - Fixed compile errors in building KVM RISC-V as module
>> - Removed unused kvm_riscv_halt_guest() and kvm_riscv_resume_guest()
>> - Set KVM_CAP_SYNC_MMU capability only after MMU notifiers are implemented
>> - Made vmid_version as unsigned long instead of atomic
>> - Renamed KVM_REQ_UPDATE_PGTBL to KVM_REQ_UPDATE_HGATP
>> - Renamed kvm_riscv_stage2_update_pgtbl() to kvm_riscv_stage2_update_hgatp()
>> - Configure HIDELEG and HEDELEG in kvm_arch_hardware_enable()
>> - Updated ONE_REG interface for CSR access to user-space
>> - Removed irqs_pending_lock and use atomic bitops instead
>> - Added separate patch for FP ONE_REG interface
>> - Added separate patch for updating MAINTAINERS file
>>
>> Anup Patel (14):
>> RISC-V: Add hypervisor extension related CSR defines
>> RISC-V: Add initial skeletal KVM support
>> RISC-V: KVM: Implement VCPU create, init and destroy functions
>> RISC-V: KVM: Implement VCPU interrupts and requests handling
>> RISC-V: KVM: Implement KVM_GET_ONE_REG/KVM_SET_ONE_REG ioctls
>> RISC-V: KVM: Implement VCPU world-switch
>> RISC-V: KVM: Handle MMIO exits for VCPU
>> RISC-V: KVM: Handle WFI exits for VCPU
>> RISC-V: KVM: Implement VMID allocator
>> RISC-V: KVM: Implement stage2 page table programming
>> RISC-V: KVM: Implement MMU notifiers
>> RISC-V: KVM: Document RISC-V specific parts of KVM API
>> RISC-V: KVM: Move sources to drivers/staging directory
>> RISC-V: KVM: Add MAINTAINERS entry
>>
>> Atish Patra (4):
>> RISC-V: KVM: Add timer functionality
>> RISC-V: KVM: FP lazy save/restore
>> RISC-V: KVM: Implement ONE REG interface for FP registers
>> RISC-V: KVM: Add SBI v0.1 support
>>
>> Documentation/virt/kvm/api.rst | 193 +++-
>> MAINTAINERS | 11 +
>> arch/riscv/Kconfig | 1 +
>> arch/riscv/Makefile | 1 +
>> arch/riscv/include/uapi/asm/kvm.h | 128 +++
>> drivers/clocksource/timer-riscv.c | 9 +
>> drivers/staging/riscv/kvm/Kconfig | 36 +
>> drivers/staging/riscv/kvm/Makefile | 23 +
>> drivers/staging/riscv/kvm/asm/kvm_csr.h | 105 ++
>> drivers/staging/riscv/kvm/asm/kvm_host.h | 271 +++++
>> drivers/staging/riscv/kvm/asm/kvm_types.h | 7 +
>> .../staging/riscv/kvm/asm/kvm_vcpu_timer.h | 44 +
>> drivers/staging/riscv/kvm/main.c | 118 +++
>> drivers/staging/riscv/kvm/mmu.c | 802 ++++++++++++++
>> drivers/staging/riscv/kvm/riscv_offsets.c | 170 +++
>> drivers/staging/riscv/kvm/tlb.S | 74 ++
>> drivers/staging/riscv/kvm/vcpu.c | 997 ++++++++++++++++++
>> drivers/staging/riscv/kvm/vcpu_exit.c | 701 ++++++++++++
>> drivers/staging/riscv/kvm/vcpu_sbi.c | 173 +++
>> drivers/staging/riscv/kvm/vcpu_switch.S | 401 +++++++
>> drivers/staging/riscv/kvm/vcpu_timer.c | 225 ++++
>> drivers/staging/riscv/kvm/vm.c | 81 ++
>> drivers/staging/riscv/kvm/vmid.c | 120 +++
>> include/clocksource/timer-riscv.h | 16 +
>> include/uapi/linux/kvm.h | 8 +
>> 25 files changed, 4706 insertions(+), 9 deletions(-)
>> create mode 100644 arch/riscv/include/uapi/asm/kvm.h
>> create mode 100644 drivers/staging/riscv/kvm/Kconfig
>> create mode 100644 drivers/staging/riscv/kvm/Makefile
>> create mode 100644 drivers/staging/riscv/kvm/asm/kvm_csr.h
>> create mode 100644 drivers/staging/riscv/kvm/asm/kvm_host.h
>> create mode 100644 drivers/staging/riscv/kvm/asm/kvm_types.h
>> create mode 100644 drivers/staging/riscv/kvm/asm/kvm_vcpu_timer.h
>> create mode 100644 drivers/staging/riscv/kvm/main.c
>> create mode 100644 drivers/staging/riscv/kvm/mmu.c
>> create mode 100644 drivers/staging/riscv/kvm/riscv_offsets.c
>> create mode 100644 drivers/staging/riscv/kvm/tlb.S
>> create mode 100644 drivers/staging/riscv/kvm/vcpu.c
>> create mode 100644 drivers/staging/riscv/kvm/vcpu_exit.c
>> create mode 100644 drivers/staging/riscv/kvm/vcpu_sbi.c
>> create mode 100644 drivers/staging/riscv/kvm/vcpu_switch.S
>> create mode 100644 drivers/staging/riscv/kvm/vcpu_timer.c
>> create mode 100644 drivers/staging/riscv/kvm/vm.c
>> create mode 100644 drivers/staging/riscv/kvm/vmid.c
>> create mode 100644 include/clocksource/timer-riscv.h
>>
>> --
>> 2.25.1
>>

2021-05-24 23:12:13

by Damien Le Moal

[permalink] [raw]
Subject: Re: [PATCH v18 00/18] KVM RISC-V Support

On 2021/05/25 7:57, Palmer Dabbelt wrote:
> On Mon, 24 May 2021 00:09:45 PDT (-0700), [email protected] wrote:
>> Thx Anup,
>>
>> Tested-by: Guo Ren <[email protected]> (Just on qemu-rv64)
>>
>> I'm following your KVM patchset and it's a great job for riscv
>> H-extension. I think hardware companies hope Linux KVM ready first
>> before the real chip. That means we can ensure the hardware could run
>> mainline linux.
>
> I understand that it would be wonderful for hardware vendors to have a
> guarantee that their hardware will be supported by the software
> ecosystem, but that's not what we're talking about here. Specifically,
> the proposal for this code is to track the latest draft extension which
> would specifically leave vendors who implement the current draft out in
> the cold was something to change. In practice that is the only way to
> move forward with any draft extension that doesn't have hardware
> available, as the software RISC-V implementations rapidly deprecate
> draft extensions and without a way to test our code it is destined to
> bit rot.

To facilitate the process of implementing, and updating, against draft
specifications, I proposed to have arch/riscv/staging added. This would be the
place to put code based on drafts. Some simple rules can be put in place:
1) The code and eventual ABI may change any time, no guarantees of backward
compatibility
2) Once the specifications are frozen, the code is moved out of staging
somewhere else.
3) The code may be removed any time if the specification proposal is dropped, or
any other valid reason (can't think of any other right now)
4) ...

This way, the implementation process would be greatly facilitated and
interactions between different extensions can be explored much more easily.

Thoughts ?



>
> If vendors want to make sure their hardware is supported then the best
> way to do that is to make sure specifications get ratified in a timely
> fashion that describe the behavior required from their products. That
> way we have an agreed upon interface that vendors can implement and
> software can rely on. I understand that a lot of people are frustrated
> with the pace of that process when it comes to the H extension, but
> circumventing that process doesn't fix the fundamental problem. If
> there really are products out there that people can't build because the
> H extension isn't upstream then we need to have a serious discussion
> about those, but without something specific to discuss this is just
> going to devolve into speculation which isn't a good use of time.
>
> I can't find any hardware that implements the draft H extension, at
> least via poking around on Google and in my email. I'm very hesitant to
> talk about private vendor roadmaps in public, as that's getting way too
> close to my day job, but I've yet to have a vendor raise this as an
> issue to me privately and I do try my best to make sure to talk to the
> RISC-V hardware vendors whenever possible (though I try to stick to
> public roadmaps there, to avoid issues around discussions like this and
> conflicts with work). Anup is clearly in a much more privileged
> position than I am here, given that he has real hardware and is able to
> allude to vendor roadmaps that I can't find in public, but until we can
> all get on the same page about that it's going to be difficult to have a
> reasonable discussion -- if we all have different information we're
> naturally going to arrive at different conclusions, which IMO is why
> this argument just keeps coming up. It's totally possible I'm just
> missing something here, in which case I'd love to be corrected as we can
> be having a very different discussion.
>
> I certainly hope that vendors understand that we're willing to work with
> them when it comes to making the software run on their hardware. I've
> always tried to be quite explicit that's our goal here, both by just
> saying so and by demonstrating that we're willing to take code that
> exhibits behavior not specified by the specifications but that is
> necessary to make real hardware work. There's always a balance here and
> I can't commit to making every chip with a RISC-V logo on it run Linux
> well, as there will always be some implementations that are just
> impossible to run real code on, but I'm always willing to do whatever I
> can to try to make things work.
>
> If anyone has concrete concerns about RISC-V hardware not being
> supported by Linux then I'm happy to have a discussion about that.
> Having a discussion in public is always best, as then everyone can be on
> the same page, but as far as I know we're doing a good job supporting
> the publicly available hardware -- not saying we're perfect, but given
> the size of the RISC-V user base and how new much of the hardware is I
> think we're well above average when it comes to upstream support of real
> hardware. I have a feeling anyone's worries would be about unreleased
> hardware, in which case I can understand it's difficult to have concrete
> discussions in public. I'm always happy to at least make an attempt to
> have private discussions about these (private discussion are tricky,
> though, so I can't promise I can always participate), and while I don't
> think those discussions should meaningfully sway the kernel's policies
> one way or the other it could at least help alleviate any acute concerns
> that vendors have. We've gotten to the point where some pretty serious
> accusations are starting to get thrown around, and that sort of thing
> really doesn't benefit anyone so I'm willing to do whatever I can to
> help fix that.
>
> IMO we're just trying to follow the standard Linux development policies
> here, where the focus is on making real hardware work in a way that can
> be sustainably maintained so we don't break users. If anything I think
> we're a notch more liberal WRT the code we accept than standard with the
> current policy, as accepting anything in a frozen extension doesn't even
> require a commitment from a hardware vendor WRT implementing the code.
> That obviously opens us up to behavior differences between the hardware
> and the specification, which have historically been retrofitted back to
> the specifications, but I'm willing to take on the extra work as it
> helps lend weight to the specification development process.
>
> If I'm just missing something here and there is publicly available
> hardware that implements the H extension then I'd be happy to have that
> pointed out and very much change the tune of this discussion, but until
> hardware shows up or the ISA is frozen I just don't see any way to
> maintain this code up the standards generally set by Linux or
> specifically by arch/riscv and therefor cannot support merging it.
>
>>
>> Good luck!
>>
>> On Wed, May 19, 2021 at 11:36 AM Anup Patel <[email protected]> wrote:
>>>
>>> From: Anup Patel <[email protected]>
>>>
>>> This series adds initial KVM RISC-V support. Currently, we are able to boot
>>> Linux on RV64/RV32 Guest with multiple VCPUs.
>>>
>>> Key aspects of KVM RISC-V added by this series are:
>>> 1. No RISC-V specific KVM IOCTL
>>> 2. Minimal possible KVM world-switch which touches only GPRs and few CSRs
>>> 3. Both RV64 and RV32 host supported
>>> 4. Full Guest/VM switch is done via vcpu_get/vcpu_put infrastructure
>>> 5. KVM ONE_REG interface for VCPU register access from user-space
>>> 6. PLIC emulation is done in user-space
>>> 7. Timer and IPI emuation is done in-kernel
>>> 8. Both Sv39x4 and Sv48x4 supported for RV64 host
>>> 9. MMU notifiers supported
>>> 10. Generic dirtylog supported
>>> 11. FP lazy save/restore supported
>>> 12. SBI v0.1 emulation for KVM Guest available
>>> 13. Forward unhandled SBI calls to KVM userspace
>>> 14. Hugepage support for Guest/VM
>>> 15. IOEVENTFD support for Vhost
>>>
>>> Here's a brief TODO list which we will work upon after this series:
>>> 1. SBI v0.2 emulation in-kernel
>>> 2. SBI v0.2 hart state management emulation in-kernel
>>> 3. In-kernel PLIC emulation
>>> 4. ..... and more .....
>>>
>>> This series can be found in riscv_kvm_v18 branch at:
>>> https//github.com/avpatel/linux.git
>>>
>>> Our work-in-progress KVMTOOL RISC-V port can be found in riscv_v7 branch
>>> at: https//github.com/avpatel/kvmtool.git
>>>
>>> The QEMU RISC-V hypervisor emulation is done by Alistair and is available
>>> in master branch at: https://git.qemu.org/git/qemu.git
>>>
>>> To play around with KVM RISC-V, refer KVM RISC-V wiki at:
>>> https://github.com/kvm-riscv/howto/wiki
>>> https://github.com/kvm-riscv/howto/wiki/KVM-RISCV64-on-QEMU
>>> https://github.com/kvm-riscv/howto/wiki/KVM-RISCV64-on-Spike
>>>
>>> Changes since v17:
>>> - Rebased on Linux-5.13-rc2
>>> - Moved to new KVM MMU notifier APIs
>>> - Removed redundant kvm_arch_vcpu_uninit()
>>> - Moved KVM RISC-V sources to drivers/staging for compliance with
>>> Linux RISC-V patch acceptance policy
>>>
>>> Changes since v16:
>>> - Rebased on Linux-5.12-rc5
>>> - Remove redundant kvm_arch_create_memslot(), kvm_arch_vcpu_setup(),
>>> kvm_arch_vcpu_init(), kvm_arch_has_vcpu_debugfs(), and
>>> kvm_arch_create_vcpu_debugfs() from PATCH5
>>> - Make stage2_wp_memory_region() and stage2_ioremap() as static
>>> in PATCH13
>>>
>>> Changes since v15:
>>> - Rebased on Linux-5.11-rc3
>>> - Fixed kvm_stage2_map() to use gfn_to_pfn_prot() for determing
>>> writeability of a host pfn.
>>> - Use "__u64" in-place of "u64" and "__u32" in-place of "u32" for
>>> uapi/asm/kvm.h
>>>
>>> Changes since v14:
>>> - Rebased on Linux-5.10-rc3
>>> - Fixed Stage2 (G-stage) PDG allocation to ensure it is 16KB aligned
>>>
>>> Changes since v13:
>>> - Rebased on Linux-5.9-rc3
>>> - Fixed kvm_riscv_vcpu_set_reg_csr() for SIP updation in PATCH5
>>> - Fixed instruction length computation in PATCH7
>>> - Added ioeventfd support in PATCH7
>>> - Ensure HSTATUS.SPVP is set to correct value before using HLV/HSV
>>> intructions in PATCH7
>>> - Fixed stage2_map_page() to set PTE 'A' and 'D' bits correctly
>>> in PATCH10
>>> - Added stage2 dirty page logging in PATCH10
>>> - Allow KVM user-space to SET/GET SCOUNTER CSR in PATCH5
>>> - Save/restore SCOUNTEREN in PATCH6
>>> - Reduced quite a few instructions for __kvm_riscv_switch_to() by
>>> using CSR swap instruction in PATCH6
>>> - Detect and use Sv48x4 when available in PATCH10
>>>
>>> Changes since v12:
>>> - Rebased patches on Linux-5.8-rc4
>>> - By default enable all counters in HCOUNTEREN
>>> - RISC-V H-Extension v0.6.1 spec support
>>>
>>> Changes since v11:
>>> - Rebased patches on Linux-5.7-rc3
>>> - Fixed typo in typecast of stage2_map_size define
>>> - Introduced struct kvm_cpu_trap to represent trap details and
>>> use it as function parameter wherever applicable
>>> - Pass memslot to kvm_riscv_stage2_map() for supporing dirty page
>>> logging in future
>>> - RISC-V H-Extension v0.6 spec support
>>> - Send-out first three patches as separate series so that it can
>>> be taken by Palmer for Linux RISC-V
>>>
>>> Changes since v10:
>>> - Rebased patches on Linux-5.6-rc5
>>> - Reduce RISCV_ISA_EXT_MAX from 256 to 64
>>> - Separate PATCH for removing N-extension related defines
>>> - Added comments as requested by Palmer
>>> - Fixed HIDELEG CSR programming
>>>
>>> Changes since v9:
>>> - Rebased patches on Linux-5.5-rc3
>>> - Squash PATCH19 and PATCH20 into PATCH5
>>> - Squash PATCH18 into PATCH11
>>> - Squash PATCH17 into PATCH16
>>> - Added ONE_REG interface for VCPU timer in PATCH13
>>> - Use HTIMEDELTA for VCPU timer in PATCH13
>>> - Updated KVM RISC-V mailing list in MAINTAINERS entry
>>> - Update KVM kconfig option to depend on RISCV_SBI and MMU
>>> - Check for SBI v0.2 and SBI v0.2 RFENCE extension at boot-time
>>> - Use SBI v0.2 RFENCE extension in VMID implementation
>>> - Use SBI v0.2 RFENCE extension in Stage2 MMU implementation
>>> - Use SBI v0.2 RFENCE extension in SBI implementation
>>> - Moved to RISC-V Hypervisor v0.5 draft spec
>>> - Updated Documentation/virt/kvm/api.txt for timer ONE_REG interface
>>>
>>> Changes since v8:
>>> - Rebased series on Linux-5.4-rc3 and Atish's SBI v0.2 patches
>>> - Use HRTIMER_MODE_REL instead of HRTIMER_MODE_ABS in timer emulation
>>> - Fixed kvm_riscv_stage2_map() to handle hugepages
>>> - Added patch to forward unhandled SBI calls to user-space
>>> - Added patch for iterative/recursive stage2 page table programming
>>> - Added patch to remove per-CPU vsip_shadow variable
>>> - Added patch to fix race-condition in kvm_riscv_vcpu_sync_interrupts()
>>>
>>> Changes since v7:
>>> - Rebased series on Linux-5.4-rc1 and Atish's SBI v0.2 patches
>>> - Removed PATCH1, PATCH3, and PATCH20 because these already merged
>>> - Use kernel doc style comments for ISA bitmap functions
>>> - Don't parse X, Y, and Z extension in riscv_fill_hwcap() because it will
>>> be added in-future
>>> - Mark KVM RISC-V kconfig option as EXPERIMENTAL
>>> - Typo fix in commit description of PATCH6 of v7 series
>>> - Use separate structs for CORE and CSR registers of ONE_REG interface
>>> - Explicitly include asm/sbi.h in kvm/vcpu_sbi.c
>>> - Removed implicit switch-case fall-through in kvm_riscv_vcpu_exit()
>>> - No need to set VSSTATUS.MXR bit in kvm_riscv_vcpu_unpriv_read()
>>> - Removed register for instruction length in kvm_riscv_vcpu_unpriv_read()
>>> - Added defines for checking/decoding instruction length
>>> - Added separate patch to forward unhandled SBI calls to userspace tool
>>>
>>> Changes since v6:
>>> - Rebased patches on Linux-5.3-rc7
>>> - Added "return_handled" in struct kvm_mmio_decode to ensure that
>>> kvm_riscv_vcpu_mmio_return() updates SEPC only once
>>> - Removed trap_stval parameter from kvm_riscv_vcpu_unpriv_read()
>>> - Updated git repo URL in MAINTAINERS entry
>>>
>>> Changes since v5:
>>> - Renamed KVM_REG_RISCV_CONFIG_TIMEBASE register to
>>> KVM_REG_RISCV_CONFIG_TBFREQ register in ONE_REG interface
>>> - Update SPEC in kvm_riscv_vcpu_mmio_return() for MMIO exits
>>> - Use switch case instead of illegal instruction opcode table for simplicity
>>> - Improve comments in stage2_remote_tlb_flush() for a potential remote TLB
>>> flush optimization
>>> - Handle all unsupported SBI calls in default case of
>>> kvm_riscv_vcpu_sbi_ecall() function
>>> - Fixed kvm_riscv_vcpu_sync_interrupts() for software interrupts
>>> - Improved unprivilege reads to handle traps due to Guest stage1 page table
>>> - Added separate patch to document RISC-V specific things in
>>> Documentation/virt/kvm/api.txt
>>>
>>> Changes since v4:
>>> - Rebased patches on Linux-5.3-rc5
>>> - Added Paolo's Acked-by and Reviewed-by
>>> - Updated mailing list in MAINTAINERS entry
>>>
>>> Changes since v3:
>>> - Moved patch for ISA bitmap from KVM prep series to this series
>>> - Make vsip_shadow as run-time percpu variable instead of compile-time
>>> - Flush Guest TLBs on all Host CPUs whenever we run-out of VMIDs
>>>
>>> Changes since v2:
>>> - Removed references of KVM_REQ_IRQ_PENDING from all patches
>>> - Use kvm->srcu within in-kernel KVM run loop
>>> - Added percpu vsip_shadow to track last value programmed in VSIP CSR
>>> - Added comments about irqs_pending and irqs_pending_mask
>>> - Used kvm_arch_vcpu_runnable() in-place-of kvm_riscv_vcpu_has_interrupt()
>>> in system_opcode_insn()
>>> - Removed unwanted smp_wmb() in kvm_riscv_stage2_vmid_update()
>>> - Use kvm_flush_remote_tlbs() in kvm_riscv_stage2_vmid_update()
>>> - Use READ_ONCE() in kvm_riscv_stage2_update_hgatp() for vmid
>>>
>>> Changes since v1:
>>> - Fixed compile errors in building KVM RISC-V as module
>>> - Removed unused kvm_riscv_halt_guest() and kvm_riscv_resume_guest()
>>> - Set KVM_CAP_SYNC_MMU capability only after MMU notifiers are implemented
>>> - Made vmid_version as unsigned long instead of atomic
>>> - Renamed KVM_REQ_UPDATE_PGTBL to KVM_REQ_UPDATE_HGATP
>>> - Renamed kvm_riscv_stage2_update_pgtbl() to kvm_riscv_stage2_update_hgatp()
>>> - Configure HIDELEG and HEDELEG in kvm_arch_hardware_enable()
>>> - Updated ONE_REG interface for CSR access to user-space
>>> - Removed irqs_pending_lock and use atomic bitops instead
>>> - Added separate patch for FP ONE_REG interface
>>> - Added separate patch for updating MAINTAINERS file
>>>
>>> Anup Patel (14):
>>> RISC-V: Add hypervisor extension related CSR defines
>>> RISC-V: Add initial skeletal KVM support
>>> RISC-V: KVM: Implement VCPU create, init and destroy functions
>>> RISC-V: KVM: Implement VCPU interrupts and requests handling
>>> RISC-V: KVM: Implement KVM_GET_ONE_REG/KVM_SET_ONE_REG ioctls
>>> RISC-V: KVM: Implement VCPU world-switch
>>> RISC-V: KVM: Handle MMIO exits for VCPU
>>> RISC-V: KVM: Handle WFI exits for VCPU
>>> RISC-V: KVM: Implement VMID allocator
>>> RISC-V: KVM: Implement stage2 page table programming
>>> RISC-V: KVM: Implement MMU notifiers
>>> RISC-V: KVM: Document RISC-V specific parts of KVM API
>>> RISC-V: KVM: Move sources to drivers/staging directory
>>> RISC-V: KVM: Add MAINTAINERS entry
>>>
>>> Atish Patra (4):
>>> RISC-V: KVM: Add timer functionality
>>> RISC-V: KVM: FP lazy save/restore
>>> RISC-V: KVM: Implement ONE REG interface for FP registers
>>> RISC-V: KVM: Add SBI v0.1 support
>>>
>>> Documentation/virt/kvm/api.rst | 193 +++-
>>> MAINTAINERS | 11 +
>>> arch/riscv/Kconfig | 1 +
>>> arch/riscv/Makefile | 1 +
>>> arch/riscv/include/uapi/asm/kvm.h | 128 +++
>>> drivers/clocksource/timer-riscv.c | 9 +
>>> drivers/staging/riscv/kvm/Kconfig | 36 +
>>> drivers/staging/riscv/kvm/Makefile | 23 +
>>> drivers/staging/riscv/kvm/asm/kvm_csr.h | 105 ++
>>> drivers/staging/riscv/kvm/asm/kvm_host.h | 271 +++++
>>> drivers/staging/riscv/kvm/asm/kvm_types.h | 7 +
>>> .../staging/riscv/kvm/asm/kvm_vcpu_timer.h | 44 +
>>> drivers/staging/riscv/kvm/main.c | 118 +++
>>> drivers/staging/riscv/kvm/mmu.c | 802 ++++++++++++++
>>> drivers/staging/riscv/kvm/riscv_offsets.c | 170 +++
>>> drivers/staging/riscv/kvm/tlb.S | 74 ++
>>> drivers/staging/riscv/kvm/vcpu.c | 997 ++++++++++++++++++
>>> drivers/staging/riscv/kvm/vcpu_exit.c | 701 ++++++++++++
>>> drivers/staging/riscv/kvm/vcpu_sbi.c | 173 +++
>>> drivers/staging/riscv/kvm/vcpu_switch.S | 401 +++++++
>>> drivers/staging/riscv/kvm/vcpu_timer.c | 225 ++++
>>> drivers/staging/riscv/kvm/vm.c | 81 ++
>>> drivers/staging/riscv/kvm/vmid.c | 120 +++
>>> include/clocksource/timer-riscv.h | 16 +
>>> include/uapi/linux/kvm.h | 8 +
>>> 25 files changed, 4706 insertions(+), 9 deletions(-)
>>> create mode 100644 arch/riscv/include/uapi/asm/kvm.h
>>> create mode 100644 drivers/staging/riscv/kvm/Kconfig
>>> create mode 100644 drivers/staging/riscv/kvm/Makefile
>>> create mode 100644 drivers/staging/riscv/kvm/asm/kvm_csr.h
>>> create mode 100644 drivers/staging/riscv/kvm/asm/kvm_host.h
>>> create mode 100644 drivers/staging/riscv/kvm/asm/kvm_types.h
>>> create mode 100644 drivers/staging/riscv/kvm/asm/kvm_vcpu_timer.h
>>> create mode 100644 drivers/staging/riscv/kvm/main.c
>>> create mode 100644 drivers/staging/riscv/kvm/mmu.c
>>> create mode 100644 drivers/staging/riscv/kvm/riscv_offsets.c
>>> create mode 100644 drivers/staging/riscv/kvm/tlb.S
>>> create mode 100644 drivers/staging/riscv/kvm/vcpu.c
>>> create mode 100644 drivers/staging/riscv/kvm/vcpu_exit.c
>>> create mode 100644 drivers/staging/riscv/kvm/vcpu_sbi.c
>>> create mode 100644 drivers/staging/riscv/kvm/vcpu_switch.S
>>> create mode 100644 drivers/staging/riscv/kvm/vcpu_timer.c
>>> create mode 100644 drivers/staging/riscv/kvm/vm.c
>>> create mode 100644 drivers/staging/riscv/kvm/vmid.c
>>> create mode 100644 include/clocksource/timer-riscv.h
>>>
>>> --
>>> 2.25.1
>>>
>


--
Damien Le Moal
Western Digital Research

2021-05-25 07:39:47

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH v18 00/18] KVM RISC-V Support

On Mon, May 24, 2021 at 11:08:30PM +0000, Damien Le Moal wrote:
> On 2021/05/25 7:57, Palmer Dabbelt wrote:
> > On Mon, 24 May 2021 00:09:45 PDT (-0700), [email protected] wrote:
> >> Thx Anup,
> >>
> >> Tested-by: Guo Ren <[email protected]> (Just on qemu-rv64)
> >>
> >> I'm following your KVM patchset and it's a great job for riscv
> >> H-extension. I think hardware companies hope Linux KVM ready first
> >> before the real chip. That means we can ensure the hardware could run
> >> mainline linux.
> >
> > I understand that it would be wonderful for hardware vendors to have a
> > guarantee that their hardware will be supported by the software
> > ecosystem, but that's not what we're talking about here. Specifically,
> > the proposal for this code is to track the latest draft extension which
> > would specifically leave vendors who implement the current draft out in
> > the cold was something to change. In practice that is the only way to
> > move forward with any draft extension that doesn't have hardware
> > available, as the software RISC-V implementations rapidly deprecate
> > draft extensions and without a way to test our code it is destined to
> > bit rot.
>
> To facilitate the process of implementing, and updating, against draft
> specifications, I proposed to have arch/riscv/staging added. This would be the
> place to put code based on drafts. Some simple rules can be put in place:
> 1) The code and eventual ABI may change any time, no guarantees of backward
> compatibility
> 2) Once the specifications are frozen, the code is moved out of staging
> somewhere else.
> 3) The code may be removed any time if the specification proposal is dropped, or
> any other valid reason (can't think of any other right now)
> 4) ...
>
> This way, the implementation process would be greatly facilitated and
> interactions between different extensions can be explored much more easily.
>
> Thoughts ?

It will not work, unless you are mean and ruthless and people will get
mad at you. I do not recommend it at all.

Once code shows up in the kernel tree, and people rely on it, you now
_have_ to support it. Users don't know the difference between "staging
or not staging" at all. We have reported problems of staging media
drivers breaking userspace apps and people having problems with that,
despite the media developers trying to tell the world, "DO NOT RELY ON
THESE!".

And if this can't be done with tiny simple single drivers, you are going
to have a world-of-hurt if you put arch/platform support into
arch/riscv/. Once it's there, you will never be able to delete it,
trust me.

If you REALLY wanted to do this, you could create drivers/staging/riscv/
and try to make the following rules:

- stand-alone code only, can not depend on ANYTHING outside of
the directory that is not also used by other in-kernel code
- does not expose any userspace apis
- interacts only with existing in-kernel code.
- can be deleted at any time, UNLESS someone is using it for
functionality on a system

But what use would that be? What could you put into there that anyone
would be able to actually use?

Remember the rule we made to our user community over 15 years ago:

We will not break userspace functionality*

With the caveat of "* - in a way that you notice".

That means we can remove and change things that no one relies on
anymore, as long as if someone pops up that does rely on it, we put it
back.

We do this because we never want anyone to be afraid to drop in a new
kernel, because they know we did not break their existing hardware and
userspace workloads. And if we did, we will work quickly to fix it.

So back to the original issue here, what is the problem that you are
trying to solve? Why do you want to have in-kernel code for hardware
that no one else can have access to, and that isn't part of a "finalized
spec" that ends up touching other subsystems and is not self-contained?

Why not take the energy here and go get that spec ratified so we aren't
having this argument anymore? What needs to be done to make that happen
and why hasn't anyone done that? There's nothing keeping kernel
developers from working on spec groups, right?

thanks,

greg k-h

2021-05-25 08:37:03

by Damien Le Moal

[permalink] [raw]
Subject: Re: [PATCH v18 00/18] KVM RISC-V Support

On 2021/05/25 16:37, Greg KH wrote:
> On Mon, May 24, 2021 at 11:08:30PM +0000, Damien Le Moal wrote:
>> On 2021/05/25 7:57, Palmer Dabbelt wrote:
>>> On Mon, 24 May 2021 00:09:45 PDT (-0700), [email protected] wrote:
>>>> Thx Anup,
>>>>
>>>> Tested-by: Guo Ren <[email protected]> (Just on qemu-rv64)
>>>>
>>>> I'm following your KVM patchset and it's a great job for riscv
>>>> H-extension. I think hardware companies hope Linux KVM ready first
>>>> before the real chip. That means we can ensure the hardware could run
>>>> mainline linux.
>>>
>>> I understand that it would be wonderful for hardware vendors to have a
>>> guarantee that their hardware will be supported by the software
>>> ecosystem, but that's not what we're talking about here. Specifically,
>>> the proposal for this code is to track the latest draft extension which
>>> would specifically leave vendors who implement the current draft out in
>>> the cold was something to change. In practice that is the only way to
>>> move forward with any draft extension that doesn't have hardware
>>> available, as the software RISC-V implementations rapidly deprecate
>>> draft extensions and without a way to test our code it is destined to
>>> bit rot.
>>
>> To facilitate the process of implementing, and updating, against draft
>> specifications, I proposed to have arch/riscv/staging added. This would be the
>> place to put code based on drafts. Some simple rules can be put in place:
>> 1) The code and eventual ABI may change any time, no guarantees of backward
>> compatibility
>> 2) Once the specifications are frozen, the code is moved out of staging
>> somewhere else.
>> 3) The code may be removed any time if the specification proposal is dropped, or
>> any other valid reason (can't think of any other right now)
>> 4) ...
>>
>> This way, the implementation process would be greatly facilitated and
>> interactions between different extensions can be explored much more easily.
>>
>> Thoughts ?
>
> It will not work, unless you are mean and ruthless and people will get
> mad at you. I do not recommend it at all.
>
> Once code shows up in the kernel tree, and people rely on it, you now
> _have_ to support it. Users don't know the difference between "staging
> or not staging" at all. We have reported problems of staging media
> drivers breaking userspace apps and people having problems with that,
> despite the media developers trying to tell the world, "DO NOT RELY ON
> THESE!".
>
> And if this can't be done with tiny simple single drivers, you are going
> to have a world-of-hurt if you put arch/platform support into
> arch/riscv/. Once it's there, you will never be able to delete it,
> trust me.

All very good points. Thank you for sharing.

> If you REALLY wanted to do this, you could create drivers/staging/riscv/
> and try to make the following rules:
>
> - stand-alone code only, can not depend on ANYTHING outside of
> the directory that is not also used by other in-kernel code
> - does not expose any userspace apis
> - interacts only with existing in-kernel code.
> - can be deleted at any time, UNLESS someone is using it for
> functionality on a system
>
> But what use would that be? What could you put into there that anyone
> would be able to actually use?

Yes, you already mentioned this and we were not thinking about this solution.
drivers/staging really is for device drivers and does not apply to arch code.

> Remember the rule we made to our user community over 15 years ago:
>
> We will not break userspace functionality*
>
> With the caveat of "* - in a way that you notice".
>
> That means we can remove and change things that no one relies on
> anymore, as long as if someone pops up that does rely on it, we put it
> back.
>
> We do this because we never want anyone to be afraid to drop in a new
> kernel, because they know we did not break their existing hardware and
> userspace workloads. And if we did, we will work quickly to fix it.

Yes, I am well aware of this rule.

> So back to the original issue here, what is the problem that you are
> trying to solve? Why do you want to have in-kernel code for hardware
> that no one else can have access to, and that isn't part of a "finalized
> spec" that ends up touching other subsystems and is not self-contained?

For the case at hand, the only thing that would be outside of the staging area
would be the ABI definition, but that one depends only on the ratified riscv ISA
specs. So having it outside of staging would be OK. The idea of the arch staging
area is 2 fold:
1) facilitate the development work overall, both for Paolo and Anup on the KVM
part, but also others to check that their changes do not break KVM support.
2) Provide feedback to the specs groups that their concerns are moot. E.g. one
reason the hypervisor specs are being delayed is concerns with interrupt
handling. With a working implementation based on current ratified specs for
other components (e.g. interrupt controller), the hope is that the specs group
can speed up freezing of the specs.

But your points about how users will likely end up using this potentially
creates a lot more problems than we are solving...

> Why not take the energy here and go get that spec ratified so we aren't
> having this argument anymore? What needs to be done to make that happen
> and why hasn't anyone done that? There's nothing keeping kernel
> developers from working on spec groups, right?

We are participating and giving arguments for freezing the specs. This is
however not working as we would like. But that is a problem to be addressed with
RISCV International and the processes governing the operation of specification
groups. The linux mailing lists are not the right place to discuss this, so I
will not go into more details.

Thank you for the feedback.

Best regards.

--
Damien Le Moal
Western Digital Research

2021-05-25 10:13:40

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v18 00/18] KVM RISC-V Support

On 25/05/21 10:11, Greg KH wrote:
>> 1) facilitate the development work overall, both for Paolo and Anup on the KVM
>> part, but also others to check that their changes do not break KVM support.
>
> Who are the "others" here? You can't force your code into the tree just
> to keep it up to date with internal apis that others are changing, if
> you have no real users for it yet. That's asking others to do your work
> for you:(

I don't know about changes that would break KVM support. However,
"other KVM developers" would be able to check that their changes do not
break the RISC-V implementation, and I would certainly either enforce
that or do the work myself.

Also, excluding simulators and emulators from the set of "real users"
ignores the needs of userspace developers, as well as other uses such as
education/academia. Linux for x86 (both KVM and bare metal) supports
features that are only available in emulators and simulators which are
not even free software. I am pretty sure that there would be more users
of KVM/RISC-V than with KVM/MIPS, despite the latter having support in
real hardware.

Paolo

2021-05-25 10:25:56

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH v18 00/18] KVM RISC-V Support

On Tue, May 25, 2021 at 08:01:01AM +0000, Damien Le Moal wrote:
> On 2021/05/25 16:37, Greg KH wrote:
> > On Mon, May 24, 2021 at 11:08:30PM +0000, Damien Le Moal wrote:
> >> On 2021/05/25 7:57, Palmer Dabbelt wrote:
> >>> On Mon, 24 May 2021 00:09:45 PDT (-0700), [email protected] wrote:
> >>>> Thx Anup,
> >>>>
> >>>> Tested-by: Guo Ren <[email protected]> (Just on qemu-rv64)
> >>>>
> >>>> I'm following your KVM patchset and it's a great job for riscv
> >>>> H-extension. I think hardware companies hope Linux KVM ready first
> >>>> before the real chip. That means we can ensure the hardware could run
> >>>> mainline linux.
> >>>
> >>> I understand that it would be wonderful for hardware vendors to have a
> >>> guarantee that their hardware will be supported by the software
> >>> ecosystem, but that's not what we're talking about here. Specifically,
> >>> the proposal for this code is to track the latest draft extension which
> >>> would specifically leave vendors who implement the current draft out in
> >>> the cold was something to change. In practice that is the only way to
> >>> move forward with any draft extension that doesn't have hardware
> >>> available, as the software RISC-V implementations rapidly deprecate
> >>> draft extensions and without a way to test our code it is destined to
> >>> bit rot.
> >>
> >> To facilitate the process of implementing, and updating, against draft
> >> specifications, I proposed to have arch/riscv/staging added. This would be the
> >> place to put code based on drafts. Some simple rules can be put in place:
> >> 1) The code and eventual ABI may change any time, no guarantees of backward
> >> compatibility
> >> 2) Once the specifications are frozen, the code is moved out of staging
> >> somewhere else.
> >> 3) The code may be removed any time if the specification proposal is dropped, or
> >> any other valid reason (can't think of any other right now)
> >> 4) ...
> >>
> >> This way, the implementation process would be greatly facilitated and
> >> interactions between different extensions can be explored much more easily.
> >>
> >> Thoughts ?
> >
> > It will not work, unless you are mean and ruthless and people will get
> > mad at you. I do not recommend it at all.
> >
> > Once code shows up in the kernel tree, and people rely on it, you now
> > _have_ to support it. Users don't know the difference between "staging
> > or not staging" at all. We have reported problems of staging media
> > drivers breaking userspace apps and people having problems with that,
> > despite the media developers trying to tell the world, "DO NOT RELY ON
> > THESE!".
> >
> > And if this can't be done with tiny simple single drivers, you are going
> > to have a world-of-hurt if you put arch/platform support into
> > arch/riscv/. Once it's there, you will never be able to delete it,
> > trust me.
>
> All very good points. Thank you for sharing.
>
> > If you REALLY wanted to do this, you could create drivers/staging/riscv/
> > and try to make the following rules:
> >
> > - stand-alone code only, can not depend on ANYTHING outside of
> > the directory that is not also used by other in-kernel code
> > - does not expose any userspace apis
> > - interacts only with existing in-kernel code.
> > - can be deleted at any time, UNLESS someone is using it for
> > functionality on a system
> >
> > But what use would that be? What could you put into there that anyone
> > would be able to actually use?
>
> Yes, you already mentioned this and we were not thinking about this solution.
> drivers/staging really is for device drivers and does not apply to arch code.

Then you can not use the "staging model" anywhere else, especially in
arch code. We tried that many years ago, and it instantly failed and we
ripped it out. Learn from our mistakes please.

> > So back to the original issue here, what is the problem that you are
> > trying to solve? Why do you want to have in-kernel code for hardware
> > that no one else can have access to, and that isn't part of a "finalized
> > spec" that ends up touching other subsystems and is not self-contained?
>
> For the case at hand, the only thing that would be outside of the staging area
> would be the ABI definition, but that one depends only on the ratified riscv ISA
> specs. So having it outside of staging would be OK. The idea of the arch staging
> area is 2 fold:
> 1) facilitate the development work overall, both for Paolo and Anup on the KVM
> part, but also others to check that their changes do not break KVM support.

Who are the "others" here? You can't force your code into the tree just
to keep it up to date with internal apis that others are changing, if
you have no real users for it yet. That's asking others to do your work
for you :(

> 2) Provide feedback to the specs groups that their concerns are moot. E.g. one
> reason the hypervisor specs are being delayed is concerns with interrupt
> handling. With a working implementation based on current ratified specs for
> other components (e.g. interrupt controller), the hope is that the specs group
> can speed up freezing of the specs.

There is the issue of specs-without-working-code that can cause major
problems. But you have code, it does not have to be merged into the
kernel tree to prove/disprove specs, so don't push the inability of your
standards group to come to an agreement to the kernel developer
community. Again, you are making us do your work for you here :(

> But your points about how users will likely end up using this potentially
> creates a lot more problems than we are solving...

Thank you for understanding.

good luck with your standards meetings!

greg k-h