This series adds initial KVM RISC-V support. Currently, we are able to boot
Linux on RV64/RV32 Guest with multiple VCPUs.
Key aspects of KVM RISC-V added by this series are:
1. No RISC-V specific KVM IOCTL
2. Minimal possible KVM world-switch which touches only GPRs and few CSRs
3. Both RV64 and RV32 host supported
4. Full Guest/VM switch is done via vcpu_get/vcpu_put infrastructure
5. KVM ONE_REG interface for VCPU register access from user-space
6. PLIC emulation is done in user-space
7. Timer and IPI emuation is done in-kernel
8. Both Sv39x4 and Sv48x4 supported for RV64 host
9. MMU notifiers supported
10. Generic dirtylog supported
11. FP lazy save/restore supported
12. SBI v0.1 emulation for KVM Guest available
13. Forward unhandled SBI calls to KVM userspace
14. Hugepage support for Guest/VM
15. IOEVENTFD support for Vhost
Here's a brief TODO list which we will work upon after this series:
1. SBI v0.2 emulation in-kernel
2. SBI v0.2 hart state management emulation in-kernel
3. In-kernel PLIC emulation
4. ..... and more .....
This series can be found in riscv_kvm_v15 branch at:
https//github.com/avpatel/linux.git
Our work-in-progress KVMTOOL RISC-V port can be found in riscv_v5 branch
at: https//github.com/avpatel/kvmtool.git
The QEMU RISC-V hypervisor emulation is done by Alistair and is available
in master branch at: https://git.qemu.org/git/qemu.git
To play around with KVM RISC-V, refer KVM RISC-V wiki at:
https://github.com/kvm-riscv/howto/wiki
https://github.com/kvm-riscv/howto/wiki/KVM-RISCV64-on-QEMU
https://github.com/kvm-riscv/howto/wiki/KVM-RISCV64-on-Spike
Changes since v14:
- Rebased on Linux-5.10-rc3
- Fixed Stage2 (G-stage) PDG allocation to ensure it is 16KB aligned
Changes since v13:
- Rebased on Linux-5.9-rc3
- Fixed kvm_riscv_vcpu_set_reg_csr() for SIP updation in PATCH5
- Fixed instruction length computation in PATCH7
- Added ioeventfd support in PATCH7
- Ensure HSTATUS.SPVP is set to correct value before using HLV/HSV
intructions in PATCH7
- Fixed stage2_map_page() to set PTE 'A' and 'D' bits correctly
in PATCH10
- Added stage2 dirty page logging in PATCH10
- Allow KVM user-space to SET/GET SCOUNTER CSR in PATCH5
- Save/restore SCOUNTEREN in PATCH6
- Reduced quite a few instructions for __kvm_riscv_switch_to() by
using CSR swap instruction in PATCH6
- Detect and use Sv48x4 when available in PATCH10
Changes since v12:
- Rebased patches on Linux-5.8-rc4
- By default enable all counters in HCOUNTEREN
- RISC-V H-Extension v0.6.1 spec support
Changes since v11:
- Rebased patches on Linux-5.7-rc3
- Fixed typo in typecast of stage2_map_size define
- Introduced struct kvm_cpu_trap to represent trap details and
use it as function parameter wherever applicable
- Pass memslot to kvm_riscv_stage2_map() for supporing dirty page
logging in future
- RISC-V H-Extension v0.6 spec support
- Send-out first three patches as separate series so that it can
be taken by Palmer for Linux RISC-V
Changes since v10:
- Rebased patches on Linux-5.6-rc5
- Reduce RISCV_ISA_EXT_MAX from 256 to 64
- Separate PATCH for removing N-extension related defines
- Added comments as requested by Palmer
- Fixed HIDELEG CSR programming
Changes since v9:
- Rebased patches on Linux-5.5-rc3
- Squash PATCH19 and PATCH20 into PATCH5
- Squash PATCH18 into PATCH11
- Squash PATCH17 into PATCH16
- Added ONE_REG interface for VCPU timer in PATCH13
- Use HTIMEDELTA for VCPU timer in PATCH13
- Updated KVM RISC-V mailing list in MAINTAINERS entry
- Update KVM kconfig option to depend on RISCV_SBI and MMU
- Check for SBI v0.2 and SBI v0.2 RFENCE extension at boot-time
- Use SBI v0.2 RFENCE extension in VMID implementation
- Use SBI v0.2 RFENCE extension in Stage2 MMU implementation
- Use SBI v0.2 RFENCE extension in SBI implementation
- Moved to RISC-V Hypervisor v0.5 draft spec
- Updated Documentation/virt/kvm/api.txt for timer ONE_REG interface
Changes since v8:
- Rebased series on Linux-5.4-rc3 and Atish's SBI v0.2 patches
- Use HRTIMER_MODE_REL instead of HRTIMER_MODE_ABS in timer emulation
- Fixed kvm_riscv_stage2_map() to handle hugepages
- Added patch to forward unhandled SBI calls to user-space
- Added patch for iterative/recursive stage2 page table programming
- Added patch to remove per-CPU vsip_shadow variable
- Added patch to fix race-condition in kvm_riscv_vcpu_sync_interrupts()
Changes since v7:
- Rebased series on Linux-5.4-rc1 and Atish's SBI v0.2 patches
- Removed PATCH1, PATCH3, and PATCH20 because these already merged
- Use kernel doc style comments for ISA bitmap functions
- Don't parse X, Y, and Z extension in riscv_fill_hwcap() because it will
be added in-future
- Mark KVM RISC-V kconfig option as EXPERIMENTAL
- Typo fix in commit description of PATCH6 of v7 series
- Use separate structs for CORE and CSR registers of ONE_REG interface
- Explicitly include asm/sbi.h in kvm/vcpu_sbi.c
- Removed implicit switch-case fall-through in kvm_riscv_vcpu_exit()
- No need to set VSSTATUS.MXR bit in kvm_riscv_vcpu_unpriv_read()
- Removed register for instruction length in kvm_riscv_vcpu_unpriv_read()
- Added defines for checking/decoding instruction length
- Added separate patch to forward unhandled SBI calls to userspace tool
Changes since v6:
- Rebased patches on Linux-5.3-rc7
- Added "return_handled" in struct kvm_mmio_decode to ensure that
kvm_riscv_vcpu_mmio_return() updates SEPC only once
- Removed trap_stval parameter from kvm_riscv_vcpu_unpriv_read()
- Updated git repo URL in MAINTAINERS entry
Changes since v5:
- Renamed KVM_REG_RISCV_CONFIG_TIMEBASE register to
KVM_REG_RISCV_CONFIG_TBFREQ register in ONE_REG interface
- Update SPEC in kvm_riscv_vcpu_mmio_return() for MMIO exits
- Use switch case instead of illegal instruction opcode table for simplicity
- Improve comments in stage2_remote_tlb_flush() for a potential remote TLB
flush optimization
- Handle all unsupported SBI calls in default case of
kvm_riscv_vcpu_sbi_ecall() function
- Fixed kvm_riscv_vcpu_sync_interrupts() for software interrupts
- Improved unprivilege reads to handle traps due to Guest stage1 page table
- Added separate patch to document RISC-V specific things in
Documentation/virt/kvm/api.txt
Changes since v4:
- Rebased patches on Linux-5.3-rc5
- Added Paolo's Acked-by and Reviewed-by
- Updated mailing list in MAINTAINERS entry
Changes since v3:
- Moved patch for ISA bitmap from KVM prep series to this series
- Make vsip_shadow as run-time percpu variable instead of compile-time
- Flush Guest TLBs on all Host CPUs whenever we run-out of VMIDs
Changes since v2:
- Removed references of KVM_REQ_IRQ_PENDING from all patches
- Use kvm->srcu within in-kernel KVM run loop
- Added percpu vsip_shadow to track last value programmed in VSIP CSR
- Added comments about irqs_pending and irqs_pending_mask
- Used kvm_arch_vcpu_runnable() in-place-of kvm_riscv_vcpu_has_interrupt()
in system_opcode_insn()
- Removed unwanted smp_wmb() in kvm_riscv_stage2_vmid_update()
- Use kvm_flush_remote_tlbs() in kvm_riscv_stage2_vmid_update()
- Use READ_ONCE() in kvm_riscv_stage2_update_hgatp() for vmid
Changes since v1:
- Fixed compile errors in building KVM RISC-V as module
- Removed unused kvm_riscv_halt_guest() and kvm_riscv_resume_guest()
- Set KVM_CAP_SYNC_MMU capability only after MMU notifiers are implemented
- Made vmid_version as unsigned long instead of atomic
- Renamed KVM_REQ_UPDATE_PGTBL to KVM_REQ_UPDATE_HGATP
- Renamed kvm_riscv_stage2_update_pgtbl() to kvm_riscv_stage2_update_hgatp()
- Configure HIDELEG and HEDELEG in kvm_arch_hardware_enable()
- Updated ONE_REG interface for CSR access to user-space
- Removed irqs_pending_lock and use atomic bitops instead
- Added separate patch for FP ONE_REG interface
- Added separate patch for updating MAINTAINERS file
Anup Patel (13):
RISC-V: Add hypervisor extension related CSR defines
RISC-V: Add initial skeletal KVM support
RISC-V: KVM: Implement VCPU create, init and destroy functions
RISC-V: KVM: Implement VCPU interrupts and requests handling
RISC-V: KVM: Implement KVM_GET_ONE_REG/KVM_SET_ONE_REG ioctls
RISC-V: KVM: Implement VCPU world-switch
RISC-V: KVM: Handle MMIO exits for VCPU
RISC-V: KVM: Handle WFI exits for VCPU
RISC-V: KVM: Implement VMID allocator
RISC-V: KVM: Implement stage2 page table programming
RISC-V: KVM: Implement MMU notifiers
RISC-V: KVM: Document RISC-V specific parts of KVM API
RISC-V: KVM: Add MAINTAINERS entry
Atish Patra (4):
RISC-V: KVM: Add timer functionality
RISC-V: KVM: FP lazy save/restore
RISC-V: KVM: Implement ONE REG interface for FP registers
RISC-V: KVM: Add SBI v0.1 support
Documentation/virt/kvm/api.rst | 193 ++++-
MAINTAINERS | 11 +
arch/riscv/Kconfig | 1 +
arch/riscv/Makefile | 2 +
arch/riscv/include/asm/csr.h | 89 ++
arch/riscv/include/asm/kvm_host.h | 279 +++++++
arch/riscv/include/asm/kvm_types.h | 7 +
arch/riscv/include/asm/kvm_vcpu_timer.h | 44 +
arch/riscv/include/asm/pgtable-bits.h | 1 +
arch/riscv/include/uapi/asm/kvm.h | 128 +++
arch/riscv/kernel/asm-offsets.c | 156 ++++
arch/riscv/kvm/Kconfig | 36 +
arch/riscv/kvm/Makefile | 15 +
arch/riscv/kvm/main.c | 118 +++
arch/riscv/kvm/mmu.c | 860 +++++++++++++++++++
arch/riscv/kvm/tlb.S | 74 ++
arch/riscv/kvm/vcpu.c | 1012 +++++++++++++++++++++++
arch/riscv/kvm/vcpu_exit.c | 701 ++++++++++++++++
arch/riscv/kvm/vcpu_sbi.c | 173 ++++
arch/riscv/kvm/vcpu_switch.S | 400 +++++++++
arch/riscv/kvm/vcpu_timer.c | 225 +++++
arch/riscv/kvm/vm.c | 81 ++
arch/riscv/kvm/vmid.c | 120 +++
drivers/clocksource/timer-riscv.c | 8 +
include/clocksource/timer-riscv.h | 16 +
include/uapi/linux/kvm.h | 8 +
26 files changed, 4749 insertions(+), 9 deletions(-)
create mode 100644 arch/riscv/include/asm/kvm_host.h
create mode 100644 arch/riscv/include/asm/kvm_types.h
create mode 100644 arch/riscv/include/asm/kvm_vcpu_timer.h
create mode 100644 arch/riscv/include/uapi/asm/kvm.h
create mode 100644 arch/riscv/kvm/Kconfig
create mode 100644 arch/riscv/kvm/Makefile
create mode 100644 arch/riscv/kvm/main.c
create mode 100644 arch/riscv/kvm/mmu.c
create mode 100644 arch/riscv/kvm/tlb.S
create mode 100644 arch/riscv/kvm/vcpu.c
create mode 100644 arch/riscv/kvm/vcpu_exit.c
create mode 100644 arch/riscv/kvm/vcpu_sbi.c
create mode 100644 arch/riscv/kvm/vcpu_switch.S
create mode 100644 arch/riscv/kvm/vcpu_timer.c
create mode 100644 arch/riscv/kvm/vm.c
create mode 100644 arch/riscv/kvm/vmid.c
create mode 100644 include/clocksource/timer-riscv.h
--
2.25.1
This patch implements all required functions for programming
the stage2 page table for each Guest/VM.
At high-level, the flow of stage2 related functions is similar
from KVM ARM/ARM64 implementation but the stage2 page table
format is quite different for KVM RISC-V.
[jiangyifei: stage2 dirty log support]
Signed-off-by: Yifei Jiang <[email protected]>
Signed-off-by: Anup Patel <[email protected]>
Acked-by: Paolo Bonzini <[email protected]>
Reviewed-by: Paolo Bonzini <[email protected]>
---
arch/riscv/include/asm/kvm_host.h | 12 +
arch/riscv/include/asm/pgtable-bits.h | 1 +
arch/riscv/kvm/Kconfig | 1 +
arch/riscv/kvm/main.c | 19 +
arch/riscv/kvm/mmu.c | 649 +++++++++++++++++++++++++-
arch/riscv/kvm/vm.c | 6 -
6 files changed, 672 insertions(+), 16 deletions(-)
diff --git a/arch/riscv/include/asm/kvm_host.h b/arch/riscv/include/asm/kvm_host.h
index 08681e58c695..1c4c71e06e5b 100644
--- a/arch/riscv/include/asm/kvm_host.h
+++ b/arch/riscv/include/asm/kvm_host.h
@@ -76,6 +76,13 @@ struct kvm_mmio_decode {
int return_handled;
};
+#define KVM_MMU_PAGE_CACHE_NR_OBJS 32
+
+struct kvm_mmu_page_cache {
+ int nobjs;
+ void *objects[KVM_MMU_PAGE_CACHE_NR_OBJS];
+};
+
struct kvm_cpu_trap {
unsigned long sepc;
unsigned long scause;
@@ -177,6 +184,9 @@ struct kvm_vcpu_arch {
/* MMIO instruction details */
struct kvm_mmio_decode mmio_decode;
+ /* Cache pages needed to program page tables with spinlock held */
+ struct kvm_mmu_page_cache mmu_page_cache;
+
/* VCPU power-off state */
bool power_off;
@@ -206,6 +216,8 @@ void kvm_riscv_stage2_flush_cache(struct kvm_vcpu *vcpu);
int kvm_riscv_stage2_alloc_pgd(struct kvm *kvm);
void kvm_riscv_stage2_free_pgd(struct kvm *kvm);
void kvm_riscv_stage2_update_hgatp(struct kvm_vcpu *vcpu);
+void kvm_riscv_stage2_mode_detect(void);
+unsigned long kvm_riscv_stage2_mode(void);
void kvm_riscv_stage2_vmid_detect(void);
unsigned long kvm_riscv_stage2_vmid_bits(void);
diff --git a/arch/riscv/include/asm/pgtable-bits.h b/arch/riscv/include/asm/pgtable-bits.h
index bbaeb5d35842..be49d62fcc2b 100644
--- a/arch/riscv/include/asm/pgtable-bits.h
+++ b/arch/riscv/include/asm/pgtable-bits.h
@@ -26,6 +26,7 @@
#define _PAGE_SPECIAL _PAGE_SOFT
#define _PAGE_TABLE _PAGE_PRESENT
+#define _PAGE_LEAF (_PAGE_READ | _PAGE_WRITE | _PAGE_EXEC)
/*
* _PAGE_PROT_NONE is set on not-present pages (and ignored by the hardware) to
diff --git a/arch/riscv/kvm/Kconfig b/arch/riscv/kvm/Kconfig
index b42979f84042..633063edaee8 100644
--- a/arch/riscv/kvm/Kconfig
+++ b/arch/riscv/kvm/Kconfig
@@ -23,6 +23,7 @@ config KVM
select PREEMPT_NOTIFIERS
select ANON_INODES
select KVM_MMIO
+ select KVM_GENERIC_DIRTYLOG_READ_PROTECT
select HAVE_KVM_VCPU_ASYNC_IOCTL
select HAVE_KVM_EVENTFD
select SRCU
diff --git a/arch/riscv/kvm/main.c b/arch/riscv/kvm/main.c
index 49a4941e3838..421ecf4e6360 100644
--- a/arch/riscv/kvm/main.c
+++ b/arch/riscv/kvm/main.c
@@ -64,6 +64,8 @@ void kvm_arch_hardware_disable(void)
int kvm_arch_init(void *opaque)
{
+ const char *str;
+
if (!riscv_isa_extension_available(NULL, h)) {
kvm_info("hypervisor extension not available\n");
return -ENODEV;
@@ -79,10 +81,27 @@ int kvm_arch_init(void *opaque)
return -ENODEV;
}
+ kvm_riscv_stage2_mode_detect();
+
kvm_riscv_stage2_vmid_detect();
kvm_info("hypervisor extension available\n");
+ switch (kvm_riscv_stage2_mode()) {
+ case HGATP_MODE_SV32X4:
+ str = "Sv32x4";
+ break;
+ case HGATP_MODE_SV39X4:
+ str = "Sv39x4";
+ break;
+ case HGATP_MODE_SV48X4:
+ str = "Sv48x4";
+ break;
+ default:
+ return -ENODEV;
+ }
+ kvm_info("using %s G-stage page table format\n", str);
+
kvm_info("VMID %ld bits available\n", kvm_riscv_stage2_vmid_bits());
return 0;
diff --git a/arch/riscv/kvm/mmu.c b/arch/riscv/kvm/mmu.c
index f8153648a3bb..39063d1372cb 100644
--- a/arch/riscv/kvm/mmu.c
+++ b/arch/riscv/kvm/mmu.c
@@ -17,11 +17,415 @@
#include <linux/sched/signal.h>
#include <asm/page.h>
#include <asm/pgtable.h>
+#include <asm/sbi.h>
+
+#ifdef CONFIG_64BIT
+static unsigned long stage2_mode = (HGATP_MODE_SV39X4 << HGATP_MODE_SHIFT);
+static unsigned long stage2_pgd_levels = 3;
+#define stage2_index_bits 9
+#else
+static unsigned long stage2_mode = (HGATP_MODE_SV32X4 << HGATP_MODE_SHIFT);
+static unsigned long stage2_pgd_levels = 2;
+#define stage2_index_bits 10
+#endif
+
+#define stage2_pgd_xbits 2
+#define stage2_pgd_size (1UL << (HGATP_PAGE_SHIFT + stage2_pgd_xbits))
+#define stage2_gpa_bits (HGATP_PAGE_SHIFT + \
+ (stage2_pgd_levels * stage2_index_bits) + \
+ stage2_pgd_xbits)
+#define stage2_gpa_size ((gpa_t)(1ULL << stage2_gpa_bits))
+
+static inline unsigned long stage2_pte_index(gpa_t addr, u32 level)
+{
+ unsigned long mask;
+ unsigned long shift = HGATP_PAGE_SHIFT + (stage2_index_bits * level);
+
+ if (level == (stage2_pgd_levels - 1))
+ mask = (PTRS_PER_PTE * (1UL << stage2_pgd_xbits)) - 1;
+ else
+ mask = PTRS_PER_PTE - 1;
+
+ return (addr >> shift) & mask;
+}
+
+static inline unsigned long stage2_pte_page_vaddr(pte_t pte)
+{
+ return (unsigned long)pfn_to_virt(pte_val(pte) >> _PAGE_PFN_SHIFT);
+}
+
+static int stage2_page_size_to_level(unsigned long page_size, u32 *out_level)
+{
+ u32 i;
+ unsigned long psz = 1UL << 12;
+
+ for (i = 0; i < stage2_pgd_levels; i++) {
+ if (page_size == (psz << (i * stage2_index_bits))) {
+ *out_level = i;
+ return 0;
+ }
+ }
+
+ return -EINVAL;
+}
+
+static int stage2_level_to_page_size(u32 level, unsigned long *out_pgsize)
+{
+ if (stage2_pgd_levels < level)
+ return -EINVAL;
+
+ *out_pgsize = 1UL << (12 + (level * stage2_index_bits));
+
+ return 0;
+}
+
+static int stage2_cache_topup(struct kvm_mmu_page_cache *pcache,
+ int min, int max)
+{
+ void *page;
+
+ BUG_ON(max > KVM_MMU_PAGE_CACHE_NR_OBJS);
+ if (pcache->nobjs >= min)
+ return 0;
+ while (pcache->nobjs < max) {
+ page = (void *)__get_free_page(GFP_KERNEL | __GFP_ZERO);
+ if (!page)
+ return -ENOMEM;
+ pcache->objects[pcache->nobjs++] = page;
+ }
+
+ return 0;
+}
+
+static void stage2_cache_flush(struct kvm_mmu_page_cache *pcache)
+{
+ while (pcache && pcache->nobjs)
+ free_page((unsigned long)pcache->objects[--pcache->nobjs]);
+}
+
+static void *stage2_cache_alloc(struct kvm_mmu_page_cache *pcache)
+{
+ void *p;
+
+ if (!pcache)
+ return NULL;
+
+ BUG_ON(!pcache->nobjs);
+ p = pcache->objects[--pcache->nobjs];
+
+ return p;
+}
+
+static bool stage2_get_leaf_entry(struct kvm *kvm, gpa_t addr,
+ pte_t **ptepp, u32 *ptep_level)
+{
+ pte_t *ptep;
+ u32 current_level = stage2_pgd_levels - 1;
+
+ *ptep_level = current_level;
+ ptep = (pte_t *)kvm->arch.pgd;
+ ptep = &ptep[stage2_pte_index(addr, current_level)];
+ while (ptep && pte_val(*ptep)) {
+ if (pte_val(*ptep) & _PAGE_LEAF) {
+ *ptep_level = current_level;
+ *ptepp = ptep;
+ return true;
+ }
+
+ if (current_level) {
+ current_level--;
+ *ptep_level = current_level;
+ ptep = (pte_t *)stage2_pte_page_vaddr(*ptep);
+ ptep = &ptep[stage2_pte_index(addr, current_level)];
+ } else {
+ ptep = NULL;
+ }
+ }
+
+ return false;
+}
+
+static void stage2_remote_tlb_flush(struct kvm *kvm, u32 level, gpa_t addr)
+{
+ struct cpumask hmask;
+ unsigned long size = PAGE_SIZE;
+ struct kvm_vmid *vmid = &kvm->arch.vmid;
+
+ if (stage2_level_to_page_size(level, &size))
+ return;
+ addr &= ~(size - 1);
+
+ /*
+ * TODO: Instead of cpu_online_mask, we should only target CPUs
+ * where the Guest/VM is running.
+ */
+ preempt_disable();
+ riscv_cpuid_to_hartid_mask(cpu_online_mask, &hmask);
+ sbi_remote_hfence_gvma_vmid(cpumask_bits(&hmask), addr, size,
+ READ_ONCE(vmid->vmid));
+ preempt_enable();
+}
+
+static int stage2_set_pte(struct kvm *kvm, u32 level,
+ struct kvm_mmu_page_cache *pcache,
+ gpa_t addr, const pte_t *new_pte)
+{
+ u32 current_level = stage2_pgd_levels - 1;
+ pte_t *next_ptep = (pte_t *)kvm->arch.pgd;
+ pte_t *ptep = &next_ptep[stage2_pte_index(addr, current_level)];
+
+ if (current_level < level)
+ return -EINVAL;
+
+ while (current_level != level) {
+ if (pte_val(*ptep) & _PAGE_LEAF)
+ return -EEXIST;
+
+ if (!pte_val(*ptep)) {
+ next_ptep = stage2_cache_alloc(pcache);
+ if (!next_ptep)
+ return -ENOMEM;
+ *ptep = pfn_pte(PFN_DOWN(__pa(next_ptep)),
+ __pgprot(_PAGE_TABLE));
+ } else {
+ if (pte_val(*ptep) & _PAGE_LEAF)
+ return -EEXIST;
+ next_ptep = (pte_t *)stage2_pte_page_vaddr(*ptep);
+ }
+
+ current_level--;
+ ptep = &next_ptep[stage2_pte_index(addr, current_level)];
+ }
+
+ *ptep = *new_pte;
+ if (pte_val(*ptep) & _PAGE_LEAF)
+ stage2_remote_tlb_flush(kvm, current_level, addr);
+
+ return 0;
+}
+
+static int stage2_map_page(struct kvm *kvm,
+ struct kvm_mmu_page_cache *pcache,
+ gpa_t gpa, phys_addr_t hpa,
+ unsigned long page_size,
+ bool page_rdonly, bool page_exec)
+{
+ int ret;
+ u32 level = 0;
+ pte_t new_pte;
+ pgprot_t prot;
+
+ ret = stage2_page_size_to_level(page_size, &level);
+ if (ret)
+ return ret;
+
+ /*
+ * A RISC-V implementation can choose to either:
+ * 1) Update 'A' and 'D' PTE bits in hardware
+ * 2) Generate page fault when 'A' and/or 'D' bits are not set
+ * PTE so that software can update these bits.
+ *
+ * We support both options mentioned above. To achieve this, we
+ * always set 'A' and 'D' PTE bits at time of creating stage2
+ * mapping. To support KVM dirty page logging with both options
+ * mentioned above, we will write-protect stage2 PTEs to track
+ * dirty pages.
+ */
+
+ if (page_exec) {
+ if (page_rdonly)
+ prot = PAGE_READ_EXEC;
+ else
+ prot = PAGE_WRITE_EXEC;
+ } else {
+ if (page_rdonly)
+ prot = PAGE_READ;
+ else
+ prot = PAGE_WRITE;
+ }
+ new_pte = pfn_pte(PFN_DOWN(hpa), prot);
+ new_pte = pte_mkdirty(new_pte);
+
+ return stage2_set_pte(kvm, level, pcache, gpa, &new_pte);
+}
+
+enum stage2_op {
+ STAGE2_OP_NOP = 0, /* Nothing */
+ STAGE2_OP_CLEAR, /* Clear/Unmap */
+ STAGE2_OP_WP, /* Write-protect */
+};
+
+static void stage2_op_pte(struct kvm *kvm, gpa_t addr,
+ pte_t *ptep, u32 ptep_level, enum stage2_op op)
+{
+ int i, ret;
+ pte_t *next_ptep;
+ u32 next_ptep_level;
+ unsigned long next_page_size, page_size;
+
+ ret = stage2_level_to_page_size(ptep_level, &page_size);
+ if (ret)
+ return;
+
+ BUG_ON(addr & (page_size - 1));
+
+ if (!pte_val(*ptep))
+ return;
+
+ if (ptep_level && !(pte_val(*ptep) & _PAGE_LEAF)) {
+ next_ptep = (pte_t *)stage2_pte_page_vaddr(*ptep);
+ next_ptep_level = ptep_level - 1;
+ ret = stage2_level_to_page_size(next_ptep_level,
+ &next_page_size);
+ if (ret)
+ return;
+
+ if (op == STAGE2_OP_CLEAR)
+ set_pte(ptep, __pte(0));
+ for (i = 0; i < PTRS_PER_PTE; i++)
+ stage2_op_pte(kvm, addr + i * next_page_size,
+ &next_ptep[i], next_ptep_level, op);
+ if (op == STAGE2_OP_CLEAR)
+ put_page(virt_to_page(next_ptep));
+ } else {
+ if (op == STAGE2_OP_CLEAR)
+ set_pte(ptep, __pte(0));
+ else if (op == STAGE2_OP_WP)
+ set_pte(ptep, __pte(pte_val(*ptep) & ~_PAGE_WRITE));
+ stage2_remote_tlb_flush(kvm, ptep_level, addr);
+ }
+}
+
+static void stage2_unmap_range(struct kvm *kvm, gpa_t start, gpa_t size)
+{
+ int ret;
+ pte_t *ptep;
+ u32 ptep_level;
+ bool found_leaf;
+ unsigned long page_size;
+ gpa_t addr = start, end = start + size;
+
+ while (addr < end) {
+ found_leaf = stage2_get_leaf_entry(kvm, addr,
+ &ptep, &ptep_level);
+ ret = stage2_level_to_page_size(ptep_level, &page_size);
+ if (ret)
+ break;
+
+ if (!found_leaf)
+ goto next;
+
+ if (!(addr & (page_size - 1)) && ((end - addr) >= page_size))
+ stage2_op_pte(kvm, addr, ptep,
+ ptep_level, STAGE2_OP_CLEAR);
+
+next:
+ addr += page_size;
+ }
+}
+
+static void stage2_wp_range(struct kvm *kvm, gpa_t start, gpa_t end)
+{
+ int ret;
+ pte_t *ptep;
+ u32 ptep_level;
+ bool found_leaf;
+ gpa_t addr = start;
+ unsigned long page_size;
+
+ while (addr < end) {
+ found_leaf = stage2_get_leaf_entry(kvm, addr,
+ &ptep, &ptep_level);
+ ret = stage2_level_to_page_size(ptep_level, &page_size);
+ if (ret)
+ break;
+
+ if (!found_leaf)
+ goto next;
+
+ if (!(addr & (page_size - 1)) && ((end - addr) >= page_size))
+ stage2_op_pte(kvm, addr, ptep,
+ ptep_level, STAGE2_OP_WP);
+
+next:
+ addr += page_size;
+ }
+}
+
+void stage2_wp_memory_region(struct kvm *kvm, int slot)
+{
+ struct kvm_memslots *slots = kvm_memslots(kvm);
+ struct kvm_memory_slot *memslot = id_to_memslot(slots, slot);
+ phys_addr_t start = memslot->base_gfn << PAGE_SHIFT;
+ phys_addr_t end = (memslot->base_gfn + memslot->npages) << PAGE_SHIFT;
+
+ spin_lock(&kvm->mmu_lock);
+ stage2_wp_range(kvm, start, end);
+ spin_unlock(&kvm->mmu_lock);
+ kvm_flush_remote_tlbs(kvm);
+}
+
+int stage2_ioremap(struct kvm *kvm, gpa_t gpa, phys_addr_t hpa,
+ unsigned long size, bool writable)
+{
+ pte_t pte;
+ int ret = 0;
+ unsigned long pfn;
+ phys_addr_t addr, end;
+ struct kvm_mmu_page_cache pcache = { 0, };
+
+ end = (gpa + size + PAGE_SIZE - 1) & PAGE_MASK;
+ pfn = __phys_to_pfn(hpa);
+
+ for (addr = gpa; addr < end; addr += PAGE_SIZE) {
+ pte = pfn_pte(pfn, PAGE_KERNEL);
+
+ if (!writable)
+ pte = pte_wrprotect(pte);
+
+ ret = stage2_cache_topup(&pcache,
+ stage2_pgd_levels,
+ KVM_MMU_PAGE_CACHE_NR_OBJS);
+ if (ret)
+ goto out;
+
+ spin_lock(&kvm->mmu_lock);
+ ret = stage2_set_pte(kvm, 0, &pcache, addr, &pte);
+ spin_unlock(&kvm->mmu_lock);
+ if (ret)
+ goto out;
+
+ pfn++;
+ }
+
+out:
+ stage2_cache_flush(&pcache);
+ return ret;
+
+}
+
+void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
+ struct kvm_memory_slot *slot,
+ gfn_t gfn_offset,
+ unsigned long mask)
+{
+ phys_addr_t base_gfn = slot->base_gfn + gfn_offset;
+ phys_addr_t start = (base_gfn + __ffs(mask)) << PAGE_SHIFT;
+ phys_addr_t end = (base_gfn + __fls(mask) + 1) << PAGE_SHIFT;
+
+ stage2_wp_range(kvm, start, end);
+}
void kvm_arch_sync_dirty_log(struct kvm *kvm, struct kvm_memory_slot *memslot)
{
}
+void kvm_arch_flush_remote_tlbs_memslot(struct kvm *kvm,
+ struct kvm_memory_slot *memslot)
+{
+ kvm_flush_remote_tlbs(kvm);
+}
+
void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *free)
{
}
@@ -38,7 +442,7 @@ void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen)
void kvm_arch_flush_shadow_all(struct kvm *kvm)
{
- /* TODO: */
+ kvm_riscv_stage2_free_pgd(kvm);
}
void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
@@ -52,7 +456,13 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
const struct kvm_memory_slot *new,
enum kvm_mr_change change)
{
- /* TODO: */
+ /*
+ * At this point memslot has been committed and there is an
+ * allocated dirty_bitmap[], dirty pages will be tracked while
+ * the memory slot is write protected.
+ */
+ if (change != KVM_MR_DELETE && mem->flags & KVM_MEM_LOG_DIRTY_PAGES)
+ stage2_wp_memory_region(kvm, mem->slot);
}
int kvm_arch_prepare_memory_region(struct kvm *kvm,
@@ -60,8 +470,91 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
const struct kvm_userspace_memory_region *mem,
enum kvm_mr_change change)
{
- /* TODO: */
- return 0;
+ hva_t hva = mem->userspace_addr;
+ hva_t reg_end = hva + mem->memory_size;
+ bool writable = !(mem->flags & KVM_MEM_READONLY);
+ int ret = 0;
+
+ if (change != KVM_MR_CREATE && change != KVM_MR_MOVE &&
+ change != KVM_MR_FLAGS_ONLY)
+ return 0;
+
+ /*
+ * Prevent userspace from creating a memory region outside of the GPA
+ * space addressable by the KVM guest GPA space.
+ */
+ if ((memslot->base_gfn + memslot->npages) >=
+ (stage2_gpa_size >> PAGE_SHIFT))
+ return -EFAULT;
+
+ mmap_read_lock(current->mm);
+
+ /*
+ * A memory region could potentially cover multiple VMAs, and
+ * any holes between them, so iterate over all of them to find
+ * out if we can map any of them right now.
+ *
+ * +--------------------------------------------+
+ * +---------------+----------------+ +----------------+
+ * | : VMA 1 | VMA 2 | | VMA 3 : |
+ * +---------------+----------------+ +----------------+
+ * | memory region |
+ * +--------------------------------------------+
+ */
+ do {
+ struct vm_area_struct *vma = find_vma(current->mm, hva);
+ hva_t vm_start, vm_end;
+
+ if (!vma || vma->vm_start >= reg_end)
+ break;
+
+ /*
+ * Mapping a read-only VMA is only allowed if the
+ * memory region is configured as read-only.
+ */
+ if (writable && !(vma->vm_flags & VM_WRITE)) {
+ ret = -EPERM;
+ break;
+ }
+
+ /* Take the intersection of this VMA with the memory region */
+ vm_start = max(hva, vma->vm_start);
+ vm_end = min(reg_end, vma->vm_end);
+
+ if (vma->vm_flags & VM_PFNMAP) {
+ gpa_t gpa = mem->guest_phys_addr +
+ (vm_start - mem->userspace_addr);
+ phys_addr_t pa;
+
+ pa = (phys_addr_t)vma->vm_pgoff << PAGE_SHIFT;
+ pa += vm_start - vma->vm_start;
+
+ /* IO region dirty page logging not allowed */
+ if (memslot->flags & KVM_MEM_LOG_DIRTY_PAGES) {
+ ret = -EINVAL;
+ goto out;
+ }
+
+ ret = stage2_ioremap(kvm, gpa, pa,
+ vm_end - vm_start, writable);
+ if (ret)
+ break;
+ }
+ hva = vm_end;
+ } while (hva < reg_end);
+
+ if (change == KVM_MR_FLAGS_ONLY)
+ goto out;
+
+ spin_lock(&kvm->mmu_lock);
+ if (ret)
+ stage2_unmap_range(kvm, mem->guest_phys_addr,
+ mem->memory_size);
+ spin_unlock(&kvm->mmu_lock);
+
+out:
+ mmap_read_unlock(current->mm);
+ return ret;
}
int kvm_riscv_stage2_map(struct kvm_vcpu *vcpu,
@@ -69,27 +562,163 @@ int kvm_riscv_stage2_map(struct kvm_vcpu *vcpu,
gpa_t gpa, unsigned long hva,
bool writeable, bool is_write)
{
- /* TODO: */
- return 0;
+ int ret;
+ kvm_pfn_t hfn;
+ short vma_pageshift;
+ gfn_t gfn = gpa >> PAGE_SHIFT;
+ struct vm_area_struct *vma;
+ struct kvm *kvm = vcpu->kvm;
+ struct kvm_mmu_page_cache *pcache = &vcpu->arch.mmu_page_cache;
+ bool logging = (memslot->dirty_bitmap &&
+ !(memslot->flags & KVM_MEM_READONLY)) ? true : false;
+ unsigned long vma_pagesize;
+
+ mmap_read_lock(current->mm);
+
+ vma = find_vma_intersection(current->mm, hva, hva + 1);
+ if (unlikely(!vma)) {
+ kvm_err("Failed to find VMA for hva 0x%lx\n", hva);
+ mmap_read_unlock(current->mm);
+ return -EFAULT;
+ }
+
+ if (is_vm_hugetlb_page(vma))
+ vma_pageshift = huge_page_shift(hstate_vma(vma));
+ else
+ vma_pageshift = PAGE_SHIFT;
+ vma_pagesize = 1ULL << vma_pageshift;
+ if (logging || (vma->vm_flags & VM_PFNMAP))
+ vma_pagesize = PAGE_SIZE;
+
+ if (vma_pagesize == PMD_SIZE || vma_pagesize == PGDIR_SIZE)
+ gfn = (gpa & huge_page_mask(hstate_vma(vma))) >> PAGE_SHIFT;
+
+ mmap_read_unlock(current->mm);
+
+ if (vma_pagesize != PGDIR_SIZE &&
+ vma_pagesize != PMD_SIZE &&
+ vma_pagesize != PAGE_SIZE) {
+ kvm_err("Invalid VMA page size 0x%lx\n", vma_pagesize);
+ return -EFAULT;
+ }
+
+ /* We need minimum second+third level pages */
+ ret = stage2_cache_topup(pcache, stage2_pgd_levels,
+ KVM_MMU_PAGE_CACHE_NR_OBJS);
+ if (ret) {
+ kvm_err("Failed to topup stage2 cache\n");
+ return ret;
+ }
+
+ hfn = gfn_to_pfn_prot(kvm, gfn, is_write, NULL);
+ if (hfn == KVM_PFN_ERR_HWPOISON) {
+ send_sig_mceerr(BUS_MCEERR_AR, (void __user *)hva,
+ vma_pageshift, current);
+ return 0;
+ }
+ if (is_error_noslot_pfn(hfn))
+ return -EFAULT;
+
+ /*
+ * If logging is active then we allow writable pages only
+ * for write faults.
+ */
+ if (logging && !is_write)
+ writeable = false;
+
+ spin_lock(&kvm->mmu_lock);
+
+ if (writeable) {
+ kvm_set_pfn_dirty(hfn);
+ mark_page_dirty(kvm, gfn);
+ ret = stage2_map_page(kvm, pcache, gpa, hfn << PAGE_SHIFT,
+ vma_pagesize, false, true);
+ } else {
+ ret = stage2_map_page(kvm, pcache, gpa, hfn << PAGE_SHIFT,
+ vma_pagesize, true, true);
+ }
+
+ if (ret)
+ kvm_err("Failed to map in stage2\n");
+
+ spin_unlock(&kvm->mmu_lock);
+ kvm_set_pfn_accessed(hfn);
+ kvm_release_pfn_clean(hfn);
+ return ret;
}
void kvm_riscv_stage2_flush_cache(struct kvm_vcpu *vcpu)
{
- /* TODO: */
+ stage2_cache_flush(&vcpu->arch.mmu_page_cache);
}
int kvm_riscv_stage2_alloc_pgd(struct kvm *kvm)
{
- /* TODO: */
+ struct page *pgd_page;
+
+ if (kvm->arch.pgd != NULL) {
+ kvm_err("kvm_arch already initialized?\n");
+ return -EINVAL;
+ }
+
+ pgd_page = alloc_pages(GFP_KERNEL | __GFP_ZERO,
+ get_order(stage2_pgd_size));
+ if (!pgd_page)
+ return -ENOMEM;
+ kvm->arch.pgd = page_to_virt(pgd_page);
+ kvm->arch.pgd_phys = page_to_phys(pgd_page);
+
return 0;
}
void kvm_riscv_stage2_free_pgd(struct kvm *kvm)
{
- /* TODO: */
+ void *pgd = NULL;
+
+ spin_lock(&kvm->mmu_lock);
+ if (kvm->arch.pgd) {
+ stage2_unmap_range(kvm, 0UL, stage2_gpa_size);
+ pgd = READ_ONCE(kvm->arch.pgd);
+ kvm->arch.pgd = NULL;
+ kvm->arch.pgd_phys = 0;
+ }
+ spin_unlock(&kvm->mmu_lock);
+
+ if (pgd)
+ free_pages((unsigned long)pgd, get_order(stage2_pgd_size));
}
void kvm_riscv_stage2_update_hgatp(struct kvm_vcpu *vcpu)
{
- /* TODO: */
+ unsigned long hgatp = stage2_mode;
+ struct kvm_arch *k = &vcpu->kvm->arch;
+
+ hgatp |= (READ_ONCE(k->vmid.vmid) << HGATP_VMID_SHIFT) &
+ HGATP_VMID_MASK;
+ hgatp |= (k->pgd_phys >> PAGE_SHIFT) & HGATP_PPN;
+
+ csr_write(CSR_HGATP, hgatp);
+
+ if (!kvm_riscv_stage2_vmid_bits())
+ __kvm_riscv_hfence_gvma_all();
+}
+
+void kvm_riscv_stage2_mode_detect(void)
+{
+#ifdef CONFIG_64BIT
+ /* Try Sv48x4 stage2 mode */
+ csr_write(CSR_HGATP, HGATP_MODE_SV48X4 << HGATP_MODE_SHIFT);
+ if ((csr_read(CSR_HGATP) >> HGATP_MODE_SHIFT) == HGATP_MODE_SV48X4) {
+ stage2_mode = (HGATP_MODE_SV48X4 << HGATP_MODE_SHIFT);
+ stage2_pgd_levels = 4;
+ }
+ csr_write(CSR_HGATP, 0);
+
+ __kvm_riscv_hfence_gvma_all();
+#endif
+}
+
+unsigned long kvm_riscv_stage2_mode(void)
+{
+ return stage2_mode >> HGATP_MODE_SHIFT;
}
diff --git a/arch/riscv/kvm/vm.c b/arch/riscv/kvm/vm.c
index 282d67617229..6cde69a82252 100644
--- a/arch/riscv/kvm/vm.c
+++ b/arch/riscv/kvm/vm.c
@@ -12,12 +12,6 @@
#include <linux/uaccess.h>
#include <linux/kvm_host.h>
-int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log)
-{
- /* TODO: To be added later. */
- return -EOPNOTSUPP;
-}
-
int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
{
int r;
--
2.25.1
This patch implements VCPU create, init and destroy functions
required by generic KVM module. We don't have much dynamic
resources in struct kvm_vcpu_arch so these functions are quite
simple for KVM RISC-V.
Signed-off-by: Anup Patel <[email protected]>
Acked-by: Paolo Bonzini <[email protected]>
Reviewed-by: Paolo Bonzini <[email protected]>
Reviewed-by: Alexander Graf <[email protected]>
---
arch/riscv/include/asm/kvm_host.h | 69 +++++++++++++++++++++++++++++++
arch/riscv/kvm/vcpu.c | 55 ++++++++++++++++++++----
2 files changed, 115 insertions(+), 9 deletions(-)
diff --git a/arch/riscv/include/asm/kvm_host.h b/arch/riscv/include/asm/kvm_host.h
index fdd63bafb714..43e85523a07e 100644
--- a/arch/riscv/include/asm/kvm_host.h
+++ b/arch/riscv/include/asm/kvm_host.h
@@ -63,7 +63,76 @@ struct kvm_cpu_trap {
unsigned long htinst;
};
+struct kvm_cpu_context {
+ unsigned long zero;
+ unsigned long ra;
+ unsigned long sp;
+ unsigned long gp;
+ unsigned long tp;
+ unsigned long t0;
+ unsigned long t1;
+ unsigned long t2;
+ unsigned long s0;
+ unsigned long s1;
+ unsigned long a0;
+ unsigned long a1;
+ unsigned long a2;
+ unsigned long a3;
+ unsigned long a4;
+ unsigned long a5;
+ unsigned long a6;
+ unsigned long a7;
+ unsigned long s2;
+ unsigned long s3;
+ unsigned long s4;
+ unsigned long s5;
+ unsigned long s6;
+ unsigned long s7;
+ unsigned long s8;
+ unsigned long s9;
+ unsigned long s10;
+ unsigned long s11;
+ unsigned long t3;
+ unsigned long t4;
+ unsigned long t5;
+ unsigned long t6;
+ unsigned long sepc;
+ unsigned long sstatus;
+ unsigned long hstatus;
+};
+
+struct kvm_vcpu_csr {
+ unsigned long vsstatus;
+ unsigned long hie;
+ unsigned long vstvec;
+ unsigned long vsscratch;
+ unsigned long vsepc;
+ unsigned long vscause;
+ unsigned long vstval;
+ unsigned long hvip;
+ unsigned long vsatp;
+ unsigned long scounteren;
+};
+
struct kvm_vcpu_arch {
+ /* VCPU ran atleast once */
+ bool ran_atleast_once;
+
+ /* ISA feature bits (similar to MISA) */
+ unsigned long isa;
+
+ /* CPU context of Guest VCPU */
+ struct kvm_cpu_context guest_context;
+
+ /* CPU CSR context of Guest VCPU */
+ struct kvm_vcpu_csr guest_csr;
+
+ /* CPU context upon Guest VCPU reset */
+ struct kvm_cpu_context guest_reset_context;
+
+ /* CPU CSR context upon Guest VCPU reset */
+ struct kvm_vcpu_csr guest_reset_csr;
+
/* Don't run the VCPU (blocked) */
bool pause;
diff --git a/arch/riscv/kvm/vcpu.c b/arch/riscv/kvm/vcpu.c
index 8d8d140a0caf..84deeddbffbe 100644
--- a/arch/riscv/kvm/vcpu.c
+++ b/arch/riscv/kvm/vcpu.c
@@ -35,6 +35,27 @@ struct kvm_stats_debugfs_item debugfs_entries[] = {
{ NULL }
};
+#define KVM_RISCV_ISA_ALLOWED (riscv_isa_extension_mask(a) | \
+ riscv_isa_extension_mask(c) | \
+ riscv_isa_extension_mask(d) | \
+ riscv_isa_extension_mask(f) | \
+ riscv_isa_extension_mask(i) | \
+ riscv_isa_extension_mask(m) | \
+ riscv_isa_extension_mask(s) | \
+ riscv_isa_extension_mask(u))
+
+static void kvm_riscv_reset_vcpu(struct kvm_vcpu *vcpu)
+{
+ struct kvm_vcpu_csr *csr = &vcpu->arch.guest_csr;
+ struct kvm_vcpu_csr *reset_csr = &vcpu->arch.guest_reset_csr;
+ struct kvm_cpu_context *cntx = &vcpu->arch.guest_context;
+ struct kvm_cpu_context *reset_cntx = &vcpu->arch.guest_reset_context;
+
+ memcpy(csr, reset_csr, sizeof(*csr));
+
+ memcpy(cntx, reset_cntx, sizeof(*cntx));
+}
+
int kvm_arch_vcpu_precreate(struct kvm *kvm, unsigned int id)
{
return 0;
@@ -42,7 +63,25 @@ int kvm_arch_vcpu_precreate(struct kvm *kvm, unsigned int id)
int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
{
- /* TODO: */
+ struct kvm_cpu_context *cntx;
+
+ /* Mark this VCPU never ran */
+ vcpu->arch.ran_atleast_once = false;
+
+ /* Setup ISA features available to VCPU */
+ vcpu->arch.isa = riscv_isa_extension_base(NULL) & KVM_RISCV_ISA_ALLOWED;
+
+ /* Setup reset state of shadow SSTATUS and HSTATUS CSRs */
+ cntx = &vcpu->arch.guest_reset_context;
+ cntx->sstatus = SR_SPP | SR_SPIE;
+ cntx->hstatus = 0;
+ cntx->hstatus |= HSTATUS_VTW;
+ cntx->hstatus |= HSTATUS_SPVP;
+ cntx->hstatus |= HSTATUS_SPV;
+
+ /* Reset VCPU */
+ kvm_riscv_reset_vcpu(vcpu);
+
return 0;
}
@@ -55,15 +94,10 @@ void kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu)
{
}
-int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu)
-{
- /* TODO: */
- return 0;
-}
-
void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu)
{
- /* TODO: */
+ /* Flush the pages pre-allocated for Stage2 page table mappings */
+ kvm_riscv_stage2_flush_cache(vcpu);
}
int kvm_cpu_has_pending_timer(struct kvm_vcpu *vcpu)
@@ -209,6 +243,9 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
struct kvm_cpu_trap trap;
struct kvm_run *run = vcpu->run;
+ /* Mark this VCPU ran at least once */
+ vcpu->arch.ran_atleast_once = true;
+
vcpu->arch.srcu_idx = srcu_read_lock(&vcpu->kvm->srcu);
/* Process MMIO value returned from user-space */
@@ -282,7 +319,7 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
* get an interrupt between __kvm_riscv_switch_to() and
* local_irq_enable() which can potentially change CSRs.
*/
- trap.sepc = 0;
+ trap.sepc = vcpu->arch.guest_context.sepc;
trap.scause = csr_read(CSR_SCAUSE);
trap.stval = csr_read(CSR_STVAL);
trap.htval = csr_read(CSR_HTVAL);
--
2.25.1
> -----Original Message-----
> From: Anup Patel [mailto:[email protected]]
> Sent: Monday, November 9, 2020 7:33 PM
> To: Palmer Dabbelt <[email protected]>; Palmer Dabbelt
> <[email protected]>; Paul Walmsley <[email protected]>;
> Albert Ou <[email protected]>; Paolo Bonzini <[email protected]>
> Cc: Alexander Graf <[email protected]>; Atish Patra <[email protected]>;
> Alistair Francis <[email protected]>; Damien Le Moal
> <[email protected]>; Anup Patel <[email protected]>;
> [email protected]; [email protected];
> [email protected]; [email protected]; Anup Patel
> <[email protected]>; Jiangyifei <[email protected]>
> Subject: [PATCH v15 10/17] RISC-V: KVM: Implement stage2 page table
> programming
>
> This patch implements all required functions for programming the stage2 page
> table for each Guest/VM.
>
> At high-level, the flow of stage2 related functions is similar from KVM
> ARM/ARM64 implementation but the stage2 page table format is quite
> different for KVM RISC-V.
>
> [jiangyifei: stage2 dirty log support]
> Signed-off-by: Yifei Jiang <[email protected]>
> Signed-off-by: Anup Patel <[email protected]>
> Acked-by: Paolo Bonzini <[email protected]>
> Reviewed-by: Paolo Bonzini <[email protected]>
> ---
> arch/riscv/include/asm/kvm_host.h | 12 +
> arch/riscv/include/asm/pgtable-bits.h | 1 +
> arch/riscv/kvm/Kconfig | 1 +
> arch/riscv/kvm/main.c | 19 +
> arch/riscv/kvm/mmu.c | 649
> +++++++++++++++++++++++++-
> arch/riscv/kvm/vm.c | 6 -
> 6 files changed, 672 insertions(+), 16 deletions(-)
>
......
>
> int kvm_riscv_stage2_map(struct kvm_vcpu *vcpu, @@ -69,27 +562,163 @@
> int kvm_riscv_stage2_map(struct kvm_vcpu *vcpu,
> gpa_t gpa, unsigned long hva,
> bool writeable, bool is_write)
> {
> - /* TODO: */
> - return 0;
> + int ret;
> + kvm_pfn_t hfn;
> + short vma_pageshift;
> + gfn_t gfn = gpa >> PAGE_SHIFT;
> + struct vm_area_struct *vma;
> + struct kvm *kvm = vcpu->kvm;
> + struct kvm_mmu_page_cache *pcache = &vcpu->arch.mmu_page_cache;
> + bool logging = (memslot->dirty_bitmap &&
> + !(memslot->flags & KVM_MEM_READONLY)) ? true : false;
> + unsigned long vma_pagesize;
> +
> + mmap_read_lock(current->mm);
> +
> + vma = find_vma_intersection(current->mm, hva, hva + 1);
> + if (unlikely(!vma)) {
> + kvm_err("Failed to find VMA for hva 0x%lx\n", hva);
> + mmap_read_unlock(current->mm);
> + return -EFAULT;
> + }
> +
> + if (is_vm_hugetlb_page(vma))
> + vma_pageshift = huge_page_shift(hstate_vma(vma));
> + else
> + vma_pageshift = PAGE_SHIFT;
> + vma_pagesize = 1ULL << vma_pageshift;
> + if (logging || (vma->vm_flags & VM_PFNMAP))
> + vma_pagesize = PAGE_SIZE;
> +
> + if (vma_pagesize == PMD_SIZE || vma_pagesize == PGDIR_SIZE)
> + gfn = (gpa & huge_page_mask(hstate_vma(vma))) >> PAGE_SHIFT;
> +
> + mmap_read_unlock(current->mm);
> +
> + if (vma_pagesize != PGDIR_SIZE &&
> + vma_pagesize != PMD_SIZE &&
> + vma_pagesize != PAGE_SIZE) {
> + kvm_err("Invalid VMA page size 0x%lx\n", vma_pagesize);
> + return -EFAULT;
> + }
> +
> + /* We need minimum second+third level pages */
> + ret = stage2_cache_topup(pcache, stage2_pgd_levels,
> + KVM_MMU_PAGE_CACHE_NR_OBJS);
> + if (ret) {
> + kvm_err("Failed to topup stage2 cache\n");
> + return ret;
> + }
> +
> + hfn = gfn_to_pfn_prot(kvm, gfn, is_write, NULL);
> + if (hfn == KVM_PFN_ERR_HWPOISON) {
> + send_sig_mceerr(BUS_MCEERR_AR, (void __user *)hva,
> + vma_pageshift, current);
> + return 0;
> + }
> + if (is_error_noslot_pfn(hfn))
> + return -EFAULT;
> +
> + /*
> + * If logging is active then we allow writable pages only
> + * for write faults.
> + */
> + if (logging && !is_write)
> + writeable = false;
> +
> + spin_lock(&kvm->mmu_lock);
> +
> + if (writeable) {
Hi Anup,
What is the purpose of "writable = !memslot_is_readonly(slot)" in this series?
When mapping the HVA to HPA above, it doesn't know that the PTE writeable of stage2 is "!memslot_is_readonly(slot)".
This may causes the difference between the writability of HVA->HPA and GPA->HPA.
For example, GPA->HPA is writeable, but HVA->HPA is not writeable.
Is it better that the writability of HVA->HPA is also determined by whether the memslot is readonly in this change?
Like this:
- hfn = gfn_to_pfn_prot(kvm, gfn, is_write, NULL);
+ hfn = gfn_to_pfn_prot(kvm, gfn, writeable, NULL);
Regards,
Yifei
> + kvm_set_pfn_dirty(hfn);
> + mark_page_dirty(kvm, gfn);
> + ret = stage2_map_page(kvm, pcache, gpa, hfn << PAGE_SHIFT,
> + vma_pagesize, false, true);
> + } else {
> + ret = stage2_map_page(kvm, pcache, gpa, hfn << PAGE_SHIFT,
> + vma_pagesize, true, true);
> + }
> +
> + if (ret)
> + kvm_err("Failed to map in stage2\n");
> +
> + spin_unlock(&kvm->mmu_lock);
> + kvm_set_pfn_accessed(hfn);
> + kvm_release_pfn_clean(hfn);
> + return ret;
> }
>
......
On Mon, Nov 16, 2020 at 2:59 PM Jiangyifei <[email protected]> wrote:
>
>
> > -----Original Message-----
> > From: Anup Patel [mailto:[email protected]]
> > Sent: Monday, November 9, 2020 7:33 PM
> > To: Palmer Dabbelt <[email protected]>; Palmer Dabbelt
> > <[email protected]>; Paul Walmsley <[email protected]>;
> > Albert Ou <[email protected]>; Paolo Bonzini <[email protected]>
> > Cc: Alexander Graf <[email protected]>; Atish Patra <[email protected]>;
> > Alistair Francis <[email protected]>; Damien Le Moal
> > <[email protected]>; Anup Patel <[email protected]>;
> > [email protected]; [email protected];
> > [email protected]; [email protected]; Anup Patel
> > <[email protected]>; Jiangyifei <[email protected]>
> > Subject: [PATCH v15 10/17] RISC-V: KVM: Implement stage2 page table
> > programming
> >
> > This patch implements all required functions for programming the stage2 page
> > table for each Guest/VM.
> >
> > At high-level, the flow of stage2 related functions is similar from KVM
> > ARM/ARM64 implementation but the stage2 page table format is quite
> > different for KVM RISC-V.
> >
> > [jiangyifei: stage2 dirty log support]
> > Signed-off-by: Yifei Jiang <[email protected]>
> > Signed-off-by: Anup Patel <[email protected]>
> > Acked-by: Paolo Bonzini <[email protected]>
> > Reviewed-by: Paolo Bonzini <[email protected]>
> > ---
> > arch/riscv/include/asm/kvm_host.h | 12 +
> > arch/riscv/include/asm/pgtable-bits.h | 1 +
> > arch/riscv/kvm/Kconfig | 1 +
> > arch/riscv/kvm/main.c | 19 +
> > arch/riscv/kvm/mmu.c | 649
> > +++++++++++++++++++++++++-
> > arch/riscv/kvm/vm.c | 6 -
> > 6 files changed, 672 insertions(+), 16 deletions(-)
> >
>
> ......
>
> >
> > int kvm_riscv_stage2_map(struct kvm_vcpu *vcpu, @@ -69,27 +562,163 @@
> > int kvm_riscv_stage2_map(struct kvm_vcpu *vcpu,
> > gpa_t gpa, unsigned long hva,
> > bool writeable, bool is_write)
> > {
> > - /* TODO: */
> > - return 0;
> > + int ret;
> > + kvm_pfn_t hfn;
> > + short vma_pageshift;
> > + gfn_t gfn = gpa >> PAGE_SHIFT;
> > + struct vm_area_struct *vma;
> > + struct kvm *kvm = vcpu->kvm;
> > + struct kvm_mmu_page_cache *pcache = &vcpu->arch.mmu_page_cache;
> > + bool logging = (memslot->dirty_bitmap &&
> > + !(memslot->flags & KVM_MEM_READONLY)) ? true : false;
> > + unsigned long vma_pagesize;
> > +
> > + mmap_read_lock(current->mm);
> > +
> > + vma = find_vma_intersection(current->mm, hva, hva + 1);
> > + if (unlikely(!vma)) {
> > + kvm_err("Failed to find VMA for hva 0x%lx\n", hva);
> > + mmap_read_unlock(current->mm);
> > + return -EFAULT;
> > + }
> > +
> > + if (is_vm_hugetlb_page(vma))
> > + vma_pageshift = huge_page_shift(hstate_vma(vma));
> > + else
> > + vma_pageshift = PAGE_SHIFT;
> > + vma_pagesize = 1ULL << vma_pageshift;
> > + if (logging || (vma->vm_flags & VM_PFNMAP))
> > + vma_pagesize = PAGE_SIZE;
> > +
> > + if (vma_pagesize == PMD_SIZE || vma_pagesize == PGDIR_SIZE)
> > + gfn = (gpa & huge_page_mask(hstate_vma(vma))) >> PAGE_SHIFT;
> > +
> > + mmap_read_unlock(current->mm);
> > +
> > + if (vma_pagesize != PGDIR_SIZE &&
> > + vma_pagesize != PMD_SIZE &&
> > + vma_pagesize != PAGE_SIZE) {
> > + kvm_err("Invalid VMA page size 0x%lx\n", vma_pagesize);
> > + return -EFAULT;
> > + }
> > +
> > + /* We need minimum second+third level pages */
> > + ret = stage2_cache_topup(pcache, stage2_pgd_levels,
> > + KVM_MMU_PAGE_CACHE_NR_OBJS);
> > + if (ret) {
> > + kvm_err("Failed to topup stage2 cache\n");
> > + return ret;
> > + }
> > +
> > + hfn = gfn_to_pfn_prot(kvm, gfn, is_write, NULL);
> > + if (hfn == KVM_PFN_ERR_HWPOISON) {
> > + send_sig_mceerr(BUS_MCEERR_AR, (void __user *)hva,
> > + vma_pageshift, current);
> > + return 0;
> > + }
> > + if (is_error_noslot_pfn(hfn))
> > + return -EFAULT;
> > +
> > + /*
> > + * If logging is active then we allow writable pages only
> > + * for write faults.
> > + */
> > + if (logging && !is_write)
> > + writeable = false;
> > +
> > + spin_lock(&kvm->mmu_lock);
> > +
> > + if (writeable) {
>
> Hi Anup,
>
> What is the purpose of "writable = !memslot_is_readonly(slot)" in this series?
Where ? I don't see this line in any of the patches.
>
> When mapping the HVA to HPA above, it doesn't know that the PTE writeable of stage2 is "!memslot_is_readonly(slot)".
> This may causes the difference between the writability of HVA->HPA and GPA->HPA.
> For example, GPA->HPA is writeable, but HVA->HPA is not writeable.
Yes, this is possible particularly when Host kernel is updating writability
of HVA->HPA mappings for swapping in/out pages.
>
> Is it better that the writability of HVA->HPA is also determined by whether the memslot is readonly in this change?
> Like this:
> - hfn = gfn_to_pfn_prot(kvm, gfn, is_write, NULL);
> + hfn = gfn_to_pfn_prot(kvm, gfn, writeable, NULL);
The gfn_to_pfn_prot() needs to know what type of fault we
got (i.e read/write fault). Rest of the information (such as whether
slot is writable or not) is already available to gfn_to_pfn_prot().
The question here is should we pass "&writeable" or NULL as
last parameter to gfn_to_pfn_prot(). The recent JUMP label
support in Linux RISC-V causes problem on HW where PTE
'A' and 'D' bits are not updated by HW so I have to change
last parameter of gfn_to_pfn_prot() from "&writeable" to NULL.
I am still investigating this.
Regards,
Anup
>
> Regards,
> Yifei
>
> > + kvm_set_pfn_dirty(hfn);
> > + mark_page_dirty(kvm, gfn);
> > + ret = stage2_map_page(kvm, pcache, gpa, hfn << PAGE_SHIFT,
> > + vma_pagesize, false, true);
> > + } else {
> > + ret = stage2_map_page(kvm, pcache, gpa, hfn << PAGE_SHIFT,
> > + vma_pagesize, true, true);
> > + }
> > +
> > + if (ret)
> > + kvm_err("Failed to map in stage2\n");
> > +
> > + spin_unlock(&kvm->mmu_lock);
> > + kvm_set_pfn_accessed(hfn);
> > + kvm_release_pfn_clean(hfn);
> > + return ret;
> > }
> >
>
> ......
>
On Tue, Nov 24, 2020 at 2:56 PM Anup Patel <[email protected]> wrote:
>
> On Mon, Nov 16, 2020 at 2:59 PM Jiangyifei <[email protected]> wrote:
> >
> >
> > > -----Original Message-----
> > > From: Anup Patel [mailto:[email protected]]
> > > Sent: Monday, November 9, 2020 7:33 PM
> > > To: Palmer Dabbelt <[email protected]>; Palmer Dabbelt
> > > <[email protected]>; Paul Walmsley <[email protected]>;
> > > Albert Ou <[email protected]>; Paolo Bonzini <[email protected]>
> > > Cc: Alexander Graf <[email protected]>; Atish Patra <[email protected]>;
> > > Alistair Francis <[email protected]>; Damien Le Moal
> > > <[email protected]>; Anup Patel <[email protected]>;
> > > [email protected]; [email protected];
> > > [email protected]; [email protected]; Anup Patel
> > > <[email protected]>; Jiangyifei <[email protected]>
> > > Subject: [PATCH v15 10/17] RISC-V: KVM: Implement stage2 page table
> > > programming
> > >
> > > This patch implements all required functions for programming the stage2 page
> > > table for each Guest/VM.
> > >
> > > At high-level, the flow of stage2 related functions is similar from KVM
> > > ARM/ARM64 implementation but the stage2 page table format is quite
> > > different for KVM RISC-V.
> > >
> > > [jiangyifei: stage2 dirty log support]
> > > Signed-off-by: Yifei Jiang <[email protected]>
> > > Signed-off-by: Anup Patel <[email protected]>
> > > Acked-by: Paolo Bonzini <[email protected]>
> > > Reviewed-by: Paolo Bonzini <[email protected]>
> > > ---
> > > arch/riscv/include/asm/kvm_host.h | 12 +
> > > arch/riscv/include/asm/pgtable-bits.h | 1 +
> > > arch/riscv/kvm/Kconfig | 1 +
> > > arch/riscv/kvm/main.c | 19 +
> > > arch/riscv/kvm/mmu.c | 649
> > > +++++++++++++++++++++++++-
> > > arch/riscv/kvm/vm.c | 6 -
> > > 6 files changed, 672 insertions(+), 16 deletions(-)
> > >
> >
> > ......
> >
> > >
> > > int kvm_riscv_stage2_map(struct kvm_vcpu *vcpu, @@ -69,27 +562,163 @@
> > > int kvm_riscv_stage2_map(struct kvm_vcpu *vcpu,
> > > gpa_t gpa, unsigned long hva,
> > > bool writeable, bool is_write)
> > > {
> > > - /* TODO: */
> > > - return 0;
> > > + int ret;
> > > + kvm_pfn_t hfn;
> > > + short vma_pageshift;
> > > + gfn_t gfn = gpa >> PAGE_SHIFT;
> > > + struct vm_area_struct *vma;
> > > + struct kvm *kvm = vcpu->kvm;
> > > + struct kvm_mmu_page_cache *pcache = &vcpu->arch.mmu_page_cache;
> > > + bool logging = (memslot->dirty_bitmap &&
> > > + !(memslot->flags & KVM_MEM_READONLY)) ? true : false;
> > > + unsigned long vma_pagesize;
> > > +
> > > + mmap_read_lock(current->mm);
> > > +
> > > + vma = find_vma_intersection(current->mm, hva, hva + 1);
> > > + if (unlikely(!vma)) {
> > > + kvm_err("Failed to find VMA for hva 0x%lx\n", hva);
> > > + mmap_read_unlock(current->mm);
> > > + return -EFAULT;
> > > + }
> > > +
> > > + if (is_vm_hugetlb_page(vma))
> > > + vma_pageshift = huge_page_shift(hstate_vma(vma));
> > > + else
> > > + vma_pageshift = PAGE_SHIFT;
> > > + vma_pagesize = 1ULL << vma_pageshift;
> > > + if (logging || (vma->vm_flags & VM_PFNMAP))
> > > + vma_pagesize = PAGE_SIZE;
> > > +
> > > + if (vma_pagesize == PMD_SIZE || vma_pagesize == PGDIR_SIZE)
> > > + gfn = (gpa & huge_page_mask(hstate_vma(vma))) >> PAGE_SHIFT;
> > > +
> > > + mmap_read_unlock(current->mm);
> > > +
> > > + if (vma_pagesize != PGDIR_SIZE &&
> > > + vma_pagesize != PMD_SIZE &&
> > > + vma_pagesize != PAGE_SIZE) {
> > > + kvm_err("Invalid VMA page size 0x%lx\n", vma_pagesize);
> > > + return -EFAULT;
> > > + }
> > > +
> > > + /* We need minimum second+third level pages */
> > > + ret = stage2_cache_topup(pcache, stage2_pgd_levels,
> > > + KVM_MMU_PAGE_CACHE_NR_OBJS);
> > > + if (ret) {
> > > + kvm_err("Failed to topup stage2 cache\n");
> > > + return ret;
> > > + }
> > > +
> > > + hfn = gfn_to_pfn_prot(kvm, gfn, is_write, NULL);
> > > + if (hfn == KVM_PFN_ERR_HWPOISON) {
> > > + send_sig_mceerr(BUS_MCEERR_AR, (void __user *)hva,
> > > + vma_pageshift, current);
> > > + return 0;
> > > + }
> > > + if (is_error_noslot_pfn(hfn))
> > > + return -EFAULT;
> > > +
> > > + /*
> > > + * If logging is active then we allow writable pages only
> > > + * for write faults.
> > > + */
> > > + if (logging && !is_write)
> > > + writeable = false;
> > > +
> > > + spin_lock(&kvm->mmu_lock);
> > > +
> > > + if (writeable) {
> >
> > Hi Anup,
> >
> > What is the purpose of "writable = !memslot_is_readonly(slot)" in this series?
>
> Where ? I don't see this line in any of the patches.
>
> >
> > When mapping the HVA to HPA above, it doesn't know that the PTE writeable of stage2 is "!memslot_is_readonly(slot)".
> > This may causes the difference between the writability of HVA->HPA and GPA->HPA.
> > For example, GPA->HPA is writeable, but HVA->HPA is not writeable.
>
> Yes, this is possible particularly when Host kernel is updating writability
> of HVA->HPA mappings for swapping in/out pages.
>
> >
> > Is it better that the writability of HVA->HPA is also determined by whether the memslot is readonly in this change?
> > Like this:
> > - hfn = gfn_to_pfn_prot(kvm, gfn, is_write, NULL);
> > + hfn = gfn_to_pfn_prot(kvm, gfn, writeable, NULL);
>
> The gfn_to_pfn_prot() needs to know what type of fault we
> got (i.e read/write fault). Rest of the information (such as whether
> slot is writable or not) is already available to gfn_to_pfn_prot().
>
> The question here is should we pass "&writeable" or NULL as
> last parameter to gfn_to_pfn_prot(). The recent JUMP label
> support in Linux RISC-V causes problem on HW where PTE
> 'A' and 'D' bits are not updated by HW so I have to change
> last parameter of gfn_to_pfn_prot() from "&writeable" to NULL.
>
> I am still investigating this.
This turned-out to be a bug in Spike which is not fixed.
I will include following change in v16 patch series:
diff --git a/arch/riscv/include/asm/kvm_host.h
b/arch/riscv/include/asm/kvm_host.h
index 241030956d47..dc2666b4180b 100644
--- a/arch/riscv/include/asm/kvm_host.h
+++ b/arch/riscv/include/asm/kvm_host.h
@@ -232,8 +232,7 @@ void __kvm_riscv_hfence_gvma_all(void);
int kvm_riscv_stage2_map(struct kvm_vcpu *vcpu,
struct kvm_memory_slot *memslot,
- gpa_t gpa, unsigned long hva,
- bool writeable, bool is_write);
+ gpa_t gpa, unsigned long hva, bool is_write);
void kvm_riscv_stage2_flush_cache(struct kvm_vcpu *vcpu);
int kvm_riscv_stage2_alloc_pgd(struct kvm *kvm);
void kvm_riscv_stage2_free_pgd(struct kvm *kvm);
diff --git a/arch/riscv/kvm/mmu.c b/arch/riscv/kvm/mmu.c
index fcaeadc9b34d..56fda9ef70fd 100644
--- a/arch/riscv/kvm/mmu.c
+++ b/arch/riscv/kvm/mmu.c
@@ -689,11 +689,11 @@ int kvm_test_age_hva(struct kvm *kvm, unsigned long hva)
int kvm_riscv_stage2_map(struct kvm_vcpu *vcpu,
struct kvm_memory_slot *memslot,
- gpa_t gpa, unsigned long hva,
- bool writeable, bool is_write)
+ gpa_t gpa, unsigned long hva, bool is_write)
{
int ret;
kvm_pfn_t hfn;
+ bool writeable;
short vma_pageshift;
gfn_t gfn = gpa >> PAGE_SHIFT;
struct vm_area_struct *vma;
@@ -742,7 +742,7 @@ int kvm_riscv_stage2_map(struct kvm_vcpu *vcpu,
mmu_seq = kvm->mmu_notifier_seq;
- hfn = gfn_to_pfn_prot(kvm, gfn, is_write, NULL);
+ hfn = gfn_to_pfn_prot(kvm, gfn, is_write, &writeable);
if (hfn == KVM_PFN_ERR_HWPOISON) {
send_sig_mceerr(BUS_MCEERR_AR, (void __user *)hva,
vma_pageshift, current);
diff --git a/arch/riscv/kvm/vcpu_exit.c b/arch/riscv/kvm/vcpu_exit.c
index f054406792a6..058cfa168abe 100644
--- a/arch/riscv/kvm/vcpu_exit.c
+++ b/arch/riscv/kvm/vcpu_exit.c
@@ -445,7 +445,7 @@ static int stage2_page_fault(struct kvm_vcpu
*vcpu, struct kvm_run *run,
};
}
- ret = kvm_riscv_stage2_map(vcpu, memslot, fault_addr, hva, writeable,
+ ret = kvm_riscv_stage2_map(vcpu, memslot, fault_addr, hva,
(trap->scause == EXC_STORE_GUEST_PAGE_FAULT) ? true : false);
if (ret < 0)
return ret;
Regards,
Anup
> -----Original Message-----
> From: Anup Patel [mailto:[email protected]]
> Sent: Monday, November 30, 2020 6:22 PM
> To: Jiangyifei <[email protected]>
> Cc: Anup Patel <[email protected]>; Palmer Dabbelt
> <[email protected]>; Palmer Dabbelt <[email protected]>; Paul
> Walmsley <[email protected]>; Albert Ou <[email protected]>;
> Paolo Bonzini <[email protected]>; Alexander Graf <[email protected]>;
> Atish Patra <[email protected]>; Alistair Francis
> <[email protected]>; Damien Le Moal <[email protected]>;
> [email protected]; [email protected];
> [email protected]; [email protected]; Zhangxiaofeng (F)
> <[email protected]>; Wubin (H) <[email protected]>;
> dengkai (A) <[email protected]>; yinyipeng <[email protected]>
> Subject: Re: [PATCH v15 10/17] RISC-V: KVM: Implement stage2 page table
> programming
>
> On Tue, Nov 24, 2020 at 2:56 PM Anup Patel <[email protected]> wrote:
> >
> > On Mon, Nov 16, 2020 at 2:59 PM Jiangyifei <[email protected]> wrote:
> > >
> > >
> > > > -----Original Message-----
> > > > From: Anup Patel [mailto:[email protected]]
> > > > Sent: Monday, November 9, 2020 7:33 PM
> > > > To: Palmer Dabbelt <[email protected]>; Palmer Dabbelt
> > > > <[email protected]>; Paul Walmsley
> > > > <[email protected]>; Albert Ou <[email protected]>;
> > > > Paolo Bonzini <[email protected]>
> > > > Cc: Alexander Graf <[email protected]>; Atish Patra
> > > > <[email protected]>; Alistair Francis
> > > > <[email protected]>; Damien Le Moal
> > > > <[email protected]>; Anup Patel <[email protected]>;
> > > > [email protected]; [email protected];
> > > > [email protected]; [email protected];
> > > > Anup Patel <[email protected]>; Jiangyifei
> > > > <[email protected]>
> > > > Subject: [PATCH v15 10/17] RISC-V: KVM: Implement stage2 page
> > > > table programming
> > > >
> > > > This patch implements all required functions for programming the
> > > > stage2 page table for each Guest/VM.
> > > >
> > > > At high-level, the flow of stage2 related functions is similar
> > > > from KVM
> > > > ARM/ARM64 implementation but the stage2 page table format is quite
> > > > different for KVM RISC-V.
> > > >
> > > > [jiangyifei: stage2 dirty log support]
> > > > Signed-off-by: Yifei Jiang <[email protected]>
> > > > Signed-off-by: Anup Patel <[email protected]>
> > > > Acked-by: Paolo Bonzini <[email protected]>
> > > > Reviewed-by: Paolo Bonzini <[email protected]>
> > > > ---
> > > > arch/riscv/include/asm/kvm_host.h | 12 +
> > > > arch/riscv/include/asm/pgtable-bits.h | 1 +
> > > > arch/riscv/kvm/Kconfig | 1 +
> > > > arch/riscv/kvm/main.c | 19 +
> > > > arch/riscv/kvm/mmu.c | 649
> > > > +++++++++++++++++++++++++-
> > > > arch/riscv/kvm/vm.c | 6 -
> > > > 6 files changed, 672 insertions(+), 16 deletions(-)
> > > >
> > >
> > > ......
> > >
> > > >
> > > > int kvm_riscv_stage2_map(struct kvm_vcpu *vcpu, @@ -69,27
> > > > +562,163 @@ int kvm_riscv_stage2_map(struct kvm_vcpu *vcpu,
> > > > gpa_t gpa, unsigned long hva,
> > > > bool writeable, bool is_write) {
> > > > - /* TODO: */
> > > > - return 0;
> > > > + int ret;
> > > > + kvm_pfn_t hfn;
> > > > + short vma_pageshift;
> > > > + gfn_t gfn = gpa >> PAGE_SHIFT;
> > > > + struct vm_area_struct *vma;
> > > > + struct kvm *kvm = vcpu->kvm;
> > > > + struct kvm_mmu_page_cache *pcache =
> &vcpu->arch.mmu_page_cache;
> > > > + bool logging = (memslot->dirty_bitmap &&
> > > > + !(memslot->flags & KVM_MEM_READONLY)) ?
> true : false;
> > > > + unsigned long vma_pagesize;
> > > > +
> > > > + mmap_read_lock(current->mm);
> > > > +
> > > > + vma = find_vma_intersection(current->mm, hva, hva + 1);
> > > > + if (unlikely(!vma)) {
> > > > + kvm_err("Failed to find VMA for hva 0x%lx\n", hva);
> > > > + mmap_read_unlock(current->mm);
> > > > + return -EFAULT;
> > > > + }
> > > > +
> > > > + if (is_vm_hugetlb_page(vma))
> > > > + vma_pageshift = huge_page_shift(hstate_vma(vma));
> > > > + else
> > > > + vma_pageshift = PAGE_SHIFT;
> > > > + vma_pagesize = 1ULL << vma_pageshift;
> > > > + if (logging || (vma->vm_flags & VM_PFNMAP))
> > > > + vma_pagesize = PAGE_SIZE;
> > > > +
> > > > + if (vma_pagesize == PMD_SIZE || vma_pagesize == PGDIR_SIZE)
> > > > + gfn = (gpa & huge_page_mask(hstate_vma(vma))) >>
> > > > + PAGE_SHIFT;
> > > > +
> > > > + mmap_read_unlock(current->mm);
> > > > +
> > > > + if (vma_pagesize != PGDIR_SIZE &&
> > > > + vma_pagesize != PMD_SIZE &&
> > > > + vma_pagesize != PAGE_SIZE) {
> > > > + kvm_err("Invalid VMA page size 0x%lx\n",
> vma_pagesize);
> > > > + return -EFAULT;
> > > > + }
> > > > +
> > > > + /* We need minimum second+third level pages */
> > > > + ret = stage2_cache_topup(pcache, stage2_pgd_levels,
> > > > +
> KVM_MMU_PAGE_CACHE_NR_OBJS);
> > > > + if (ret) {
> > > > + kvm_err("Failed to topup stage2 cache\n");
> > > > + return ret;
> > > > + }
> > > > +
> > > > + hfn = gfn_to_pfn_prot(kvm, gfn, is_write, NULL);
> > > > + if (hfn == KVM_PFN_ERR_HWPOISON) {
> > > > + send_sig_mceerr(BUS_MCEERR_AR, (void __user *)hva,
> > > > + vma_pageshift, current);
> > > > + return 0;
> > > > + }
> > > > + if (is_error_noslot_pfn(hfn))
> > > > + return -EFAULT;
> > > > +
> > > > + /*
> > > > + * If logging is active then we allow writable pages only
> > > > + * for write faults.
> > > > + */
> > > > + if (logging && !is_write)
> > > > + writeable = false;
> > > > +
> > > > + spin_lock(&kvm->mmu_lock);
> > > > +
> > > > + if (writeable) {
> > >
> > > Hi Anup,
> > >
> > > What is the purpose of "writable = !memslot_is_readonly(slot)" in this
> series?
> >
> > Where ? I don't see this line in any of the patches.
> >
> > >
> > > When mapping the HVA to HPA above, it doesn't know that the PTE
> writeable of stage2 is "!memslot_is_readonly(slot)".
> > > This may causes the difference between the writability of HVA->HPA and
> GPA->HPA.
> > > For example, GPA->HPA is writeable, but HVA->HPA is not writeable.
> >
> > Yes, this is possible particularly when Host kernel is updating
> > writability of HVA->HPA mappings for swapping in/out pages.
> >
> > >
> > > Is it better that the writability of HVA->HPA is also determined by whether
> the memslot is readonly in this change?
> > > Like this:
> > > - hfn = gfn_to_pfn_prot(kvm, gfn, is_write, NULL);
> > > + hfn = gfn_to_pfn_prot(kvm, gfn, writeable, NULL);
> >
> > The gfn_to_pfn_prot() needs to know what type of fault we got (i.e
> > read/write fault). Rest of the information (such as whether slot is
> > writable or not) is already available to gfn_to_pfn_prot().
> >
> > The question here is should we pass "&writeable" or NULL as last
> > parameter to gfn_to_pfn_prot(). The recent JUMP label support in Linux
> > RISC-V causes problem on HW where PTE 'A' and 'D' bits are not updated
> > by HW so I have to change last parameter of gfn_to_pfn_prot() from
> > "&writeable" to NULL.
> >
> > I am still investigating this.
>
> This turned-out to be a bug in Spike which is not fixed.
>
> I will include following change in v16 patch series:
>
>
> diff --git a/arch/riscv/include/asm/kvm_host.h
> b/arch/riscv/include/asm/kvm_host.h
> index 241030956d47..dc2666b4180b 100644
> --- a/arch/riscv/include/asm/kvm_host.h
> +++ b/arch/riscv/include/asm/kvm_host.h
> @@ -232,8 +232,7 @@ void __kvm_riscv_hfence_gvma_all(void);
>
> int kvm_riscv_stage2_map(struct kvm_vcpu *vcpu,
> struct kvm_memory_slot *memslot,
> - gpa_t gpa, unsigned long hva,
> - bool writeable, bool is_write);
> + gpa_t gpa, unsigned long hva, bool is_write);
> void kvm_riscv_stage2_flush_cache(struct kvm_vcpu *vcpu); int
> kvm_riscv_stage2_alloc_pgd(struct kvm *kvm); void
> kvm_riscv_stage2_free_pgd(struct kvm *kvm); diff --git
> a/arch/riscv/kvm/mmu.c b/arch/riscv/kvm/mmu.c index
> fcaeadc9b34d..56fda9ef70fd 100644
> --- a/arch/riscv/kvm/mmu.c
> +++ b/arch/riscv/kvm/mmu.c
> @@ -689,11 +689,11 @@ int kvm_test_age_hva(struct kvm *kvm, unsigned
> long hva)
>
> int kvm_riscv_stage2_map(struct kvm_vcpu *vcpu,
> struct kvm_memory_slot *memslot,
> - gpa_t gpa, unsigned long hva,
> - bool writeable, bool is_write)
> + gpa_t gpa, unsigned long hva, bool is_write)
> {
> int ret;
> kvm_pfn_t hfn;
> + bool writeable;
> short vma_pageshift;
> gfn_t gfn = gpa >> PAGE_SHIFT;
> struct vm_area_struct *vma;
> @@ -742,7 +742,7 @@ int kvm_riscv_stage2_map(struct kvm_vcpu *vcpu,
>
> mmu_seq = kvm->mmu_notifier_seq;
>
> - hfn = gfn_to_pfn_prot(kvm, gfn, is_write, NULL);
> + hfn = gfn_to_pfn_prot(kvm, gfn, is_write, &writeable);
> if (hfn == KVM_PFN_ERR_HWPOISON) {
> send_sig_mceerr(BUS_MCEERR_AR, (void __user *)hva,
> vma_pageshift, current); diff --git
> a/arch/riscv/kvm/vcpu_exit.c b/arch/riscv/kvm/vcpu_exit.c index
> f054406792a6..058cfa168abe 100644
> --- a/arch/riscv/kvm/vcpu_exit.c
> +++ b/arch/riscv/kvm/vcpu_exit.c
> @@ -445,7 +445,7 @@ static int stage2_page_fault(struct kvm_vcpu *vcpu,
> struct kvm_run *run,
> };
> }
>
> - ret = kvm_riscv_stage2_map(vcpu, memslot, fault_addr, hva, writeable,
> + ret = kvm_riscv_stage2_map(vcpu, memslot, fault_addr, hva,
> (trap->scause == EXC_STORE_GUEST_PAGE_FAULT) ? true : false);
> if (ret < 0)
> return ret;
>
> Regards,
> Anup
This change looks good.
Yifei