This patch series adds the RISC-V Confidential VM Extension (CoVE) support to
Linux kernel. The RISC-V CoVE specification introduces non-ISA, SBI APIs. These
APIs enable a confidential environment in which a guest VM's data can be isolated
from the host while the host retains control of guest VM management and platform
resources(memory, CPU, I/O).
This is a very early WIP work. We want to share this with the community to get any
feedback on overall architecture and direction. Any other feedback is welcome too.
The detailed CoVE architecture document can be found here [0]. It used to be
called AP-TEE and renamed to CoVE recently to avoid overloading term of TEE in
general. The specification is in the draft stages and is subjected to change based
on the feedback from the community.
The CoVE specification introduces 3 new SBI extensions.
COVH - CoVE Host side interface
COVG - CoVE Guest side interface
COVI - CoVE Secure Interrupt management extension
Some key acronyms introduced:
TSM - TEE Security Manager
TVM - TEE VM (aka Confidential VM)
CoVE Architecture:
====================
The CoVE APIs are designed to be implementation and architecture agnostic,
allowing for different deployment models while retaining common host and guest
kernel code. Two examples are shown in Figure 1 and Figure 2.
As shown in both figures, the architecture introduces a new software component
called the "TEE Security Manager" (TSM) that runs in HS mode. The TSM has minimal
hw attested footprint on TCB as it is a passive component that doesn't support
scheduling or timer interrupts. Both example deployment models provide memory
isolation between the host and the TEE VM (TVM).
Non secure world | Secure world |
| |
Non | |
Virtualized | Virtualized | Virtualized Virtualized |
Env | Env | Env Env |
+----------+ | +----------+ | +----------+ +----------+ | --------------
| | | | | | | | | | |
| Host Apps| | | Apps | | | Apps | | Apps | | VU-Mode
| (VMM) | | | | | | | | | |
+----------+ | +----------+ | +----------+ +----------+ | --------------
| | +----------+ | +----------+ +----------+ |
| | | | | | | | | |
| | | | | | TVM | | TVM | |
| | | Guest | | | Guest | | Guest | | VS-Mode
Syscalls | +----------+ | +----------+ +----------+ |
| | | | |
| SBI | SBI(COVG + COVI) |
| | | | |
+--------------------------+ | +---------------------------+ --------------
| Host (Linux) | | | TSM (Salus) |
+--------------------------+ | +---------------------------+
| | | HS-Mode
SBI (COVH + COVI) | SBI (COVH + COVI)
| | |
+-----------------------------------------------------------+ --------------
| Firmware(OpenSBI) + TSM Driver | M-Mode
+-----------------------------------------------------------+ --------------
+-----------------------------------------------------------------------------
| Hardware (RISC-V CPU + RoT + IOMMU)
+----------------------------------------------------------------------------
Figure 1: Host in HS model
The deployment model shown in Figure 1 runs the host in HS mode where it is peer
to the TSM which also runs in HS mode. It requires another component known as TSM
Driver running in higher privilege mode than host/TSM. It is responsible for switching
the context between the host and the TSM. TSM driver also manages the platform
specific hardware solution via confidential domain bit as described in the specification[0]
to provide the required memory isolation.
Non secure world | Secure world
|
Virtualized Env | Virtualized Virtualized |
Env Env |
+-------------------------+ | +----------+ +----------+ | ------------
| | | | | | | | | | |
| Host Apps| | | Apps | | | Apps | | Apps | | VU-Mode
+----------+ | +----------+ | +----------+ +----------+ | ------------
| | | | |
Syscalls SBI | | | |
| | | | |
+--------------------------+ | +-----------+ +-----------+ |
| Host (Linux) | | | TVM Guest| | TVM Guest| | VS-Mode
+--------------------------+ | +-----------+ +-----------+ |
| | | | |
SBI (COVH + COVI) | SBI SBI |
| | (COVG + COVI) (COVG + COVI)|
| | | | |
+-----------------------------------------------------------+ --------------
| TSM(Salus) | HS-Mode
+-----------------------------------------------------------+ --------------
|
SBI
|
+---------------------------------------------------------+ --------------
| Firmware(OpenSBI) | M-Mode
+---------------------------------------------------------+ --------------
+-----------------------------------------------------------------------------
| Hardware (RISC-V CPU + RoT + IOMMU)
+----------------------------------------------------------------------------
Figure 2: Host in VS model
The deployment model shown in Figure 2 simplifies the context switch and memory isolation
by running the host in VS mode as a guest of TSM. Thus, the memory isolation is
achieved by gstage mapping by the TSM. We don't need any additional hardware confidential
domain bit to provide memory isolation. The downside of this model the host has to run the
non-confidential VMs in nested environment which may have lower performance (yet to be measured).
The current implementation Salus(TSM) doesn't support full nested virtualization yet.
The platform must have a RoT to provide attestation in either model.
This patch series implements the APIs defined by CoVE. The current TSM implementation
allows the host to run TVMs as shown in figure 2. We are working on deployment
model 1 in parallel. We do not expect any significant changes in either host/guest side
ABI due to that.
Shared memory between the host & TSM:
=====================================
To accelerate the H-mode CSR/GPR access, CoVE also reuses the Nested Acceleration (NACL)
SBI extension[1]. NACL defines a per physical cpu shared memory area that is allocated
at the boot. It allows the host running in VS mode to access H-mode CSR/GPR easily
without trapping into the TSM. The CoVE specification clearly defines the exact
state of the shared memory with r/w permissions at every call.
Secure Interrupt management:
===========================
The CoVE specification relies on the MSI based interrupt scheme defined in Advanced Interrupt
Architecture specification[2]. The COVI SBI extension adds functions to bind
a guest interrupt file to a TVMs. After that, only TCB components (TSM, TVM, TSM driver)
can modify that. The host can inject an interrupt via TSM only.
The TVMs are also in complete control of which interrupts it can receive. By default,
all interrupts are denied. In this proof-of-concept implementation, all the interrupts
are allowed by the guest at boot time to keep it simple.
Device I/O:
===========
In order to support paravirt I/O devices, SWIOTLB bounce buffer must be used by the
guest. As the host can not access confidential memory, this buffer memory
must be shared with the host via share/unshare functions defined in COVG SBI extension.
RISC-V implementation achieves this generalizing mem_encrypt_init() similar to TDX/SEV/CCA.
That's why, the CoVE Guest is only allowed to use virtio devices with VIRTIO_F_ACCESS_PLATFORM
and VIRTIO_F_VERSION_1 as they force virtio drivers to use the DMA API.
MMIO emulation:
======================
TVM can register regions of address space as MMIO regions to be emulated by
the host. TSM provides explicit SBI functions i.e. SBI_EXT_COVG_[ADD/REMOVE]_MMIO_REGION
to request/remove MMIO regions. Any reads or writes to those MMIO region after
SBI_EXT_COVG_ADD_MMIO_REGION call are forwarded to the host for emulation.
This series allows any ioremapped memory to be emulated as MMIO region with
above APIs via arch hookups inspired from pKVM work. We are aware that this model
doesn't address all the threat vectors. We have also implemented the device
filtering/authorization approach adopted by TDX[4]. However, those patches are not
part of this series as the base TDX patches are still under active development.
RISC-V CoVE will also adapt the revamped device filtering work once it is accepted
by the Linux community in the future.
The direct assignment of devices are a work in progress and will be added in the future[4].
VMM support:
============
This series is only tested with kvmtool support. Other VMM support (qemu-kvm, crossvm/rust-vmm)
will be added later.
Test cases:
===========
We are working on kvm selftest for CoVE. We will post them as soon as they are ready.
We haven't started any work on kvm unit-tests as RISC-V doesn't have basic infrastructure
to support that. Once the kvm uni-test infrastructure is in place, we will add
support for CoVE as well.
Open design questions:
======================
1. The current implementation has two separate configs for guest(CONFIG_RISCV_COVE_GUEST)
and the host (RISCV_COVE_HOST). The default defconfig will enable both so that
same unified image works as both host & guest. Most likely distro prefer this way
to minimize the maintenance burden but some may want a minimal CoVE guest image
that has only hardened drivers. In addition to that, Android runs a microdroid instance
in the confidential guests. A separate config will help in those case. Please let us
know if there is any concern with two configs.
2. Lazy gstage page allocation vs upfront allocation with page pool.
Currently, all gstage mappings happen at runtime during the fault. This is expensive
as we need to convert that page to confidential memory as well. A page pool framework
may be a better choice which can hold all the confidential pages which can be
pre-allocated upfront. A generic page pool infrastructure may benefit other CC solutions ?
3. In order to allow both confidential VM and non-confidential VM, the series
uses regular branching instead of static branches for CoVE VM specific cases through
out KVM. That may cause a few more branch penalties while running regular VMs.
The alternate option is to use function pointers for any function that needs to
take a different path. As per my understanding, that would be worse than branches.
Patch organization:
===================
This series depends on quite a few RISC-V patches that are not upstream yet.
Here are the dependencies.
1. RISC-V IPI improvement series
2. RISC-V AIA support series.
3. RISC-V NACL support series
In this series, PATCH [0-5] are generic improvement and cleanup patches which
can be merged independently.
PATCH [6-26, 34-37] adds host side for CoVE.
PATCH [27-33] adds the interrupt related changes.
PATCH [34-49] Adds the guest side changes for CoVE.
The TSM project is written in rust and can be found here:
https://github.com/rivosinc/salus
Running the stack
====================
To run/test the stack, you would need the following components :
1) Qemu
2) Common Host & Guest Kernel
3) kvmtool
4) Host RootFS with KVMTOOL and Guest Kernel
5) Salus
The detailed steps are available at[6]
The Linux kernel patches are also available at [7] and the kvmtool patches
are available at [8].
TODOs
=======
As this is a very early work, the todo list is quite long :).
Here are some of them (not in any specific order)
1. Support fd based private memory interface proposed in
https://lkml.org/lkml/2022/1/18/395
2. Align with updated guest runtime device filtering approach.
3. IOMMU integration
4. Dedicated device assignment via TDSIP & SPDM[4]
5. Support huge pages
6. Page pool allocator to avoid convert/reclaim at every fault
7. Other VMM support (qemu-kvm, crossvm)
8. Complete the PoC for the deployment model 1 where host runs in HS mode
9. Attestation integration
10. Harden the interrupt allowed list
11. kvm self-tests support for CoVE
11. kvm unit-tests support for CoVE
12. Guest hardening
13. Port pKVM on RISC-V using CoVE
14. Any other ?
Links
============
[0] CoVE architecture Specification.
https://github.com/riscv-non-isa/riscv-ap-tee/blob/main/specification/riscv-aptee-spec.pdf
[1] https://lists.riscv.org/g/sig-hypervisors/message/260
[2] https://github.com/riscv/riscv-aia/releases/download/1.0-RC2/riscv-interrupts-1.0-RC2.pdf
[3] https://github.com/rivosinc/linux/tree/cove_integration_device_filtering1
[4] https://github.com/intel/tdx/commits/guest-filter-upstream
[5] https://lists.riscv.org/g/tech-ap-tee/message/83
[6] https://github.com/rivosinc/cove/wiki/CoVE-KVM-RISCV64-on-QEMU
[7] https://github.com/rivosinc/linux/commits/cove-integration
[8] https://github.com/rivosinc/kvmtool/tree/cove-integration-03072023
Atish Patra (33):
RISC-V: KVM: Improve KVM error reporting to the user space
RISC-V: KVM: Invoke aia_update with preempt disabled/irq enabled
RISC-V: KVM: Add a helper function to get pgd size
RISC-V: Add COVH SBI extensions definitions
RISC-V: KVM: Implement COVH SBI extension
RISC-V: KVM: Add a barebone CoVE implementation
RISC-V: KVM: Add UABI to support static memory region attestation
RISC-V: KVM: Add CoVE related nacl helpers
RISC-V: KVM: Implement static memory region measurement
RISC-V: KVM: Use the new VM IOCTL for measuring pages
RISC-V: KVM: Exit to the user space for trap redirection
RISC-V: KVM: Return early for gstage modifications
RISC-V: KVM: Skip dirty logging updates for TVM
RISC-V: KVM: Add a helper function to trigger fence ops
RISC-V: KVM: Skip most VCPU requests for TVMs
RISC-V : KVM: Skip vmid/hgatp management for TVMs
RISC-V: KVM: Skip TLB management for TVMs
RISC-V: KVM: Register memory regions as confidential for TVMs
RISC-V: KVM: Add gstage mapping for TVMs
RISC-V: KVM: Handle SBI call forward from the TSM
RISC-V: KVM: Implement vcpu load/put functions for CoVE guests
RISC-V: KVM: Wireup TVM world switch
RISC-V: KVM: Skip HVIP update for TVMs
RISC-V: KVM: Implement COVI SBI extension
RISC-V: KVM: Add interrupt management functions for TVM
RISC-V: KVM: Skip AIA CSR updates for TVMs
RISC-V: KVM: Perform limited operations in hardware enable/disable
RISC-V: KVM: Indicate no support user space emulated IRQCHIP
RISC-V: KVM: Add AIA support for TVMs
RISC-V: KVM: Hookup TVM VCPU init/destroy
RISC-V: KVM: Initialize CoVE
RISC-V: KVM: Add TVM init/destroy calls
drivers/hvc: sbi: Disable HVC console for TVMs
Rajnesh Kanwal (15):
mm/vmalloc: Introduce arch hooks to notify ioremap/unmap changes
RISC-V: KVM: Update timer functionality for TVMs.
RISC-V: Add COVI extension definitions
RISC-V: KVM: Read/write gprs from/to shmem in case of TVM VCPU.
RISC-V: Add COVG SBI extension definitions
RISC-V: Add CoVE guest config and helper functions
RISC-V: Implement COVG SBI extension
RISC-V: COVE: Add COVH invalidate, validate, promote, demote and
remove APIs.
RISC-V: KVM: Add host side support to handle COVG SBI calls.
RISC-V: Allow host to inject any ext interrupt id to a CoVE guest.
RISC-V: Add base memory encryption functions.
RISC-V: Add cc_platform_has() for RISC-V for CoVE
RISC-V: ioremap: Implement for arch specific ioremap hooks
riscv/virtio: Have CoVE guests enforce restricted virtio memory
access.
RISC-V: Add shared bounce buffer to support DBCN for CoVE Guest.
arch/riscv/Kbuild | 2 +
arch/riscv/Kconfig | 27 +
arch/riscv/cove/Makefile | 2 +
arch/riscv/cove/core.c | 40 +
arch/riscv/cove/cove_guest_sbi.c | 109 +++
arch/riscv/include/asm/cove.h | 27 +
arch/riscv/include/asm/covg_sbi.h | 38 +
arch/riscv/include/asm/csr.h | 2 +
arch/riscv/include/asm/kvm_cove.h | 206 +++++
arch/riscv/include/asm/kvm_cove_sbi.h | 101 +++
arch/riscv/include/asm/kvm_host.h | 10 +-
arch/riscv/include/asm/kvm_vcpu_sbi.h | 3 +
arch/riscv/include/asm/mem_encrypt.h | 26 +
arch/riscv/include/asm/sbi.h | 107 +++
arch/riscv/include/uapi/asm/kvm.h | 17 +
arch/riscv/kernel/irq.c | 12 +
arch/riscv/kernel/setup.c | 2 +
arch/riscv/kvm/Makefile | 1 +
arch/riscv/kvm/aia.c | 101 ++-
arch/riscv/kvm/aia_device.c | 41 +-
arch/riscv/kvm/aia_imsic.c | 127 ++-
arch/riscv/kvm/cove.c | 1005 +++++++++++++++++++++++
arch/riscv/kvm/cove_sbi.c | 490 +++++++++++
arch/riscv/kvm/main.c | 30 +-
arch/riscv/kvm/mmu.c | 45 +-
arch/riscv/kvm/tlb.c | 11 +-
arch/riscv/kvm/vcpu.c | 69 +-
arch/riscv/kvm/vcpu_exit.c | 34 +-
arch/riscv/kvm/vcpu_insn.c | 115 ++-
arch/riscv/kvm/vcpu_sbi.c | 16 +
arch/riscv/kvm/vcpu_sbi_covg.c | 232 ++++++
arch/riscv/kvm/vcpu_timer.c | 26 +-
arch/riscv/kvm/vm.c | 34 +-
arch/riscv/kvm/vmid.c | 17 +-
arch/riscv/mm/Makefile | 3 +
arch/riscv/mm/init.c | 17 +-
arch/riscv/mm/ioremap.c | 45 +
arch/riscv/mm/mem_encrypt.c | 61 ++
drivers/tty/hvc/hvc_riscv_sbi.c | 5 +
drivers/tty/serial/earlycon-riscv-sbi.c | 51 +-
include/uapi/linux/kvm.h | 8 +
mm/vmalloc.c | 16 +
42 files changed, 3222 insertions(+), 109 deletions(-)
create mode 100644 arch/riscv/cove/Makefile
create mode 100644 arch/riscv/cove/core.c
create mode 100644 arch/riscv/cove/cove_guest_sbi.c
create mode 100644 arch/riscv/include/asm/cove.h
create mode 100644 arch/riscv/include/asm/covg_sbi.h
create mode 100644 arch/riscv/include/asm/kvm_cove.h
create mode 100644 arch/riscv/include/asm/kvm_cove_sbi.h
create mode 100644 arch/riscv/include/asm/mem_encrypt.h
create mode 100644 arch/riscv/kvm/cove.c
create mode 100644 arch/riscv/kvm/cove_sbi.c
create mode 100644 arch/riscv/kvm/vcpu_sbi_covg.c
create mode 100644 arch/riscv/mm/ioremap.c
create mode 100644 arch/riscv/mm/mem_encrypt.c
--
2.25.1
This patch adds RISC-V specific cause for ioctl run failure.
For now, it will be used for the below two cases:
1. Insufficient IMSIC files if VM is configured to run in HWACCEL mode
2. TSM is unable to run run_vcpu SBI call for TVMs
KVM also uses a custom scause bit(48) to distinguish this case
from regular vcpu exit causes.
Signed-off-by: Atish Patra <[email protected]>
---
arch/riscv/include/asm/csr.h | 2 ++
arch/riscv/include/uapi/asm/kvm.h | 4 ++++
2 files changed, 6 insertions(+)
diff --git a/arch/riscv/include/asm/csr.h b/arch/riscv/include/asm/csr.h
index 3176355..e78503a 100644
--- a/arch/riscv/include/asm/csr.h
+++ b/arch/riscv/include/asm/csr.h
@@ -96,6 +96,8 @@
#define EXC_VIRTUAL_INST_FAULT 22
#define EXC_STORE_GUEST_PAGE_FAULT 23
+#define EXC_CUSTOM_KVM_COVE_RUN_FAIL 48
+
/* PMP configuration */
#define PMP_R 0x01
#define PMP_W 0x02
diff --git a/arch/riscv/include/uapi/asm/kvm.h b/arch/riscv/include/uapi/asm/kvm.h
index b41d0e7..11440df 100644
--- a/arch/riscv/include/uapi/asm/kvm.h
+++ b/arch/riscv/include/uapi/asm/kvm.h
@@ -245,6 +245,10 @@ enum KVM_RISCV_SBI_EXT_ID {
/* One single KVM irqchip, ie. the AIA */
#define KVM_NR_IRQCHIPS 1
+/* run->fail_entry.hardware_entry_failure_reason codes. */
+#define KVM_EXIT_FAIL_ENTRY_IMSIC_FILE_UNAVAILABLE (1ULL << 0)
+#define KVM_EXIT_FAIL_ENTRY_COVE_RUN_VCPU (1ULL << 1)
+
#endif
#endif /* __LINUX_KVM_RISCV_H */
--
2.25.1
Some of the aia_update operations required to invoke IPIs that
needs interrupts to be enabled. Currently, entire aia_update
is being called from irqs disabled context while only preemption
disable is necessary.
Signed-off-by: Atish Patra <[email protected]>
---
arch/riscv/kvm/vcpu.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/arch/riscv/kvm/vcpu.c b/arch/riscv/kvm/vcpu.c
index e65852d..c53bf98 100644
--- a/arch/riscv/kvm/vcpu.c
+++ b/arch/riscv/kvm/vcpu.c
@@ -1247,15 +1247,16 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
kvm_riscv_check_vcpu_requests(vcpu);
- local_irq_disable();
-
/* Update AIA HW state before entering guest */
+ preempt_disable();
ret = kvm_riscv_vcpu_aia_update(vcpu);
if (ret <= 0) {
- local_irq_enable();
+ preempt_enable();
continue;
}
+ preempt_enable();
+ local_irq_disable();
/*
* Ensure we set mode to IN_GUEST_MODE after we disable
* interrupts and before the final VCPU requests check.
--
2.25.1
The cove support will require to find out the pgd size to be used for
the gstage page table directory. Export the value via an additional
helper.
Signed-off-by: Atish Patra <[email protected]>
---
arch/riscv/include/asm/kvm_host.h | 1 +
arch/riscv/kvm/mmu.c | 5 +++++
2 files changed, 6 insertions(+)
diff --git a/arch/riscv/include/asm/kvm_host.h b/arch/riscv/include/asm/kvm_host.h
index 8714325..63c46af 100644
--- a/arch/riscv/include/asm/kvm_host.h
+++ b/arch/riscv/include/asm/kvm_host.h
@@ -313,6 +313,7 @@ void kvm_riscv_gstage_update_hgatp(struct kvm_vcpu *vcpu);
void __init kvm_riscv_gstage_mode_detect(void);
unsigned long __init kvm_riscv_gstage_mode(void);
int kvm_riscv_gstage_gpa_bits(void);
+unsigned long kvm_riscv_gstage_pgd_size(void);
void __init kvm_riscv_gstage_vmid_detect(void);
unsigned long kvm_riscv_gstage_vmid_bits(void);
diff --git a/arch/riscv/kvm/mmu.c b/arch/riscv/kvm/mmu.c
index f0fff56..6b037f7 100644
--- a/arch/riscv/kvm/mmu.c
+++ b/arch/riscv/kvm/mmu.c
@@ -797,3 +797,8 @@ int kvm_riscv_gstage_gpa_bits(void)
{
return gstage_gpa_bits;
}
+
+unsigned long kvm_riscv_gstage_pgd_size(void)
+{
+ return gstage_pgd_size;
+}
--
2.25.1
From: Rajnesh Kanwal <[email protected]>
In virtualization, the guest may need notify the host about the ioremap
regions. This is a common usecase in confidential computing where the
host only provides MMIO emulation for the regions specified by the guest.
Add a pair if arch specific callbacks to track the ioremapped regions.
This patch is based on pkvm patches. A generic arch config can be added
similar to pkvm if this is going to be the final solution. The device
authorization/filtering approach is very different from this and we
may prefer that one as it provides more flexibility in terms of which
devices are allowed for the confidential guests.
Signed-off-by: Rajnesh Kanwal <[email protected]>
Signed-off-by: Atish Patra <[email protected]>
---
mm/vmalloc.c | 16 ++++++++++++++++
1 file changed, 16 insertions(+)
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index bef6cf2..023630e 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -304,6 +304,14 @@ static int vmap_range_noflush(unsigned long addr, unsigned long end,
return err;
}
+__weak void ioremap_phys_range_hook(phys_addr_t phys_addr, size_t size, pgprot_t prot)
+{
+}
+
+__weak void iounmap_phys_range_hook(phys_addr_t phys_addr, size_t size)
+{
+}
+
int ioremap_page_range(unsigned long addr, unsigned long end,
phys_addr_t phys_addr, pgprot_t prot)
{
@@ -315,6 +323,10 @@ int ioremap_page_range(unsigned long addr, unsigned long end,
if (!err)
kmsan_ioremap_page_range(addr, end, phys_addr, prot,
ioremap_max_page_shift);
+
+ if (!err)
+ ioremap_phys_range_hook(phys_addr, end - addr, prot);
+
return err;
}
@@ -2772,6 +2784,10 @@ void vunmap(const void *addr)
addr);
return;
}
+
+ if (vm->flags & VM_IOREMAP)
+ iounmap_phys_range_hook(vm->phys_addr, get_vm_area_size(vm));
+
kfree(vm);
}
EXPORT_SYMBOL(vunmap);
--
2.25.1
RISC-V Confidential Virtualization Extension(COVE) specification defines
following 3 SBI extensions.
COVH (Host side interface)
COVG (Guest side interface)
COVI (Interrupt management interface)
Few acronyms introduced in this patch:
TSM - TEE Security Manager
TVM - TEE VM
This patch adds the definitions for COVH extension only.
Signed-off-by: Atish Patra <[email protected]>
---
arch/riscv/include/asm/sbi.h | 61 ++++++++++++++++++++++++++++++++++++
1 file changed, 61 insertions(+)
diff --git a/arch/riscv/include/asm/sbi.h b/arch/riscv/include/asm/sbi.h
index 62d00c7..c5a5526 100644
--- a/arch/riscv/include/asm/sbi.h
+++ b/arch/riscv/include/asm/sbi.h
@@ -32,6 +32,7 @@ enum sbi_ext_id {
SBI_EXT_PMU = 0x504D55,
SBI_EXT_DBCN = 0x4442434E,
SBI_EXT_NACL = 0x4E41434C,
+ SBI_EXT_COVH = 0x434F5648,
/* Experimentals extensions must lie within this range */
SBI_EXT_EXPERIMENTAL_START = 0x08000000,
@@ -348,6 +349,66 @@ enum sbi_ext_nacl_feature {
#define SBI_NACL_SHMEM_SRET_X(__i) ((__riscv_xlen / 8) * (__i))
#define SBI_NACL_SHMEM_SRET_X_LAST 31
+/* SBI COVH extension data structures */
+enum sbi_ext_covh_fid {
+ SBI_EXT_COVH_TSM_GET_INFO = 0,
+ SBI_EXT_COVH_TSM_CONVERT_PAGES,
+ SBI_EXT_COVH_TSM_RECLAIM_PAGES,
+ SBI_EXT_COVH_TSM_INITIATE_FENCE,
+ SBI_EXT_COVH_TSM_LOCAL_FENCE,
+ SBI_EXT_COVH_CREATE_TVM,
+ SBI_EXT_COVH_FINALIZE_TVM,
+ SBI_EXT_COVH_DESTROY_TVM,
+ SBI_EXT_COVH_TVM_ADD_MEMORY_REGION,
+ SBI_EXT_COVH_TVM_ADD_PGT_PAGES,
+ SBI_EXT_COVH_TVM_ADD_MEASURED_PAGES,
+ SBI_EXT_COVH_TVM_ADD_ZERO_PAGES,
+ SBI_EXT_COVH_TVM_ADD_SHARED_PAGES,
+ SBI_EXT_COVH_TVM_CREATE_VCPU,
+ SBI_EXT_COVH_TVM_VCPU_RUN,
+ SBI_EXT_COVH_TVM_INITIATE_FENCE,
+};
+
+enum sbi_cove_page_type {
+ SBI_COVE_PAGE_4K,
+ SBI_COVE_PAGE_2MB,
+ SBI_COVE_PAGE_1GB,
+ SBI_COVE_PAGE_512GB,
+};
+
+enum sbi_cove_tsm_state {
+ /* TSM has not been loaded yet */
+ TSM_NOT_LOADED,
+ /* TSM has been loaded but not initialized yet */
+ TSM_LOADED,
+ /* TSM has been initialized and ready to run */
+ TSM_READY,
+};
+
+struct sbi_cove_tsm_info {
+ /* Current state of the TSM */
+ enum sbi_cove_tsm_state tstate;
+
+ /* Version of the loaded TSM */
+ uint32_t version;
+
+ /* Number of 4K pages required per TVM */
+ unsigned long tvm_pages_needed;
+
+ /* Maximum VCPUs supported per TVM */
+ unsigned long tvm_max_vcpus;
+
+ /* Number of 4K pages each vcpu per TVM */
+ unsigned long tvcpu_pages_needed;
+};
+
+struct sbi_cove_tvm_create_params {
+ /* Root page directory for TVM's page table management */
+ unsigned long tvm_page_directory_addr;
+ /* Confidential memory address used to store TVM state information. Must be page aligned */
+ unsigned long tvm_state_addr;
+};
+
#define SBI_SPEC_VERSION_DEFAULT 0x1
#define SBI_SPEC_VERSION_MAJOR_SHIFT 24
#define SBI_SPEC_VERSION_MAJOR_MASK 0x7f
--
2.25.1
This patch just adds a barebone implementation of CoVE functionality
that exercises the COVH functions to create/manage pages for various
boot time operations such as page directory, page table management,
vcpu/vm state management etc.
Signed-off-by: Atish Patra <[email protected]>
---
arch/riscv/include/asm/kvm_cove.h | 154 ++++++++++++
arch/riscv/include/asm/kvm_host.h | 7 +
arch/riscv/kvm/Makefile | 2 +-
arch/riscv/kvm/cove.c | 401 ++++++++++++++++++++++++++++++
arch/riscv/kvm/cove_sbi.c | 2 -
include/uapi/linux/kvm.h | 6 +
6 files changed, 569 insertions(+), 3 deletions(-)
create mode 100644 arch/riscv/include/asm/kvm_cove.h
create mode 100644 arch/riscv/kvm/cove.c
diff --git a/arch/riscv/include/asm/kvm_cove.h b/arch/riscv/include/asm/kvm_cove.h
new file mode 100644
index 0000000..3bf1bcd
--- /dev/null
+++ b/arch/riscv/include/asm/kvm_cove.h
@@ -0,0 +1,154 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * COVE related header file.
+ *
+ * Copyright (c) 2023 RivosInc
+ *
+ * Authors:
+ * Atish Patra <[email protected]>
+ */
+
+#ifndef __KVM_RISCV_COVE_H
+#define __KVM_RISCV_COVE_H
+
+#include <linux/errno.h>
+#include <linux/err.h>
+#include <linux/kvm_host.h>
+#include <linux/kvm.h>
+#include <linux/list.h>
+#include <asm/csr.h>
+#include <asm/sbi.h>
+
+#define KVM_COVE_PAGE_SIZE_4K (1UL << 12)
+#define KVM_COVE_PAGE_SIZE_2MB (1UL << 21)
+#define KVM_COVE_PAGE_SIZE_1GB (1UL << 30)
+#define KVM_COVE_PAGE_SIZE_512GB (1UL << 39)
+
+#define bytes_to_pages(n) ((n + PAGE_SIZE - 1) >> PAGE_SHIFT)
+
+/* Allocate 2MB(i.e. 512 pages) for the page table pool */
+#define KVM_COVE_PGTABLE_SIZE_MAX ((1UL << 10) * PAGE_SIZE)
+
+#define get_order_num_pages(n) (get_order(n << PAGE_SHIFT))
+
+/* Describe a confidential or shared memory region */
+struct kvm_riscv_cove_mem_region {
+ unsigned long hva;
+ unsigned long gpa;
+ unsigned long npages;
+};
+
+/* Page management structure for the host */
+struct kvm_riscv_cove_page {
+ struct list_head link;
+
+ /* Pointer to page allocated */
+ struct page *page;
+
+ /* number of pages allocated for page */
+ unsigned long npages;
+
+ /* Described the page type */
+ unsigned long ptype;
+
+ /* set if the page is mapped in guest physical address */
+ bool is_mapped;
+
+ /* The below two fileds are only valid if is_mapped is true */
+ /* host virtual address for the mapping */
+ unsigned long hva;
+ /* guest physical address for the mapping */
+ unsigned long gpa;
+};
+
+struct kvm_cove_tvm_vcpu_context {
+ struct kvm_vcpu *vcpu;
+ /* Pages storing each vcpu state of the TVM in TSM */
+ struct kvm_riscv_cove_page vcpu_state;
+};
+
+struct kvm_cove_tvm_context {
+ struct kvm *kvm;
+
+ /* TODO: This is not really a VMID as TSM returns the page owner ID instead of VMID */
+ unsigned long tvm_guest_id;
+
+ /* Pages where TVM page table is stored */
+ struct kvm_riscv_cove_page pgtable;
+
+ /* Pages storing the TVM state in TSM */
+ struct kvm_riscv_cove_page tvm_state;
+
+ /* Keep track of zero pages */
+ struct list_head zero_pages;
+
+ /* Pages where TVM image is measured & loaded */
+ struct list_head measured_pages;
+
+ /* keep track of shared pages */
+ struct list_head shared_pages;
+
+ /* keep track of pending reclaim confidential pages */
+ struct list_head reclaim_pending_pages;
+
+ struct kvm_riscv_cove_mem_region shared_region;
+ struct kvm_riscv_cove_mem_region confidential_region;
+
+ /* spinlock to protect the tvm fence sequence */
+ spinlock_t tvm_fence_lock;
+
+ /* Track TVM state */
+ bool finalized_done;
+};
+
+static inline bool is_cove_vm(struct kvm *kvm)
+{
+ return kvm->arch.vm_type == KVM_VM_TYPE_RISCV_COVE;
+}
+
+static inline bool is_cove_vcpu(struct kvm_vcpu *vcpu)
+{
+ return is_cove_vm(vcpu->kvm);
+}
+
+#ifdef CONFIG_RISCV_COVE_HOST
+
+bool kvm_riscv_cove_enabled(void);
+int kvm_riscv_cove_init(void);
+
+/* TVM related functions */
+void kvm_riscv_cove_vm_destroy(struct kvm *kvm);
+int kvm_riscv_cove_vm_init(struct kvm *kvm);
+
+/* TVM VCPU related functions */
+void kvm_riscv_cove_vcpu_destroy(struct kvm_vcpu *vcpu);
+int kvm_riscv_cove_vcpu_init(struct kvm_vcpu *vcpu);
+void kvm_riscv_cove_vcpu_load(struct kvm_vcpu *vcpu);
+void kvm_riscv_cove_vcpu_put(struct kvm_vcpu *vcpu);
+void kvm_riscv_cove_vcpu_switchto(struct kvm_vcpu *vcpu, struct kvm_cpu_trap *trap);
+
+int kvm_riscv_cove_vm_add_memreg(struct kvm *kvm, unsigned long gpa, unsigned long size);
+int kvm_riscv_cove_gstage_map(struct kvm_vcpu *vcpu, gpa_t gpa, unsigned long hva);
+#else
+static inline bool kvm_riscv_cove_enabled(void) {return false; };
+static inline int kvm_riscv_cove_init(void) { return -1; }
+static inline void kvm_riscv_cove_hardware_disable(void) {}
+static inline int kvm_riscv_cove_hardware_enable(void) {return 0; }
+
+/* TVM related functions */
+static inline void kvm_riscv_cove_vm_destroy(struct kvm *kvm) {}
+static inline int kvm_riscv_cove_vm_init(struct kvm *kvm) {return -1; }
+
+/* TVM VCPU related functions */
+static inline void kvm_riscv_cove_vcpu_destroy(struct kvm_vcpu *vcpu) {}
+static inline int kvm_riscv_cove_vcpu_init(struct kvm_vcpu *vcpu) {return -1; }
+static inline void kvm_riscv_cove_vcpu_load(struct kvm_vcpu *vcpu) {}
+static inline void kvm_riscv_cove_vcpu_put(struct kvm_vcpu *vcpu) {}
+static inline void kvm_riscv_cove_vcpu_switchto(struct kvm_vcpu *vcpu, struct kvm_cpu_trap *trap) {}
+static inline int kvm_riscv_cove_vm_add_memreg(struct kvm *kvm, unsigned long gpa,
+ unsigned long size) {return -1; }
+static inline int kvm_riscv_cove_gstage_map(struct kvm_vcpu *vcpu,
+ gpa_t gpa, unsigned long hva) {return -1; }
+#endif /* CONFIG_RISCV_COVE_HOST */
+
+#endif /* __KVM_RISCV_COVE_H */
diff --git a/arch/riscv/include/asm/kvm_host.h b/arch/riscv/include/asm/kvm_host.h
index 63c46af..ca2ebe3 100644
--- a/arch/riscv/include/asm/kvm_host.h
+++ b/arch/riscv/include/asm/kvm_host.h
@@ -88,6 +88,8 @@ struct kvm_vmid {
};
struct kvm_arch {
+ unsigned long vm_type;
+
/* G-stage vmid */
struct kvm_vmid vmid;
@@ -100,6 +102,9 @@ struct kvm_arch {
/* AIA Guest/VM context */
struct kvm_aia aia;
+
+ /* COVE guest/VM context */
+ struct kvm_cove_tvm_context *tvmc;
};
struct kvm_cpu_trap {
@@ -242,6 +247,8 @@ struct kvm_vcpu_arch {
/* Performance monitoring context */
struct kvm_pmu pmu_context;
+
+ struct kvm_cove_tvm_vcpu_context *tc;
};
static inline void kvm_arch_sync_events(struct kvm *kvm) {}
diff --git a/arch/riscv/kvm/Makefile b/arch/riscv/kvm/Makefile
index 40dee04..8c91551 100644
--- a/arch/riscv/kvm/Makefile
+++ b/arch/riscv/kvm/Makefile
@@ -31,4 +31,4 @@ kvm-y += aia.o
kvm-y += aia_device.o
kvm-y += aia_aplic.o
kvm-y += aia_imsic.o
-kvm-$(CONFIG_RISCV_COVE_HOST) += cove_sbi.o
+kvm-$(CONFIG_RISCV_COVE_HOST) += cove_sbi.o cove.o
diff --git a/arch/riscv/kvm/cove.c b/arch/riscv/kvm/cove.c
new file mode 100644
index 0000000..d001e36
--- /dev/null
+++ b/arch/riscv/kvm/cove.c
@@ -0,0 +1,401 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * COVE related helper functions.
+ *
+ * Copyright (c) 2023 RivosInc
+ *
+ * Authors:
+ * Atish Patra <[email protected]>
+ */
+
+#include <linux/errno.h>
+#include <linux/err.h>
+#include <linux/kvm_host.h>
+#include <linux/smp.h>
+#include <asm/csr.h>
+#include <asm/sbi.h>
+#include <asm/kvm_nacl.h>
+#include <asm/kvm_cove.h>
+#include <asm/kvm_cove_sbi.h>
+#include <asm/asm-offsets.h>
+
+static struct sbi_cove_tsm_info tinfo;
+struct sbi_cove_tvm_create_params params;
+
+/* We need a global lock as initiate fence can be invoked once per host */
+static DEFINE_SPINLOCK(cove_fence_lock);
+
+static bool riscv_cove_enabled;
+
+static void kvm_cove_local_fence(void *info)
+{
+ int rc;
+
+ rc = sbi_covh_tsm_local_fence();
+
+ if (rc)
+ kvm_err("local fence for TSM failed %d on cpu %d\n", rc, smp_processor_id());
+}
+
+static void cove_delete_page_list(struct kvm *kvm, struct list_head *tpages, bool unpin)
+{
+ struct kvm_riscv_cove_page *tpage, *temp;
+ int rc;
+
+ list_for_each_entry_safe(tpage, temp, tpages, link) {
+ rc = sbi_covh_tsm_reclaim_pages(page_to_phys(tpage->page), tpage->npages);
+ if (rc)
+ kvm_err("Reclaiming page %llx failed\n", page_to_phys(tpage->page));
+ if (unpin)
+ unpin_user_pages_dirty_lock(&tpage->page, 1, true);
+ list_del(&tpage->link);
+ kfree(tpage);
+ }
+}
+
+static int kvm_riscv_cove_fence(void)
+{
+ int rc;
+
+ spin_lock(&cove_fence_lock);
+
+ rc = sbi_covh_tsm_initiate_fence();
+ if (rc) {
+ kvm_err("initiate fence for tsm failed %d\n", rc);
+ goto done;
+ }
+
+ /* initiate local fence on each online hart */
+ on_each_cpu(kvm_cove_local_fence, NULL, 1);
+done:
+ spin_unlock(&cove_fence_lock);
+ return rc;
+}
+
+static int cove_convert_pages(unsigned long phys_addr, unsigned long npages, bool fence)
+{
+ int rc;
+
+ if (!IS_ALIGNED(phys_addr, PAGE_SIZE))
+ return -EINVAL;
+
+ rc = sbi_covh_tsm_convert_pages(phys_addr, npages);
+ if (rc)
+ return rc;
+
+ /* Conversion was successful. Flush the TLB if caller requested */
+ if (fence)
+ rc = kvm_riscv_cove_fence();
+
+ return rc;
+}
+
+__always_inline bool kvm_riscv_cove_enabled(void)
+{
+ return riscv_cove_enabled;
+}
+
+void kvm_riscv_cove_vcpu_load(struct kvm_vcpu *vcpu)
+{
+ /* TODO */
+}
+
+void kvm_riscv_cove_vcpu_put(struct kvm_vcpu *vcpu)
+{
+ /* TODO */
+}
+
+int kvm_riscv_cove_gstage_map(struct kvm_vcpu *vcpu, gpa_t gpa, unsigned long hva)
+{
+ /* TODO */
+ return 0;
+}
+
+void kvm_riscv_cove_vcpu_switchto(struct kvm_vcpu *vcpu, struct kvm_cpu_trap *trap)
+{
+ /* TODO */
+}
+
+void kvm_riscv_cove_vcpu_destroy(struct kvm_vcpu *vcpu)
+{
+ struct kvm_cove_tvm_vcpu_context *tvcpuc = vcpu->arch.tc;
+ struct kvm *kvm = vcpu->kvm;
+
+ /*
+ * Just add the vcpu state pages to a list at this point as these can not
+ * be claimed until tvm is destroyed. *
+ */
+ list_add(&tvcpuc->vcpu_state.link, &kvm->arch.tvmc->reclaim_pending_pages);
+}
+
+int kvm_riscv_cove_vcpu_init(struct kvm_vcpu *vcpu)
+{
+ int rc;
+ struct kvm *kvm;
+ struct kvm_cove_tvm_vcpu_context *tvcpuc;
+ struct kvm_cove_tvm_context *tvmc;
+ struct page *vcpus_page;
+ unsigned long vcpus_phys_addr;
+
+ if (!vcpu)
+ return -EINVAL;
+
+ kvm = vcpu->kvm;
+
+ if (!kvm->arch.tvmc)
+ return -EINVAL;
+
+ tvmc = kvm->arch.tvmc;
+
+ if (tvmc->finalized_done) {
+ kvm_err("vcpu init must not happen after finalize\n");
+ return -EINVAL;
+ }
+
+ tvcpuc = kzalloc(sizeof(*tvcpuc), GFP_KERNEL);
+ if (!tvcpuc)
+ return -ENOMEM;
+
+ vcpus_page = alloc_pages(GFP_KERNEL | __GFP_ZERO,
+ get_order_num_pages(tinfo.tvcpu_pages_needed));
+ if (!vcpus_page) {
+ rc = -ENOMEM;
+ goto alloc_page_failed;
+ }
+
+ tvcpuc->vcpu = vcpu;
+ tvcpuc->vcpu_state.npages = tinfo.tvcpu_pages_needed;
+ tvcpuc->vcpu_state.page = vcpus_page;
+ vcpus_phys_addr = page_to_phys(vcpus_page);
+
+ rc = cove_convert_pages(vcpus_phys_addr, tvcpuc->vcpu_state.npages, true);
+ if (rc)
+ goto convert_failed;
+
+ rc = sbi_covh_create_tvm_vcpu(tvmc->tvm_guest_id, vcpu->vcpu_idx, vcpus_phys_addr);
+ if (rc)
+ goto vcpu_create_failed;
+
+ vcpu->arch.tc = tvcpuc;
+
+ return 0;
+
+vcpu_create_failed:
+ /* Reclaim all the pages or return to the confidential page pool */
+ sbi_covh_tsm_reclaim_pages(vcpus_phys_addr, tvcpuc->vcpu_state.npages);
+
+convert_failed:
+ __free_pages(vcpus_page, get_order_num_pages(tinfo.tvcpu_pages_needed));
+
+alloc_page_failed:
+ kfree(tvcpuc);
+ return rc;
+}
+
+int kvm_riscv_cove_vm_add_memreg(struct kvm *kvm, unsigned long gpa, unsigned long size)
+{
+ int rc;
+ struct kvm_cove_tvm_context *tvmc = kvm->arch.tvmc;
+
+ if (!tvmc)
+ return -EFAULT;
+
+ if (tvmc->finalized_done) {
+ kvm_err("Memory region can not be added after finalize\n");
+ return -EINVAL;
+ }
+
+ tvmc->confidential_region.gpa = gpa;
+ tvmc->confidential_region.npages = bytes_to_pages(size);
+
+ rc = sbi_covh_add_memory_region(tvmc->tvm_guest_id, gpa, size);
+ if (rc) {
+ kvm_err("Registering confidential memory region failed with rc %d\n", rc);
+ return rc;
+ }
+
+ kvm_info("%s: Success with gpa %lx size %lx\n", __func__, gpa, size);
+
+ return 0;
+}
+
+/*
+ * Destroying A TVM is expensive because we need to reclaim all the pages by iterating over it.
+ * Few ideas to improve:
+ * 1. At least do the reclaim part in a worker thread in the background
+ * 2. Define a page pool which can contain a pre-allocated/converted pages.
+ * In this step, we just return to the confidential page pool. Thus, some other TVM
+ * can use it.
+ */
+void kvm_riscv_cove_vm_destroy(struct kvm *kvm)
+{
+ int rc;
+ struct kvm_cove_tvm_context *tvmc = kvm->arch.tvmc;
+ unsigned long pgd_npages;
+
+ if (!tvmc)
+ return;
+
+ /* Release all the confidential pages using COVH SBI call */
+ rc = sbi_covh_tsm_destroy_tvm(tvmc->tvm_guest_id);
+ if (rc) {
+ kvm_err("TVM %ld destruction failed with rc = %d\n", tvmc->tvm_guest_id, rc);
+ return;
+ }
+
+ cove_delete_page_list(kvm, &tvmc->reclaim_pending_pages, false);
+
+ /* Reclaim and Free the pages for tvm state management */
+ rc = sbi_covh_tsm_reclaim_pages(page_to_phys(tvmc->tvm_state.page), tvmc->tvm_state.npages);
+ if (rc)
+ goto reclaim_failed;
+
+ __free_pages(tvmc->tvm_state.page, get_order_num_pages(tvmc->tvm_state.npages));
+
+ /* Reclaim and Free the pages for gstage page table management */
+ rc = sbi_covh_tsm_reclaim_pages(page_to_phys(tvmc->pgtable.page), tvmc->pgtable.npages);
+ if (rc)
+ goto reclaim_failed;
+
+ __free_pages(tvmc->pgtable.page, get_order_num_pages(tvmc->pgtable.npages));
+
+ /* Reclaim the confidential page for pgd */
+ pgd_npages = kvm_riscv_gstage_pgd_size() >> PAGE_SHIFT;
+ rc = sbi_covh_tsm_reclaim_pages(kvm->arch.pgd_phys, pgd_npages);
+ if (rc)
+ goto reclaim_failed;
+
+ kfree(tvmc);
+
+ return;
+
+reclaim_failed:
+ kvm_err("Memory reclaim failed with rc %d\n", rc);
+}
+
+int kvm_riscv_cove_vm_init(struct kvm *kvm)
+{
+ struct kvm_cove_tvm_context *tvmc;
+ struct page *tvms_page, *pgt_page;
+ unsigned long tvm_gid, pgt_phys_addr, tvms_phys_addr;
+ unsigned long gstage_pgd_size = kvm_riscv_gstage_pgd_size();
+ int rc = 0;
+
+ tvmc = kzalloc(sizeof(*tvmc), GFP_KERNEL);
+ if (!tvmc)
+ return -ENOMEM;
+
+ /* Allocate the pages required for gstage page table management */
+ /* TODO: Just give enough pages for page table pool for now */
+ pgt_page = alloc_pages(GFP_KERNEL | __GFP_ZERO, get_order(KVM_COVE_PGTABLE_SIZE_MAX));
+ if (!pgt_page)
+ return -ENOMEM;
+
+ /* pgd is always 16KB aligned */
+ rc = cove_convert_pages(kvm->arch.pgd_phys, gstage_pgd_size >> PAGE_SHIFT, false);
+ if (rc)
+ goto done;
+
+ /* Convert the gstage page table pages */
+ tvmc->pgtable.page = pgt_page;
+ tvmc->pgtable.npages = KVM_COVE_PGTABLE_SIZE_MAX >> PAGE_SHIFT;
+ pgt_phys_addr = page_to_phys(pgt_page);
+
+ rc = cove_convert_pages(pgt_phys_addr, tvmc->pgtable.npages, false);
+ if (rc) {
+ kvm_err("%s: page table pool conversion failed rc %d\n", __func__, rc);
+ goto pgt_convert_failed;
+ }
+
+ /* Allocate and convert the pages required for TVM state management */
+ tvms_page = alloc_pages(GFP_KERNEL | __GFP_ZERO,
+ get_order_num_pages(tinfo.tvm_pages_needed));
+ if (!tvms_page) {
+ rc = -ENOMEM;
+ goto tvms_alloc_failed;
+ }
+
+ tvmc->tvm_state.page = tvms_page;
+ tvmc->tvm_state.npages = tinfo.tvm_pages_needed;
+ tvms_phys_addr = page_to_phys(tvms_page);
+
+ rc = cove_convert_pages(tvms_phys_addr, tinfo.tvm_pages_needed, false);
+ if (rc) {
+ kvm_err("%s: tvm state page conversion failed rc %d\n", __func__, rc);
+ goto tvms_convert_failed;
+ }
+
+ rc = kvm_riscv_cove_fence();
+ if (rc)
+ goto tvm_init_failed;
+
+ INIT_LIST_HEAD(&tvmc->measured_pages);
+ INIT_LIST_HEAD(&tvmc->zero_pages);
+ INIT_LIST_HEAD(&tvmc->shared_pages);
+ INIT_LIST_HEAD(&tvmc->reclaim_pending_pages);
+
+ /* The required pages have been converted to confidential memory. Create the TVM now */
+ params.tvm_page_directory_addr = kvm->arch.pgd_phys;
+ params.tvm_state_addr = tvms_phys_addr;
+
+ rc = sbi_covh_tsm_create_tvm(¶ms, &tvm_gid);
+ if (rc)
+ goto tvm_init_failed;
+
+ tvmc->tvm_guest_id = tvm_gid;
+ spin_lock_init(&tvmc->tvm_fence_lock);
+ kvm->arch.tvmc = tvmc;
+
+ rc = sbi_covh_add_pgt_pages(tvm_gid, pgt_phys_addr, tvmc->pgtable.npages);
+ if (rc)
+ goto tvm_init_failed;
+
+ tvmc->kvm = kvm;
+ kvm_info("Guest VM creation successful with guest id %lx\n", tvm_gid);
+
+ return 0;
+
+tvm_init_failed:
+ /* Reclaim tvm state pages */
+ sbi_covh_tsm_reclaim_pages(tvms_phys_addr, tvmc->tvm_state.npages);
+
+tvms_convert_failed:
+ __free_pages(tvms_page, get_order_num_pages(tinfo.tvm_pages_needed));
+
+tvms_alloc_failed:
+ /* Reclaim pgtable pages */
+ sbi_covh_tsm_reclaim_pages(pgt_phys_addr, tvmc->pgtable.npages);
+
+pgt_convert_failed:
+ __free_pages(pgt_page, get_order(KVM_COVE_PGTABLE_SIZE_MAX));
+ /* Reclaim pgd pages */
+ sbi_covh_tsm_reclaim_pages(kvm->arch.pgd_phys, gstage_pgd_size >> PAGE_SHIFT);
+
+done:
+ kfree(tvmc);
+ return rc;
+}
+
+int kvm_riscv_cove_init(void)
+{
+ int rc;
+
+ /* We currently support host in VS mode. Thus, NACL is mandatory */
+ if (sbi_probe_extension(SBI_EXT_COVH) <= 0 || !kvm_riscv_nacl_available())
+ return -EOPNOTSUPP;
+
+ rc = sbi_covh_tsm_get_info(&tinfo);
+ if (rc < 0)
+ return -EINVAL;
+
+ if (tinfo.tstate != TSM_READY) {
+ kvm_err("TSM is not ready yet. Can't run TVMs\n");
+ return -EAGAIN;
+ }
+
+ riscv_cove_enabled = true;
+ kvm_info("The platform has confidential computing feature enabled\n");
+ kvm_info("TSM version %d is loaded and ready to run\n", tinfo.version);
+
+ return 0;
+}
diff --git a/arch/riscv/kvm/cove_sbi.c b/arch/riscv/kvm/cove_sbi.c
index c8c63fe..bf037f6 100644
--- a/arch/riscv/kvm/cove_sbi.c
+++ b/arch/riscv/kvm/cove_sbi.c
@@ -82,9 +82,7 @@ int sbi_covh_tsm_create_tvm(struct sbi_cove_tvm_create_params *tparam, unsigned
goto done;
}
- kvm_info("%s: create_tvm tvmid %lx\n", __func__, ret.value);
*tvmid = ret.value;
-
done:
return rc;
}
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 8923319..a55a6a5 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -914,6 +914,12 @@ struct kvm_ppc_resize_hpt {
#define KVM_VM_TYPE_ARM_IPA_SIZE_MASK 0xffULL
#define KVM_VM_TYPE_ARM_IPA_SIZE(x) \
((x) & KVM_VM_TYPE_ARM_IPA_SIZE_MASK)
+
+/*
+ * RISCV-V Confidential VM type. The large bit shift is chosen on purpose
+ * to allow other architectures to have their specific VM types if required.
+ */
+#define KVM_VM_TYPE_RISCV_COVE (1UL << 9)
/*
* ioctls for /dev/kvm fds:
*/
--
2.25.1
The newly introduced VM IOCTL allow the VMM to measure the pages used
to load the blobs. Hookup the VM ioctl.
Signed-off-by: Atish Patra <[email protected]>
---
arch/riscv/kvm/vm.c | 18 +++++++++++++++++-
1 file changed, 17 insertions(+), 1 deletion(-)
diff --git a/arch/riscv/kvm/vm.c b/arch/riscv/kvm/vm.c
index 29d3221..1b59a8f 100644
--- a/arch/riscv/kvm/vm.c
+++ b/arch/riscv/kvm/vm.c
@@ -11,6 +11,7 @@
#include <linux/module.h>
#include <linux/uaccess.h>
#include <linux/kvm_host.h>
+#include <asm/kvm_cove.h>
const struct _kvm_stats_desc kvm_vm_stats_desc[] = {
KVM_GENERIC_VM_STATS()
@@ -209,5 +210,20 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
long kvm_arch_vm_ioctl(struct file *filp,
unsigned int ioctl, unsigned long arg)
{
- return -EINVAL;
+ struct kvm *kvm = filp->private_data;
+ void __user *argp = (void __user *)arg;
+ struct kvm_riscv_cove_measure_region mr;
+
+ switch (ioctl) {
+ case KVM_RISCV_COVE_MEASURE_REGION:
+ if (!is_cove_vm(kvm))
+ return -EINVAL;
+ if (copy_from_user(&mr, argp, sizeof(mr)))
+ return -EFAULT;
+
+ return kvm_riscv_cove_vm_measure_pages(kvm, &mr);
+ default:
+ return -EINVAL;
+ }
+
}
--
2.25.1
To support attestation of any images loaded by the VMM, the COVH allows
measuring these memory regions. Currently, it will be used for the
kernel image, device tree and initrd images.
Signed-off-by: Atish Patra <[email protected]>
---
arch/riscv/include/asm/kvm_cove.h | 6 ++
arch/riscv/kvm/cove.c | 110 ++++++++++++++++++++++++++++++
2 files changed, 116 insertions(+)
diff --git a/arch/riscv/include/asm/kvm_cove.h b/arch/riscv/include/asm/kvm_cove.h
index 3bf1bcd..4ea1df1 100644
--- a/arch/riscv/include/asm/kvm_cove.h
+++ b/arch/riscv/include/asm/kvm_cove.h
@@ -127,6 +127,7 @@ void kvm_riscv_cove_vcpu_load(struct kvm_vcpu *vcpu);
void kvm_riscv_cove_vcpu_put(struct kvm_vcpu *vcpu);
void kvm_riscv_cove_vcpu_switchto(struct kvm_vcpu *vcpu, struct kvm_cpu_trap *trap);
+int kvm_riscv_cove_vm_measure_pages(struct kvm *kvm, struct kvm_riscv_cove_measure_region *mr);
int kvm_riscv_cove_vm_add_memreg(struct kvm *kvm, unsigned long gpa, unsigned long size);
int kvm_riscv_cove_gstage_map(struct kvm_vcpu *vcpu, gpa_t gpa, unsigned long hva);
#else
@@ -147,6 +148,11 @@ static inline void kvm_riscv_cove_vcpu_put(struct kvm_vcpu *vcpu) {}
static inline void kvm_riscv_cove_vcpu_switchto(struct kvm_vcpu *vcpu, struct kvm_cpu_trap *trap) {}
static inline int kvm_riscv_cove_vm_add_memreg(struct kvm *kvm, unsigned long gpa,
unsigned long size) {return -1; }
+static inline int kvm_riscv_cove_vm_measure_pages(struct kvm *kvm,
+ struct kvm_riscv_cove_measure_region *mr)
+{
+ return -1;
+}
static inline int kvm_riscv_cove_gstage_map(struct kvm_vcpu *vcpu,
gpa_t gpa, unsigned long hva) {return -1; }
#endif /* CONFIG_RISCV_COVE_HOST */
diff --git a/arch/riscv/kvm/cove.c b/arch/riscv/kvm/cove.c
index d001e36..5b4d9ba 100644
--- a/arch/riscv/kvm/cove.c
+++ b/arch/riscv/kvm/cove.c
@@ -27,6 +27,12 @@ static DEFINE_SPINLOCK(cove_fence_lock);
static bool riscv_cove_enabled;
+static inline bool cove_is_within_region(unsigned long addr1, unsigned long size1,
+ unsigned long addr2, unsigned long size2)
+{
+ return ((addr1 <= addr2) && ((addr1 + size1) >= (addr2 + size2)));
+}
+
static void kvm_cove_local_fence(void *info)
{
int rc;
@@ -192,6 +198,109 @@ int kvm_riscv_cove_vcpu_init(struct kvm_vcpu *vcpu)
return rc;
}
+int kvm_riscv_cove_vm_measure_pages(struct kvm *kvm, struct kvm_riscv_cove_measure_region *mr)
+{
+ struct kvm_cove_tvm_context *tvmc = kvm->arch.tvmc;
+ int rc = 0, idx, num_pages;
+ struct kvm_riscv_cove_mem_region *conf;
+ struct page *pinned_page, *conf_page;
+ struct kvm_riscv_cove_page *cpage;
+
+ if (!tvmc)
+ return -EFAULT;
+
+ if (tvmc->finalized_done) {
+ kvm_err("measured_mr pages can not be added after finalize\n");
+ return -EINVAL;
+ }
+
+ num_pages = bytes_to_pages(mr->size);
+ conf = &tvmc->confidential_region;
+
+ if (!IS_ALIGNED(mr->userspace_addr, PAGE_SIZE) ||
+ !IS_ALIGNED(mr->gpa, PAGE_SIZE) || !mr->size ||
+ !cove_is_within_region(conf->gpa, conf->npages << PAGE_SHIFT, mr->gpa, mr->size))
+ return -EINVAL;
+
+ idx = srcu_read_lock(&kvm->srcu);
+
+ /*TODO: Iterate one page at a time as pinning multiple pages fail with unmapped panic
+ * with a virtual address range belonging to vmalloc region for some reason.
+ */
+ while (num_pages) {
+ if (signal_pending(current)) {
+ rc = -ERESTARTSYS;
+ break;
+ }
+
+ if (need_resched())
+ cond_resched();
+
+ rc = get_user_pages_fast(mr->userspace_addr, 1, 0, &pinned_page);
+ if (rc < 0) {
+ kvm_err("Pinning the userpsace addr %lx failed\n", mr->userspace_addr);
+ break;
+ }
+
+ /* Enough pages are not available to be pinned */
+ if (rc != 1) {
+ rc = -ENOMEM;
+ break;
+ }
+ conf_page = alloc_page(GFP_KERNEL | __GFP_ZERO);
+ if (!conf_page) {
+ rc = -ENOMEM;
+ break;
+ }
+
+ rc = cove_convert_pages(page_to_phys(conf_page), 1, true);
+ if (rc)
+ break;
+
+ /*TODO: Support other pages sizes */
+ rc = sbi_covh_add_measured_pages(tvmc->tvm_guest_id, page_to_phys(pinned_page),
+ page_to_phys(conf_page), SBI_COVE_PAGE_4K,
+ 1, mr->gpa);
+ if (rc)
+ break;
+
+ /* Unpin the page now */
+ put_page(pinned_page);
+
+ cpage = kmalloc(sizeof(*cpage), GFP_KERNEL_ACCOUNT);
+ if (!cpage) {
+ rc = -ENOMEM;
+ break;
+ }
+
+ cpage->page = conf_page;
+ cpage->npages = 1;
+ cpage->gpa = mr->gpa;
+ cpage->hva = mr->userspace_addr;
+ cpage->is_mapped = true;
+ INIT_LIST_HEAD(&cpage->link);
+ list_add(&cpage->link, &tvmc->measured_pages);
+
+ mr->userspace_addr += PAGE_SIZE;
+ mr->gpa += PAGE_SIZE;
+ num_pages--;
+ conf_page = NULL;
+
+ continue;
+ }
+ srcu_read_unlock(&kvm->srcu, idx);
+
+ if (rc < 0) {
+ /* We don't to need unpin pages as it is allocated by the hypervisor itself */
+ cove_delete_page_list(kvm, &tvmc->measured_pages, false);
+ /* Free the last allocated page for which conversion/measurement failed */
+ kfree(conf_page);
+ kvm_err("Adding/Converting measured pages failed %d\n", num_pages);
+ }
+
+ return rc;
+}
+
int kvm_riscv_cove_vm_add_memreg(struct kvm *kvm, unsigned long gpa, unsigned long size)
{
int rc;
@@ -244,6 +353,7 @@ void kvm_riscv_cove_vm_destroy(struct kvm *kvm)
}
cove_delete_page_list(kvm, &tvmc->reclaim_pending_pages, false);
+ cove_delete_page_list(kvm, &tvmc->measured_pages, false);
/* Reclaim and Free the pages for tvm state management */
rc = sbi_covh_tsm_reclaim_pages(page_to_phys(tvmc->tvm_state.page), tvmc->tvm_state.npages);
--
2.25.1
Currently, the trap redirection to the guest happens in the
following cases.
1. Illegal instruction trap
2. Virtual instruction trap
3. Unsuccesfull unpriv read
Allowing host to cause traps in the TVM directly is problematic.
TSM doesn't support trap redirection yet. Ideally, the host should not end
up in one of these situations where it has to redirect the trap. If it
happens, exit to the userspace with error as it can't forward the trap to
the TVM.
If there is any usecasse arises in the future, it has to be co-ordinated
through TSM.
Signed-off-by: Atish Patra <[email protected]>
---
arch/riscv/kvm/vcpu_exit.c | 9 ++++++++-
arch/riscv/kvm/vcpu_insn.c | 17 +++++++++++++++++
2 files changed, 25 insertions(+), 1 deletion(-)
diff --git a/arch/riscv/kvm/vcpu_exit.c b/arch/riscv/kvm/vcpu_exit.c
index 4ea101a..0d0c895 100644
--- a/arch/riscv/kvm/vcpu_exit.c
+++ b/arch/riscv/kvm/vcpu_exit.c
@@ -9,6 +9,7 @@
#include <linux/kvm_host.h>
#include <asm/csr.h>
#include <asm/insn-def.h>
+#include <asm/kvm_cove.h>
static int gstage_page_fault(struct kvm_vcpu *vcpu, struct kvm_run *run,
struct kvm_cpu_trap *trap)
@@ -135,8 +136,14 @@ unsigned long kvm_riscv_vcpu_unpriv_read(struct kvm_vcpu *vcpu,
void kvm_riscv_vcpu_trap_redirect(struct kvm_vcpu *vcpu,
struct kvm_cpu_trap *trap)
{
- unsigned long vsstatus = csr_read(CSR_VSSTATUS);
+ unsigned long vsstatus;
+ if (is_cove_vcpu(vcpu)) {
+ kvm_err("RISC-V KVM do not support redirect to CoVE guest yet\n");
+ return;
+ }
+
+ vsstatus = csr_read(CSR_VSSTATUS);
/* Change Guest SSTATUS.SPP bit */
vsstatus &= ~SR_SPP;
if (vcpu->arch.guest_context.sstatus & SR_SPP)
diff --git a/arch/riscv/kvm/vcpu_insn.c b/arch/riscv/kvm/vcpu_insn.c
index 7a6abed..331489f 100644
--- a/arch/riscv/kvm/vcpu_insn.c
+++ b/arch/riscv/kvm/vcpu_insn.c
@@ -6,6 +6,7 @@
#include <linux/bitops.h>
#include <linux/kvm_host.h>
+#include <asm/kvm_cove.h>
#define INSN_OPCODE_MASK 0x007c
#define INSN_OPCODE_SHIFT 2
@@ -153,6 +154,10 @@ static int truly_illegal_insn(struct kvm_vcpu *vcpu, struct kvm_run *run,
{
struct kvm_cpu_trap utrap = { 0 };
+ /* The host can not redirect any illegal instruction trap to TVM */
+ if (unlikely(is_cove_vcpu(vcpu)))
+ return -EPERM;
+
/* Redirect trap to Guest VCPU */
utrap.sepc = vcpu->arch.guest_context.sepc;
utrap.scause = EXC_INST_ILLEGAL;
@@ -169,6 +174,10 @@ static int truly_virtual_insn(struct kvm_vcpu *vcpu, struct kvm_run *run,
{
struct kvm_cpu_trap utrap = { 0 };
+ /* The host can not redirect any virtual instruction trap to TVM */
+ if (unlikely(is_cove_vcpu(vcpu)))
+ return -EPERM;
+
/* Redirect trap to Guest VCPU */
utrap.sepc = vcpu->arch.guest_context.sepc;
utrap.scause = EXC_VIRTUAL_INST_FAULT;
@@ -417,6 +426,10 @@ int kvm_riscv_vcpu_virtual_insn(struct kvm_vcpu *vcpu, struct kvm_run *run,
if (unlikely(INSN_IS_16BIT(insn))) {
if (insn == 0) {
ct = &vcpu->arch.guest_context;
+
+ if (unlikely(is_cove_vcpu(vcpu)))
+ return -EPERM;
+
insn = kvm_riscv_vcpu_unpriv_read(vcpu, true,
ct->sepc,
&utrap);
@@ -469,6 +482,8 @@ int kvm_riscv_vcpu_mmio_load(struct kvm_vcpu *vcpu, struct kvm_run *run,
insn = htinst | INSN_16BIT_MASK;
insn_len = (htinst & BIT(1)) ? INSN_LEN(insn) : 2;
} else {
+ if (unlikely(is_cove_vcpu(vcpu)))
+ return -EFAULT;
/*
* Bit[0] == 0 implies trapped instruction value is
* zero or special value.
@@ -595,6 +610,8 @@ int kvm_riscv_vcpu_mmio_store(struct kvm_vcpu *vcpu, struct kvm_run *run,
insn = htinst | INSN_16BIT_MASK;
insn_len = (htinst & BIT(1)) ? INSN_LEN(insn) : 2;
} else {
+ if (unlikely(is_cove_vcpu(vcpu)))
+ return -EFAULT;
/*
* Bit[0] == 0 implies trapped instruction value is
* zero or special value.
--
2.25.1
COVH SBI extension defines the SBI functions that the host will
invoke to configure/create/destroy a TEE VM (TVM).
Implement all the COVH SBI extension functions.
Signed-off-by: Atish Patra <[email protected]>
---
arch/riscv/Kconfig | 13 ++
arch/riscv/include/asm/kvm_cove_sbi.h | 46 +++++
arch/riscv/kvm/Makefile | 1 +
arch/riscv/kvm/cove_sbi.c | 245 ++++++++++++++++++++++++++
4 files changed, 305 insertions(+)
create mode 100644 arch/riscv/include/asm/kvm_cove_sbi.h
create mode 100644 arch/riscv/kvm/cove_sbi.c
diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index 4044080..8462941 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -501,6 +501,19 @@ config FPU
If you don't know what to do here, say Y.
+menu "Confidential VM Extension(CoVE) Support"
+
+config RISCV_COVE_HOST
+ bool "Host(KVM) support for Confidential VM Extension(CoVE)"
+ depends on KVM
+ default n
+ help
+ Enable this if the platform supports confidential vm extension.
+ That means the platform should be capable of running TEE VM (TVM)
+ using KVM and TEE Security Manager (TSM).
+
+endmenu # "Confidential VM Extension(CoVE) Support"
+
endmenu # "Platform type"
menu "Kernel features"
diff --git a/arch/riscv/include/asm/kvm_cove_sbi.h b/arch/riscv/include/asm/kvm_cove_sbi.h
new file mode 100644
index 0000000..24562df
--- /dev/null
+++ b/arch/riscv/include/asm/kvm_cove_sbi.h
@@ -0,0 +1,46 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * COVE SBI extension related header file.
+ *
+ * Copyright (c) 2023 RivosInc
+ *
+ * Authors:
+ * Atish Patra <[email protected]>
+ */
+
+#ifndef __KVM_COVE_SBI_H
+#define __KVM_COVE_SBI_H
+
+#include <linux/errno.h>
+#include <linux/err.h>
+#include <linux/kvm_host.h>
+#include <asm/csr.h>
+#include <asm/sbi.h>
+
+int sbi_covh_tsm_get_info(struct sbi_cove_tsm_info *tinfo_addr);
+int sbi_covh_tvm_initiate_fence(unsigned long tvmid);
+int sbi_covh_tsm_initiate_fence(void);
+int sbi_covh_tsm_local_fence(void);
+int sbi_covh_tsm_create_tvm(struct sbi_cove_tvm_create_params *tparam, unsigned long *tvmid);
+int sbi_covh_tsm_finalize_tvm(unsigned long tvmid, unsigned long sepc, unsigned long entry_arg);
+int sbi_covh_tsm_destroy_tvm(unsigned long tvmid);
+int sbi_covh_add_memory_region(unsigned long tvmid, unsigned long tgpadr, unsigned long rlen);
+
+int sbi_covh_tsm_reclaim_pages(unsigned long phys_addr, unsigned long npages);
+int sbi_covh_tsm_convert_pages(unsigned long phys_addr, unsigned long npages);
+int sbi_covh_tsm_reclaim_page(unsigned long page_addr_phys);
+int sbi_covh_add_pgt_pages(unsigned long tvmid, unsigned long page_addr_phys, unsigned long npages);
+
+int sbi_covh_add_measured_pages(unsigned long tvmid, unsigned long src_addr,
+ unsigned long dest_addr, enum sbi_cove_page_type ptype,
+ unsigned long npages, unsigned long tgpa);
+int sbi_covh_add_zero_pages(unsigned long tvmid, unsigned long page_addr_phys,
+ enum sbi_cove_page_type ptype, unsigned long npages,
+ unsigned long tvm_base_page_addr);
+
+int sbi_covh_create_tvm_vcpu(unsigned long tvmid, unsigned long tvm_vcpuid,
+ unsigned long vpus_page_addr);
+
+int sbi_covh_run_tvm_vcpu(unsigned long tvmid, unsigned long tvm_vcpuid);
+
+#endif
diff --git a/arch/riscv/kvm/Makefile b/arch/riscv/kvm/Makefile
index 6986d3c..40dee04 100644
--- a/arch/riscv/kvm/Makefile
+++ b/arch/riscv/kvm/Makefile
@@ -31,3 +31,4 @@ kvm-y += aia.o
kvm-y += aia_device.o
kvm-y += aia_aplic.o
kvm-y += aia_imsic.o
+kvm-$(CONFIG_RISCV_COVE_HOST) += cove_sbi.o
diff --git a/arch/riscv/kvm/cove_sbi.c b/arch/riscv/kvm/cove_sbi.c
new file mode 100644
index 0000000..c8c63fe
--- /dev/null
+++ b/arch/riscv/kvm/cove_sbi.c
@@ -0,0 +1,245 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * COVE SBI extensions related helper functions.
+ *
+ * Copyright (c) 2023 RivosInc
+ *
+ * Authors:
+ * Atish Patra <[email protected]>
+ */
+
+#include <linux/align.h>
+#include <linux/errno.h>
+#include <linux/err.h>
+#include <linux/kvm_host.h>
+#include <asm/csr.h>
+#include <asm/kvm_cove_sbi.h>
+#include <asm/sbi.h>
+
+#define RISCV_COVE_ALIGN_4KB (1UL << 12)
+
+int sbi_covh_tsm_get_info(struct sbi_cove_tsm_info *tinfo_addr)
+{
+ struct sbiret ret;
+
+ ret = sbi_ecall(SBI_EXT_COVH, SBI_EXT_COVH_TSM_GET_INFO, __pa(tinfo_addr),
+ sizeof(*tinfo_addr), 0, 0, 0, 0);
+
+ if (ret.error)
+ return sbi_err_map_linux_errno(ret.error);
+
+ return 0;
+}
+
+int sbi_covh_tvm_initiate_fence(unsigned long tvmid)
+{
+ struct sbiret ret;
+
+ ret = sbi_ecall(SBI_EXT_COVH, SBI_EXT_COVH_TVM_INITIATE_FENCE, tvmid, 0, 0, 0, 0, 0);
+
+ if (ret.error)
+ return sbi_err_map_linux_errno(ret.error);
+
+ return 0;
+}
+
+int sbi_covh_tsm_initiate_fence(void)
+{
+ struct sbiret ret;
+
+ ret = sbi_ecall(SBI_EXT_COVH, SBI_EXT_COVH_TSM_INITIATE_FENCE, 0, 0, 0, 0, 0, 0);
+
+ if (ret.error)
+ return sbi_err_map_linux_errno(ret.error);
+
+ return 0;
+}
+
+int sbi_covh_tsm_local_fence(void)
+{
+ struct sbiret ret;
+
+ ret = sbi_ecall(SBI_EXT_COVH, SBI_EXT_COVH_TSM_LOCAL_FENCE, 0, 0, 0, 0, 0, 0);
+
+ if (ret.error)
+ return sbi_err_map_linux_errno(ret.error);
+
+ return 0;
+}
+
+int sbi_covh_tsm_create_tvm(struct sbi_cove_tvm_create_params *tparam, unsigned long *tvmid)
+{
+ struct sbiret ret;
+ int rc = 0;
+
+ ret = sbi_ecall(SBI_EXT_COVH, SBI_EXT_COVH_CREATE_TVM, __pa(tparam),
+ sizeof(*tparam), 0, 0, 0, 0);
+
+ if (ret.error) {
+ rc = sbi_err_map_linux_errno(ret.error);
+ if (rc == -EFAULT)
+ kvm_err("Invalid phsyical address for tvm params structure\n");
+ goto done;
+ }
+
+ kvm_info("%s: create_tvm tvmid %lx\n", __func__, ret.value);
+ *tvmid = ret.value;
+
+done:
+ return rc;
+}
+
+int sbi_covh_tsm_finalize_tvm(unsigned long tvmid, unsigned long sepc, unsigned long entry_arg)
+{
+ struct sbiret ret;
+
+ ret = sbi_ecall(SBI_EXT_COVH, SBI_EXT_COVH_FINALIZE_TVM, tvmid,
+ sepc, entry_arg, 0, 0, 0);
+
+ if (ret.error)
+ return sbi_err_map_linux_errno(ret.error);
+
+ return 0;
+}
+
+int sbi_covh_tsm_destroy_tvm(unsigned long tvmid)
+{
+ struct sbiret ret;
+
+ ret = sbi_ecall(SBI_EXT_COVH, SBI_EXT_COVH_DESTROY_TVM, tvmid,
+ 0, 0, 0, 0, 0);
+
+ if (ret.error)
+ return sbi_err_map_linux_errno(ret.error);
+
+ return 0;
+}
+
+int sbi_covh_add_memory_region(unsigned long tvmid, unsigned long tgpaddr, unsigned long rlen)
+{
+ struct sbiret ret;
+
+ if (!IS_ALIGNED(tgpaddr, RISCV_COVE_ALIGN_4KB) || !IS_ALIGNED(rlen, RISCV_COVE_ALIGN_4KB))
+ return -EINVAL;
+
+ ret = sbi_ecall(SBI_EXT_COVH, SBI_EXT_COVH_TVM_ADD_MEMORY_REGION, tvmid,
+ tgpaddr, rlen, 0, 0, 0);
+ if (ret.error) {
+ kvm_err("Add memory region failed with sbi error code %ld\n", ret.error);
+ return sbi_err_map_linux_errno(ret.error);
+ }
+
+ return 0;
+}
+
+int sbi_covh_tsm_convert_pages(unsigned long phys_addr, unsigned long npages)
+{
+ struct sbiret ret;
+
+ ret = sbi_ecall(SBI_EXT_COVH, SBI_EXT_COVH_TSM_CONVERT_PAGES, phys_addr,
+ npages, 0, 0, 0, 0);
+ if (ret.error) {
+ kvm_err("Convert pages failed ret %ld\n", ret.error);
+ return sbi_err_map_linux_errno(ret.error);
+ }
+ return 0;
+}
+
+int sbi_covh_tsm_reclaim_page(unsigned long page_addr_phys)
+{
+ struct sbiret ret;
+
+ ret = sbi_ecall(SBI_EXT_COVH, SBI_EXT_COVH_TSM_RECLAIM_PAGES, page_addr_phys,
+ 1, 0, 0, 0, 0);
+ if (ret.error)
+ return sbi_err_map_linux_errno(ret.error);
+
+ return 0;
+}
+
+int sbi_covh_tsm_reclaim_pages(unsigned long phys_addr, unsigned long npages)
+{
+ struct sbiret ret;
+
+ ret = sbi_ecall(SBI_EXT_COVH, SBI_EXT_COVH_TSM_RECLAIM_PAGES, phys_addr,
+ npages, 0, 0, 0, 0);
+ if (ret.error)
+ return sbi_err_map_linux_errno(ret.error);
+
+ return 0;
+}
+
+int sbi_covh_add_pgt_pages(unsigned long tvmid, unsigned long page_addr_phys, unsigned long npages)
+{
+ struct sbiret ret;
+
+ if (!PAGE_ALIGNED(page_addr_phys))
+ return -EINVAL;
+
+ ret = sbi_ecall(SBI_EXT_COVH, SBI_EXT_COVH_TVM_ADD_PGT_PAGES, tvmid, page_addr_phys,
+ npages, 0, 0, 0);
+ if (ret.error) {
+ kvm_err("Adding page table pages at %lx failed %ld\n", page_addr_phys, ret.error);
+ return sbi_err_map_linux_errno(ret.error);
+ }
+
+ return 0;
+}
+
+int sbi_covh_add_measured_pages(unsigned long tvmid, unsigned long src_addr,
+ unsigned long dest_addr, enum sbi_cove_page_type ptype,
+ unsigned long npages, unsigned long tgpa)
+{
+ struct sbiret ret;
+
+ ret = sbi_ecall(SBI_EXT_COVH, SBI_EXT_COVH_TVM_ADD_MEASURED_PAGES, tvmid, src_addr,
+ dest_addr, ptype, npages, tgpa);
+ if (ret.error) {
+ kvm_err("Adding measued pages failed ret %ld\n", ret.error);
+ return sbi_err_map_linux_errno(ret.error);
+ }
+
+ return 0;
+}
+
+int sbi_covh_add_zero_pages(unsigned long tvmid, unsigned long page_addr_phys,
+ enum sbi_cove_page_type ptype, unsigned long npages,
+ unsigned long tvm_base_page_addr)
+{
+ struct sbiret ret;
+
+ if (!PAGE_ALIGNED(page_addr_phys))
+ return -EINVAL;
+ ret = sbi_ecall(SBI_EXT_COVH, SBI_EXT_COVH_TVM_ADD_ZERO_PAGES, tvmid, page_addr_phys,
+ ptype, npages, tvm_base_page_addr, 0);
+ if (ret.error)
+ return sbi_err_map_linux_errno(ret.error);
+
+ return 0;
+}
+
+int sbi_covh_create_tvm_vcpu(unsigned long tvmid, unsigned long vcpuid,
+ unsigned long vcpu_state_paddr)
+{
+ struct sbiret ret;
+
+ ret = sbi_ecall(SBI_EXT_COVH, SBI_EXT_COVH_TVM_CREATE_VCPU, tvmid, vcpuid,
+ vcpu_state_paddr, 0, 0, 0);
+ if (ret.error) {
+ kvm_err("create vcpu failed ret %ld\n", ret.error);
+ return sbi_err_map_linux_errno(ret.error);
+ }
+ return 0;
+}
+
+int sbi_covh_run_tvm_vcpu(unsigned long tvmid, unsigned long vcpuid)
+{
+ struct sbiret ret;
+
+ ret = sbi_ecall(SBI_EXT_COVH, SBI_EXT_COVH_TVM_VCPU_RUN, tvmid, vcpuid, 0, 0, 0, 0);
+ /* Non-zero return value indicate the vcpu is already terminated */
+ if (ret.error || !ret.value)
+ return ret.error ? sbi_err_map_linux_errno(ret.error) : ret.value;
+
+ return 0;
+}
--
2.25.1
The NACL SBI extension allows the scratch area to be customizable
per SBI extension. As per the COVH SBI extension, the scratch
area stores the guest gpr state.
Add some helpers to read/write gprs easily.
Signed-off-by: Atish Patra <[email protected]>
---
arch/riscv/include/asm/kvm_cove_sbi.h | 15 +++++++++++++++
1 file changed, 15 insertions(+)
diff --git a/arch/riscv/include/asm/kvm_cove_sbi.h b/arch/riscv/include/asm/kvm_cove_sbi.h
index 24562df..df7d88c 100644
--- a/arch/riscv/include/asm/kvm_cove_sbi.h
+++ b/arch/riscv/include/asm/kvm_cove_sbi.h
@@ -17,6 +17,21 @@
#include <asm/csr.h>
#include <asm/sbi.h>
+#include <asm/asm-offsets.h>
+
+/**
+ * CoVE SBI extensions defines the NACL scratch memory.
+ * uint64_t gprs[32]
+ * uint64_t reserved[224]
+ */
+#define get_scratch_gpr_offset(goffset) (goffset - KVM_ARCH_GUEST_ZERO)
+
+#define nacl_shmem_gpr_write_cove(__s, __g, __o) \
+ nacl_shmem_scratch_write_long(__s, get_scratch_gpr_offset(__g), __o)
+
+#define nacl_shmem_gpr_read_cove(__s, __g) \
+ nacl_shmem_scratch_read_long(__s, get_scratch_gpr_offset(__g))
+
int sbi_covh_tsm_get_info(struct sbi_cove_tsm_info *tinfo_addr);
int sbi_covh_tvm_initiate_fence(unsigned long tvmid);
int sbi_covh_tsm_initiate_fence(void);
--
2.25.1
To initialize a TVM, a TSM must ensure that all the static memory regions
that contain the device tree, the kernel image or initrd for the TVM
attested. Some of these information is not usually present with the host
and only VMM is aware of these.
Introduce an new ioctl which is part of the uABI to support this.
Signed-off-by: Atish Patra <[email protected]>
---
arch/riscv/include/uapi/asm/kvm.h | 12 ++++++++++++
include/uapi/linux/kvm.h | 2 ++
2 files changed, 14 insertions(+)
diff --git a/arch/riscv/include/uapi/asm/kvm.h b/arch/riscv/include/uapi/asm/kvm.h
index 11440df..ac3def0 100644
--- a/arch/riscv/include/uapi/asm/kvm.h
+++ b/arch/riscv/include/uapi/asm/kvm.h
@@ -98,6 +98,18 @@ struct kvm_riscv_timer {
__u64 state;
};
+/* Memory region details of a CoVE guest that is measured at boot time */
+struct kvm_riscv_cove_measure_region {
+ /* Address of the user space where the VM code/data resides */
+ unsigned long userspace_addr;
+
+ /* The guest physical address where VM code/data should be mapped */
+ unsigned long gpa;
+
+ /* Size of the region */
+ unsigned long size;
+};
+
/*
* ISA extension IDs specific to KVM. This is not the same as the host ISA
* extension IDs as that is internal to the host and should not be exposed
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index a55a6a5..84a73b5 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1552,6 +1552,8 @@ struct kvm_s390_ucas_mapping {
#define KVM_PPC_SVM_OFF _IO(KVMIO, 0xb3)
#define KVM_ARM_MTE_COPY_TAGS _IOR(KVMIO, 0xb4, struct kvm_arm_copy_mte_tags)
+#define KVM_RISCV_COVE_MEASURE_REGION _IOR(KVMIO, 0xb5, struct kvm_riscv_cove_measure_region)
+
/* ioctl for vm fd */
#define KVM_CREATE_DEVICE _IOWR(KVMIO, 0xe0, struct kvm_create_device)
--
2.25.1
The CoVE doesn't support dirty logging for TVMs yet.
Skip for now.
Signed-off-by: Atish Patra <[email protected]>
---
arch/riscv/kvm/mmu.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/arch/riscv/kvm/mmu.c b/arch/riscv/kvm/mmu.c
index 9693897..1d5e4ed 100644
--- a/arch/riscv/kvm/mmu.c
+++ b/arch/riscv/kvm/mmu.c
@@ -457,6 +457,9 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
const struct kvm_memory_slot *new,
enum kvm_mr_change change)
{
+ /* We don't support dirty logging for CoVE guests yet */
+ if (is_cove_vm(kvm))
+ return;
/*
* At this point memslot has been committed and there is an
* allocated dirty_bitmap[], dirty pages will be tracked while
--
2.25.1
The gstage entries for CoVE VM is managed by the TSM. Return
early for any gstage pte modification operations.
Signed-off-by: Atish Patra <[email protected]>
---
arch/riscv/kvm/mmu.c | 28 ++++++++++++++++++++++++----
1 file changed, 24 insertions(+), 4 deletions(-)
diff --git a/arch/riscv/kvm/mmu.c b/arch/riscv/kvm/mmu.c
index 6b037f7..9693897 100644
--- a/arch/riscv/kvm/mmu.c
+++ b/arch/riscv/kvm/mmu.c
@@ -16,6 +16,9 @@
#include <linux/kvm_host.h>
#include <linux/sched/signal.h>
#include <asm/kvm_nacl.h>
+#include <asm/csr.h>
+#include <asm/kvm_host.h>
+#include <asm/kvm_cove.h>
#include <asm/page.h>
#include <asm/pgtable.h>
@@ -356,6 +359,11 @@ int kvm_riscv_gstage_ioremap(struct kvm *kvm, gpa_t gpa,
.gfp_zero = __GFP_ZERO,
};
+ if (is_cove_vm(kvm)) {
+ kvm_debug("%s: KVM doesn't support ioremap for TVM io regions\n", __func__);
+ return -EPERM;
+ }
+
end = (gpa + size + PAGE_SIZE - 1) & PAGE_MASK;
pfn = __phys_to_pfn(hpa);
@@ -385,6 +393,10 @@ int kvm_riscv_gstage_ioremap(struct kvm *kvm, gpa_t gpa,
void kvm_riscv_gstage_iounmap(struct kvm *kvm, gpa_t gpa, unsigned long size)
{
+ /* KVM doesn't map any IO region in gstage for TVM */
+ if (is_cove_vm(kvm))
+ return;
+
spin_lock(&kvm->mmu_lock);
gstage_unmap_range(kvm, gpa, size, false);
spin_unlock(&kvm->mmu_lock);
@@ -431,6 +443,10 @@ void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
gpa_t gpa = slot->base_gfn << PAGE_SHIFT;
phys_addr_t size = slot->npages << PAGE_SHIFT;
+ /* No need to unmap gstage as it is managed by TSM */
+ if (is_cove_vm(kvm))
+ return;
+
spin_lock(&kvm->mmu_lock);
gstage_unmap_range(kvm, gpa, size, false);
spin_unlock(&kvm->mmu_lock);
@@ -547,7 +563,7 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
{
- if (!kvm->arch.pgd)
+ if (!kvm->arch.pgd || is_cove_vm(kvm))
return false;
gstage_unmap_range(kvm, range->start << PAGE_SHIFT,
@@ -561,7 +577,7 @@ bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
int ret;
kvm_pfn_t pfn = pte_pfn(range->pte);
- if (!kvm->arch.pgd)
+ if (!kvm->arch.pgd || is_cove_vm(kvm))
return false;
WARN_ON(range->end - range->start != 1);
@@ -582,7 +598,7 @@ bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
u32 ptep_level = 0;
u64 size = (range->end - range->start) << PAGE_SHIFT;
- if (!kvm->arch.pgd)
+ if (!kvm->arch.pgd || is_cove_vm(kvm))
return false;
WARN_ON(size != PAGE_SIZE && size != PMD_SIZE && size != PUD_SIZE);
@@ -600,7 +616,7 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
u32 ptep_level = 0;
u64 size = (range->end - range->start) << PAGE_SHIFT;
- if (!kvm->arch.pgd)
+ if (!kvm->arch.pgd || is_cove_vm(kvm))
return false;
WARN_ON(size != PAGE_SIZE && size != PMD_SIZE && size != PUD_SIZE);
@@ -737,6 +753,10 @@ void kvm_riscv_gstage_free_pgd(struct kvm *kvm)
{
void *pgd = NULL;
+ /* PGD is mapped in TSM */
+ if (is_cove_vm(kvm))
+ return;
+
spin_lock(&kvm->mmu_lock);
if (kvm->arch.pgd) {
gstage_unmap_range(kvm, 0UL, gstage_gpa_size, false);
--
2.25.1
TSM manages the tlb entries for the TVMs. Thus, host can ignore
all the hfence requests or tlb updates for confidential guests.
Most of the hfence requests happen through vcpu requests which
are skipped for TVMs. Thus, we just need to take care of the
invocation from tlb management here.
Signed-off-by: Atish Patra <[email protected]>
---
arch/riscv/kvm/tlb.c | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)
diff --git a/arch/riscv/kvm/tlb.c b/arch/riscv/kvm/tlb.c
index dff37b57..b007c02 100644
--- a/arch/riscv/kvm/tlb.c
+++ b/arch/riscv/kvm/tlb.c
@@ -15,6 +15,7 @@
#include <asm/hwcap.h>
#include <asm/insn-def.h>
#include <asm/kvm_nacl.h>
+#include <asm/kvm_cove.h>
#define has_svinval() riscv_has_extension_unlikely(RISCV_ISA_EXT_SVINVAL)
@@ -72,6 +73,14 @@ void kvm_riscv_local_hfence_gvma_gpa(gpa_t gpa, gpa_t gpsz,
void kvm_riscv_local_hfence_gvma_all(void)
{
+ /* For TVMs, TSM will take care of hfence.
+ * TODO: We can't skip unconditionally if cove is enabled
+ * as the host may be running in HS-mode and need to issue hfence
+ * for legacy VMs.
+ */
+ if (kvm_riscv_cove_enabled())
+ return;
+
asm volatile(HFENCE_GVMA(zero, zero) : : : "memory");
}
@@ -160,7 +169,7 @@ void kvm_riscv_local_tlb_sanitize(struct kvm_vcpu *vcpu)
{
unsigned long vmid;
- if (!kvm_riscv_gstage_vmid_bits() ||
+ if (is_cove_vcpu(vcpu) || !kvm_riscv_gstage_vmid_bits() ||
vcpu->arch.last_exit_cpu == vcpu->cpu)
return;
--
2.25.1
Currently, KVM manages TLB shootdown, hgatp updates and
fence.i through vcpu requests.
TLB shootdown for the TVMs happens in co-ordination with TSM.
The fence.i & hgatp updates are directly managed by the TSM.
There is no need to issue these requests directly for TVMs.
Signed-off-by: Atish Patra <[email protected]>
---
arch/riscv/kvm/vcpu.c | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/arch/riscv/kvm/vcpu.c b/arch/riscv/kvm/vcpu.c
index c53bf98..3b600c6 100644
--- a/arch/riscv/kvm/vcpu.c
+++ b/arch/riscv/kvm/vcpu.c
@@ -22,6 +22,7 @@
#include <asm/kvm_nacl.h>
#include <asm/hwcap.h>
#include <asm/sbi.h>
+#include <asm/kvm_cove.h>
const struct _kvm_stats_desc kvm_vcpu_stats_desc[] = {
KVM_GENERIC_VCPU_STATS(),
@@ -1078,6 +1079,15 @@ static void kvm_riscv_check_vcpu_requests(struct kvm_vcpu *vcpu)
if (kvm_check_request(KVM_REQ_VCPU_RESET, vcpu))
kvm_riscv_reset_vcpu(vcpu);
+ if (is_cove_vcpu(vcpu)) {
+ /*
+ * KVM doesn't need to do anything special here
+ * as the TSM is expected track the tlb version and issue
+ * hfence when vcpu is scheduled again.
+ */
+ return;
+ }
+
if (kvm_check_request(KVM_REQ_UPDATE_HGATP, vcpu))
kvm_riscv_gstage_update_hgatp(vcpu);
--
2.25.1
The TSM manages the vmid for the guests running in CoVE. The host
doesn't need to update vmid at all. As a result, the host doesn't
need to update the hgatp as well.
Return early for vmid/hgatp management functions for confidential
guests.
Signed-off-by: Atish Patra <[email protected]>
---
arch/riscv/include/asm/kvm_host.h | 2 +-
arch/riscv/kvm/mmu.c | 4 ++++
arch/riscv/kvm/vcpu.c | 2 +-
arch/riscv/kvm/vmid.c | 17 ++++++++++++-----
4 files changed, 18 insertions(+), 7 deletions(-)
diff --git a/arch/riscv/include/asm/kvm_host.h b/arch/riscv/include/asm/kvm_host.h
index ca2ebe3..047e046 100644
--- a/arch/riscv/include/asm/kvm_host.h
+++ b/arch/riscv/include/asm/kvm_host.h
@@ -325,7 +325,7 @@ unsigned long kvm_riscv_gstage_pgd_size(void);
void __init kvm_riscv_gstage_vmid_detect(void);
unsigned long kvm_riscv_gstage_vmid_bits(void);
int kvm_riscv_gstage_vmid_init(struct kvm *kvm);
-bool kvm_riscv_gstage_vmid_ver_changed(struct kvm_vmid *vmid);
+bool kvm_riscv_gstage_vmid_ver_changed(struct kvm *kvm);
void kvm_riscv_gstage_vmid_update(struct kvm_vcpu *vcpu);
int kvm_riscv_setup_default_irq_routing(struct kvm *kvm, u32 lines);
diff --git a/arch/riscv/kvm/mmu.c b/arch/riscv/kvm/mmu.c
index 1d5e4ed..4b0f09e 100644
--- a/arch/riscv/kvm/mmu.c
+++ b/arch/riscv/kvm/mmu.c
@@ -778,6 +778,10 @@ void kvm_riscv_gstage_update_hgatp(struct kvm_vcpu *vcpu)
unsigned long hgatp = gstage_mode;
struct kvm_arch *k = &vcpu->kvm->arch;
+ /* COVE VCPU hgatp is managed by TSM. */
+ if (is_cove_vcpu(vcpu))
+ return;
+
hgatp |= (READ_ONCE(k->vmid.vmid) << HGATP_VMID_SHIFT) & HGATP_VMID;
hgatp |= (k->pgd_phys >> PAGE_SHIFT) & HGATP_PPN;
diff --git a/arch/riscv/kvm/vcpu.c b/arch/riscv/kvm/vcpu.c
index 3b600c6..8cf462c 100644
--- a/arch/riscv/kvm/vcpu.c
+++ b/arch/riscv/kvm/vcpu.c
@@ -1288,7 +1288,7 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
kvm_riscv_update_hvip(vcpu);
if (ret <= 0 ||
- kvm_riscv_gstage_vmid_ver_changed(&vcpu->kvm->arch.vmid) ||
+ kvm_riscv_gstage_vmid_ver_changed(vcpu->kvm) ||
kvm_request_pending(vcpu) ||
xfer_to_guest_mode_work_pending()) {
vcpu->mode = OUTSIDE_GUEST_MODE;
diff --git a/arch/riscv/kvm/vmid.c b/arch/riscv/kvm/vmid.c
index ddc9871..dc03601 100644
--- a/arch/riscv/kvm/vmid.c
+++ b/arch/riscv/kvm/vmid.c
@@ -14,6 +14,7 @@
#include <linux/smp.h>
#include <linux/kvm_host.h>
#include <asm/csr.h>
+#include <asm/kvm_cove.h>
static unsigned long vmid_version = 1;
static unsigned long vmid_next;
@@ -54,12 +55,13 @@ int kvm_riscv_gstage_vmid_init(struct kvm *kvm)
return 0;
}
-bool kvm_riscv_gstage_vmid_ver_changed(struct kvm_vmid *vmid)
+bool kvm_riscv_gstage_vmid_ver_changed(struct kvm *kvm)
{
- if (!vmid_bits)
+ /* VMID version can't be changed by the host for TVMs */
+ if (!vmid_bits || is_cove_vm(kvm))
return false;
- return unlikely(READ_ONCE(vmid->vmid_version) !=
+ return unlikely(READ_ONCE(kvm->arch.vmid.vmid_version) !=
READ_ONCE(vmid_version));
}
@@ -72,9 +74,14 @@ void kvm_riscv_gstage_vmid_update(struct kvm_vcpu *vcpu)
{
unsigned long i;
struct kvm_vcpu *v;
+ struct kvm *kvm = vcpu->kvm;
struct kvm_vmid *vmid = &vcpu->kvm->arch.vmid;
- if (!kvm_riscv_gstage_vmid_ver_changed(vmid))
+ /* No VMID management for TVMs by the host */
+ if (is_cove_vcpu(vcpu))
+ return;
+
+ if (!kvm_riscv_gstage_vmid_ver_changed(kvm))
return;
spin_lock(&vmid_lock);
@@ -83,7 +90,7 @@ void kvm_riscv_gstage_vmid_update(struct kvm_vcpu *vcpu)
* We need to re-check the vmid_version here to ensure that if
* another vcpu already allocated a valid vmid for this vm.
*/
- if (!kvm_riscv_gstage_vmid_ver_changed(vmid)) {
+ if (!kvm_riscv_gstage_vmid_ver_changed(kvm)) {
spin_unlock(&vmid_lock);
return;
}
--
2.25.1
The entire DRAM region of a TVM running in CoVE must be confidential by
default. If a TVM wishes to share any sub-region, the TVM has to
request it explicitly with memory share APIs.
Mark the memory region as confidential during vm create itself.
Signed-off-by: Atish Patra <[email protected]>
---
arch/riscv/kvm/mmu.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/arch/riscv/kvm/mmu.c b/arch/riscv/kvm/mmu.c
index 4b0f09e..63889d9 100644
--- a/arch/riscv/kvm/mmu.c
+++ b/arch/riscv/kvm/mmu.c
@@ -499,6 +499,11 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
mmap_read_lock(current->mm);
+ if (is_cove_vm(kvm)) {
+ ret = kvm_riscv_cove_vm_add_memreg(kvm, base_gpa, size);
+ if (ret)
+ return ret;
+ }
/*
* A memory region could potentially cover multiple VMAs, and
* any holes between them, so iterate over all of them to find
--
2.25.1
When Cove is enabled in RISC-V, the TLB shootdown happens in co-ordination
with TSM. The host must not issue hfence directly. It relies on TSM
to do that instead. It just needs to initiate the process and make
sure that all the running vcpus exit the guest mode. As a result, it
traps to TSM and TSM issues hfence on behalf of the host.
Signed-off-by: Atish Patra <[email protected]>
---
arch/riscv/include/asm/kvm_cove.h | 2 ++
arch/riscv/kvm/cove.c | 36 +++++++++++++++++++++++++++++++
2 files changed, 38 insertions(+)
diff --git a/arch/riscv/include/asm/kvm_cove.h b/arch/riscv/include/asm/kvm_cove.h
index 4ea1df1..fc8633d 100644
--- a/arch/riscv/include/asm/kvm_cove.h
+++ b/arch/riscv/include/asm/kvm_cove.h
@@ -130,6 +130,8 @@ void kvm_riscv_cove_vcpu_switchto(struct kvm_vcpu *vcpu, struct kvm_cpu_trap *tr
int kvm_riscv_cove_vm_measure_pages(struct kvm *kvm, struct kvm_riscv_cove_measure_region *mr);
int kvm_riscv_cove_vm_add_memreg(struct kvm *kvm, unsigned long gpa, unsigned long size);
int kvm_riscv_cove_gstage_map(struct kvm_vcpu *vcpu, gpa_t gpa, unsigned long hva);
+/* Fence related function */
+int kvm_riscv_cove_tvm_fence(struct kvm_vcpu *vcpu);
#else
static inline bool kvm_riscv_cove_enabled(void) {return false; };
static inline int kvm_riscv_cove_init(void) { return -1; }
diff --git a/arch/riscv/kvm/cove.c b/arch/riscv/kvm/cove.c
index 5b4d9ba..4efcae3 100644
--- a/arch/riscv/kvm/cove.c
+++ b/arch/riscv/kvm/cove.c
@@ -78,6 +78,42 @@ static int kvm_riscv_cove_fence(void)
return rc;
}
+int kvm_riscv_cove_tvm_fence(struct kvm_vcpu *vcpu)
+{
+ struct kvm *kvm = vcpu->kvm;
+ struct kvm_cove_tvm_context *tvmc = kvm->arch.tvmc;
+ DECLARE_BITMAP(vcpu_mask, KVM_MAX_VCPUS);
+ unsigned long i;
+ struct kvm_vcpu *temp_vcpu;
+ int ret;
+
+ if (!tvmc)
+ return -EINVAL;
+
+ spin_lock(&tvmc->tvm_fence_lock);
+ ret = sbi_covh_tvm_initiate_fence(tvmc->tvm_guest_id);
+ if (ret) {
+ spin_unlock(&tvmc->tvm_fence_lock);
+ return ret;
+ }
+
+ bitmap_clear(vcpu_mask, 0, KVM_MAX_VCPUS);
+ kvm_for_each_vcpu(i, temp_vcpu, kvm) {
+ if (temp_vcpu != vcpu)
+ bitmap_set(vcpu_mask, i, 1);
+ }
+
+ /*
+ * The host just needs to make sure that the running vcpus exit the
+ * guest mode and traps into TSM so that it can issue hfence.
+ */
+ kvm_make_vcpus_request_mask(kvm, KVM_REQ_OUTSIDE_GUEST_MODE, vcpu_mask);
+ spin_unlock(&tvmc->tvm_fence_lock);
+
+ return 0;
+}
+
+
static int cove_convert_pages(unsigned long phys_addr, unsigned long npages, bool fence)
{
int rc;
--
2.25.1
For TVM, the gstage mapping is managed by the TSM via COVH SBI
calls. The host is responsible for allocating page that must be pinned
to avoid swapping. The page is converted it to confidential before
handing over to the TSM for gstage mapping.
Signed-off-by: Atish Patra <[email protected]>
---
arch/riscv/kvm/cove.c | 63 +++++++++++++++++++++++++++++++++++++-
arch/riscv/kvm/vcpu_exit.c | 9 ++++--
2 files changed, 69 insertions(+), 3 deletions(-)
diff --git a/arch/riscv/kvm/cove.c b/arch/riscv/kvm/cove.c
index 4efcae3..44095f6 100644
--- a/arch/riscv/kvm/cove.c
+++ b/arch/riscv/kvm/cove.c
@@ -149,8 +149,68 @@ void kvm_riscv_cove_vcpu_put(struct kvm_vcpu *vcpu)
int kvm_riscv_cove_gstage_map(struct kvm_vcpu *vcpu, gpa_t gpa, unsigned long hva)
{
- /* TODO */
+ struct kvm_riscv_cove_page *tpage;
+ struct mm_struct *mm = current->mm;
+ struct kvm *kvm = vcpu->kvm;
+ unsigned int flags = FOLL_LONGTERM | FOLL_WRITE | FOLL_HWPOISON;
+ struct page *page;
+ int rc;
+ struct kvm_cove_tvm_context *tvmc = kvm->arch.tvmc;
+
+ tpage = kmalloc(sizeof(*tpage), GFP_KERNEL_ACCOUNT);
+ if (!tpage)
+ return -ENOMEM;
+
+ mmap_read_lock(mm);
+ rc = pin_user_pages(hva, 1, flags, &page, NULL);
+ mmap_read_unlock(mm);
+
+ if (rc == -EHWPOISON) {
+ send_sig_mceerr(BUS_MCEERR_AR, (void __user *)hva,
+ PAGE_SHIFT, current);
+ rc = 0;
+ goto free_tpage;
+ } else if (rc != 1) {
+ rc = -EFAULT;
+ goto free_tpage;
+ } else if (!PageSwapBacked(page)) {
+ rc = -EIO;
+ goto free_tpage;
+ }
+
+ rc = cove_convert_pages(page_to_phys(page), 1, true);
+ if (rc)
+ goto unpin_page;
+
+ rc = sbi_covh_add_zero_pages(tvmc->tvm_guest_id, page_to_phys(page),
+ SBI_COVE_PAGE_4K, 1, gpa);
+ if (rc) {
+ pr_err("%s: Adding zero pages failed %d\n", __func__, rc);
+ goto zero_page_failed;
+ }
+ tpage->page = page;
+ tpage->npages = 1;
+ tpage->is_mapped = true;
+ tpage->gpa = gpa;
+ tpage->hva = hva;
+ INIT_LIST_HEAD(&tpage->link);
+
+ spin_lock(&kvm->mmu_lock);
+ list_add(&tpage->link, &kvm->arch.tvmc->zero_pages);
+ spin_unlock(&kvm->mmu_lock);
+
return 0;
+
+zero_page_failed:
+ //TODO: Do we need to reclaim the page now or VM gets destroyed ?
+
+unpin_page:
+ unpin_user_pages(&page, 1);
+
+free_tpage:
+ kfree(tpage);
+
+ return rc;
}
void kvm_riscv_cove_vcpu_switchto(struct kvm_vcpu *vcpu, struct kvm_cpu_trap *trap)
@@ -390,6 +450,7 @@ void kvm_riscv_cove_vm_destroy(struct kvm *kvm)
cove_delete_page_list(kvm, &tvmc->reclaim_pending_pages, false);
cove_delete_page_list(kvm, &tvmc->measured_pages, false);
+ cove_delete_page_list(kvm, &tvmc->zero_pages, true);
/* Reclaim and Free the pages for tvm state management */
rc = sbi_covh_tsm_reclaim_pages(page_to_phys(tvmc->tvm_state.page), tvmc->tvm_state.npages);
diff --git a/arch/riscv/kvm/vcpu_exit.c b/arch/riscv/kvm/vcpu_exit.c
index 0d0c895..d00b9ee5 100644
--- a/arch/riscv/kvm/vcpu_exit.c
+++ b/arch/riscv/kvm/vcpu_exit.c
@@ -41,8 +41,13 @@ static int gstage_page_fault(struct kvm_vcpu *vcpu, struct kvm_run *run,
};
}
- ret = kvm_riscv_gstage_map(vcpu, memslot, fault_addr, hva,
- (trap->scause == EXC_STORE_GUEST_PAGE_FAULT) ? true : false);
+ if (is_cove_vcpu(vcpu)) {
+ /* CoVE doesn't care about PTE prots now. No need to compute the prots */
+ ret = kvm_riscv_cove_gstage_map(vcpu, fault_addr, hva);
+ } else {
+ ret = kvm_riscv_gstage_map(vcpu, memslot, fault_addr, hva,
+ (trap->scause == EXC_STORE_GUEST_PAGE_FAULT) ? true : false);
+ }
if (ret < 0)
return ret;
--
2.25.1
TSM may forward the some SBI calls to the host as the host
is the best place to handle these calls. Any calls related to hart
state management or console or guest side interface (COVG) falls under
this category.
Add a cove specific ecall handler to take appropriate actions upon
receiving these SBI calls.
Signed-off-by: Atish Patra <[email protected]>
---
arch/riscv/include/asm/kvm_cove.h | 5 +++
arch/riscv/kvm/cove.c | 54 +++++++++++++++++++++++++++++++
arch/riscv/kvm/vcpu_exit.c | 6 +++-
arch/riscv/kvm/vcpu_sbi.c | 2 ++
4 files changed, 66 insertions(+), 1 deletion(-)
diff --git a/arch/riscv/include/asm/kvm_cove.h b/arch/riscv/include/asm/kvm_cove.h
index fc8633d..b63682f 100644
--- a/arch/riscv/include/asm/kvm_cove.h
+++ b/arch/riscv/include/asm/kvm_cove.h
@@ -126,6 +126,7 @@ int kvm_riscv_cove_vcpu_init(struct kvm_vcpu *vcpu);
void kvm_riscv_cove_vcpu_load(struct kvm_vcpu *vcpu);
void kvm_riscv_cove_vcpu_put(struct kvm_vcpu *vcpu);
void kvm_riscv_cove_vcpu_switchto(struct kvm_vcpu *vcpu, struct kvm_cpu_trap *trap);
+int kvm_riscv_cove_vcpu_sbi_ecall(struct kvm_vcpu *vcpu, struct kvm_run *run);
int kvm_riscv_cove_vm_measure_pages(struct kvm *kvm, struct kvm_riscv_cove_measure_region *mr);
int kvm_riscv_cove_vm_add_memreg(struct kvm *kvm, unsigned long gpa, unsigned long size);
@@ -148,6 +149,10 @@ static inline int kvm_riscv_cove_vcpu_init(struct kvm_vcpu *vcpu) {return -1; }
static inline void kvm_riscv_cove_vcpu_load(struct kvm_vcpu *vcpu) {}
static inline void kvm_riscv_cove_vcpu_put(struct kvm_vcpu *vcpu) {}
static inline void kvm_riscv_cove_vcpu_switchto(struct kvm_vcpu *vcpu, struct kvm_cpu_trap *trap) {}
+static inline int kvm_riscv_cove_vcpu_sbi_ecall(struct kvm_vcpu *vcpu, struct kvm_run *run)
+{
+ return -1;
+}
static inline int kvm_riscv_cove_vm_add_memreg(struct kvm *kvm, unsigned long gpa,
unsigned long size) {return -1; }
static inline int kvm_riscv_cove_vm_measure_pages(struct kvm *kvm,
diff --git a/arch/riscv/kvm/cove.c b/arch/riscv/kvm/cove.c
index 44095f6..87fa04b 100644
--- a/arch/riscv/kvm/cove.c
+++ b/arch/riscv/kvm/cove.c
@@ -147,6 +147,60 @@ void kvm_riscv_cove_vcpu_put(struct kvm_vcpu *vcpu)
/* TODO */
}
+int kvm_riscv_cove_vcpu_sbi_ecall(struct kvm_vcpu *vcpu, struct kvm_run *run)
+{
+ void *nshmem;
+ const struct kvm_vcpu_sbi_extension *sbi_ext;
+ struct kvm_cpu_context *cp = &vcpu->arch.guest_context;
+ struct kvm_cpu_trap utrap = { 0 };
+ struct kvm_vcpu_sbi_return sbi_ret = {
+ .out_val = 0,
+ .err_val = 0,
+ .utrap = &utrap,
+ };
+ bool ext_is_01 = false;
+ int ret = 1;
+
+ nshmem = nacl_shmem();
+ cp->a0 = nacl_shmem_gpr_read_cove(nshmem, KVM_ARCH_GUEST_A0);
+ cp->a1 = nacl_shmem_gpr_read_cove(nshmem, KVM_ARCH_GUEST_A1);
+ cp->a6 = nacl_shmem_gpr_read_cove(nshmem, KVM_ARCH_GUEST_A6);
+ cp->a7 = nacl_shmem_gpr_read_cove(nshmem, KVM_ARCH_GUEST_A7);
+
+ /* TSM will only forward legacy console to the host */
+#ifdef CONFIG_RISCV_SBI_V01
+ if (cp->a7 == SBI_EXT_0_1_CONSOLE_PUTCHAR)
+ ext_is_01 = true;
+#endif
+
+ sbi_ext = kvm_vcpu_sbi_find_ext(vcpu, cp->a7);
+ if ((sbi_ext && sbi_ext->handler) && ((cp->a7 == SBI_EXT_DBCN) ||
+ (cp->a7 == SBI_EXT_HSM) || (cp->a7 == SBI_EXT_SRST) || ext_is_01)) {
+ ret = sbi_ext->handler(vcpu, run, &sbi_ret);
+ } else {
+ kvm_err("%s: SBI EXT %lx not supported for TVM\n", __func__, cp->a7);
+ /* Return error for unsupported SBI calls */
+ sbi_ret.err_val = SBI_ERR_NOT_SUPPORTED;
+ goto ecall_done;
+ }
+
+ if (ret < 0)
+ goto ecall_done;
+
+ ret = (sbi_ret.uexit) ? 0 : 1;
+
+ecall_done:
+ /*
+ * No need to update the sepc as TSM will take care of SEPC increment
+ * for ECALLS that won't be forwarded to the user space (e.g. console)
+ */
+ nacl_shmem_gpr_write_cove(nshmem, KVM_ARCH_GUEST_A0, sbi_ret.err_val);
+ if (!ext_is_01)
+ nacl_shmem_gpr_write_cove(nshmem, KVM_ARCH_GUEST_A1, sbi_ret.out_val);
+
+ return ret;
+}
+
int kvm_riscv_cove_gstage_map(struct kvm_vcpu *vcpu, gpa_t gpa, unsigned long hva)
{
struct kvm_riscv_cove_page *tpage;
diff --git a/arch/riscv/kvm/vcpu_exit.c b/arch/riscv/kvm/vcpu_exit.c
index d00b9ee5..8944e29 100644
--- a/arch/riscv/kvm/vcpu_exit.c
+++ b/arch/riscv/kvm/vcpu_exit.c
@@ -207,11 +207,15 @@ int kvm_riscv_vcpu_exit(struct kvm_vcpu *vcpu, struct kvm_run *run,
case EXC_INST_GUEST_PAGE_FAULT:
case EXC_LOAD_GUEST_PAGE_FAULT:
case EXC_STORE_GUEST_PAGE_FAULT:
+ //TODO: If the host runs in HS mode, this won't work as we don't
+ //read hstatus from the shared memory yet
if (vcpu->arch.guest_context.hstatus & HSTATUS_SPV)
ret = gstage_page_fault(vcpu, run, trap);
break;
case EXC_SUPERVISOR_SYSCALL:
- if (vcpu->arch.guest_context.hstatus & HSTATUS_SPV)
+ if (is_cove_vcpu(vcpu))
+ ret = kvm_riscv_cove_vcpu_sbi_ecall(vcpu, run);
+ else if (vcpu->arch.guest_context.hstatus & HSTATUS_SPV)
ret = kvm_riscv_vcpu_sbi_ecall(vcpu, run);
break;
default:
diff --git a/arch/riscv/kvm/vcpu_sbi.c b/arch/riscv/kvm/vcpu_sbi.c
index 047ba10..d2f43bc 100644
--- a/arch/riscv/kvm/vcpu_sbi.c
+++ b/arch/riscv/kvm/vcpu_sbi.c
@@ -10,6 +10,8 @@
#include <linux/err.h>
#include <linux/kvm_host.h>
#include <asm/sbi.h>
+#include <asm/kvm_nacl.h>
+#include <asm/kvm_cove_sbi.h>
#include <asm/kvm_vcpu_sbi.h>
#ifndef CONFIG_RISCV_SBI_V01
--
2.25.1
The TSM takes care of most of the H extension CSR/fp save/restore
for any guest running in CoVE. It may choose to do the fp save/restore
lazily as well. The host has to do minimal operations such timer
save/restore and interrupt state restore during vcpu load/put.
Signed-off-by: Rajnesh Kanwal <[email protected]>
Signed-off-by: Atish Patra <[email protected]>
---
arch/riscv/kvm/cove.c | 12 ++++++++++--
arch/riscv/kvm/vcpu.c | 12 +++++++++++-
2 files changed, 21 insertions(+), 3 deletions(-)
diff --git a/arch/riscv/kvm/cove.c b/arch/riscv/kvm/cove.c
index 87fa04b..c93de9b 100644
--- a/arch/riscv/kvm/cove.c
+++ b/arch/riscv/kvm/cove.c
@@ -139,12 +139,20 @@ __always_inline bool kvm_riscv_cove_enabled(void)
void kvm_riscv_cove_vcpu_load(struct kvm_vcpu *vcpu)
{
- /* TODO */
+ kvm_riscv_vcpu_timer_restore(vcpu);
}
void kvm_riscv_cove_vcpu_put(struct kvm_vcpu *vcpu)
{
- /* TODO */
+ void *nshmem;
+ struct kvm_vcpu_csr *csr = &vcpu->arch.guest_csr;
+
+ kvm_riscv_vcpu_timer_save(vcpu);
+ /* NACL is mandatory for CoVE */
+ nshmem = nacl_shmem();
+
+ /* Only VSIE needs to be read to manage the interrupt stuff */
+ csr->vsie = nacl_shmem_csr_read(nshmem, CSR_VSIE);
}
int kvm_riscv_cove_vcpu_sbi_ecall(struct kvm_vcpu *vcpu, struct kvm_run *run)
diff --git a/arch/riscv/kvm/vcpu.c b/arch/riscv/kvm/vcpu.c
index 8cf462c..3e04b78 100644
--- a/arch/riscv/kvm/vcpu.c
+++ b/arch/riscv/kvm/vcpu.c
@@ -972,6 +972,11 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
u64 henvcfg = kvm_riscv_vcpu_get_henvcfg(vcpu->arch.isa);
struct kvm_vcpu_csr *csr = &vcpu->arch.guest_csr;
+ if (is_cove_vcpu(vcpu)) {
+ kvm_riscv_cove_vcpu_load(vcpu);
+ goto skip_load;
+ }
+
if (kvm_riscv_nacl_sync_csr_available()) {
nshmem = nacl_shmem();
nacl_shmem_csr_write(nshmem, CSR_VSSTATUS, csr->vsstatus);
@@ -1010,9 +1015,9 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
kvm_riscv_vcpu_host_fp_save(&vcpu->arch.host_context);
kvm_riscv_vcpu_guest_fp_restore(&vcpu->arch.guest_context,
vcpu->arch.isa);
-
kvm_riscv_vcpu_aia_load(vcpu, cpu);
+skip_load:
vcpu->cpu = cpu;
}
@@ -1023,6 +1028,11 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
vcpu->cpu = -1;
+ if (is_cove_vcpu(vcpu)) {
+ kvm_riscv_cove_vcpu_put(vcpu);
+ return;
+ }
+
kvm_riscv_vcpu_aia_put(vcpu);
kvm_riscv_vcpu_guest_fp_save(&vcpu->arch.guest_context,
--
2.25.1
TVM worlds switch takes a different path from regular VM world switch as
it needs to make an ecall to TSM and TSM actually does the world switch.
The host doesn't need to save/restore any context as TSM is expected
to do that on behalf of the host. The TSM updatess the trap
information in the shared memory which host uses to figure out the
cause of the guest exit.
Signed-off-by: Atish Patra <[email protected]>
---
arch/riscv/kvm/cove.c | 31 +++++++++++++++++++++++++++++--
arch/riscv/kvm/vcpu.c | 11 +++++++++++
arch/riscv/kvm/vcpu_exit.c | 10 ++++++++++
3 files changed, 50 insertions(+), 2 deletions(-)
diff --git a/arch/riscv/kvm/cove.c b/arch/riscv/kvm/cove.c
index c93de9b..c11db7a 100644
--- a/arch/riscv/kvm/cove.c
+++ b/arch/riscv/kvm/cove.c
@@ -275,9 +275,36 @@ int kvm_riscv_cove_gstage_map(struct kvm_vcpu *vcpu, gpa_t gpa, unsigned long hv
return rc;
}
-void kvm_riscv_cove_vcpu_switchto(struct kvm_vcpu *vcpu, struct kvm_cpu_trap *trap)
+void noinstr kvm_riscv_cove_vcpu_switchto(struct kvm_vcpu *vcpu, struct kvm_cpu_trap *trap)
{
- /* TODO */
+ int rc;
+ struct kvm *kvm = vcpu->kvm;
+ struct kvm_cove_tvm_context *tvmc;
+ struct kvm_cpu_context *cntx = &vcpu->arch.guest_context;
+ void *nshmem;
+
+ if (!kvm->arch.tvmc)
+ return;
+
+ tvmc = kvm->arch.tvmc;
+
+ nshmem = nacl_shmem();
+ /* Invoke finalize to mark TVM is ready run for the first time */
+ if (unlikely(!tvmc->finalized_done)) {
+
+ rc = sbi_covh_tsm_finalize_tvm(tvmc->tvm_guest_id, cntx->sepc, cntx->a1);
+ if (rc) {
+ kvm_err("TVM Finalized failed with %d\n", rc);
+ return;
+ }
+ tvmc->finalized_done = true;
+ }
+
+ rc = sbi_covh_run_tvm_vcpu(tvmc->tvm_guest_id, vcpu->vcpu_idx);
+ if (rc) {
+ trap->scause = EXC_CUSTOM_KVM_COVE_RUN_FAIL;
+ return;
+ }
}
void kvm_riscv_cove_vcpu_destroy(struct kvm_vcpu *vcpu)
diff --git a/arch/riscv/kvm/vcpu.c b/arch/riscv/kvm/vcpu.c
index 3e04b78..43a0b8c 100644
--- a/arch/riscv/kvm/vcpu.c
+++ b/arch/riscv/kvm/vcpu.c
@@ -1042,6 +1042,11 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
kvm_riscv_vcpu_timer_save(vcpu);
if (kvm_riscv_nacl_available()) {
+ /**
+ * For TVMs, we don't need a separate case as TSM only updates
+ * the required CSRs during the world switch. All other CSR
+ * value should be zeroed out by TSM anyways.
+ */
nshmem = nacl_shmem();
csr->vsstatus = nacl_shmem_csr_read(nshmem, CSR_VSSTATUS);
csr->vsie = nacl_shmem_csr_read(nshmem, CSR_VSIE);
@@ -1191,6 +1196,12 @@ static void noinstr kvm_riscv_vcpu_enter_exit(struct kvm_vcpu *vcpu,
gcntx->hstatus = csr_swap(CSR_HSTATUS, hcntx->hstatus);
}
+ trap->htval = nacl_shmem_csr_read(nshmem, CSR_HTVAL);
+ trap->htinst = nacl_shmem_csr_read(nshmem, CSR_HTINST);
+ } else if (is_cove_vcpu(vcpu)) {
+ nshmem = nacl_shmem();
+ kvm_riscv_cove_vcpu_switchto(vcpu, trap);
+
trap->htval = nacl_shmem_csr_read(nshmem, CSR_HTVAL);
trap->htinst = nacl_shmem_csr_read(nshmem, CSR_HTINST);
} else {
diff --git a/arch/riscv/kvm/vcpu_exit.c b/arch/riscv/kvm/vcpu_exit.c
index 8944e29..c46e7f2 100644
--- a/arch/riscv/kvm/vcpu_exit.c
+++ b/arch/riscv/kvm/vcpu_exit.c
@@ -218,6 +218,15 @@ int kvm_riscv_vcpu_exit(struct kvm_vcpu *vcpu, struct kvm_run *run,
else if (vcpu->arch.guest_context.hstatus & HSTATUS_SPV)
ret = kvm_riscv_vcpu_sbi_ecall(vcpu, run);
break;
+ case EXC_CUSTOM_KVM_COVE_RUN_FAIL:
+ if (likely(is_cove_vcpu(vcpu))) {
+ ret = -EACCES;
+ run->fail_entry.hardware_entry_failure_reason =
+ KVM_EXIT_FAIL_ENTRY_COVE_RUN_VCPU;
+ run->fail_entry.cpu = vcpu->cpu;
+ run->exit_reason = KVM_EXIT_FAIL_ENTRY;
+ }
+ break;
default:
break;
}
@@ -225,6 +234,7 @@ int kvm_riscv_vcpu_exit(struct kvm_vcpu *vcpu, struct kvm_run *run,
/* Print details in-case of error */
if (ret < 0) {
kvm_err("VCPU exit error %d\n", ret);
+ //TODO: These values are bogus/stale for a TVM. Improve it
kvm_err("SEPC=0x%lx SSTATUS=0x%lx HSTATUS=0x%lx\n",
vcpu->arch.guest_context.sepc,
vcpu->arch.guest_context.sstatus,
--
2.25.1
From: Rajnesh Kanwal <[email protected]>
TSM manages the htimedelta/vstimecmp for the TVM and shares it
with the host to properly schedule hrtimer to keep timer interrupt ticking.
TSM only sets htimedetla when first VCPU is run to make sure host
is not able to control the start time of the VM. TSM updates vstimemcp
at every vmexit and ignores any write to vstimecmp from the host.
Signed-off-by: Rajnesh Kanwal <[email protected]>
Signed-off-by: Atish Patra <[email protected]>
---
arch/riscv/kvm/cove.c | 8 ++++++++
arch/riscv/kvm/vcpu_timer.c | 26 +++++++++++++++++++++++++-
2 files changed, 33 insertions(+), 1 deletion(-)
diff --git a/arch/riscv/kvm/cove.c b/arch/riscv/kvm/cove.c
index c11db7a..4a8a8db 100644
--- a/arch/riscv/kvm/cove.c
+++ b/arch/riscv/kvm/cove.c
@@ -282,6 +282,7 @@ void noinstr kvm_riscv_cove_vcpu_switchto(struct kvm_vcpu *vcpu, struct kvm_cpu_
struct kvm_cove_tvm_context *tvmc;
struct kvm_cpu_context *cntx = &vcpu->arch.guest_context;
void *nshmem;
+ struct kvm_guest_timer *gt = &kvm->arch.timer;
if (!kvm->arch.tvmc)
return;
@@ -305,6 +306,13 @@ void noinstr kvm_riscv_cove_vcpu_switchto(struct kvm_vcpu *vcpu, struct kvm_cpu_
trap->scause = EXC_CUSTOM_KVM_COVE_RUN_FAIL;
return;
}
+
+ /* Read htimedelta from shmem. Given it's written by TSM only when we
+ * run first VCPU, we need to update this here rather than in timer
+ * init.
+ */
+ if (unlikely(!gt->time_delta))
+ gt->time_delta = nacl_shmem_csr_read(nshmem, CSR_HTIMEDELTA);
}
void kvm_riscv_cove_vcpu_destroy(struct kvm_vcpu *vcpu)
diff --git a/arch/riscv/kvm/vcpu_timer.c b/arch/riscv/kvm/vcpu_timer.c
index 71a4560..f059e14 100644
--- a/arch/riscv/kvm/vcpu_timer.c
+++ b/arch/riscv/kvm/vcpu_timer.c
@@ -14,6 +14,7 @@
#include <asm/delay.h>
#include <asm/kvm_nacl.h>
#include <asm/kvm_vcpu_timer.h>
+#include <asm/kvm_cove.h>
static u64 kvm_riscv_current_cycles(struct kvm_guest_timer *gt)
{
@@ -71,6 +72,10 @@ static int kvm_riscv_vcpu_timer_cancel(struct kvm_vcpu_timer *t)
static int kvm_riscv_vcpu_update_vstimecmp(struct kvm_vcpu *vcpu, u64 ncycles)
{
+ /* Host is not allowed to update the vstimecmp for the TVM */
+ if (is_cove_vcpu(vcpu))
+ return 0;
+
#if defined(CONFIG_32BIT)
nacl_csr_write(CSR_VSTIMECMP, ncycles & 0xFFFFFFFF);
nacl_csr_write(CSR_VSTIMECMPH, ncycles >> 32);
@@ -221,6 +226,11 @@ int kvm_riscv_vcpu_set_reg_timer(struct kvm_vcpu *vcpu,
ret = -EOPNOTSUPP;
break;
case KVM_REG_RISCV_TIMER_REG(time):
+ /* For trusted VMs we can not update htimedelta. We can just
+ * read it from shared memory.
+ */
+ if (is_cove_vcpu(vcpu))
+ return -EOPNOTSUPP;
gt->time_delta = reg_val - get_cycles64();
break;
case KVM_REG_RISCV_TIMER_REG(compare):
@@ -287,6 +297,7 @@ static void kvm_riscv_vcpu_update_timedelta(struct kvm_vcpu *vcpu)
{
struct kvm_guest_timer *gt = &vcpu->kvm->arch.timer;
+
#if defined(CONFIG_32BIT)
nacl_csr_write(CSR_HTIMEDELTA, (u32)(gt->time_delta));
nacl_csr_write(CSR_HTIMEDELTAH, (u32)(gt->time_delta >> 32));
@@ -299,6 +310,10 @@ void kvm_riscv_vcpu_timer_restore(struct kvm_vcpu *vcpu)
{
struct kvm_vcpu_timer *t = &vcpu->arch.timer;
+ /* While in CoVE, HOST must not manage HTIMEDELTA or VSTIMECMP for TVM */
+ if (is_cove_vcpu(vcpu))
+ goto skip_hcsr_update;
+
kvm_riscv_vcpu_update_timedelta(vcpu);
if (!t->sstc_enabled)
@@ -311,6 +326,7 @@ void kvm_riscv_vcpu_timer_restore(struct kvm_vcpu *vcpu)
nacl_csr_write(CSR_VSTIMECMP, t->next_cycles);
#endif
+skip_hcsr_update:
/* timer should be enabled for the remaining operations */
if (unlikely(!t->init_done))
return;
@@ -358,5 +374,13 @@ void kvm_riscv_guest_timer_init(struct kvm *kvm)
struct kvm_guest_timer *gt = &kvm->arch.timer;
riscv_cs_get_mult_shift(>->nsec_mult, >->nsec_shift);
- gt->time_delta = -get_cycles64();
+ if (is_cove_vm(kvm)) {
+ /* For TVMs htimedelta is managed by TSM and it's communicated using
+ * NACL shmem interface when first time VCPU is run. so we read it in
+ * kvm_riscv_cove_vcpu_switchto() where we enter VCPUs.
+ */
+ gt->time_delta = 0;
+ } else {
+ gt->time_delta = -get_cycles64();
+ }
}
--
2.25.1
Skip HVIP update as the Host shouldn't be able to inject
interrupt directly to a TVM.
Signed-off-by: Atish Patra <[email protected]>
---
arch/riscv/kvm/vcpu.c | 10 +++++++---
1 file changed, 7 insertions(+), 3 deletions(-)
diff --git a/arch/riscv/kvm/vcpu.c b/arch/riscv/kvm/vcpu.c
index 43a0b8c..20d4800 100644
--- a/arch/riscv/kvm/vcpu.c
+++ b/arch/riscv/kvm/vcpu.c
@@ -822,7 +822,10 @@ void kvm_riscv_vcpu_sync_interrupts(struct kvm_vcpu *vcpu)
/* Read current HVIP and VSIE CSRs */
csr->vsie = nacl_csr_read(CSR_VSIE);
- /* Sync-up HVIP.VSSIP bit changes does by Guest */
+ /*
+ * Sync-up HVIP.VSSIP bit changes does by Guest. For TVMs,
+ * the HVIP is not updated by the TSM. Expect it to be zero.
+ */
hvip = nacl_csr_read(CSR_HVIP);
if ((csr->hvip ^ hvip) & (1UL << IRQ_VS_SOFT)) {
if (hvip & (1UL << IRQ_VS_SOFT)) {
@@ -1305,8 +1308,9 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
*/
kvm_riscv_vcpu_flush_interrupts(vcpu);
- /* Update HVIP CSR for current CPU */
- kvm_riscv_update_hvip(vcpu);
+ /* Update HVIP CSR for current CPU only for non TVMs */
+ if (!is_cove_vcpu(vcpu))
+ kvm_riscv_update_hvip(vcpu);
if (ret <= 0 ||
kvm_riscv_gstage_vmid_ver_changed(vcpu->kvm) ||
--
2.25.1
From: Rajnesh Kanwal <[email protected]>
This patch adds the CoVE interrupt management extension(COVI) details
to the sbi header file.
Signed-off-by: Atish Patra <[email protected]>
Signed-off-by: Rajnesh Kanwal <[email protected]>
---
arch/riscv/include/asm/sbi.h | 30 ++++++++++++++++++++++++++++++
1 file changed, 30 insertions(+)
diff --git a/arch/riscv/include/asm/sbi.h b/arch/riscv/include/asm/sbi.h
index c5a5526..bbea922 100644
--- a/arch/riscv/include/asm/sbi.h
+++ b/arch/riscv/include/asm/sbi.h
@@ -33,6 +33,7 @@ enum sbi_ext_id {
SBI_EXT_DBCN = 0x4442434E,
SBI_EXT_NACL = 0x4E41434C,
SBI_EXT_COVH = 0x434F5648,
+ SBI_EXT_COVI = 0x434F5649,
/* Experimentals extensions must lie within this range */
SBI_EXT_EXPERIMENTAL_START = 0x08000000,
@@ -369,6 +370,20 @@ enum sbi_ext_covh_fid {
SBI_EXT_COVH_TVM_INITIATE_FENCE,
};
+enum sbi_ext_covi_fid {
+ SBI_EXT_COVI_TVM_AIA_INIT,
+ SBI_EXT_COVI_TVM_CPU_SET_IMSIC_ADDR,
+ SBI_EXT_COVI_TVM_CONVERT_IMSIC,
+ SBI_EXT_COVI_TVM_RECLAIM_IMSIC,
+ SBI_EXT_COVI_TVM_CPU_BIND_IMSIC,
+ SBI_EXT_COVI_TVM_CPU_UNBIND_IMSIC_BEGIN,
+ SBI_EXT_COVI_TVM_CPU_UNBIND_IMSIC_END,
+ SBI_EXT_COVI_TVM_CPU_INJECT_EXT_INTERRUPT,
+ SBI_EXT_COVI_TVM_REBIND_IMSIC_BEGIN,
+ SBI_EXT_COVI_TVM_REBIND_IMSIC_CLONE,
+ SBI_EXT_COVI_TVM_REBIND_IMSIC_END,
+};
+
enum sbi_cove_page_type {
SBI_COVE_PAGE_4K,
SBI_COVE_PAGE_2MB,
@@ -409,6 +424,21 @@ struct sbi_cove_tvm_create_params {
unsigned long tvm_state_addr;
};
+struct sbi_cove_tvm_aia_params {
+ /* The base address is the address of the IMSIC with group ID, hart ID, and guest ID of 0 */
+ uint64_t imsic_base_addr;
+ /* The number of group index bits in an IMSIC address */
+ uint32_t group_index_bits;
+ /* The location of the group index in an IMSIC address. Must be >= 24i. */
+ uint32_t group_index_shift;
+ /* The number of hart index bits in an IMSIC address */
+ uint32_t hart_index_bits;
+ /* The number of guest index bits in an IMSIC address. Must be >= log2(guests/hart + 1) */
+ uint32_t guest_index_bits;
+ /* The number of guest interrupt files to be implemented per vCPU */
+ uint32_t guests_per_hart;
+};
+
#define SBI_SPEC_VERSION_DEFAULT 0x1
#define SBI_SPEC_VERSION_MAJOR_SHIFT 24
#define SBI_SPEC_VERSION_MAJOR_MASK 0x7f
--
2.25.1
CoVE specification defines a separate SBI extension to manage interrupts
in TVM. This extension is known as COVI as both host & guest
interface access these functions.
This patch implements the functions defined by COVI.
Co-developed-by: Rajnesh Kanwal <[email protected]>
Signed-off-by: Rajnesh Kanwal <[email protected]>
Signed-off-by: Atish Patra <[email protected]>
---
arch/riscv/include/asm/kvm_cove_sbi.h | 20 ++++
arch/riscv/kvm/cove_sbi.c | 164 ++++++++++++++++++++++++++
2 files changed, 184 insertions(+)
diff --git a/arch/riscv/include/asm/kvm_cove_sbi.h b/arch/riscv/include/asm/kvm_cove_sbi.h
index df7d88c..0759f70 100644
--- a/arch/riscv/include/asm/kvm_cove_sbi.h
+++ b/arch/riscv/include/asm/kvm_cove_sbi.h
@@ -32,6 +32,7 @@
#define nacl_shmem_gpr_read_cove(__s, __g) \
nacl_shmem_scratch_read_long(__s, get_scratch_gpr_offset(__g))
+/* Functions related to CoVE Host Interface (COVH) Extension */
int sbi_covh_tsm_get_info(struct sbi_cove_tsm_info *tinfo_addr);
int sbi_covh_tvm_initiate_fence(unsigned long tvmid);
int sbi_covh_tsm_initiate_fence(void);
@@ -58,4 +59,23 @@ int sbi_covh_create_tvm_vcpu(unsigned long tvmid, unsigned long tvm_vcpuid,
int sbi_covh_run_tvm_vcpu(unsigned long tvmid, unsigned long tvm_vcpuid);
+/* Functions related to CoVE Interrupt Management(COVI) Extension */
+int sbi_covi_tvm_aia_init(unsigned long tvm_gid, struct sbi_cove_tvm_aia_params *tvm_aia_params);
+int sbi_covi_set_vcpu_imsic_addr(unsigned long tvm_gid, unsigned long vcpu_id,
+ unsigned long imsic_addr);
+int sbi_covi_convert_imsic(unsigned long imsic_addr);
+int sbi_covi_reclaim_imsic(unsigned long imsic_addr);
+int sbi_covi_bind_vcpu_imsic(unsigned long tvm_gid, unsigned long vcpu_id,
+ unsigned long imsic_mask);
+int sbi_covi_unbind_vcpu_imsic_begin(unsigned long tvm_gid, unsigned long vcpu_id);
+int sbi_covi_unbind_vcpu_imsic_end(unsigned long tvm_gid, unsigned long vcpu_id);
+int sbi_covi_inject_external_interrupt(unsigned long tvm_gid, unsigned long vcpu_id,
+ unsigned long interrupt_id);
+int sbi_covi_rebind_vcpu_imsic_begin(unsigned long tvm_gid, unsigned long vcpu_id,
+ unsigned long imsic_mask);
+int sbi_covi_rebind_vcpu_imsic_clone(unsigned long tvm_gid, unsigned long vcpu_id);
+int sbi_covi_rebind_vcpu_imsic_end(unsigned long tvm_gid, unsigned long vcpu_id);
+
+
+
#endif
diff --git a/arch/riscv/kvm/cove_sbi.c b/arch/riscv/kvm/cove_sbi.c
index bf037f6..a8901ac 100644
--- a/arch/riscv/kvm/cove_sbi.c
+++ b/arch/riscv/kvm/cove_sbi.c
@@ -18,6 +18,170 @@
#define RISCV_COVE_ALIGN_4KB (1UL << 12)
+int sbi_covi_tvm_aia_init(unsigned long tvm_gid,
+ struct sbi_cove_tvm_aia_params *tvm_aia_params)
+{
+ struct sbiret ret;
+
+ unsigned long pa = __pa(tvm_aia_params);
+
+ ret = sbi_ecall(SBI_EXT_COVI, SBI_EXT_COVI_TVM_AIA_INIT, tvm_gid, pa,
+ sizeof(*tvm_aia_params), 0, 0, 0);
+ if (ret.error)
+ return sbi_err_map_linux_errno(ret.error);
+
+ return 0;
+}
+
+int sbi_covi_set_vcpu_imsic_addr(unsigned long tvm_gid, unsigned long vcpu_id,
+ unsigned long imsic_addr)
+{
+ struct sbiret ret;
+
+ ret = sbi_ecall(SBI_EXT_COVI, SBI_EXT_COVI_TVM_CPU_SET_IMSIC_ADDR,
+ tvm_gid, vcpu_id, imsic_addr, 0, 0, 0);
+ if (ret.error)
+ return sbi_err_map_linux_errno(ret.error);
+
+ return 0;
+}
+
+/*
+ * Converts the guest interrupt file at `imsic_addr` for use with a TVM.
+ * The guest interrupt file must not be used by the caller until reclaim.
+ */
+int sbi_covi_convert_imsic(unsigned long imsic_addr)
+{
+ struct sbiret ret;
+
+ ret = sbi_ecall(SBI_EXT_COVI, SBI_EXT_COVI_TVM_CONVERT_IMSIC,
+ imsic_addr, 0, 0, 0, 0, 0);
+ if (ret.error)
+ return sbi_err_map_linux_errno(ret.error);
+
+ return 0;
+}
+
+int sbi_covi_reclaim_imsic(unsigned long imsic_addr)
+{
+ struct sbiret ret;
+
+ ret = sbi_ecall(SBI_EXT_COVI, SBI_EXT_COVI_TVM_RECLAIM_IMSIC,
+ imsic_addr, 0, 0, 0, 0, 0);
+ if (ret.error)
+ return sbi_err_map_linux_errno(ret.error);
+
+ return 0;
+}
+
+/*
+ * Binds a vCPU to this physical CPU and the specified set of confidential guest
+ * interrupt files.
+ */
+int sbi_covi_bind_vcpu_imsic(unsigned long tvm_gid, unsigned long vcpu_id,
+ unsigned long imsic_mask)
+{
+ struct sbiret ret;
+
+ ret = sbi_ecall(SBI_EXT_COVI, SBI_EXT_COVI_TVM_CPU_BIND_IMSIC, tvm_gid,
+ vcpu_id, imsic_mask, 0, 0, 0);
+ if (ret.error)
+ return sbi_err_map_linux_errno(ret.error);
+
+ return 0;
+}
+
+/*
+ * Begins the unbind process for the specified vCPU from this physical CPU and its guest
+ * interrupt files. The host must complete a TLB invalidation sequence for the TVM before
+ * completing the unbind with `unbind_vcpu_imsic_end()`.
+ */
+int sbi_covi_unbind_vcpu_imsic_begin(unsigned long tvm_gid,
+ unsigned long vcpu_id)
+{
+ struct sbiret ret;
+
+ ret = sbi_ecall(SBI_EXT_COVI, SBI_EXT_COVI_TVM_CPU_UNBIND_IMSIC_BEGIN,
+ tvm_gid, vcpu_id, 0, 0, 0, 0);
+ if (ret.error)
+ return sbi_err_map_linux_errno(ret.error);
+
+ return 0;
+}
+
+/*
+ * Completes the unbind process for the specified vCPU from this physical CPU and its guest
+ * interrupt files.
+ */
+int sbi_covi_unbind_vcpu_imsic_end(unsigned long tvm_gid, unsigned long vcpu_id)
+{
+ struct sbiret ret;
+
+ ret = sbi_ecall(SBI_EXT_COVI, SBI_EXT_COVI_TVM_CPU_UNBIND_IMSIC_END,
+ tvm_gid, vcpu_id, 0, 0, 0, 0);
+ if (ret.error)
+ return sbi_err_map_linux_errno(ret.error);
+
+ return 0;
+}
+
+/*
+ * Injects an external interrupt into the specified vCPU. The interrupt ID must
+ * have been allowed with `allow_external_interrupt()` by the guest.
+ */
+int sbi_covi_inject_external_interrupt(unsigned long tvm_gid,
+ unsigned long vcpu_id,
+ unsigned long interrupt_id)
+{
+ struct sbiret ret;
+
+ ret = sbi_ecall(SBI_EXT_COVI, SBI_EXT_COVI_TVM_CPU_INJECT_EXT_INTERRUPT,
+ tvm_gid, vcpu_id, interrupt_id, 0, 0, 0);
+ if (ret.error)
+ return sbi_err_map_linux_errno(ret.error);
+
+ return 0;
+}
+
+int sbi_covi_rebind_vcpu_imsic_begin(unsigned long tvm_gid,
+ unsigned long vcpu_id,
+ unsigned long imsic_mask)
+{
+ struct sbiret ret;
+
+ ret = sbi_ecall(SBI_EXT_COVI, SBI_EXT_COVI_TVM_REBIND_IMSIC_BEGIN,
+ tvm_gid, vcpu_id, imsic_mask, 0, 0, 0);
+ if (ret.error)
+ return sbi_err_map_linux_errno(ret.error);
+
+ return 0;
+}
+
+int sbi_covi_rebind_vcpu_imsic_clone(unsigned long tvm_gid,
+ unsigned long vcpu_id)
+{
+ struct sbiret ret;
+
+ ret = sbi_ecall(SBI_EXT_COVI, SBI_EXT_COVI_TVM_REBIND_IMSIC_CLONE,
+ tvm_gid, vcpu_id, 0, 0, 0, 0);
+ if (ret.error)
+ return sbi_err_map_linux_errno(ret.error);
+
+ return 0;
+}
+
+int sbi_covi_rebind_vcpu_imsic_end(unsigned long tvm_gid, unsigned long vcpu_id)
+{
+ struct sbiret ret;
+
+ ret = sbi_ecall(SBI_EXT_COVI, SBI_EXT_COVI_TVM_REBIND_IMSIC_END,
+ tvm_gid, vcpu_id, 0, 0, 0, 0);
+ if (ret.error)
+ return sbi_err_map_linux_errno(ret.error);
+
+ return 0;
+}
+
int sbi_covh_tsm_get_info(struct sbi_cove_tsm_info *tinfo_addr)
{
struct sbiret ret;
--
2.25.1
The COVI SBI extension defines the functions related to interrupt
management for TVMs. These functions are the glue logic between
AIA code and the actually CoVE Interrupt SBI extension(COVI).
Signed-off-by: Atish Patra <[email protected]>
---
arch/riscv/include/asm/kvm_cove.h | 34 ++++
arch/riscv/kvm/cove.c | 256 ++++++++++++++++++++++++++++++
2 files changed, 290 insertions(+)
diff --git a/arch/riscv/include/asm/kvm_cove.h b/arch/riscv/include/asm/kvm_cove.h
index b63682f..74bad2f 100644
--- a/arch/riscv/include/asm/kvm_cove.h
+++ b/arch/riscv/include/asm/kvm_cove.h
@@ -61,10 +61,19 @@ struct kvm_riscv_cove_page {
unsigned long gpa;
};
+struct imsic_tee_state {
+ bool bind_required;
+ bool bound;
+ int vsfile_hgei;
+};
+
struct kvm_cove_tvm_vcpu_context {
struct kvm_vcpu *vcpu;
/* Pages storing each vcpu state of the TVM in TSM */
struct kvm_riscv_cove_page vcpu_state;
+
+ /* Per VCPU imsic state */
+ struct imsic_tee_state imsic;
};
struct kvm_cove_tvm_context {
@@ -133,6 +142,16 @@ int kvm_riscv_cove_vm_add_memreg(struct kvm *kvm, unsigned long gpa, unsigned lo
int kvm_riscv_cove_gstage_map(struct kvm_vcpu *vcpu, gpa_t gpa, unsigned long hva);
/* Fence related function */
int kvm_riscv_cove_tvm_fence(struct kvm_vcpu *vcpu);
+
+/* AIA related CoVE functions */
+int kvm_riscv_cove_aia_init(struct kvm *kvm);
+int kvm_riscv_cove_vcpu_inject_interrupt(struct kvm_vcpu *vcpu, unsigned long iid);
+int kvm_riscv_cove_vcpu_imsic_unbind(struct kvm_vcpu *vcpu, int old_cpu);
+int kvm_riscv_cove_vcpu_imsic_bind(struct kvm_vcpu *vcpu, unsigned long imsic_mask);
+int kvm_riscv_cove_vcpu_imsic_rebind(struct kvm_vcpu *vcpu, int old_pcpu);
+int kvm_riscv_cove_aia_claim_imsic(struct kvm_vcpu *vcpu, phys_addr_t imsic_pa);
+int kvm_riscv_cove_aia_convert_imsic(struct kvm_vcpu *vcpu, phys_addr_t imsic_pa);
+int kvm_riscv_cove_vcpu_imsic_addr(struct kvm_vcpu *vcpu);
#else
static inline bool kvm_riscv_cove_enabled(void) {return false; };
static inline int kvm_riscv_cove_init(void) { return -1; }
@@ -162,6 +181,21 @@ static inline int kvm_riscv_cove_vm_measure_pages(struct kvm *kvm,
}
static inline int kvm_riscv_cove_gstage_map(struct kvm_vcpu *vcpu,
gpa_t gpa, unsigned long hva) {return -1; }
+/* AIA related TEE functions */
+static inline int kvm_riscv_cove_aia_init(struct kvm *kvm) { return -1; }
+static inline int kvm_riscv_cove_vcpu_inject_interrupt(struct kvm_vcpu *vcpu,
+ unsigned long iid) { return -1; }
+static inline int kvm_riscv_cove_vcpu_imsic_unbind(struct kvm_vcpu *vcpu,
+ int old_cpu) { return -1; }
+static inline int kvm_riscv_cove_vcpu_imsic_bind(struct kvm_vcpu *vcpu,
+ unsigned long imsic_mask) { return -1; }
+static inline int kvm_riscv_cove_aia_claim_imsic(struct kvm_vcpu *vcpu,
+ phys_addr_t imsic_pa) { return -1; }
+static inline int kvm_riscv_cove_aia_convert_imsic(struct kvm_vcpu *vcpu,
+ phys_addr_t imsic_pa) { return -1; }
+static inline int kvm_riscv_cove_vcpu_imsic_addr(struct kvm_vcpu *vcpu) { return -1; }
+static inline int kvm_riscv_cove_vcpu_imsic_rebind(struct kvm_vcpu *vcpu,
+ int old_pcpu) { return -1; }
#endif /* CONFIG_RISCV_COVE_HOST */
#endif /* __KVM_RISCV_COVE_H */
diff --git a/arch/riscv/kvm/cove.c b/arch/riscv/kvm/cove.c
index 4a8a8db..154b01a 100644
--- a/arch/riscv/kvm/cove.c
+++ b/arch/riscv/kvm/cove.c
@@ -8,6 +8,7 @@
* Atish Patra <[email protected]>
*/
+#include <linux/cpumask.h>
#include <linux/errno.h>
#include <linux/err.h>
#include <linux/kvm_host.h>
@@ -137,6 +138,247 @@ __always_inline bool kvm_riscv_cove_enabled(void)
return riscv_cove_enabled;
}
+static void kvm_cove_imsic_clone(void *info)
+{
+ int rc;
+ struct kvm_vcpu *vcpu = info;
+ struct kvm *kvm = vcpu->kvm;
+
+ rc = sbi_covi_rebind_vcpu_imsic_clone(kvm->arch.tvmc->tvm_guest_id, vcpu->vcpu_idx);
+ if (rc)
+ kvm_err("Imsic clone failed guest %ld vcpu %d pcpu %d\n",
+ kvm->arch.tvmc->tvm_guest_id, vcpu->vcpu_idx, smp_processor_id());
+}
+
+static void kvm_cove_imsic_unbind(void *info)
+{
+ struct kvm_vcpu *vcpu = info;
+ struct kvm_cove_tvm_context *tvmc = vcpu->kvm->arch.tvmc;
+
+ /*TODO: We probably want to return but the remote function call doesn't allow any return */
+ if (sbi_covi_unbind_vcpu_imsic_begin(tvmc->tvm_guest_id, vcpu->vcpu_idx))
+ return;
+
+ /* This may issue IPIs to running vcpus. */
+ if (kvm_riscv_cove_tvm_fence(vcpu))
+ return;
+
+ if (sbi_covi_unbind_vcpu_imsic_end(tvmc->tvm_guest_id, vcpu->vcpu_idx))
+ return;
+
+ kvm_info("Unbind success for guest %ld vcpu %d pcpu %d\n",
+ tvmc->tvm_guest_id, smp_processor_id(), vcpu->vcpu_idx);
+}
+
+int kvm_riscv_cove_vcpu_imsic_addr(struct kvm_vcpu *vcpu)
+{
+ struct kvm_cove_tvm_context *tvmc;
+ struct kvm *kvm = vcpu->kvm;
+ struct kvm_vcpu_aia *vaia = &vcpu->arch.aia_context;
+ int ret;
+
+ if (!kvm->arch.tvmc)
+ return -EINVAL;
+
+ tvmc = kvm->arch.tvmc;
+
+ ret = sbi_covi_set_vcpu_imsic_addr(tvmc->tvm_guest_id, vcpu->vcpu_idx, vaia->imsic_addr);
+ if (ret)
+ return -EPERM;
+
+ return 0;
+}
+
+int kvm_riscv_cove_aia_convert_imsic(struct kvm_vcpu *vcpu, phys_addr_t imsic_pa)
+{
+ struct kvm *kvm = vcpu->kvm;
+ int ret;
+
+ if (!kvm->arch.tvmc)
+ return -EINVAL;
+
+ ret = sbi_covi_convert_imsic(imsic_pa);
+ if (ret)
+ return -EPERM;
+
+ ret = kvm_riscv_cove_fence();
+ if (ret)
+ return ret;
+
+ return 0;
+}
+
+int kvm_riscv_cove_aia_claim_imsic(struct kvm_vcpu *vcpu, phys_addr_t imsic_pa)
+{
+ int ret;
+ struct kvm *kvm = vcpu->kvm;
+
+ if (!kvm->arch.tvmc)
+ return -EINVAL;
+
+ ret = sbi_covi_reclaim_imsic(imsic_pa);
+ if (ret)
+ return -EPERM;
+
+ return 0;
+}
+
+int kvm_riscv_cove_vcpu_imsic_rebind(struct kvm_vcpu *vcpu, int old_pcpu)
+{
+ struct kvm_cove_tvm_context *tvmc;
+ struct kvm *kvm = vcpu->kvm;
+ struct kvm_cove_tvm_vcpu_context *tvcpu = vcpu->arch.tc;
+ int ret;
+ cpumask_t tmpmask;
+
+ if (!kvm->arch.tvmc)
+ return -EINVAL;
+
+ tvmc = kvm->arch.tvmc;
+
+ ret = sbi_covi_rebind_vcpu_imsic_begin(tvmc->tvm_guest_id, vcpu->vcpu_idx,
+ BIT(tvcpu->imsic.vsfile_hgei));
+ if (ret) {
+ kvm_err("Imsic rebind begin failed guest %ld vcpu %d pcpu %d\n",
+ tvmc->tvm_guest_id, vcpu->vcpu_idx, smp_processor_id());
+ return ret;
+ }
+
+ ret = kvm_riscv_cove_tvm_fence(vcpu);
+ if (ret)
+ return ret;
+
+ cpumask_clear(&tmpmask);
+ cpumask_set_cpu(old_pcpu, &tmpmask);
+ on_each_cpu_mask(&tmpmask, kvm_cove_imsic_clone, vcpu, 1);
+
+ ret = sbi_covi_rebind_vcpu_imsic_end(tvmc->tvm_guest_id, vcpu->vcpu_idx);
+ if (ret) {
+ kvm_err("Imsic rebind end failed guest %ld vcpu %d pcpu %d\n",
+ tvmc->tvm_guest_id, vcpu->vcpu_idx, smp_processor_id());
+ return ret;
+ }
+
+ tvcpu->imsic.bound = true;
+
+ return 0;
+}
+
+int kvm_riscv_cove_vcpu_imsic_bind(struct kvm_vcpu *vcpu, unsigned long imsic_mask)
+{
+ struct kvm_cove_tvm_context *tvmc;
+ struct kvm *kvm = vcpu->kvm;
+ struct kvm_cove_tvm_vcpu_context *tvcpu = vcpu->arch.tc;
+ int ret;
+
+ if (!kvm->arch.tvmc)
+ return -EINVAL;
+
+ tvmc = kvm->arch.tvmc;
+
+ ret = sbi_covi_bind_vcpu_imsic(tvmc->tvm_guest_id, vcpu->vcpu_idx, imsic_mask);
+ if (ret) {
+ kvm_err("Imsic bind failed for imsic %lx guest %ld vcpu %d pcpu %d\n",
+ imsic_mask, tvmc->tvm_guest_id, vcpu->vcpu_idx, smp_processor_id());
+ return ret;
+ }
+ tvcpu->imsic.bound = true;
+ pr_err("%s: rebind success vcpu %d hgei %d pcpu %d\n", __func__,
+ vcpu->vcpu_idx, tvcpu->imsic.vsfile_hgei, smp_processor_id());
+
+ return 0;
+}
+
+int kvm_riscv_cove_vcpu_imsic_unbind(struct kvm_vcpu *vcpu, int old_pcpu)
+{
+ struct kvm *kvm = vcpu->kvm;
+ struct kvm_cove_tvm_vcpu_context *tvcpu = vcpu->arch.tc;
+ cpumask_t tmpmask;
+
+ if (!kvm->arch.tvmc)
+ return -EINVAL;
+
+ /* No need to unbind if it is not bound already */
+ if (!tvcpu->imsic.bound)
+ return 0;
+
+ /* Do it first even if there is failure to prevent it to try again */
+ tvcpu->imsic.bound = false;
+
+ if (smp_processor_id() == old_pcpu) {
+ kvm_cove_imsic_unbind(vcpu);
+ } else {
+ /* Unbind can be invoked from a different physical cpu */
+ cpumask_clear(&tmpmask);
+ cpumask_set_cpu(old_pcpu, &tmpmask);
+ on_each_cpu_mask(&tmpmask, kvm_cove_imsic_unbind, vcpu, 1);
+ }
+
+ return 0;
+}
+
+int kvm_riscv_cove_vcpu_inject_interrupt(struct kvm_vcpu *vcpu, unsigned long iid)
+{
+ struct kvm_cove_tvm_context *tvmc;
+ struct kvm *kvm = vcpu->kvm;
+ int ret;
+
+ if (!kvm->arch.tvmc)
+ return -EINVAL;
+
+ tvmc = kvm->arch.tvmc;
+
+ ret = sbi_covi_inject_external_interrupt(tvmc->tvm_guest_id, vcpu->vcpu_idx, iid);
+ if (ret)
+ return ret;
+
+ return 0;
+}
+
+int kvm_riscv_cove_aia_init(struct kvm *kvm)
+{
+ struct kvm_aia *aia = &kvm->arch.aia;
+ struct sbi_cove_tvm_aia_params *tvm_aia;
+ struct kvm_vcpu *vcpu;
+ struct kvm_cove_tvm_context *tvmc;
+ int ret;
+
+ if (!kvm->arch.tvmc)
+ return -EINVAL;
+
+ tvmc = kvm->arch.tvmc;
+
+ /* Sanity Check */
+ if (aia->aplic_addr != KVM_RISCV_AIA_UNDEF_ADDR)
+ return -EINVAL;
+
+ /* TVMs must have a physical guest interrut file */
+ if (aia->mode != KVM_DEV_RISCV_AIA_MODE_HWACCEL)
+ return -ENODEV;
+
+ tvm_aia = kzalloc(sizeof(*tvm_aia), GFP_KERNEL);
+ if (!tvm_aia)
+ return -ENOMEM;
+
+ /* Address of the IMSIC group ID, hart ID & guest ID of 0 */
+ vcpu = kvm_get_vcpu_by_id(kvm, 0);
+ tvm_aia->imsic_base_addr = vcpu->arch.aia_context.imsic_addr;
+
+ tvm_aia->group_index_bits = aia->nr_group_bits;
+ tvm_aia->group_index_shift = aia->nr_group_shift;
+ tvm_aia->hart_index_bits = aia->nr_hart_bits;
+ tvm_aia->guest_index_bits = aia->nr_guest_bits;
+ /* Nested TVMs are not supported yet */
+ tvm_aia->guests_per_hart = 0;
+
+
+ ret = sbi_covi_tvm_aia_init(tvmc->tvm_guest_id, tvm_aia);
+ if (ret)
+ kvm_err("TVM AIA init failed with rc %d\n", ret);
+
+ return ret;
+}
+
void kvm_riscv_cove_vcpu_load(struct kvm_vcpu *vcpu)
{
kvm_riscv_vcpu_timer_restore(vcpu);
@@ -283,6 +525,7 @@ void noinstr kvm_riscv_cove_vcpu_switchto(struct kvm_vcpu *vcpu, struct kvm_cpu_
struct kvm_cpu_context *cntx = &vcpu->arch.guest_context;
void *nshmem;
struct kvm_guest_timer *gt = &kvm->arch.timer;
+ struct kvm_cove_tvm_vcpu_context *tvcpuc = vcpu->arch.tc;
if (!kvm->arch.tvmc)
return;
@@ -301,6 +544,19 @@ void noinstr kvm_riscv_cove_vcpu_switchto(struct kvm_vcpu *vcpu, struct kvm_cpu_
tvmc->finalized_done = true;
}
+ /*
+ * Bind the vsfile here instead during the new vsfile allocation because
+ * COVH bind call requires the TVM to be in finalized state.
+ */
+ if (tvcpuc->imsic.bind_required) {
+ tvcpuc->imsic.bind_required = false;
+ rc = kvm_riscv_cove_vcpu_imsic_bind(vcpu, BIT(tvcpuc->imsic.vsfile_hgei));
+ if (rc) {
+ kvm_err("bind failed with rc %d\n", rc);
+ return;
+ }
+ }
+
rc = sbi_covh_run_tvm_vcpu(tvmc->tvm_guest_id, vcpu->vcpu_idx);
if (rc) {
trap->scause = EXC_CUSTOM_KVM_COVE_RUN_FAIL;
--
2.25.1
For TVMs, the host must not support AIA CSR emulation. In addition to
that, during vcpu load/put the CSR updates are not necessary as the CSR
state must not be visible to the host.
Signed-off-by: Atish Patra <[email protected]>
---
arch/riscv/kvm/aia.c | 17 ++++++++++++++---
1 file changed, 14 insertions(+), 3 deletions(-)
diff --git a/arch/riscv/kvm/aia.c b/arch/riscv/kvm/aia.c
index 71216e1..e3da661 100644
--- a/arch/riscv/kvm/aia.c
+++ b/arch/riscv/kvm/aia.c
@@ -15,6 +15,7 @@
#include <linux/percpu.h>
#include <linux/spinlock.h>
#include <asm/hwcap.h>
+#include <asm/kvm_cove.h>
struct aia_hgei_control {
raw_spinlock_t lock;
@@ -134,7 +135,7 @@ void kvm_riscv_vcpu_aia_load(struct kvm_vcpu *vcpu, int cpu)
{
struct kvm_vcpu_aia_csr *csr = &vcpu->arch.aia_context.guest_csr;
- if (!kvm_riscv_aia_available())
+ if (!kvm_riscv_aia_available() || is_cove_vcpu(vcpu))
return;
csr_write(CSR_VSISELECT, csr->vsiselect);
@@ -152,7 +153,7 @@ void kvm_riscv_vcpu_aia_put(struct kvm_vcpu *vcpu)
{
struct kvm_vcpu_aia_csr *csr = &vcpu->arch.aia_context.guest_csr;
- if (!kvm_riscv_aia_available())
+ if (!kvm_riscv_aia_available() || is_cove_vcpu(vcpu))
return;
csr->vsiselect = csr_read(CSR_VSISELECT);
@@ -370,6 +371,10 @@ int kvm_riscv_vcpu_aia_rmw_ireg(struct kvm_vcpu *vcpu, unsigned int csr_num,
if (!kvm_riscv_aia_available())
return KVM_INSN_ILLEGAL_TRAP;
+ /* TVMs do not support AIA emulation */
+ if (is_cove_vcpu(vcpu))
+ return KVM_INSN_EXIT_TO_USER_SPACE;
+
/* First try to emulate in kernel space */
isel = csr_read(CSR_VSISELECT) & ISELECT_MASK;
if (isel >= ISELECT_IPRIO0 && isel <= ISELECT_IPRIO15)
@@ -529,6 +534,9 @@ void kvm_riscv_aia_enable(void)
if (!kvm_riscv_aia_available())
return;
+ if (unlikely(kvm_riscv_cove_enabled()))
+ goto enable_gext;
+
aia_set_hvictl(false);
csr_write(CSR_HVIPRIO1, 0x0);
csr_write(CSR_HVIPRIO2, 0x0);
@@ -539,6 +547,7 @@ void kvm_riscv_aia_enable(void)
csr_write(CSR_HVIPRIO2H, 0x0);
#endif
+enable_gext:
/* Enable per-CPU SGEI interrupt */
enable_percpu_irq(hgei_parent_irq,
irq_get_trigger_type(hgei_parent_irq));
@@ -559,7 +568,9 @@ void kvm_riscv_aia_disable(void)
csr_clear(CSR_HIE, BIT(IRQ_S_GEXT));
disable_percpu_irq(hgei_parent_irq);
- aia_set_hvictl(false);
+ /* The host is not allowed modify hvictl for TVMs */
+ if (!unlikely(kvm_riscv_cove_enabled()))
+ aia_set_hvictl(false);
raw_spin_lock_irqsave(&hgctrl->lock, flags);
--
2.25.1
Hardware enable/disable path only need to perform AIA/NACL enable/disable
for TVMs. All other operations i.e. interrupt/exception delegation,
counter access must be provided by the TSM as host doesn't have control
of these operations for a TVM.
Signed-off-by: Atish Patra <[email protected]>
---
arch/riscv/kvm/main.c | 14 ++++++++++++++
1 file changed, 14 insertions(+)
diff --git a/arch/riscv/kvm/main.c b/arch/riscv/kvm/main.c
index 45ee62d..842b78d 100644
--- a/arch/riscv/kvm/main.c
+++ b/arch/riscv/kvm/main.c
@@ -13,6 +13,7 @@
#include <asm/hwcap.h>
#include <asm/kvm_nacl.h>
#include <asm/sbi.h>
+#include <asm/kvm_cove.h>
long kvm_arch_dev_ioctl(struct file *filp,
unsigned int ioctl, unsigned long arg)
@@ -29,6 +30,15 @@ int kvm_arch_hardware_enable(void)
if (rc)
return rc;
+ /*
+ * We just need to invoke aia enable for CoVE if host is in VS mode
+ * However, if the host is running in HS mode, we need to initialize
+ * other CSRs as well for legacy VMs.
+ * TODO: Handle host in HS mode use case.
+ */
+ if (unlikely(kvm_riscv_cove_enabled()))
+ goto enable_aia;
+
hedeleg = 0;
hedeleg |= (1UL << EXC_INST_MISALIGNED);
hedeleg |= (1UL << EXC_BREAKPOINT);
@@ -49,6 +59,7 @@ int kvm_arch_hardware_enable(void)
csr_write(CSR_HVIP, 0);
+enable_aia:
kvm_riscv_aia_enable();
return 0;
@@ -58,6 +69,8 @@ void kvm_arch_hardware_disable(void)
{
kvm_riscv_aia_disable();
+ if (unlikely(kvm_riscv_cove_enabled()))
+ goto disable_nacl;
/*
* After clearing the hideleg CSR, the host kernel will receive
* spurious interrupts if hvip CSR has pending interrupts and the
@@ -69,6 +82,7 @@ void kvm_arch_hardware_disable(void)
csr_write(CSR_HEDELEG, 0);
csr_write(CSR_HIDELEG, 0);
+disable_nacl:
kvm_riscv_nacl_disable();
}
--
2.25.1
The KVM_INTERRUPT IOCTL is used for the userspace emulated IRQCHIP.
The TEE use case do not support that yet. Return appropriate error
in case any VMM tries to invoke that operation.
Signed-off-by: Atish Patra <[email protected]>
---
arch/riscv/kvm/vcpu.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/arch/riscv/kvm/vcpu.c b/arch/riscv/kvm/vcpu.c
index 20d4800..65f87e1 100644
--- a/arch/riscv/kvm/vcpu.c
+++ b/arch/riscv/kvm/vcpu.c
@@ -716,6 +716,9 @@ long kvm_arch_vcpu_async_ioctl(struct file *filp,
if (ioctl == KVM_INTERRUPT) {
struct kvm_interrupt irq;
+ /* We do not support user space emulated IRQCHIP for TVMs yet */
+ if (is_cove_vcpu(vcpu))
+ return -ENXIO;
if (copy_from_user(&irq, argp, sizeof(irq)))
return -EFAULT;
--
2.25.1
The AIA support for TVMs are split between the host and the TSM.
While the host allocates the vsfile, the TSM controls the gstage mapping
and any updates to it. The host must not be able to inject interrupt
to a TVM. Thus, the interrupt injection has to happen via the TSM only
for the interrupts allowed by the guest.
The swfile maintained by the host is not useful for the TVMs as well
as the TVMs only work for HW_ACCEL mode. The TSM does maintain a swfile
for the vcpu internally. The swfile allocation in the host is kept
as is to avoid further bifurcation of the code.
Co-developed-by: Rajnesh Kanwal <[email protected]>
Signed-off-by: Rajnesh Kanwal <[email protected]>
Signed-off-by: Atish Patra <[email protected]>
---
arch/riscv/include/asm/kvm_cove.h | 6 +-
arch/riscv/kvm/aia.c | 84 +++++++++++++++++---
arch/riscv/kvm/aia_device.c | 41 +++++++---
arch/riscv/kvm/aia_imsic.c | 127 +++++++++++++++++++++---------
4 files changed, 195 insertions(+), 63 deletions(-)
diff --git a/arch/riscv/include/asm/kvm_cove.h b/arch/riscv/include/asm/kvm_cove.h
index 74bad2f..4367281 100644
--- a/arch/riscv/include/asm/kvm_cove.h
+++ b/arch/riscv/include/asm/kvm_cove.h
@@ -61,7 +61,7 @@ struct kvm_riscv_cove_page {
unsigned long gpa;
};
-struct imsic_tee_state {
+struct imsic_cove_state {
bool bind_required;
bool bound;
int vsfile_hgei;
@@ -73,7 +73,7 @@ struct kvm_cove_tvm_vcpu_context {
struct kvm_riscv_cove_page vcpu_state;
/* Per VCPU imsic state */
- struct imsic_tee_state imsic;
+ struct imsic_cove_state imsic;
};
struct kvm_cove_tvm_context {
@@ -181,7 +181,7 @@ static inline int kvm_riscv_cove_vm_measure_pages(struct kvm *kvm,
}
static inline int kvm_riscv_cove_gstage_map(struct kvm_vcpu *vcpu,
gpa_t gpa, unsigned long hva) {return -1; }
-/* AIA related TEE functions */
+/* TVM interrupt managenet via AIA functions */
static inline int kvm_riscv_cove_aia_init(struct kvm *kvm) { return -1; }
static inline int kvm_riscv_cove_vcpu_inject_interrupt(struct kvm_vcpu *vcpu,
unsigned long iid) { return -1; }
diff --git a/arch/riscv/kvm/aia.c b/arch/riscv/kvm/aia.c
index e3da661..88b91b5 100644
--- a/arch/riscv/kvm/aia.c
+++ b/arch/riscv/kvm/aia.c
@@ -20,6 +20,8 @@
struct aia_hgei_control {
raw_spinlock_t lock;
unsigned long free_bitmap;
+ /* Tracks if a hgei is converted to confidential mode */
+ unsigned long nconf_bitmap;
struct kvm_vcpu *owners[BITS_PER_LONG];
};
static DEFINE_PER_CPU(struct aia_hgei_control, aia_hgei);
@@ -391,34 +393,96 @@ int kvm_riscv_vcpu_aia_rmw_ireg(struct kvm_vcpu *vcpu, unsigned int csr_num,
int kvm_riscv_aia_alloc_hgei(int cpu, struct kvm_vcpu *owner,
void __iomem **hgei_va, phys_addr_t *hgei_pa)
{
- int ret = -ENOENT;
- unsigned long flags;
+ int ret = -ENOENT, rc;
+ bool reclaim_needed = false;
+ unsigned long flags, tmp_bitmap;
const struct imsic_local_config *lc;
struct aia_hgei_control *hgctrl = per_cpu_ptr(&aia_hgei, cpu);
+ phys_addr_t imsic_hgei_pa;
if (!kvm_riscv_aia_available())
return -ENODEV;
if (!hgctrl)
return -ENODEV;
+ lc = imsic_get_local_config(cpu);
raw_spin_lock_irqsave(&hgctrl->lock, flags);
- if (hgctrl->free_bitmap) {
- ret = __ffs(hgctrl->free_bitmap);
- hgctrl->free_bitmap &= ~BIT(ret);
- hgctrl->owners[ret] = owner;
+ if (!hgctrl->free_bitmap) {
+ raw_spin_unlock_irqrestore(&hgctrl->lock, flags);
+ goto done;
+ }
+
+ if (!is_cove_vcpu(owner)) {
+ /* Find a free one that is not converted */
+ tmp_bitmap = hgctrl->free_bitmap & hgctrl->nconf_bitmap;
+ if (tmp_bitmap > 0)
+ ret = __ffs(tmp_bitmap);
+ else {
+ /* All free ones have been converted in the past. Reclaim one now */
+ ret = __ffs(hgctrl->free_bitmap);
+ reclaim_needed = true;
+ }
+ } else {
+ /* First try to find a free one that is already converted */
+ tmp_bitmap = hgctrl->free_bitmap & !hgctrl->nconf_bitmap;
+ if (tmp_bitmap > 0)
+ ret = __ffs(tmp_bitmap);
+ else
+ ret = __ffs(hgctrl->free_bitmap);
}
+ hgctrl->free_bitmap &= ~BIT(ret);
+ hgctrl->owners[ret] = owner;
raw_spin_unlock_irqrestore(&hgctrl->lock, flags);
- lc = imsic_get_local_config(cpu);
if (lc && ret > 0) {
if (hgei_va)
*hgei_va = lc->msi_va + (ret * IMSIC_MMIO_PAGE_SZ);
- if (hgei_pa)
- *hgei_pa = lc->msi_pa + (ret * IMSIC_MMIO_PAGE_SZ);
+ imsic_hgei_pa = lc->msi_pa + (ret * IMSIC_MMIO_PAGE_SZ);
+
+ if (reclaim_needed) {
+ rc = kvm_riscv_cove_aia_claim_imsic(owner, imsic_hgei_pa);
+ if (rc) {
+ kvm_err("Reclaim of imsic pa %pa failed for vcpu %d pcpu %d ret %d\n",
+ &imsic_hgei_pa, owner->vcpu_idx, smp_processor_id(), ret);
+ kvm_riscv_aia_free_hgei(cpu, ret);
+ return rc;
+ }
+ }
+
+ /*
+ * Clear the free_bitmap here instead in case relcaim was necessary.
+ * Do it here instead of above because it we should only set the nconf
+ * bitmap after the claim is successful.
+ */
+ raw_spin_lock_irqsave(&hgctrl->lock, flags);
+ if (reclaim_needed)
+ set_bit(ret, &hgctrl->nconf_bitmap);
+ raw_spin_unlock_irqrestore(&hgctrl->lock, flags);
+
+ if (is_cove_vcpu(owner) && test_bit(ret, &hgctrl->nconf_bitmap)) {
+ /*
+ * Convert the address to confidential mode.
+ * This may need to send IPIs to issue global fence. Hence,
+ * enable interrupts temporarily for irq processing
+ */
+ rc = kvm_riscv_cove_aia_convert_imsic(owner, imsic_hgei_pa);
+
+ if (rc) {
+ kvm_riscv_aia_free_hgei(cpu, ret);
+ ret = rc;
+ } else {
+ raw_spin_lock_irqsave(&hgctrl->lock, flags);
+ clear_bit(ret, &hgctrl->nconf_bitmap);
+ raw_spin_unlock_irqrestore(&hgctrl->lock, flags);
+ }
+ }
}
+ if (hgei_pa)
+ *hgei_pa = imsic_hgei_pa;
+done:
return ret;
}
@@ -495,6 +559,8 @@ static int aia_hgei_init(void)
hgctrl->free_bitmap &= ~BIT(0);
} else
hgctrl->free_bitmap = 0;
+ /* By default all vsfiles are to be used for non-confidential mode */
+ hgctrl->nconf_bitmap = hgctrl->free_bitmap;
}
/* Find INTC irq domain */
diff --git a/arch/riscv/kvm/aia_device.c b/arch/riscv/kvm/aia_device.c
index 3556e82..ecf6734 100644
--- a/arch/riscv/kvm/aia_device.c
+++ b/arch/riscv/kvm/aia_device.c
@@ -11,6 +11,7 @@
#include <linux/irqchip/riscv-imsic.h>
#include <linux/kvm_host.h>
#include <linux/uaccess.h>
+#include <asm/kvm_cove.h>
static void unlock_vcpus(struct kvm *kvm, int vcpu_lock_idx)
{
@@ -103,6 +104,10 @@ static int aia_config(struct kvm *kvm, unsigned long type,
default:
return -EINVAL;
};
+ /* TVM must have a physical vs file */
+ if (is_cove_vm(kvm) && *nr != KVM_DEV_RISCV_AIA_MODE_HWACCEL)
+ return -EINVAL;
+
aia->mode = *nr;
} else
*nr = aia->mode;
@@ -264,18 +269,24 @@ static int aia_init(struct kvm *kvm)
if (kvm->created_vcpus != atomic_read(&kvm->online_vcpus))
return -EBUSY;
- /* Number of sources should be less than or equals number of IDs */
- if (aia->nr_ids < aia->nr_sources)
- return -EINVAL;
+ if (!is_cove_vm(kvm)) {
+ /* Number of sources should be less than or equals number of IDs */
+ if (aia->nr_ids < aia->nr_sources)
+ return -EINVAL;
+ /* APLIC base is required for non-zero number of sources only for non TVMs*/
+ if (aia->nr_sources && aia->aplic_addr == KVM_RISCV_AIA_UNDEF_ADDR)
+ return -EINVAL;
- /* APLIC base is required for non-zero number of sources */
- if (aia->nr_sources && aia->aplic_addr == KVM_RISCV_AIA_UNDEF_ADDR)
- return -EINVAL;
+ /* Initialize APLIC */
+ ret = kvm_riscv_aia_aplic_init(kvm);
+ if (ret)
+ return ret;
- /* Initialze APLIC */
- ret = kvm_riscv_aia_aplic_init(kvm);
- if (ret)
- return ret;
+ } else {
+ ret = kvm_riscv_cove_aia_init(kvm);
+ if (ret)
+ return ret;
+ }
/* Iterate over each VCPU */
kvm_for_each_vcpu(idx, vcpu, kvm) {
@@ -650,8 +661,14 @@ void kvm_riscv_aia_init_vm(struct kvm *kvm)
*/
/* Initialize default values in AIA global context */
- aia->mode = (kvm_riscv_aia_nr_hgei) ?
- KVM_DEV_RISCV_AIA_MODE_AUTO : KVM_DEV_RISCV_AIA_MODE_EMUL;
+ if (is_cove_vm(kvm)) {
+ if (!kvm_riscv_aia_nr_hgei)
+ return;
+ aia->mode = KVM_DEV_RISCV_AIA_MODE_HWACCEL;
+ } else {
+ aia->mode = (kvm_riscv_aia_nr_hgei) ?
+ KVM_DEV_RISCV_AIA_MODE_AUTO : KVM_DEV_RISCV_AIA_MODE_EMUL;
+ }
aia->nr_ids = kvm_riscv_aia_max_ids - 1;
aia->nr_sources = 0;
aia->nr_group_bits = 0;
diff --git a/arch/riscv/kvm/aia_imsic.c b/arch/riscv/kvm/aia_imsic.c
index 419c98d..8db1e29 100644
--- a/arch/riscv/kvm/aia_imsic.c
+++ b/arch/riscv/kvm/aia_imsic.c
@@ -15,6 +15,7 @@
#include <linux/swab.h>
#include <kvm/iodev.h>
#include <asm/csr.h>
+#include <asm/kvm_cove.h>
#define IMSIC_MAX_EIX (IMSIC_MAX_ID / BITS_PER_TYPE(u64))
@@ -583,7 +584,7 @@ static void imsic_vsfile_local_update(int vsfile_hgei, u32 nr_eix,
csr_write(CSR_VSISELECT, old_vsiselect);
}
-static void imsic_vsfile_cleanup(struct imsic *imsic)
+static void imsic_vsfile_cleanup(struct kvm_vcpu *vcpu, struct imsic *imsic)
{
int old_vsfile_hgei, old_vsfile_cpu;
unsigned long flags;
@@ -604,8 +605,12 @@ static void imsic_vsfile_cleanup(struct imsic *imsic)
memset(imsic->swfile, 0, sizeof(*imsic->swfile));
- if (old_vsfile_cpu >= 0)
+ if (old_vsfile_cpu >= 0) {
+ if (is_cove_vcpu(vcpu))
+ kvm_riscv_cove_vcpu_imsic_unbind(vcpu, old_vsfile_cpu);
+
kvm_riscv_aia_free_hgei(old_vsfile_cpu, old_vsfile_hgei);
+ }
}
static void imsic_swfile_extirq_update(struct kvm_vcpu *vcpu)
@@ -688,27 +693,30 @@ void kvm_riscv_vcpu_aia_imsic_release(struct kvm_vcpu *vcpu)
* the old IMSIC VS-file so we first re-direct all interrupt
* producers.
*/
+ if (!is_cove_vcpu(vcpu)) {
+ /* Purge the G-stage mapping */
+ kvm_riscv_gstage_iounmap(vcpu->kvm,
+ vcpu->arch.aia_context.imsic_addr,
+ IMSIC_MMIO_PAGE_SZ);
- /* Purge the G-stage mapping */
- kvm_riscv_gstage_iounmap(vcpu->kvm,
- vcpu->arch.aia_context.imsic_addr,
- IMSIC_MMIO_PAGE_SZ);
-
- /* TODO: Purge the IOMMU mapping ??? */
+ /* TODO: Purge the IOMMU mapping ??? */
- /*
- * At this point, all interrupt producers have been re-directed
- * to somewhere else so we move register state from the old IMSIC
- * VS-file to the IMSIC SW-file.
- */
+ /*
+ * At this point, all interrupt producers have been re-directed
+ * to somewhere else so we move register state from the old IMSIC
+ * VS-file to the IMSIC SW-file.
+ */
- /* Read and clear register state from old IMSIC VS-file */
- memset(&tmrif, 0, sizeof(tmrif));
- imsic_vsfile_read(old_vsfile_hgei, old_vsfile_cpu, imsic->nr_hw_eix,
- true, &tmrif);
+ /* Read and clear register state from old IMSIC VS-file */
+ memset(&tmrif, 0, sizeof(tmrif));
+ imsic_vsfile_read(old_vsfile_hgei, old_vsfile_cpu, imsic->nr_hw_eix,
+ true, &tmrif);
- /* Update register state in IMSIC SW-file */
- imsic_swfile_update(vcpu, &tmrif);
+ /* Update register state in IMSIC SW-file */
+ imsic_swfile_update(vcpu, &tmrif);
+ } else {
+ kvm_riscv_cove_vcpu_imsic_unbind(vcpu, old_vsfile_cpu);
+ }
/* Free-up old IMSIC VS-file */
kvm_riscv_aia_free_hgei(old_vsfile_cpu, old_vsfile_hgei);
@@ -747,7 +755,7 @@ int kvm_riscv_vcpu_aia_imsic_update(struct kvm_vcpu *vcpu)
/* For HW acceleration mode, we can't continue */
if (kvm->arch.aia.mode == KVM_DEV_RISCV_AIA_MODE_HWACCEL) {
run->fail_entry.hardware_entry_failure_reason =
- CSR_HSTATUS;
+ KVM_EXIT_FAIL_ENTRY_IMSIC_FILE_UNAVAILABLE;
run->fail_entry.cpu = vcpu->cpu;
run->exit_reason = KVM_EXIT_FAIL_ENTRY;
return 0;
@@ -762,22 +770,24 @@ int kvm_riscv_vcpu_aia_imsic_update(struct kvm_vcpu *vcpu)
}
new_vsfile_hgei = ret;
- /*
- * At this point, all interrupt producers are still using
- * to the old IMSIC VS-file so we first move all interrupt
- * producers to the new IMSIC VS-file.
- */
-
- /* Zero-out new IMSIC VS-file */
- imsic_vsfile_local_clear(new_vsfile_hgei, imsic->nr_hw_eix);
-
- /* Update G-stage mapping for the new IMSIC VS-file */
- ret = kvm_riscv_gstage_ioremap(kvm, vcpu->arch.aia_context.imsic_addr,
- new_vsfile_pa, IMSIC_MMIO_PAGE_SZ,
- true, true);
- if (ret)
- goto fail_free_vsfile_hgei;
-
+ /* TSM only maintains the gstage mapping. Skip vsfile updates & ioremap */
+ if (!is_cove_vcpu(vcpu)) {
+ /*
+ * At this point, all interrupt producers are still using
+ * to the old IMSIC VS-file so we first move all interrupt
+ * producers to the new IMSIC VS-file.
+ */
+
+ /* Zero-out new IMSIC VS-file */
+ imsic_vsfile_local_clear(new_vsfile_hgei, imsic->nr_hw_eix);
+
+ /* Update G-stage mapping for the new IMSIC VS-file */
+ ret = kvm_riscv_gstage_ioremap(kvm, vcpu->arch.aia_context.imsic_addr,
+ new_vsfile_pa, IMSIC_MMIO_PAGE_SZ,
+ true, true);
+ if (ret)
+ goto fail_free_vsfile_hgei;
+ }
/* TODO: Update the IOMMU mapping ??? */
/* Update new IMSIC VS-file details in IMSIC context */
@@ -788,12 +798,32 @@ int kvm_riscv_vcpu_aia_imsic_update(struct kvm_vcpu *vcpu)
imsic->vsfile_pa = new_vsfile_pa;
write_unlock_irqrestore(&imsic->vsfile_lock, flags);
+ /* Now bind the new vsfile for the TVMs */
+ if (is_cove_vcpu(vcpu) && vcpu->arch.tc) {
+ vcpu->arch.tc->imsic.vsfile_hgei = new_vsfile_hgei;
+ if (old_vsfile_cpu >= 0) {
+ if (vcpu->arch.tc->imsic.bound) {
+ ret = kvm_riscv_cove_vcpu_imsic_rebind(vcpu, old_vsfile_cpu);
+ if (ret) {
+ kvm_err("imsic rebind failed for vcpu %d ret %d\n",
+ vcpu->vcpu_idx, ret);
+ goto fail_free_vsfile_hgei;
+ }
+ }
+ kvm_riscv_aia_free_hgei(old_vsfile_cpu, old_vsfile_hgei);
+ } else {
+ /* Bind if it is not a migration case */
+ vcpu->arch.tc->imsic.bind_required = true;
+ }
+ /* Skip the oldvsfile and swfile update process as it is managed by TSM */
+ goto done;
+ }
+
/*
* At this point, all interrupt producers have been moved
* to the new IMSIC VS-file so we move register state from
* the old IMSIC VS/SW-file to the new IMSIC VS-file.
*/
-
memset(&tmrif, 0, sizeof(tmrif));
if (old_vsfile_cpu >= 0) {
/* Read and clear register state from old IMSIC VS-file */
@@ -946,6 +976,7 @@ int kvm_riscv_vcpu_aia_imsic_inject(struct kvm_vcpu *vcpu,
unsigned long flags;
struct imsic_mrif_eix *eix;
struct imsic *imsic = vcpu->arch.aia_context.imsic_state;
+ int ret;
/* We only emulate one IMSIC MMIO page for each Guest VCPU */
if (!imsic || !iid || guest_index ||
@@ -960,7 +991,14 @@ int kvm_riscv_vcpu_aia_imsic_inject(struct kvm_vcpu *vcpu,
read_lock_irqsave(&imsic->vsfile_lock, flags);
if (imsic->vsfile_cpu >= 0) {
- writel(iid, imsic->vsfile_va + IMSIC_MMIO_SETIPNUM_LE);
+ /* TSM can only inject the external interrupt if it is allowed by the guest */
+ if (is_cove_vcpu(vcpu)) {
+ ret = kvm_riscv_cove_vcpu_inject_interrupt(vcpu, iid);
+ if (ret)
+ kvm_err("External interrupt %d injection failed\n", iid);
+ } else {
+ writel(iid, imsic->vsfile_va + IMSIC_MMIO_SETIPNUM_LE);
+ }
kvm_vcpu_kick(vcpu);
} else {
eix = &imsic->swfile->eix[iid / BITS_PER_TYPE(u64)];
@@ -1039,6 +1077,17 @@ int kvm_riscv_vcpu_aia_imsic_init(struct kvm_vcpu *vcpu)
imsic->swfile = page_to_virt(swfile_page);
imsic->swfile_pa = page_to_phys(swfile_page);
+ /* No need to setup iodev ops for TVMs. Swfile will also not be used for
+ * TVMs. However, allocate it for now as to avoid different path during
+ * free.
+ */
+ if (is_cove_vcpu(vcpu)) {
+ ret = kvm_riscv_cove_vcpu_imsic_addr(vcpu);
+ if (ret)
+ goto fail_free_swfile;
+ return 0;
+ }
+
/* Setup IO device */
kvm_iodevice_init(&imsic->iodev, &imsic_iodoev_ops);
mutex_lock(&kvm->slots_lock);
@@ -1069,7 +1118,7 @@ void kvm_riscv_vcpu_aia_imsic_cleanup(struct kvm_vcpu *vcpu)
if (!imsic)
return;
- imsic_vsfile_cleanup(imsic);
+ imsic_vsfile_cleanup(vcpu, imsic);
mutex_lock(&kvm->slots_lock);
kvm_io_bus_unregister_dev(kvm, KVM_MMIO_BUS, &imsic->iodev);
--
2.25.1
The TVM VCPU create function requires the vcpu id which is generated
after the arch_create_vcpu returns. Thus, TVM vcpu init can not be
invoked from the arch_create_vcpu. Invoke it in post create for now.
However, post create doesn't return any error which is problematic
as vcpu creation can fail from TSM side.
Signed-off-by: Atish Patra <[email protected]>
---
arch/riscv/kvm/vcpu.c | 14 ++++++++++++++
1 file changed, 14 insertions(+)
diff --git a/arch/riscv/kvm/vcpu.c b/arch/riscv/kvm/vcpu.c
index 65f87e1..005c7c9 100644
--- a/arch/riscv/kvm/vcpu.c
+++ b/arch/riscv/kvm/vcpu.c
@@ -218,6 +218,17 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
void kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu)
{
+ int rc;
+ /*
+ * TODO: Ideally it should be invoked in vcpu_create. but vcpu_idx
+ * is allocated after returning create_vcpu. Find a better place to do it
+ */
+ if (unlikely(is_cove_vcpu(vcpu))) {
+ rc = kvm_riscv_cove_vcpu_init(vcpu);
+ if (rc)
+ pr_err("%s: cove vcpu init failed %d\n", __func__, rc);
+ }
+
/**
* vcpu with id 0 is the designated boot cpu.
* Keep all vcpus with non-zero id in power-off state so that
@@ -237,6 +248,9 @@ void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu)
kvm_riscv_vcpu_pmu_deinit(vcpu);
+ if (unlikely(is_cove_vcpu(vcpu)))
+ kvm_riscv_cove_vcpu_destroy(vcpu);
+
/* Free unused pages pre-allocated for G-stage page table mappings */
kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_cache);
}
--
2.25.1
CoVE initialization depends on few underlying conditions that differs
from normal VMs.
1. RFENCE extension is no longer mandatory as TEEH APIs has its own set
of fence APIs.
2. SBI NACL is mandatory for TEE VMs to share memory between the host
and the TSM.
Signed-off-by: Atish Patra <[email protected]>
---
arch/riscv/kvm/main.c | 16 +++++++++++-----
1 file changed, 11 insertions(+), 5 deletions(-)
diff --git a/arch/riscv/kvm/main.c b/arch/riscv/kvm/main.c
index 842b78d..a059414 100644
--- a/arch/riscv/kvm/main.c
+++ b/arch/riscv/kvm/main.c
@@ -102,15 +102,12 @@ static int __init riscv_kvm_init(void)
return -ENODEV;
}
- if (sbi_probe_extension(SBI_EXT_RFENCE) <= 0) {
- kvm_info("require SBI RFENCE extension\n");
- return -ENODEV;
- }
-
rc = kvm_riscv_nacl_init();
if (rc && rc != -ENODEV)
return rc;
+ kvm_riscv_cove_init();
+
kvm_riscv_gstage_mode_detect();
kvm_riscv_gstage_vmid_detect();
@@ -121,6 +118,15 @@ static int __init riscv_kvm_init(void)
return rc;
}
+ /* TVM don't need RFENCE extension as hardware imsic support is mandatory for TVMs
+ * TODO: This check should happen later if HW_ACCEL mode is not set as RFENCE
+ * should only be mandatory in that case.
+ */
+ if (!kvm_riscv_cove_enabled() && sbi_probe_extension(SBI_EXT_RFENCE) <= 0) {
+ kvm_info("require SBI RFENCE extension\n");
+ return -ENODEV;
+ }
+
kvm_info("hypervisor extension available\n");
if (kvm_riscv_nacl_available()) {
--
2.25.1
A TVM can only be created upon explicit request from the VMM via
the vm type if CoVE SBI extensions must supported by the TSM.
Signed-off-by: Atish Patra <[email protected]>
---
arch/riscv/kvm/vm.c | 16 ++++++++++++++++
1 file changed, 16 insertions(+)
diff --git a/arch/riscv/kvm/vm.c b/arch/riscv/kvm/vm.c
index 1b59a8f..8a1460d 100644
--- a/arch/riscv/kvm/vm.c
+++ b/arch/riscv/kvm/vm.c
@@ -42,6 +42,19 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
return r;
}
+ if (unlikely(type == KVM_VM_TYPE_RISCV_COVE)) {
+ if (!kvm_riscv_cove_enabled()) {
+ kvm_err("Unable to init CoVE VM because cove is not enabled\n");
+ return -EPERM;
+ }
+
+ r = kvm_riscv_cove_vm_init(kvm);
+ if (r)
+ return r;
+ kvm->arch.vm_type = type;
+ kvm_info("Trusted VM instance init successful\n");
+ }
+
kvm_riscv_aia_init_vm(kvm);
kvm_riscv_guest_timer_init(kvm);
@@ -54,6 +67,9 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
kvm_destroy_vcpus(kvm);
kvm_riscv_aia_destroy_vm(kvm);
+
+ if (unlikely(is_cove_vm(kvm)))
+ kvm_riscv_cove_vm_destroy(kvm);
}
int kvm_vm_ioctl_irq_line(struct kvm *kvm, struct kvm_irq_level *irql,
--
2.25.1
From: Rajnesh Kanwal <[email protected]>
CoVE specification defines a separate SBI extension known as CoVG
for the guest side interface. Add the definitions for that extension.
Signed-off-by: Rajnesh Kanwal <[email protected]>
Signed-off-by: Atish Patra <[email protected]>
---
arch/riscv/include/asm/sbi.h | 11 +++++++++++
1 file changed, 11 insertions(+)
diff --git a/arch/riscv/include/asm/sbi.h b/arch/riscv/include/asm/sbi.h
index bbea922..e02ee75 100644
--- a/arch/riscv/include/asm/sbi.h
+++ b/arch/riscv/include/asm/sbi.h
@@ -34,6 +34,7 @@ enum sbi_ext_id {
SBI_EXT_NACL = 0x4E41434C,
SBI_EXT_COVH = 0x434F5648,
SBI_EXT_COVI = 0x434F5649,
+ SBI_EXT_COVG = 0x434F5647,
/* Experimentals extensions must lie within this range */
SBI_EXT_EXPERIMENTAL_START = 0x08000000,
@@ -439,6 +440,16 @@ struct sbi_cove_tvm_aia_params {
uint32_t guests_per_hart;
};
+/* SBI COVG extension data structures */
+enum sbi_ext_covg_fid {
+ SBI_EXT_COVG_ADD_MMIO_REGION,
+ SBI_EXT_COVG_REMOVE_MMIO_REGION,
+ SBI_EXT_COVG_SHARE_MEMORY,
+ SBI_EXT_COVG_UNSHARE_MEMORY,
+ SBI_EXT_COVG_ALLOW_EXT_INTERRUPT,
+ SBI_EXT_COVG_DENY_EXT_INTERRUPT,
+};
+
#define SBI_SPEC_VERSION_DEFAULT 0x1
#define SBI_SPEC_VERSION_MAJOR_SHIFT 24
#define SBI_SPEC_VERSION_MAJOR_MASK 0x7f
--
2.25.1
From: Rajnesh Kanwal <[email protected]>
For TVM vcpus, TSM uses shared memory to exposes gprs for the trusted
VCPU. This change makes sure we use shmem when doing mmio emulation
for trusted VMs.
Signed-off-by: Rajnesh Kanwal <[email protected]>
Signed-off-by: Atish Patra <[email protected]>
---
arch/riscv/kvm/vcpu_insn.c | 98 +++++++++++++++++++++++++++++++++-----
1 file changed, 85 insertions(+), 13 deletions(-)
diff --git a/arch/riscv/kvm/vcpu_insn.c b/arch/riscv/kvm/vcpu_insn.c
index 331489f..56eeb86 100644
--- a/arch/riscv/kvm/vcpu_insn.c
+++ b/arch/riscv/kvm/vcpu_insn.c
@@ -7,6 +7,9 @@
#include <linux/bitops.h>
#include <linux/kvm_host.h>
#include <asm/kvm_cove.h>
+#include <asm/kvm_nacl.h>
+#include <asm/kvm_cove_sbi.h>
+#include <asm/asm-offsets.h>
#define INSN_OPCODE_MASK 0x007c
#define INSN_OPCODE_SHIFT 2
@@ -116,6 +119,10 @@
#define REG_OFFSET(insn, pos) \
(SHIFT_RIGHT((insn), (pos) - LOG_REGBYTES) & REG_MASK)
+#define REG_INDEX(insn, pos) \
+ ((SHIFT_RIGHT((insn), (pos)-LOG_REGBYTES) & REG_MASK) / \
+ (__riscv_xlen / 8))
+
#define REG_PTR(insn, pos, regs) \
((ulong *)((ulong)(regs) + REG_OFFSET(insn, pos)))
@@ -600,6 +607,7 @@ int kvm_riscv_vcpu_mmio_store(struct kvm_vcpu *vcpu, struct kvm_run *run,
int len = 0, insn_len = 0;
struct kvm_cpu_trap utrap = { 0 };
struct kvm_cpu_context *ct = &vcpu->arch.guest_context;
+ void *nshmem;
/* Determine trapped instruction */
if (htinst & 0x1) {
@@ -627,7 +635,15 @@ int kvm_riscv_vcpu_mmio_store(struct kvm_vcpu *vcpu, struct kvm_run *run,
insn_len = INSN_LEN(insn);
}
- data = GET_RS2(insn, &vcpu->arch.guest_context);
+ if (is_cove_vcpu(vcpu)) {
+ nshmem = nacl_shmem();
+ data = nacl_shmem_gpr_read_cove(nshmem,
+ REG_INDEX(insn, SH_RS2) * 8 +
+ KVM_ARCH_GUEST_ZERO);
+ } else {
+ data = GET_RS2(insn, &vcpu->arch.guest_context);
+ }
+
data8 = data16 = data32 = data64 = data;
if ((insn & INSN_MASK_SW) == INSN_MATCH_SW) {
@@ -643,19 +659,43 @@ int kvm_riscv_vcpu_mmio_store(struct kvm_vcpu *vcpu, struct kvm_run *run,
#ifdef CONFIG_64BIT
} else if ((insn & INSN_MASK_C_SD) == INSN_MATCH_C_SD) {
len = 8;
- data64 = GET_RS2S(insn, &vcpu->arch.guest_context);
+ if (is_cove_vcpu(vcpu)) {
+ data64 = nacl_shmem_gpr_read_cove(
+ nshmem,
+ RVC_RS2S(insn) * 8 + KVM_ARCH_GUEST_ZERO);
+ } else {
+ data64 = GET_RS2S(insn, &vcpu->arch.guest_context);
+ }
} else if ((insn & INSN_MASK_C_SDSP) == INSN_MATCH_C_SDSP &&
((insn >> SH_RD) & 0x1f)) {
len = 8;
- data64 = GET_RS2C(insn, &vcpu->arch.guest_context);
+ if (is_cove_vcpu(vcpu)) {
+ data64 = nacl_shmem_gpr_read_cove(
+ nshmem, REG_INDEX(insn, SH_RS2C) * 8 +
+ KVM_ARCH_GUEST_ZERO);
+ } else {
+ data64 = GET_RS2C(insn, &vcpu->arch.guest_context);
+ }
#endif
} else if ((insn & INSN_MASK_C_SW) == INSN_MATCH_C_SW) {
len = 4;
- data32 = GET_RS2S(insn, &vcpu->arch.guest_context);
+ if (is_cove_vcpu(vcpu)) {
+ data32 = nacl_shmem_gpr_read_cove(
+ nshmem,
+ RVC_RS2S(insn) * 8 + KVM_ARCH_GUEST_ZERO);
+ } else {
+ data32 = GET_RS2S(insn, &vcpu->arch.guest_context);
+ }
} else if ((insn & INSN_MASK_C_SWSP) == INSN_MATCH_C_SWSP &&
((insn >> SH_RD) & 0x1f)) {
len = 4;
- data32 = GET_RS2C(insn, &vcpu->arch.guest_context);
+ if (is_cove_vcpu(vcpu)) {
+ data32 = nacl_shmem_gpr_read_cove(
+ nshmem, REG_INDEX(insn, SH_RS2C) * 8 +
+ KVM_ARCH_GUEST_ZERO);
+ } else {
+ data32 = GET_RS2C(insn, &vcpu->arch.guest_context);
+ }
} else {
return -EOPNOTSUPP;
}
@@ -725,6 +765,7 @@ int kvm_riscv_vcpu_mmio_return(struct kvm_vcpu *vcpu, struct kvm_run *run)
u64 data64;
ulong insn;
int len, shift;
+ void *nshmem;
if (vcpu->arch.mmio_decode.return_handled)
return 0;
@@ -738,26 +779,57 @@ int kvm_riscv_vcpu_mmio_return(struct kvm_vcpu *vcpu, struct kvm_run *run)
len = vcpu->arch.mmio_decode.len;
shift = vcpu->arch.mmio_decode.shift;
+ if (is_cove_vcpu(vcpu))
+ nshmem = nacl_shmem();
+
switch (len) {
case 1:
data8 = *((u8 *)run->mmio.data);
- SET_RD(insn, &vcpu->arch.guest_context,
- (ulong)data8 << shift >> shift);
+ if (is_cove_vcpu(vcpu)) {
+ nacl_shmem_gpr_write_cove(nshmem,
+ REG_INDEX(insn, SH_RD) * 8 +
+ KVM_ARCH_GUEST_ZERO,
+ (unsigned long)data8);
+ } else {
+ SET_RD(insn, &vcpu->arch.guest_context,
+ (ulong)data8 << shift >> shift);
+ }
break;
case 2:
data16 = *((u16 *)run->mmio.data);
- SET_RD(insn, &vcpu->arch.guest_context,
- (ulong)data16 << shift >> shift);
+ if (is_cove_vcpu(vcpu)) {
+ nacl_shmem_gpr_write_cove(nshmem,
+ REG_INDEX(insn, SH_RD) * 8 +
+ KVM_ARCH_GUEST_ZERO,
+ (unsigned long)data16);
+ } else {
+ SET_RD(insn, &vcpu->arch.guest_context,
+ (ulong)data16 << shift >> shift);
+ }
break;
case 4:
data32 = *((u32 *)run->mmio.data);
- SET_RD(insn, &vcpu->arch.guest_context,
- (ulong)data32 << shift >> shift);
+ if (is_cove_vcpu(vcpu)) {
+ nacl_shmem_gpr_write_cove(nshmem,
+ REG_INDEX(insn, SH_RD) * 8 +
+ KVM_ARCH_GUEST_ZERO,
+ (unsigned long)data32);
+ } else {
+ SET_RD(insn, &vcpu->arch.guest_context,
+ (ulong)data32 << shift >> shift);
+ }
break;
case 8:
data64 = *((u64 *)run->mmio.data);
- SET_RD(insn, &vcpu->arch.guest_context,
- (ulong)data64 << shift >> shift);
+ if (is_cove_vcpu(vcpu)) {
+ nacl_shmem_gpr_write_cove(nshmem,
+ REG_INDEX(insn, SH_RD) * 8 +
+ KVM_ARCH_GUEST_ZERO,
+ (unsigned long)data64);
+ } else {
+ SET_RD(insn, &vcpu->arch.guest_context,
+ (ulong)data64 << shift >> shift);
+ }
break;
default:
return -EOPNOTSUPP;
--
2.25.1
From: Rajnesh Kanwal <[email protected]>
All the confidential computing solutions uses the arch specific
cc_platform_has function to enable memory encryption/decryption.
Implement the same for RISC-V to support that as well.
Signed-off-by: Rajnesh Kanwal <[email protected]>
Signed-off-by: Atish Patra <[email protected]>
---
arch/riscv/Kconfig | 1 +
arch/riscv/cove/core.c | 12 ++++++++++++
2 files changed, 13 insertions(+)
diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index 414cee1..2ca9e01 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -522,6 +522,7 @@ config RISCV_COVE_GUEST
default n
select SWIOTLB
select RISCV_MEM_ENCRYPT
+ select ARCH_HAS_CC_PLATFORM
help
Enables support for running TVMs on platforms supporting CoVE.
diff --git a/arch/riscv/cove/core.c b/arch/riscv/cove/core.c
index 7218fe7..582feb1c 100644
--- a/arch/riscv/cove/core.c
+++ b/arch/riscv/cove/core.c
@@ -21,6 +21,18 @@ bool is_cove_guest(void)
}
EXPORT_SYMBOL_GPL(is_cove_guest);
+bool cc_platform_has(enum cc_attr attr)
+{
+ switch (attr) {
+ case CC_ATTR_GUEST_MEM_ENCRYPT:
+ case CC_ATTR_MEM_ENCRYPT:
+ return is_cove_guest();
+ default:
+ return false;
+ }
+}
+EXPORT_SYMBOL_GPL(cc_platform_has);
+
void riscv_cove_sbi_init(void)
{
if (sbi_probe_extension(SBI_EXT_COVG) > 0)
--
2.25.1
From: Rajnesh Kanwal <[email protected]>
Adding host side support to allow memory sharing/unsharing.
Host needs to check if the page has been already assigned
(converted) to a TVM or not. If yes, that page needs to be
reclaimed before sharing that page.
For the remaining ECALLs host doesn't really need to do anything
and we just return in those cases.
Signed-off-by: Rajnesh Kanwal <[email protected]>
Signed-off-by: Atish Patra <[email protected]>
---
arch/riscv/include/asm/kvm_cove.h | 11 +-
arch/riscv/include/asm/kvm_cove_sbi.h | 4 +
arch/riscv/include/asm/kvm_vcpu_sbi.h | 3 +
arch/riscv/include/uapi/asm/kvm.h | 1 +
arch/riscv/kvm/Makefile | 2 +-
arch/riscv/kvm/cove.c | 48 +++++-
arch/riscv/kvm/cove_sbi.c | 18 ++
arch/riscv/kvm/vcpu_exit.c | 2 +-
arch/riscv/kvm/vcpu_sbi.c | 14 ++
arch/riscv/kvm/vcpu_sbi_covg.c | 232 ++++++++++++++++++++++++++
10 files changed, 328 insertions(+), 7 deletions(-)
create mode 100644 arch/riscv/kvm/vcpu_sbi_covg.c
diff --git a/arch/riscv/include/asm/kvm_cove.h b/arch/riscv/include/asm/kvm_cove.h
index 4367281..afaea7c 100644
--- a/arch/riscv/include/asm/kvm_cove.h
+++ b/arch/riscv/include/asm/kvm_cove.h
@@ -31,6 +31,9 @@
#define get_order_num_pages(n) (get_order(n << PAGE_SHIFT))
+#define get_gpr_index(goffset) \
+ ((goffset - KVM_ARCH_GUEST_ZERO) / (__riscv_xlen / 8))
+
/* Describe a confidential or shared memory region */
struct kvm_riscv_cove_mem_region {
unsigned long hva;
@@ -139,7 +142,8 @@ int kvm_riscv_cove_vcpu_sbi_ecall(struct kvm_vcpu *vcpu, struct kvm_run *run);
int kvm_riscv_cove_vm_measure_pages(struct kvm *kvm, struct kvm_riscv_cove_measure_region *mr);
int kvm_riscv_cove_vm_add_memreg(struct kvm *kvm, unsigned long gpa, unsigned long size);
-int kvm_riscv_cove_gstage_map(struct kvm_vcpu *vcpu, gpa_t gpa, unsigned long hva);
+int kvm_riscv_cove_handle_pagefault(struct kvm_vcpu *vcpu, gpa_t gpa,
+ unsigned long hva);
/* Fence related function */
int kvm_riscv_cove_tvm_fence(struct kvm_vcpu *vcpu);
@@ -179,8 +183,9 @@ static inline int kvm_riscv_cove_vm_measure_pages(struct kvm *kvm,
{
return -1;
}
-static inline int kvm_riscv_cove_gstage_map(struct kvm_vcpu *vcpu,
- gpa_t gpa, unsigned long hva) {return -1; }
+static inline int kvm_riscv_cove_handle_pagefault(struct kvm_vcpu *vcpu,
+ gpa_t gpa, unsigned long hva) { return -1; }
+
/* TVM interrupt managenet via AIA functions */
static inline int kvm_riscv_cove_aia_init(struct kvm *kvm) { return -1; }
static inline int kvm_riscv_cove_vcpu_inject_interrupt(struct kvm_vcpu *vcpu,
diff --git a/arch/riscv/include/asm/kvm_cove_sbi.h b/arch/riscv/include/asm/kvm_cove_sbi.h
index b554a8d..c930265 100644
--- a/arch/riscv/include/asm/kvm_cove_sbi.h
+++ b/arch/riscv/include/asm/kvm_cove_sbi.h
@@ -59,6 +59,10 @@ int sbi_covh_create_tvm_vcpu(unsigned long tvmid, unsigned long tvm_vcpuid,
int sbi_covh_run_tvm_vcpu(unsigned long tvmid, unsigned long tvm_vcpuid);
+int sbi_covh_add_shared_pages(unsigned long tvmid, unsigned long page_addr_phys,
+ enum sbi_cove_page_type ptype,
+ unsigned long npages,
+ unsigned long tvm_base_page_addr);
int sbi_covh_tvm_invalidate_pages(unsigned long tvmid,
unsigned long tvm_base_page_addr,
unsigned long len);
diff --git a/arch/riscv/include/asm/kvm_vcpu_sbi.h b/arch/riscv/include/asm/kvm_vcpu_sbi.h
index b10c896..5b37a12 100644
--- a/arch/riscv/include/asm/kvm_vcpu_sbi.h
+++ b/arch/riscv/include/asm/kvm_vcpu_sbi.h
@@ -66,5 +66,8 @@ extern const struct kvm_vcpu_sbi_extension vcpu_sbi_ext_hsm;
extern const struct kvm_vcpu_sbi_extension vcpu_sbi_ext_dbcn;
extern const struct kvm_vcpu_sbi_extension vcpu_sbi_ext_experimental;
extern const struct kvm_vcpu_sbi_extension vcpu_sbi_ext_vendor;
+#ifdef CONFIG_RISCV_COVE_HOST
+extern const struct kvm_vcpu_sbi_extension vcpu_sbi_ext_covg;
+#endif
#endif /* __RISCV_KVM_VCPU_SBI_H__ */
diff --git a/arch/riscv/include/uapi/asm/kvm.h b/arch/riscv/include/uapi/asm/kvm.h
index ac3def0..2a24341 100644
--- a/arch/riscv/include/uapi/asm/kvm.h
+++ b/arch/riscv/include/uapi/asm/kvm.h
@@ -148,6 +148,7 @@ enum KVM_RISCV_SBI_EXT_ID {
KVM_RISCV_SBI_EXT_EXPERIMENTAL,
KVM_RISCV_SBI_EXT_VENDOR,
KVM_RISCV_SBI_EXT_DBCN,
+ KVM_RISCV_SBI_EXT_COVG,
KVM_RISCV_SBI_EXT_MAX,
};
diff --git a/arch/riscv/kvm/Makefile b/arch/riscv/kvm/Makefile
index 8c91551..31f4dbd 100644
--- a/arch/riscv/kvm/Makefile
+++ b/arch/riscv/kvm/Makefile
@@ -31,4 +31,4 @@ kvm-y += aia.o
kvm-y += aia_device.o
kvm-y += aia_aplic.o
kvm-y += aia_imsic.o
-kvm-$(CONFIG_RISCV_COVE_HOST) += cove_sbi.o cove.o
+kvm-$(CONFIG_RISCV_COVE_HOST) += cove_sbi.o cove.o vcpu_sbi_covg.o
diff --git a/arch/riscv/kvm/cove.c b/arch/riscv/kvm/cove.c
index 154b01a..ba596b7 100644
--- a/arch/riscv/kvm/cove.c
+++ b/arch/riscv/kvm/cove.c
@@ -44,6 +44,18 @@ static void kvm_cove_local_fence(void *info)
kvm_err("local fence for TSM failed %d on cpu %d\n", rc, smp_processor_id());
}
+static void cove_delete_shared_pinned_page_list(struct kvm *kvm,
+ struct list_head *tpages)
+{
+ struct kvm_riscv_cove_page *tpage, *temp;
+
+ list_for_each_entry_safe(tpage, temp, tpages, link) {
+ unpin_user_pages_dirty_lock(&tpage->page, 1, true);
+ list_del(&tpage->link);
+ kfree(tpage);
+ }
+}
+
static void cove_delete_page_list(struct kvm *kvm, struct list_head *tpages, bool unpin)
{
struct kvm_riscv_cove_page *tpage, *temp;
@@ -425,7 +437,8 @@ int kvm_riscv_cove_vcpu_sbi_ecall(struct kvm_vcpu *vcpu, struct kvm_run *run)
sbi_ext = kvm_vcpu_sbi_find_ext(vcpu, cp->a7);
if ((sbi_ext && sbi_ext->handler) && ((cp->a7 == SBI_EXT_DBCN) ||
- (cp->a7 == SBI_EXT_HSM) || (cp->a7 == SBI_EXT_SRST) || ext_is_01)) {
+ (cp->a7 == SBI_EXT_HSM) || (cp->a7 == SBI_EXT_SRST) ||
+ (cp->a7 == SBI_EXT_COVG) || ext_is_01)) {
ret = sbi_ext->handler(vcpu, run, &sbi_ret);
} else {
kvm_err("%s: SBI EXT %lx not supported for TVM\n", __func__, cp->a7);
@@ -451,7 +464,8 @@ int kvm_riscv_cove_vcpu_sbi_ecall(struct kvm_vcpu *vcpu, struct kvm_run *run)
return ret;
}
-int kvm_riscv_cove_gstage_map(struct kvm_vcpu *vcpu, gpa_t gpa, unsigned long hva)
+static int kvm_riscv_cove_gstage_map(struct kvm_vcpu *vcpu, gpa_t gpa,
+ unsigned long hva)
{
struct kvm_riscv_cove_page *tpage;
struct mm_struct *mm = current->mm;
@@ -517,6 +531,35 @@ int kvm_riscv_cove_gstage_map(struct kvm_vcpu *vcpu, gpa_t gpa, unsigned long hv
return rc;
}
+int kvm_riscv_cove_handle_pagefault(struct kvm_vcpu *vcpu, gpa_t gpa,
+ unsigned long hva)
+{
+ struct kvm_cove_tvm_context *tvmc = vcpu->kvm->arch.tvmc;
+ struct kvm_riscv_cove_page *tpage, *next;
+ bool shared = false;
+
+ /* TODO: Implement a better approach to track regions to avoid
+ * traversing the whole list on each fault.
+ */
+ spin_lock(&vcpu->kvm->mmu_lock);
+ list_for_each_entry_safe(tpage, next, &tvmc->shared_pages, link) {
+ if (tpage->gpa == (gpa & PAGE_MASK)) {
+ shared = true;
+ break;
+ }
+ }
+ spin_unlock(&vcpu->kvm->mmu_lock);
+
+ if (shared) {
+ return sbi_covh_add_shared_pages(tvmc->tvm_guest_id,
+ page_to_phys(tpage->page),
+ SBI_COVE_PAGE_4K, 1,
+ gpa & PAGE_MASK);
+ }
+
+ return kvm_riscv_cove_gstage_map(vcpu, gpa, hva);
+}
+
void noinstr kvm_riscv_cove_vcpu_switchto(struct kvm_vcpu *vcpu, struct kvm_cpu_trap *trap)
{
int rc;
@@ -804,6 +847,7 @@ void kvm_riscv_cove_vm_destroy(struct kvm *kvm)
cove_delete_page_list(kvm, &tvmc->reclaim_pending_pages, false);
cove_delete_page_list(kvm, &tvmc->measured_pages, false);
cove_delete_page_list(kvm, &tvmc->zero_pages, true);
+ cove_delete_shared_pinned_page_list(kvm, &tvmc->shared_pages);
/* Reclaim and Free the pages for tvm state management */
rc = sbi_covh_tsm_reclaim_pages(page_to_phys(tvmc->tvm_state.page), tvmc->tvm_state.npages);
diff --git a/arch/riscv/kvm/cove_sbi.c b/arch/riscv/kvm/cove_sbi.c
index 01dc260..4759b49 100644
--- a/arch/riscv/kvm/cove_sbi.c
+++ b/arch/riscv/kvm/cove_sbi.c
@@ -380,6 +380,24 @@ int sbi_covh_add_zero_pages(unsigned long tvmid, unsigned long page_addr_phys,
return 0;
}
+int sbi_covh_add_shared_pages(unsigned long tvmid, unsigned long page_addr_phys,
+ enum sbi_cove_page_type ptype,
+ unsigned long npages,
+ unsigned long tvm_base_page_addr)
+{
+ struct sbiret ret;
+
+ if (!PAGE_ALIGNED(page_addr_phys))
+ return -EINVAL;
+
+ ret = sbi_ecall(SBI_EXT_COVH, SBI_EXT_COVH_TVM_ADD_SHARED_PAGES, tvmid,
+ page_addr_phys, ptype, npages, tvm_base_page_addr, 0);
+ if (ret.error)
+ return sbi_err_map_linux_errno(ret.error);
+
+ return 0;
+}
+
int sbi_covh_create_tvm_vcpu(unsigned long tvmid, unsigned long vcpuid,
unsigned long vcpu_state_paddr)
{
diff --git a/arch/riscv/kvm/vcpu_exit.c b/arch/riscv/kvm/vcpu_exit.c
index c46e7f2..51eb434 100644
--- a/arch/riscv/kvm/vcpu_exit.c
+++ b/arch/riscv/kvm/vcpu_exit.c
@@ -43,7 +43,7 @@ static int gstage_page_fault(struct kvm_vcpu *vcpu, struct kvm_run *run,
if (is_cove_vcpu(vcpu)) {
/* CoVE doesn't care about PTE prots now. No need to compute the prots */
- ret = kvm_riscv_cove_gstage_map(vcpu, fault_addr, hva);
+ ret = kvm_riscv_cove_handle_pagefault(vcpu, fault_addr, hva);
} else {
ret = kvm_riscv_gstage_map(vcpu, memslot, fault_addr, hva,
(trap->scause == EXC_STORE_GUEST_PAGE_FAULT) ? true : false);
diff --git a/arch/riscv/kvm/vcpu_sbi.c b/arch/riscv/kvm/vcpu_sbi.c
index d2f43bc..8bc7d73 100644
--- a/arch/riscv/kvm/vcpu_sbi.c
+++ b/arch/riscv/kvm/vcpu_sbi.c
@@ -13,6 +13,8 @@
#include <asm/kvm_nacl.h>
#include <asm/kvm_cove_sbi.h>
#include <asm/kvm_vcpu_sbi.h>
+#include <asm/asm-offsets.h>
+#include <asm/kvm_cove.h>
#ifndef CONFIG_RISCV_SBI_V01
static const struct kvm_vcpu_sbi_extension vcpu_sbi_ext_v01 = {
@@ -32,6 +34,14 @@ static const struct kvm_vcpu_sbi_extension vcpu_sbi_ext_pmu = {
};
#endif
+#ifndef CONFIG_RISCV_COVE_HOST
+static const struct kvm_vcpu_sbi_extension vcpu_sbi_ext_covg = {
+ .extid_start = -1UL,
+ .extid_end = -1UL,
+ .handler = NULL,
+};
+#endif
+
struct kvm_riscv_sbi_extension_entry {
enum KVM_RISCV_SBI_EXT_ID dis_idx;
const struct kvm_vcpu_sbi_extension *ext_ptr;
@@ -82,6 +92,10 @@ static const struct kvm_riscv_sbi_extension_entry sbi_ext[] = {
.dis_idx = KVM_RISCV_SBI_EXT_VENDOR,
.ext_ptr = &vcpu_sbi_ext_vendor,
},
+ {
+ .dis_idx = KVM_RISCV_SBI_EXT_COVG,
+ .ext_ptr = &vcpu_sbi_ext_covg,
+ },
};
void kvm_riscv_vcpu_sbi_forward(struct kvm_vcpu *vcpu, struct kvm_run *run)
diff --git a/arch/riscv/kvm/vcpu_sbi_covg.c b/arch/riscv/kvm/vcpu_sbi_covg.c
new file mode 100644
index 0000000..44a3b06
--- /dev/null
+++ b/arch/riscv/kvm/vcpu_sbi_covg.c
@@ -0,0 +1,232 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2023 Rivos Inc.
+ *
+ * Authors:
+ * Rajnesh Kanwal <[email protected]>
+ */
+
+#include <linux/errno.h>
+#include <linux/err.h>
+#include <linux/kvm_host.h>
+#include <linux/list.h>
+#include <linux/mm.h>
+#include <linux/spinlock.h>
+#include <asm/csr.h>
+#include <asm/sbi.h>
+#include <asm/kvm_vcpu_sbi.h>
+#include <asm/kvm_cove.h>
+#include <asm/kvm_cove_sbi.h>
+
+static int cove_share_converted_page(struct kvm_vcpu *vcpu, gpa_t gpa,
+ struct kvm_riscv_cove_page *tpage)
+{
+ struct kvm *kvm = vcpu->kvm;
+ struct kvm_cove_tvm_context *tvmc = kvm->arch.tvmc;
+ int rc;
+
+ rc = sbi_covh_tvm_invalidate_pages(tvmc->tvm_guest_id, gpa, PAGE_SIZE);
+ if (rc)
+ return rc;
+
+ rc = kvm_riscv_cove_tvm_fence(vcpu);
+ if (rc)
+ goto err;
+
+ rc = sbi_covh_tvm_remove_pages(tvmc->tvm_guest_id, gpa, PAGE_SIZE);
+ if (rc)
+ goto err;
+
+ rc = sbi_covh_tsm_reclaim_page(page_to_phys(tpage->page));
+ if (rc)
+ return rc;
+
+ spin_lock(&kvm->mmu_lock);
+ list_del(&tpage->link);
+ list_add(&tpage->link, &tvmc->shared_pages);
+ spin_unlock(&kvm->mmu_lock);
+
+ return 0;
+
+err:
+ sbi_covh_tvm_validate_pages(tvmc->tvm_guest_id, gpa, PAGE_SIZE);
+
+ return rc;
+}
+
+static int cove_share_page(struct kvm_vcpu *vcpu, gpa_t gpa,
+ unsigned long *sbi_err)
+{
+ unsigned long hva = gfn_to_hva(vcpu->kvm, gpa >> PAGE_SHIFT);
+ struct kvm_cove_tvm_context *tvmc = vcpu->kvm->arch.tvmc;
+ struct mm_struct *mm = current->mm;
+ struct kvm_riscv_cove_page *tpage;
+ struct page *page;
+ int rc;
+
+ if (kvm_is_error_hva(hva)) {
+ /* Address is out of the guest ram memory region. */
+ *sbi_err = SBI_ERR_INVALID_PARAM;
+ return 0;
+ }
+
+ tpage = kmalloc(sizeof(*tpage), GFP_KERNEL_ACCOUNT);
+ if (!tpage)
+ return -ENOMEM;
+
+ mmap_read_lock(mm);
+ rc = pin_user_pages(hva, 1, FOLL_LONGTERM | FOLL_WRITE, &page, NULL);
+ mmap_read_unlock(mm);
+
+ if (rc != 1) {
+ rc = -EINVAL;
+ goto free_tpage;
+ } else if (!PageSwapBacked(page)) {
+ rc = -EIO;
+ goto free_tpage;
+ }
+
+ tpage->page = page;
+ tpage->gpa = gpa;
+ tpage->hva = hva;
+ INIT_LIST_HEAD(&tpage->link);
+
+ spin_lock(&vcpu->kvm->mmu_lock);
+ list_add(&tpage->link, &tvmc->shared_pages);
+ spin_unlock(&vcpu->kvm->mmu_lock);
+
+ return 0;
+
+free_tpage:
+ kfree(tpage);
+
+ return rc;
+}
+
+static int kvm_riscv_cove_share_page(struct kvm_vcpu *vcpu, gpa_t gpa,
+ unsigned long *sbi_err)
+{
+ struct kvm_cove_tvm_context *tvmc = vcpu->kvm->arch.tvmc;
+ struct kvm_riscv_cove_page *tpage, *next;
+ bool converted = false;
+
+ /*
+ * Check if the shared memory is part of the pages already assigned
+ * to the TVM.
+ *
+ * TODO: Implement a better approach to track regions to avoid
+ * traversing the whole list.
+ */
+ spin_lock(&vcpu->kvm->mmu_lock);
+ list_for_each_entry_safe(tpage, next, &tvmc->zero_pages, link) {
+ if (tpage->gpa == gpa) {
+ converted = true;
+ break;
+ }
+ }
+ spin_unlock(&vcpu->kvm->mmu_lock);
+
+ if (converted)
+ return cove_share_converted_page(vcpu, gpa, tpage);
+
+ return cove_share_page(vcpu, gpa, sbi_err);
+}
+
+static int kvm_riscv_cove_unshare_page(struct kvm_vcpu *vcpu, gpa_t gpa)
+{
+ struct kvm_riscv_cove_page *tpage, *next;
+ struct kvm *kvm = vcpu->kvm;
+ struct kvm_cove_tvm_context *tvmc = kvm->arch.tvmc;
+ struct page *page = NULL;
+ int rc;
+
+ spin_lock(&kvm->mmu_lock);
+ list_for_each_entry_safe(tpage, next, &tvmc->shared_pages, link) {
+ if (tpage->gpa == gpa) {
+ page = tpage->page;
+ break;
+ }
+ }
+ spin_unlock(&kvm->mmu_lock);
+
+ if (unlikely(!page))
+ return -EINVAL;
+
+ rc = sbi_covh_tvm_invalidate_pages(tvmc->tvm_guest_id, gpa, PAGE_SIZE);
+ if (rc)
+ return rc;
+
+ rc = kvm_riscv_cove_tvm_fence(vcpu);
+ if (rc)
+ return rc;
+
+ rc = sbi_covh_tvm_remove_pages(tvmc->tvm_guest_id, gpa, PAGE_SIZE);
+ if (rc)
+ return rc;
+
+ unpin_user_pages_dirty_lock(&page, 1, true);
+
+ spin_lock(&kvm->mmu_lock);
+ list_del(&tpage->link);
+ spin_unlock(&kvm->mmu_lock);
+
+ kfree(tpage);
+
+ return 0;
+}
+
+static int kvm_sbi_ext_covg_handler(struct kvm_vcpu *vcpu, struct kvm_run *run,
+ struct kvm_vcpu_sbi_return *retdata)
+{
+ struct kvm_cpu_context *cp = &vcpu->arch.guest_context;
+ uint32_t num_pages = cp->a1 / PAGE_SIZE;
+ unsigned long funcid = cp->a6;
+ unsigned long *err_val = &retdata->err_val;
+ uint32_t i;
+ int ret;
+
+ switch (funcid) {
+ case SBI_EXT_COVG_SHARE_MEMORY:
+ for (i = 0; i < num_pages; i++) {
+ ret = kvm_riscv_cove_share_page(
+ vcpu, cp->a0 + i * PAGE_SIZE, err_val);
+ if (ret || *err_val != SBI_SUCCESS)
+ return ret;
+ }
+ return 0;
+
+ case SBI_EXT_COVG_UNSHARE_MEMORY:
+ for (i = 0; i < num_pages; i++) {
+ ret = kvm_riscv_cove_unshare_page(
+ vcpu, cp->a0 + i * PAGE_SIZE);
+ if (ret)
+ return ret;
+ }
+ return 0;
+
+ case SBI_EXT_COVG_ADD_MMIO_REGION:
+ case SBI_EXT_COVG_REMOVE_MMIO_REGION:
+ case SBI_EXT_COVG_ALLOW_EXT_INTERRUPT:
+ case SBI_EXT_COVG_DENY_EXT_INTERRUPT:
+ /* We don't really need to do anything here for now. */
+ return 0;
+
+ default:
+ kvm_err("%s: Unsupported guest SBI %ld.\n", __func__, funcid);
+ retdata->err_val = SBI_ERR_NOT_SUPPORTED;
+ return -EOPNOTSUPP;
+ }
+}
+
+unsigned long kvm_sbi_ext_covg_probe(struct kvm_vcpu *vcpu)
+{
+ /* KVM COVG SBI handler is only meant for handling calls from TSM */
+ return 0;
+}
+
+const struct kvm_vcpu_sbi_extension vcpu_sbi_ext_covg = {
+ .extid_start = SBI_EXT_COVG,
+ .extid_end = SBI_EXT_COVG,
+ .handler = kvm_sbi_ext_covg_handler,
+ .probe = kvm_sbi_ext_covg_probe,
+};
--
2.25.1
From: Rajnesh Kanwal <[email protected]>
Devices like virtio use shared memory buffers to transfer
data. These buffers are part of the guest memory region.
For CoVE guest this is not possible as host can not access
guest memory.
This is solved by VIRTIO_F_ACCESS_PLATFORM feature and SWIOTLB
bounce buffers. Guest only allow devices with VIRTIO_F_ACCESS_PLATFORM
feature which leads to guest using DMA API and from there moving
to SWIOTLB bounce buffer due to SWIOTLB_FORCE flag set for TEE VM.
set_memory_encrypted and set_memory_decrypted sit in this allocation
path. Based on if a buffer is being decrypted we mark it shared and
if it's being encrypted we mark it unshared using hypercalls.
Signed-off-by: Rajnesh Kanwal <[email protected]>
Signed-off-by: Atish Patra <[email protected]>
---
arch/riscv/Kconfig | 7 ++++
arch/riscv/include/asm/mem_encrypt.h | 26 +++++++++++++
arch/riscv/mm/Makefile | 2 +
arch/riscv/mm/init.c | 17 ++++++++-
arch/riscv/mm/mem_encrypt.c | 57 ++++++++++++++++++++++++++++
5 files changed, 108 insertions(+), 1 deletion(-)
create mode 100644 arch/riscv/include/asm/mem_encrypt.h
create mode 100644 arch/riscv/mm/mem_encrypt.c
diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index 49c3006..414cee1 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -163,6 +163,11 @@ config ARCH_MMAP_RND_BITS_MAX
config ARCH_MMAP_RND_COMPAT_BITS_MAX
default 17
+config RISCV_MEM_ENCRYPT
+ select ARCH_HAS_MEM_ENCRYPT
+ select ARCH_HAS_FORCE_DMA_UNENCRYPTED
+ def_bool n
+
# set if we run in machine mode, cleared if we run in supervisor mode
config RISCV_M_MODE
bool
@@ -515,6 +520,8 @@ config RISCV_COVE_HOST
config RISCV_COVE_GUEST
bool "Guest Support for Confidential VM Extension(CoVE)"
default n
+ select SWIOTLB
+ select RISCV_MEM_ENCRYPT
help
Enables support for running TVMs on platforms supporting CoVE.
diff --git a/arch/riscv/include/asm/mem_encrypt.h b/arch/riscv/include/asm/mem_encrypt.h
new file mode 100644
index 0000000..0dc3fe8
--- /dev/null
+++ b/arch/riscv/include/asm/mem_encrypt.h
@@ -0,0 +1,26 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * RISCV Memory Encryption Support.
+ *
+ * Copyright (c) 2023 Rivos Inc.
+ *
+ * Authors:
+ * Rajnesh Kanwal <[email protected]>
+ */
+
+#ifndef __RISCV_MEM_ENCRYPT_H__
+#define __RISCV_MEM_ENCRYPT_H__
+
+#include <linux/init.h>
+
+struct device;
+
+bool force_dma_unencrypted(struct device *dev);
+
+/* Architecture __weak replacement functions */
+void __init mem_encrypt_init(void);
+
+int set_memory_encrypted(unsigned long addr, int numpages);
+int set_memory_decrypted(unsigned long addr, int numpages);
+
+#endif /* __RISCV_MEM_ENCRYPT_H__ */
diff --git a/arch/riscv/mm/Makefile b/arch/riscv/mm/Makefile
index 2ac177c..1fd9b60 100644
--- a/arch/riscv/mm/Makefile
+++ b/arch/riscv/mm/Makefile
@@ -33,3 +33,5 @@ endif
obj-$(CONFIG_DEBUG_VIRTUAL) += physaddr.o
obj-$(CONFIG_RISCV_DMA_NONCOHERENT) += dma-noncoherent.o
+
+obj-$(CONFIG_RISCV_MEM_ENCRYPT) += mem_encrypt.o
diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index 478d676..b5edd8e 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -21,6 +21,7 @@
#include <linux/crash_dump.h>
#include <linux/hugetlb.h>
+#include <asm/cove.h>
#include <asm/fixmap.h>
#include <asm/tlbflush.h>
#include <asm/sections.h>
@@ -156,11 +157,25 @@ static void print_vm_layout(void) { }
void __init mem_init(void)
{
+ unsigned int flags = SWIOTLB_VERBOSE;
+ bool swiotlb_en;
+
+ if (is_cove_guest()) {
+ /* Since the guest memory is inaccessible to the host, devices
+ * always need to use the SWIOTLB buffer for DMA even if
+ * dma_capable() says otherwise.
+ */
+ flags |= SWIOTLB_FORCE;
+ swiotlb_en = true;
+ } else {
+ swiotlb_en = !!(max_pfn > PFN_DOWN(dma32_phys_limit));
+ }
+
#ifdef CONFIG_FLATMEM
BUG_ON(!mem_map);
#endif /* CONFIG_FLATMEM */
- swiotlb_init(max_pfn > PFN_DOWN(dma32_phys_limit), SWIOTLB_VERBOSE);
+ swiotlb_init(swiotlb_en, flags);
memblock_free_all();
print_vm_layout();
diff --git a/arch/riscv/mm/mem_encrypt.c b/arch/riscv/mm/mem_encrypt.c
new file mode 100644
index 0000000..8207a5c
--- /dev/null
+++ b/arch/riscv/mm/mem_encrypt.c
@@ -0,0 +1,57 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2023 Rivos Inc.
+ *
+ * Authors:
+ * Rajnesh Kanwal <[email protected]>
+ */
+
+#include <linux/dma-direct.h>
+#include <linux/swiotlb.h>
+#include <linux/cc_platform.h>
+#include <linux/mem_encrypt.h>
+#include <asm/covg_sbi.h>
+
+/* Override for DMA direct allocation check - ARCH_HAS_FORCE_DMA_UNENCRYPTED */
+bool force_dma_unencrypted(struct device *dev)
+{
+ /*
+ * For authorized devices in trusted guest, all DMA must be to/from
+ * unencrypted addresses.
+ */
+ return cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT);
+}
+
+int set_memory_encrypted(unsigned long addr, int numpages)
+{
+ if (!cc_platform_has(CC_ATTR_MEM_ENCRYPT))
+ return 0;
+
+ if (!PAGE_ALIGNED(addr))
+ return -EINVAL;
+
+ return sbi_covg_unshare_memory(__pa(addr), numpages * PAGE_SIZE);
+}
+EXPORT_SYMBOL_GPL(set_memory_encrypted);
+
+int set_memory_decrypted(unsigned long addr, int numpages)
+{
+ if (!cc_platform_has(CC_ATTR_MEM_ENCRYPT))
+ return 0;
+
+ if (!PAGE_ALIGNED(addr))
+ return -EINVAL;
+
+ return sbi_covg_share_memory(__pa(addr), numpages * PAGE_SIZE);
+}
+EXPORT_SYMBOL_GPL(set_memory_decrypted);
+
+/* Architecture __weak replacement functions */
+void __init mem_encrypt_init(void)
+{
+ if (!cc_platform_has(CC_ATTR_MEM_ENCRYPT))
+ return;
+
+ /* Call into SWIOTLB to update the SWIOTLB DMA buffers */
+ swiotlb_update_mem_attributes();
+}
--
2.25.1
From: Rajnesh Kanwal <[email protected]>
COVG extension defines the guest side interface for running a guest
in CoVE. These functions allow a CoVE guest to share/unshare memory, ask
host to trap and emulate MMIO regions and allow/deny
injection of interrupts from host.
Signed-off-by: Rajnesh Kanwal <[email protected]>
Signed-off-by: Atish Patra <[email protected]>
---
arch/riscv/cove/Makefile | 2 +-
arch/riscv/cove/cove_guest_sbi.c | 109 ++++++++++++++++++++++++++++++
arch/riscv/include/asm/covg_sbi.h | 38 +++++++++++
3 files changed, 148 insertions(+), 1 deletion(-)
create mode 100644 arch/riscv/cove/cove_guest_sbi.c
create mode 100644 arch/riscv/include/asm/covg_sbi.h
diff --git a/arch/riscv/cove/Makefile b/arch/riscv/cove/Makefile
index 03a0cac..a95043b 100644
--- a/arch/riscv/cove/Makefile
+++ b/arch/riscv/cove/Makefile
@@ -1,2 +1,2 @@
# SPDX-License-Identifier: GPL-2.0
-obj-$(CONFIG_RISCV_COVE_GUEST) += core.o
+obj-$(CONFIG_RISCV_COVE_GUEST) += core.o cove_guest_sbi.o
diff --git a/arch/riscv/cove/cove_guest_sbi.c b/arch/riscv/cove/cove_guest_sbi.c
new file mode 100644
index 0000000..af22d5e
--- /dev/null
+++ b/arch/riscv/cove/cove_guest_sbi.c
@@ -0,0 +1,109 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * COVG SBI extensions related helper functions.
+ *
+ * Copyright (c) 2023 Rivos Inc.
+ *
+ * Authors:
+ * Rajnesh Kanwal <[email protected]>
+ */
+
+#include <linux/errno.h>
+#include <asm/sbi.h>
+#include <asm/covg_sbi.h>
+
+int sbi_covg_add_mmio_region(unsigned long addr, unsigned long len)
+{
+ struct sbiret ret;
+
+ ret = sbi_ecall(SBI_EXT_COVG, SBI_EXT_COVG_ADD_MMIO_REGION, addr, len,
+ 0, 0, 0, 0);
+ if (ret.error)
+ return sbi_err_map_linux_errno(ret.error);
+
+ return 0;
+}
+
+int sbi_covg_remove_mmio_region(unsigned long addr, unsigned long len)
+{
+ struct sbiret ret;
+
+ ret = sbi_ecall(SBI_EXT_COVG, SBI_EXT_COVG_REMOVE_MMIO_REGION, addr,
+ len, 0, 0, 0, 0);
+ if (ret.error)
+ return sbi_err_map_linux_errno(ret.error);
+
+ return 0;
+}
+
+int sbi_covg_share_memory(unsigned long addr, unsigned long len)
+{
+ struct sbiret ret;
+
+ ret = sbi_ecall(SBI_EXT_COVG, SBI_EXT_COVG_SHARE_MEMORY, addr, len, 0,
+ 0, 0, 0);
+ if (ret.error)
+ return sbi_err_map_linux_errno(ret.error);
+
+ return 0;
+}
+
+int sbi_covg_unshare_memory(unsigned long addr, unsigned long len)
+{
+ struct sbiret ret;
+
+ ret = sbi_ecall(SBI_EXT_COVG, SBI_EXT_COVG_UNSHARE_MEMORY, addr, len, 0,
+ 0, 0, 0);
+ if (ret.error)
+ return sbi_err_map_linux_errno(ret.error);
+
+ return 0;
+}
+
+int sbi_covg_allow_external_interrupt(unsigned long id)
+{
+ struct sbiret ret;
+
+ ret = sbi_ecall(SBI_EXT_COVG, SBI_EXT_COVG_ALLOW_EXT_INTERRUPT, id, 0,
+ 0, 0, 0, 0);
+ if (ret.error)
+ return sbi_err_map_linux_errno(ret.error);
+
+ return 0;
+}
+
+int sbi_covg_allow_all_external_interrupt(void)
+{
+ struct sbiret ret;
+
+ ret = sbi_ecall(SBI_EXT_COVG, SBI_EXT_COVG_ALLOW_EXT_INTERRUPT, -1, 0,
+ 0, 0, 0, 0);
+ if (ret.error)
+ return sbi_err_map_linux_errno(ret.error);
+
+ return 0;
+}
+
+int sbi_covg_deny_external_interrupt(unsigned long id)
+{
+ struct sbiret ret;
+
+ ret = sbi_ecall(SBI_EXT_COVG, SBI_EXT_COVG_DENY_EXT_INTERRUPT, id, 0, 0,
+ 0, 0, 0);
+ if (ret.error)
+ return sbi_err_map_linux_errno(ret.error);
+
+ return 0;
+}
+
+int sbi_covg_deny_all_external_interrupt(void)
+{
+ struct sbiret ret;
+
+ ret = sbi_ecall(SBI_EXT_COVG, SBI_EXT_COVG_DENY_EXT_INTERRUPT, -1, 0, 0,
+ 0, 0, 0);
+ if (ret.error)
+ return sbi_err_map_linux_errno(ret.error);
+
+ return 0;
+}
diff --git a/arch/riscv/include/asm/covg_sbi.h b/arch/riscv/include/asm/covg_sbi.h
new file mode 100644
index 0000000..31283de
--- /dev/null
+++ b/arch/riscv/include/asm/covg_sbi.h
@@ -0,0 +1,38 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * COVG SBI extension related header file.
+ *
+ * Copyright (c) 2023 Rivos Inc.
+ *
+ * Authors:
+ * Rajnesh Kanwal <[email protected]>
+ */
+
+#ifndef __RISCV_COVG_SBI_H__
+#define __RISCV_COVG_SBI_H__
+
+#ifdef CONFIG_RISCV_COVE_GUEST
+
+int sbi_covg_add_mmio_region(unsigned long addr, unsigned long len);
+int sbi_covg_remove_mmio_region(unsigned long addr, unsigned long len);
+int sbi_covg_share_memory(unsigned long addr, unsigned long len);
+int sbi_covg_unshare_memory(unsigned long addr, unsigned long len);
+int sbi_covg_allow_external_interrupt(unsigned long id);
+int sbi_covg_allow_all_external_interrupt(void);
+int sbi_covg_deny_external_interrupt(unsigned long id);
+int sbi_covg_deny_all_external_interrupt(void);
+
+#else
+
+static inline int sbi_covg_add_mmio_region(unsigned long addr, unsigned long len) { return 0; }
+static inline int sbi_covg_remove_mmio_region(unsigned long addr, unsigned long len) { return 0; }
+static inline int sbi_covg_share_memory(unsigned long addr, unsigned long len) { return 0; }
+static inline int sbi_covg_unshare_memory(unsigned long addr, unsigned long len) { return 0; }
+static inline int sbi_covg_allow_external_interrupt(unsigned long id) { return 0; }
+static inline int sbi_covg_allow_all_external_interrupt(void) { return 0; }
+static inline int sbi_covg_deny_external_interrupt(unsigned long id) { return 0; }
+static inline int sbi_covg_deny_all_external_interrupt(void) { return 0; }
+
+#endif
+
+#endif /* __RISCV_COVG_SBI_H__ */
--
2.25.1
If two same type of console is used in command line, kernel
picks up the first registered one instead of the preferred one.
The fix was proposed and NACK'ed due to a possible regression
for other users.
https://lore.kernel.org/all/Y+tziG0Uo5ey+Ocy@alley/
HVC sbi console makes it impossible to use virtio console
which is preferred anyways. We could have disabled HVC console
for TVMs but same kernel image must work on both host and the
the guest. There are genuine reasons for requiring the hvc sbi
cosnole for the host.
Do not initialize the hvc console for the TVMs so that virtio
console can be used.
Signed-off-by: Atish Patra <[email protected]>
---
drivers/tty/hvc/hvc_riscv_sbi.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/drivers/tty/hvc/hvc_riscv_sbi.c b/drivers/tty/hvc/hvc_riscv_sbi.c
index 83cfe00..dee96c5 100644
--- a/drivers/tty/hvc/hvc_riscv_sbi.c
+++ b/drivers/tty/hvc/hvc_riscv_sbi.c
@@ -11,6 +11,7 @@
#include <linux/moduleparam.h>
#include <linux/types.h>
+#include <asm/cove.h>
#include <asm/sbi.h>
#include "hvc_console.h"
@@ -103,6 +104,10 @@ static int __init hvc_sbi_init(void)
{
int err;
+ /* Prefer virtio console as hvc console for guests */
+ if (is_cove_guest())
+ return 0;
+
if ((sbi_spec_version >= sbi_mk_version(1, 0)) &&
(sbi_probe_extension(SBI_EXT_DBCN) > 0)) {
err = PTR_ERR_OR_ZERO(hvc_alloc(0, 0, &hvc_sbi_dbcn_ops, 16));
--
2.25.1
From: Rajnesh Kanwal <[email protected]>
SBI_EXT_COVH_TVM_INVALIDATE_PAGES: Invalidates the pages in the specified
range of guest physical address space. The host can now work upon the
range without TVM's interaction. Any access from TVM to the range will
result in a page-fault which will be reported to the host.
SBI_EXT_COVH_TVM_VALIDATE_PAGES: Marks the invalidated pages in the
specified range of guest physical address space as present and TVM
can now access the pages.
SBI_EXT_COVH_TVM_PROMOTE_PAGES: Promotes a set of contiguous mappings
to the requested page size. This is mainly to support huge-pages.
SBI_EXT_COVH_TVM_DEMOTE_PAGES: Demotes a huge page mapping to a set
of contiguous mappings at the target size.
SBI_EXT_COVH_TVM_REMOVE_PAGES: Removes mappings from a TVM. The range
to be unmapped must already have been invalidated and fenced, and must
lie within a removable region of guest physical address space.
Signed-off-by: Atish Patra <[email protected]>
Signed-off-by: Rajnesh Kanwal <[email protected]>
---
arch/riscv/include/asm/kvm_cove_sbi.h | 16 +++++++
arch/riscv/include/asm/sbi.h | 5 +++
arch/riscv/kvm/cove_sbi.c | 65 +++++++++++++++++++++++++++
3 files changed, 86 insertions(+)
diff --git a/arch/riscv/include/asm/kvm_cove_sbi.h b/arch/riscv/include/asm/kvm_cove_sbi.h
index 0759f70..b554a8d 100644
--- a/arch/riscv/include/asm/kvm_cove_sbi.h
+++ b/arch/riscv/include/asm/kvm_cove_sbi.h
@@ -59,6 +59,22 @@ int sbi_covh_create_tvm_vcpu(unsigned long tvmid, unsigned long tvm_vcpuid,
int sbi_covh_run_tvm_vcpu(unsigned long tvmid, unsigned long tvm_vcpuid);
+int sbi_covh_tvm_invalidate_pages(unsigned long tvmid,
+ unsigned long tvm_base_page_addr,
+ unsigned long len);
+int sbi_covh_tvm_validate_pages(unsigned long tvmid,
+ unsigned long tvm_base_page_addr,
+ unsigned long len);
+int sbi_covh_tvm_promote_page(unsigned long tvmid,
+ unsigned long tvm_base_page_addr,
+ enum sbi_cove_page_type ptype);
+int sbi_covh_tvm_demote_page(unsigned long tvmid,
+ unsigned long tvm_base_page_addr,
+ enum sbi_cove_page_type ptype);
+int sbi_covh_tvm_remove_pages(unsigned long tvmid,
+ unsigned long tvm_base_page_addr,
+ unsigned long len);
+
/* Functions related to CoVE Interrupt Management(COVI) Extension */
int sbi_covi_tvm_aia_init(unsigned long tvm_gid, struct sbi_cove_tvm_aia_params *tvm_aia_params);
int sbi_covi_set_vcpu_imsic_addr(unsigned long tvm_gid, unsigned long vcpu_id,
diff --git a/arch/riscv/include/asm/sbi.h b/arch/riscv/include/asm/sbi.h
index e02ee75..03b0cc8 100644
--- a/arch/riscv/include/asm/sbi.h
+++ b/arch/riscv/include/asm/sbi.h
@@ -369,6 +369,11 @@ enum sbi_ext_covh_fid {
SBI_EXT_COVH_TVM_CREATE_VCPU,
SBI_EXT_COVH_TVM_VCPU_RUN,
SBI_EXT_COVH_TVM_INITIATE_FENCE,
+ SBI_EXT_COVH_TVM_INVALIDATE_PAGES,
+ SBI_EXT_COVH_TVM_VALIDATE_PAGES,
+ SBI_EXT_COVH_TVM_PROMOTE_PAGE,
+ SBI_EXT_COVH_TVM_DEMOTE_PAGE,
+ SBI_EXT_COVH_TVM_REMOVE_PAGES,
};
enum sbi_ext_covi_fid {
diff --git a/arch/riscv/kvm/cove_sbi.c b/arch/riscv/kvm/cove_sbi.c
index a8901ac..01dc260 100644
--- a/arch/riscv/kvm/cove_sbi.c
+++ b/arch/riscv/kvm/cove_sbi.c
@@ -405,3 +405,68 @@ int sbi_covh_run_tvm_vcpu(unsigned long tvmid, unsigned long vcpuid)
return 0;
}
+
+int sbi_covh_tvm_invalidate_pages(unsigned long tvmid,
+ unsigned long tvm_base_page_addr,
+ unsigned long len)
+{
+ struct sbiret ret = sbi_ecall(SBI_EXT_COVH,
+ SBI_EXT_COVH_TVM_INVALIDATE_PAGES, tvmid,
+ tvm_base_page_addr, len, 0, 0, 0);
+ if (ret.error)
+ return sbi_err_map_linux_errno(ret.error);
+
+ return 0;
+}
+
+int sbi_covh_tvm_validate_pages(unsigned long tvmid,
+ unsigned long tvm_base_page_addr,
+ unsigned long len)
+{
+ struct sbiret ret = sbi_ecall(SBI_EXT_COVH,
+ SBI_EXT_COVH_TVM_VALIDATE_PAGES, tvmid,
+ tvm_base_page_addr, len, 0, 0, 0);
+ if (ret.error)
+ return sbi_err_map_linux_errno(ret.error);
+
+ return 0;
+}
+
+int sbi_covh_tvm_promote_page(unsigned long tvmid,
+ unsigned long tvm_base_page_addr,
+ enum sbi_cove_page_type ptype)
+{
+ struct sbiret ret = sbi_ecall(SBI_EXT_COVH,
+ SBI_EXT_COVH_TVM_PROMOTE_PAGE, tvmid,
+ tvm_base_page_addr, ptype, 0, 0, 0);
+ if (ret.error)
+ return sbi_err_map_linux_errno(ret.error);
+
+ return 0;
+}
+
+int sbi_covh_tvm_demote_page(unsigned long tvmid,
+ unsigned long tvm_base_page_addr,
+ enum sbi_cove_page_type ptype)
+{
+ struct sbiret ret = sbi_ecall(SBI_EXT_COVH,
+ SBI_EXT_COVH_TVM_DEMOTE_PAGE, tvmid,
+ tvm_base_page_addr, ptype, 0, 0, 0);
+ if (ret.error)
+ return sbi_err_map_linux_errno(ret.error);
+
+ return 0;
+}
+
+int sbi_covh_tvm_remove_pages(unsigned long tvmid,
+ unsigned long tvm_base_page_addr,
+ unsigned long len)
+{
+ struct sbiret ret = sbi_ecall(SBI_EXT_COVH,
+ SBI_EXT_COVH_TVM_REMOVE_PAGES, tvmid,
+ tvm_base_page_addr, len, 0, 0, 0);
+ if (ret.error)
+ return sbi_err_map_linux_errno(ret.error);
+
+ return 0;
+}
--
2.25.1
From: Rajnesh Kanwal <[email protected]>
Introduce a separate config for the guest running in CoVE so that
it can be enabled separately if required. However, the default config
will enable both CoVE host & guest configs in order to make single
image work as both host & guest. Introduce a helper function to
detect if a guest is TVM or not at run time. The TSM only enables
the CoVE guest SBI extension for TVMs.
Signed-off-by: Rajnesh Kanwal <[email protected]>
Co-developed-by: Atish Patra <[email protected]>
Signed-off-by: Atish Patra <[email protected]>
---
arch/riscv/Kbuild | 2 ++
arch/riscv/Kconfig | 6 ++++++
arch/riscv/cove/Makefile | 2 ++
arch/riscv/cove/core.c | 28 ++++++++++++++++++++++++++++
arch/riscv/include/asm/cove.h | 27 +++++++++++++++++++++++++++
arch/riscv/kernel/setup.c | 2 ++
6 files changed, 67 insertions(+)
create mode 100644 arch/riscv/cove/Makefile
create mode 100644 arch/riscv/cove/core.c
create mode 100644 arch/riscv/include/asm/cove.h
diff --git a/arch/riscv/Kbuild b/arch/riscv/Kbuild
index afa83e3..ecd661e 100644
--- a/arch/riscv/Kbuild
+++ b/arch/riscv/Kbuild
@@ -1,5 +1,7 @@
# SPDX-License-Identifier: GPL-2.0-only
+obj-$(CONFIG_RISCV_COVE_GUEST) += cove/
+
obj-y += kernel/ mm/ net/
obj-$(CONFIG_BUILTIN_DTB) += boot/dts/
obj-y += errata/
diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index 8462941..49c3006 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -512,6 +512,12 @@ config RISCV_COVE_HOST
That means the platform should be capable of running TEE VM (TVM)
using KVM and TEE Security Manager (TSM).
+config RISCV_COVE_GUEST
+ bool "Guest Support for Confidential VM Extension(CoVE)"
+ default n
+ help
+ Enables support for running TVMs on platforms supporting CoVE.
+
endmenu # "Confidential VM Extension(CoVE) Support"
endmenu # "Platform type"
diff --git a/arch/riscv/cove/Makefile b/arch/riscv/cove/Makefile
new file mode 100644
index 0000000..03a0cac
--- /dev/null
+++ b/arch/riscv/cove/Makefile
@@ -0,0 +1,2 @@
+# SPDX-License-Identifier: GPL-2.0
+obj-$(CONFIG_RISCV_COVE_GUEST) += core.o
diff --git a/arch/riscv/cove/core.c b/arch/riscv/cove/core.c
new file mode 100644
index 0000000..7218fe7
--- /dev/null
+++ b/arch/riscv/cove/core.c
@@ -0,0 +1,28 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Confidential Computing Platform Capability checks
+ *
+ * Copyright (c) 2023 Rivos Inc.
+ *
+ * Authors:
+ * Rajnesh Kanwal <[email protected]>
+ */
+
+#include <linux/export.h>
+#include <linux/cc_platform.h>
+#include <asm/sbi.h>
+#include <asm/cove.h>
+
+static bool is_tvm;
+
+bool is_cove_guest(void)
+{
+ return is_tvm;
+}
+EXPORT_SYMBOL_GPL(is_cove_guest);
+
+void riscv_cove_sbi_init(void)
+{
+ if (sbi_probe_extension(SBI_EXT_COVG) > 0)
+ is_tvm = true;
+}
diff --git a/arch/riscv/include/asm/cove.h b/arch/riscv/include/asm/cove.h
new file mode 100644
index 0000000..c4d609d
--- /dev/null
+++ b/arch/riscv/include/asm/cove.h
@@ -0,0 +1,27 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * TVM helper functions
+ *
+ * Copyright (c) 2023 Rivos Inc.
+ *
+ * Authors:
+ * Rajnesh Kanwal <[email protected]>
+ */
+
+#ifndef __RISCV_COVE_H__
+#define __RISCV_COVE_H__
+
+#ifdef CONFIG_RISCV_COVE_GUEST
+void riscv_cove_sbi_init(void);
+bool is_cove_guest(void);
+#else /* CONFIG_RISCV_COVE_GUEST */
+static inline bool is_cove_guest(void)
+{
+ return false;
+}
+static inline void riscv_cove_sbi_init(void)
+{
+}
+#endif /* CONFIG_RISCV_COVE_GUEST */
+
+#endif /* __RISCV_COVE_H__ */
diff --git a/arch/riscv/kernel/setup.c b/arch/riscv/kernel/setup.c
index 7b2b065..20b0280 100644
--- a/arch/riscv/kernel/setup.c
+++ b/arch/riscv/kernel/setup.c
@@ -35,6 +35,7 @@
#include <asm/thread_info.h>
#include <asm/kasan.h>
#include <asm/efi.h>
+#include <asm/cove.h>
#include "head.h"
@@ -272,6 +273,7 @@ void __init setup_arch(char **cmdline_p)
early_ioremap_setup();
sbi_init();
+ riscv_cove_sbi_init();
jump_label_init();
parse_early_param();
--
2.25.1
From: Rajnesh Kanwal <[email protected]>
Ideally, host must not inject any external interrupt until explicitly
allowed by the guest. This should be done per interrupt id but currently
adding allow-all call in init_IRQ. In future, it will be modified
to allow specific interrupts only.
Signed-off-by: Rajnesh Kanwal <[email protected]>
Signed-off-by: Atish Patra <[email protected]>
---
arch/riscv/kernel/irq.c | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/arch/riscv/kernel/irq.c b/arch/riscv/kernel/irq.c
index eb9a68a..b5e0fd8 100644
--- a/arch/riscv/kernel/irq.c
+++ b/arch/riscv/kernel/irq.c
@@ -11,6 +11,8 @@
#include <linux/module.h>
#include <linux/seq_file.h>
#include <asm/sbi.h>
+#include <asm/covg_sbi.h>
+#include <asm/cove.h>
static struct fwnode_handle *(*__get_intc_node)(void);
@@ -36,8 +38,18 @@ int arch_show_interrupts(struct seq_file *p, int prec)
void __init init_IRQ(void)
{
+ int ret;
+
irqchip_init();
if (!handle_arch_irq)
panic("No interrupt controller found.");
sbi_ipi_init();
+
+ if (is_cove_guest()) {
+ /* FIXME: For now just allow all interrupts. */
+ ret = sbi_covg_allow_all_external_interrupt();
+
+ if (ret)
+ pr_err("Failed to allow external interrupts.\n");
+ }
}
--
2.25.1
From: Rajnesh Kanwal <[email protected]>
The guests running in CoVE must notify the host about its mmio regions
so that host can enable mmio emulation.
Signed-off-by: Rajnesh Kanwal <[email protected]>
Signed-off-by: Atish Patra <[email protected]>
---
arch/riscv/mm/Makefile | 1 +
arch/riscv/mm/ioremap.c | 45 +++++++++++++++++++++++++++++++++++++++++
2 files changed, 46 insertions(+)
create mode 100644 arch/riscv/mm/ioremap.c
diff --git a/arch/riscv/mm/Makefile b/arch/riscv/mm/Makefile
index 1fd9b60..721b557 100644
--- a/arch/riscv/mm/Makefile
+++ b/arch/riscv/mm/Makefile
@@ -15,6 +15,7 @@ obj-y += cacheflush.o
obj-y += context.o
obj-y += pgtable.o
obj-y += pmem.o
+obj-y += ioremap.o
ifeq ($(CONFIG_MMU),y)
obj-$(CONFIG_SMP) += tlbflush.o
diff --git a/arch/riscv/mm/ioremap.c b/arch/riscv/mm/ioremap.c
new file mode 100644
index 0000000..0d4e026
--- /dev/null
+++ b/arch/riscv/mm/ioremap.c
@@ -0,0 +1,45 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2023 Rivos Inc.
+ *
+ * Authors:
+ * Rajnesh Kanwal <[email protected]>
+ */
+
+#include <linux/export.h>
+#include <linux/mm.h>
+#include <linux/vmalloc.h>
+#include <linux/io.h>
+#include <asm/covg_sbi.h>
+#include <asm/cove.h>
+#include <asm-generic/io.h>
+
+void ioremap_phys_range_hook(phys_addr_t addr, size_t size, pgprot_t prot)
+{
+ unsigned long offset;
+
+ if (!is_cove_guest())
+ return;
+
+ /* Page-align address and size. */
+ offset = addr & (~PAGE_MASK);
+ addr -= offset;
+ size = PAGE_ALIGN(size + offset);
+
+ sbi_covg_add_mmio_region(addr, size);
+}
+
+void iounmap_phys_range_hook(phys_addr_t addr, size_t size)
+{
+ unsigned long offset;
+
+ if (!is_cove_guest())
+ return;
+
+ /* Page-align address and size. */
+ offset = addr & (~PAGE_MASK);
+ addr -= offset;
+ size = PAGE_ALIGN(size + offset);
+
+ sbi_covg_remove_mmio_region(addr, size);
+}
--
2.25.1
From: Rajnesh Kanwal <[email protected]>
CoVE guest requires that virtio devices use the DMA API to allow the
hypervisor to successfully access guest memory as needed.
The VIRTIO_F_VERSION_1 and VIRTIO_F_ACCESS_PLATFORM features tell virtio
to use the DMA API. Force to check for these features to fail the device
probe if these features have not been set when running as an TEE guest.
Signed-off-by: Rajnesh Kanwal <[email protected]>
---
arch/riscv/mm/mem_encrypt.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/arch/riscv/mm/mem_encrypt.c b/arch/riscv/mm/mem_encrypt.c
index 8207a5c..8523c50 100644
--- a/arch/riscv/mm/mem_encrypt.c
+++ b/arch/riscv/mm/mem_encrypt.c
@@ -10,6 +10,7 @@
#include <linux/swiotlb.h>
#include <linux/cc_platform.h>
#include <linux/mem_encrypt.h>
+#include <linux/virtio_anchor.h>
#include <asm/covg_sbi.h>
/* Override for DMA direct allocation check - ARCH_HAS_FORCE_DMA_UNENCRYPTED */
@@ -54,4 +55,7 @@ void __init mem_encrypt_init(void)
/* Call into SWIOTLB to update the SWIOTLB DMA buffers */
swiotlb_update_mem_attributes();
+
+ /* Set restricted memory access for virtio. */
+ virtio_set_mem_acc_cb(virtio_require_restricted_mem_acc);
}
--
2.25.1
From: Rajnesh Kanwal <[email protected]>
Early console buffer needs to be shared with the host for CoVE Guest.
Signed-off-by: Rajnesh Kanwal <[email protected]>
Signed-off-by: Atish Patra <[email protected]>
---
drivers/tty/serial/earlycon-riscv-sbi.c | 51 ++++++++++++++++++++++++-
1 file changed, 49 insertions(+), 2 deletions(-)
diff --git a/drivers/tty/serial/earlycon-riscv-sbi.c b/drivers/tty/serial/earlycon-riscv-sbi.c
index 311a4f8..9033cca 100644
--- a/drivers/tty/serial/earlycon-riscv-sbi.c
+++ b/drivers/tty/serial/earlycon-riscv-sbi.c
@@ -9,6 +9,14 @@
#include <linux/init.h>
#include <linux/serial_core.h>
#include <asm/sbi.h>
+#include <asm/cove.h>
+#include <asm/covg_sbi.h>
+#include <linux/memblock.h>
+
+#ifdef CONFIG_RISCV_COVE_GUEST
+#define DBCN_BOUNCE_BUF_SIZE (PAGE_SIZE)
+static char dbcn_buf[DBCN_BOUNCE_BUF_SIZE] __aligned(PAGE_SIZE);
+#endif
#ifdef CONFIG_RISCV_SBI_V01
static void sbi_putc(struct uart_port *port, unsigned char c)
@@ -24,6 +32,33 @@ static void sbi_0_1_console_write(struct console *con,
}
#endif
+#ifdef CONFIG_RISCV_COVE_GUEST
+static void sbi_dbcn_console_write_cove(struct console *con, const char *s,
+ unsigned int n)
+{
+ phys_addr_t pa = __pa(dbcn_buf);
+ unsigned int off = 0;
+
+ while (off < n) {
+ const unsigned int rem = n - off;
+ const unsigned int size =
+ rem > DBCN_BOUNCE_BUF_SIZE ? DBCN_BOUNCE_BUF_SIZE : rem;
+
+ memcpy(dbcn_buf, &s[off], size);
+
+ sbi_ecall(SBI_EXT_DBCN, SBI_EXT_DBCN_CONSOLE_WRITE,
+#ifdef CONFIG_32BIT
+ size, pa, (u64)pa >> 32,
+#else
+ size, pa, 0,
+#endif
+ 0, 0, 0);
+
+ off += size;
+ }
+}
+#endif
+
static void sbi_dbcn_console_write(struct console *con,
const char *s, unsigned n)
{
@@ -45,14 +80,26 @@ static int __init early_sbi_setup(struct earlycon_device *device,
/* TODO: Check for SBI debug console (DBCN) extension */
if ((sbi_spec_version >= sbi_mk_version(1, 0)) &&
- (sbi_probe_extension(SBI_EXT_DBCN) > 0))
+ (sbi_probe_extension(SBI_EXT_DBCN) > 0)) {
+#ifdef CONFIG_RISCV_COVE_GUEST
+ if (is_cove_guest()) {
+ ret = sbi_covg_share_memory(__pa(dbcn_buf),
+ DBCN_BOUNCE_BUF_SIZE);
+ if (ret)
+ return ret;
+
+ device->con->write = sbi_dbcn_console_write_cove;
+ return 0;
+ }
+#endif
device->con->write = sbi_dbcn_console_write;
- else
+ } else {
#ifdef CONFIG_RISCV_SBI_V01
device->con->write = sbi_0_1_console_write;
#else
ret = -ENODEV;
#endif
+ }
return ret;
}
--
2.25.1
On Thu, Apr 20, 2023 at 3:47 AM Atish Patra <[email protected]> wrote:
>
> This patch series adds the RISC-V Confidential VM Extension (CoVE) support to
> Linux kernel. The RISC-V CoVE specification introduces non-ISA, SBI APIs. These
> APIs enable a confidential environment in which a guest VM's data can be isolated
> from the host while the host retains control of guest VM management and platform
> resources(memory, CPU, I/O).
>
> This is a very early WIP work. We want to share this with the community to get any
> feedback on overall architecture and direction. Any other feedback is welcome too.
>
> The detailed CoVE architecture document can be found here [0]. It used to be
> called AP-TEE and renamed to CoVE recently to avoid overloading term of TEE in
> general. The specification is in the draft stages and is subjected to change based
> on the feedback from the community.
>
> The CoVE specification introduces 3 new SBI extensions.
> COVH - CoVE Host side interface
> COVG - CoVE Guest side interface
> COVI - CoVE Secure Interrupt management extension
>
> Some key acronyms introduced:
>
> TSM - TEE Security Manager
> TVM - TEE VM (aka Confidential VM)
>
> CoVE Architecture:
> ====================
> The CoVE APIs are designed to be implementation and architecture agnostic,
> allowing for different deployment models while retaining common host and guest
> kernel code. Two examples are shown in Figure 1 and Figure 2.
> As shown in both figures, the architecture introduces a new software component
> called the "TEE Security Manager" (TSM) that runs in HS mode. The TSM has minimal
> hw attested footprint on TCB as it is a passive component that doesn't support
> scheduling or timer interrupts. Both example deployment models provide memory
> isolation between the host and the TEE VM (TVM).
>
>
> Non secure world | Secure world |
> | |
> Non | |
> Virtualized | Virtualized | Virtualized Virtualized |
> Env | Env | Env Env |
> +----------+ | +----------+ | +----------+ +----------+ | --------------
> | | | | | | | | | | |
> | Host Apps| | | Apps | | | Apps | | Apps | | VU-Mode
> | (VMM) | | | | | | | | | |
> +----------+ | +----------+ | +----------+ +----------+ | --------------
> | | +----------+ | +----------+ +----------+ |
> | | | | | | | | | |
> | | | | | | TVM | | TVM | |
> | | | Guest | | | Guest | | Guest | | VS-Mode
> Syscalls | +----------+ | +----------+ +----------+ |
> | | | | |
> | SBI | SBI(COVG + COVI) |
> | | | | |
> +--------------------------+ | +---------------------------+ --------------
> | Host (Linux) | | | TSM (Salus) |
> +--------------------------+ | +---------------------------+
> | | | HS-Mode
> SBI (COVH + COVI) | SBI (COVH + COVI)
> | | |
> +-----------------------------------------------------------+ --------------
> | Firmware(OpenSBI) + TSM Driver | M-Mode
> +-----------------------------------------------------------+ --------------
> +-----------------------------------------------------------------------------
> | Hardware (RISC-V CPU + RoT + IOMMU)
> +----------------------------------------------------------------------------
> Figure 1: Host in HS model
>
>
> The deployment model shown in Figure 1 runs the host in HS mode where it is peer
> to the TSM which also runs in HS mode. It requires another component known as TSM
> Driver running in higher privilege mode than host/TSM. It is responsible for switching
> the context between the host and the TSM. TSM driver also manages the platform
> specific hardware solution via confidential domain bit as described in the specification[0]
> to provide the required memory isolation.
>
>
> Non secure world | Secure world
> |
> Virtualized Env | Virtualized Virtualized |
> Env Env |
> +-------------------------+ | +----------+ +----------+ | ------------
> | | | | | | | | | | |
> | Host Apps| | | Apps | | | Apps | | Apps | | VU-Mode
> +----------+ | +----------+ | +----------+ +----------+ | ------------
> | | | | |
> Syscalls SBI | | | |
> | | | | |
> +--------------------------+ | +-----------+ +-----------+ |
> | Host (Linux) | | | TVM Guest| | TVM Guest| | VS-Mode
> +--------------------------+ | +-----------+ +-----------+ |
> | | | | |
> SBI (COVH + COVI) | SBI SBI |
> | | (COVG + COVI) (COVG + COVI)|
> | | | | |
> +-----------------------------------------------------------+ --------------
> | TSM(Salus) | HS-Mode
> +-----------------------------------------------------------+ --------------
> |
> SBI
> |
> +---------------------------------------------------------+ --------------
> | Firmware(OpenSBI) | M-Mode
> +---------------------------------------------------------+ --------------
> +-----------------------------------------------------------------------------
> | Hardware (RISC-V CPU + RoT + IOMMU)
> +----------------------------------------------------------------------------
> Figure 2: Host in VS model
>
>
> The deployment model shown in Figure 2 simplifies the context switch and memory isolation
> by running the host in VS mode as a guest of TSM. Thus, the memory isolation is
> achieved by gstage mapping by the TSM. We don't need any additional hardware confidential
> domain bit to provide memory isolation. The downside of this model the host has to run the
> non-confidential VMs in nested environment which may have lower performance (yet to be measured).
> The current implementation Salus(TSM) doesn't support full nested virtualization yet.
>
> The platform must have a RoT to provide attestation in either model.
> This patch series implements the APIs defined by CoVE. The current TSM implementation
> allows the host to run TVMs as shown in figure 2. We are working on deployment
> model 1 in parallel. We do not expect any significant changes in either host/guest side
> ABI due to that.
>
> Shared memory between the host & TSM:
> =====================================
> To accelerate the H-mode CSR/GPR access, CoVE also reuses the Nested Acceleration (NACL)
> SBI extension[1]. NACL defines a per physical cpu shared memory area that is allocated
> at the boot. It allows the host running in VS mode to access H-mode CSR/GPR easily
> without trapping into the TSM. The CoVE specification clearly defines the exact
> state of the shared memory with r/w permissions at every call.
>
> Secure Interrupt management:
> ===========================
> The CoVE specification relies on the MSI based interrupt scheme defined in Advanced Interrupt
> Architecture specification[2]. The COVI SBI extension adds functions to bind
> a guest interrupt file to a TVMs. After that, only TCB components (TSM, TVM, TSM driver)
> can modify that. The host can inject an interrupt via TSM only.
> The TVMs are also in complete control of which interrupts it can receive. By default,
> all interrupts are denied. In this proof-of-concept implementation, all the interrupts
> are allowed by the guest at boot time to keep it simple.
>
> Device I/O:
> ===========
> In order to support paravirt I/O devices, SWIOTLB bounce buffer must be used by the
> guest. As the host can not access confidential memory, this buffer memory
> must be shared with the host via share/unshare functions defined in COVG SBI extension.
> RISC-V implementation achieves this generalizing mem_encrypt_init() similar to TDX/SEV/CCA.
> That's why, the CoVE Guest is only allowed to use virtio devices with VIRTIO_F_ACCESS_PLATFORM
> and VIRTIO_F_VERSION_1 as they force virtio drivers to use the DMA API.
>
> MMIO emulation:
> ======================
> TVM can register regions of address space as MMIO regions to be emulated by
> the host. TSM provides explicit SBI functions i.e. SBI_EXT_COVG_[ADD/REMOVE]_MMIO_REGION
> to request/remove MMIO regions. Any reads or writes to those MMIO region after
> SBI_EXT_COVG_ADD_MMIO_REGION call are forwarded to the host for emulation.
>
> This series allows any ioremapped memory to be emulated as MMIO region with
> above APIs via arch hookups inspired from pKVM work. We are aware that this model
> doesn't address all the threat vectors. We have also implemented the device
> filtering/authorization approach adopted by TDX[4]. However, those patches are not
> part of this series as the base TDX patches are still under active development.
> RISC-V CoVE will also adapt the revamped device filtering work once it is accepted
> by the Linux community in the future.
>
> The direct assignment of devices are a work in progress and will be added in the future[4].
>
> VMM support:
> ============
> This series is only tested with kvmtool support. Other VMM support (qemu-kvm, crossvm/rust-vmm)
> will be added later.
>
> Test cases:
> ===========
> We are working on kvm selftest for CoVE. We will post them as soon as they are ready.
> We haven't started any work on kvm unit-tests as RISC-V doesn't have basic infrastructure
> to support that. Once the kvm uni-test infrastructure is in place, we will add
> support for CoVE as well.
>
> Open design questions:
> ======================
>
> 1. The current implementation has two separate configs for guest(CONFIG_RISCV_COVE_GUEST)
> and the host (RISCV_COVE_HOST). The default defconfig will enable both so that
> same unified image works as both host & guest. Most likely distro prefer this way
> to minimize the maintenance burden but some may want a minimal CoVE guest image
> that has only hardened drivers. In addition to that, Android runs a microdroid instance
> in the confidential guests. A separate config will help in those case. Please let us
> know if there is any concern with two configs.
>
> 2. Lazy gstage page allocation vs upfront allocation with page pool.
> Currently, all gstage mappings happen at runtime during the fault. This is expensive
> as we need to convert that page to confidential memory as well. A page pool framework
> may be a better choice which can hold all the confidential pages which can be
> pre-allocated upfront. A generic page pool infrastructure may benefit other CC solutions ?
>
> 3. In order to allow both confidential VM and non-confidential VM, the series
> uses regular branching instead of static branches for CoVE VM specific cases through
> out KVM. That may cause a few more branch penalties while running regular VMs.
> The alternate option is to use function pointers for any function that needs to
> take a different path. As per my understanding, that would be worse than branches.
>
> Patch organization:
> ===================
> This series depends on quite a few RISC-V patches that are not upstream yet.
> Here are the dependencies.
>
> 1. RISC-V IPI improvement series
> 2. RISC-V AIA support series.
> 3. RISC-V NACL support series
>
> In this series, PATCH [0-5] are generic improvement and cleanup patches which
> can be merged independently.
>
> PATCH [6-26, 34-37] adds host side for CoVE.
> PATCH [27-33] adds the interrupt related changes.
> PATCH [34-49] Adds the guest side changes for CoVE.
>
> The TSM project is written in rust and can be found here:
> https://github.com/rivosinc/salus
>
> Running the stack
> ====================
>
> To run/test the stack, you would need the following components :
>
> 1) Qemu
> 2) Common Host & Guest Kernel
> 3) kvmtool
> 4) Host RootFS with KVMTOOL and Guest Kernel
> 5) Salus
>
> The detailed steps are available at[6]
>
> The Linux kernel patches are also available at [7] and the kvmtool patches
> are available at [8].
>
> TODOs
> =======
> As this is a very early work, the todo list is quite long :).
> Here are some of them (not in any specific order)
>
> 1. Support fd based private memory interface proposed in
> https://lkml.org/lkml/2022/1/18/395
> 2. Align with updated guest runtime device filtering approach.
> 3. IOMMU integration
> 4. Dedicated device assignment via TDSIP & SPDM[4]
> 5. Support huge pages
> 6. Page pool allocator to avoid convert/reclaim at every fault
> 7. Other VMM support (qemu-kvm, crossvm)
> 8. Complete the PoC for the deployment model 1 where host runs in HS mode
> 9. Attestation integration
> 10. Harden the interrupt allowed list
> 11. kvm self-tests support for CoVE
> 11. kvm unit-tests support for CoVE
> 12. Guest hardening
> 13. Port pKVM on RISC-V using CoVE
> 14. Any other ?
>
> Links
> ============
> [0] CoVE architecture Specification.
> https://github.com/riscv-non-isa/riscv-ap-tee/blob/main/specification/riscv-aptee-spec.pdf
I just noticed that this link is broken due to a recent PR merge. Here
is the updated link
https://github.com/riscv-non-isa/riscv-ap-tee/blob/main/specification/riscv-cove.pdf
Sorry for the noise.
> [1] https://lists.riscv.org/g/sig-hypervisors/message/260
> [2] https://github.com/riscv/riscv-aia/releases/download/1.0-RC2/riscv-interrupts-1.0-RC2.pdf
> [3] https://github.com/rivosinc/linux/tree/cove_integration_device_filtering1
> [4] https://github.com/intel/tdx/commits/guest-filter-upstream
> [5] https://lists.riscv.org/g/tech-ap-tee/message/83
> [6] https://github.com/rivosinc/cove/wiki/CoVE-KVM-RISCV64-on-QEMU
> [7] https://github.com/rivosinc/linux/commits/cove-integration
> [8] https://github.com/rivosinc/kvmtool/tree/cove-integration-03072023
>
> Atish Patra (33):
> RISC-V: KVM: Improve KVM error reporting to the user space
> RISC-V: KVM: Invoke aia_update with preempt disabled/irq enabled
> RISC-V: KVM: Add a helper function to get pgd size
> RISC-V: Add COVH SBI extensions definitions
> RISC-V: KVM: Implement COVH SBI extension
> RISC-V: KVM: Add a barebone CoVE implementation
> RISC-V: KVM: Add UABI to support static memory region attestation
> RISC-V: KVM: Add CoVE related nacl helpers
> RISC-V: KVM: Implement static memory region measurement
> RISC-V: KVM: Use the new VM IOCTL for measuring pages
> RISC-V: KVM: Exit to the user space for trap redirection
> RISC-V: KVM: Return early for gstage modifications
> RISC-V: KVM: Skip dirty logging updates for TVM
> RISC-V: KVM: Add a helper function to trigger fence ops
> RISC-V: KVM: Skip most VCPU requests for TVMs
> RISC-V : KVM: Skip vmid/hgatp management for TVMs
> RISC-V: KVM: Skip TLB management for TVMs
> RISC-V: KVM: Register memory regions as confidential for TVMs
> RISC-V: KVM: Add gstage mapping for TVMs
> RISC-V: KVM: Handle SBI call forward from the TSM
> RISC-V: KVM: Implement vcpu load/put functions for CoVE guests
> RISC-V: KVM: Wireup TVM world switch
> RISC-V: KVM: Skip HVIP update for TVMs
> RISC-V: KVM: Implement COVI SBI extension
> RISC-V: KVM: Add interrupt management functions for TVM
> RISC-V: KVM: Skip AIA CSR updates for TVMs
> RISC-V: KVM: Perform limited operations in hardware enable/disable
> RISC-V: KVM: Indicate no support user space emulated IRQCHIP
> RISC-V: KVM: Add AIA support for TVMs
> RISC-V: KVM: Hookup TVM VCPU init/destroy
> RISC-V: KVM: Initialize CoVE
> RISC-V: KVM: Add TVM init/destroy calls
> drivers/hvc: sbi: Disable HVC console for TVMs
>
> Rajnesh Kanwal (15):
> mm/vmalloc: Introduce arch hooks to notify ioremap/unmap changes
> RISC-V: KVM: Update timer functionality for TVMs.
> RISC-V: Add COVI extension definitions
> RISC-V: KVM: Read/write gprs from/to shmem in case of TVM VCPU.
> RISC-V: Add COVG SBI extension definitions
> RISC-V: Add CoVE guest config and helper functions
> RISC-V: Implement COVG SBI extension
> RISC-V: COVE: Add COVH invalidate, validate, promote, demote and
> remove APIs.
> RISC-V: KVM: Add host side support to handle COVG SBI calls.
> RISC-V: Allow host to inject any ext interrupt id to a CoVE guest.
> RISC-V: Add base memory encryption functions.
> RISC-V: Add cc_platform_has() for RISC-V for CoVE
> RISC-V: ioremap: Implement for arch specific ioremap hooks
> riscv/virtio: Have CoVE guests enforce restricted virtio memory
> access.
> RISC-V: Add shared bounce buffer to support DBCN for CoVE Guest.
>
> arch/riscv/Kbuild | 2 +
> arch/riscv/Kconfig | 27 +
> arch/riscv/cove/Makefile | 2 +
> arch/riscv/cove/core.c | 40 +
> arch/riscv/cove/cove_guest_sbi.c | 109 +++
> arch/riscv/include/asm/cove.h | 27 +
> arch/riscv/include/asm/covg_sbi.h | 38 +
> arch/riscv/include/asm/csr.h | 2 +
> arch/riscv/include/asm/kvm_cove.h | 206 +++++
> arch/riscv/include/asm/kvm_cove_sbi.h | 101 +++
> arch/riscv/include/asm/kvm_host.h | 10 +-
> arch/riscv/include/asm/kvm_vcpu_sbi.h | 3 +
> arch/riscv/include/asm/mem_encrypt.h | 26 +
> arch/riscv/include/asm/sbi.h | 107 +++
> arch/riscv/include/uapi/asm/kvm.h | 17 +
> arch/riscv/kernel/irq.c | 12 +
> arch/riscv/kernel/setup.c | 2 +
> arch/riscv/kvm/Makefile | 1 +
> arch/riscv/kvm/aia.c | 101 ++-
> arch/riscv/kvm/aia_device.c | 41 +-
> arch/riscv/kvm/aia_imsic.c | 127 ++-
> arch/riscv/kvm/cove.c | 1005 +++++++++++++++++++++++
> arch/riscv/kvm/cove_sbi.c | 490 +++++++++++
> arch/riscv/kvm/main.c | 30 +-
> arch/riscv/kvm/mmu.c | 45 +-
> arch/riscv/kvm/tlb.c | 11 +-
> arch/riscv/kvm/vcpu.c | 69 +-
> arch/riscv/kvm/vcpu_exit.c | 34 +-
> arch/riscv/kvm/vcpu_insn.c | 115 ++-
> arch/riscv/kvm/vcpu_sbi.c | 16 +
> arch/riscv/kvm/vcpu_sbi_covg.c | 232 ++++++
> arch/riscv/kvm/vcpu_timer.c | 26 +-
> arch/riscv/kvm/vm.c | 34 +-
> arch/riscv/kvm/vmid.c | 17 +-
> arch/riscv/mm/Makefile | 3 +
> arch/riscv/mm/init.c | 17 +-
> arch/riscv/mm/ioremap.c | 45 +
> arch/riscv/mm/mem_encrypt.c | 61 ++
> drivers/tty/hvc/hvc_riscv_sbi.c | 5 +
> drivers/tty/serial/earlycon-riscv-sbi.c | 51 +-
> include/uapi/linux/kvm.h | 8 +
> mm/vmalloc.c | 16 +
> 42 files changed, 3222 insertions(+), 109 deletions(-)
> create mode 100644 arch/riscv/cove/Makefile
> create mode 100644 arch/riscv/cove/core.c
> create mode 100644 arch/riscv/cove/cove_guest_sbi.c
> create mode 100644 arch/riscv/include/asm/cove.h
> create mode 100644 arch/riscv/include/asm/covg_sbi.h
> create mode 100644 arch/riscv/include/asm/kvm_cove.h
> create mode 100644 arch/riscv/include/asm/kvm_cove_sbi.h
> create mode 100644 arch/riscv/include/asm/mem_encrypt.h
> create mode 100644 arch/riscv/kvm/cove.c
> create mode 100644 arch/riscv/kvm/cove_sbi.c
> create mode 100644 arch/riscv/kvm/vcpu_sbi_covg.c
> create mode 100644 arch/riscv/mm/ioremap.c
> create mode 100644 arch/riscv/mm/mem_encrypt.c
>
> --
> 2.25.1
>
--
Regards,
Atish
On Wed, Apr 19, 2023, Atish Patra wrote:
> +int kvm_riscv_cove_vm_measure_pages(struct kvm *kvm, struct kvm_riscv_cove_measure_region *mr)
> +{
> + struct kvm_cove_tvm_context *tvmc = kvm->arch.tvmc;
> + int rc = 0, idx, num_pages;
> + struct kvm_riscv_cove_mem_region *conf;
> + struct page *pinned_page, *conf_page;
> + struct kvm_riscv_cove_page *cpage;
> +
> + if (!tvmc)
> + return -EFAULT;
> +
> + if (tvmc->finalized_done) {
> + kvm_err("measured_mr pages can not be added after finalize\n");
> + return -EINVAL;
> + }
> +
> + num_pages = bytes_to_pages(mr->size);
> + conf = &tvmc->confidential_region;
> +
> + if (!IS_ALIGNED(mr->userspace_addr, PAGE_SIZE) ||
> + !IS_ALIGNED(mr->gpa, PAGE_SIZE) || !mr->size ||
> + !cove_is_within_region(conf->gpa, conf->npages << PAGE_SHIFT, mr->gpa, mr->size))
> + return -EINVAL;
> +
> + idx = srcu_read_lock(&kvm->srcu);
> +
> + /*TODO: Iterate one page at a time as pinning multiple pages fail with unmapped panic
> + * with a virtual address range belonging to vmalloc region for some reason.
I've no idea what code you had, but I suspect the fact that vmalloc'd memory isn't
guaranteed to be physically contiguous is relevant to the panic.
> + */
> + while (num_pages) {
> + if (signal_pending(current)) {
> + rc = -ERESTARTSYS;
> + break;
> + }
> +
> + if (need_resched())
> + cond_resched();
> +
> + rc = get_user_pages_fast(mr->userspace_addr, 1, 0, &pinned_page);
> + if (rc < 0) {
> + kvm_err("Pinning the userpsace addr %lx failed\n", mr->userspace_addr);
> + break;
> + }
> +
> + /* Enough pages are not available to be pinned */
> + if (rc != 1) {
> + rc = -ENOMEM;
> + break;
> + }
> + conf_page = alloc_page(GFP_KERNEL | __GFP_ZERO);
> + if (!conf_page) {
> + rc = -ENOMEM;
> + break;
> + }
> +
> + rc = cove_convert_pages(page_to_phys(conf_page), 1, true);
> + if (rc)
> + break;
> +
> + /*TODO: Support other pages sizes */
> + rc = sbi_covh_add_measured_pages(tvmc->tvm_guest_id, page_to_phys(pinned_page),
> + page_to_phys(conf_page), SBI_COVE_PAGE_4K,
> + 1, mr->gpa);
> + if (rc)
> + break;
> +
> + /* Unpin the page now */
> + put_page(pinned_page);
> +
> + cpage = kmalloc(sizeof(*cpage), GFP_KERNEL_ACCOUNT);
> + if (!cpage) {
> + rc = -ENOMEM;
> + break;
> + }
> +
> + cpage->page = conf_page;
> + cpage->npages = 1;
> + cpage->gpa = mr->gpa;
> + cpage->hva = mr->userspace_addr;
Snapshotting the userspace address for the _source_ page can't possibly be useful.
> + cpage->is_mapped = true;
> + INIT_LIST_HEAD(&cpage->link);
> + list_add(&cpage->link, &tvmc->measured_pages);
> +
> + mr->userspace_addr += PAGE_SIZE;
> + mr->gpa += PAGE_SIZE;
> + num_pages--;
> + conf_page = NULL;
> +
> + continue;
> + }
> + srcu_read_unlock(&kvm->srcu, idx);
> +
> + if (rc < 0) {
> + /* We don't to need unpin pages as it is allocated by the hypervisor itself */
This comment makes no sense. The above code is doing all of the allocation and
pinning, which strongly suggests that KVM is the hypervisor. But this comment
implies that KVM is not the hypervisor.
And "pinned_page" is cleared unpinned in the loop after the page is added+measured,
which looks to be the same model as TDX where "pinned_page" is the source and
"conf_page" is gifted to the hypervisor. But on failure, e.g. when allocating
"conf_page", that reference is not put.
> + cove_delete_page_list(kvm, &tvmc->measured_pages, false);
> + /* Free the last allocated page for which conversion/measurement failed */
> + kfree(conf_page);
Assuming my guesses about how the architecture works are correct, this is broken
if sbi_covh_add_measured_pages() fails. The page has already been gifted to the
TSM by cove_convert_pages(), but there is no call to sbi_covh_tsm_reclaim_pages(),
which I'm guessing is necesary to transition the page back to a state where it can
be safely used by the host.
On Wed, Apr 19, 2023, Atish Patra wrote:
> 2. Lazy gstage page allocation vs upfront allocation with page pool.
> Currently, all gstage mappings happen at runtime during the fault. This is expensive
> as we need to convert that page to confidential memory as well. A page pool framework
> may be a better choice which can hold all the confidential pages which can be
> pre-allocated upfront. A generic page pool infrastructure may benefit other CC solutions ?
I'm sorry, what? Do y'all really not pay any attention to what is happening
outside of the RISC-V world?
We, where "we" is KVM x86 and ARM, with folks contributing from 5+ companines,
have been working on this problem for going on three *years*. And that's just
from the first public posting[1], there have been discussions about how to approach
this for even longer. There have been multiple related presentations at KVM Forum,
something like 4 or 5 just at KVM Forum 2022 alone.
Patch 1 says "This patch is based on pkvm patches", so clearly you are at least
aware that there is other work going on in this space.
At a very quick glance, this series is suffers from all of the same flaws that SNP,
TDX, and pKVM have encountered. E.g. assuming guest memory is backed by struct page
memory, relying on pinning to solve all problems (hint, it doesn't), and so on and
so forth.
And to make things worse, this series is riddled with bugs. E.g. patch 19 alone
manages to squeeze in multiple fatal bugs in five new lines of code: deadlock due
to not releasing mmap_lock on failure, failure to correcty handle MOVE, failure to
handle DELETE at all, failure to honor (or reject) READONLY, and probably several
others.
diff --git a/arch/riscv/kvm/mmu.c b/arch/riscv/kvm/mmu.c
index 4b0f09e..63889d9 100644
--- a/arch/riscv/kvm/mmu.c
+++ b/arch/riscv/kvm/mmu.c
@@ -499,6 +499,11 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
mmap_read_lock(current->mm);
+ if (is_cove_vm(kvm)) {
+ ret = kvm_riscv_cove_vm_add_memreg(kvm, base_gpa, size);
+ if (ret)
+ return ret;
+ }
/*
* A memory region could potentially cover multiple VMAs, and
* any holes between them, so iterate over all of them to find
I get that this is an RFC, but for a series of this size, operating in an area that
is under heavy development by multiple other architectures, to have a diffstat that
shows _zero_ changes to common KVM is simply unacceptable.
Please, go look at restrictedmem[2] and work on building CoVE support on top of
that. If the current proposal doesn't fit CoVE's needs, then we need to know _before_
all of that code gets merged.
[1] https://lore.kernel.org/linux-mm/[email protected]
[2] https://lkml.kernel.org/r/20221202061347.1070246-1-chao.p.peng%40linux.intel.com
> arch/riscv/Kbuild | 2 +
> arch/riscv/Kconfig | 27 +
> arch/riscv/cove/Makefile | 2 +
> arch/riscv/cove/core.c | 40 +
> arch/riscv/cove/cove_guest_sbi.c | 109 +++
> arch/riscv/include/asm/cove.h | 27 +
> arch/riscv/include/asm/covg_sbi.h | 38 +
> arch/riscv/include/asm/csr.h | 2 +
> arch/riscv/include/asm/kvm_cove.h | 206 +++++
> arch/riscv/include/asm/kvm_cove_sbi.h | 101 +++
> arch/riscv/include/asm/kvm_host.h | 10 +-
> arch/riscv/include/asm/kvm_vcpu_sbi.h | 3 +
> arch/riscv/include/asm/mem_encrypt.h | 26 +
> arch/riscv/include/asm/sbi.h | 107 +++
> arch/riscv/include/uapi/asm/kvm.h | 17 +
> arch/riscv/kernel/irq.c | 12 +
> arch/riscv/kernel/setup.c | 2 +
> arch/riscv/kvm/Makefile | 1 +
> arch/riscv/kvm/aia.c | 101 ++-
> arch/riscv/kvm/aia_device.c | 41 +-
> arch/riscv/kvm/aia_imsic.c | 127 ++-
> arch/riscv/kvm/cove.c | 1005 +++++++++++++++++++++++
> arch/riscv/kvm/cove_sbi.c | 490 +++++++++++
> arch/riscv/kvm/main.c | 30 +-
> arch/riscv/kvm/mmu.c | 45 +-
> arch/riscv/kvm/tlb.c | 11 +-
> arch/riscv/kvm/vcpu.c | 69 +-
> arch/riscv/kvm/vcpu_exit.c | 34 +-
> arch/riscv/kvm/vcpu_insn.c | 115 ++-
> arch/riscv/kvm/vcpu_sbi.c | 16 +
> arch/riscv/kvm/vcpu_sbi_covg.c | 232 ++++++
> arch/riscv/kvm/vcpu_timer.c | 26 +-
> arch/riscv/kvm/vm.c | 34 +-
> arch/riscv/kvm/vmid.c | 17 +-
> arch/riscv/mm/Makefile | 3 +
> arch/riscv/mm/init.c | 17 +-
> arch/riscv/mm/ioremap.c | 45 +
> arch/riscv/mm/mem_encrypt.c | 61 ++
> drivers/tty/hvc/hvc_riscv_sbi.c | 5 +
> drivers/tty/serial/earlycon-riscv-sbi.c | 51 +-
> include/uapi/linux/kvm.h | 8 +
> mm/vmalloc.c | 16 +
> 42 files changed, 3222 insertions(+), 109 deletions(-)
> create mode 100644 arch/riscv/cove/Makefile
> create mode 100644 arch/riscv/cove/core.c
> create mode 100644 arch/riscv/cove/cove_guest_sbi.c
> create mode 100644 arch/riscv/include/asm/cove.h
> create mode 100644 arch/riscv/include/asm/covg_sbi.h
> create mode 100644 arch/riscv/include/asm/kvm_cove.h
> create mode 100644 arch/riscv/include/asm/kvm_cove_sbi.h
> create mode 100644 arch/riscv/include/asm/mem_encrypt.h
> create mode 100644 arch/riscv/kvm/cove.c
> create mode 100644 arch/riscv/kvm/cove_sbi.c
> create mode 100644 arch/riscv/kvm/vcpu_sbi_covg.c
> create mode 100644 arch/riscv/mm/ioremap.c
> create mode 100644 arch/riscv/mm/mem_encrypt.c
>
> --
> 2.25.1
>
On Thu, Apr 20, 2023 at 10:00 PM Sean Christopherson <[email protected]> wrote:
>
> On Wed, Apr 19, 2023, Atish Patra wrote:
> > 2. Lazy gstage page allocation vs upfront allocation with page pool.
> > Currently, all gstage mappings happen at runtime during the fault. This is expensive
> > as we need to convert that page to confidential memory as well. A page pool framework
> > may be a better choice which can hold all the confidential pages which can be
> > pre-allocated upfront. A generic page pool infrastructure may benefit other CC solutions ?
>
> I'm sorry, what? Do y'all really not pay any attention to what is happening
> outside of the RISC-V world?
>
> We, where "we" is KVM x86 and ARM, with folks contributing from 5+ companines,
> have been working on this problem for going on three *years*. And that's just
> from the first public posting[1], there have been discussions about how to approach
> this for even longer. There have been multiple related presentations at KVM Forum,
> something like 4 or 5 just at KVM Forum 2022 alone.
>
Yes. We are following the restrictedmem effort and was reviewing the
v10 this week.
I did mention about that in the 1st item in the TODO list. We are
planning to use the restrictedmen
feature once it is closer to upstream (which seems to be the case
looking at v10).
Another reason is that this initial series is based on kvmtool only.
We are working on qemu-kvm
right now but have some RISC-V specific dependencies(interrupt
controller stuff) which are not there yet.
As the restrictedmem patches are already available in qemu-kvm too,
our plan was to support CoVE
in qemu-kvm first and work on restrictedmem after that.
This item was just based on this RFC implementation which uses a lazy
gstage page allocation.
The idea was to check if there is any interest at all in this
approach. I should have mentioned about
restrictedmem plan in this section as well. Sorry for the confusion.
Thanks for your suggestion. It seems we should just directly move to
restrictedmem asap.
> Patch 1 says "This patch is based on pkvm patches", so clearly you are at least
> aware that there is other work going on in this space.
>
Yes. We have been following pkvm, tdx & CCA patches. The MMIO section
has more details
on TDX/pkvm related aspects.
> At a very quick glance, this series is suffers from all of the same flaws that SNP,
> TDX, and pKVM have encountered. E.g. assuming guest memory is backed by struct page
> memory, relying on pinning to solve all problems (hint, it doesn't), and so on and
> so forth.
>
> And to make things worse, this series is riddled with bugs. E.g. patch 19 alone
> manages to squeeze in multiple fatal bugs in five new lines of code: deadlock due
> to not releasing mmap_lock on failure, failure to correcty handle MOVE, failure to
That's an oversight. Apologies for that. Thanks for pointing it out.
> handle DELETE at all, failure to honor (or reject) READONLY, and probably several
> others.
>
It should be rejected for READONLY as our APIs don't have any
permission flags yet.
I think we should add that to enable CoVE APIs to support as well ?
Same goes for DELETE ops as we don't have an API to delete any
confidential memory region
yet. I was not very sure about the use case for MOVE though (migration
possibly ?)
kvm_riscv_cove_vm_add_memreg should have been invoked only for CREATE
& reject others for now.
I will revise the patch accordingly and leave a TODO comment for the
future about API updates.
> diff --git a/arch/riscv/kvm/mmu.c b/arch/riscv/kvm/mmu.c
> index 4b0f09e..63889d9 100644
> --- a/arch/riscv/kvm/mmu.c
> +++ b/arch/riscv/kvm/mmu.c
> @@ -499,6 +499,11 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
>
> mmap_read_lock(current->mm);
>
> + if (is_cove_vm(kvm)) {
> + ret = kvm_riscv_cove_vm_add_memreg(kvm, base_gpa, size);
> + if (ret)
> + return ret;
> + }
> /*
> * A memory region could potentially cover multiple VMAs, and
> * any holes between them, so iterate over all of them to find
>
> I get that this is an RFC, but for a series of this size, operating in an area that
> is under heavy development by multiple other architectures, to have a diffstat that
> shows _zero_ changes to common KVM is simply unacceptable.
>
Thanks for the valuable feedback. This is pretty much pre-RFC as the
spec is very much
in draft state. We want to share with the larger linux community to
gather feedback sooner
than later so that we can incorporate that feedback into the spec if any.
> Please, go look at restrictedmem[2] and work on building CoVE support on top of
> that. If the current proposal doesn't fit CoVE's needs, then we need to know _before_
> all of that code gets merged.
>
Absolutely. That has always been the plan.
> [1] https://lore.kernel.org/linux-mm/[email protected]
> [2] https://lkml.kernel.org/r/20221202061347.1070246-1-chao.p.peng%40linux.intel.com
>
> > arch/riscv/Kbuild | 2 +
> > arch/riscv/Kconfig | 27 +
> > arch/riscv/cove/Makefile | 2 +
> > arch/riscv/cove/core.c | 40 +
> > arch/riscv/cove/cove_guest_sbi.c | 109 +++
> > arch/riscv/include/asm/cove.h | 27 +
> > arch/riscv/include/asm/covg_sbi.h | 38 +
> > arch/riscv/include/asm/csr.h | 2 +
> > arch/riscv/include/asm/kvm_cove.h | 206 +++++
> > arch/riscv/include/asm/kvm_cove_sbi.h | 101 +++
> > arch/riscv/include/asm/kvm_host.h | 10 +-
> > arch/riscv/include/asm/kvm_vcpu_sbi.h | 3 +
> > arch/riscv/include/asm/mem_encrypt.h | 26 +
> > arch/riscv/include/asm/sbi.h | 107 +++
> > arch/riscv/include/uapi/asm/kvm.h | 17 +
> > arch/riscv/kernel/irq.c | 12 +
> > arch/riscv/kernel/setup.c | 2 +
> > arch/riscv/kvm/Makefile | 1 +
> > arch/riscv/kvm/aia.c | 101 ++-
> > arch/riscv/kvm/aia_device.c | 41 +-
> > arch/riscv/kvm/aia_imsic.c | 127 ++-
> > arch/riscv/kvm/cove.c | 1005 +++++++++++++++++++++++
> > arch/riscv/kvm/cove_sbi.c | 490 +++++++++++
> > arch/riscv/kvm/main.c | 30 +-
> > arch/riscv/kvm/mmu.c | 45 +-
> > arch/riscv/kvm/tlb.c | 11 +-
> > arch/riscv/kvm/vcpu.c | 69 +-
> > arch/riscv/kvm/vcpu_exit.c | 34 +-
> > arch/riscv/kvm/vcpu_insn.c | 115 ++-
> > arch/riscv/kvm/vcpu_sbi.c | 16 +
> > arch/riscv/kvm/vcpu_sbi_covg.c | 232 ++++++
> > arch/riscv/kvm/vcpu_timer.c | 26 +-
> > arch/riscv/kvm/vm.c | 34 +-
> > arch/riscv/kvm/vmid.c | 17 +-
> > arch/riscv/mm/Makefile | 3 +
> > arch/riscv/mm/init.c | 17 +-
> > arch/riscv/mm/ioremap.c | 45 +
> > arch/riscv/mm/mem_encrypt.c | 61 ++
> > drivers/tty/hvc/hvc_riscv_sbi.c | 5 +
> > drivers/tty/serial/earlycon-riscv-sbi.c | 51 +-
> > include/uapi/linux/kvm.h | 8 +
> > mm/vmalloc.c | 16 +
> > 42 files changed, 3222 insertions(+), 109 deletions(-)
> > create mode 100644 arch/riscv/cove/Makefile
> > create mode 100644 arch/riscv/cove/core.c
> > create mode 100644 arch/riscv/cove/cove_guest_sbi.c
> > create mode 100644 arch/riscv/include/asm/cove.h
> > create mode 100644 arch/riscv/include/asm/covg_sbi.h
> > create mode 100644 arch/riscv/include/asm/kvm_cove.h
> > create mode 100644 arch/riscv/include/asm/kvm_cove_sbi.h
> > create mode 100644 arch/riscv/include/asm/mem_encrypt.h
> > create mode 100644 arch/riscv/kvm/cove.c
> > create mode 100644 arch/riscv/kvm/cove_sbi.c
> > create mode 100644 arch/riscv/kvm/vcpu_sbi_covg.c
> > create mode 100644 arch/riscv/mm/ioremap.c
> > create mode 100644 arch/riscv/mm/mem_encrypt.c
> >
> > --
> > 2.25.1
> >
I'm a vmalloc reviewer too now -next/mm-unstable get_maintainer.pl should say
so, but forgivable because perhaps you ran against another tree but FYI for
future I'd appreciate a cc- :)
On Wed, Apr 19, 2023 at 03:16:29PM -0700, Atish Patra wrote:
> From: Rajnesh Kanwal <[email protected]>
>
> In virtualization, the guest may need notify the host about the ioremap
> regions. This is a common usecase in confidential computing where the
> host only provides MMIO emulation for the regions specified by the guest.
>
> Add a pair if arch specific callbacks to track the ioremapped regions.
Nit: typo if -> of.
>
> This patch is based on pkvm patches. A generic arch config can be added
> similar to pkvm if this is going to be the final solution. The device
> authorization/filtering approach is very different from this and we
> may prefer that one as it provides more flexibility in terms of which
> devices are allowed for the confidential guests.
So it's an RFC that assumes existing patches are already applied or do you mean
something else here? What do I need to do to get to a vmalloc.c with your patch
applied?
I guess this is pretty nitty since your changes are small here but be good to
know!
>
> Signed-off-by: Rajnesh Kanwal <[email protected]>
> Signed-off-by: Atish Patra <[email protected]>
> ---
> mm/vmalloc.c | 16 ++++++++++++++++
> 1 file changed, 16 insertions(+)
>
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index bef6cf2..023630e 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -304,6 +304,14 @@ static int vmap_range_noflush(unsigned long addr, unsigned long end,
> return err;
> }
>
> +__weak void ioremap_phys_range_hook(phys_addr_t phys_addr, size_t size, pgprot_t prot)
> +{
> +}
> +
> +__weak void iounmap_phys_range_hook(phys_addr_t phys_addr, size_t size)
> +{
> +}
> +
I'm not sure if this is for efficiency by using a weak reference, however, and
perhaps a nit, but I'd prefer an arch_*() that's defined in a header somewhere,
as it does hide the call paths quite effectively.
> int ioremap_page_range(unsigned long addr, unsigned long end,
> phys_addr_t phys_addr, pgprot_t prot)
> {
> @@ -315,6 +323,10 @@ int ioremap_page_range(unsigned long addr, unsigned long end,
> if (!err)
> kmsan_ioremap_page_range(addr, end, phys_addr, prot,
> ioremap_max_page_shift);
> +
> + if (!err)
> + ioremap_phys_range_hook(phys_addr, end - addr, prot);
> +
> return err;
> }
>
> @@ -2772,6 +2784,10 @@ void vunmap(const void *addr)
> addr);
> return;
> }
> +
> + if (vm->flags & VM_IOREMAP)
> + iounmap_phys_range_hook(vm->phys_addr, get_vm_area_size(vm));
> +
There are places other than ioremap_page_range() that can set VM_IOREMAP,
e.g. vmap_pfn(), so this may trigger with addresses other than those specified
in the original hook. Is this intended?
> kfree(vm);
> }
> EXPORT_SYMBOL(vunmap);
> --
> 2.25.1
>
On Fri, Apr 21, 2023, Atish Kumar Patra wrote:
> On Thu, Apr 20, 2023 at 10:00 PM Sean Christopherson <[email protected]> wrote:
> >
> > On Wed, Apr 19, 2023, Atish Patra wrote:
> > > 2. Lazy gstage page allocation vs upfront allocation with page pool.
> > > Currently, all gstage mappings happen at runtime during the fault. This is expensive
> > > as we need to convert that page to confidential memory as well. A page pool framework
> > > may be a better choice which can hold all the confidential pages which can be
> > > pre-allocated upfront. A generic page pool infrastructure may benefit other CC solutions ?
> >
> > I'm sorry, what? Do y'all really not pay any attention to what is happening
> > outside of the RISC-V world?
> >
> > We, where "we" is KVM x86 and ARM, with folks contributing from 5+ companines,
> > have been working on this problem for going on three *years*. And that's just
> > from the first public posting[1], there have been discussions about how to approach
> > this for even longer. There have been multiple related presentations at KVM Forum,
> > something like 4 or 5 just at KVM Forum 2022 alone.
> >
>
> I did mention about that in the 1st item in the TODO list.
My apologies, I completely missed the todo list.
> Thanks for your suggestion. It seems we should just directly move to
> restrictedmem asap.
Yes please, for the sake of everyone involved. It will likely save you from
running into the same pitfalls that x86 and ARM already encountered, and the more
eyeballs and use cases on whatever restrictemem ends up being called, the better.
Thanks!
On Fri, Apr 21, 2023 at 1:12 AM Lorenzo Stoakes <[email protected]> wrote:
>
> I'm a vmalloc reviewer too now -next/mm-unstable get_maintainer.pl should say
> so, but forgivable because perhaps you ran against another tree but FYI for
> future I'd appreciate a cc- :)
>
Ahh. Thanks for pointing that out. I see the patch for that now
https://lkml.org/lkml/2023/3/21/1084
This series is based on 6.3-rc4. That's probably why get_maintainer.pl
did not pick it up.
I will make sure it includes you in the future revisions.
> On Wed, Apr 19, 2023 at 03:16:29PM -0700, Atish Patra wrote:
> > From: Rajnesh Kanwal <[email protected]>
> >
> > In virtualization, the guest may need notify the host about the ioremap
> > regions. This is a common usecase in confidential computing where the
> > host only provides MMIO emulation for the regions specified by the guest.
> >
> > Add a pair if arch specific callbacks to track the ioremapped regions.
>
> Nit: typo if -> of.
>
Fixed. Thanks.
> >
> > This patch is based on pkvm patches. A generic arch config can be added
> > similar to pkvm if this is going to be the final solution. The device
> > authorization/filtering approach is very different from this and we
> > may prefer that one as it provides more flexibility in terms of which
> > devices are allowed for the confidential guests.
>
> So it's an RFC that assumes existing patches are already applied or do you mean
> something else here? What do I need to do to get to a vmalloc.c with your patch
> applied?
>
> I guess this is pretty nitty since your changes are small here but be good to
> know!
>
Here is a bit more context: This patch is inspired from Marc's pkvm patch[1]
We haven't seen a revised version of that series. Thus, we are not
sure if this is what will be the final solution for pKVM.
The alternative solution is the guest device filtering approach. We
are also tracking that which introduces a new set of functions
(ioremap_hardned)[1] for authorized devices allowed for. That series
doesn't require any changes to the vmalloc.c and this
patch can be dropped.
As the TDX implementation is not ready yet, we chose to go this way to
get the ball rolling for implementing confidential computing
in RISC-V. Our plan is to align with the solution that the upstream
community finally agrees upon.
[1] https://lore.kernel.org/kvm/20211007143852.pyae42sbovi4vk23@gator/t/#mc3480e2a1d69f91999aae11004941dbdfbbdd600
[2] https://github.com/intel/tdx/commit/d8bb168e10d1ba534cb83260d9a8a3c5b269eb50
> >
> > Signed-off-by: Rajnesh Kanwal <[email protected]>
> > Signed-off-by: Atish Patra <[email protected]>
> > ---
> > mm/vmalloc.c | 16 ++++++++++++++++
> > 1 file changed, 16 insertions(+)
> >
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index bef6cf2..023630e 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -304,6 +304,14 @@ static int vmap_range_noflush(unsigned long addr, unsigned long end,
> > return err;
> > }
> >
> > +__weak void ioremap_phys_range_hook(phys_addr_t phys_addr, size_t size, pgprot_t prot)
> > +{
> > +}
> > +
> > +__weak void iounmap_phys_range_hook(phys_addr_t phys_addr, size_t size)
> > +{
> > +}
> > +
>
> I'm not sure if this is for efficiency by using a weak reference, however, and
> perhaps a nit, but I'd prefer an arch_*() that's defined in a header somewhere,
> as it does hide the call paths quite effectively.
>
Sure. Will do that.
> > int ioremap_page_range(unsigned long addr, unsigned long end,
> > phys_addr_t phys_addr, pgprot_t prot)
> > {
> > @@ -315,6 +323,10 @@ int ioremap_page_range(unsigned long addr, unsigned long end,
> > if (!err)
> > kmsan_ioremap_page_range(addr, end, phys_addr, prot,
> > ioremap_max_page_shift);
> > +
> > + if (!err)
> > + ioremap_phys_range_hook(phys_addr, end - addr, prot);
> > +
> > return err;
> > }
> >
> > @@ -2772,6 +2784,10 @@ void vunmap(const void *addr)
> > addr);
> > return;
> > }
> > +
> > + if (vm->flags & VM_IOREMAP)
> > + iounmap_phys_range_hook(vm->phys_addr, get_vm_area_size(vm));
> > +
>
> There are places other than ioremap_page_range() that can set VM_IOREMAP,
> e.g. vmap_pfn(), so this may trigger with addresses other than those specified
> in the original hook. Is this intended?
>
Thanks for pointing that out. Yeah. That is not intentional.
> > kfree(vm);
> > }
> > EXPORT_SYMBOL(vunmap);
> > --
> > 2.25.1
> >
On 4/19/23 15:17, Atish Patra wrote:
> The guests running in CoVE must notify the host about its mmio regions
> so that host can enable mmio emulation.
This one doesn't make a lot of sense to me.
The guest and host must agree about the guest's physical layout up
front. In general, the host gets to dictate that layout. It tells the
guest, up front, what is present in the guest physical address space.
This callback appears to say to the host:
Hey, I (the guest) am treating this guest physical area as MMIO.
But the host and guest have to agree _somewhere_ what the MMIO is used
for, not just that it is being used as MMIO.
On Thu, Apr 20, 2023 at 09:30:29AM -0700, Sean Christopherson wrote:
> Please, go look at restrictedmem[2] and work on building CoVE support on top of
> that. If the current proposal doesn't fit CoVE's needs, then we need to know _before_
> all of that code gets merged.
I agree it's preferable to know beforehand to avoid potential
maintainability quagmires bringing additional architectures onboard, and
that it probably makes sense here to get that early input. But as a
general statement, it's not necessarily a *requirement*.
I worry that if we commit to such a policy that by the time restrictedmem
gets close to merge, yet another architecture/use-case will come along that
delays things further for architectures that already have hardware in the
field.
Not saying that's the case here, but just in general I think it's worth
keeping the option open on iterating on a partial solution vs. trying to
address everything on the first shot, depending on how the timing works
out.
Thanks,
Mike
>
> [1] https://lore.kernel.org/linux-mm/[email protected]
> [2] https://lkml.kernel.org/r/20221202061347.1070246-1-chao.p.peng%40linux.intel.com
>
> > arch/riscv/Kbuild | 2 +
> > arch/riscv/Kconfig | 27 +
> > arch/riscv/cove/Makefile | 2 +
> > arch/riscv/cove/core.c | 40 +
> > arch/riscv/cove/cove_guest_sbi.c | 109 +++
> > arch/riscv/include/asm/cove.h | 27 +
> > arch/riscv/include/asm/covg_sbi.h | 38 +
> > arch/riscv/include/asm/csr.h | 2 +
> > arch/riscv/include/asm/kvm_cove.h | 206 +++++
> > arch/riscv/include/asm/kvm_cove_sbi.h | 101 +++
> > arch/riscv/include/asm/kvm_host.h | 10 +-
> > arch/riscv/include/asm/kvm_vcpu_sbi.h | 3 +
> > arch/riscv/include/asm/mem_encrypt.h | 26 +
> > arch/riscv/include/asm/sbi.h | 107 +++
> > arch/riscv/include/uapi/asm/kvm.h | 17 +
> > arch/riscv/kernel/irq.c | 12 +
> > arch/riscv/kernel/setup.c | 2 +
> > arch/riscv/kvm/Makefile | 1 +
> > arch/riscv/kvm/aia.c | 101 ++-
> > arch/riscv/kvm/aia_device.c | 41 +-
> > arch/riscv/kvm/aia_imsic.c | 127 ++-
> > arch/riscv/kvm/cove.c | 1005 +++++++++++++++++++++++
> > arch/riscv/kvm/cove_sbi.c | 490 +++++++++++
> > arch/riscv/kvm/main.c | 30 +-
> > arch/riscv/kvm/mmu.c | 45 +-
> > arch/riscv/kvm/tlb.c | 11 +-
> > arch/riscv/kvm/vcpu.c | 69 +-
> > arch/riscv/kvm/vcpu_exit.c | 34 +-
> > arch/riscv/kvm/vcpu_insn.c | 115 ++-
> > arch/riscv/kvm/vcpu_sbi.c | 16 +
> > arch/riscv/kvm/vcpu_sbi_covg.c | 232 ++++++
> > arch/riscv/kvm/vcpu_timer.c | 26 +-
> > arch/riscv/kvm/vm.c | 34 +-
> > arch/riscv/kvm/vmid.c | 17 +-
> > arch/riscv/mm/Makefile | 3 +
> > arch/riscv/mm/init.c | 17 +-
> > arch/riscv/mm/ioremap.c | 45 +
> > arch/riscv/mm/mem_encrypt.c | 61 ++
> > drivers/tty/hvc/hvc_riscv_sbi.c | 5 +
> > drivers/tty/serial/earlycon-riscv-sbi.c | 51 +-
> > include/uapi/linux/kvm.h | 8 +
> > mm/vmalloc.c | 16 +
> > 42 files changed, 3222 insertions(+), 109 deletions(-)
> > create mode 100644 arch/riscv/cove/Makefile
> > create mode 100644 arch/riscv/cove/core.c
> > create mode 100644 arch/riscv/cove/cove_guest_sbi.c
> > create mode 100644 arch/riscv/include/asm/cove.h
> > create mode 100644 arch/riscv/include/asm/covg_sbi.h
> > create mode 100644 arch/riscv/include/asm/kvm_cove.h
> > create mode 100644 arch/riscv/include/asm/kvm_cove_sbi.h
> > create mode 100644 arch/riscv/include/asm/mem_encrypt.h
> > create mode 100644 arch/riscv/kvm/cove.c
> > create mode 100644 arch/riscv/kvm/cove_sbi.c
> > create mode 100644 arch/riscv/kvm/vcpu_sbi_covg.c
> > create mode 100644 arch/riscv/mm/ioremap.c
> > create mode 100644 arch/riscv/mm/mem_encrypt.c
> >
> > --
> > 2.25.1
> >
>
On Thu, Apr 20, 2023 at 8:47 PM Sean Christopherson <[email protected]> wrote:
>
> On Wed, Apr 19, 2023, Atish Patra wrote:
> > +int kvm_riscv_cove_vm_measure_pages(struct kvm *kvm, struct kvm_riscv_cove_measure_region *mr)
> > +{
> > + struct kvm_cove_tvm_context *tvmc = kvm->arch.tvmc;
> > + int rc = 0, idx, num_pages;
> > + struct kvm_riscv_cove_mem_region *conf;
> > + struct page *pinned_page, *conf_page;
> > + struct kvm_riscv_cove_page *cpage;
> > +
> > + if (!tvmc)
> > + return -EFAULT;
> > +
> > + if (tvmc->finalized_done) {
> > + kvm_err("measured_mr pages can not be added after finalize\n");
> > + return -EINVAL;
> > + }
> > +
> > + num_pages = bytes_to_pages(mr->size);
> > + conf = &tvmc->confidential_region;
> > +
> > + if (!IS_ALIGNED(mr->userspace_addr, PAGE_SIZE) ||
> > + !IS_ALIGNED(mr->gpa, PAGE_SIZE) || !mr->size ||
> > + !cove_is_within_region(conf->gpa, conf->npages << PAGE_SHIFT, mr->gpa, mr->size))
> > + return -EINVAL;
> > +
> > + idx = srcu_read_lock(&kvm->srcu);
> > +
> > + /*TODO: Iterate one page at a time as pinning multiple pages fail with unmapped panic
> > + * with a virtual address range belonging to vmalloc region for some reason.
>
> I've no idea what code you had, but I suspect the fact that vmalloc'd memory isn't
> guaranteed to be physically contiguous is relevant to the panic.
>
Ahh. That makes sense. Thanks.
> > + */
> > + while (num_pages) {
> > + if (signal_pending(current)) {
> > + rc = -ERESTARTSYS;
> > + break;
> > + }
> > +
> > + if (need_resched())
> > + cond_resched();
> > +
> > + rc = get_user_pages_fast(mr->userspace_addr, 1, 0, &pinned_page);
> > + if (rc < 0) {
> > + kvm_err("Pinning the userpsace addr %lx failed\n", mr->userspace_addr);
> > + break;
> > + }
> > +
> > + /* Enough pages are not available to be pinned */
> > + if (rc != 1) {
> > + rc = -ENOMEM;
> > + break;
> > + }
> > + conf_page = alloc_page(GFP_KERNEL | __GFP_ZERO);
> > + if (!conf_page) {
> > + rc = -ENOMEM;
> > + break;
> > + }
> > +
> > + rc = cove_convert_pages(page_to_phys(conf_page), 1, true);
> > + if (rc)
> > + break;
> > +
> > + /*TODO: Support other pages sizes */
> > + rc = sbi_covh_add_measured_pages(tvmc->tvm_guest_id, page_to_phys(pinned_page),
> > + page_to_phys(conf_page), SBI_COVE_PAGE_4K,
> > + 1, mr->gpa);
> > + if (rc)
> > + break;
> > +
> > + /* Unpin the page now */
> > + put_page(pinned_page);
> > +
> > + cpage = kmalloc(sizeof(*cpage), GFP_KERNEL_ACCOUNT);
> > + if (!cpage) {
> > + rc = -ENOMEM;
> > + break;
> > + }
> > +
> > + cpage->page = conf_page;
> > + cpage->npages = 1;
> > + cpage->gpa = mr->gpa;
> > + cpage->hva = mr->userspace_addr;
>
> Snapshotting the userspace address for the _source_ page can't possibly be useful.
>
Yeah. Currently, the hva in the kvm_riscv_cove_page is not used
anywhere in the code. We can remove it.
> > + cpage->is_mapped = true;
> > + INIT_LIST_HEAD(&cpage->link);
> > + list_add(&cpage->link, &tvmc->measured_pages);
> > +
> > + mr->userspace_addr += PAGE_SIZE;
> > + mr->gpa += PAGE_SIZE;
> > + num_pages--;
> > + conf_page = NULL;
> > +
> > + continue;
> > + }
> > + srcu_read_unlock(&kvm->srcu, idx);
> > +
> > + if (rc < 0) {
> > + /* We don't to need unpin pages as it is allocated by the hypervisor itself */
>
> This comment makes no sense. The above code is doing all of the allocation and
> pinning, which strongly suggests that KVM is the hypervisor. But this comment
> implies that KVM is not the hypervisor.
>
I mean to say here the conf_page is allocated in the kernel using
alloc_page. So no pinning/unpinning is required.
It seems the comment is probably misleading & confusing at best. I
will remove it.
> And "pinned_page" is cleared unpinned in the loop after the page is added+measured,
> which looks to be the same model as TDX where "pinned_page" is the source and
> "conf_page" is gifted to the hypervisor. But on failure, e.g. when allocating
> "conf_page", that reference is not put.
>
Thanks. Will fix it.
> > + cove_delete_page_list(kvm, &tvmc->measured_pages, false);
> > + /* Free the last allocated page for which conversion/measurement failed */
> > + kfree(conf_page);
>
> Assuming my guesses about how the architecture works are correct, this is broken
> if sbi_covh_add_measured_pages() fails. The page has already been gifted to the
Yeah. The last conf_page should be reclaimed as well if measured_pages
fail at any point in the loop.
All other allocated ones would be reclaimed as a part of cove_delete_page_list.
> TSM by cove_convert_pages(), but there is no call to sbi_covh_tsm_reclaim_pages(),
> which I'm guessing is necesary to transition the page back to a state where it can
> be safely used by the host.
On Fri, Apr 21, 2023 at 3:46 AM Dave Hansen <[email protected]> wrote:
>
> On 4/19/23 15:17, Atish Patra wrote:
> > The guests running in CoVE must notify the host about its mmio regions
> > so that host can enable mmio emulation.
>
> This one doesn't make a lot of sense to me.
>
> The guest and host must agree about the guest's physical layout up
> front. In general, the host gets to dictate that layout. It tells the
> guest, up front, what is present in the guest physical address space.
>
That is passed through DT/ACPI (which will be measured) to the guest.
> This callback appears to say to the host:
>
> Hey, I (the guest) am treating this guest physical area as MMIO.
>
> But the host and guest have to agree _somewhere_ what the MMIO is used
> for, not just that it is being used as MMIO.
>
Yes. The TSM (TEE Security Manager) which is equivalent to TDX also
needs to be aware
of the MMIO regions so that it can forward the faults accordingly.
Most of the MMIO is emulated in the host (userspace or kernel
emulation if present).
The host is outside the trust boundary of the guest. Thus, guest needs
to make sure the host
only emulates the designated MMIO region. Otherwise, it opens an
attack surface from a malicious host.
All other confidential computing solutions also depend on guest
initiated MMIO as well. AFAIK, the TDX & SEV
relies on #VE like exceptions to invoke that while this patch is
similar to what pkvm does.
This approach lets the enlightened guest control which MMIO regions it
wants the host to emulate.
It can be a subset of the region's host provided the layout. The guest
device filtering solution is based on
this idea as well [1].
[1] https://lore.kernel.org/all/20210930010511.3387967-1-sathyanarayanan.kuppuswamy@linux.intel.com/
>
On 2023-04-19 at 15:16 -07, Atish Patra <[email protected]> wrote...
> This patch series adds the RISC-V Confidential VM Extension (CoVE) support to
> Linux kernel. The RISC-V CoVE specification introduces non-ISA, SBI APIs. These
> APIs enable a confidential environment in which a guest VM's data can be isolated
> from the host while the host retains control of guest VM management and platform
> resources(memory, CPU, I/O).
>
> This is a very early WIP work. We want to share this with the community to get any
> feedback on overall architecture and direction. Any other feedback is welcome too.
>
> The detailed CoVE architecture document can be found here [0]. It used to be
> called AP-TEE and renamed to CoVE recently to avoid overloading term of TEE in
> general. The specification is in the draft stages and is subjected to change based
> on the feedback from the community.
>
> The CoVE specification introduces 3 new SBI extensions.
> COVH - CoVE Host side interface
> COVG - CoVE Guest side interface
> COVI - CoVE Secure Interrupt management extension
>
> Some key acronyms introduced:
>
> TSM - TEE Security Manager
> TVM - TEE VM (aka Confidential VM)
>
> CoVE Architecture:
> ====================
> The CoVE APIs are designed to be implementation and architecture agnostic,
> allowing for different deployment models while retaining common host and guest
> kernel code. Two examples are shown in Figure 1 and Figure 2.
> As shown in both figures, the architecture introduces a new software component
> called the "TEE Security Manager" (TSM) that runs in HS mode. The TSM has minimal
> hw attested footprint on TCB as it is a passive component that doesn't support
> scheduling or timer interrupts. Both example deployment models provide memory
> isolation between the host and the TEE VM (TVM).
>
>
> Non secure world | Secure world |
> | |
> Non | |
> Virtualized | Virtualized | Virtualized Virtualized |
> Env | Env | Env Env |
> +----------+ | +----------+ | +----------+ +----------+ | --------------
> | | | | | | | | | | |
> | Host Apps| | | Apps | | | Apps | | Apps | | VU-Mode
> | (VMM) | | | | | | | | | |
> +----------+ | +----------+ | +----------+ +----------+ | --------------
> | | +----------+ | +----------+ +----------+ |
> | | | | | | | | | |
> | | | | | | TVM | | TVM | |
> | | | Guest | | | Guest | | Guest | | VS-Mode
> Syscalls | +----------+ | +----------+ +----------+ |
> | | | | |
> | SBI | SBI(COVG + COVI) |
> | | | | |
> +--------------------------+ | +---------------------------+ --------------
> | Host (Linux) | | | TSM (Salus) |
> +--------------------------+ | +---------------------------+
> | | | HS-Mode
> SBI (COVH + COVI) | SBI (COVH + COVI)
> | | |
> +-----------------------------------------------------------+ --------------
> | Firmware(OpenSBI) + TSM Driver | M-Mode
> +-----------------------------------------------------------+ --------------
> +-----------------------------------------------------------------------------
> | Hardware (RISC-V CPU + RoT + IOMMU)
> +----------------------------------------------------------------------------
> Figure 1: Host in HS model
>
>
> The deployment model shown in Figure 1 runs the host in HS mode where it is peer
> to the TSM which also runs in HS mode. It requires another component known as TSM
> Driver running in higher privilege mode than host/TSM. It is responsible for switching
> the context between the host and the TSM. TSM driver also manages the platform
> specific hardware solution via confidential domain bit as described in the specification[0]
> to provide the required memory isolation.
>
>
> Non secure world | Secure world
> |
> Virtualized Env | Virtualized Virtualized |
> Env Env |
> +-------------------------+ | +----------+ +----------+ | ------------
> | | | | | | | | | | |
> | Host Apps| | | Apps | | | Apps | | Apps | | VU-Mode
> +----------+ | +----------+ | +----------+ +----------+ | ------------
> | | | | |
> Syscalls SBI | | | |
> | | | | |
> +--------------------------+ | +-----------+ +-----------+ |
> | Host (Linux) | | | TVM Guest| | TVM Guest| | VS-Mode
> +--------------------------+ | +-----------+ +-----------+ |
> | | | | |
> SBI (COVH + COVI) | SBI SBI |
> | | (COVG + COVI) (COVG + COVI)|
> | | | | |
> +-----------------------------------------------------------+ --------------
> | TSM(Salus) | HS-Mode
> +-----------------------------------------------------------+ --------------
> |
> SBI
> |
> +---------------------------------------------------------+ --------------
> | Firmware(OpenSBI) | M-Mode
> +---------------------------------------------------------+ --------------
> +-----------------------------------------------------------------------------
> | Hardware (RISC-V CPU + RoT + IOMMU)
> +----------------------------------------------------------------------------
> Figure 2: Host in VS model
>
>
> The deployment model shown in Figure 2 simplifies the context switch and memory isolation
> by running the host in VS mode as a guest of TSM. Thus, the memory isolation is
> achieved by gstage mapping by the TSM. We don't need any additional hardware confidential
> domain bit to provide memory isolation. The downside of this model the host has to run the
> non-confidential VMs in nested environment which may have lower performance (yet to be measured).
> The current implementation Salus(TSM) doesn't support full nested virtualization yet.
>
> The platform must have a RoT to provide attestation in either model.
> This patch series implements the APIs defined by CoVE. The current TSM implementation
> allows the host to run TVMs as shown in figure 2. We are working on deployment
> model 1 in parallel. We do not expect any significant changes in either host/guest side
> ABI due to that.
>
> Shared memory between the host & TSM:
> =====================================
> To accelerate the H-mode CSR/GPR access, CoVE also reuses the Nested Acceleration (NACL)
> SBI extension[1]. NACL defines a per physical cpu shared memory area that is allocated
> at the boot. It allows the host running in VS mode to access H-mode CSR/GPR easily
> without trapping into the TSM. The CoVE specification clearly defines the exact
> state of the shared memory with r/w permissions at every call.
>
> Secure Interrupt management:
> ===========================
> The CoVE specification relies on the MSI based interrupt scheme defined in Advanced Interrupt
> Architecture specification[2]. The COVI SBI extension adds functions to bind
> a guest interrupt file to a TVMs. After that, only TCB components (TSM, TVM, TSM driver)
> can modify that. The host can inject an interrupt via TSM only.
> The TVMs are also in complete control of which interrupts it can receive. By default,
> all interrupts are denied. In this proof-of-concept implementation, all the interrupts
> are allowed by the guest at boot time to keep it simple.
>
> Device I/O:
> ===========
> In order to support paravirt I/O devices, SWIOTLB bounce buffer must be used by the
> guest. As the host can not access confidential memory, this buffer memory
> must be shared with the host via share/unshare functions defined in COVG SBI extension.
> RISC-V implementation achieves this generalizing mem_encrypt_init() similar to TDX/SEV/CCA.
> That's why, the CoVE Guest is only allowed to use virtio devices with VIRTIO_F_ACCESS_PLATFORM
> and VIRTIO_F_VERSION_1 as they force virtio drivers to use the DMA API.
>
> MMIO emulation:
> ======================
> TVM can register regions of address space as MMIO regions to be emulated by
> the host. TSM provides explicit SBI functions i.e. SBI_EXT_COVG_[ADD/REMOVE]_MMIO_REGION
> to request/remove MMIO regions. Any reads or writes to those MMIO region after
> SBI_EXT_COVG_ADD_MMIO_REGION call are forwarded to the host for emulation.
>
> This series allows any ioremapped memory to be emulated as MMIO region with
> above APIs via arch hookups inspired from pKVM work. We are aware that this model
> doesn't address all the threat vectors. We have also implemented the device
> filtering/authorization approach adopted by TDX[4]. However, those patches are not
> part of this series as the base TDX patches are still under active development.
> RISC-V CoVE will also adapt the revamped device filtering work once it is accepted
> by the Linux community in the future.
>
> The direct assignment of devices are a work in progress and will be added in the future[4].
>
> VMM support:
> ============
> This series is only tested with kvmtool support. Other VMM support (qemu-kvm, crossvm/rust-vmm)
> will be added later.
>
> Test cases:
> ===========
> We are working on kvm selftest for CoVE. We will post them as soon as they are ready.
> We haven't started any work on kvm unit-tests as RISC-V doesn't have basic infrastructure
> to support that. Once the kvm uni-test infrastructure is in place, we will add
> support for CoVE as well.
>
> Open design questions:
> ======================
>
> 1. The current implementation has two separate configs for guest(CONFIG_RISCV_COVE_GUEST)
> and the host (RISCV_COVE_HOST). The default defconfig will enable both so that
> same unified image works as both host & guest. Most likely distro prefer this way
> to minimize the maintenance burden but some may want a minimal CoVE guest image
> that has only hardened drivers. In addition to that, Android runs a microdroid instance
> in the confidential guests. A separate config will help in those case. Please let us
> know if there is any concern with two configs.
>
> 2. Lazy gstage page allocation vs upfront allocation with page pool.
> Currently, all gstage mappings happen at runtime during the fault. This is expensive
> as we need to convert that page to confidential memory as well. A page pool framework
> may be a better choice which can hold all the confidential pages which can be
> pre-allocated upfront. A generic page pool infrastructure may benefit other CC solutions ?
>
> 3. In order to allow both confidential VM and non-confidential VM, the series
> uses regular branching instead of static branches for CoVE VM specific cases through
> out KVM. That may cause a few more branch penalties while running regular VMs.
> The alternate option is to use function pointers for any function that needs to
> take a different path. As per my understanding, that would be worse than branches.
>
> Patch organization:
> ===================
> This series depends on quite a few RISC-V patches that are not upstream yet.
> Here are the dependencies.
>
> 1. RISC-V IPI improvement series
> 2. RISC-V AIA support series.
> 3. RISC-V NACL support series
>
> In this series, PATCH [0-5] are generic improvement and cleanup patches which
> can be merged independently.
>
> PATCH [6-26, 34-37] adds host side for CoVE.
> PATCH [27-33] adds the interrupt related changes.
> PATCH [34-49] Adds the guest side changes for CoVE.
>
> The TSM project is written in rust and can be found here:
> https://github.com/rivosinc/salus
>
> Running the stack
> ====================
>
> To run/test the stack, you would need the following components :
>
> 1) Qemu
> 2) Common Host & Guest Kernel
> 3) kvmtool
> 4) Host RootFS with KVMTOOL and Guest Kernel
> 5) Salus
>
> The detailed steps are available at[6]
>
> The Linux kernel patches are also available at [7] and the kvmtool patches
> are available at [8].
>
> TODOs
> =======
> As this is a very early work, the todo list is quite long :).
> Here are some of them (not in any specific order)
>
> 1. Support fd based private memory interface proposed in
> https://lkml.org/lkml/2022/1/18/395
> 2. Align with updated guest runtime device filtering approach.
> 3. IOMMU integration
> 4. Dedicated device assignment via TDSIP & SPDM[4]
> 5. Support huge pages
> 6. Page pool allocator to avoid convert/reclaim at every fault
> 7. Other VMM support (qemu-kvm, crossvm)
> 8. Complete the PoC for the deployment model 1 where host runs in HS mode
> 9. Attestation integration
> 10. Harden the interrupt allowed list
> 11. kvm self-tests support for CoVE
> 11. kvm unit-tests support for CoVE
> 12. Guest hardening
> 13. Port pKVM on RISC-V using CoVE
> 14. Any other ?
>
> Links
> ============
> [0] CoVE architecture Specification.
> https://github.com/riscv-non-isa/riscv-ap-tee/blob/main/specification/riscv-aptee-spec.pdf
URL does not work for me.
I found this:
https://github.com/riscv-non-isa/riscv-ap-tee/blob/main/specification/riscv-cove.pdf
> [1] https://lists.riscv.org/g/sig-hypervisors/message/260
> [2] https://github.com/riscv/riscv-aia/releases/download/1.0-RC2/riscv-interrupts-1.0-RC2.pdf
> [3] https://github.com/rivosinc/linux/tree/cove_integration_device_filtering1
> [4] https://github.com/intel/tdx/commits/guest-filter-upstream
> [5] https://lists.riscv.org/g/tech-ap-tee/message/83
> [6] https://github.com/rivosinc/cove/wiki/CoVE-KVM-RISCV64-on-QEMU
> [7] https://github.com/rivosinc/linux/commits/cove-integration
> [8] https://github.com/rivosinc/kvmtool/tree/cove-integration-03072023
>
> Atish Patra (33):
> RISC-V: KVM: Improve KVM error reporting to the user space
> RISC-V: KVM: Invoke aia_update with preempt disabled/irq enabled
> RISC-V: KVM: Add a helper function to get pgd size
> RISC-V: Add COVH SBI extensions definitions
> RISC-V: KVM: Implement COVH SBI extension
> RISC-V: KVM: Add a barebone CoVE implementation
> RISC-V: KVM: Add UABI to support static memory region attestation
> RISC-V: KVM: Add CoVE related nacl helpers
> RISC-V: KVM: Implement static memory region measurement
> RISC-V: KVM: Use the new VM IOCTL for measuring pages
> RISC-V: KVM: Exit to the user space for trap redirection
> RISC-V: KVM: Return early for gstage modifications
> RISC-V: KVM: Skip dirty logging updates for TVM
> RISC-V: KVM: Add a helper function to trigger fence ops
> RISC-V: KVM: Skip most VCPU requests for TVMs
> RISC-V : KVM: Skip vmid/hgatp management for TVMs
> RISC-V: KVM: Skip TLB management for TVMs
> RISC-V: KVM: Register memory regions as confidential for TVMs
> RISC-V: KVM: Add gstage mapping for TVMs
> RISC-V: KVM: Handle SBI call forward from the TSM
> RISC-V: KVM: Implement vcpu load/put functions for CoVE guests
> RISC-V: KVM: Wireup TVM world switch
> RISC-V: KVM: Skip HVIP update for TVMs
> RISC-V: KVM: Implement COVI SBI extension
> RISC-V: KVM: Add interrupt management functions for TVM
> RISC-V: KVM: Skip AIA CSR updates for TVMs
> RISC-V: KVM: Perform limited operations in hardware enable/disable
> RISC-V: KVM: Indicate no support user space emulated IRQCHIP
> RISC-V: KVM: Add AIA support for TVMs
> RISC-V: KVM: Hookup TVM VCPU init/destroy
> RISC-V: KVM: Initialize CoVE
> RISC-V: KVM: Add TVM init/destroy calls
> drivers/hvc: sbi: Disable HVC console for TVMs
>
> Rajnesh Kanwal (15):
> mm/vmalloc: Introduce arch hooks to notify ioremap/unmap changes
> RISC-V: KVM: Update timer functionality for TVMs.
> RISC-V: Add COVI extension definitions
> RISC-V: KVM: Read/write gprs from/to shmem in case of TVM VCPU.
> RISC-V: Add COVG SBI extension definitions
> RISC-V: Add CoVE guest config and helper functions
> RISC-V: Implement COVG SBI extension
> RISC-V: COVE: Add COVH invalidate, validate, promote, demote and
> remove APIs.
> RISC-V: KVM: Add host side support to handle COVG SBI calls.
> RISC-V: Allow host to inject any ext interrupt id to a CoVE guest.
> RISC-V: Add base memory encryption functions.
> RISC-V: Add cc_platform_has() for RISC-V for CoVE
> RISC-V: ioremap: Implement for arch specific ioremap hooks
> riscv/virtio: Have CoVE guests enforce restricted virtio memory
> access.
> RISC-V: Add shared bounce buffer to support DBCN for CoVE Guest.
>
> arch/riscv/Kbuild | 2 +
> arch/riscv/Kconfig | 27 +
> arch/riscv/cove/Makefile | 2 +
> arch/riscv/cove/core.c | 40 +
> arch/riscv/cove/cove_guest_sbi.c | 109 +++
> arch/riscv/include/asm/cove.h | 27 +
> arch/riscv/include/asm/covg_sbi.h | 38 +
> arch/riscv/include/asm/csr.h | 2 +
> arch/riscv/include/asm/kvm_cove.h | 206 +++++
> arch/riscv/include/asm/kvm_cove_sbi.h | 101 +++
> arch/riscv/include/asm/kvm_host.h | 10 +-
> arch/riscv/include/asm/kvm_vcpu_sbi.h | 3 +
> arch/riscv/include/asm/mem_encrypt.h | 26 +
> arch/riscv/include/asm/sbi.h | 107 +++
> arch/riscv/include/uapi/asm/kvm.h | 17 +
> arch/riscv/kernel/irq.c | 12 +
> arch/riscv/kernel/setup.c | 2 +
> arch/riscv/kvm/Makefile | 1 +
> arch/riscv/kvm/aia.c | 101 ++-
> arch/riscv/kvm/aia_device.c | 41 +-
> arch/riscv/kvm/aia_imsic.c | 127 ++-
> arch/riscv/kvm/cove.c | 1005 +++++++++++++++++++++++
> arch/riscv/kvm/cove_sbi.c | 490 +++++++++++
> arch/riscv/kvm/main.c | 30 +-
> arch/riscv/kvm/mmu.c | 45 +-
> arch/riscv/kvm/tlb.c | 11 +-
> arch/riscv/kvm/vcpu.c | 69 +-
> arch/riscv/kvm/vcpu_exit.c | 34 +-
> arch/riscv/kvm/vcpu_insn.c | 115 ++-
> arch/riscv/kvm/vcpu_sbi.c | 16 +
> arch/riscv/kvm/vcpu_sbi_covg.c | 232 ++++++
> arch/riscv/kvm/vcpu_timer.c | 26 +-
> arch/riscv/kvm/vm.c | 34 +-
> arch/riscv/kvm/vmid.c | 17 +-
> arch/riscv/mm/Makefile | 3 +
> arch/riscv/mm/init.c | 17 +-
> arch/riscv/mm/ioremap.c | 45 +
> arch/riscv/mm/mem_encrypt.c | 61 ++
> drivers/tty/hvc/hvc_riscv_sbi.c | 5 +
> drivers/tty/serial/earlycon-riscv-sbi.c | 51 +-
> include/uapi/linux/kvm.h | 8 +
> mm/vmalloc.c | 16 +
> 42 files changed, 3222 insertions(+), 109 deletions(-)
> create mode 100644 arch/riscv/cove/Makefile
> create mode 100644 arch/riscv/cove/core.c
> create mode 100644 arch/riscv/cove/cove_guest_sbi.c
> create mode 100644 arch/riscv/include/asm/cove.h
> create mode 100644 arch/riscv/include/asm/covg_sbi.h
> create mode 100644 arch/riscv/include/asm/kvm_cove.h
> create mode 100644 arch/riscv/include/asm/kvm_cove_sbi.h
> create mode 100644 arch/riscv/include/asm/mem_encrypt.h
> create mode 100644 arch/riscv/kvm/cove.c
> create mode 100644 arch/riscv/kvm/cove_sbi.c
> create mode 100644 arch/riscv/kvm/vcpu_sbi_covg.c
> create mode 100644 arch/riscv/mm/ioremap.c
> create mode 100644 arch/riscv/mm/mem_encrypt.c
--
Cheers,
Christophe de Dinechin (https://c3d.github.io)
Freedom Covenant (https://github.com/c3d/freedom-covenant)
Theory of Incomplete Measurements (https://c3d.github.io/TIM)
On 4/21/23 12:24, Atish Kumar Patra wrote:
> On Fri, Apr 21, 2023 at 3:46 AM Dave Hansen <[email protected]> wrote:>> This callback appears to say to the host:
>>
>> Hey, I (the guest) am treating this guest physical area as MMIO.
>>
>> But the host and guest have to agree _somewhere_ what the MMIO is used
>> for, not just that it is being used as MMIO.
>
> Yes. The TSM (TEE Security Manager) which is equivalent to TDX also
> needs to be aware of the MMIO regions so that it can forward the
> faults accordingly. Most of the MMIO is emulated in the host
> (userspace or kernel emulation if present). The host is outside the
> trust boundary of the guest. Thus, guest needs to make sure the host
> only emulates the designated MMIO region. Otherwise, it opens an
> attack surface from a malicious host.
How does this mechanism stop the host from emulating something outside
the designated region?
On TDX, for instance, the guest page table have a shared/private bit.
Private pages get TDX protections to (among other things) keep the page
contents confidential from the host. Shared pages can be used for MMIO
and don't have those protections.
If the host goes and tries to flip a page from private->shared, TDX
protections will kick in and prevent it.
None of this requires the guest to tell the host where it expects MMIO
to be located.
> All other confidential computing solutions also depend on guest
> initiated MMIO as well. AFAIK, the TDX & SEV relies on #VE like
> exceptions to invoke that while this patch is similar to what pkvm
> does. This approach lets the enlightened guest control which MMIO
> regions it wants the host to emulate.
I'm not _quite_ sure what "guest initiated" means. But SEV and TDX
don't require an ioremap hook like this. So, even if they *are* "guest
initiated", the question still remains how they work without this patch,
or what they are missing without it.
> It can be a subset of the region's host provided the layout. The
> guest device filtering solution is based on this idea as well [1].
>
> [1] https://lore.kernel.org/all/20210930010511.3387967-1-sathyanarayanan.kuppuswamy@linux.intel.com/
I don't really see the connection. Even if that series was going
forward (I'm not sure it is) there is no ioremap hook there. There's
also no guest->host communication in that series. The guest doesn't
_tell_ the host where the MMIO is, it just declines to run code for
devices that it didn't expect to see.
I'm still rather confused here.
On Mon, Apr 24, 2023 at 7:18 PM Dave Hansen <[email protected]> wrote:
>
> On 4/21/23 12:24, Atish Kumar Patra wrote:
> > On Fri, Apr 21, 2023 at 3:46 AM Dave Hansen <[email protected]> wrote:>> This callback appears to say to the host:
> >>
> >> Hey, I (the guest) am treating this guest physical area as MMIO.
> >>
> >> But the host and guest have to agree _somewhere_ what the MMIO is used
> >> for, not just that it is being used as MMIO.
> >
> > Yes. The TSM (TEE Security Manager) which is equivalent to TDX also
> > needs to be aware of the MMIO regions so that it can forward the
> > faults accordingly. Most of the MMIO is emulated in the host
> > (userspace or kernel emulation if present). The host is outside the
> > trust boundary of the guest. Thus, guest needs to make sure the host
> > only emulates the designated MMIO region. Otherwise, it opens an
> > attack surface from a malicious host.
> How does this mechanism stop the host from emulating something outside
> the designated region?
>
> On TDX, for instance, the guest page table have a shared/private bit.
> Private pages get TDX protections to (among other things) keep the page
> contents confidential from the host. Shared pages can be used for MMIO
> and don't have those protections.
>
> If the host goes and tries to flip a page from private->shared, TDX
> protections will kick in and prevent it.
>
> None of this requires the guest to tell the host where it expects MMIO
> to be located.
>
> > All other confidential computing solutions also depend on guest
> > initiated MMIO as well. AFAIK, the TDX & SEV relies on #VE like
> > exceptions to invoke that while this patch is similar to what pkvm
> > does. This approach lets the enlightened guest control which MMIO
> > regions it wants the host to emulate.
>
> I'm not _quite_ sure what "guest initiated" means. But SEV and TDX
> don't require an ioremap hook like this. So, even if they *are* "guest
> initiated", the question still remains how they work without this patch,
> or what they are missing without it.
>
Maybe I misunderstood your question earlier. Are you concerned about guests
invoking any MMIO region specific calls in the ioremap path or passing
that information to the host ?
Earlier, I assumed the former but it seems you are also concerned
about the latter as well. Sorry for the confusion in that case.
The guest initiation is necessary while the host notification can be
made optional.
The "guest initiated" means the guest tells the TSM (equivalent of TDX
module in RISC-V) the MMIO region details.
The TSM keeps a track of this and any page faults that happen in that
region are forwarded
to the host by the TSM after the instruction decoding. Thus TSM can
make sure that only ioremapped regions are
considered MMIO regions. Otherwise, all memory outside the guest
physical region will be considered as the MMIO region.
In the current CoVE implementation, that MMIO region information is also
passed to the host to provide additional flexibility. The host may
choose to do additional
sanity check and bail if the fault address does not belong to
requested MMIO regions without
going to the userspace. This is purely an optimization and may not be mandatory.
> > It can be a subset of the region's host provided the layout. The
> > guest device filtering solution is based on this idea as well [1].
> >
> > [1] https://lore.kernel.org/all/20210930010511.3387967-1-sathyanarayanan.kuppuswamy@linux.intel.com/
>
> I don't really see the connection. Even if that series was going
> forward (I'm not sure it is) there is no ioremap hook there. There's
> also no guest->host communication in that series. The guest doesn't
> _tell_ the host where the MMIO is, it just declines to run code for
> devices that it didn't expect to see.
>
This is a recent version of the above series from tdx github. This is
a WIP as well and has not been posted to
the mailing list. Thus, it may be going under revisions as well.
As per my understanding the above ioremap changes for TDX mark the
ioremapped pages as shared.
The guest->host communication happen in the #VE exception handler
where the guest converts this to a hypercall by invoking TDG.VP.VMCALL
with an EPT violation set. The host would emulate an MMIO address if
it gets an VMCALL with EPT violation.
Please correct me if I am wrong.
As I said above, the objective here is to notify the TSM where the
MMIO is. Notifying the host
is just an optimization that we choose to add. In fact, in this series
the KVM code doesn't do anything with that information.
The commit text probably can be improved to clarify that.
> I'm still rather confused here.
On 4/25/23 01:00, Atish Kumar Patra wrote:
> On Mon, Apr 24, 2023 at 7:18 PM Dave Hansen <[email protected]> wrote:
>> On 4/21/23 12:24, Atish Kumar Patra wrote:
>> I'm not _quite_ sure what "guest initiated" means. But SEV and TDX
>> don't require an ioremap hook like this. So, even if they *are* "guest
>> initiated", the question still remains how they work without this patch,
>> or what they are missing without it.
>
> Maybe I misunderstood your question earlier. Are you concerned about guests
> invoking any MMIO region specific calls in the ioremap path or passing
> that information to the host ?
My concern is that I don't know why this patch is here. There should be
a very simple answer to the question: Why does RISC-V need this patch
but x86 does not?
> Earlier, I assumed the former but it seems you are also concerned
> about the latter as well. Sorry for the confusion in that case.
> The guest initiation is necessary while the host notification can be
> made optional.
> The "guest initiated" means the guest tells the TSM (equivalent of TDX
> module in RISC-V) the MMIO region details.
> The TSM keeps a track of this and any page faults that happen in that
> region are forwarded
> to the host by the TSM after the instruction decoding. Thus TSM can
> make sure that only ioremapped regions are
> considered MMIO regions. Otherwise, all memory outside the guest
> physical region will be considered as the MMIO region.
Ahh, OK, that's a familiar problem. I see the connection to device
filtering now.
Is this functionality in the current set? I went looking for it and all
I found was the host notification side.
Is this the only mechanism by which the guest tells the TSM which parts
of the guest physical address space can be exposed to the host?
For TDX and SEV, that information is inferred from a bit in the page
tables. Essentially, there are dedicated guest physical addresses that
tell the hardware how to treat the mappings: should the secure page
tables or the host's EPT/NPT be consulted?
If that mechanism is different for RISC-V, it would go a long way to
explaining why RISC-V needs this patch.
> In the current CoVE implementation, that MMIO region information is also
> passed to the host to provide additional flexibility. The host may
> choose to do additional
> sanity check and bail if the fault address does not belong to
> requested MMIO regions without
> going to the userspace. This is purely an optimization and may not be mandatory.
Makes sense, thanks for the explanation.
>>> It can be a subset of the region's host provided the layout. The
>>> guest device filtering solution is based on this idea as well [1].
>>>
>>> [1] https://lore.kernel.org/all/20210930010511.3387967-1-sathyanarayanan.kuppuswamy@linux.intel.com/
>>
>> I don't really see the connection. Even if that series was going
>> forward (I'm not sure it is) there is no ioremap hook there. There's
>> also no guest->host communication in that series. The guest doesn't
>> _tell_ the host where the MMIO is, it just declines to run code for
>> devices that it didn't expect to see.
>
> This is a recent version of the above series from tdx github. This is
> a WIP as well and has not been posted to
> the mailing list. Thus, it may be going under revisions as well.
> As per my understanding the above ioremap changes for TDX mark the
> ioremapped pages as shared.
> The guest->host communication happen in the #VE exception handler
> where the guest converts this to a hypercall by invoking TDG.VP.VMCALL
> with an EPT violation set. The host would emulate an MMIO address if
> it gets an VMCALL with EPT violation.
> Please correct me if I am wrong.
Yeah, TDX does:
1. Guest MMIO access
2. Guest #VE handler (if the access faults)
3. Guest hypercall->host
4. Host fixes the fault
5. Hypercall returns, guest returns from #VE via IRET
6. Guest retries MMIO instruction
From what you said, RISC-V appears to do:
1. Guest MMIO access
2. Host MMIO handler
3. Host handles the fault, returns
4. Guest retries MMIO instruction
In other words, this mechanism does the same thing but short-circuits
the trip through #VE and the hypercall.
What happens if this ioremap() hook is not in place? Does the hardware
(or TSM) generate an exception like TDX gets? If so, it's probably
possible to move this "notify the TSM" code to that exception handler
instead of needing an ioremap() hook.
I'm not saying that it's _better_ to do that, but it would allow you to
get rid of this patch for now and get me to shut up. :)
> As I said above, the objective here is to notify the TSM where the
> MMIO is. Notifying the host is just an optimization that we choose to
> add. In fact, in this series the KVM code doesn't do anything with
> that information. The commit text probably can be improved to clarify
> that.
Just to close the loop here, please go take a look at
pgprot_decrypted(). That's where the x86 guest page table bit gets to
tell the hardware that the mapping might cause a #VE and is under the
control of the host. That's the extent of what x86 does at ioremap() time.
So, to summarize, we have:
x86:
1. Guest page table bit to mark shared (host) vs. private (guest)
control
2. #VE if there is a fault on a shared mapping to call into the host
RISC-V:
1. Guest->TSM call to mark MMIO vs. private
2. Faults in the MMIO area are then transparent to the guest
That design difference would, indeed, help explain why this patch is
here. I'm still not 100% convinced that the patch is *required*, but I
at least understand how we arrived here.
On Tue, Apr 25, 2023 at 6:40 PM Dave Hansen <[email protected]> wrote:
>
> On 4/25/23 01:00, Atish Kumar Patra wrote:
> > On Mon, Apr 24, 2023 at 7:18 PM Dave Hansen <[email protected]> wrote:
> >> On 4/21/23 12:24, Atish Kumar Patra wrote:
> >> I'm not _quite_ sure what "guest initiated" means. But SEV and TDX
> >> don't require an ioremap hook like this. So, even if they *are* "guest
> >> initiated", the question still remains how they work without this patch,
> >> or what they are missing without it.
> >
> > Maybe I misunderstood your question earlier. Are you concerned about guests
> > invoking any MMIO region specific calls in the ioremap path or passing
> > that information to the host ?
>
> My concern is that I don't know why this patch is here. There should be
> a very simple answer to the question: Why does RISC-V need this patch
> but x86 does not?
>
> > Earlier, I assumed the former but it seems you are also concerned
> > about the latter as well. Sorry for the confusion in that case.
> > The guest initiation is necessary while the host notification can be
> > made optional.
> > The "guest initiated" means the guest tells the TSM (equivalent of TDX
> > module in RISC-V) the MMIO region details.
> > The TSM keeps a track of this and any page faults that happen in that
> > region are forwarded
> > to the host by the TSM after the instruction decoding. Thus TSM can
> > make sure that only ioremapped regions are
> > considered MMIO regions. Otherwise, all memory outside the guest
> > physical region will be considered as the MMIO region.
>
> Ahh, OK, that's a familiar problem. I see the connection to device
> filtering now.
>
> Is this functionality in the current set? I went looking for it and all
> I found was the host notification side.
>
The current series doesn't have the guest filtering feature enabled.
However, we implemented guest filtering and is maintained in a separate tree
https://github.com/rivosinc/linux/tree/cove-integration-device-filtering
We did not include those in this series because the tdx patches are
still undergoing
development. We are planning to rebase our changes once the revised
patches are available.
> Is this the only mechanism by which the guest tells the TSM which parts
> of the guest physical address space can be exposed to the host?
>
This is the current approach defined in CoVE spec. Guest informs about both
shared memory & mmio regions via dedicated SBI calls (
e.g sbi_covg_[add/remove]_mmio_region and
sbi_covg_[share/unshare]_memory_region)
> For TDX and SEV, that information is inferred from a bit in the page
> tables. Essentially, there are dedicated guest physical addresses that
> tell the hardware how to treat the mappings: should the secure page
> tables or the host's EPT/NPT be consulted?
>
Yes. We don't have such a mechanism defined in CoVE yet.
Having said that, there is nothing in ISA to prevent that and it is doable.
Some specific bits in the PTE entry can also be used to encode for
shared & mmio physical memory addresses.
The TSM implementation will probably need to implement a software page
walker in that case.
Are there any performance advantages between the two approaches ?
As per my understanding, we are saving some boot time privilege
transitions & less ABIs but
adds the cost of software walk at runtime faults.
> If that mechanism is different for RISC-V, it would go a long way to
> explaining why RISC-V needs this patch.
>
> > In the current CoVE implementation, that MMIO region information is also
> > passed to the host to provide additional flexibility. The host may
> > choose to do additional
> > sanity check and bail if the fault address does not belong to
> > requested MMIO regions without
> > going to the userspace. This is purely an optimization and may not be mandatory.
>
> Makes sense, thanks for the explanation.
>
> >>> It can be a subset of the region's host provided the layout. The
> >>> guest device filtering solution is based on this idea as well [1].
> >>>
> >>> [1] https://lore.kernel.org/all/20210930010511.3387967-1-sathyanarayanan.kuppuswamy@linux.intel.com/
> >>
> >> I don't really see the connection. Even if that series was going
> >> forward (I'm not sure it is) there is no ioremap hook there. There's
> >> also no guest->host communication in that series. The guest doesn't
> >> _tell_ the host where the MMIO is, it just declines to run code for
> >> devices that it didn't expect to see.
> >
> > This is a recent version of the above series from tdx github. This is
> > a WIP as well and has not been posted to
> > the mailing list. Thus, it may be going under revisions as well.
> > As per my understanding the above ioremap changes for TDX mark the
> > ioremapped pages as shared.
> > The guest->host communication happen in the #VE exception handler
> > where the guest converts this to a hypercall by invoking TDG.VP.VMCALL
> > with an EPT violation set. The host would emulate an MMIO address if
> > it gets an VMCALL with EPT violation.
> > Please correct me if I am wrong.
>
> Yeah, TDX does:
>
> 1. Guest MMIO access
> 2. Guest #VE handler (if the access faults)
> 3. Guest hypercall->host
> 4. Host fixes the fault
> 5. Hypercall returns, guest returns from #VE via IRET
> 6. Guest retries MMIO instruction
>
> From what you said, RISC-V appears to do:
>
> 1. Guest MMIO access
> 2. Host MMIO handler
> 3. Host handles the fault, returns
> 4. Guest retries MMIO instruction
>
> In other words, this mechanism does the same thing but short-circuits
> the trip through #VE and the hypercall.
>
Yes. Thanks for summarizing the tdx approach.
> What happens if this ioremap() hook is not in place? Does the hardware
> (or TSM) generate an exception like TDX gets? If so, it's probably
> possible to move this "notify the TSM" code to that exception handler
> instead of needing an ioremap() hook.
>
We don't have a #VE like exception mechanism in RISC-V.
> I'm not saying that it's _better_ to do that, but it would allow you to
> get rid of this patch for now and get me to shut up. :)
>
> > As I said above, the objective here is to notify the TSM where the
> > MMIO is. Notifying the host is just an optimization that we choose to
> > add. In fact, in this series the KVM code doesn't do anything with
> > that information. The commit text probably can be improved to clarify
> > that.
>
> Just to close the loop here, please go take a look at
> pgprot_decrypted(). That's where the x86 guest page table bit gets to
> tell the hardware that the mapping might cause a #VE and is under the
> control of the host. That's the extent of what x86 does at ioremap() time.
>
> So, to summarize, we have:
>
> x86:
> 1. Guest page table bit to mark shared (host) vs. private (guest)
> control
> 2. #VE if there is a fault on a shared mapping to call into the host
>
> RISC-V:
> 1. Guest->TSM call to mark MMIO vs. private
> 2. Faults in the MMIO area are then transparent to the guest
>
Yup. This discussion raised a very valid design aspect of the CoVE spec.
To summarize, we need to investigate whether using PTE bits instead of
additional ABI
for managing shared/confidential/ioremapped pages makes more sense.
Thanks for putting up with my answers and the feedback :).
> That design difference would, indeed, help explain why this patch is
> here. I'm still not 100% convinced that the patch is *required*, but I
> at least understand how we arrived here.
On Wed, Apr 26, 2023 at 1:32 PM Atish Kumar Patra <[email protected]> wrote:
>
> On Tue, Apr 25, 2023 at 6:40 PM Dave Hansen <[email protected]> wrote:
> >
> > On 4/25/23 01:00, Atish Kumar Patra wrote:
> > > On Mon, Apr 24, 2023 at 7:18 PM Dave Hansen <[email protected]> wrote:
> > >> On 4/21/23 12:24, Atish Kumar Patra wrote:
> > >> I'm not _quite_ sure what "guest initiated" means. But SEV and TDX
> > >> don't require an ioremap hook like this. So, even if they *are* "guest
> > >> initiated", the question still remains how they work without this patch,
> > >> or what they are missing without it.
> > >
> > > Maybe I misunderstood your question earlier. Are you concerned about guests
> > > invoking any MMIO region specific calls in the ioremap path or passing
> > > that information to the host ?
> >
> > My concern is that I don't know why this patch is here. There should be
> > a very simple answer to the question: Why does RISC-V need this patch
> > but x86 does not?
> >
> > > Earlier, I assumed the former but it seems you are also concerned
> > > about the latter as well. Sorry for the confusion in that case.
> > > The guest initiation is necessary while the host notification can be
> > > made optional.
> > > The "guest initiated" means the guest tells the TSM (equivalent of TDX
> > > module in RISC-V) the MMIO region details.
> > > The TSM keeps a track of this and any page faults that happen in that
> > > region are forwarded
> > > to the host by the TSM after the instruction decoding. Thus TSM can
> > > make sure that only ioremapped regions are
> > > considered MMIO regions. Otherwise, all memory outside the guest
> > > physical region will be considered as the MMIO region.
> >
> > Ahh, OK, that's a familiar problem. I see the connection to device
> > filtering now.
> >
> > Is this functionality in the current set? I went looking for it and all
> > I found was the host notification side.
> >
>
> The current series doesn't have the guest filtering feature enabled.
> However, we implemented guest filtering and is maintained in a separate tree
>
> https://github.com/rivosinc/linux/tree/cove-integration-device-filtering
>
> We did not include those in this series because the tdx patches are
> still undergoing
> development. We are planning to rebase our changes once the revised
> patches are available.
>
> > Is this the only mechanism by which the guest tells the TSM which parts
> > of the guest physical address space can be exposed to the host?
> >
>
> This is the current approach defined in CoVE spec. Guest informs about both
> shared memory & mmio regions via dedicated SBI calls (
> e.g sbi_covg_[add/remove]_mmio_region and
> sbi_covg_[share/unshare]_memory_region)
>
> > For TDX and SEV, that information is inferred from a bit in the page
> > tables. Essentially, there are dedicated guest physical addresses that
> > tell the hardware how to treat the mappings: should the secure page
> > tables or the host's EPT/NPT be consulted?
> >
>
> Yes. We don't have such a mechanism defined in CoVE yet.
> Having said that, there is nothing in ISA to prevent that and it is doable.
> Some specific bits in the PTE entry can also be used to encode for
> shared & mmio physical memory addresses.
> The TSM implementation will probably need to implement a software page
> walker in that case.
We can certainly use PTE bits defined by Svpmbt extension to
differentiate between IO and memory. Also, we can use the PTE
SW bits to differentiate between shared and non-shared memory.
>
> Are there any performance advantages between the two approaches ?
> As per my understanding, we are saving some boot time privilege
> transitions & less ABIs but
> adds the cost of software walk at runtime faults.
Performance wise both approaches will be the same because in
case of PTE based approach, the TSM can on-demand map the
shared memory and do software walk upon first access.
>
> > If that mechanism is different for RISC-V, it would go a long way to
> > explaining why RISC-V needs this patch.
> >
> > > In the current CoVE implementation, that MMIO region information is also
> > > passed to the host to provide additional flexibility. The host may
> > > choose to do additional
> > > sanity check and bail if the fault address does not belong to
> > > requested MMIO regions without
> > > going to the userspace. This is purely an optimization and may not be mandatory.
> >
> > Makes sense, thanks for the explanation.
> >
> > >>> It can be a subset of the region's host provided the layout. The
> > >>> guest device filtering solution is based on this idea as well [1].
> > >>>
> > >>> [1] https://lore.kernel.org/all/20210930010511.3387967-1-sathyanarayanan.kuppuswamy@linux.intel.com/
> > >>
> > >> I don't really see the connection. Even if that series was going
> > >> forward (I'm not sure it is) there is no ioremap hook there. There's
> > >> also no guest->host communication in that series. The guest doesn't
> > >> _tell_ the host where the MMIO is, it just declines to run code for
> > >> devices that it didn't expect to see.
> > >
> > > This is a recent version of the above series from tdx github. This is
> > > a WIP as well and has not been posted to
> > > the mailing list. Thus, it may be going under revisions as well.
> > > As per my understanding the above ioremap changes for TDX mark the
> > > ioremapped pages as shared.
> > > The guest->host communication happen in the #VE exception handler
> > > where the guest converts this to a hypercall by invoking TDG.VP.VMCALL
> > > with an EPT violation set. The host would emulate an MMIO address if
> > > it gets an VMCALL with EPT violation.
> > > Please correct me if I am wrong.
> >
> > Yeah, TDX does:
> >
> > 1. Guest MMIO access
> > 2. Guest #VE handler (if the access faults)
> > 3. Guest hypercall->host
> > 4. Host fixes the fault
> > 5. Hypercall returns, guest returns from #VE via IRET
> > 6. Guest retries MMIO instruction
> >
> > From what you said, RISC-V appears to do:
> >
> > 1. Guest MMIO access
> > 2. Host MMIO handler
> > 3. Host handles the fault, returns
> > 4. Guest retries MMIO instruction
> >
> > In other words, this mechanism does the same thing but short-circuits
> > the trip through #VE and the hypercall.
> >
>
> Yes. Thanks for summarizing the tdx approach.
>
> > What happens if this ioremap() hook is not in place? Does the hardware
> > (or TSM) generate an exception like TDX gets? If so, it's probably
> > possible to move this "notify the TSM" code to that exception handler
> > instead of needing an ioremap() hook.
> >
>
> We don't have a #VE like exception mechanism in RISC-V.
>
> > I'm not saying that it's _better_ to do that, but it would allow you to
> > get rid of this patch for now and get me to shut up. :)
> >
> > > As I said above, the objective here is to notify the TSM where the
> > > MMIO is. Notifying the host is just an optimization that we choose to
> > > add. In fact, in this series the KVM code doesn't do anything with
> > > that information. The commit text probably can be improved to clarify
> > > that.
> >
> > Just to close the loop here, please go take a look at
> > pgprot_decrypted(). That's where the x86 guest page table bit gets to
> > tell the hardware that the mapping might cause a #VE and is under the
> > control of the host. That's the extent of what x86 does at ioremap() time.
> >
> > So, to summarize, we have:
> >
> > x86:
> > 1. Guest page table bit to mark shared (host) vs. private (guest)
> > control
> > 2. #VE if there is a fault on a shared mapping to call into the host
> >
> > RISC-V:
> > 1. Guest->TSM call to mark MMIO vs. private
> > 2. Faults in the MMIO area are then transparent to the guest
> >
>
> Yup. This discussion raised a very valid design aspect of the CoVE spec.
> To summarize, we need to investigate whether using PTE bits instead of
> additional ABI
> for managing shared/confidential/ioremapped pages makes more sense.
>
> Thanks for putting up with my answers and the feedback :).
I think we should re-evaluate the PTE (or software walk) based approach
for CoVE spec. It is better to keep the CoVE spec as minimal as possible
and define SBI calls only if absolutely required.
>
> > That design difference would, indeed, help explain why this patch is
> > here. I'm still not 100% convinced that the patch is *required*, but I
> > at least understand how we arrived here.
Regards,
Anup
On Wed, Apr 26, 2023 at 6:30 AM Anup Patel <[email protected]> wrote:
>
> On Wed, Apr 26, 2023 at 1:32 PM Atish Kumar Patra <[email protected]> wrote:
> >
> > On Tue, Apr 25, 2023 at 6:40 PM Dave Hansen <[email protected]> wrote:
> > >
> > > On 4/25/23 01:00, Atish Kumar Patra wrote:
> > > > On Mon, Apr 24, 2023 at 7:18 PM Dave Hansen <[email protected]> wrote:
> > > >> On 4/21/23 12:24, Atish Kumar Patra wrote:
> > > >> I'm not _quite_ sure what "guest initiated" means. But SEV and TDX
> > > >> don't require an ioremap hook like this. So, even if they *are* "guest
> > > >> initiated", the question still remains how they work without this patch,
> > > >> or what they are missing without it.
> > > >
> > > > Maybe I misunderstood your question earlier. Are you concerned about guests
> > > > invoking any MMIO region specific calls in the ioremap path or passing
> > > > that information to the host ?
> > >
> > > My concern is that I don't know why this patch is here. There should be
> > > a very simple answer to the question: Why does RISC-V need this patch
> > > but x86 does not?
> > >
> > > > Earlier, I assumed the former but it seems you are also concerned
> > > > about the latter as well. Sorry for the confusion in that case.
> > > > The guest initiation is necessary while the host notification can be
> > > > made optional.
> > > > The "guest initiated" means the guest tells the TSM (equivalent of TDX
> > > > module in RISC-V) the MMIO region details.
> > > > The TSM keeps a track of this and any page faults that happen in that
> > > > region are forwarded
> > > > to the host by the TSM after the instruction decoding. Thus TSM can
> > > > make sure that only ioremapped regions are
> > > > considered MMIO regions. Otherwise, all memory outside the guest
> > > > physical region will be considered as the MMIO region.
> > >
> > > Ahh, OK, that's a familiar problem. I see the connection to device
> > > filtering now.
> > >
> > > Is this functionality in the current set? I went looking for it and all
> > > I found was the host notification side.
> > >
> >
> > The current series doesn't have the guest filtering feature enabled.
> > However, we implemented guest filtering and is maintained in a separate tree
> >
> > https://github.com/rivosinc/linux/tree/cove-integration-device-filtering
> >
> > We did not include those in this series because the tdx patches are
> > still undergoing
> > development. We are planning to rebase our changes once the revised
> > patches are available.
> >
> > > Is this the only mechanism by which the guest tells the TSM which parts
> > > of the guest physical address space can be exposed to the host?
> > >
> >
> > This is the current approach defined in CoVE spec. Guest informs about both
> > shared memory & mmio regions via dedicated SBI calls (
> > e.g sbi_covg_[add/remove]_mmio_region and
> > sbi_covg_[share/unshare]_memory_region)
> >
> > > For TDX and SEV, that information is inferred from a bit in the page
> > > tables. Essentially, there are dedicated guest physical addresses that
> > > tell the hardware how to treat the mappings: should the secure page
> > > tables or the host's EPT/NPT be consulted?
> > >
> >
> > Yes. We don't have such a mechanism defined in CoVE yet.
> > Having said that, there is nothing in ISA to prevent that and it is doable.
> > Some specific bits in the PTE entry can also be used to encode for
> > shared & mmio physical memory addresses.
> > The TSM implementation will probably need to implement a software page
> > walker in that case.
>
> We can certainly use PTE bits defined by Svpmbt extension to
> differentiate between IO and memory. Also, we can use the PTE
> SW bits to differentiate between shared and non-shared memory.
>
> >
> > Are there any performance advantages between the two approaches ?
> > As per my understanding, we are saving some boot time privilege
> > transitions & less ABIs but
> > adds the cost of software walk at runtime faults.
>
> Performance wise both approaches will be the same because in
> case of PTE based approach, the TSM can on-demand map the
> shared memory and do software walk upon first access.
For MMIO sure, we can use Svpbmt or an RSW bit in the VS-stage PTE.
Performance-wise the difference is a few fetches from guest memory by
the TSM vs a lookup by the TSM in an internal data-structure.
It's a little more complicated for shared <-> private conversion,
though. If we were to emulate what TDX does with separate Shared vs
Secure EPTs, we could use the MSB of the GPA to divide GPA space in
half between private vs shared. But then we need to enable the host to
reclaim the private pages on a private -> shared conversion: either
the TSM must track which parts of GPA space have been converted (which
gets complicated in the presence of hugepages), or we let the host
remove whatever private pages it wants. For the latter we'd then need
an "accept" flow -- we don't have a #VE equivalent on RISC-V, but I
suppose we could use access fault exceptions for this purpose.
-Andrew
>
> >
> > > If that mechanism is different for RISC-V, it would go a long way to
> > > explaining why RISC-V needs this patch.
> > >
> > > > In the current CoVE implementation, that MMIO region information is also
> > > > passed to the host to provide additional flexibility. The host may
> > > > choose to do additional
> > > > sanity check and bail if the fault address does not belong to
> > > > requested MMIO regions without
> > > > going to the userspace. This is purely an optimization and may not be mandatory.
> > >
> > > Makes sense, thanks for the explanation.
> > >
> > > >>> It can be a subset of the region's host provided the layout. The
> > > >>> guest device filtering solution is based on this idea as well [1].
> > > >>>
> > > >>> [1] https://lore.kernel.org/all/20210930010511.3387967-1-sathyanarayanan.kuppuswamy@linux.intel.com/
> > > >>
> > > >> I don't really see the connection. Even if that series was going
> > > >> forward (I'm not sure it is) there is no ioremap hook there. There's
> > > >> also no guest->host communication in that series. The guest doesn't
> > > >> _tell_ the host where the MMIO is, it just declines to run code for
> > > >> devices that it didn't expect to see.
> > > >
> > > > This is a recent version of the above series from tdx github. This is
> > > > a WIP as well and has not been posted to
> > > > the mailing list. Thus, it may be going under revisions as well.
> > > > As per my understanding the above ioremap changes for TDX mark the
> > > > ioremapped pages as shared.
> > > > The guest->host communication happen in the #VE exception handler
> > > > where the guest converts this to a hypercall by invoking TDG.VP.VMCALL
> > > > with an EPT violation set. The host would emulate an MMIO address if
> > > > it gets an VMCALL with EPT violation.
> > > > Please correct me if I am wrong.
> > >
> > > Yeah, TDX does:
> > >
> > > 1. Guest MMIO access
> > > 2. Guest #VE handler (if the access faults)
> > > 3. Guest hypercall->host
> > > 4. Host fixes the fault
> > > 5. Hypercall returns, guest returns from #VE via IRET
> > > 6. Guest retries MMIO instruction
> > >
> > > From what you said, RISC-V appears to do:
> > >
> > > 1. Guest MMIO access
> > > 2. Host MMIO handler
> > > 3. Host handles the fault, returns
> > > 4. Guest retries MMIO instruction
> > >
> > > In other words, this mechanism does the same thing but short-circuits
> > > the trip through #VE and the hypercall.
> > >
> >
> > Yes. Thanks for summarizing the tdx approach.
> >
> > > What happens if this ioremap() hook is not in place? Does the hardware
> > > (or TSM) generate an exception like TDX gets? If so, it's probably
> > > possible to move this "notify the TSM" code to that exception handler
> > > instead of needing an ioremap() hook.
> > >
> >
> > We don't have a #VE like exception mechanism in RISC-V.
> >
> > > I'm not saying that it's _better_ to do that, but it would allow you to
> > > get rid of this patch for now and get me to shut up. :)
> > >
> > > > As I said above, the objective here is to notify the TSM where the
> > > > MMIO is. Notifying the host is just an optimization that we choose to
> > > > add. In fact, in this series the KVM code doesn't do anything with
> > > > that information. The commit text probably can be improved to clarify
> > > > that.
> > >
> > > Just to close the loop here, please go take a look at
> > > pgprot_decrypted(). That's where the x86 guest page table bit gets to
> > > tell the hardware that the mapping might cause a #VE and is under the
> > > control of the host. That's the extent of what x86 does at ioremap() time.
> > >
> > > So, to summarize, we have:
> > >
> > > x86:
> > > 1. Guest page table bit to mark shared (host) vs. private (guest)
> > > control
> > > 2. #VE if there is a fault on a shared mapping to call into the host
> > >
> > > RISC-V:
> > > 1. Guest->TSM call to mark MMIO vs. private
> > > 2. Faults in the MMIO area are then transparent to the guest
> > >
> >
> > Yup. This discussion raised a very valid design aspect of the CoVE spec.
> > To summarize, we need to investigate whether using PTE bits instead of
> > additional ABI
> > for managing shared/confidential/ioremapped pages makes more sense.
> >
> > Thanks for putting up with my answers and the feedback :).
>
> I think we should re-evaluate the PTE (or software walk) based approach
> for CoVE spec. It is better to keep the CoVE spec as minimal as possible
> and define SBI calls only if absolutely required.
>
> >
> > > That design difference would, indeed, help explain why this patch is
> > > here. I'm still not 100% convinced that the patch is *required*, but I
> > > at least understand how we arrived here.
>
> Regards,
> Anup