Pre-population has been requested several times to mitigate KVM page faults
during guest boot or after live migration. It is also required by TDX
before filling in the initial guest memory with measured contents; while
I am not yet sure that userspace will use this ioctl, if not the code
will be used by a TDX-specific ioctl---to pre-populate the SEPT before
invoking TDH.MEM.PAGE.ADD or TDH.MR.EXTEND.
Compared to Isaku's v2, I have reduced the scope as much as possible:
- no vendor-specific hooks
- just fail if pre-population is invoked while nested virt is access
- just populate page tables for the SMM address space if invoked while
SMM is active
There are however other changes that affect the userspace API:
- struct name changed to `kvm_map_memory`
- added support for KVM_CHECK_EXTENSION(KVM_MAP_MEMORY) on the VM
file descriptor, which allows to make this ioctl supported only
on a subset of VM types
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
- **important**: if EINTR or EAGAIN happens on
the first page, it is returned. Otherwise, the ioctl *succeeds*
but mapping->size is left nonzero. While this drops the detail
as to why the system call was interrupted, it is consistent with
other interruptible system calls such as read().
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Implementation changes:
- the test is not x86-specific anymore (though for now only compiled
on x86 because no other architectures supports the feature)
- instead of using __weak symbols, the code is conditional on a new
Kconfig CONFIG_KVM_GENERIC_MAP_MEMORY.
This patch series depends on the other pieces that have been applied
to the kvm-coco-queue branch (and is present on the branch).
Paolo
Isaku Yamahata (7):
KVM: Document KVM_MAP_MEMORY ioctl
KVM: Add KVM_MAP_MEMORY vcpu ioctl to pre-populate guest memory
KVM: x86/mmu: Extract __kvm_mmu_do_page_fault()
KVM: x86/mmu: Make __kvm_mmu_do_page_fault() return mapped level
KVM: x86/mmu: Introduce kvm_tdp_map_page() to populate guest memory
KVM: x86: Implement kvm_arch_vcpu_map_memory()
KVM: selftests: x86: Add test for KVM_MAP_MEMORY
Documentation/virt/kvm/api.rst | 54 +++++++
arch/x86/kvm/Kconfig | 1 +
arch/x86/kvm/mmu.h | 3 +
arch/x86/kvm/mmu/mmu.c | 32 +++++
arch/x86/kvm/mmu/mmu_internal.h | 42 ++++--
arch/x86/kvm/x86.c | 43 ++++++
include/linux/kvm_host.h | 5 +
include/uapi/linux/kvm.h | 10 ++
tools/include/uapi/linux/kvm.h | 8 ++
tools/testing/selftests/kvm/Makefile | 1 +
tools/testing/selftests/kvm/map_memory_test.c | 135 ++++++++++++++++++
virt/kvm/Kconfig | 3 +
virt/kvm/kvm_main.c | 61 ++++++++
13 files changed, 384 insertions(+), 14 deletions(-)
create mode 100644 tools/testing/selftests/kvm/map_memory_test.c
--
2.43.0
From: Isaku Yamahata <[email protected]>
Add a new ioctl KVM_MAP_MEMORY in the KVM common code. It iterates on the
memory range and calls the arch-specific function. Add stub arch function
as a weak symbol.
Suggested-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
Reviewed-by: Rick Edgecombe <[email protected]>
Message-ID: <819322b8f25971f2b9933bfa4506e618508ad782.1712785629.git.isaku.yamahata@intel.com>
Signed-off-by: Paolo Bonzini <[email protected]>
---
include/linux/kvm_host.h | 5 ++++
include/uapi/linux/kvm.h | 10 +++++++
virt/kvm/Kconfig | 3 ++
virt/kvm/kvm_main.c | 61 ++++++++++++++++++++++++++++++++++++++++
4 files changed, 79 insertions(+)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 8dea11701ab2..2b0f0240a64c 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2478,4 +2478,9 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t gfn, void __user *src, long npages
void kvm_arch_gmem_invalidate(kvm_pfn_t start, kvm_pfn_t end);
#endif
+#ifdef CONFIG_KVM_GENERIC_MAP_MEMORY
+int kvm_arch_vcpu_map_memory(struct kvm_vcpu *vcpu,
+ struct kvm_map_memory *mapping);
+#endif
+
#endif
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 2190adbe3002..4d233f44c613 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -917,6 +917,7 @@ struct kvm_enable_cap {
#define KVM_CAP_MEMORY_ATTRIBUTES 233
#define KVM_CAP_GUEST_MEMFD 234
#define KVM_CAP_VM_TYPES 235
+#define KVM_CAP_MAP_MEMORY 236
struct kvm_irq_routing_irqchip {
__u32 irqchip;
@@ -1548,4 +1549,13 @@ struct kvm_create_guest_memfd {
__u64 reserved[6];
};
+#define KVM_MAP_MEMORY _IOWR(KVMIO, 0xd5, struct kvm_map_memory)
+
+struct kvm_map_memory {
+ __u64 base_address;
+ __u64 size;
+ __u64 flags;
+ __u64 padding[5];
+};
+
#endif /* __LINUX_KVM_H */
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index 754c6c923427..1b94126622e8 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -67,6 +67,9 @@ config HAVE_KVM_INVALID_WAKEUPS
config KVM_GENERIC_DIRTYLOG_READ_PROTECT
bool
+config KVM_GENERIC_MAP_MEMORY
+ bool
+
config KVM_COMPAT
def_bool y
depends on KVM && COMPAT && !(S390 || ARM64 || RISCV)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 38b498669ef9..350ead98e9a6 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -4379,6 +4379,47 @@ static int kvm_vcpu_ioctl_get_stats_fd(struct kvm_vcpu *vcpu)
return fd;
}
+#ifdef CONFIG_KVM_GENERIC_MAP_MEMORY
+static int kvm_vcpu_map_memory(struct kvm_vcpu *vcpu,
+ struct kvm_map_memory *mapping)
+{
+ int idx, r;
+ u64 full_size;
+
+ if (mapping->flags)
+ return -EINVAL;
+
+ if (!PAGE_ALIGNED(mapping->base_address) ||
+ !PAGE_ALIGNED(mapping->size) ||
+ mapping->base_address + mapping->size <= mapping->base_address)
+ return -EINVAL;
+
+ vcpu_load(vcpu);
+ idx = srcu_read_lock(&vcpu->kvm->srcu);
+
+ r = 0;
+ full_size = mapping->size;
+ while (mapping->size) {
+ if (signal_pending(current)) {
+ r = -EINTR;
+ break;
+ }
+
+ r = kvm_arch_vcpu_map_memory(vcpu, mapping);
+ if (r)
+ break;
+
+ cond_resched();
+ }
+
+ srcu_read_unlock(&vcpu->kvm->srcu, idx);
+ vcpu_put(vcpu);
+
+ /* Return success if at least one page was mapped successfully. */
+ return full_size == mapping->size ? r : 0;
+}
+#endif
+
static long kvm_vcpu_ioctl(struct file *filp,
unsigned int ioctl, unsigned long arg)
{
@@ -4580,6 +4621,20 @@ static long kvm_vcpu_ioctl(struct file *filp,
r = kvm_vcpu_ioctl_get_stats_fd(vcpu);
break;
}
+#ifdef CONFIG_KVM_GENERIC_MAP_MEMORY
+ case KVM_MAP_MEMORY: {
+ struct kvm_map_memory mapping;
+
+ r = -EFAULT;
+ if (copy_from_user(&mapping, argp, sizeof(mapping)))
+ break;
+ r = kvm_vcpu_map_memory(vcpu, &mapping);
+ /* Pass back leftover range. */
+ if (copy_to_user(argp, &mapping, sizeof(mapping)))
+ r = -EFAULT;
+ break;
+ }
+#endif
default:
r = kvm_arch_vcpu_ioctl(filp, ioctl, arg);
}
@@ -4863,6 +4918,12 @@ static int kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
#ifdef CONFIG_KVM_PRIVATE_MEM
case KVM_CAP_GUEST_MEMFD:
return !kvm || kvm_arch_has_private_mem(kvm);
+#endif
+#ifdef CONFIG_KVM_GENERIC_MAP_MEMORY
+ case KVM_CAP_MAP_MEMORY:
+ if (!kvm)
+ return 1;
+ /* Leave per-VM implementation to kvm_vm_ioctl_check_extension(). */
#endif
default:
break;
--
2.43.0
From: Isaku Yamahata <[email protected]>
Introduce a helper function to call the KVM fault handler. It allows a new
ioctl to invoke the KVM fault handler to populate without seeing RET_PF_*
enums or other KVM MMU internal definitions because RET_PF_* are internal
to x86 KVM MMU. The implementation is restricted to two-dimensional paging
for simplicity. The shadow paging uses GVA for faulting instead of L1 GPA.
It makes the API difficult to use.
Signed-off-by: Isaku Yamahata <[email protected]>
Message-ID: <9b866a0ae7147f96571c439e75429a03dcb659b6.1712785629.git.isaku.yamahata@intel.com>
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/mmu.h | 3 +++
arch/x86/kvm/mmu/mmu.c | 32 ++++++++++++++++++++++++++++++++
2 files changed, 35 insertions(+)
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index e8b620a85627..51ff4f67e115 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -183,6 +183,9 @@ static inline void kvm_mmu_refresh_passthrough_bits(struct kvm_vcpu *vcpu,
__kvm_mmu_refresh_passthrough_bits(vcpu, mmu);
}
+int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code,
+ u8 *level);
+
/*
* Check if a given access (described through the I/D, W/R and U/S bits of a
* page fault error code pfec) causes a permission fault with the given PTE
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 7fbcfc97edcc..fb2149d16f8d 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4646,6 +4646,38 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
return direct_page_fault(vcpu, fault);
}
+int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code,
+ u8 *level)
+{
+ int r;
+
+ /* Restrict to TDP page fault. */
+ if (vcpu->arch.mmu->page_fault != kvm_tdp_page_fault)
+ return -EOPNOTSUPP;
+
+ r = __kvm_mmu_do_page_fault(vcpu, gpa, error_code, true, NULL, level);
+ if (r < 0)
+ return r;
+
+ switch (r) {
+ case RET_PF_RETRY:
+ return -EAGAIN;
+
+ case RET_PF_FIXED:
+ case RET_PF_SPURIOUS:
+ return 0;
+
+ case RET_PF_EMULATE:
+ return -EINVAL;
+
+ case RET_PF_CONTINUE:
+ case RET_PF_INVALID:
+ default:
+ WARN_ON_ONCE(r);
+ return -EIO;
+ }
+}
+
static void nonpaging_init_context(struct kvm_mmu *context)
{
context->page_fault = nonpaging_page_fault;
--
2.43.0
From: Isaku Yamahata <[email protected]>
Wire KVM_MAP_MEMORY ioctl to kvm_mmu_map_tdp_page() to populate guest
memory. When KVM_CREATE_VCPU creates vCPU, it initializes the x86
KVM MMU part by kvm_mmu_create() and kvm_init_mmu(). vCPU is ready to
invoke the KVM page fault handler.
Signed-off-by: Isaku Yamahata <[email protected]>
Message-ID: <7138a3bc00ea8d3cbe0e59df15f8c22027005b59.1712785629.git.isaku.yamahata@intel.com>
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/Kconfig | 1 +
arch/x86/kvm/x86.c | 43 +++++++++++++++++++++++++++++++++++++++++++
2 files changed, 44 insertions(+)
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 7632fe6e4db9..e58360d368ec 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -44,6 +44,7 @@ config KVM
select KVM_VFIO
select HAVE_KVM_PM_NOTIFIER if PM
select KVM_GENERIC_HARDWARE_ENABLING
+ select KVM_GENERIC_MAP_MEMORY
help
Support hosting fully virtualized guest machines using hardware
virtualization extensions. You will need a fairly recent
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 83b8260443a3..f84c75c2a47f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4715,6 +4715,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
case KVM_CAP_MEMORY_FAULT_INFO:
r = 1;
break;
+ case KVM_CAP_MAP_MEMORY:
+ r = tdp_enabled;
+ break;
case KVM_CAP_EXIT_HYPERCALL:
r = KVM_EXIT_HYPERCALL_VALID_MASK;
break;
@@ -5867,6 +5870,46 @@ static int kvm_vcpu_ioctl_enable_cap(struct kvm_vcpu *vcpu,
}
}
+int kvm_arch_vcpu_map_memory(struct kvm_vcpu *vcpu,
+ struct kvm_map_memory *mapping)
+{
+ u64 mapped, end, error_code = 0;
+ u8 level = PG_LEVEL_4K;
+ int r;
+
+ /*
+ * Shadow paging uses GVA for kvm page fault. The first implementation
+ * supports GPA only to avoid confusion.
+ */
+ if (!tdp_enabled)
+ return -EOPNOTSUPP;
+
+ /*
+ * reload is efficient when called repeatedly, so we can do it on
+ * every iteration.
+ */
+ kvm_mmu_reload(vcpu);
+
+ if (kvm_arch_has_private_mem(vcpu->kvm) &&
+ kvm_mem_is_private(vcpu->kvm, gpa_to_gfn(mapping->base_address)))
+ error_code |= PFERR_PRIVATE_ACCESS;
+
+ r = kvm_tdp_map_page(vcpu, mapping->base_address, error_code, &level);
+ if (r)
+ return r;
+
+ /*
+ * level can be more than the alignment of mapping->base_address if
+ * the mapping can use a huge page.
+ */
+ end = (mapping->base_address & KVM_HPAGE_MASK(level)) +
+ KVM_HPAGE_SIZE(level);
+ mapped = min(mapping->size, end - mapping->base_address);
+ mapping->size -= mapped;
+ mapping->base_address += mapped;
+ return r;
+}
+
long kvm_arch_vcpu_ioctl(struct file *filp,
unsigned int ioctl, unsigned long arg)
{
--
2.43.0
From: Isaku Yamahata <[email protected]>
Add a test case to exercise KVM_MAP_MEMORY and run the guest to access the
pre-populated area. It tests KVM_MAP_MEMORY ioctl for KVM_X86_DEFAULT_VM
and KVM_X86_SW_PROTECTED_VM.
Signed-off-by: Isaku Yamahata <[email protected]>
Message-ID: <32427791ef42e5efaafb05d2ac37fa4372715f47.1712785629.git.isaku.yamahata@intel.com>
Signed-off-by: Paolo Bonzini <[email protected]>
---
tools/include/uapi/linux/kvm.h | 8 ++
tools/testing/selftests/kvm/Makefile | 1 +
tools/testing/selftests/kvm/map_memory_test.c | 135 ++++++++++++++++++
3 files changed, 144 insertions(+)
create mode 100644 tools/testing/selftests/kvm/map_memory_test.c
diff --git a/tools/include/uapi/linux/kvm.h b/tools/include/uapi/linux/kvm.h
index c3308536482b..69a43c5ae33c 100644
--- a/tools/include/uapi/linux/kvm.h
+++ b/tools/include/uapi/linux/kvm.h
@@ -2227,4 +2227,12 @@ struct kvm_create_guest_memfd {
__u64 reserved[6];
};
+#define KVM_MAP_MEMORY _IOWR(KVMIO, 0xd5, struct kvm_map_memory)
+
+struct kvm_map_memory {
+ __u64 base_address;
+ __u64 size;
+ __u64 flags;
+};
+
#endif /* __LINUX_KVM_H */
diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile
index 871e2de3eb05..41def90f7dfb 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -144,6 +144,7 @@ TEST_GEN_PROGS_x86_64 += set_memory_region_test
TEST_GEN_PROGS_x86_64 += steal_time
TEST_GEN_PROGS_x86_64 += kvm_binary_stats_test
TEST_GEN_PROGS_x86_64 += system_counter_offset_test
+TEST_GEN_PROGS_x86_64 += map_memory_test
# Compiled outputs used by test targets
TEST_GEN_PROGS_EXTENDED_x86_64 += x86_64/nx_huge_pages_test
diff --git a/tools/testing/selftests/kvm/map_memory_test.c b/tools/testing/selftests/kvm/map_memory_test.c
new file mode 100644
index 000000000000..e52e33145f01
--- /dev/null
+++ b/tools/testing/selftests/kvm/map_memory_test.c
@@ -0,0 +1,135 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2024, Intel, Inc
+ *
+ * Author:
+ * Isaku Yamahata <isaku.yamahata at gmail.com>
+ */
+#include <linux/sizes.h>
+
+#include <test_util.h>
+#include <kvm_util.h>
+#include <processor.h>
+
+/* Arbitrarily chosen values */
+#define TEST_SIZE (SZ_2M + PAGE_SIZE)
+#define TEST_NPAGES (TEST_SIZE / PAGE_SIZE)
+#define TEST_SLOT 10
+
+static void guest_code(uint64_t base_gpa)
+{
+ volatile uint64_t val __used;
+ int i;
+
+ for (i = 0; i < TEST_NPAGES; i++) {
+ uint64_t *src = (uint64_t *)(base_gpa + i * PAGE_SIZE);
+
+ val = *src;
+ }
+
+ GUEST_DONE();
+}
+
+static void map_memory(struct kvm_vcpu *vcpu, u64 base_address, u64 size,
+ bool should_succeed)
+{
+ struct kvm_map_memory mapping = {
+ .base_address = base_address,
+ .size = size,
+ .flags = 0,
+ };
+ int ret;
+
+ do {
+ ret = __vcpu_ioctl(vcpu, KVM_MAP_MEMORY, &mapping);
+ } while (ret >= 0 && mapping.size);
+
+ if (should_succeed)
+ __TEST_ASSERT_VM_VCPU_IOCTL(!ret, "KVM_MAP_MEMORY", ret, vcpu->vm);
+ else
+ /* No memory slot causes RET_PF_EMULATE. it results in -EINVAL. */
+ __TEST_ASSERT_VM_VCPU_IOCTL(ret && errno == EINVAL,
+ "KVM_MAP_MEMORY", ret, vcpu->vm);
+}
+
+static void __test_map_memory(unsigned long vm_type, bool private)
+{
+ const struct vm_shape shape = {
+ .mode = VM_MODE_DEFAULT,
+ .type = vm_type,
+ };
+ struct kvm_vcpu *vcpu;
+ struct kvm_run *run;
+ struct kvm_vm *vm;
+ struct ucall uc;
+
+ uint64_t guest_test_phys_mem;
+ uint64_t guest_test_virt_mem;
+ uint64_t alignment, guest_page_size;
+
+ vm = vm_create_shape_with_one_vcpu(shape, &vcpu, guest_code);
+
+ alignment = guest_page_size = vm_guest_mode_params[VM_MODE_DEFAULT].page_size;
+ guest_test_phys_mem = (vm->max_gfn - TEST_NPAGES) * guest_page_size;
+#ifdef __s390x__
+ alignment = max(0x100000UL, guest_page_size);
+#else
+ alignment = SZ_2M;
+#endif
+ guest_test_phys_mem = align_down(guest_test_phys_mem, alignment);
+ guest_test_virt_mem = guest_test_phys_mem;
+
+ vm_userspace_mem_region_add(vm, VM_MEM_SRC_ANONYMOUS,
+ guest_test_phys_mem, TEST_SLOT, TEST_NPAGES,
+ private ? KVM_MEM_GUEST_MEMFD : 0);
+ virt_map(vm, guest_test_virt_mem, guest_test_phys_mem, TEST_NPAGES);
+
+ if (private)
+ vm_mem_set_private(vm, guest_test_phys_mem, TEST_SIZE);
+ map_memory(vcpu, guest_test_phys_mem, SZ_2M, true);
+ map_memory(vcpu, guest_test_phys_mem + SZ_2M, PAGE_SIZE, true);
+ map_memory(vcpu, guest_test_phys_mem + TEST_SIZE, PAGE_SIZE, false);
+
+ vcpu_args_set(vcpu, 1, guest_test_virt_mem);
+ vcpu_run(vcpu);
+
+ run = vcpu->run;
+ TEST_ASSERT(run->exit_reason == KVM_EXIT_IO,
+ "Wanted KVM_EXIT_IO, got exit reason: %u (%s)",
+ run->exit_reason, exit_reason_str(run->exit_reason));
+
+ switch (get_ucall(vcpu, &uc)) {
+ case UCALL_ABORT:
+ REPORT_GUEST_ASSERT(uc);
+ break;
+ case UCALL_DONE:
+ break;
+ default:
+ TEST_FAIL("Unknown ucall 0x%lx.", uc.cmd);
+ break;
+ }
+
+ kvm_vm_free(vm);
+}
+
+static void test_map_memory(unsigned long vm_type, bool private)
+{
+ if (vm_type && !(kvm_check_cap(KVM_CAP_VM_TYPES) & BIT(vm_type))) {
+ pr_info("Skipping tests for vm_type 0x%lx\n", vm_type);
+ return;
+ }
+
+ __test_map_memory(vm_type, private);
+}
+
+int main(int argc, char *argv[])
+{
+ TEST_REQUIRE(kvm_check_cap(KVM_CAP_MAP_MEMORY));
+
+ test_map_memory(0, false);
+#ifdef __x86_64__
+ test_map_memory(KVM_X86_SW_PROTECTED_VM, false);
+ test_map_memory(KVM_X86_SW_PROTECTED_VM, true);
+#endif
+ return 0;
+}
--
2.43.0
From: Isaku Yamahata <[email protected]>
The guest memory population logic will need to know what page size or level
(4K, 2M, ...) is mapped.
Signed-off-by: Isaku Yamahata <[email protected]>
Message-ID: <eabc3f3e5eb03b370cadf6e1901ea34d7a020adc.1712785629.git.isaku.yamahata@intel.com>
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/mmu/mmu_internal.h | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 9baae6c223ee..b0a10f5a40dd 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -288,7 +288,8 @@ static inline void kvm_mmu_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
}
static inline int __kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
- u64 err, bool prefetch, int *emulation_type)
+ u64 err, bool prefetch,
+ int *emulation_type, u8 *level)
{
struct kvm_page_fault fault = {
.addr = cr2_or_gpa,
@@ -330,6 +331,8 @@ static inline int __kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gp
if (fault.write_fault_to_shadow_pgtable && emulation_type)
*emulation_type |= EMULTYPE_WRITE_PF_TO_SP;
+ if (level)
+ *level = fault.goal_level;
return r;
}
@@ -347,7 +350,8 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
if (!prefetch)
vcpu->stat.pf_taken++;
- r = __kvm_mmu_do_page_fault(vcpu, cr2_or_gpa, err, prefetch, emulation_type);
+ r = __kvm_mmu_do_page_fault(vcpu, cr2_or_gpa, err, prefetch,
+ emulation_type, NULL);
/*
* Similar to above, prefetch faults aren't truly spurious, and the
--
2.43.0
From: Isaku Yamahata <[email protected]>
Extract out __kvm_mmu_do_page_fault() from kvm_mmu_do_page_fault(). The
inner function is to initialize struct kvm_page_fault and to call the fault
handler, and the outer function handles updating stats and converting
return code. KVM_MAP_MEMORY will call the KVM page fault handler.
This patch makes the emulation_type always set irrelevant to the return
code. kvm_mmu_page_fault() is the only caller of kvm_mmu_do_page_fault(),
and references the value only when PF_RET_EMULATE is returned. Therefore,
this adjustment doesn't affect functionality.
No functional change intended.
Suggested-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
Message-ID: <ddf1d98420f562707b11e12c416cce8fdb986bb1.1712785629.git.isaku.yamahata@intel.com>
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/mmu/mmu_internal.h | 38 +++++++++++++++++++++------------
1 file changed, 24 insertions(+), 14 deletions(-)
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index e68a60974cf4..9baae6c223ee 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -287,8 +287,8 @@ static inline void kvm_mmu_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
fault->is_private);
}
-static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
- u64 err, bool prefetch, int *emulation_type)
+static inline int __kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
+ u64 err, bool prefetch, int *emulation_type)
{
struct kvm_page_fault fault = {
.addr = cr2_or_gpa,
@@ -318,6 +318,27 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
fault.slot = kvm_vcpu_gfn_to_memslot(vcpu, fault.gfn);
}
+ if (IS_ENABLED(CONFIG_MITIGATION_RETPOLINE) && fault.is_tdp)
+ r = kvm_tdp_page_fault(vcpu, &fault);
+ else
+ r = vcpu->arch.mmu->page_fault(vcpu, &fault);
+
+ if (r == RET_PF_EMULATE && fault.is_private) {
+ kvm_mmu_prepare_memory_fault_exit(vcpu, &fault);
+ r = -EFAULT;
+ }
+
+ if (fault.write_fault_to_shadow_pgtable && emulation_type)
+ *emulation_type |= EMULTYPE_WRITE_PF_TO_SP;
+
+ return r;
+}
+
+static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
+ u64 err, bool prefetch, int *emulation_type)
+{
+ int r;
+
/*
* Async #PF "faults", a.k.a. prefetch faults, are not faults from the
* guest perspective and have already been counted at the time of the
@@ -326,18 +347,7 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
if (!prefetch)
vcpu->stat.pf_taken++;
- if (IS_ENABLED(CONFIG_MITIGATION_RETPOLINE) && fault.is_tdp)
- r = kvm_tdp_page_fault(vcpu, &fault);
- else
- r = vcpu->arch.mmu->page_fault(vcpu, &fault);
-
- if (r == RET_PF_EMULATE && fault.is_private) {
- kvm_mmu_prepare_memory_fault_exit(vcpu, &fault);
- return -EFAULT;
- }
-
- if (fault.write_fault_to_shadow_pgtable && emulation_type)
- *emulation_type |= EMULTYPE_WRITE_PF_TO_SP;
+ r = __kvm_mmu_do_page_fault(vcpu, cr2_or_gpa, err, prefetch, emulation_type);
/*
* Similar to above, prefetch faults aren't truly spurious, and the
--
2.43.0
From: Isaku Yamahata <[email protected]>
Adds documentation of KVM_MAP_MEMORY ioctl. [1]
It populates guest memory. It doesn't do extra operations on the
underlying technology-specific initialization [2]. For example,
CoCo-related operations won't be performed. Concretely for TDX, this API
won't invoke TDH.MEM.PAGE.ADD() or TDH.MR.EXTEND(). Vendor-specific APIs
are required for such operations.
The key point is to adapt of vcpu ioctl instead of VM ioctl. First,
populating guest memory requires vcpu. If it is VM ioctl, we need to pick
one vcpu somehow. Secondly, vcpu ioctl allows each vcpu to invoke this
ioctl in parallel. It helps to scale regarding guest memory size, e.g.,
hundreds of GB.
[1] https://lore.kernel.org/kvm/[email protected]/
[2] https://lore.kernel.org/kvm/[email protected]/
Suggested-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
Message-ID: <9a060293c9ad9a78f1d8994cfe1311e818e99257.1712785629.git.isaku.yamahata@intel.com>
Signed-off-by: Paolo Bonzini <[email protected]>
---
Documentation/virt/kvm/api.rst | 54 ++++++++++++++++++++++++++++++++++
1 file changed, 54 insertions(+)
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index f0b76ff5030d..c16906a42db1 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6352,6 +6352,60 @@ a single guest_memfd file, but the bound ranges must not overlap).
See KVM_SET_USER_MEMORY_REGION2 for additional details.
+4.143 KVM_MAP_MEMORY
+------------------------
+
+:Capability: KVM_CAP_MAP_MEMORY
+:Architectures: none
+:Type: vcpu ioctl
+:Parameters: struct kvm_map_memory (in/out)
+:Returns: 0 on success, < 0 on error
+
+Errors:
+
+ ========== ===============================================================
+ EINVAL The specified `base_address` and `size` were invalid (e.g. not
+ page aligned or outside the defined memory slots).
+ EAGAIN The ioctl should be invoked again and no page was processed.
+ EINTR An unmasked signal is pending and no page was processed.
+ EFAULT The parameter address was invalid.
+ EOPNOTSUPP The architecture does not support this operation, or the
+ guest state does not allow it.
+ ========== ===============================================================
+
+::
+
+ struct kvm_map_memory {
+ /* in/out */
+ __u64 base_address;
+ __u64 size;
+ /* in */
+ __u64 flags;
+ __u64 padding[5];
+ };
+
+KVM_MAP_MEMORY populates guest memory in the page tables of a vCPU.
+When the ioctl returns, the input values are updated to point to the
+remaining range. If `size` > 0 on return, the caller can just issue
+the ioctl again with the same `struct kvm_map_memory` argument.
+
+In some cases, multiple vCPUs might share the page tables. In this
+case, if this ioctl is called in parallel for multiple vCPUs the
+ioctl might return with `size` > 0.
+
+The ioctl may not be supported for all VMs, and may just return
+an `EOPNOTSUPP` error if a VM does not support it. You may use
+`KVM_CHECK_EXTENSION` on the VM file descriptor to check if it is
+supported.
+
+Also, shadow page tables cannot support this ioctl because they
+are indexed by virtual address or nested guest physical address.
+Calling this ioctl when the guest is using shadow page tables (for
+example because it is running a nested guest) will also fail.
+
+`flags` must currently be zero.
+
+
5. The kvm_run structure
========================
--
2.43.0
On Wed, Apr 17, 2024 at 11:34:49AM -0400,
Paolo Bonzini <[email protected]> wrote:
> From: Isaku Yamahata <[email protected]>
>
> Wire KVM_MAP_MEMORY ioctl to kvm_mmu_map_tdp_page() to populate guest
> memory. When KVM_CREATE_VCPU creates vCPU, it initializes the x86
> KVM MMU part by kvm_mmu_create() and kvm_init_mmu(). vCPU is ready to
> invoke the KVM page fault handler.
As a record for the past discussion and to address Rick comment at
https://lore.kernel.org/all/[email protected]/
The current implementation supports TDP only because the population with GVA
is moot based on the thread [1]. If necessary, this restriction can be
relaxed in future.
[1] https://lore.kernel.org/all/[email protected]/
>
> Signed-off-by: Isaku Yamahata <[email protected]>
> Message-ID: <7138a3bc00ea8d3cbe0e59df15f8c22027005b59.1712785629.git.isaku.yamahata@intel.com>
> Signed-off-by: Paolo Bonzini <[email protected]>
> ---
> arch/x86/kvm/Kconfig | 1 +
> arch/x86/kvm/x86.c | 43 +++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 44 insertions(+)
>
> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> index 7632fe6e4db9..e58360d368ec 100644
> --- a/arch/x86/kvm/Kconfig
> +++ b/arch/x86/kvm/Kconfig
> @@ -44,6 +44,7 @@ config KVM
> select KVM_VFIO
> select HAVE_KVM_PM_NOTIFIER if PM
> select KVM_GENERIC_HARDWARE_ENABLING
> + select KVM_GENERIC_MAP_MEMORY
> help
> Support hosting fully virtualized guest machines using hardware
> virtualization extensions. You will need a fairly recent
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 83b8260443a3..f84c75c2a47f 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -4715,6 +4715,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
> case KVM_CAP_MEMORY_FAULT_INFO:
> r = 1;
> break;
> + case KVM_CAP_MAP_MEMORY:
> + r = tdp_enabled;
> + break;
> case KVM_CAP_EXIT_HYPERCALL:
> r = KVM_EXIT_HYPERCALL_VALID_MASK;
> break;
> @@ -5867,6 +5870,46 @@ static int kvm_vcpu_ioctl_enable_cap(struct kvm_vcpu *vcpu,
> }
> }
>
> +int kvm_arch_vcpu_map_memory(struct kvm_vcpu *vcpu,
> + struct kvm_map_memory *mapping)
> +{
> + u64 mapped, end, error_code = 0;
> + u8 level = PG_LEVEL_4K;
> + int r;
> +
> + /*
> + * Shadow paging uses GVA for kvm page fault. The first implementation
> + * supports GPA only to avoid confusion.
> + */
> + if (!tdp_enabled)
> + return -EOPNOTSUPP;
> +
> + /*
> + * reload is efficient when called repeatedly, so we can do it on
> + * every iteration.
> + */
> + kvm_mmu_reload(vcpu);
> +
> + if (kvm_arch_has_private_mem(vcpu->kvm) &&
> + kvm_mem_is_private(vcpu->kvm, gpa_to_gfn(mapping->base_address)))
> + error_code |= PFERR_PRIVATE_ACCESS;
> +
> + r = kvm_tdp_map_page(vcpu, mapping->base_address, error_code, &level);
> + if (r)
> + return r;
> +
> + /*
> + * level can be more than the alignment of mapping->base_address if
> + * the mapping can use a huge page.
> + */
> + end = (mapping->base_address & KVM_HPAGE_MASK(level)) +
> + KVM_HPAGE_SIZE(level);
end = ALIGN(mapping->base_address, KVM_HPAGE_SIZE(level));
ALIGN() simplifies this as Chao pointed out.
https://lore.kernel.org/all/Zh94V8ochIXEkO17@chao-email/
> + mapped = min(mapping->size, end - mapping->base_address);
> + mapping->size -= mapped;
> + mapping->base_address += mapped;
> + return r;
> +}
> +
> long kvm_arch_vcpu_ioctl(struct file *filp,
> unsigned int ioctl, unsigned long arg)
> {
> --
> 2.43.0
>
>
>
--
Isaku Yamahata <[email protected]>
On Wed, Apr 17, 2024 at 11:34:45AM -0400,
Paolo Bonzini <[email protected]> wrote:
> From: Isaku Yamahata <[email protected]>
>
> Add a new ioctl KVM_MAP_MEMORY in the KVM common code. It iterates on the
> memory range and calls the arch-specific function. Add stub arch function
> as a weak symbol.
>
> Suggested-by: Sean Christopherson <[email protected]>
> Signed-off-by: Isaku Yamahata <[email protected]>
> Reviewed-by: Rick Edgecombe <[email protected]>
> Message-ID: <819322b8f25971f2b9933bfa4506e618508ad782.1712785629.git.isaku.yamahata@intel.com>
> Signed-off-by: Paolo Bonzini <[email protected]>
> ---
> include/linux/kvm_host.h | 5 ++++
> include/uapi/linux/kvm.h | 10 +++++++
> virt/kvm/Kconfig | 3 ++
> virt/kvm/kvm_main.c | 61 ++++++++++++++++++++++++++++++++++++++++
> 4 files changed, 79 insertions(+)
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 8dea11701ab2..2b0f0240a64c 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -2478,4 +2478,9 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t gfn, void __user *src, long npages
> void kvm_arch_gmem_invalidate(kvm_pfn_t start, kvm_pfn_t end);
> #endif
>
> +#ifdef CONFIG_KVM_GENERIC_MAP_MEMORY
> +int kvm_arch_vcpu_map_memory(struct kvm_vcpu *vcpu,
> + struct kvm_map_memory *mapping);
> +#endif
> +
> #endif
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 2190adbe3002..4d233f44c613 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -917,6 +917,7 @@ struct kvm_enable_cap {
> #define KVM_CAP_MEMORY_ATTRIBUTES 233
> #define KVM_CAP_GUEST_MEMFD 234
> #define KVM_CAP_VM_TYPES 235
> +#define KVM_CAP_MAP_MEMORY 236
>
> struct kvm_irq_routing_irqchip {
> __u32 irqchip;
> @@ -1548,4 +1549,13 @@ struct kvm_create_guest_memfd {
> __u64 reserved[6];
> };
>
> +#define KVM_MAP_MEMORY _IOWR(KVMIO, 0xd5, struct kvm_map_memory)
> +
> +struct kvm_map_memory {
> + __u64 base_address;
> + __u64 size;
> + __u64 flags;
> + __u64 padding[5];
> +};
> +
> #endif /* __LINUX_KVM_H */
> diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
> index 754c6c923427..1b94126622e8 100644
> --- a/virt/kvm/Kconfig
> +++ b/virt/kvm/Kconfig
> @@ -67,6 +67,9 @@ config HAVE_KVM_INVALID_WAKEUPS
> config KVM_GENERIC_DIRTYLOG_READ_PROTECT
> bool
>
> +config KVM_GENERIC_MAP_MEMORY
> + bool
> +
> config KVM_COMPAT
> def_bool y
> depends on KVM && COMPAT && !(S390 || ARM64 || RISCV)
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 38b498669ef9..350ead98e9a6 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -4379,6 +4379,47 @@ static int kvm_vcpu_ioctl_get_stats_fd(struct kvm_vcpu *vcpu)
> return fd;
> }
>
> +#ifdef CONFIG_KVM_GENERIC_MAP_MEMORY
> +static int kvm_vcpu_map_memory(struct kvm_vcpu *vcpu,
> + struct kvm_map_memory *mapping)
> +{
> + int idx, r;
> + u64 full_size;
> +
> + if (mapping->flags)
> + return -EINVAL;
> +
> + if (!PAGE_ALIGNED(mapping->base_address) ||
> + !PAGE_ALIGNED(mapping->size) ||
> + mapping->base_address + mapping->size <= mapping->base_address)
> + return -EINVAL;
> +
> + vcpu_load(vcpu);
> + idx = srcu_read_lock(&vcpu->kvm->srcu);
> +
> + r = 0;
> + full_size = mapping->size;
> + while (mapping->size) {
> + if (signal_pending(current)) {
> + r = -EINTR;
> + break;
> + }
> +
> + r = kvm_arch_vcpu_map_memory(vcpu, mapping);
> + if (r)
> + break;
> +
> + cond_resched();
> + }
> +
> + srcu_read_unlock(&vcpu->kvm->srcu, idx);
> + vcpu_put(vcpu);
> +
> + /* Return success if at least one page was mapped successfully. */
> + return full_size == mapping->size ? r : 0;
> +}
> +#endif
> +
> static long kvm_vcpu_ioctl(struct file *filp,
> unsigned int ioctl, unsigned long arg)
> {
> @@ -4580,6 +4621,20 @@ static long kvm_vcpu_ioctl(struct file *filp,
> r = kvm_vcpu_ioctl_get_stats_fd(vcpu);
> break;
> }
> +#ifdef CONFIG_KVM_GENERIC_MAP_MEMORY
> + case KVM_MAP_MEMORY: {
> + struct kvm_map_memory mapping;
> +
> + r = -EFAULT;
> + if (copy_from_user(&mapping, argp, sizeof(mapping)))
> + break;
> + r = kvm_vcpu_map_memory(vcpu, &mapping);
> + /* Pass back leftover range. */
> + if (copy_to_user(argp, &mapping, sizeof(mapping)))
> + r = -EFAULT;
> + break;
> + }
> +#endif
> default:
> r = kvm_arch_vcpu_ioctl(filp, ioctl, arg);
> }
> @@ -4863,6 +4918,12 @@ static int kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
> #ifdef CONFIG_KVM_PRIVATE_MEM
> case KVM_CAP_GUEST_MEMFD:
> return !kvm || kvm_arch_has_private_mem(kvm);
> +#endif
> +#ifdef CONFIG_KVM_GENERIC_MAP_MEMORY
> + case KVM_CAP_MAP_MEMORY:
> + if (!kvm)
> + return 1;
> + /* Leave per-VM implementation to kvm_vm_ioctl_check_extension(). */
nitpick:
fallthough;
> #endif
> default:
> break;
> --
> 2.43.0
>
>
>
--
Isaku Yamahata <[email protected]>
On Wed, Apr 17, 2024 at 11:34:46AM -0400,
Paolo Bonzini <[email protected]> wrote:
> From: Isaku Yamahata <[email protected]>
>
> Extract out __kvm_mmu_do_page_fault() from kvm_mmu_do_page_fault(). The
> inner function is to initialize struct kvm_page_fault and to call the fault
> handler, and the outer function handles updating stats and converting
> return code. KVM_MAP_MEMORY will call the KVM page fault handler.
To clarify to no update vcpu.stat, let me update the last sentence.
KVM_MAP_MEMORY will call the KVM page fault handler without vcpu stat that
doesn't make sense for pre-population because pre-population (outside TDX) has
the purpose of avoiding page faults
> This patch makes the emulation_type always set irrelevant to the return
> code. kvm_mmu_page_fault() is the only caller of kvm_mmu_do_page_fault(),
> and references the value only when PF_RET_EMULATE is returned. Therefore,
> this adjustment doesn't affect functionality.
For the technical correctness, let me mention about NULL emulation_type.
I added "," and "with non-NULL emulation_type" to the second sentence.
https://lore.kernel.org/all/[email protected]/
This patch makes the emulation_type always set, irrelevant to the return
code. kvm_mmu_page_fault() is the only caller of kvm_mmu_do_page_fault() with
non-NULL emulation_type and references the value only when PF_RET_EMULATE is
returned. Therefore, this adjustment doesn't affect functionality.
--
Isaku Yamahata <[email protected]>
On Wed, Apr 17, 2024, Paolo Bonzini wrote:
> +4.143 KVM_MAP_MEMORY
> +------------------------
> +
> +:Capability: KVM_CAP_MAP_MEMORY
> +:Architectures: none
> +:Type: vcpu ioctl
> +:Parameters: struct kvm_map_memory (in/out)
> +:Returns: 0 on success, < 0 on error
> +
> +Errors:
> +
> + ========== ===============================================================
> + EINVAL The specified `base_address` and `size` were invalid (e.g. not
> + page aligned or outside the defined memory slots).
"outside the memslots" should probably be -EFAULT, i.e. keep EINVAL for things
that can _never_ succeed.
> + EAGAIN The ioctl should be invoked again and no page was processed.
> + EINTR An unmasked signal is pending and no page was processed.
I'm guessing we'll want to handle large ranges, at which point we'll likely end
up with EAGAIN and/or EINTR after processing at least one page.
> + EFAULT The parameter address was invalid.
> + EOPNOTSUPP The architecture does not support this operation, or the
> + guest state does not allow it.
I would phrase this as something like:
Mapping memory given for a GPA is unsupported by the
architecture, and/or for the current vCPU state/mode.
It's not that the guest state doesn't "allow" it, it's that it's explicitly
unsupported because it's nonsensical without a GVA (or L2 GPA).
> + ========== ===============================================================
> +
> +::
> +
> + struct kvm_map_memory {
> + /* in/out */
> + __u64 base_address;
I think we should commit to this being limited to gpa mappings, e.g. go with
"gpa", or "guest_physical_address" if we want to be verbose (I vote for "gpa").
> + __u64 size;
> + /* in */
> + __u64 flags;
> + __u64 padding[5];
> + };
> +
> +KVM_MAP_MEMORY populates guest memory in the page tables of a vCPU.
I think we should word this very carefully and explicitly so that KVM doesn't
commit to behavior that can't be guaranteed. We might even want to use a name
that explicitly captures the semantics, e.g. KVM_PRE_FAULT_MEMORY?
Also, this doesn't populate guest _memory_, and "in the page tables of a vCPU"
could be interpreted as the _guest's_ page tables.
Something like:
KVM_PRE_FAULT_MEMORY populates KVM's stage-2 page tables used to map memory
for the current vCPU state. KVM maps memory as if the vCPU generated a
stage-2 read page fault, e.g. faults in memory as needed, but doesn't break
CoW. However, KVM does not mark any newly created stage-2 PTE as Accessed.
> +When the ioctl returns, the input values are updated to point to the
> +remaining range. If `size` > 0 on return, the caller can just issue
> +the ioctl again with the same `struct kvm_map_memory` argument.
This is likely misleading. Unless KVM explicitly zeros size on *every* failure,
a pedantic reading of this would suggest that userspace can retry and it should
eventually succeed.
> +In some cases, multiple vCPUs might share the page tables. In this
> +case, if this ioctl is called in parallel for multiple vCPUs the
> +ioctl might return with `size` > 0.
Why? If there's already a valid mapping, mission accomplished. I don't see any
reason to return an error. If x86's page fault path returns RET_PF_RETRY, then I
think it makes sense to retry in KVM, not punt this to userspace.
> +The ioctl may not be supported for all VMs, and may just return
> +an `EOPNOTSUPP` error if a VM does not support it. You may use
> +`KVM_CHECK_EXTENSION` on the VM file descriptor to check if it is
> +supported.
Why per-VM? I don't think there's any per-VM state that would change the behavior.
The TDP MMU being enabled is KVM wide, and the guest state modifiers that cause
problems are per-vCPU, not per-VM.
Adding support for KVM_CHECK_EXTENSION on vCPU FDs is probably overkill, e.g. I
don't think it would add much value beyond returning EOPNOTSUPP for the ioctl()
itself.
> +Also, shadow page tables cannot support this ioctl because they
> +are indexed by virtual address or nested guest physical address.
> +Calling this ioctl when the guest is using shadow page tables (for
> +example because it is running a nested guest) will also fail.
Running a nested guest using TDP.
> +
> +`flags` must currently be zero.
> +
> +
> 5. The kvm_run structure
> ========================
>
> --
> 2.43.0
>
>
On Wed, Apr 17, 2024 at 10:28 PM Sean Christopherson <seanjc@googlecom> wrote:
>
> On Wed, Apr 17, 2024, Paolo Bonzini wrote:
> > +4.143 KVM_MAP_MEMORY
> > +------------------------
> > +
> > +:Capability: KVM_CAP_MAP_MEMORY
> > +:Architectures: none
> > +:Type: vcpu ioctl
> > +:Parameters: struct kvm_map_memory (in/out)
> > +:Returns: 0 on success, < 0 on error
> > +
> > +Errors:
> > +
> > + ========== ===============================================================
> > + EINVAL The specified `base_address` and `size` were invalid (e.g not
> > + page aligned or outside the defined memory slots).
>
> "outside the memslots" should probably be -EFAULT, i.e. keep EINVAL for things
> that can _never_ succeed.
>
> > + EAGAIN The ioctl should be invoked again and no page was processed.
> > + EINTR An unmasked signal is pending and no page was processed.
>
> I'm guessing we'll want to handle large ranges, at which point we'll likely end
> up with EAGAIN and/or EINTR after processing at least one page.
Yes, in that case you get a success (return value of 0), just like read().
> > + EFAULT The parameter address was invalid.
> > + EOPNOTSUPP The architecture does not support this operation, or the
> > + guest state does not allow it.
>
> I would phrase this as something like:
>
> Mapping memory given for a GPA is unsupported by the
> architecture, and/or for the current vCPU state/mode.
Better.
> > + struct kvm_map_memory {
> > + /* in/out */
> > + __u64 base_address;
>
> I think we should commit to this being limited to gpa mappings, e.g. go with
> "gpa", or "guest_physical_address" if we want to be verbose (I vote for "gpa").
>
> > + __u64 size;
> > + /* in */
> > + __u64 flags;
> > + __u64 padding[5];
> > + };
> > +
> > +KVM_MAP_MEMORY populates guest memory in the page tables of a vCPU.
>
> I think we should word this very carefully and explicitly so that KVM doesn't
> commit to behavior that can't be guaranteed. We might even want to use a name
> that explicitly captures the semantics, e.g. KVM_PRE_FAULT_MEMORY?
>
> Also, this doesn't populate guest _memory_, and "in the page tables of a vCPU"
> could be interpreted as the _guest's_ page tables.
>
> Something like:
>
> KVM_PRE_FAULT_MEMORY populates KVM's stage-2 page tables used to map memory
> for the current vCPU state. KVM maps memory as if the vCPU generated a
> stage-2 read page fault, e.g. faults in memory as needed, but doesn't break
> CoW. However, KVM does not mark any newly created stage-2 PTE as Accessed.
>
> > +When the ioctl returns, the input values are updated to point to the
> > +remaining range. If `size` > 0 on return, the caller can just issue
> > +the ioctl again with the same `struct kvm_map_memory` argument.
>
> This is likely misleading. Unless KVM explicitly zeros size on *every* failure,
> a pedantic reading of this would suggest that userspace can retry and it should
> eventually succeed.
Gotcha... KVM explicitly zeros size on every success, but never zeros
size on a failure.
> > +In some cases, multiple vCPUs might share the page tables. In this
> > +case, if this ioctl is called in parallel for multiple vCPUs the
> > +ioctl might return with `size` > 0.
>
> Why? If there's already a valid mapping, mission accomplished. I don't see any
> reason to return an error. If x86's page fault path returns RET_PF_RETRY, then I
> think it makes sense to retry in KVM, not punt this to userspace.
Considering that vcpu_mutex critical sections are killable I think I
tend to agree.
> > +The ioctl may not be supported for all VMs, and may just return
> > +an `EOPNOTSUPP` error if a VM does not support it. You may use
> > +`KVM_CHECK_EXTENSION` on the VM file descriptor to check if it is
> > +supported.
>
> Why per-VM? I don't think there's any per-VM state that would change the behavior.
Perhaps it may depend on the VM type? I'm trying to avoid having to
invent a different API later. But yeah, I can drop this sentence and
the related code.
> The TDP MMU being enabled is KVM wide, and the guest state modifiers that cause
> problems are per-vCPU, not per-VM.
>
> Adding support for KVM_CHECK_EXTENSION on vCPU FDs is probably overkill, e.g. I
> don't think it would add much value beyond returning EOPNOTSUPP for the ioctl()
> itself.
Yes, I agree.
Paolo
On Wed, Apr 17, 2024, Isaku Yamahata wrote:
> > + vcpu_load(vcpu);
> > + idx = srcu_read_lock(&vcpu->kvm->srcu);
> > +
> > + r = 0;
> > + full_size = mapping->size;
> > + while (mapping->size) {
Maybe pre-check !mapping->size? E.g. there's no reason to load the vCPU and
acquire SRCU just to do nothing. Then this can be a do-while loop and doesn't
need to explicitly initialize 'r'.
> > + if (signal_pending(current)) {
> > + r = -EINTR;
> > + break;
> > + }
> > +
> > + r = kvm_arch_vcpu_map_memory(vcpu, mapping);
Requiring arch code to address @mapping is cumbersone. If the arch call returns
a long, then can't we do?
if (r < 0)
break;
mapping->size -= r;
mapping->gpa += r;
> > + if (r)
> > + break;
> > +
> > + cond_resched();
> > + }
> > +
> > + srcu_read_unlock(&vcpu->kvm->srcu, idx);
> > + vcpu_put(vcpu);
> > +
> > + /* Return success if at least one page was mapped successfully. */
> > + return full_size == mapping->size ? r : 0;
I just saw Paolo's update that this is intentional, but this strikes me as odd,
as it requires userspace to redo the ioctl() to figure out why the last one failed.
Not a sticking point, just odd to my eyes.
On Wed, Apr 17, 2024 at 11:07 PM Sean Christopherson <seanjc@googlecom> wrote:
>
> On Wed, Apr 17, 2024, Isaku Yamahata wrote:
> > > + vcpu_load(vcpu);
> > > + idx = srcu_read_lock(&vcpu->kvm->srcu);
> > > +
> > > + r = 0;
> > > + full_size = mapping->size;
> > > + while (mapping->size) {
>
> Maybe pre-check !mapping->size? E.g. there's no reason to load the vCPU and
> acquire SRCU just to do nothing. Then this can be a do-while loop and doesn't
> need to explicitly initialize 'r'.
It's unlikely to make any difference but okay---easy enough.
> > > + if (signal_pending(current)) {
> > > + r = -EINTR;
> > > + break;
> > > + }
> > > +
> > > + r = kvm_arch_vcpu_map_memory(vcpu, mapping);
>
> Requiring arch code to address @mapping is cumbersone. If the arch call returns
> a long, then can't we do?
>
> if (r < 0)
> break;
>
> mapping->size -= r;
> mapping->gpa += r;
Ok, I thought the same for the return value. I didn't expand the
arguments to arch code in case in the future we have flags or other
expansions of the struct.
> > > + if (r)
> > > + break;
> > > +
> > > + cond_resched();
> > > + }
> > > +
> > > + srcu_read_unlock(&vcpu->kvm->srcu, idx);
> > > + vcpu_put(vcpu);
> > > +
> > > + /* Return success if at least one page was mapped successfully. */
> > > + return full_size == mapping->size ? r : 0;
>
> I just saw Paolo's update that this is intentional, but this strikes me as odd,
> as it requires userspace to redo the ioctl() to figure out why the last one failed.
Yeah, the same is true of read() but I don't think it's a problem. If
we get an EINTR, then (unlike KVM_RUN which can change the signal
mask) the signal will be delivered right after the ioctl() returns and
you can just proceed. For EAGAIN you can just proceed in general.
And of course, if RET_PF_RETRY is handled in the kernel then EAGAIN
goes away and the only cause of premature exit can be EINTR.
Paolo
> Not a sticking point, just odd to my eyes.
>
On Wed, Apr 17, 2024, Paolo Bonzini wrote:
> From: Isaku Yamahata <[email protected]>
>
> Introduce a helper function to call the KVM fault handler. It allows a new
> ioctl to invoke the KVM fault handler to populate without seeing RET_PF_*
> enums or other KVM MMU internal definitions because RET_PF_* are internal
> to x86 KVM MMU. The implementation is restricted to two-dimensional paging
> for simplicity. The shadow paging uses GVA for faulting instead of L1 GPA.
> It makes the API difficult to use.
>
> Signed-off-by: Isaku Yamahata <[email protected]>
> Message-ID: <9b866a0ae7147f96571c439e75429a03dcb659b6.1712785629.git.isaku.yamahata@intel.com>
> Signed-off-by: Paolo Bonzini <[email protected]>
> ---
> arch/x86/kvm/mmu.h | 3 +++
> arch/x86/kvm/mmu/mmu.c | 32 ++++++++++++++++++++++++++++++++
> 2 files changed, 35 insertions(+)
>
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index e8b620a85627..51ff4f67e115 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -183,6 +183,9 @@ static inline void kvm_mmu_refresh_passthrough_bits(struct kvm_vcpu *vcpu,
> __kvm_mmu_refresh_passthrough_bits(vcpu, mmu);
> }
>
> +int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code,
> + u8 *level);
> +
> /*
> * Check if a given access (described through the I/D, W/R and U/S bits of a
> * page fault error code pfec) causes a permission fault with the given PTE
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 7fbcfc97edcc..fb2149d16f8d 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4646,6 +4646,38 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> return direct_page_fault(vcpu, fault);
> }
>
> +int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code,
> + u8 *level)
If the return is an overloaded "long", then there's no need for @level, i.e. do
the level=>size conversion in this helper.
> +{
> + int r;
> +
> + /* Restrict to TDP page fault. */
Do we want to restrict this to the TDP MMU? Not for any particular reason, mostly
just to keep moving towards officially deprecating/removing TDP support from the
shadow MMU.
> + if (vcpu->arch.mmu->page_fault != kvm_tdp_page_fault)
> + return -EOPNOTSUPP;
> +
> + r = __kvm_mmu_do_page_fault(vcpu, gpa, error_code, true, NULL, level);
> + if (r < 0)
> + return r;
> +
> + switch (r) {
> + case RET_PF_RETRY:
> + return -EAGAIN;
> +
> + case RET_PF_FIXED:
> + case RET_PF_SPURIOUS:
> + return 0;
Going with the "long" idea, this becomes:
end = (gpa & KVM_HPAGE_MASK(level)) + KVM_HPAGE_SIZE(level);
return min(size, end - gpa);
though I would vote for a:
break;
so that the happy path is nicely isolated at the end of the function.
> +
> + case RET_PF_EMULATE:
> + return -EINVAL;
> +
> + case RET_PF_CONTINUE:
> + case RET_PF_INVALID:
> + default:
> + WARN_ON_ONCE(r);
> + return -EIO;
> + }
> +}
> +
> static void nonpaging_init_context(struct kvm_mmu *context)
> {
> context->page_fault = nonpaging_page_fault;
> --
> 2.43.0
>
>
On Wed, Apr 17, 2024 at 11:24 PM Sean Christopherson <seanjc@googlecom> wrote:
>
> On Wed, Apr 17, 2024, Paolo Bonzini wrote:
> > From: Isaku Yamahata <[email protected]>
> >
> > Introduce a helper function to call the KVM fault handler. It allows a new
> > ioctl to invoke the KVM fault handler to populate without seeing RET_PF_*
> > enums or other KVM MMU internal definitions because RET_PF_* are internal
> > to x86 KVM MMU. The implementation is restricted to two-dimensional paging
> > for simplicity. The shadow paging uses GVA for faulting instead of L1 GPA.
> > It makes the API difficult to use.
> >
> > Signed-off-by: Isaku Yamahata <[email protected]>
> > Message-ID: <9b866a0ae7147f96571c439e75429a03dcb659b6.1712785629.git.isaku.yamahata@intel.com>
> > Signed-off-by: Paolo Bonzini <[email protected]>
> > ---
> > arch/x86/kvm/mmu.h | 3 +++
> > arch/x86/kvm/mmu/mmu.c | 32 ++++++++++++++++++++++++++++++++
> > 2 files changed, 35 insertions(+)
> >
> > diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> > index e8b620a85627..51ff4f67e115 100644
> > --- a/arch/x86/kvm/mmu.h
> > +++ b/arch/x86/kvm/mmu.h
> > @@ -183,6 +183,9 @@ static inline void kvm_mmu_refresh_passthrough_bits(struct kvm_vcpu *vcpu,
> > __kvm_mmu_refresh_passthrough_bits(vcpu, mmu);
> > }
> >
> > +int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code,
> > + u8 *level);
> > +
> > /*
> > * Check if a given access (described through the I/D, W/R and U/S bits of a
> > * page fault error code pfec) causes a permission fault with the given PTE
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 7fbcfc97edcc..fb2149d16f8d 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -4646,6 +4646,38 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> > return direct_page_fault(vcpu, fault);
> > }
> >
> > +int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code,
> > + u8 *level)
>
> If the return is an overloaded "long", then there's no need for @level, ie. do
> the level=>size conversion in this helper.
>
> > +{
> > + int r;
> > +
> > + /* Restrict to TDP page fault. */
>
> Do we want to restrict this to the TDP MMU? Not for any particular reason, mostly
> just to keep moving towards officially deprecating/removing TDP support from the
> shadow MMU.
Heh, yet another thing I briefly thought about while reviewing Isaku's
work. In the end I decided that, with the implementation being just a
regular prefault, there's not much to save from keeping the shadow MMU
away from this.
The real ugly part is that if the memslots are zapped the
pre-population effect basically goes away (damn
kvm_arch_flush_shadow_memslot). This is the reason why I initially
thought of KVM_CHECK_EXTENSION for the VM file descriptor, to only
allow this for TDX VMs.
The real solution for this is to not "graduate" this ioctl too soon to
kvm/next. Let's keep it in kvm-coco-queue until TDX is ready and then
make a final decision.
Paolo
> > + if (vcpu->arch.mmu->page_fault != kvm_tdp_page_fault)
> > + return -EOPNOTSUPP;
> > +
> > + r = __kvm_mmu_do_page_fault(vcpu, gpa, error_code, true, NULL, level);
> > + if (r < 0)
> > + return r;
> > +
> > + switch (r) {
> > + case RET_PF_RETRY:
> > + return -EAGAIN;
> > +
> > + case RET_PF_FIXED:
> > + case RET_PF_SPURIOUS:
> > + return 0;
>
> Going with the "long" idea, this becomes:
>
> end = (gpa & KVM_HPAGE_MASK(level)) + KVM_HPAGE_SIZE(level);
> return min(size, end - gpa);
>
> though I would vote for a:
>
> break;
>
> so that the happy path is nicely isolated at the end of the function.
>
> > +
> > + case RET_PF_EMULATE:
> > + return -EINVAL;
> > +
> > + case RET_PF_CONTINUE:
> > + case RET_PF_INVALID:
> > + default:
> > + WARN_ON_ONCE(r);
> > + return -EIO;
> > + }
> > +}
> > +
> > static void nonpaging_init_context(struct kvm_mmu *context)
> > {
> > context->page_fault = nonpaging_page_fault;
> > --
> > 2.43.0
> >
> >
>
On Wed, Apr 17, 2024, Sean Christopherson wrote:
> On Wed, Apr 17, 2024, Paolo Bonzini wrote:
> > + case RET_PF_EMULATE:
> > + return -EINVAL;
Almost forgot. EINVAL on emulation is weird. I don't know that any return code
is going to be "good", but I think just about anything is better than EINVAL,
e.g. arguably this could be -EBUSY since retrying after creating a memslot would
succeed.
On Wed, Apr 17, 2024, Paolo Bonzini wrote:
> From: Isaku Yamahata <[email protected]>
>
> Wire KVM_MAP_MEMORY ioctl to kvm_mmu_map_tdp_page() to populate guest
> memory. When KVM_CREATE_VCPU creates vCPU, it initializes the x86
> KVM MMU part by kvm_mmu_create() and kvm_init_mmu(). vCPU is ready to
> invoke the KVM page fault handler.
>
> Signed-off-by: Isaku Yamahata <[email protected]>
> Message-ID: <7138a3bc00ea8d3cbe0e59df15f8c22027005b59.1712785629.git.isaku.yamahata@intel.com>
> Signed-off-by: Paolo Bonzini <[email protected]>
> ---
> arch/x86/kvm/Kconfig | 1 +
> arch/x86/kvm/x86.c | 43 +++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 44 insertions(+)
>
> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> index 7632fe6e4db9..e58360d368ec 100644
> --- a/arch/x86/kvm/Kconfig
> +++ b/arch/x86/kvm/Kconfig
> @@ -44,6 +44,7 @@ config KVM
> select KVM_VFIO
> select HAVE_KVM_PM_NOTIFIER if PM
> select KVM_GENERIC_HARDWARE_ENABLING
> + select KVM_GENERIC_MAP_MEMORY
> help
> Support hosting fully virtualized guest machines using hardware
> virtualization extensions. You will need a fairly recent
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 83b8260443a3..f84c75c2a47f 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -4715,6 +4715,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
> case KVM_CAP_MEMORY_FAULT_INFO:
> r = 1;
> break;
> + case KVM_CAP_MAP_MEMORY:
> + r = tdp_enabled;
> + break;
> case KVM_CAP_EXIT_HYPERCALL:
> r = KVM_EXIT_HYPERCALL_VALID_MASK;
> break;
> @@ -5867,6 +5870,46 @@ static int kvm_vcpu_ioctl_enable_cap(struct kvm_vcpu *vcpu,
> }
> }
>
> +int kvm_arch_vcpu_map_memory(struct kvm_vcpu *vcpu,
> + struct kvm_map_memory *mapping)
> +{
> + u64 mapped, end, error_code = 0;
Maybe add PFERR_GUEST_FINAL_MASK to the error code? KVM doesn't currently consume
that except in svm_check_emulate_instruction(), which isn't reachable, but it
seems logical?
> + u8 level = PG_LEVEL_4K;
> + int r;
> +
> + /*
> + * Shadow paging uses GVA for kvm page fault. The first implementation
> + * supports GPA only to avoid confusion.
> + */
> + if (!tdp_enabled)
Eh, I'd omit this explicit check since kvm_tdp_map_page() has a more complete
check.
Actually, why is this a separate function and a separate patch? Just implement
kvm_arch_vcpu_map_memory() in mmu.c, in a single patch, e.g.
int kvm_arch_vcpu_map_memory(struct kvm_vcpu *vcpu,
struct kvm_map_memory *mapping)
{
u64 mapped, end, error_code = 0;
u8 level = PG_LEVEL_4K;
int r;
if (vcpu->arch.mmu->page_fault != kvm_tdp_page_fault)
return -EOPNOTSUPP;
kvm_mmu_reload(vcpu);
if (kvm_arch_has_private_mem(vcpu->kvm) &&
kvm_mem_is_private(vcpu->kvm, gpa_to_gfn(mapping->base_address)))
error_code |= PFERR_PRIVATE_ACCESS;
r = __kvm_mmu_do_page_fault(vcpu, mapping->gpa, error_code, true, NULL, &level);
if (r < 0)
return r;
switch (r) {
case RET_PF_RETRY:
return -EAGAIN;
case RET_PF_FIXED:
case RET_PF_SPURIOUS:
break;
case RET_PF_EMULATE:
return -EBUSY;
case RET_PF_CONTINUE:
case RET_PF_INVALID:
default:
WARN_ON_ONCE(r);
return -EIO;
}
/*
* Adjust the GPA down when accounting for the page size, as KVM could
* have created a hugepage that covers @gpa, but doesn't start at @gpa.
*/
end = (mapping->gpa & KVM_HPAGE_MASK(level)) + KVM_HPAGE_SIZE(level);
return min(mapping->size, end - mapping->gpa);
}
On Wed, Apr 17, 2024 at 11:34 PM Sean Christopherson <seanjc@googlecom> wrote:
>
> On Wed, Apr 17, 2024, Sean Christopherson wrote:
> > On Wed, Apr 17, 2024, Paolo Bonzini wrote:
> > > + case RET_PF_EMULATE:
> > > + return -EINVAL;
>
> Almost forgot. EINVAL on emulation is weird. I don't know that any return code
> is going to be "good", but I think just about anything is better than EINVAL,
> e.g. arguably this could be -EBUSY since retrying after creating a memslot would
> succeed.
Then I guess -ENOENT?
Paolo
On Wed, Apr 17, 2024, Paolo Bonzini wrote:
> On Wed, Apr 17, 2024 at 11:24 PM Sean Christopherson <[email protected]> wrote:
> > Do we want to restrict this to the TDP MMU? Not for any particular reason,
> > mostly just to keep moving towards officially deprecating/removing TDP
> > support from the shadow MMU.
>
> Heh, yet another thing I briefly thought about while reviewing Isaku's
> work. In the end I decided that, with the implementation being just a
> regular prefault, there's not much to save from keeping the shadow MMU
> away from this.
Yeah.
> The real ugly part is that if the memslots are zapped the
> pre-population effect basically goes away (damn
> kvm_arch_flush_shadow_memslot).
Ah, the eternal thorn in my side.
> This is the reason why I initially thought of KVM_CHECK_EXTENSION for the VM
> file descriptor, to only allow this for TDX VMs.
I'm fairly certain memslot deletion is mostly a QEMU specific problem. Allegedly
(I haven't verified), our userspace+firmware doesn't delete any memslots during
boot.
And it might even be solvable for QEMU, at least for some configurations. E.g.
during boot, my QEMU+OVMF setup creates and deletes the SMRAM memslot (despite my
KVM build not supporting SMM), and deletes the lower RAM memslot when relocating
BIOS. The SMRAM is definitely solvable, and the BIOS relocation stuff seems like
it's solvable too.
On Wed, 2024-04-17 at 11:34 -0400, Paolo Bonzini wrote:
>
> Compared to Isaku's v2, I have reduced the scope as much as possible:
>
> - no vendor-specific hooks
The TDX patches build on this, with the vendor callback looking like:
"
int tdx_pre_mmu_map_page(struct kvm_vcpu *vcpu,
struct kvm_map_memory *mapping,
u64 *error_code)
{
struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
struct kvm *kvm = vcpu->kvm;
if (!to_tdx(vcpu)->initialized)
return -EINVAL;
/* Shared-EPT case */
if (!(kvm_is_private_gpa(kvm, mapping->base_address)))
return 0;
/* Once TD is finalized, the initial guest memory is fixed. */
if (is_td_finalized(kvm_tdx))
return -EINVAL;
*error_code = TDX_SEPT_PFERR;
return 0;
}
"
kvm_is_private_gpa() check is already handled in this series.
The initialized and finalized checks are nice guard rails for userspace, but
shouldn't be strictly required.
The TDX_SEPT_PFERR is (PFERR_WRITE_MASK | PFERR_PRIVATE_ACCESS). The
PFERR_WRITE_MASK doesn't seem to be required. Isaku, what was the intention?
But, I think maybe we should add a hook back in the TDX series for the guard
rails.
Il 18 aprile 2024 02:01:03 CEST, "Edgecombe, Rick P" <[email protected]> ha scritto:
>On Wed, 2024-04-17 at 11:34 -0400, Paolo Bonzini wrote:
>>
>> Compared to Isaku's v2, I have reduced the scope as much as possible:
>>
>> - no vendor-specific hooks
>
>The TDX patches build on this, with the vendor callback looking like:
>
>"
>int tdx_pre_mmu_map_page(struct kvm_vcpu *vcpu,
> struct kvm_map_memory *mapping,
> u64 *error_code)
>{
> struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
> struct kvm *kvm = vcpu->kvm;
>
> if (!to_tdx(vcpu)->initialized)
> return -EINVAL;
>
> /* Shared-EPT case */
> if (!(kvm_is_private_gpa(kvm, mapping->base_address)))
> return 0;
>
> /* Once TD is finalized, the initial guest memory is fixed. */
> if (is_td_finalized(kvm_tdx))
> return -EINVAL;
This is wrong, KVM_MAP_MEMORY should be idempotent. But anyway, you can post what you have on to of kvm-coco-queue (i.e., adding the hook in your patches) and we will sort it out a piece at a time.
Paolo
>
> *error_code = TDX_SEPT_PFERR;
> return 0;
>}
>"
>
>kvm_is_private_gpa() check is already handled in this series.
>
>The initialized and finalized checks are nice guard rails for userspace, but
>shouldn't be strictly required.
>
>The TDX_SEPT_PFERR is (PFERR_WRITE_MASK | PFERR_PRIVATE_ACCESS). The
>PFERR_WRITE_MASK doesn't seem to be required. Isaku, what was the intention?
>
>But, I think maybe we should add a hook back in the TDX series for the guard
>rails.
Paolo
On Thu, 2024-04-18 at 02:31 +0200, Paolo Bonzini wrote:
> > The TDX patches build on this, with the vendor callback looking like:
> >
> > "
> > int tdx_pre_mmu_map_page(struct kvm_vcpu *vcpu,
> > struct kvm_map_memory *mapping,
> > u64 *error_code)
> > {
> > struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
> > struct kvm *kvm = vcpu->kvm;
> >
> > if (!to_tdx(vcpu)->initialized)
> > return -EINVAL;
> >
> > /* Shared-EPT case */
> > if (!(kvm_is_private_gpa(kvm, mapping->base_address)))
> > return 0;
> >
> > /* Once TD is finalized, the initial guest memory is fixed. */
> > if (is_td_finalized(kvm_tdx))
> > return -EINVAL;
>
> This is wrong, KVM_MAP_MEMORY should be idempotent. But anyway, you can post
> what you have on to of kvm-coco-queue (i.e., adding the hook in your patches)
> and we will sort it out a piece at a time.
Hmm, I see your point.
On Wed, Apr 17, 2024 at 10:37:00PM +0200, Paolo Bonzini wrote:
> On Wed, Apr 17, 2024 at 10:28 PM Sean Christopherson <[email protected]> wrote:
> >
> > On Wed, Apr 17, 2024, Paolo Bonzini wrote:
> > > +4.143 KVM_MAP_MEMORY
> > > +------------------------
> > > +
> > > +:Capability: KVM_CAP_MAP_MEMORY
> > > +:Architectures: none
> > > +:Type: vcpu ioctl
> > > +:Parameters: struct kvm_map_memory (in/out)
> > > +:Returns: 0 on success, < 0 on error
The definition of *success* here doesn't align with below comments.
Maybe replace success with a clearer definition, e.g. 0 when all or
part of the pages are processed. < 0 when error and no page is
processed.
> > > +
> > > +Errors:
> > > +
> > > + ========== ===============================================================
> > > + EINVAL The specified `base_address` and `size` were invalid (e.g. not
> > > + page aligned or outside the defined memory slots).
> >
> > "outside the memslots" should probably be -EFAULT, i.e. keep EINVAL for things
> > that can _never_ succeed.
> >
> > > + EAGAIN The ioctl should be invoked again and no page was processed.
> > > + EINTR An unmasked signal is pending and no page was processed.
> >
> > I'm guessing we'll want to handle large ranges, at which point we'll likely end
> > up with EAGAIN and/or EINTR after processing at least one page.
>
> Yes, in that case you get a success (return value of 0), just like read().
[...]
> >
> > > +When the ioctl returns, the input values are updated to point to the
> > > +remaining range. If `size` > 0 on return, the caller can just issue
> > > +the ioctl again with the same `struct kvm_map_memory` argument.
> >
> > This is likely misleading. Unless KVM explicitly zeros size on *every* failure,
> > a pedantic reading of this would suggest that userspace can retry and it should
> > eventually succeed.
>
> Gotcha... KVM explicitly zeros size on every success, but never zeros
> size on a failure.
Thanks,
Yilun
On Fri, Apr 19, 2024, Xu Yilun wrote:
> > > +#ifdef CONFIG_KVM_GENERIC_MAP_MEMORY
> > > + case KVM_CAP_MAP_MEMORY:
> > > + if (!kvm)
> > > + return 1;
> > > + /* Leave per-VM implementation to kvm_vm_ioctl_check_extension(). */
> >
> > nitpick:
> > fallthough;
>
> A little tricky. 'break;' should be more straightforward.
+1, though it's a moot point as Paolo dropped this code in v4.
> > +#ifdef CONFIG_KVM_GENERIC_MAP_MEMORY
> > + case KVM_CAP_MAP_MEMORY:
> > + if (!kvm)
> > + return 1;
> > + /* Leave per-VM implementation to kvm_vm_ioctl_check_extension(). */
>
> nitpick:
> fallthough;
A little tricky. 'break;' should be more straightforward.
Thanks,
Yilun
>
> > #endif
> > default:
> > break;
> > --
> > 2.43.0
> >
> >
> >
>
> --
> Isaku Yamahata <[email protected]>
>
On Wed, Apr 17, 2024 at 11:34:45AM -0400, Paolo Bonzini wrote:
> From: Isaku Yamahata <[email protected]>
>
> Add a new ioctl KVM_MAP_MEMORY in the KVM common code. It iterates on the
> memory range and calls the arch-specific function. Add stub arch function
> as a weak symbol.
The weak symbol is gone.
Thanks,
Yilun