2024-03-01 17:29:52

by Isaku Yamahata

[permalink] [raw]
Subject: [RFC PATCH 0/8] KVM: Prepopulate guest memory API

From: Isaku Yamahata <[email protected]>

The objective of this RFC patch series is to develop a uAPI aimed at
(pre)populating guest memory for various use cases and underlying VM
technologies.

- Pre-populate guest memory to mitigate excessive KVM page faults during guest
boot [1], a need not limited to any specific technology.

- Pre-populating guest memory (including encryption and measurement) for
confidential guests [2]. SEV-SNP, TDX, and SW-PROTECTED VM. Potentially
other technologies and pKVM.

The patches are organized as follows.
- 1: documentation on uAPI KVM_MAP_MEMORY.
- 2: archtechture-independent implementation part.
- 3-4: refactoring of x86 KVM MMU as preparation.
- 5: x86 Helper function to map guest page.
- 6: x86 KVM arch implementation.
- 7: Add x86-ops necessary for TDX and SEV-SNP.
- 8: selftest for validation.

Discussion point:

uAPI design:
- access flags
Access flags are needed for the guest memory population. We have options for
their exposure to uAPI.
- option 1. Introduce access flags, possibly with the addition of a private
access flag.
- option 2. Omit access flags from UAPI.
Allow the kernel to deduce the necessary flag based on the memory slot and
its memory attributes.

- SEV-SNP and byte vs. page size
The SEV correspondence is SEV_LAUNCH_UPDATE_DATA. Which dictates memory
regions to be in 16-byte alignment, not page size. Should we define struct
kvm_memory_mapping in bytes rather than page size?

struct kvm_sev_launch_update_data {
__u64 uaddr;
__u32 len;
};

- TDX and measurement
The TDX correspondence is TDH.MEM.PAGE.ADD and TDH.MR.EXTEND. TDH.MEM.EXTEND
extends its measurement by the page contents.
Option 1. Add an additional flag like KVM_MEMORY_MAPPING_FLAG_EXTEND to issue
TDH.MEM.EXTEND
Option 2. Don't handle extend. Let TDX vendor specific API
KVM_EMMORY_ENCRYPT_OP to handle it with the subcommand like
KVM_TDX_EXTEND_MEMORY.

- TDX and struct kvm_memory_mapping:source
While the current patch series doesn't utilize
kvm_memory_mapping::source member. TDX needs it to specify the source of
memory contents.

Implementation:
- x86 KVM MMU
In x86 KVM MMU, I chose to use kvm_mmu_do_page_fault(). It's not confined to
KVM TDP MMU. We can restrict it to KVM TDP MMU and introduce an optimized
version.

[1] https://lore.kernel.org/all/[email protected]/
[2] https://lore.kernel.org/all/6a4c029af70d41b63bcee3d6a1f0c2377f6eb4bd.1690322424.git.isaku.yamahata@intel.com

Thanks,

Isaku Yamahata (8):
KVM: Document KVM_MAP_MEMORY ioctl
KVM: Add KVM_MAP_MEMORY vcpu ioctl to pre-populate guest memory
KVM: x86/mmu: Introduce initialier macro for struct kvm_page_fault
KVM: x86/mmu: Factor out kvm_mmu_do_page_fault()
KVM: x86/mmu: Introduce kvm_mmu_map_page() for prepopulating guest
memory
KVM: x86: Implement kvm_arch_{, pre_}vcpu_map_memory()
KVM: x86: Add hooks in kvm_arch_vcpu_map_memory()
KVM: selftests: x86: Add test for KVM_MAP_MEMORY

Documentation/virt/kvm/api.rst | 36 +++++
arch/x86/include/asm/kvm-x86-ops.h | 2 +
arch/x86/include/asm/kvm_host.h | 6 +
arch/x86/kvm/mmu.h | 3 +
arch/x86/kvm/mmu/mmu.c | 30 ++++
arch/x86/kvm/mmu/mmu_internal.h | 70 +++++----
arch/x86/kvm/x86.c | 83 +++++++++++
include/linux/kvm_host.h | 4 +
include/uapi/linux/kvm.h | 15 ++
tools/include/uapi/linux/kvm.h | 14 ++
tools/testing/selftests/kvm/Makefile | 1 +
.../selftests/kvm/x86_64/map_memory_test.c | 136 ++++++++++++++++++
virt/kvm/kvm_main.c | 74 ++++++++++
13 files changed, 448 insertions(+), 26 deletions(-)
create mode 100644 tools/testing/selftests/kvm/x86_64/map_memory_test.c


base-commit: 6a108bdc49138bcaa4f995ed87681ab9c65122ad
--
2.25.1



2024-03-01 17:31:33

by Isaku Yamahata

[permalink] [raw]
Subject: [RFC PATCH 4/8] KVM: x86/mmu: Factor out kvm_mmu_do_page_fault()

From: Isaku Yamahata <[email protected]>

For ioctl to pre-populate guest memory, factor out kvm_mmu_do_page_fault()
into initialization function of struct kvm_page_fault, calling fault hander,
and the surrounding logic of error check and stats update part. This
enables to implement a wrapper to call fault handler.

No functional change intended.

Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/mmu/mmu_internal.h | 28 +++++++++++++++++++---------
1 file changed, 19 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 72ef09fc9322..aac52f0fdf54 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -302,6 +302,24 @@ enum {
.pfn = KVM_PFN_ERR_FAULT, \
.hva = KVM_HVA_ERR_BAD, }

+static inline int __kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu,
+ struct kvm_page_fault *fault)
+{
+ int r;
+
+ if (vcpu->arch.mmu->root_role.direct) {
+ fault->gfn = fault->addr >> PAGE_SHIFT;
+ fault->slot = kvm_vcpu_gfn_to_memslot(vcpu, fault->gfn);
+ }
+
+ if (IS_ENABLED(CONFIG_RETPOLINE) && fault->is_tdp)
+ r = kvm_tdp_page_fault(vcpu, fault);
+ else
+ r = vcpu->arch.mmu->page_fault(vcpu, fault);
+
+ return r;
+}
+
static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
u32 err, bool prefetch, int *emulation_type)
{
@@ -310,11 +328,6 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
KVM_MAX_HUGEPAGE_LEVEL);
int r;

- if (vcpu->arch.mmu->root_role.direct) {
- fault.gfn = fault.addr >> PAGE_SHIFT;
- fault.slot = kvm_vcpu_gfn_to_memslot(vcpu, fault.gfn);
- }
-
/*
* Async #PF "faults", a.k.a. prefetch faults, are not faults from the
* guest perspective and have already been counted at the time of the
@@ -323,10 +336,7 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
if (!prefetch)
vcpu->stat.pf_taken++;

- if (IS_ENABLED(CONFIG_RETPOLINE) && fault.is_tdp)
- r = kvm_tdp_page_fault(vcpu, &fault);
- else
- r = vcpu->arch.mmu->page_fault(vcpu, &fault);
+ r = __kvm_mmu_do_page_fault(vcpu, &fault);

if (fault.write_fault_to_shadow_pgtable && emulation_type)
*emulation_type |= EMULTYPE_WRITE_PF_TO_SP;
--
2.25.1


2024-03-01 17:36:38

by Isaku Yamahata

[permalink] [raw]
Subject: [RFC PATCH 7/8] KVM: x86: Add hooks in kvm_arch_vcpu_map_memory()

From: Isaku Yamahata <[email protected]>

In the case of TDX, the memory contents needs to be provided to be
encrypted when populating guest memory before running the guest. Add hooks
in kvm_mmu_map_tdp_page() for KVM_MAP_MEMORY before/after calling
kvm_mmu_tdp_page(). TDX KVM will use the hooks.

Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/include/asm/kvm-x86-ops.h | 2 ++
arch/x86/include/asm/kvm_host.h | 6 ++++++
arch/x86/kvm/x86.c | 34 ++++++++++++++++++++++++++++++
3 files changed, 42 insertions(+)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index 3942b74c1b75..fc4e11d40733 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -137,6 +137,8 @@ KVM_X86_OP(complete_emulated_msr)
KVM_X86_OP(vcpu_deliver_sipi_vector)
KVM_X86_OP_OPTIONAL_RET0(vcpu_get_apicv_inhibit_reasons);
KVM_X86_OP_OPTIONAL(get_untagged_addr)
+KVM_X86_OP_OPTIONAL(pre_mmu_map_page);
+KVM_X86_OP_OPTIONAL(post_mmu_map_page);

#undef KVM_X86_OP
#undef KVM_X86_OP_OPTIONAL
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 9e7b1a00e265..301fedd6b156 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1805,6 +1805,12 @@ struct kvm_x86_ops {
unsigned long (*vcpu_get_apicv_inhibit_reasons)(struct kvm_vcpu *vcpu);

gva_t (*get_untagged_addr)(struct kvm_vcpu *vcpu, gva_t gva, unsigned int flags);
+
+ int (*pre_mmu_map_page)(struct kvm_vcpu *vcpu,
+ struct kvm_memory_mapping *mapping,
+ u32 *error_code, u8 *max_level);
+ void (*post_mmu_map_page)(struct kvm_vcpu *vcpu,
+ struct kvm_memory_mapping *mapping);
};

struct kvm_x86_nested_ops {
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6025c0e12d89..ba8bf35f1c9a 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5811,6 +5811,36 @@ int kvm_arch_vcpu_pre_map_memory(struct kvm_vcpu *vcpu)
return kvm_mmu_reload(vcpu);
}

+static int kvm_pre_mmu_map_page(struct kvm_vcpu *vcpu,
+ struct kvm_memory_mapping *mapping,
+ u32 error_code, u8 *max_level)
+{
+ int r = 0;
+
+ if (vcpu->kvm->arch.vm_type == KVM_X86_DEFAULT_VM ||
+ vcpu->kvm->arch.vm_type == KVM_X86_SW_PROTECTED_VM) {
+ if (mapping->source)
+ r = -EINVAL;
+ } else if (kvm_x86_ops.pre_mmu_map_page)
+ r = static_call(kvm_x86_pre_mmu_map_page)(vcpu, mapping,
+ &error_code,
+ max_level);
+ else
+ r = -EOPNOTSUPP;
+
+ return r;
+}
+
+static void kvm_post_mmu_map_page(struct kvm_vcpu *vcpu, struct kvm_memory_mapping *mapping)
+{
+ if (vcpu->kvm->arch.vm_type == KVM_X86_DEFAULT_VM ||
+ vcpu->kvm->arch.vm_type == KVM_X86_SW_PROTECTED_VM)
+ return;
+
+ if (kvm_x86_ops.post_mmu_map_page)
+ static_call(kvm_x86_post_mmu_map_page)(vcpu, mapping);
+}
+
int kvm_arch_vcpu_map_memory(struct kvm_vcpu *vcpu,
struct kvm_memory_mapping *mapping)
{
@@ -5842,8 +5872,12 @@ int kvm_arch_vcpu_map_memory(struct kvm_vcpu *vcpu,
else
max_level = PG_LEVEL_4K;

+ r = kvm_pre_mmu_map_page(vcpu, mapping, error_code, &max_level);
+ if (r)
+ return r;
r = kvm_mmu_map_page(vcpu, gfn_to_gpa(mapping->base_gfn), error_code,
max_level, &goal_level);
+ kvm_post_mmu_map_page(vcpu, mapping);
if (r)
return r;

--
2.25.1


2024-03-01 17:36:56

by Isaku Yamahata

[permalink] [raw]
Subject: [RFC PATCH 1/8] KVM: Document KVM_MAP_MEMORY ioctl

From: Isaku Yamahata <[email protected]>

Adds documentation of KVM_MAP_MEMORY ioctl.

It pre-populates guest memory. And potentially do initialized memory
contents with encryption and measurement depending on underlying
technology.

Suggested-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
Documentation/virt/kvm/api.rst | 36 ++++++++++++++++++++++++++++++++++
1 file changed, 36 insertions(+)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 0b5a33ee71ee..33d2b63f7dbf 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6352,6 +6352,42 @@ a single guest_memfd file, but the bound ranges must not overlap).

See KVM_SET_USER_MEMORY_REGION2 for additional details.

+4.143 KVM_MAP_MEMORY
+------------------------
+
+:Capability: KVM_CAP_MAP_MEMORY
+:Architectures: none
+:Type: vcpu ioctl
+:Parameters: struct kvm_memory_mapping(in/out)
+:Returns: 0 on success, <0 on error
+
+KVM_MAP_MEMORY populates guest memory without running vcpu.
+
+::
+
+ struct kvm_memory_mapping {
+ __u64 base_gfn;
+ __u64 nr_pages;
+ __u64 flags;
+ __u64 source;
+ };
+
+ /* For kvm_memory_mapping:: flags */
+ #define KVM_MEMORY_MAPPING_FLAG_WRITE _BITULL(0)
+ #define KVM_MEMORY_MAPPING_FLAG_EXEC _BITULL(1)
+ #define KVM_MEMORY_MAPPING_FLAG_USER _BITULL(2)
+ #define KVM_MEMORY_MAPPING_FLAG_PRIVATE _BITULL(3)
+
+KVM_MAP_MEMORY populates guest memory in the underlying mapping. If source is
+not zero and it's supported (depending on underlying technology), the guest
+memory content is populated with the source. The flags field supports three
+flags: KVM_MEMORY_MAPPING_FLAG_WRITE, KVM_MEMORY_MAPPING_FLAG_EXEC, and
+KVM_MEMORY_MAPPING_FLAG_USER. Which corresponds to fault code for kvm page
+fault to populate guest memory. write fault, fetch fault and user fault.
+When it returned, the input is updated. If nr_pages is large, it may
+return -EAGAIN and update the values (base_gfn and nr_pages. source if not zero)
+to point the remaining range.
+
5. The kvm_run structure
========================

--
2.25.1


2024-03-01 17:43:18

by Isaku Yamahata

[permalink] [raw]
Subject: [RFC PATCH 8/8] KVM: selftests: x86: Add test for KVM_MAP_MEMORY

From: Isaku Yamahata <[email protected]>

Add a test case to exercise KVM_MAP_MEMORY and run the guest to access
pre-populated area. It tests KVM_MAP_MEMORY ioctl for KVM_X86_DEFAULT_VM
and KVM_X86_SW_PROTECTED_VM. The other VM type is just for future place
hodler.

Signed-off-by: Isaku Yamahata <[email protected]>
---
tools/include/uapi/linux/kvm.h | 14 ++
tools/testing/selftests/kvm/Makefile | 1 +
.../selftests/kvm/x86_64/map_memory_test.c | 136 ++++++++++++++++++
3 files changed, 151 insertions(+)
create mode 100644 tools/testing/selftests/kvm/x86_64/map_memory_test.c

diff --git a/tools/include/uapi/linux/kvm.h b/tools/include/uapi/linux/kvm.h
index c3308536482b..ea8d3cf840ab 100644
--- a/tools/include/uapi/linux/kvm.h
+++ b/tools/include/uapi/linux/kvm.h
@@ -2227,4 +2227,18 @@ struct kvm_create_guest_memfd {
__u64 reserved[6];
};

+#define KVM_MAP_MEMORY _IOWR(KVMIO, 0xd5, struct kvm_memory_mapping)
+
+#define KVM_MEMORY_MAPPING_FLAG_WRITE _BITULL(0)
+#define KVM_MEMORY_MAPPING_FLAG_EXEC _BITULL(1)
+#define KVM_MEMORY_MAPPING_FLAG_USER _BITULL(2)
+#define KVM_MEMORY_MAPPING_FLAG_PRIVATE _BITULL(3)
+
+struct kvm_memory_mapping {
+ __u64 base_gfn;
+ __u64 nr_pages;
+ __u64 flags;
+ __u64 source;
+};
+
#endif /* __LINUX_KVM_H */
diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile
index da20e6bb43ed..baef461ed38a 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -142,6 +142,7 @@ TEST_GEN_PROGS_x86_64 += set_memory_region_test
TEST_GEN_PROGS_x86_64 += steal_time
TEST_GEN_PROGS_x86_64 += kvm_binary_stats_test
TEST_GEN_PROGS_x86_64 += system_counter_offset_test
+TEST_GEN_PROGS_x86_64 += x86_64/map_memory_test

# Compiled outputs used by test targets
TEST_GEN_PROGS_EXTENDED_x86_64 += x86_64/nx_huge_pages_test
diff --git a/tools/testing/selftests/kvm/x86_64/map_memory_test.c b/tools/testing/selftests/kvm/x86_64/map_memory_test.c
new file mode 100644
index 000000000000..9480c6c89226
--- /dev/null
+++ b/tools/testing/selftests/kvm/x86_64/map_memory_test.c
@@ -0,0 +1,136 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 202r, Intel, Inc
+ *
+ * Author:
+ * Isaku Yamahata <isaku.yamahata at gmail.com>
+ */
+#include <linux/sizes.h>
+
+#include <test_util.h>
+#include <kvm_util.h>
+#include <processor.h>
+
+/* Arbitrarily chosen value. Pick 3G */
+#define TEST_GVA 0xc0000000
+#define TEST_GPA TEST_GVA
+#define TEST_SIZE (SZ_2M + PAGE_SIZE)
+#define TEST_NPAGES (TEST_SIZE / PAGE_SIZE)
+#define TEST_SLOT 10
+
+static void guest_code(uint64_t base_gpa)
+{
+ volatile uint64_t val __used;
+ int i;
+
+ for (i = 0; i < TEST_NPAGES; i++) {
+ uint64_t *src = (uint64_t *)(base_gpa + i * PAGE_SIZE);
+
+ val = *src;
+ }
+
+ GUEST_DONE();
+}
+
+static void map_memory(struct kvm_vcpu *vcpu, u64 base_gfn, u64 nr_pages,
+ u64 source, bool should_success)
+{
+ struct kvm_memory_mapping mapping = {
+ .base_gfn = base_gfn,
+ .nr_pages = nr_pages,
+ .flags = KVM_MEMORY_MAPPING_FLAG_WRITE,
+ .source = source,
+ };
+ int ret;
+
+ do {
+ ret = __vcpu_ioctl(vcpu, KVM_MAP_MEMORY, &mapping);
+ } while (ret && errno == EAGAIN);
+
+ if (should_success) {
+ __TEST_ASSERT_VM_VCPU_IOCTL(!ret, "KVM_MAP_MEMORY", ret, vcpu->vm);
+ } else {
+ __TEST_ASSERT_VM_VCPU_IOCTL(ret && errno == EFAULT,
+ "KVM_MAP_MEMORY", ret, vcpu->vm);
+ }
+}
+
+static void __test_map_memory(unsigned long vm_type, bool private, bool use_source)
+{
+ const struct vm_shape shape = {
+ .mode = VM_MODE_DEFAULT,
+ .type = vm_type,
+ };
+ struct kvm_vcpu *vcpu;
+ struct kvm_run *run;
+ struct kvm_vm *vm;
+ struct ucall uc;
+ u64 source;
+
+ vm = vm_create_shape_with_one_vcpu(shape, &vcpu, guest_code);
+ vm_userspace_mem_region_add(vm, VM_MEM_SRC_ANONYMOUS,
+ TEST_GPA, TEST_SLOT, TEST_NPAGES,
+ private ? KVM_MEM_GUEST_MEMFD : 0);
+ virt_map(vm, TEST_GVA, TEST_GPA, TEST_NPAGES);
+
+ if (private)
+ vm_mem_set_private(vm, TEST_GPA, TEST_SIZE);
+
+ source = use_source ? TEST_GVA: 0;
+ map_memory(vcpu, TEST_GPA / PAGE_SIZE, SZ_2M / PAGE_SIZE, source, true);
+ source = use_source ? TEST_GVA + SZ_2M: 0;
+ map_memory(vcpu, (TEST_GPA + SZ_2M) / PAGE_SIZE, 1, source, true);
+
+ source = use_source ? TEST_GVA + TEST_SIZE : 0;
+ map_memory(vcpu, (TEST_GPA + TEST_SIZE) / PAGE_SIZE, 1, source, false);
+
+ vcpu_args_set(vcpu, 1, TEST_GVA);
+ vcpu_run(vcpu);
+
+ run = vcpu->run;
+ TEST_ASSERT(run->exit_reason == KVM_EXIT_IO,
+ "Wanted KVM_EXIT_IO, got exit reason: %u (%s)",
+ run->exit_reason, exit_reason_str(run->exit_reason));
+
+ switch (get_ucall(vcpu, &uc)) {
+ case UCALL_ABORT:
+ REPORT_GUEST_ASSERT(uc);
+ break;
+ case UCALL_DONE:
+ break;
+ default:
+ TEST_FAIL("Unknown ucall 0x%lx.", uc.cmd);
+ break;
+ }
+
+ kvm_vm_free(vm);
+}
+
+static void test_map_memory(unsigned long vm_type, bool use_source)
+{
+ if (!(kvm_check_cap(KVM_CAP_VM_TYPES) & BIT(vm_type))) {
+ pr_info("Skipping tests for vm_type 0x%lx\n", vm_type);
+ return;
+ }
+
+ __test_map_memory(vm_type, false, use_source);
+ __test_map_memory(vm_type, true, use_source);
+}
+
+int main(int argc, char *argv[])
+{
+ TEST_REQUIRE(kvm_check_cap(KVM_CAP_MAP_MEMORY));
+
+ __test_map_memory(KVM_X86_DEFAULT_VM, false, false);
+ test_map_memory(KVM_X86_SW_PROTECTED_VM, false);
+#ifdef KVM_X86_SEV_VM
+ test_map_memory(KVM_X86_SEV_VM, false);
+#endif
+#ifdef KVM_X86_SEV_ES_VM
+ test_map_memory(KVM_X86_SEV_ES_VM, false);
+#endif
+#ifdef KVM_X86_TDX_VM
+ test_map_memory(KVM_X86_TDX_VM, true);
+#endif
+ return 0;
+}
--
2.25.1


2024-03-01 17:44:03

by Isaku Yamahata

[permalink] [raw]
Subject: [RFC PATCH 6/8] KVM: x86: Implement kvm_arch_{, pre_}vcpu_map_memory()

From: Isaku Yamahata <[email protected]>

Wire KVM_MAP_MEMORY ioctl to kvm_mmu_map_tdp_page() to populate guest
memory.

Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/x86.c | 49 ++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 49 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 3b8cb69b04fa..6025c0e12d89 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4660,6 +4660,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
case KVM_CAP_VM_DISABLE_NX_HUGE_PAGES:
case KVM_CAP_IRQFD_RESAMPLE:
case KVM_CAP_MEMORY_FAULT_INFO:
+ case KVM_CAP_MAP_MEMORY:
r = 1;
break;
case KVM_CAP_EXIT_HYPERCALL:
@@ -5805,6 +5806,54 @@ static int kvm_vcpu_ioctl_enable_cap(struct kvm_vcpu *vcpu,
}
}

+int kvm_arch_vcpu_pre_map_memory(struct kvm_vcpu *vcpu)
+{
+ return kvm_mmu_reload(vcpu);
+}
+
+int kvm_arch_vcpu_map_memory(struct kvm_vcpu *vcpu,
+ struct kvm_memory_mapping *mapping)
+{
+ u8 max_level, goal_level = PG_LEVEL_4K;
+ u32 error_code;
+ int r;
+
+ error_code = 0;
+ if (mapping->flags & KVM_MEMORY_MAPPING_FLAG_WRITE)
+ error_code |= PFERR_WRITE_MASK;
+ if (mapping->flags & KVM_MEMORY_MAPPING_FLAG_EXEC)
+ error_code |= PFERR_FETCH_MASK;
+ if (mapping->flags & KVM_MEMORY_MAPPING_FLAG_USER)
+ error_code |= PFERR_USER_MASK;
+ if (mapping->flags & KVM_MEMORY_MAPPING_FLAG_PRIVATE) {
+#ifdef PFERR_PRIVATE_ACCESS
+ error_code |= PFERR_PRIVATE_ACCESS;
+#else
+ return -OPNOTSUPP;
+#endif
+ }
+
+ if (IS_ALIGNED(mapping->base_gfn, KVM_PAGES_PER_HPAGE(PG_LEVEL_1G)) &&
+ mapping->nr_pages >= KVM_PAGES_PER_HPAGE(PG_LEVEL_1G))
+ max_level = PG_LEVEL_1G;
+ else if (IS_ALIGNED(mapping->base_gfn, KVM_PAGES_PER_HPAGE(PG_LEVEL_2M)) &&
+ mapping->nr_pages >= KVM_PAGES_PER_HPAGE(PG_LEVEL_2M))
+ max_level = PG_LEVEL_2M;
+ else
+ max_level = PG_LEVEL_4K;
+
+ r = kvm_mmu_map_page(vcpu, gfn_to_gpa(mapping->base_gfn), error_code,
+ max_level, &goal_level);
+ if (r)
+ return r;
+
+ if (mapping->source)
+ mapping->source += KVM_HPAGE_SIZE(goal_level);
+ mapping->base_gfn += KVM_PAGES_PER_HPAGE(goal_level);
+ mapping->nr_pages -= KVM_PAGES_PER_HPAGE(goal_level);
+ return r;
+}
+
long kvm_arch_vcpu_ioctl(struct file *filp,
unsigned int ioctl, unsigned long arg)
{
--
2.25.1


2024-03-01 17:52:50

by Isaku Yamahata

[permalink] [raw]
Subject: [RFC PATCH 2/8] KVM: Add KVM_MAP_MEMORY vcpu ioctl to pre-populate guest memory

From: Isaku Yamahata <[email protected]>

Add new ioctl KVM_MAP_MEMORY in the kvm common code. It iterates on the
memory range and call arch specific function. Add stub function as weak
symbol.

[1] https://lore.kernel.org/kvm/[email protected]/
Suggested-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
include/linux/kvm_host.h | 4 +++
include/uapi/linux/kvm.h | 15 ++++++++
virt/kvm/kvm_main.c | 74 ++++++++++++++++++++++++++++++++++++++++
3 files changed, 93 insertions(+)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 9807ea98b568..afbed288d625 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2445,4 +2445,8 @@ static inline int kvm_gmem_get_pfn(struct kvm *kvm,
}
#endif /* CONFIG_KVM_PRIVATE_MEM */

+int kvm_arch_vcpu_pre_map_memory(struct kvm_vcpu *vcpu);
+int kvm_arch_vcpu_map_memory(struct kvm_vcpu *vcpu,
+ struct kvm_memory_mapping *mapping);
+
#endif
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 2190adbe3002..f5d6b481244f 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -917,6 +917,7 @@ struct kvm_enable_cap {
#define KVM_CAP_MEMORY_ATTRIBUTES 233
#define KVM_CAP_GUEST_MEMFD 234
#define KVM_CAP_VM_TYPES 235
+#define KVM_CAP_MAP_MEMORY 236

struct kvm_irq_routing_irqchip {
__u32 irqchip;
@@ -1548,4 +1549,18 @@ struct kvm_create_guest_memfd {
__u64 reserved[6];
};

+#define KVM_MAP_MEMORY _IOWR(KVMIO, 0xd5, struct kvm_memory_mapping)
+
+#define KVM_MEMORY_MAPPING_FLAG_WRITE _BITULL(0)
+#define KVM_MEMORY_MAPPING_FLAG_EXEC _BITULL(1)
+#define KVM_MEMORY_MAPPING_FLAG_USER _BITULL(2)
+#define KVM_MEMORY_MAPPING_FLAG_PRIVATE _BITULL(3)
+
+struct kvm_memory_mapping {
+ __u64 base_gfn;
+ __u64 nr_pages;
+ __u64 flags;
+ __u64 source;
+};
+
#endif /* __LINUX_KVM_H */
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index d1fd9cb5d037..d77c9b79d76b 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -4419,6 +4419,69 @@ static int kvm_vcpu_ioctl_get_stats_fd(struct kvm_vcpu *vcpu)
return fd;
}

+__weak int kvm_arch_vcpu_pre_map_memory(struct kvm_vcpu *vcpu)
+{
+ return -EOPNOTSUPP;
+}
+
+__weak int kvm_arch_vcpu_map_memory(struct kvm_vcpu *vcpu,
+ struct kvm_memory_mapping *mapping)
+{
+ return -EOPNOTSUPP;
+}
+
+static int kvm_vcpu_map_memory(struct kvm_vcpu *vcpu,
+ struct kvm_memory_mapping *mapping)
+{
+ bool added = false;
+ int idx, r = 0;
+
+ if (mapping->flags & ~(KVM_MEMORY_MAPPING_FLAG_WRITE |
+ KVM_MEMORY_MAPPING_FLAG_EXEC |
+ KVM_MEMORY_MAPPING_FLAG_USER |
+ KVM_MEMORY_MAPPING_FLAG_PRIVATE))
+ return -EINVAL;
+ if ((mapping->flags & KVM_MEMORY_MAPPING_FLAG_PRIVATE) &&
+ !kvm_arch_has_private_mem(vcpu->kvm))
+ return -EINVAL;
+
+ /* Sanity check */
+ if (!IS_ALIGNED(mapping->source, PAGE_SIZE) ||
+ !mapping->nr_pages ||
+ mapping->base_gfn + mapping->nr_pages <= mapping->base_gfn)
+ return -EINVAL;
+
+ vcpu_load(vcpu);
+ idx = srcu_read_lock(&vcpu->kvm->srcu);
+ r = kvm_arch_vcpu_pre_map_memory(vcpu);
+ if (r)
+ return r;
+
+ while (mapping->nr_pages) {
+ if (signal_pending(current)) {
+ r = -ERESTARTSYS;
+ break;
+ }
+
+ if (need_resched())
+ cond_resched();
+
+ r = kvm_arch_vcpu_map_memory(vcpu, mapping);
+ if (r)
+ break;
+
+ added = true;
+ }
+
+ srcu_read_unlock(&vcpu->kvm->srcu, idx);
+ vcpu_put(vcpu);
+
+ if (added && mapping->nr_pages > 0)
+ r = -EAGAIN;
+
+ return r;
+}
+
static long kvm_vcpu_ioctl(struct file *filp,
unsigned int ioctl, unsigned long arg)
{
@@ -4620,6 +4683,17 @@ static long kvm_vcpu_ioctl(struct file *filp,
r = kvm_vcpu_ioctl_get_stats_fd(vcpu);
break;
}
+ case KVM_MAP_MEMORY: {
+ struct kvm_memory_mapping mapping;
+
+ r = -EFAULT;
+ if (copy_from_user(&mapping, argp, sizeof(mapping)))
+ break;
+ r = kvm_vcpu_map_memory(vcpu, &mapping);
+ if (copy_to_user(argp, &mapping, sizeof(mapping)))
+ r = -EFAULT;
+ break;
+ }
default:
r = kvm_arch_vcpu_ioctl(filp, ioctl, arg);
}
--
2.25.1


2024-03-01 17:53:00

by Isaku Yamahata

[permalink] [raw]
Subject: [RFC PATCH 3/8] KVM: x86/mmu: Introduce initialier macro for struct kvm_page_fault

From: Isaku Yamahata <[email protected]>

Another function will initialize struct kvm_page_fault. Add initializer
macro to unify the big struct initialization.

No functional change intended.

Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/mmu/mmu_internal.h | 44 +++++++++++++++++++--------------
1 file changed, 26 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 0669a8a668ca..72ef09fc9322 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -279,27 +279,35 @@ enum {
RET_PF_SPURIOUS,
};

+#define KVM_PAGE_FAULT_INIT(_vcpu, _cr2_or_gpa, _err, _prefetch, _max_level) { \
+ .addr = (_cr2_or_gpa), \
+ .error_code = (_err), \
+ .exec = (_err) & PFERR_FETCH_MASK, \
+ .write = (_err) & PFERR_WRITE_MASK, \
+ .present = (_err) & PFERR_PRESENT_MASK, \
+ .rsvd = (_err) & PFERR_RSVD_MASK, \
+ .user = (_err) & PFERR_USER_MASK, \
+ .prefetch = (_prefetch), \
+ .is_tdp = \
+ likely((_vcpu)->arch.mmu->page_fault == kvm_tdp_page_fault), \
+ .nx_huge_page_workaround_enabled = \
+ is_nx_huge_page_enabled((_vcpu)->kvm), \
+ \
+ .max_level = (_max_level), \
+ .req_level = PG_LEVEL_4K, \
+ .goal_level = PG_LEVEL_4K, \
+ .is_private = \
+ kvm_mem_is_private((_vcpu)->kvm, (_cr2_or_gpa) >> PAGE_SHIFT), \
+ \
+ .pfn = KVM_PFN_ERR_FAULT, \
+ .hva = KVM_HVA_ERR_BAD, }
+
static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
u32 err, bool prefetch, int *emulation_type)
{
- struct kvm_page_fault fault = {
- .addr = cr2_or_gpa,
- .error_code = err,
- .exec = err & PFERR_FETCH_MASK,
- .write = err & PFERR_WRITE_MASK,
- .present = err & PFERR_PRESENT_MASK,
- .rsvd = err & PFERR_RSVD_MASK,
- .user = err & PFERR_USER_MASK,
- .prefetch = prefetch,
- .is_tdp = likely(vcpu->arch.mmu->page_fault == kvm_tdp_page_fault),
- .nx_huge_page_workaround_enabled =
- is_nx_huge_page_enabled(vcpu->kvm),
-
- .max_level = KVM_MAX_HUGEPAGE_LEVEL,
- .req_level = PG_LEVEL_4K,
- .goal_level = PG_LEVEL_4K,
- .is_private = kvm_mem_is_private(vcpu->kvm, cr2_or_gpa >> PAGE_SHIFT),
- };
+ struct kvm_page_fault fault = KVM_PAGE_FAULT_INIT(vcpu, cr2_or_gpa, err,
+ prefetch,
+ KVM_MAX_HUGEPAGE_LEVEL);
int r;

if (vcpu->arch.mmu->root_role.direct) {
--
2.25.1


2024-03-01 17:53:25

by Isaku Yamahata

[permalink] [raw]
Subject: [RFC PATCH 5/8] KVM: x86/mmu: Introduce kvm_mmu_map_page() for prepopulating guest memory

From: Isaku Yamahata <[email protected]>

Introduce a helper function to call kvm fault handler. This allows
a new ioctl to invoke kvm fault handler to populate without seeing
RET_PF_* enums or other KVM MMU internal definitions.

Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/mmu.h | 3 +++
arch/x86/kvm/mmu/mmu.c | 30 ++++++++++++++++++++++++++++++
2 files changed, 33 insertions(+)

diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 60f21bb4c27b..48870c5e08ec 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -183,6 +183,9 @@ static inline void kvm_mmu_refresh_passthrough_bits(struct kvm_vcpu *vcpu,
__kvm_mmu_refresh_passthrough_bits(vcpu, mmu);
}

+int kvm_mmu_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
+ u8 max_level, u8 *goal_level);
+
/*
* Check if a given access (described through the I/D, W/R and U/S bits of a
* page fault error code pfec) causes a permission fault with the given PTE
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index e4cc7f764980..7d5e80d17977 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4659,6 +4659,36 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
return direct_page_fault(vcpu, fault);
}

+int kvm_mmu_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
+ u8 max_level, u8 *goal_level)
+{
+ struct kvm_page_fault fault = KVM_PAGE_FAULT_INIT(vcpu, gpa, error_code,
+ false, max_level);
+ int r;
+
+ r = __kvm_mmu_do_page_fault(vcpu, &fault);
+
+ if (is_error_noslot_pfn(fault.pfn) || vcpu->kvm->vm_bugged)
+ return -EFAULT;
+
+ switch (r) {
+ case RET_PF_RETRY:
+ return -EAGAIN;
+
+ case RET_PF_FIXED:
+ case RET_PF_SPURIOUS:
+ *goal_level = fault.goal_level;
+ return 0;
+
+ case RET_PF_CONTINUE:
+ case RET_PF_EMULATE:
+ case RET_PF_INVALID:
+ default:
+ return -EIO;
+ }
+}
+EXPORT_SYMBOL_GPL(kvm_mmu_map_page);
+
static void nonpaging_init_context(struct kvm_mmu *context)
{
context->page_fault = nonpaging_page_fault;
--
2.25.1


2024-03-07 00:31:14

by David Matlack

[permalink] [raw]
Subject: Re: [RFC PATCH 6/8] KVM: x86: Implement kvm_arch_{, pre_}vcpu_map_memory()

On 2024-03-01 09:28 AM, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> Wire KVM_MAP_MEMORY ioctl to kvm_mmu_map_tdp_page() to populate guest
> memory.
>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/kvm/x86.c | 49 ++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 49 insertions(+)
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 3b8cb69b04fa..6025c0e12d89 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -4660,6 +4660,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
> case KVM_CAP_VM_DISABLE_NX_HUGE_PAGES:
> case KVM_CAP_IRQFD_RESAMPLE:
> case KVM_CAP_MEMORY_FAULT_INFO:
> + case KVM_CAP_MAP_MEMORY:
> r = 1;
> break;
> case KVM_CAP_EXIT_HYPERCALL:
> @@ -5805,6 +5806,54 @@ static int kvm_vcpu_ioctl_enable_cap(struct kvm_vcpu *vcpu,
> }
> }
>
> +int kvm_arch_vcpu_pre_map_memory(struct kvm_vcpu *vcpu)
> +{
> + return kvm_mmu_reload(vcpu);
> +}

Why is the here and not kvm_arch_vcpu_map_memory()?

> +
> +int kvm_arch_vcpu_map_memory(struct kvm_vcpu *vcpu,
> + struct kvm_memory_mapping *mapping)
> +{
> + u8 max_level, goal_level = PG_LEVEL_4K;
> + u32 error_code;
> + int r;
> +
> + error_code = 0;
> + if (mapping->flags & KVM_MEMORY_MAPPING_FLAG_WRITE)
> + error_code |= PFERR_WRITE_MASK;
> + if (mapping->flags & KVM_MEMORY_MAPPING_FLAG_EXEC)
> + error_code |= PFERR_FETCH_MASK;
> + if (mapping->flags & KVM_MEMORY_MAPPING_FLAG_USER)
> + error_code |= PFERR_USER_MASK;
> + if (mapping->flags & KVM_MEMORY_MAPPING_FLAG_PRIVATE) {
> +#ifdef PFERR_PRIVATE_ACCESS
> + error_code |= PFERR_PRIVATE_ACCESS;
> +#else
> + return -OPNOTSUPP;

-EOPNOTSUPP

> +#endif
> + }
> +
> + if (IS_ALIGNED(mapping->base_gfn, KVM_PAGES_PER_HPAGE(PG_LEVEL_1G)) &&
> + mapping->nr_pages >= KVM_PAGES_PER_HPAGE(PG_LEVEL_1G))
> + max_level = PG_LEVEL_1G;
> + else if (IS_ALIGNED(mapping->base_gfn, KVM_PAGES_PER_HPAGE(PG_LEVEL_2M)) &&
> + mapping->nr_pages >= KVM_PAGES_PER_HPAGE(PG_LEVEL_2M))
> + max_level = PG_LEVEL_2M;
> + else
> + max_level = PG_LEVEL_4K;

Is there a requirement that KVM must not map memory outside of the
requested region?

> +
> + r = kvm_mmu_map_page(vcpu, gfn_to_gpa(mapping->base_gfn), error_code,
> + max_level, &goal_level);
> + if (r)
> + return r;
> +
> + if (mapping->source)
> + mapping->source += KVM_HPAGE_SIZE(goal_level);
> + mapping->base_gfn += KVM_PAGES_PER_HPAGE(goal_level);
> + mapping->nr_pages -= KVM_PAGES_PER_HPAGE(goal_level);
> + return r;
> +}
> +
> long kvm_arch_vcpu_ioctl(struct file *filp,
> unsigned int ioctl, unsigned long arg)
> {
> --
> 2.25.1
>

2024-03-07 00:37:07

by David Matlack

[permalink] [raw]
Subject: Re: [RFC PATCH 6/8] KVM: x86: Implement kvm_arch_{, pre_}vcpu_map_memory()

On Wed, Mar 6, 2024 at 4:31 PM David Matlack <[email protected]> wrote:
>
> On 2024-03-01 09:28 AM, [email protected] wrote:
> >
> > + if (IS_ALIGNED(mapping->base_gfn, KVM_PAGES_PER_HPAGE(PG_LEVEL_1G)) &&
> > + mapping->nr_pages >= KVM_PAGES_PER_HPAGE(PG_LEVEL_1G))
> > + max_level = PG_LEVEL_1G;
> > + else if (IS_ALIGNED(mapping->base_gfn, KVM_PAGES_PER_HPAGE(PG_LEVEL_2M)) &&
> > + mapping->nr_pages >= KVM_PAGES_PER_HPAGE(PG_LEVEL_2M))
> > + max_level = PG_LEVEL_2M;
> > + else
> > + max_level = PG_LEVEL_4K;
>
> Is there a requirement that KVM must not map memory outside of the
> requested region?

And if so, what if the requested region is already mapped with a larger page?

2024-03-07 00:39:10

by David Matlack

[permalink] [raw]
Subject: Re: [RFC PATCH 5/8] KVM: x86/mmu: Introduce kvm_mmu_map_page() for prepopulating guest memory

On 2024-03-01 09:28 AM, [email protected] wrote:
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index e4cc7f764980..7d5e80d17977 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4659,6 +4659,36 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> return direct_page_fault(vcpu, fault);
> }
>
> +int kvm_mmu_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
> + u8 max_level, u8 *goal_level)
> +{
> + struct kvm_page_fault fault = KVM_PAGE_FAULT_INIT(vcpu, gpa, error_code,
> + false, max_level);
> + int r;
> +
> + r = __kvm_mmu_do_page_fault(vcpu, &fault);

If TDP is disabled __kvm_mmu_do_page_fault() will interpret @gpa as a
GVA no? And if the vCPU is in guest-mode __kvm_mmu_do_page_fault() will
interpret gpa as a nGPA right?

2024-03-07 00:44:10

by David Matlack

[permalink] [raw]
Subject: Re: [RFC PATCH 1/8] KVM: Document KVM_MAP_MEMORY ioctl

On 2024-03-01 09:28 AM, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
> +
> + struct kvm_memory_mapping {
> + __u64 base_gfn;
> + __u64 nr_pages;
> + __u64 flags;
> + __u64 source;
> + };
> +
> + /* For kvm_memory_mapping:: flags */
> + #define KVM_MEMORY_MAPPING_FLAG_WRITE _BITULL(0)
> + #define KVM_MEMORY_MAPPING_FLAG_EXEC _BITULL(1)
> + #define KVM_MEMORY_MAPPING_FLAG_USER _BITULL(2)
> + #define KVM_MEMORY_MAPPING_FLAG_PRIVATE _BITULL(3)
> +
> +KVM_MAP_MEMORY populates guest memory in the underlying mapping. If source is
> +not zero and it's supported (depending on underlying technology), the guest
> +memory content is populated with the source.

What does "populated with the source" mean?

> The flags field supports three
> +flags: KVM_MEMORY_MAPPING_FLAG_WRITE, KVM_MEMORY_MAPPING_FLAG_EXEC, and
> +KVM_MEMORY_MAPPING_FLAG_USER.

There are 4 flags.

2024-03-07 00:59:12

by David Matlack

[permalink] [raw]
Subject: Re: [RFC PATCH 2/8] KVM: Add KVM_MAP_MEMORY vcpu ioctl to pre-populate guest memory

On 2024-03-01 09:28 AM, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index d1fd9cb5d037..d77c9b79d76b 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> +static int kvm_vcpu_map_memory(struct kvm_vcpu *vcpu,
> + struct kvm_memory_mapping *mapping)
> +{
> + bool added = false;
> + int idx, r = 0;
> +
> + if (mapping->flags & ~(KVM_MEMORY_MAPPING_FLAG_WRITE |
> + KVM_MEMORY_MAPPING_FLAG_EXEC |
> + KVM_MEMORY_MAPPING_FLAG_USER |
> + KVM_MEMORY_MAPPING_FLAG_PRIVATE))
> + return -EINVAL;
> + if ((mapping->flags & KVM_MEMORY_MAPPING_FLAG_PRIVATE) &&
> + !kvm_arch_has_private_mem(vcpu->kvm))
> + return -EINVAL;
> +
> + /* Sanity check */
> + if (!IS_ALIGNED(mapping->source, PAGE_SIZE) ||
> + !mapping->nr_pages ||
> + mapping->base_gfn + mapping->nr_pages <= mapping->base_gfn)
> + return -EINVAL;
> +
> + vcpu_load(vcpu);
> + idx = srcu_read_lock(&vcpu->kvm->srcu);
> + r = kvm_arch_vcpu_pre_map_memory(vcpu);
> + if (r)
> + return r;
> +
> + while (mapping->nr_pages) {
> + if (signal_pending(current)) {
> + r = -ERESTARTSYS;
> + break;
> + }
> +
> + if (need_resched())

nit: Is need_resched() superfluous when calling cond_resched()?

> + cond_resched();
> +
> + r = kvm_arch_vcpu_map_memory(vcpu, mapping);
> + if (r)
> + break;
> +
> + added = true;
> + }
> +
> + srcu_read_unlock(&vcpu->kvm->srcu, idx);
> + vcpu_put(vcpu);
> +
> + if (added && mapping->nr_pages > 0)
> + r = -EAGAIN;

This can overwrite the return value from kvm_arch_vcpu_map_memory().

2024-03-07 01:03:06

by David Matlack

[permalink] [raw]
Subject: Re: [RFC PATCH 0/8] KVM: Prepopulate guest memory API

On 2024-03-01 09:28 AM, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> Implementation:
> - x86 KVM MMU
> In x86 KVM MMU, I chose to use kvm_mmu_do_page_fault(). It's not confined to
> KVM TDP MMU. We can restrict it to KVM TDP MMU and introduce an optimized
> version.

Restricting to TDP MMU seems like a good idea. But I'm not quite sure
how to reliably do that from a vCPU context. Checking for TDP being
enabled is easy, but what if the vCPU is in guest-mode?

Perhaps we can just return an error out to userspace if the vCPU is in
guest-mode or TDP is disabled, and make it userspace's problem to do
memory mapping before loading any vCPU state.

2024-03-07 01:29:49

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [RFC PATCH 1/8] KVM: Document KVM_MAP_MEMORY ioctl

On Wed, Mar 06, 2024 at 04:43:51PM -0800,
David Matlack <[email protected]> wrote:

> On 2024-03-01 09:28 AM, [email protected] wrote:
> > From: Isaku Yamahata <[email protected]>
> > +
> > + struct kvm_memory_mapping {
> > + __u64 base_gfn;
> > + __u64 nr_pages;
> > + __u64 flags;
> > + __u64 source;
> > + };
> > +
> > + /* For kvm_memory_mapping:: flags */
> > + #define KVM_MEMORY_MAPPING_FLAG_WRITE _BITULL(0)
> > + #define KVM_MEMORY_MAPPING_FLAG_EXEC _BITULL(1)
> > + #define KVM_MEMORY_MAPPING_FLAG_USER _BITULL(2)
> > + #define KVM_MEMORY_MAPPING_FLAG_PRIVATE _BITULL(3)
> > +
> > +KVM_MAP_MEMORY populates guest memory in the underlying mapping. If source is
> > +not zero and it's supported (depending on underlying technology), the guest
> > +memory content is populated with the source.
>
> What does "populated with the source" mean?

source is user pointer and the memory contents of source is copied into
base_gfn. (and it will encrypted.)


> > The flags field supports three
> > +flags: KVM_MEMORY_MAPPING_FLAG_WRITE, KVM_MEMORY_MAPPING_FLAG_EXEC, and
> > +KVM_MEMORY_MAPPING_FLAG_USER.
>
> There are 4 flags.

Oops. Let me update it.


KVM_MAP_MEMORY populates guest memory at the specified range (`base_gfn`,
`nr_pages`) in the underlying mapping. `source` is an optional user pointer. If
`source` is not NULL and the underlying technology supports it, the memory
contents of `source` are copied into the guest memory. The backend may encrypt
it.

The `flags` field supports four flags: KVM_MEMORY_MAPPING_FLAG_WRITE,
KVM_MEMORY_MAPPING_FLAG_EXEC, KVM_MEMORY_MAPPING_FLAG_USER, and
KVM_MEMORY_MAPPING_FLAGS_PRIVATE. The first three correspond to the fault code
for the KVM page fault to populate guest memory. write fault, fetch fault, and
user fault. KVM_MEMORY_MAPPING_FLAGS_PRIVATE is applicable only for guest
memory with guest_memfd. It is to populate guest memory with the memory
attribute of KVM_MEMORY_ATTRIBUTE_PRIVATE set.

When the ioctl returns, the input values are updated. If `nr_pages` is large,
it may return -EAGAIN and update the values (`base_gfn` and `nr_pages`. `source`
if not zero) to point to the remaining range.

--
Isaku Yamahata <[email protected]>

2024-03-07 01:35:05

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [RFC PATCH 6/8] KVM: x86: Implement kvm_arch_{, pre_}vcpu_map_memory()

On Wed, Mar 06, 2024 at 04:30:58PM -0800,
David Matlack <[email protected]> wrote:

> On 2024-03-01 09:28 AM, [email protected] wrote:
> > From: Isaku Yamahata <[email protected]>
> >
> > Wire KVM_MAP_MEMORY ioctl to kvm_mmu_map_tdp_page() to populate guest
> > memory.
> >
> > Signed-off-by: Isaku Yamahata <[email protected]>
> > ---
> > arch/x86/kvm/x86.c | 49 ++++++++++++++++++++++++++++++++++++++++++++++
> > 1 file changed, 49 insertions(+)
> >
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 3b8cb69b04fa..6025c0e12d89 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -4660,6 +4660,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
> > case KVM_CAP_VM_DISABLE_NX_HUGE_PAGES:
> > case KVM_CAP_IRQFD_RESAMPLE:
> > case KVM_CAP_MEMORY_FAULT_INFO:
> > + case KVM_CAP_MAP_MEMORY:
> > r = 1;
> > break;
> > case KVM_CAP_EXIT_HYPERCALL:
> > @@ -5805,6 +5806,54 @@ static int kvm_vcpu_ioctl_enable_cap(struct kvm_vcpu *vcpu,
> > }
> > }
> >
> > +int kvm_arch_vcpu_pre_map_memory(struct kvm_vcpu *vcpu)
> > +{
> > + return kvm_mmu_reload(vcpu);
> > +}
>
> Why is the here and not kvm_arch_vcpu_map_memory()?

We can push down kvm_mmu_relaod into kvm_arch_vcpu_map_memory() under gpa loop.
Probably the inefficiency won't matter.

kvm_mmu_realod()
loop on gpa
kvm_arch_vcpu_map_memory()

=>

loop on gpa
kvm_arch_vcpu_map_memory()
kvm_mmu_reload()
--
Isaku Yamahata <[email protected]>

2024-03-07 01:52:09

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [RFC PATCH 6/8] KVM: x86: Implement kvm_arch_{, pre_}vcpu_map_memory()

On Wed, Mar 06, 2024 at 04:36:25PM -0800,
David Matlack <[email protected]> wrote:

> On Wed, Mar 6, 2024 at 4:31 PM David Matlack <[email protected]> wrote:
> >
> > On 2024-03-01 09:28 AM, [email protected] wrote:
> > >
> > > + if (IS_ALIGNED(mapping->base_gfn, KVM_PAGES_PER_HPAGE(PG_LEVEL_1G)) &&
> > > + mapping->nr_pages >= KVM_PAGES_PER_HPAGE(PG_LEVEL_1G))
> > > + max_level = PG_LEVEL_1G;
> > > + else if (IS_ALIGNED(mapping->base_gfn, KVM_PAGES_PER_HPAGE(PG_LEVEL_2M)) &&
> > > + mapping->nr_pages >= KVM_PAGES_PER_HPAGE(PG_LEVEL_2M))
> > > + max_level = PG_LEVEL_2M;
> > > + else
> > > + max_level = PG_LEVEL_4K;
> >
> > Is there a requirement that KVM must not map memory outside of the
> > requested region?
>
> And if so, what if the requested region is already mapped with a larger page?

Yes. We'd like to map exact gpa range for SNP or TDX case. We don't want to map
zero at around range. For SNP or TDX, we map page to GPA, it's one time
operation. It updates measurement.

Say, we'd like to populate GPA1 and GPA2 with initial guest memory image. And
they are within same 2M range. Map GPA1 first. If GPA2 is also mapped with zero
with 2M page, the following mapping of GPA2 fails. Even if mapping of GPA2
succeeds, measurement may be updated when mapping GPA1.

It's user space VMM responsibility to map GPA range only once at most for SNP or
TDX. Is this too strict requirement for default VM use case to mitigate KVM
page fault at guest boot up? If so, what about a flag like EXACT_MAPPING or
something?
--
Isaku Yamahata <[email protected]>

2024-03-07 02:10:09

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [RFC PATCH 0/8] KVM: Prepopulate guest memory API

On Wed, Mar 06, 2024 at 04:53:41PM -0800,
David Matlack <[email protected]> wrote:

> On 2024-03-01 09:28 AM, [email protected] wrote:
> > From: Isaku Yamahata <[email protected]>
> >
> > Implementation:
> > - x86 KVM MMU
> > In x86 KVM MMU, I chose to use kvm_mmu_do_page_fault(). It's not confined to
> > KVM TDP MMU. We can restrict it to KVM TDP MMU and introduce an optimized
> > version.
>
> Restricting to TDP MMU seems like a good idea. But I'm not quite sure
> how to reliably do that from a vCPU context. Checking for TDP being
> enabled is easy, but what if the vCPU is in guest-mode?

As you pointed out in other mail, legacy KVM MMU support or guest-mode will be
troublesome. The use case I supposed is pre-population before guest runs, the
guest-mode wouldn't matter. I didn't add explicit check for it, though.

Any use case while vcpus running?


> Perhaps we can just return an error out to userspace if the vCPU is in
> guest-mode or TDP is disabled, and make it userspace's problem to do
> memory mapping before loading any vCPU state.

If the use case for default VM or sw-proteced VM is to avoid excessive kvm page
fault at guest boot, error on guest-mode or disabled TDP wouldn't matter.
--
Isaku Yamahata <[email protected]>

2024-03-07 02:53:07

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [RFC PATCH 2/8] KVM: Add KVM_MAP_MEMORY vcpu ioctl to pre-populate guest memory

On Wed, Mar 06, 2024 at 04:49:19PM -0800,
David Matlack <[email protected]> wrote:

> > + cond_resched();
> > +
> > + r = kvm_arch_vcpu_map_memory(vcpu, mapping);
> > + if (r)
> > + break;
> > +
> > + added = true;
> > + }
> > +
> > + srcu_read_unlock(&vcpu->kvm->srcu, idx);
> > + vcpu_put(vcpu);
> > +
> > + if (added && mapping->nr_pages > 0)
> > + r = -EAGAIN;
>
> This can overwrite the return value from kvm_arch_vcpu_map_memory().

This is to convert ERESTART into EAGAIN. I'll drop this check
and document EINTR so that caller should be aware of partial population.
--
Isaku Yamahata <[email protected]>

2024-03-07 12:30:24

by Huang, Kai

[permalink] [raw]
Subject: Re: [RFC PATCH 1/8] KVM: Document KVM_MAP_MEMORY ioctl

On Fri, 2024-03-01 at 09:28 -0800, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> Adds documentation of KVM_MAP_MEMORY ioctl.
>
> It pre-populates guest memory. And potentially do initialized memory
> contents with encryption and measurement depending on underlying
> technology.
>
> Suggested-by: Sean Christopherson <[email protected]>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> Documentation/virt/kvm/api.rst | 36 ++++++++++++++++++++++++++++++++++
> 1 file changed, 36 insertions(+)
>
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 0b5a33ee71ee..33d2b63f7dbf 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -6352,6 +6352,42 @@ a single guest_memfd file, but the bound ranges must not overlap).
>
> See KVM_SET_USER_MEMORY_REGION2 for additional details.
>
> +4.143 KVM_MAP_MEMORY
> +------------------------
> +
> +:Capability: KVM_CAP_MAP_MEMORY
> +:Architectures: none
> +:Type: vcpu ioctl

I think "vcpu ioctl" means theoretically it can be called on multiple vcpus.

What happens in that case?

> +:Parameters: struct kvm_memory_mapping(in/out)
> +:Returns: 0 on success, <0 on error
> +
> +KVM_MAP_MEMORY populates guest memory without running vcpu.
> +
> +::
> +
> + struct kvm_memory_mapping {
> + __u64 base_gfn;
> + __u64 nr_pages;
> + __u64 flags;
> + __u64 source;
> + };
> +
> + /* For kvm_memory_mapping:: flags */
> + #define KVM_MEMORY_MAPPING_FLAG_WRITE _BITULL(0)
> + #define KVM_MEMORY_MAPPING_FLAG_EXEC _BITULL(1)
> + #define KVM_MEMORY_MAPPING_FLAG_USER _BITULL(2)

I am not sure what's the good of having "FLAG_USER"?

This ioctl is called from userspace, thus I think we can just treat this always
as user-fault?

> + #define KVM_MEMORY_MAPPING_FLAG_PRIVATE _BITULL(3)
> +
> +KVM_MAP_MEMORY populates guest memory in the underlying mapping. If source is
> +not zero and it's supported (depending on underlying technology), the guest
> +memory content is populated with the source. The flags field supports three
> +flags: KVM_MEMORY_MAPPING_FLAG_WRITE, KVM_MEMORY_MAPPING_FLAG_EXEC, and
> +KVM_MEMORY_MAPPING_FLAG_USER. Which corresponds to fault code for kvm page
> +fault to populate guest memory. write fault, fetch fault and user fault.
> +When it returned, the input is updated. If nr_pages is large, it may
> +return -EAGAIN and update the values (base_gfn and nr_pages. source if not zero)
> +to point the remaining range.
> +
> 5. The kvm_run structure
> ========================
>

2024-03-07 12:45:34

by Huang, Kai

[permalink] [raw]
Subject: Re: [RFC PATCH 2/8] KVM: Add KVM_MAP_MEMORY vcpu ioctl to pre-populate guest memory


>
> +int kvm_arch_vcpu_pre_map_memory(struct kvm_vcpu *vcpu);

No explanation of why this is needed, and why it only takes @vcpu as input w/o
having the @mapping.

> +int kvm_arch_vcpu_map_memory(struct kvm_vcpu *vcpu,
> + struct kvm_memory_mapping *mapping);
> +
>

[...]

> +static int kvm_vcpu_map_memory(struct kvm_vcpu *vcpu,
> + struct kvm_memory_mapping *mapping)
> +{
> + bool added = false;
> + int idx, r = 0;
> +
> + if (mapping->flags & ~(KVM_MEMORY_MAPPING_FLAG_WRITE |
> + KVM_MEMORY_MAPPING_FLAG_EXEC |
> + KVM_MEMORY_MAPPING_FLAG_USER |
> + KVM_MEMORY_MAPPING_FLAG_PRIVATE))
> + return -EINVAL;
> + if ((mapping->flags & KVM_MEMORY_MAPPING_FLAG_PRIVATE) &&
> + !kvm_arch_has_private_mem(vcpu->kvm))
> + return -EINVAL;
> +
> + /* Sanity check */
> + if (!IS_ALIGNED(mapping->source, PAGE_SIZE) ||
> + !mapping->nr_pages ||
> + mapping->base_gfn + mapping->nr_pages <= mapping->base_gfn)
> + return -EINVAL;
> +
> + vcpu_load(vcpu);
> + idx = srcu_read_lock(&vcpu->kvm->srcu);
> + r = kvm_arch_vcpu_pre_map_memory(vcpu);
> + if (r)
> + return r;

Returning w/o unloading the vcpu and releasing the SRCU.

> +
> + while (mapping->nr_pages) {
> + if (signal_pending(current)) {
> + r = -ERESTARTSYS;
> + break;
> + }
> +
> + if (need_resched())
> + cond_resched();

need_resched() is not needed.

And normally I think we just put it at the end of the loop.

> +
> + r = kvm_arch_vcpu_map_memory(vcpu, mapping);
> + if (r)
> + break;
> +
> + added = true;
> + }
> +
> + srcu_read_unlock(&vcpu->kvm->srcu, idx);
> + vcpu_put(vcpu);
> +
> + if (added && mapping->nr_pages > 0)
> + r = -EAGAIN;

Why do we need @added?

I assume the kvm_arch_vcpu_map_memory() can internally update the mapping-
>nr_pages but still return -E<WHATEVER>. So when that happens in the first call
of kvm_arch_vcpu_map_memory(), @added won't get chance to turn to true.

> +
> + return r;
> +}
> +
> static long kvm_vcpu_ioctl(struct file *filp,
> unsigned int ioctl, unsigned long arg)
> {
> @@ -4620,6 +4683,17 @@ static long kvm_vcpu_ioctl(struct file *filp,
> r = kvm_vcpu_ioctl_get_stats_fd(vcpu);
> break;
> }
> + case KVM_MAP_MEMORY: {
> + struct kvm_memory_mapping mapping;
> +
> + r = -EFAULT;
> + if (copy_from_user(&mapping, argp, sizeof(mapping)))
> + break;
> + r = kvm_vcpu_map_memory(vcpu, &mapping);
> + if (copy_to_user(argp, &mapping, sizeof(mapping)))
> + r = -EFAULT;
> + break;
> + }
> default:
> r = kvm_arch_vcpu_ioctl(filp, ioctl, arg);
> }

2024-03-07 20:42:18

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [RFC PATCH 2/8] KVM: Add KVM_MAP_MEMORY vcpu ioctl to pre-populate guest memory

On Thu, Mar 07, 2024 at 12:45:16PM +0000,
"Huang, Kai" <[email protected]> wrote:

>
> >
> > +int kvm_arch_vcpu_pre_map_memory(struct kvm_vcpu *vcpu);
>
> No explanation of why this is needed, and why it only takes @vcpu as input w/o
> having the @mapping.
>
> > +int kvm_arch_vcpu_map_memory(struct kvm_vcpu *vcpu,
> > + struct kvm_memory_mapping *mapping);
> > +
> >
>
> [...]
>
> > +static int kvm_vcpu_map_memory(struct kvm_vcpu *vcpu,
> > + struct kvm_memory_mapping *mapping)
> > +{
> > + bool added = false;
> > + int idx, r = 0;
> > +
> > + if (mapping->flags & ~(KVM_MEMORY_MAPPING_FLAG_WRITE |
> > + KVM_MEMORY_MAPPING_FLAG_EXEC |
> > + KVM_MEMORY_MAPPING_FLAG_USER |
> > + KVM_MEMORY_MAPPING_FLAG_PRIVATE))
> > + return -EINVAL;
> > + if ((mapping->flags & KVM_MEMORY_MAPPING_FLAG_PRIVATE) &&
> > + !kvm_arch_has_private_mem(vcpu->kvm))
> > + return -EINVAL;
> > +
> > + /* Sanity check */
> > + if (!IS_ALIGNED(mapping->source, PAGE_SIZE) ||
> > + !mapping->nr_pages ||
> > + mapping->base_gfn + mapping->nr_pages <= mapping->base_gfn)
> > + return -EINVAL;
> > +
> > + vcpu_load(vcpu);
> > + idx = srcu_read_lock(&vcpu->kvm->srcu);
> > + r = kvm_arch_vcpu_pre_map_memory(vcpu);
> > + if (r)
> > + return r;
>
> Returning w/o unloading the vcpu and releasing the SRCU.

Oos, Will fix.


> > +
> > + while (mapping->nr_pages) {
> > + if (signal_pending(current)) {
> > + r = -ERESTARTSYS;
> > + break;
> > + }
> > +
> > + if (need_resched())
> > + cond_resched();
>
> need_resched() is not needed.
>
> And normally I think we just put it at the end of the loop.

Ok, will move it.


> > +
> > + r = kvm_arch_vcpu_map_memory(vcpu, mapping);
> > + if (r)
> > + break;
> > +
> > + added = true;
> > + }
> > +
> > + srcu_read_unlock(&vcpu->kvm->srcu, idx);
> > + vcpu_put(vcpu);
> > +
> > + if (added && mapping->nr_pages > 0)
> > + r = -EAGAIN;
>
> Why do we need @added?
>
> I assume the kvm_arch_vcpu_map_memory() can internally update the mapping-
> >nr_pages but still return -E<WHATEVER>. So when that happens in the first call
> of kvm_arch_vcpu_map_memory(), @added won't get chance to turn to true.

I intend to tell the caller if the range is partially processed or not.
Anyway this seems moot. Let's drop this if clause. Then it's caller's
responsibility to check error and partial conversion and to optionally loop
with the remaining region.
--
Isaku Yamahata <[email protected]>

2024-03-07 20:54:13

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [RFC PATCH 1/8] KVM: Document KVM_MAP_MEMORY ioctl

On Thu, Mar 07, 2024 at 12:30:04PM +0000,
"Huang, Kai" <[email protected]> wrote:

> On Fri, 2024-03-01 at 09:28 -0800, [email protected] wrote:
> > From: Isaku Yamahata <[email protected]>
> >
> > Adds documentation of KVM_MAP_MEMORY ioctl.
> >
> > It pre-populates guest memory. And potentially do initialized memory
> > contents with encryption and measurement depending on underlying
> > technology.
> >
> > Suggested-by: Sean Christopherson <[email protected]>
> > Signed-off-by: Isaku Yamahata <[email protected]>
> > ---
> > Documentation/virt/kvm/api.rst | 36 ++++++++++++++++++++++++++++++++++
> > 1 file changed, 36 insertions(+)
> >
> > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > index 0b5a33ee71ee..33d2b63f7dbf 100644
> > --- a/Documentation/virt/kvm/api.rst
> > +++ b/Documentation/virt/kvm/api.rst
> > @@ -6352,6 +6352,42 @@ a single guest_memfd file, but the bound ranges must not overlap).
> >
> > See KVM_SET_USER_MEMORY_REGION2 for additional details.
> >
> > +4.143 KVM_MAP_MEMORY
> > +------------------------
> > +
> > +:Capability: KVM_CAP_MAP_MEMORY
> > +:Architectures: none
> > +:Type: vcpu ioctl
>
> I think "vcpu ioctl" means theoretically it can be called on multiple vcpus.
>
> What happens in that case?

Each vcpu can handle the ioctl simaltaneously. If we assume tdp_mmu, each vcpu
calls the kvm fault handler simultaneously with read spinlock.
If gfn ranges overlap, vcpu will get 0 (success) or EAGAIN.


> > +:Parameters: struct kvm_memory_mapping(in/out)
> > +:Returns: 0 on success, <0 on error
> > +
> > +KVM_MAP_MEMORY populates guest memory without running vcpu.
> > +
> > +::
> > +
> > + struct kvm_memory_mapping {
> > + __u64 base_gfn;
> > + __u64 nr_pages;
> > + __u64 flags;
> > + __u64 source;
> > + };
> > +
> > + /* For kvm_memory_mapping:: flags */
> > + #define KVM_MEMORY_MAPPING_FLAG_WRITE _BITULL(0)
> > + #define KVM_MEMORY_MAPPING_FLAG_EXEC _BITULL(1)
> > + #define KVM_MEMORY_MAPPING_FLAG_USER _BITULL(2)
>
> I am not sure what's the good of having "FLAG_USER"?
>
> This ioctl is called from userspace, thus I think we can just treat this always
> as user-fault?

The point is how to emulate kvm page fault as if vcpu caused the kvm page
fault. Not we call the ioctl as user context.
--
Isaku Yamahata <[email protected]>

2024-03-08 00:22:34

by Huang, Kai

[permalink] [raw]
Subject: Re: [RFC PATCH 1/8] KVM: Document KVM_MAP_MEMORY ioctl


>>>
>>> +4.143 KVM_MAP_MEMORY
>>> +------------------------
>>> +
>>> +:Capability: KVM_CAP_MAP_MEMORY
>>> +:Architectures: none
>>> +:Type: vcpu ioctl
>>
>> I think "vcpu ioctl" means theoretically it can be called on multiple vcpus.
>>
>> What happens in that case?
>
> Each vcpu can handle the ioctl simaltaneously.

Not sure whether it is implied, but should we document it can be called
simultaneously?

Also, I believe this is only supposed to be called before VM starts to
run? I think we should document that too.

This is userspace ABI, we need to be explicit on how it is supposed to
be called from userspace.

Btw, I believe there should be some justification in the changelog why
this should be a vcpu ioctl().

[...]

>>> +:Parameters: struct kvm_memory_mapping(in/out)
>>> +:Returns: 0 on success, <0 on error
>>> +
>>> +KVM_MAP_MEMORY populates guest memory without running vcpu.
>>> +
>>> +::
>>> +
>>> + struct kvm_memory_mapping {
>>> + __u64 base_gfn;
>>> + __u64 nr_pages;
>>> + __u64 flags;
>>> + __u64 source;
>>> + };
>>> +
>>> + /* For kvm_memory_mapping:: flags */
>>> + #define KVM_MEMORY_MAPPING_FLAG_WRITE _BITULL(0)
>>> + #define KVM_MEMORY_MAPPING_FLAG_EXEC _BITULL(1)
>>> + #define KVM_MEMORY_MAPPING_FLAG_USER _BITULL(2)
>>
>> I am not sure what's the good of having "FLAG_USER"?
>>
>> This ioctl is called from userspace, thus I think we can just treat this always
>> as user-fault?
>
> The point is how to emulate kvm page fault as if vcpu caused the kvm page
> fault. Not we call the ioctl as user context.

Sorry I don't quite follow. What's wrong if KVM just append the #PF
USER error bit before it calls into the fault handler?

My question is, since this is ABI, you have to tell how userspace is
supposed to use this. Maybe I am missing something, but I don't see how
USER should be used here.

2024-03-08 00:57:20

by David Matlack

[permalink] [raw]
Subject: Re: [RFC PATCH 1/8] KVM: Document KVM_MAP_MEMORY ioctl

On 2024-03-08 01:20 PM, Huang, Kai wrote:
> > > > +:Parameters: struct kvm_memory_mapping(in/out)
> > > > +:Returns: 0 on success, <0 on error
> > > > +
> > > > +KVM_MAP_MEMORY populates guest memory without running vcpu.
> > > > +
> > > > +::
> > > > +
> > > > + struct kvm_memory_mapping {
> > > > + __u64 base_gfn;
> > > > + __u64 nr_pages;
> > > > + __u64 flags;
> > > > + __u64 source;
> > > > + };
> > > > +
> > > > + /* For kvm_memory_mapping:: flags */
> > > > + #define KVM_MEMORY_MAPPING_FLAG_WRITE _BITULL(0)
> > > > + #define KVM_MEMORY_MAPPING_FLAG_EXEC _BITULL(1)
> > > > + #define KVM_MEMORY_MAPPING_FLAG_USER _BITULL(2)
> > >
> > > I am not sure what's the good of having "FLAG_USER"?
> > >
> > > This ioctl is called from userspace, thus I think we can just treat this always
> > > as user-fault?
> >
> > The point is how to emulate kvm page fault as if vcpu caused the kvm page
> > fault. Not we call the ioctl as user context.
>
> Sorry I don't quite follow. What's wrong if KVM just append the #PF USER
> error bit before it calls into the fault handler?
>
> My question is, since this is ABI, you have to tell how userspace is
> supposed to use this. Maybe I am missing something, but I don't see how
> USER should be used here.

If we restrict this API to the TDP MMU then KVM_MEMORY_MAPPING_FLAG_USER
is meaningless, PFERR_USER_MASK is only relevant for shadow paging.

KVM_MEMORY_MAPPING_FLAG_WRITE seems useful to allow memslots to be
populated with writes (which avoids just faulting in the zero-page for
anon or tmpfs backed memslots), while also allowing populating read-only
memslots.

I don't really see a use-case for KVM_MEMORY_MAPPING_FLAG_EXEC.

2024-03-08 01:28:42

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC PATCH 1/8] KVM: Document KVM_MAP_MEMORY ioctl

On Thu, Mar 07, 2024, David Matlack wrote:
> On 2024-03-08 01:20 PM, Huang, Kai wrote:
> > > > > +:Parameters: struct kvm_memory_mapping(in/out)
> > > > > +:Returns: 0 on success, <0 on error
> > > > > +
> > > > > +KVM_MAP_MEMORY populates guest memory without running vcpu.
> > > > > +
> > > > > +::
> > > > > +
> > > > > + struct kvm_memory_mapping {
> > > > > + __u64 base_gfn;
> > > > > + __u64 nr_pages;
> > > > > + __u64 flags;
> > > > > + __u64 source;
> > > > > + };
> > > > > +
> > > > > + /* For kvm_memory_mapping:: flags */
> > > > > + #define KVM_MEMORY_MAPPING_FLAG_WRITE _BITULL(0)
> > > > > + #define KVM_MEMORY_MAPPING_FLAG_EXEC _BITULL(1)
> > > > > + #define KVM_MEMORY_MAPPING_FLAG_USER _BITULL(2)
> > > >
> > > > I am not sure what's the good of having "FLAG_USER"?
> > > >
> > > > This ioctl is called from userspace, thus I think we can just treat this always
> > > > as user-fault?
> > >
> > > The point is how to emulate kvm page fault as if vcpu caused the kvm page
> > > fault. Not we call the ioctl as user context.
> >
> > Sorry I don't quite follow. What's wrong if KVM just append the #PF USER
> > error bit before it calls into the fault handler?
> >
> > My question is, since this is ABI, you have to tell how userspace is
> > supposed to use this. Maybe I am missing something, but I don't see how
> > USER should be used here.
>
> If we restrict this API to the TDP MMU then KVM_MEMORY_MAPPING_FLAG_USER
> is meaningless, PFERR_USER_MASK is only relevant for shadow paging.

+1

> KVM_MEMORY_MAPPING_FLAG_WRITE seems useful to allow memslots to be
> populated with writes (which avoids just faulting in the zero-page for
> anon or tmpfs backed memslots), while also allowing populating read-only
> memslots.
>
> I don't really see a use-case for KVM_MEMORY_MAPPING_FLAG_EXEC.

It would midly be interesting for something like the NX hugepage mitigation.

For the initial implementation, I don't think the ioctl() should specify
protections, period.

VMA-based mappings, i.e. !guest_memfd, already have a way to specify protections.
And for guest_memfd, finer grained control in general, and long term compatibility
with other features that are in-flight or proposed, I would rather userspace specify
RWX protections via KVM_SET_MEMORY_ATTRIBUTES. Oh, and dirty logging would be a
pain too.

KVM doesn't currently support execute-only (XO) or !executable (RW), so I think
we can simply define KVM_MAP_MEMORY to behave like a read fault. E.g. map RX,
and add W if all underlying protections allow it.

That way we can defer dealing with things like XO and RW *if* KVM ever does gain
support for specifying those combinations via KVM_SET_MEMORY_ATTRIBUTES, which
will likely be per-arch/vendor and non-trivial, e.g. AMD's NPT doesn't even allow
for XO memory.

And we shouldn't need to do anything for KVM_MAP_MEMORY in particular if
KVM_SET_MEMORY_ATTRIBUTES gains support for RWX protections the existing RWX and
RX combinations, e.g. if there's a use-case for write-protecting guest_memfd
regions.

We can always expand the uAPI, but taking away functionality is much harder, if
not impossible.

2024-03-08 02:20:05

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [RFC PATCH 1/8] KVM: Document KVM_MAP_MEMORY ioctl

On Thu, Mar 07, 2024 at 05:28:20PM -0800,
Sean Christopherson <[email protected]> wrote:

> On Thu, Mar 07, 2024, David Matlack wrote:
> > On 2024-03-08 01:20 PM, Huang, Kai wrote:
> > > > > > +:Parameters: struct kvm_memory_mapping(in/out)
> > > > > > +:Returns: 0 on success, <0 on error
> > > > > > +
> > > > > > +KVM_MAP_MEMORY populates guest memory without running vcpu.
> > > > > > +
> > > > > > +::
> > > > > > +
> > > > > > + struct kvm_memory_mapping {
> > > > > > + __u64 base_gfn;
> > > > > > + __u64 nr_pages;
> > > > > > + __u64 flags;
> > > > > > + __u64 source;
> > > > > > + };
> > > > > > +
> > > > > > + /* For kvm_memory_mapping:: flags */
> > > > > > + #define KVM_MEMORY_MAPPING_FLAG_WRITE _BITULL(0)
> > > > > > + #define KVM_MEMORY_MAPPING_FLAG_EXEC _BITULL(1)
> > > > > > + #define KVM_MEMORY_MAPPING_FLAG_USER _BITULL(2)
> > > > >
> > > > > I am not sure what's the good of having "FLAG_USER"?
> > > > >
> > > > > This ioctl is called from userspace, thus I think we can just treat this always
> > > > > as user-fault?
> > > >
> > > > The point is how to emulate kvm page fault as if vcpu caused the kvm page
> > > > fault. Not we call the ioctl as user context.
> > >
> > > Sorry I don't quite follow. What's wrong if KVM just append the #PF USER
> > > error bit before it calls into the fault handler?
> > >
> > > My question is, since this is ABI, you have to tell how userspace is
> > > supposed to use this. Maybe I am missing something, but I don't see how
> > > USER should be used here.
> >
> > If we restrict this API to the TDP MMU then KVM_MEMORY_MAPPING_FLAG_USER
> > is meaningless, PFERR_USER_MASK is only relevant for shadow paging.
>
> +1
>
> > KVM_MEMORY_MAPPING_FLAG_WRITE seems useful to allow memslots to be
> > populated with writes (which avoids just faulting in the zero-page for
> > anon or tmpfs backed memslots), while also allowing populating read-only
> > memslots.
> >
> > I don't really see a use-case for KVM_MEMORY_MAPPING_FLAG_EXEC.
>
> It would midly be interesting for something like the NX hugepage mitigation.
>
> For the initial implementation, I don't think the ioctl() should specify
> protections, period.
>
> VMA-based mappings, i.e. !guest_memfd, already have a way to specify protections.
> And for guest_memfd, finer grained control in general, and long term compatibility
> with other features that are in-flight or proposed, I would rather userspace specify
> RWX protections via KVM_SET_MEMORY_ATTRIBUTES. Oh, and dirty logging would be a
> pain too.
>
> KVM doesn't currently support execute-only (XO) or !executable (RW), so I think
> we can simply define KVM_MAP_MEMORY to behave like a read fault. E.g. map RX,
> and add W if all underlying protections allow it.
>
> That way we can defer dealing with things like XO and RW *if* KVM ever does gain
> support for specifying those combinations via KVM_SET_MEMORY_ATTRIBUTES, which
> will likely be per-arch/vendor and non-trivial, e.g. AMD's NPT doesn't even allow
> for XO memory.
>
> And we shouldn't need to do anything for KVM_MAP_MEMORY in particular if
> KVM_SET_MEMORY_ATTRIBUTES gains support for RWX protections the existing RWX and
> RX combinations, e.g. if there's a use-case for write-protecting guest_memfd
> regions.
>
> We can always expand the uAPI, but taking away functionality is much harder, if
> not impossible.

Ok, let me drop all the flags. Here is the updated one.

4.143 KVM_MAP_MEMORY
------------------------

:Capability: KVM_CAP_MAP_MEMORY
:Architectures: none
:Type: vcpu ioctl
:Parameters: struct kvm_memory_mapping(in/out)
:Returns: 0 on success, < 0 on error

Errors:

====== =============================================================
EINVAL vcpu state is not in TDP MMU mode or is in guest mode.
Currently, this ioctl is restricted to TDP MMU.
EAGAIN The region is only processed partially. The caller should
issue the ioctl with the updated parameters.
EINTR An unmasked signal is pending. The region may be processed
partially. If `nr_pages` > 0, the caller should issue the
ioctl with the updated parameters.
====== =============================================================

KVM_MAP_MEMORY populates guest memory before the VM starts to run. Multiple
vcpus can call this ioctl simultaneously. It may result in the error of EAGAIN
due to race conditions.

::

struct kvm_memory_mapping {
__u64 base_gfn;
__u64 nr_pages;
__u64 flags;
__u64 source;
};

KVM_MAP_MEMORY populates guest memory at the specified range (`base_gfn`,
`nr_pages`) in the underlying mapping. `source` is an optional user pointer. If
`source` is not NULL and the underlying technology supports it, the memory
contents of `source` are copied into the guest memory. The backend may encrypt
it. `flags` must be zero. It's reserved for future use.

When the ioctl returns, the input values are updated. If `nr_pages` is large,
it may return EAGAIN or EINTR for pending signal and update the values
(`base_gfn` and `nr_pages`. `source` if not zero) to point to the remaining
range.

--
Isaku Yamahata <[email protected]>

2024-03-10 23:13:03

by Michael Roth

[permalink] [raw]
Subject: Re: [RFC PATCH 1/8] KVM: Document KVM_MAP_MEMORY ioctl

On Thu, Mar 07, 2024 at 06:19:41PM -0800, Isaku Yamahata wrote:
> On Thu, Mar 07, 2024 at 05:28:20PM -0800,
> Sean Christopherson <[email protected]> wrote:
>
> > On Thu, Mar 07, 2024, David Matlack wrote:
> > > On 2024-03-08 01:20 PM, Huang, Kai wrote:
> > > > > > > +:Parameters: struct kvm_memory_mapping(in/out)
> > > > > > > +:Returns: 0 on success, <0 on error
> > > > > > > +
> > > > > > > +KVM_MAP_MEMORY populates guest memory without running vcpu.
> > > > > > > +
> > > > > > > +::
> > > > > > > +
> > > > > > > + struct kvm_memory_mapping {
> > > > > > > + __u64 base_gfn;
> > > > > > > + __u64 nr_pages;
> > > > > > > + __u64 flags;
> > > > > > > + __u64 source;
> > > > > > > + };
> > > > > > > +
> > > > > > > + /* For kvm_memory_mapping:: flags */
> > > > > > > + #define KVM_MEMORY_MAPPING_FLAG_WRITE _BITULL(0)
> > > > > > > + #define KVM_MEMORY_MAPPING_FLAG_EXEC _BITULL(1)
> > > > > > > + #define KVM_MEMORY_MAPPING_FLAG_USER _BITULL(2)
> > > > > >
> > > > > > I am not sure what's the good of having "FLAG_USER"?
> > > > > >
> > > > > > This ioctl is called from userspace, thus I think we can just treat this always
> > > > > > as user-fault?
> > > > >
> > > > > The point is how to emulate kvm page fault as if vcpu caused the kvm page
> > > > > fault. Not we call the ioctl as user context.
> > > >
> > > > Sorry I don't quite follow. What's wrong if KVM just append the #PF USER
> > > > error bit before it calls into the fault handler?
> > > >
> > > > My question is, since this is ABI, you have to tell how userspace is
> > > > supposed to use this. Maybe I am missing something, but I don't see how
> > > > USER should be used here.
> > >
> > > If we restrict this API to the TDP MMU then KVM_MEMORY_MAPPING_FLAG_USER
> > > is meaningless, PFERR_USER_MASK is only relevant for shadow paging.
> >
> > +1
> >
> > > KVM_MEMORY_MAPPING_FLAG_WRITE seems useful to allow memslots to be
> > > populated with writes (which avoids just faulting in the zero-page for
> > > anon or tmpfs backed memslots), while also allowing populating read-only
> > > memslots.
> > >
> > > I don't really see a use-case for KVM_MEMORY_MAPPING_FLAG_EXEC.
> >
> > It would midly be interesting for something like the NX hugepage mitigation.
> >
> > For the initial implementation, I don't think the ioctl() should specify
> > protections, period.
> >
> > VMA-based mappings, i.e. !guest_memfd, already have a way to specify protections.
> > And for guest_memfd, finer grained control in general, and long term compatibility
> > with other features that are in-flight or proposed, I would rather userspace specify
> > RWX protections via KVM_SET_MEMORY_ATTRIBUTES. Oh, and dirty logging would be a
> > pain too.
> >
> > KVM doesn't currently support execute-only (XO) or !executable (RW), so I think
> > we can simply define KVM_MAP_MEMORY to behave like a read fault. E.g. map RX,
> > and add W if all underlying protections allow it.
> >
> > That way we can defer dealing with things like XO and RW *if* KVM ever does gain
> > support for specifying those combinations via KVM_SET_MEMORY_ATTRIBUTES, which
> > will likely be per-arch/vendor and non-trivial, e.g. AMD's NPT doesn't even allow
> > for XO memory.
> >
> > And we shouldn't need to do anything for KVM_MAP_MEMORY in particular if
> > KVM_SET_MEMORY_ATTRIBUTES gains support for RWX protections the existing RWX and
> > RX combinations, e.g. if there's a use-case for write-protecting guest_memfd
> > regions.
> >
> > We can always expand the uAPI, but taking away functionality is much harder, if
> > not impossible.
>
> Ok, let me drop all the flags. Here is the updated one.
>
> 4.143 KVM_MAP_MEMORY
> ------------------------
>
> :Capability: KVM_CAP_MAP_MEMORY
> :Architectures: none
> :Type: vcpu ioctl
> :Parameters: struct kvm_memory_mapping(in/out)
> :Returns: 0 on success, < 0 on error
>
> Errors:
>
> ====== =============================================================
> EINVAL vcpu state is not in TDP MMU mode or is in guest mode.
> Currently, this ioctl is restricted to TDP MMU.
> EAGAIN The region is only processed partially. The caller should
> issue the ioctl with the updated parameters.
> EINTR An unmasked signal is pending. The region may be processed
> partially. If `nr_pages` > 0, the caller should issue the
> ioctl with the updated parameters.
> ====== =============================================================
>
> KVM_MAP_MEMORY populates guest memory before the VM starts to run. Multiple
> vcpus can call this ioctl simultaneously. It may result in the error of EAGAIN
> due to race conditions.
>
> ::
>
> struct kvm_memory_mapping {
> __u64 base_gfn;
> __u64 nr_pages;
> __u64 flags;
> __u64 source;
> };
>
> KVM_MAP_MEMORY populates guest memory at the specified range (`base_gfn`,
> `nr_pages`) in the underlying mapping. `source` is an optional user pointer. If
> `source` is not NULL and the underlying technology supports it, the memory
> contents of `source` are copied into the guest memory. The backend may encrypt
> it. `flags` must be zero. It's reserved for future use.
>
> When the ioctl returns, the input values are updated. If `nr_pages` is large,
> it may return EAGAIN or EINTR for pending signal and update the values
> (`base_gfn` and `nr_pages`. `source` if not zero) to point to the remaining
> range.

If this intended to replace SNP_LAUNCH_UPDATE, then to be useable for SNP
guests userspace also needs to pass along the type of page being added,
which are currently defined as:

#define KVM_SEV_SNP_PAGE_TYPE_NORMAL 0x1
#define KVM_SEV_SNP_PAGE_TYPE_ZERO 0x3
#define KVM_SEV_SNP_PAGE_TYPE_UNMEASURED 0x4
#define KVM_SEV_SNP_PAGE_TYPE_SECRETS 0x5
#define KVM_SEV_SNP_PAGE_TYPE_CPUID 0x6

So I guess the main question is, do bite the bullet now and introduce
infrastructure for vendor-specific parameters, or should we attempt to define
these as cross-vendor/cross-architecture types and hide the vendor-specific
stuff from userspace?

There are a couple other bits of vendor-specific information that would be
needed to be a total drop-in replacement for SNP_LAUNCH_UPDATE, but I think
these we could can do without:

sev_fd: handle for /dev/sev which is used to issue SEV firmware calls
as-needed for various KVM ioctls
- can likely bind this to SNP context during SNP_LAUNCH_UPDATE and
avoid needing to pass it in for subsequent calls
error code: return parameter which passes SEV firmware error codes to
userspace for informational purposes
- can probably live without this

-Mike

>
> --
> Isaku Yamahata <[email protected]>

2024-03-11 01:06:19

by Huang, Kai

[permalink] [raw]
Subject: Re: [RFC PATCH 1/8] KVM: Document KVM_MAP_MEMORY ioctl


>
> struct kvm_memory_mapping {
> __u64 base_gfn;
> __u64 nr_pages;
> __u64 flags;
> __u64 source;
> };

From your next patch the @source must be 4K aligned. If it is intended
please document this too.

[...]

The backend may encrypt
> it.

To me the "backend" is a little bit confusing. It doesn't mean the
@source, correct? Maybe just the "underlying physical pages may be
encrypted" etc. But it may be only my issue.

2024-03-11 01:08:27

by Huang, Kai

[permalink] [raw]
Subject: Re: [RFC PATCH 1/8] KVM: Document KVM_MAP_MEMORY ioctl

Maybe just the "underlying physical pages may be
> encrypted" etc.

Sorry, perhaps "encrypted" -> "crypto-protected"?

2024-03-11 03:21:59

by Michael Roth

[permalink] [raw]
Subject: Re: [RFC PATCH 0/8] KVM: Prepopulate guest memory API

On Fri, Mar 01, 2024 at 09:28:42AM -0800, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> The objective of this RFC patch series is to develop a uAPI aimed at
> (pre)populating guest memory for various use cases and underlying VM
> technologies.
>
> - Pre-populate guest memory to mitigate excessive KVM page faults during guest
> boot [1], a need not limited to any specific technology.
>
> - Pre-populating guest memory (including encryption and measurement) for
> confidential guests [2]. SEV-SNP, TDX, and SW-PROTECTED VM. Potentially
> other technologies and pKVM.
>
> The patches are organized as follows.
> - 1: documentation on uAPI KVM_MAP_MEMORY.
> - 2: archtechture-independent implementation part.
> - 3-4: refactoring of x86 KVM MMU as preparation.
> - 5: x86 Helper function to map guest page.
> - 6: x86 KVM arch implementation.
> - 7: Add x86-ops necessary for TDX and SEV-SNP.
> - 8: selftest for validation.
>
> Discussion point:
>
> uAPI design:
> - access flags
> Access flags are needed for the guest memory population. We have options for
> their exposure to uAPI.
> - option 1. Introduce access flags, possibly with the addition of a private
> access flag.
> - option 2. Omit access flags from UAPI.
> Allow the kernel to deduce the necessary flag based on the memory slot and
> its memory attributes.
>
> - SEV-SNP and byte vs. page size
> The SEV correspondence is SEV_LAUNCH_UPDATE_DATA. Which dictates memory
> regions to be in 16-byte alignment, not page size. Should we define struct
> kvm_memory_mapping in bytes rather than page size?

For SNP it's multiples of page size only, and I'm not sure it's worth
trying to work in legacy SEV support, since that pull in other
requirements like how the backing memory also needs to be pre-pinned via
a separate KVM ioctl that would need to be documented as an SEV-specific
requirement for this interface... it just doesn't seem worth it. And SEV
would still benefit from the more basic functionality of pre-mapping pages
just like any other guest so it still benefits in that regard.

That said, it would be a shame if we needed to clutter up the API as
soon as some user can along that required non-page-sized values. That
seems unlikely for the pre-mapping use case, but for CoCo maybe it's
worth checking if pKVM would have any requirements like that? Or just
going with byte-size parameters to play it safe?

>
> struct kvm_sev_launch_update_data {
> __u64 uaddr;
> __u32 len;
> };
>
> - TDX and measurement
> The TDX correspondence is TDH.MEM.PAGE.ADD and TDH.MR.EXTEND. TDH.MEM.EXTEND
> extends its measurement by the page contents.
> Option 1. Add an additional flag like KVM_MEMORY_MAPPING_FLAG_EXTEND to issue
> TDH.MEM.EXTEND
> Option 2. Don't handle extend. Let TDX vendor specific API
> KVM_EMMORY_ENCRYPT_OP to handle it with the subcommand like
> KVM_TDX_EXTEND_MEMORY.

For SNP this happens unconditionally via SNP_LAUNCH_UPDATE, and with some
additional measurements via SNP_LAUNCH_FINISH, and down the road when live
migration support is added that flow will be a bit different. So
personally I think it's better to leave separate for now. Maybe down the
road some common measurement/finalize interface can deprecate some of
the other MEMORY_ENCRYPT ioctls.

Even with this more narrowly-defined purpose it's really unfortunate we
have to bake any CoCo stuff into this interface at all. It would be great
if all it did was simply pre-map entries into nested page tables to boost
boot performance, and we had some separate CoCo-specific API to handle
intial loading/encryption of guest data. I understand with SecureEPT
considerations why we sort of need it here for TDX, but it already ends
up being awkward for the SNP_LAUNCH_UPDATE use-case because there's not
really any good reason for that handling to live inside KVM MMU hooks
like with TDX, so we'll probably end up putting it in a pre/post hook
where all the handling is completely separate from TDX flow and in the
process complicating the userspace API to with the additional parameters
needed for that even though other guest types are likely to never need
them.

(Alternatively we can handle SNP_LAUNCH_UPDATE via KVM MMU hooks like with
tdx_mem_page_add(), but then we'll need to plumb additional parameters
down into the KVM MMU code for stuff like the SNP page type. But maybe
it's worth it to have some level of commonality for x86 at least?)

But I'd be hesitant to bake more requirements into this pre-mapping
interface, it feels like we're already overloading it as is.

>
> - TDX and struct kvm_memory_mapping:source
> While the current patch series doesn't utilize
> kvm_memory_mapping::source member. TDX needs it to specify the source of
> memory contents.

SNP will need it too FWIW.

-Mike

>
> Implementation:
> - x86 KVM MMU
> In x86 KVM MMU, I chose to use kvm_mmu_do_page_fault(). It's not confined to
> KVM TDP MMU. We can restrict it to KVM TDP MMU and introduce an optimized
> version.
>
> [1] https://lore.kernel.org/all/[email protected]/
> [2] https://lore.kernel.org/all/6a4c029af70d41b63bcee3d6a1f0c2377f6eb4bd.1690322424.git.isaku.yamahata@intel.com
>
> Thanks,
>
> Isaku Yamahata (8):
> KVM: Document KVM_MAP_MEMORY ioctl
> KVM: Add KVM_MAP_MEMORY vcpu ioctl to pre-populate guest memory
> KVM: x86/mmu: Introduce initialier macro for struct kvm_page_fault
> KVM: x86/mmu: Factor out kvm_mmu_do_page_fault()
> KVM: x86/mmu: Introduce kvm_mmu_map_page() for prepopulating guest
> memory
> KVM: x86: Implement kvm_arch_{, pre_}vcpu_map_memory()
> KVM: x86: Add hooks in kvm_arch_vcpu_map_memory()
> KVM: selftests: x86: Add test for KVM_MAP_MEMORY
>
> Documentation/virt/kvm/api.rst | 36 +++++
> arch/x86/include/asm/kvm-x86-ops.h | 2 +
> arch/x86/include/asm/kvm_host.h | 6 +
> arch/x86/kvm/mmu.h | 3 +
> arch/x86/kvm/mmu/mmu.c | 30 ++++
> arch/x86/kvm/mmu/mmu_internal.h | 70 +++++----
> arch/x86/kvm/x86.c | 83 +++++++++++
> include/linux/kvm_host.h | 4 +
> include/uapi/linux/kvm.h | 15 ++
> tools/include/uapi/linux/kvm.h | 14 ++
> tools/testing/selftests/kvm/Makefile | 1 +
> .../selftests/kvm/x86_64/map_memory_test.c | 136 ++++++++++++++++++
> virt/kvm/kvm_main.c | 74 ++++++++++
> 13 files changed, 448 insertions(+), 26 deletions(-)
> create mode 100644 tools/testing/selftests/kvm/x86_64/map_memory_test.c
>
>
> base-commit: 6a108bdc49138bcaa4f995ed87681ab9c65122ad
> --
> 2.25.1
>
>

2024-03-11 17:23:39

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC PATCH 2/8] KVM: Add KVM_MAP_MEMORY vcpu ioctl to pre-populate guest memory

On Fri, Mar 01, 2024, [email protected] wrote:
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index d1fd9cb5d037..d77c9b79d76b 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -4419,6 +4419,69 @@ static int kvm_vcpu_ioctl_get_stats_fd(struct kvm_vcpu *vcpu)
> return fd;
> }
>
> +__weak int kvm_arch_vcpu_pre_map_memory(struct kvm_vcpu *vcpu)
> +{
> + return -EOPNOTSUPP;
> +}
> +
> +__weak int kvm_arch_vcpu_map_memory(struct kvm_vcpu *vcpu,
> + struct kvm_memory_mapping *mapping)
> +{
> + return -EOPNOTSUPP;
> +}
> +
> +static int kvm_vcpu_map_memory(struct kvm_vcpu *vcpu,
> + struct kvm_memory_mapping *mapping)
> +{
> + bool added = false;
> + int idx, r = 0;

Pointless initialization of 'r'.

> +
> + if (mapping->flags & ~(KVM_MEMORY_MAPPING_FLAG_WRITE |
> + KVM_MEMORY_MAPPING_FLAG_EXEC |
> + KVM_MEMORY_MAPPING_FLAG_USER |
> + KVM_MEMORY_MAPPING_FLAG_PRIVATE))
> + return -EINVAL;
> + if ((mapping->flags & KVM_MEMORY_MAPPING_FLAG_PRIVATE) &&
> + !kvm_arch_has_private_mem(vcpu->kvm))
> + return -EINVAL;
> +
> + /* Sanity check */

Pointless comment.

> + if (!IS_ALIGNED(mapping->source, PAGE_SIZE) ||
> + !mapping->nr_pages ||

> + mapping->base_gfn + mapping->nr_pages <= mapping->base_gfn)
> + return -EINVAL;
> +
> + vcpu_load(vcpu);
> + idx = srcu_read_lock(&vcpu->kvm->srcu);
> + r = kvm_arch_vcpu_pre_map_memory(vcpu);

This hooks is unnecessary, x86's kvm_mmu_reload() is optimized for the happy path
where the MMU is already loaded. Just make the call from kvm_arch_vcpu_map_memory().

> + if (r)
> + return r;

Which is a good thing, because this leaks the SRCU lock.

> +
> + while (mapping->nr_pages) {
> + if (signal_pending(current)) {
> + r = -ERESTARTSYS;

Why -ERESTARTSYS instead of -EINTR? The latter is KVM's typical response to a
pending signal.

> + break;
> + }
> +
> + if (need_resched())

No need to manually check need_resched(), the below is a _conditional_ resched.
The reason KVM explicitly checks need_resched() in MMU flows is because KVM needs
to drop mmu_lock before rescheduling, i.e. calling cond_resched() directly would
try to schedule() while holding a spinlock.

> + cond_resched();
> +
> + r = kvm_arch_vcpu_map_memory(vcpu, mapping);
> + if (r)
> + break;
> +
> + added = true;
> + }
> +
> + srcu_read_unlock(&vcpu->kvm->srcu, idx);
> + vcpu_put(vcpu);
> +
> + if (added && mapping->nr_pages > 0)
> + r = -EAGAIN;

No, this clobbers 'r', which might hold a fatal error code. I don't see any
reason for common code to ever force -EAGAIN, it can't possibly know if trying
again is reasonable.

> +
> + return r;
> +}

2024-03-11 17:25:23

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC PATCH 3/8] KVM: x86/mmu: Introduce initialier macro for struct kvm_page_fault

On Fri, Mar 01, 2024, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> Another function will initialize struct kvm_page_fault. Add initializer
> macro to unify the big struct initialization.
>
> No functional change intended.
>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/kvm/mmu/mmu_internal.h | 44 +++++++++++++++++++--------------
> 1 file changed, 26 insertions(+), 18 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index 0669a8a668ca..72ef09fc9322 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -279,27 +279,35 @@ enum {
> RET_PF_SPURIOUS,
> };
>
> +#define KVM_PAGE_FAULT_INIT(_vcpu, _cr2_or_gpa, _err, _prefetch, _max_level) { \
> + .addr = (_cr2_or_gpa), \
> + .error_code = (_err), \
> + .exec = (_err) & PFERR_FETCH_MASK, \
> + .write = (_err) & PFERR_WRITE_MASK, \
> + .present = (_err) & PFERR_PRESENT_MASK, \
> + .rsvd = (_err) & PFERR_RSVD_MASK, \
> + .user = (_err) & PFERR_USER_MASK, \
> + .prefetch = (_prefetch), \
> + .is_tdp = \
> + likely((_vcpu)->arch.mmu->page_fault == kvm_tdp_page_fault), \
> + .nx_huge_page_workaround_enabled = \
> + is_nx_huge_page_enabled((_vcpu)->kvm), \
> + \
> + .max_level = (_max_level), \
> + .req_level = PG_LEVEL_4K, \
> + .goal_level = PG_LEVEL_4K, \
> + .is_private = \
> + kvm_mem_is_private((_vcpu)->kvm, (_cr2_or_gpa) >> PAGE_SHIFT), \
> + \
> + .pfn = KVM_PFN_ERR_FAULT, \
> + .hva = KVM_HVA_ERR_BAD, }
> +

Oof, no. I would much rather refactor kvm_mmu_do_page_fault() as needed than
have to maintain a macro like this.

2024-03-11 17:29:11

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC PATCH 5/8] KVM: x86/mmu: Introduce kvm_mmu_map_page() for prepopulating guest memory

On Fri, Mar 01, 2024, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> Introduce a helper function to call kvm fault handler. This allows
> a new ioctl to invoke kvm fault handler to populate without seeing
> RET_PF_* enums or other KVM MMU internal definitions.
>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/kvm/mmu.h | 3 +++
> arch/x86/kvm/mmu/mmu.c | 30 ++++++++++++++++++++++++++++++
> 2 files changed, 33 insertions(+)
>
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index 60f21bb4c27b..48870c5e08ec 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -183,6 +183,9 @@ static inline void kvm_mmu_refresh_passthrough_bits(struct kvm_vcpu *vcpu,
> __kvm_mmu_refresh_passthrough_bits(vcpu, mmu);
> }
>
> +int kvm_mmu_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
> + u8 max_level, u8 *goal_level);
> +
> /*
> * Check if a given access (described through the I/D, W/R and U/S bits of a
> * page fault error code pfec) causes a permission fault with the given PTE
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index e4cc7f764980..7d5e80d17977 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4659,6 +4659,36 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> return direct_page_fault(vcpu, fault);
> }
>
> +int kvm_mmu_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
> + u8 max_level, u8 *goal_level)
> +{
> + struct kvm_page_fault fault = KVM_PAGE_FAULT_INIT(vcpu, gpa, error_code,
> + false, max_level);
> + int r;
> +
> + r = __kvm_mmu_do_page_fault(vcpu, &fault);
> +
> + if (is_error_noslot_pfn(fault.pfn) || vcpu->kvm->vm_bugged)
> + return -EFAULT;

This clobbers a non-zero 'r'. And KVM return -EIO if the VM is bugged/dead, not
-EFAULT. I also don't see why KVM needs to explicitly check is_error_noslot_pfn(),
that should be funneled to RET_PF_EMULATE.

> +
> + switch (r) {
> + case RET_PF_RETRY:
> + return -EAGAIN;
> +
> + case RET_PF_FIXED:
> + case RET_PF_SPURIOUS:
> + *goal_level = fault.goal_level;
> + return 0;
> +
> + case RET_PF_CONTINUE:
> + case RET_PF_EMULATE:

-EINVAL woud be more appropriate for RET_PF_EMULATE.

> + case RET_PF_INVALID:

CONTINUE and INVALID should be WARN conditions.

> + default:
> + return -EIO;
> + }
> +}
> +EXPORT_SYMBOL_GPL(kvm_mmu_map_page);

Unnecessary export.

> +
> static void nonpaging_init_context(struct kvm_mmu *context)
> {
> context->page_fault = nonpaging_page_fault;
> --
> 2.25.1
>

2024-03-11 22:26:53

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [RFC PATCH 2/8] KVM: Add KVM_MAP_MEMORY vcpu ioctl to pre-populate guest memory

On Mon, Mar 11, 2024 at 10:23:28AM -0700,
Sean Christopherson <[email protected]> wrote:

> On Fri, Mar 01, 2024, [email protected] wrote:
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index d1fd9cb5d037..d77c9b79d76b 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -4419,6 +4419,69 @@ static int kvm_vcpu_ioctl_get_stats_fd(struct kvm_vcpu *vcpu)
> > return fd;
> > }
> >
> > +__weak int kvm_arch_vcpu_pre_map_memory(struct kvm_vcpu *vcpu)
> > +{
> > + return -EOPNOTSUPP;
> > +}
> > +
> > +__weak int kvm_arch_vcpu_map_memory(struct kvm_vcpu *vcpu,
> > + struct kvm_memory_mapping *mapping)
> > +{
> > + return -EOPNOTSUPP;
> > +}
> > +
> > +static int kvm_vcpu_map_memory(struct kvm_vcpu *vcpu,
> > + struct kvm_memory_mapping *mapping)
> > +{
> > + bool added = false;
> > + int idx, r = 0;
>
> Pointless initialization of 'r'.
>
> > +
> > + if (mapping->flags & ~(KVM_MEMORY_MAPPING_FLAG_WRITE |
> > + KVM_MEMORY_MAPPING_FLAG_EXEC |
> > + KVM_MEMORY_MAPPING_FLAG_USER |
> > + KVM_MEMORY_MAPPING_FLAG_PRIVATE))
> > + return -EINVAL;
> > + if ((mapping->flags & KVM_MEMORY_MAPPING_FLAG_PRIVATE) &&
> > + !kvm_arch_has_private_mem(vcpu->kvm))
> > + return -EINVAL;
> > +
> > + /* Sanity check */
>
> Pointless comment.
>
> > + if (!IS_ALIGNED(mapping->source, PAGE_SIZE) ||
> > + !mapping->nr_pages ||
>
> > + mapping->base_gfn + mapping->nr_pages <= mapping->base_gfn)
> > + return -EINVAL;
> > +
> > + vcpu_load(vcpu);
> > + idx = srcu_read_lock(&vcpu->kvm->srcu);
> > + r = kvm_arch_vcpu_pre_map_memory(vcpu);
>
> This hooks is unnecessary, x86's kvm_mmu_reload() is optimized for the happy path
> where the MMU is already loaded. Just make the call from kvm_arch_vcpu_map_memory().
>
> > + if (r)
> > + return r;
>
> Which is a good thing, because this leaks the SRCU lock.
>
> > +
> > + while (mapping->nr_pages) {
> > + if (signal_pending(current)) {
> > + r = -ERESTARTSYS;
>
> Why -ERESTARTSYS instead of -EINTR? The latter is KVM's typical response to a
> pending signal.
>
> > + break;
> > + }
> > +
> > + if (need_resched())
>
> No need to manually check need_resched(), the below is a _conditional_ resched.
> The reason KVM explicitly checks need_resched() in MMU flows is because KVM needs
> to drop mmu_lock before rescheduling, i.e. calling cond_resched() directly would
> try to schedule() while holding a spinlock.
>
> > + cond_resched();
> > +
> > + r = kvm_arch_vcpu_map_memory(vcpu, mapping);
> > + if (r)
> > + break;
> > +
> > + added = true;
> > + }
> > +
> > + srcu_read_unlock(&vcpu->kvm->srcu, idx);
> > + vcpu_put(vcpu);
> > +
> > + if (added && mapping->nr_pages > 0)
> > + r = -EAGAIN;
>
> No, this clobbers 'r', which might hold a fatal error code. I don't see any
> reason for common code to ever force -EAGAIN, it can't possibly know if trying
> again is reasonable.

Thanks for review. With those included, the hunk is as follows.


diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index d1fd9cb5d037..342269ef9f13 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -4419,6 +4419,47 @@ static int kvm_vcpu_ioctl_get_stats_fd(struct kvm_vcpu *vcpu)
return fd;
}

+__weak int kvm_arch_vcpu_map_memory(struct kvm_vcpu *vcpu,
+ struct kvm_memory_mapping *mapping)
+{
+ return -EOPNOTSUPP;
+}
+
+static int kvm_vcpu_map_memory(struct kvm_vcpu *vcpu,
+ struct kvm_memory_mapping *mapping)
+{
+ int idx, r;
+
+ if (mapping->flags)
+ return -EINVAL;
+
+ if (!IS_ALIGNED(mapping->source, PAGE_SIZE) ||
+ mapping->base_gfn + mapping->nr_pages <= mapping->base_gfn)
+ return -EINVAL;
+
+ vcpu_load(vcpu);
+ idx = srcu_read_lock(&vcpu->kvm->srcu);
+
+ r = 0;
+ while (mapping->nr_pages) {
+ if (signal_pending(current)) {
+ r = -EINTR;
+ break;
+ }
+
+ r = kvm_arch_vcpu_map_memory(vcpu, mapping);
+ if (r)
+ break;
+
+ cond_resched();
+ }
+
+ srcu_read_unlock(&vcpu->kvm->srcu, idx);
+ vcpu_put(vcpu);
+
+ return r;
+}
+
static long kvm_vcpu_ioctl(struct file *filp,
unsigned int ioctl, unsigned long arg)
{
--
Isaku Yamahata <[email protected]>

2024-03-11 22:56:25

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [RFC PATCH 3/8] KVM: x86/mmu: Introduce initialier macro for struct kvm_page_fault

On Mon, Mar 11, 2024 at 10:24:50AM -0700,
Sean Christopherson <[email protected]> wrote:

> On Fri, Mar 01, 2024, [email protected] wrote:
> > From: Isaku Yamahata <[email protected]>
> >
> > Another function will initialize struct kvm_page_fault. Add initializer
> > macro to unify the big struct initialization.
> >
> > No functional change intended.
> >
> > Signed-off-by: Isaku Yamahata <[email protected]>
> > ---
> > arch/x86/kvm/mmu/mmu_internal.h | 44 +++++++++++++++++++--------------
> > 1 file changed, 26 insertions(+), 18 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> > index 0669a8a668ca..72ef09fc9322 100644
> > --- a/arch/x86/kvm/mmu/mmu_internal.h
> > +++ b/arch/x86/kvm/mmu/mmu_internal.h
> > @@ -279,27 +279,35 @@ enum {
> > RET_PF_SPURIOUS,
> > };
> >
> > +#define KVM_PAGE_FAULT_INIT(_vcpu, _cr2_or_gpa, _err, _prefetch, _max_level) { \
> > + .addr = (_cr2_or_gpa), \
> > + .error_code = (_err), \
> > + .exec = (_err) & PFERR_FETCH_MASK, \
> > + .write = (_err) & PFERR_WRITE_MASK, \
> > + .present = (_err) & PFERR_PRESENT_MASK, \
> > + .rsvd = (_err) & PFERR_RSVD_MASK, \
> > + .user = (_err) & PFERR_USER_MASK, \
> > + .prefetch = (_prefetch), \
> > + .is_tdp = \
> > + likely((_vcpu)->arch.mmu->page_fault == kvm_tdp_page_fault), \
> > + .nx_huge_page_workaround_enabled = \
> > + is_nx_huge_page_enabled((_vcpu)->kvm), \
> > + \
> > + .max_level = (_max_level), \
> > + .req_level = PG_LEVEL_4K, \
> > + .goal_level = PG_LEVEL_4K, \
> > + .is_private = \
> > + kvm_mem_is_private((_vcpu)->kvm, (_cr2_or_gpa) >> PAGE_SHIFT), \
> > + \
> > + .pfn = KVM_PFN_ERR_FAULT, \
> > + .hva = KVM_HVA_ERR_BAD, }
> > +
>
> Oof, no. I would much rather refactor kvm_mmu_do_page_fault() as needed than
> have to maintain a macro like this.

Ok, I updated it as follows.

diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 0669a8a668ca..e57cc3c56a6d 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -279,8 +279,8 @@ enum {
RET_PF_SPURIOUS,
};

-static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
- u32 err, bool prefetch, int *emulation_type)
+static inline int __kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
+ u32 err, bool prefetch, int *emulation_type)
{
struct kvm_page_fault fault = {
.addr = cr2_or_gpa,
@@ -307,6 +307,21 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
fault.slot = kvm_vcpu_gfn_to_memslot(vcpu, fault.gfn);
}

+ if (IS_ENABLED(CONFIG_RETPOLINE) && fault.is_tdp)
+ r = kvm_tdp_page_fault(vcpu, &fault);
+ else
+ r = vcpu->arch.mmu->page_fault(vcpu, &fault);
+
+ if (fault.write_fault_to_shadow_pgtable && emulation_type)
+ *emulation_type |= EMULTYPE_WRITE_PF_TO_SP;
+ return r;
+}
+
+static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
+ u32 err, bool prefetch, int *emulation_type)
+{
+ int r;
+
/*
* Async #PF "faults", a.k.a. prefetch faults, are not faults from the
* guest perspective and have already been counted at the time of the
@@ -315,13 +330,7 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
if (!prefetch)
vcpu->stat.pf_taken++;

- if (IS_ENABLED(CONFIG_RETPOLINE) && fault.is_tdp)
- r = kvm_tdp_page_fault(vcpu, &fault);
- else
- r = vcpu->arch.mmu->page_fault(vcpu, &fault);
-
- if (fault.write_fault_to_shadow_pgtable && emulation_type)
- *emulation_type |= EMULTYPE_WRITE_PF_TO_SP;
+ r = __kvm_mmu_do_page_fault(vcpu, cr2_or_gpa, err, prefetch, emulation_type);

/*
* Similar to above, prefetch faults aren't truly spurious, and the
--
2.43.2

--
Isaku Yamahata <[email protected]>

2024-03-11 22:57:36

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [RFC PATCH 5/8] KVM: x86/mmu: Introduce kvm_mmu_map_page() for prepopulating guest memory

On Mon, Mar 11, 2024 at 10:29:01AM -0700,
Sean Christopherson <[email protected]> wrote:

> On Fri, Mar 01, 2024, [email protected] wrote:
> > From: Isaku Yamahata <[email protected]>
> >
> > Introduce a helper function to call kvm fault handler. This allows
> > a new ioctl to invoke kvm fault handler to populate without seeing
> > RET_PF_* enums or other KVM MMU internal definitions.
> >
> > Signed-off-by: Isaku Yamahata <[email protected]>
> > ---
> > arch/x86/kvm/mmu.h | 3 +++
> > arch/x86/kvm/mmu/mmu.c | 30 ++++++++++++++++++++++++++++++
> > 2 files changed, 33 insertions(+)
> >
> > diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> > index 60f21bb4c27b..48870c5e08ec 100644
> > --- a/arch/x86/kvm/mmu.h
> > +++ b/arch/x86/kvm/mmu.h
> > @@ -183,6 +183,9 @@ static inline void kvm_mmu_refresh_passthrough_bits(struct kvm_vcpu *vcpu,
> > __kvm_mmu_refresh_passthrough_bits(vcpu, mmu);
> > }
> >
> > +int kvm_mmu_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
> > + u8 max_level, u8 *goal_level);
> > +
> > /*
> > * Check if a given access (described through the I/D, W/R and U/S bits of a
> > * page fault error code pfec) causes a permission fault with the given PTE
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index e4cc7f764980..7d5e80d17977 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -4659,6 +4659,36 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> > return direct_page_fault(vcpu, fault);
> > }
> >
> > +int kvm_mmu_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
> > + u8 max_level, u8 *goal_level)
> > +{
> > + struct kvm_page_fault fault = KVM_PAGE_FAULT_INIT(vcpu, gpa, error_code,
> > + false, max_level);
> > + int r;
> > +
> > + r = __kvm_mmu_do_page_fault(vcpu, &fault);
> > +
> > + if (is_error_noslot_pfn(fault.pfn) || vcpu->kvm->vm_bugged)
> > + return -EFAULT;
>
> This clobbers a non-zero 'r'. And KVM return -EIO if the VM is bugged/dead, not
> -EFAULT. I also don't see why KVM needs to explicitly check is_error_noslot_pfn(),
> that should be funneled to RET_PF_EMULATE.

I'll drop this check.

> > +
> > + switch (r) {
> > + case RET_PF_RETRY:
> > + return -EAGAIN;
> > +
> > + case RET_PF_FIXED:
> > + case RET_PF_SPURIOUS:
> > + *goal_level = fault.goal_level;
> > + return 0;
> > +
> > + case RET_PF_CONTINUE:
> > + case RET_PF_EMULATE:
>
> -EINVAL woud be more appropriate for RET_PF_EMULATE.
>
> > + case RET_PF_INVALID:
>
> CONTINUE and INVALID should be WARN conditions.

Will update them.
--
Isaku Yamahata <[email protected]>

2024-03-11 23:52:03

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC PATCH 0/8] KVM: Prepopulate guest memory API

On Sun, Mar 10, 2024, Michael Roth wrote:
> On Fri, Mar 01, 2024 at 09:28:42AM -0800, [email protected] wrote:
> > struct kvm_sev_launch_update_data {
> > __u64 uaddr;
> > __u32 len;
> > };
> >
> > - TDX and measurement
> > The TDX correspondence is TDH.MEM.PAGE.ADD and TDH.MR.EXTEND. TDH.MEM.EXTEND
> > extends its measurement by the page contents.
> > Option 1. Add an additional flag like KVM_MEMORY_MAPPING_FLAG_EXTEND to issue
> > TDH.MEM.EXTEND
> > Option 2. Don't handle extend. Let TDX vendor specific API
> > KVM_EMMORY_ENCRYPT_OP to handle it with the subcommand like
> > KVM_TDX_EXTEND_MEMORY.
>
> For SNP this happens unconditionally via SNP_LAUNCH_UPDATE, and with some
> additional measurements via SNP_LAUNCH_FINISH, and down the road when live
> migration support is added that flow will be a bit different. So
> personally I think it's better to leave separate for now.

+1. The only reason to do EXTEND at the same time as PAGE.ADD would be to
optimize setups that want the measurement to be extended with the contents of a
page immediately after the measurement is extended with the mapping metadata for
said page. And AFAIK, the only reason to prefer that approach is for backwards
compatibility, which is not a concern for KVM. I suppose maaaybe some memory
locality performance benefits, but that seems like a stretch.

<time passes>

And I think this whole conversation is moot, because I don't think there's a need
to do PAGE.ADD during KVM_MAP_MEMORY[*]. If KVM_MAP_MEMORY does only the SEPT.ADD
side of things, then both @source (PAGE.ADD) and the EXTEND flag go away.

> But I'd be hesitant to bake more requirements into this pre-mapping
> interface, it feels like we're already overloading it as is.

Agreed. After being able to think more about this ioctl(), I think KVM_MAP_MEMORY
should be as "pure" of a mapping operation as we can make it. It'd be a little
weird that using KVM_MAP_MEMORY is required for TDX VMs, but not other VMs. But
that's really just a reflection of S-EPT, so it's arguably not even a bad thing.

[*] https://lore.kernel.org/all/[email protected]

2024-03-12 01:33:48

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [RFC PATCH 0/8] KVM: Prepopulate guest memory API

On Mon, Mar 11, 2024 at 04:44:27PM -0700,
Sean Christopherson <[email protected]> wrote:

> On Sun, Mar 10, 2024, Michael Roth wrote:
> > On Fri, Mar 01, 2024 at 09:28:42AM -0800, [email protected] wrote:
> > > struct kvm_sev_launch_update_data {
> > > __u64 uaddr;
> > > __u32 len;
> > > };
> > >
> > > - TDX and measurement
> > > The TDX correspondence is TDH.MEM.PAGE.ADD and TDH.MR.EXTEND. TDH.MEM.EXTEND
> > > extends its measurement by the page contents.
> > > Option 1. Add an additional flag like KVM_MEMORY_MAPPING_FLAG_EXTEND to issue
> > > TDH.MEM.EXTEND
> > > Option 2. Don't handle extend. Let TDX vendor specific API
> > > KVM_EMMORY_ENCRYPT_OP to handle it with the subcommand like
> > > KVM_TDX_EXTEND_MEMORY.
> >
> > For SNP this happens unconditionally via SNP_LAUNCH_UPDATE, and with some
> > additional measurements via SNP_LAUNCH_FINISH, and down the road when live
> > migration support is added that flow will be a bit different. So
> > personally I think it's better to leave separate for now.
>
> +1. The only reason to do EXTEND at the same time as PAGE.ADD would be to
> optimize setups that want the measurement to be extended with the contents of a
> page immediately after the measurement is extended with the mapping metadata for
> said page. And AFAIK, the only reason to prefer that approach is for backwards
> compatibility, which is not a concern for KVM. I suppose maaaybe some memory
> locality performance benefits, but that seems like a stretch.
>
> <time passes>
>
> And I think this whole conversation is moot, because I don't think there's a need
> to do PAGE.ADD during KVM_MAP_MEMORY[*]. If KVM_MAP_MEMORY does only the SEPT.ADD
> side of things, then both @source (PAGE.ADD) and the EXTEND flag go away.
>
> > But I'd be hesitant to bake more requirements into this pre-mapping
> > interface, it feels like we're already overloading it as is.
>
> Agreed. After being able to think more about this ioctl(), I think KVM_MAP_MEMORY
> should be as "pure" of a mapping operation as we can make it. It'd be a little
> weird that using KVM_MAP_MEMORY is required for TDX VMs, but not other VMs. But
> that's really just a reflection of S-EPT, so it's arguably not even a bad thing.
>
> [*] https://lore.kernel.org/all/[email protected]

Let me give it a try to remove source from struct kvm_memory_mapping. With the
unit in byte instead of page, it will be
struct kvm_memory_mapping {
__u64 base_address;
__u64 size;
__u64 flags;
};

SNP won't have any changes. Always error for KVM_MAP_MEMORY for SNP?
(I'll leave it to Roth.)
TDX will have TDX_INIT_MEM_REGION with new implementation.
--
Isaku Yamahata <[email protected]>

2024-03-12 01:34:21

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [RFC PATCH 1/8] KVM: Document KVM_MAP_MEMORY ioctl

On Mon, Mar 11, 2024 at 02:08:01PM +1300,
"Huang, Kai" <[email protected]> wrote:

> Maybe just the "underlying physical pages may be
> > encrypted" etc.
>
> Sorry, perhaps "encrypted" -> "crypto-protected"?

Thanks for clarifying. Based on the discussion[1], I'm going to remove
'source' member. So I'll eliminate those sentence.

[1] https://lore.kernel.org/all/[email protected]/
--
Isaku Yamahata <[email protected]>

2024-03-12 14:20:36

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC PATCH 6/8] KVM: x86: Implement kvm_arch_{, pre_}vcpu_map_memory()

On Tue, Mar 12, 2024, Kai Huang wrote:
> > Wait. KVM doesn't *need* to do PAGE.ADD from deep in the MMU. The only inputs to
> > PAGE.ADD are the gfn, pfn, tdr (vm), and source. The S-EPT structures need to be
> > pre-built, but when they are built is irrelevant, so long as they are in place
> > before PAGE.ADD.
> >
> > Crazy idea. For TDX S-EPT, what if KVM_MAP_MEMORY does all of the SEPT.ADD stuff,
> > which doesn't affect the measurement, and even fills in KVM's copy of the leaf EPTE,
> > but tdx_sept_set_private_spte() doesn't do anything if the TD isn't finalized?
> >
> > Then KVM provides a dedicated TDX ioctl(), i.e. what is/was KVM_TDX_INIT_MEM_REGION,
> > to do PAGE.ADD. KVM_TDX_INIT_MEM_REGION wouldn't need to map anything, it would
> > simply need to verify that the pfn from guest_memfd() is the same as what's in
> > the TDP MMU.
>
> One small question:
>
> What if the memory region passed to KVM_TDX_INIT_MEM_REGION hasn't been pre-
> populated? If we want to make KVM_TDX_INIT_MEM_REGION work with these regions,
> then we still need to do the real map. Or we can make KVM_TDX_INIT_MEM_REGION
> return error when it finds the region hasn't been pre-populated?

Return an error. I don't love the idea of bleeding so many TDX details into
userspace, but I'm pretty sure that ship sailed a long, long time ago.

2024-03-12 21:41:39

by Huang, Kai

[permalink] [raw]
Subject: Re: [RFC PATCH 6/8] KVM: x86: Implement kvm_arch_{, pre_}vcpu_map_memory()

On Tue, 2024-03-12 at 07:20 -0700, Sean Christopherson wrote:
> On Tue, Mar 12, 2024, Kai Huang wrote:
> > > Wait. KVM doesn't *need* to do PAGE.ADD from deep in the MMU. The only inputs to
> > > PAGE.ADD are the gfn, pfn, tdr (vm), and source. The S-EPT structures need to be
> > > pre-built, but when they are built is irrelevant, so long as they are in place
> > > before PAGE.ADD.
> > >
> > > Crazy idea. For TDX S-EPT, what if KVM_MAP_MEMORY does all of the SEPT.ADD stuff,
> > > which doesn't affect the measurement, and even fills in KVM's copy of the leaf EPTE,
> > > but tdx_sept_set_private_spte() doesn't do anything if the TD isn't finalized?
> > >
> > > Then KVM provides a dedicated TDX ioctl(), i.e. what is/was KVM_TDX_INIT_MEM_REGION,
> > > to do PAGE.ADD. KVM_TDX_INIT_MEM_REGION wouldn't need to map anything, it would
> > > simply need to verify that the pfn from guest_memfd() is the same as what's in
> > > the TDP MMU.
> >
> > One small question:
> >
> > What if the memory region passed to KVM_TDX_INIT_MEM_REGION hasn't been pre-
> > populated? If we want to make KVM_TDX_INIT_MEM_REGION work with these regions,
> > then we still need to do the real map. Or we can make KVM_TDX_INIT_MEM_REGION
> > return error when it finds the region hasn't been pre-populated?
>
> Return an error. I don't love the idea of bleeding so many TDX details into
> userspace, but I'm pretty sure that ship sailed a long, long time ago.

In this case, IIUC the KVM_MAP_MEMORY ioctl() will be mandatory for TDX
(presumbly also SNP) guests, but _optional_ for other VMs. Not sure whether
this is ideal.

And just want to make sure I understand the background correctly:

The KVM_MAP_MEMORY ioctl() is supposed to be generic, and it should be able to
be used by any VM but not just CoCo VMs (including SW_PROTECTED ones)?

But it is only supposed to be used by the VMs which use guest_memfd()? Because
IIUC for normal VMs using mmap() we already have MAP_POPULATE for this purpose.

Looking at [*], it doesn't say what kind of VM the sender was trying to use.

Therefore can we interpret KVM_MAP_MEMORY ioctl() is effectively for CoCo VMs?
SW_PROTECTED VMs can also use guest_memfd(), but I believe nobody is going to
use it seriously.

[*] https://lore.kernel.org/all/[email protected]/



2024-03-12 21:48:31

by Huang, Kai

[permalink] [raw]
Subject: Re: [RFC PATCH 6/8] KVM: x86: Implement kvm_arch_{, pre_}vcpu_map_memory()

On Tue, 2024-03-12 at 21:41 +0000, Huang, Kai wrote:
> On Tue, 2024-03-12 at 07:20 -0700, Sean Christopherson wrote:
> > On Tue, Mar 12, 2024, Kai Huang wrote:
> > > > Wait. KVM doesn't *need* to do PAGE.ADD from deep in the MMU. The only inputs to
> > > > PAGE.ADD are the gfn, pfn, tdr (vm), and source. The S-EPT structures need to be
> > > > pre-built, but when they are built is irrelevant, so long as they are in place
> > > > before PAGE.ADD.
> > > >
> > > > Crazy idea. For TDX S-EPT, what if KVM_MAP_MEMORY does all of the SEPT.ADD stuff,
> > > > which doesn't affect the measurement, and even fills in KVM's copy of the leaf EPTE,
> > > > but tdx_sept_set_private_spte() doesn't do anything if the TD isn't finalized?
> > > >
> > > > Then KVM provides a dedicated TDX ioctl(), i.e. what is/was KVM_TDX_INIT_MEM_REGION,
> > > > to do PAGE.ADD. KVM_TDX_INIT_MEM_REGION wouldn't need to map anything, it would
> > > > simply need to verify that the pfn from guest_memfd() is the same as what's in
> > > > the TDP MMU.
> > >
> > > One small question:
> > >
> > > What if the memory region passed to KVM_TDX_INIT_MEM_REGION hasn't been pre-
> > > populated? If we want to make KVM_TDX_INIT_MEM_REGION work with these regions,
> > > then we still need to do the real map. Or we can make KVM_TDX_INIT_MEM_REGION
> > > return error when it finds the region hasn't been pre-populated?
> >
> > Return an error. I don't love the idea of bleeding so many TDX details into
> > userspace, but I'm pretty sure that ship sailed a long, long time ago.
>
> In this case, IIUC the KVM_MAP_MEMORY ioctl() will be mandatory for TDX
> (presumbly also SNP) guests, but _optional_ for other VMs. Not sure whether
> this is ideal.
>
> And just want to make sure I understand the background correctly:
>
> The KVM_MAP_MEMORY ioctl() is supposed to be generic, and it should be able to
> be used by any VM but not just CoCo VMs (including SW_PROTECTED ones)?
>
> But it is only supposed to be used by the VMs which use guest_memfd()? Because
> IIUC for normal VMs using mmap() we already have MAP_POPULATE for this purpose.
>
> Looking at [*], it doesn't say what kind of VM the sender was trying to use.
>
> Therefore can we interpret KVM_MAP_MEMORY ioctl() is effectively for CoCo VMs?
> SW_PROTECTED VMs can also use guest_memfd(), but I believe nobody is going to
> use it seriously.
>
> [*] https://lore.kernel.org/all/[email protected]/
>
>

Hmm.. Just after sending I realized the MAP_POPULATE only pre-populate page
table at host side, not EPT...

So the KVM_MAP_MEMORY is indeed can be used by _ALL_ VMs. My bad :-(

2024-03-12 23:03:54

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC PATCH 6/8] KVM: x86: Implement kvm_arch_{, pre_}vcpu_map_memory()

On Tue, Mar 12, 2024, Kai Huang wrote:
> On Tue, 2024-03-12 at 21:41 +0000, Huang, Kai wrote:
> > On Tue, 2024-03-12 at 07:20 -0700, Sean Christopherson wrote:
> > > On Tue, Mar 12, 2024, Kai Huang wrote:
> > > > > Wait. KVM doesn't *need* to do PAGE.ADD from deep in the MMU. The only inputs to
> > > > > PAGE.ADD are the gfn, pfn, tdr (vm), and source. The S-EPT structures need to be
> > > > > pre-built, but when they are built is irrelevant, so long as they are in place
> > > > > before PAGE.ADD.
> > > > >
> > > > > Crazy idea. For TDX S-EPT, what if KVM_MAP_MEMORY does all of the SEPT.ADD stuff,
> > > > > which doesn't affect the measurement, and even fills in KVM's copy of the leaf EPTE,
> > > > > but tdx_sept_set_private_spte() doesn't do anything if the TD isn't finalized?
> > > > >
> > > > > Then KVM provides a dedicated TDX ioctl(), i.e. what is/was KVM_TDX_INIT_MEM_REGION,
> > > > > to do PAGE.ADD. KVM_TDX_INIT_MEM_REGION wouldn't need to map anything, it would
> > > > > simply need to verify that the pfn from guest_memfd() is the same as what's in
> > > > > the TDP MMU.
> > > >
> > > > One small question:
> > > >
> > > > What if the memory region passed to KVM_TDX_INIT_MEM_REGION hasn't been pre-
> > > > populated? If we want to make KVM_TDX_INIT_MEM_REGION work with these regions,
> > > > then we still need to do the real map. Or we can make KVM_TDX_INIT_MEM_REGION
> > > > return error when it finds the region hasn't been pre-populated?
> > >
> > > Return an error. I don't love the idea of bleeding so many TDX details into
> > > userspace, but I'm pretty sure that ship sailed a long, long time ago.
> >
> > In this case, IIUC the KVM_MAP_MEMORY ioctl() will be mandatory for TDX
> > (presumbly also SNP) guests, but _optional_ for other VMs. Not sure whether
> > this is ideal.

No, just TDX. SNP's RMP purely works with pfns, i.e. the enforcement layer comes
into play *after* the stage-2 page table walks. KVM can zap NPTs for SNP VMs at
will.

> > And just want to make sure I understand the background correctly:
> >
> > The KVM_MAP_MEMORY ioctl() is supposed to be generic, and it should be able to
> > be used by any VM but not just CoCo VMs (including SW_PROTECTED ones)?
> >
> > But it is only supposed to be used by the VMs which use guest_memfd()? Because
> > IIUC for normal VMs using mmap() we already have MAP_POPULATE for this purpose.
> >
> > Looking at [*], it doesn't say what kind of VM the sender was trying to use.
> >
> > Therefore can we interpret KVM_MAP_MEMORY ioctl() is effectively for CoCo VMs?
> > SW_PROTECTED VMs can also use guest_memfd(), but I believe nobody is going to
> > use it seriously.
> >
> > [*] https://lore.kernel.org/all/[email protected]/
> >
> >
>
> Hmm.. Just after sending I realized the MAP_POPULATE only pre-populate page
> table at host side, not EPT...

Yep, exactly.

> So the KVM_MAP_MEMORY is indeed can be used by _ALL_ VMs. My bad :-(

2024-03-12 12:38:33

by Huang, Kai

[permalink] [raw]
Subject: Re: [RFC PATCH 6/8] KVM: x86: Implement kvm_arch_{, pre_}vcpu_map_memory()


>
> Wait. KVM doesn't *need* to do PAGE.ADD from deep in the MMU. The only inputs to
> PAGE.ADD are the gfn, pfn, tdr (vm), and source. The S-EPT structures need to be
> pre-built, but when they are built is irrelevant, so long as they are in place
> before PAGE.ADD.
>
> Crazy idea. For TDX S-EPT, what if KVM_MAP_MEMORY does all of the SEPT.ADD stuff,
> which doesn't affect the measurement, and even fills in KVM's copy of the leaf EPTE,
> but tdx_sept_set_private_spte() doesn't do anything if the TD isn't finalized?
>
> Then KVM provides a dedicated TDX ioctl(), i.e. what is/was KVM_TDX_INIT_MEM_REGION,
> to do PAGE.ADD. KVM_TDX_INIT_MEM_REGION wouldn't need to map anything, it would
> simply need to verify that the pfn from guest_memfd() is the same as what's in
> the TDP MMU.

One small question:

What if the memory region passed to KVM_TDX_INIT_MEM_REGION hasn't been pre-
populated? If we want to make KVM_TDX_INIT_MEM_REGION work with these regions,
then we still need to do the real map. Or we can make KVM_TDX_INIT_MEM_REGION
return error when it finds the region hasn't been pre-populated?

>
> Or if we want to make things more robust for userspace, set a software-available
> flag in the leaf TDP MMU SPTE to indicate that the page is awaiting PAGE.ADD.
> That way tdp_mmu_map_handle_target_level() wouldn't treat a fault as spurious
> (KVM will see the SPTE as PRESENT, but the S-EPT entry will be !PRESENT).
>
> Then KVM_MAP_MEMORY doesn't need to support @source, KVM_TDX_INIT_MEM_REGION
> doesn't need to fake a page fault and doesn't need to temporarily stash the
> source_pa in KVM, and KVM_MAP_MEMORY could be used to fully pre-map TDX memory.
>
> I believe the only missing piece is a way for the TDX code to communicate that
> hugepages are disallowed.
>

2024-03-11 23:26:58

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC PATCH 6/8] KVM: x86: Implement kvm_arch_{, pre_}vcpu_map_memory()

On Fri, Mar 01, 2024, [email protected] wrote:
> +int kvm_arch_vcpu_map_memory(struct kvm_vcpu *vcpu,
> + struct kvm_memory_mapping *mapping)
> +{
> + u8 max_level, goal_level = PG_LEVEL_4K;

goal_level is misleading. kvm_page_fault.goal_level is appropriate, because for
the majority of the structures lifetime, it is the level KVM is trying to map,
not the level that KVM has mapped.

For this variable, maybe "mapped_level" or just "level"


> + u32 error_code;

u64 when this gets rebased...

> + int r;
> +
> + error_code = 0;
> + if (mapping->flags & KVM_MEMORY_MAPPING_FLAG_WRITE)
> + error_code |= PFERR_WRITE_MASK;
> + if (mapping->flags & KVM_MEMORY_MAPPING_FLAG_EXEC)
> + error_code |= PFERR_FETCH_MASK;
> + if (mapping->flags & KVM_MEMORY_MAPPING_FLAG_USER)
> + error_code |= PFERR_USER_MASK;
> + if (mapping->flags & KVM_MEMORY_MAPPING_FLAG_PRIVATE) {
> +#ifdef PFERR_PRIVATE_ACCESS
> + error_code |= PFERR_PRIVATE_ACCESS;

..because PFERR_PRIVATE_ACCESS is in bits 63:32.

> +#else
> + return -OPNOTSUPP;

Broken code. And I don't see any reason for this to exist, PFERR_PRIVATE_ACCESS
will be unconditionally #defined.

> +#endif
> + }
> +
> + if (IS_ALIGNED(mapping->base_gfn, KVM_PAGES_PER_HPAGE(PG_LEVEL_1G)) &&
> + mapping->nr_pages >= KVM_PAGES_PER_HPAGE(PG_LEVEL_1G))
> + max_level = PG_LEVEL_1G;
> + else if (IS_ALIGNED(mapping->base_gfn, KVM_PAGES_PER_HPAGE(PG_LEVEL_2M)) &&
> + mapping->nr_pages >= KVM_PAGES_PER_HPAGE(PG_LEVEL_2M))
> + max_level = PG_LEVEL_2M;
> + else
> + max_level = PG_LEVEL_4K;

Hrm, I would much prefer to allow KVM to map more than what is requested, i.e.
not artificially restrict the hugepage size. Restricting the mapping size is
just giving userspace another way to shoot themselves in the foot. E.g. if
userspace prefaults the wrong size, KVM will likely either run with degraded
performance or immediately zap and rebuild the too-small SPTE.

Ugh, and TDX's S-EPT *must* be populated with 4KiB entries to start. On one hand,
allowing KVM to map a larger page (but still bounded by the memslot(s)) won't affect
TDX's measurement, because KVM never use a larger page. On the other hand, TDX
would need a way to restrict the mapping.

Another complication is that TDH.MEM.PAGE.ADD affects the measurement. We can
push the ordering requirements to userspace, but for all intents and purposes,
calls to this ioctl() would need to be serialized for doing PAGE.ADD, but could
run in parallel for every other use case, including PAGE.AUG and pre-mapping
shared memory for TDX.

Hrm.

Wait. KVM doesn't *need* to do PAGE.ADD from deep in the MMU. The only inputs to
PAGE.ADD are the gfn, pfn, tdr (vm), and source. The S-EPT structures need to be
pre-built, but when they are built is irrelevant, so long as they are in place
before PAGE.ADD.

Crazy idea. For TDX S-EPT, what if KVM_MAP_MEMORY does all of the SEPT.ADD stuff,
which doesn't affect the measurement, and even fills in KVM's copy of the leaf EPTE,
but tdx_sept_set_private_spte() doesn't do anything if the TD isn't finalized?

Then KVM provides a dedicated TDX ioctl(), i.e. what is/was KVM_TDX_INIT_MEM_REGION,
to do PAGE.ADD. KVM_TDX_INIT_MEM_REGION wouldn't need to map anything, it would
simply need to verify that the pfn from guest_memfd() is the same as what's in
the TDP MMU.

Or if we want to make things more robust for userspace, set a software-available
flag in the leaf TDP MMU SPTE to indicate that the page is awaiting PAGE.ADD.
That way tdp_mmu_map_handle_target_level() wouldn't treat a fault as spurious
(KVM will see the SPTE as PRESENT, but the S-EPT entry will be !PRESENT).

Then KVM_MAP_MEMORY doesn't need to support @source, KVM_TDX_INIT_MEM_REGION
doesn't need to fake a page fault and doesn't need to temporarily stash the
source_pa in KVM, and KVM_MAP_MEMORY could be used to fully pre-map TDX memory.

I believe the only missing piece is a way for the TDX code to communicate that
hugepages are disallowed.

2024-03-19 15:54:06

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [RFC PATCH 5/8] KVM: x86/mmu: Introduce kvm_mmu_map_page() for prepopulating guest memory

On Wed, Mar 06, 2024 at 04:38:53PM -0800,
David Matlack <[email protected]> wrote:

> On 2024-03-01 09:28 AM, [email protected] wrote:
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index e4cc7f764980..7d5e80d17977 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -4659,6 +4659,36 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> > return direct_page_fault(vcpu, fault);
> > }
> >
> > +int kvm_mmu_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
> > + u8 max_level, u8 *goal_level)
> > +{
> > + struct kvm_page_fault fault = KVM_PAGE_FAULT_INIT(vcpu, gpa, error_code,
> > + false, max_level);
> > + int r;
> > +
> > + r = __kvm_mmu_do_page_fault(vcpu, &fault);
>
> If TDP is disabled __kvm_mmu_do_page_fault() will interpret @gpa as a
> GVA no? And if the vCPU is in guest-mode __kvm_mmu_do_page_fault() will
> interpret gpa as a nGPA right?

Just to close the discussion.
As we discussed at [1], I'd lie to restrict the API to TDP MMU only
(with the next version). If vCPU is in guest-mode or legacy MMU mode, it will
get error.

[1] https://lore.kernel.org/kvm/[email protected]/
--
Isaku Yamahata <[email protected]>

2024-03-19 16:33:29

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [RFC PATCH 0/8] KVM: Prepopulate guest memory API

On Wed, Mar 06, 2024 at 06:09:54PM -0800,
Isaku Yamahata <[email protected]> wrote:

> On Wed, Mar 06, 2024 at 04:53:41PM -0800,
> David Matlack <[email protected]> wrote:
>
> > On 2024-03-01 09:28 AM, [email protected] wrote:
> > > From: Isaku Yamahata <[email protected]>
> > >
> > > Implementation:
> > > - x86 KVM MMU
> > > In x86 KVM MMU, I chose to use kvm_mmu_do_page_fault(). It's not confined to
> > > KVM TDP MMU. We can restrict it to KVM TDP MMU and introduce an optimized
> > > version.
> >
> > Restricting to TDP MMU seems like a good idea. But I'm not quite sure
> > how to reliably do that from a vCPU context. Checking for TDP being
> > enabled is easy, but what if the vCPU is in guest-mode?
>
> As you pointed out in other mail, legacy KVM MMU support or guest-mode will be
> troublesome. The use case I supposed is pre-population before guest runs, the
> guest-mode wouldn't matter. I didn't add explicit check for it, though.
>
> Any use case while vcpus running?
>
>
> > Perhaps we can just return an error out to userspace if the vCPU is in
> > guest-mode or TDP is disabled, and make it userspace's problem to do
> > memory mapping before loading any vCPU state.
>
> If the use case for default VM or sw-proteced VM is to avoid excessive kvm page
> fault at guest boot, error on guest-mode or disabled TDP wouldn't matter.

Any input? If no further input, I assume the primary use case is pre-population
before guest running.
--
Isaku Yamahata <[email protected]>

2024-03-19 16:38:46

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [RFC PATCH 6/8] KVM: x86: Implement kvm_arch_{, pre_}vcpu_map_memory()

On Wed, Mar 06, 2024 at 05:51:51PM -0800,
Isaku Yamahata <[email protected]> wrote:

> On Wed, Mar 06, 2024 at 04:36:25PM -0800,
> David Matlack <[email protected]> wrote:
>
> > On Wed, Mar 6, 2024 at 4:31 PM David Matlack <[email protected]> wrote:
> > >
> > > On 2024-03-01 09:28 AM, [email protected] wrote:
> > > >
> > > > + if (IS_ALIGNED(mapping->base_gfn, KVM_PAGES_PER_HPAGE(PG_LEVEL_1G)) &&
> > > > + mapping->nr_pages >= KVM_PAGES_PER_HPAGE(PG_LEVEL_1G))
> > > > + max_level = PG_LEVEL_1G;
> > > > + else if (IS_ALIGNED(mapping->base_gfn, KVM_PAGES_PER_HPAGE(PG_LEVEL_2M)) &&
> > > > + mapping->nr_pages >= KVM_PAGES_PER_HPAGE(PG_LEVEL_2M))
> > > > + max_level = PG_LEVEL_2M;
> > > > + else
> > > > + max_level = PG_LEVEL_4K;
> > >
> > > Is there a requirement that KVM must not map memory outside of the
> > > requested region?
> >
> > And if so, what if the requested region is already mapped with a larger page?
>
> Yes. We'd like to map exact gpa range for SNP or TDX case. We don't want to map
> zero at around range. For SNP or TDX, we map page to GPA, it's one time
> operation. It updates measurement.
>
> Say, we'd like to populate GPA1 and GPA2 with initial guest memory image. And
> they are within same 2M range. Map GPA1 first. If GPA2 is also mapped with zero
> with 2M page, the following mapping of GPA2 fails. Even if mapping of GPA2
> succeeds, measurement may be updated when mapping GPA1.
>
> It's user space VMM responsibility to map GPA range only once at most for SNP or
> TDX. Is this too strict requirement for default VM use case to mitigate KVM
> page fault at guest boot up? If so, what about a flag like EXACT_MAPPING or
> something?

I'm thinking as follows. What do you think?

- Allow mapping larger than requested with gmem_max_level hook:
Depend on the following patch. [1]
The gmem_max_level hook allows vendor-backend to determine max level.
By default (for default VM or sw-protected), it allows KVM_MAX_HUGEPAGE_LEVEL
mapping. TDX allows only 4KB mapping.

[1] https://lore.kernel.org/kvm/[email protected]/
[PATCH v11 30/35] KVM: x86: Add gmem hook for determining max NPT mapping level

- Pure mapping without coco operation:
As Sean suggested at [2], make KVM_MAP_MEMORY pure mapping without coco
operation. In the case of TDX, the API doesn't issue TDX specific operation
like TDH.PAGE.ADD() and TDH.EXTEND.MR(). We need TDX specific API.

[2] https://lore.kernel.org/kvm/[email protected]/

- KVM_MAP_MEMORY on already mapped area potentially with large page:
It succeeds. Not error. It doesn't care whether the GPA is backed by large
page or not. Because the use case is pre-population before guest running, it
doesn't matter if the given GPA was mapped or not, and what large page level
it backs.

Do you want error like -EEXIST?

--
Isaku Yamahata <[email protected]>

2024-04-03 18:54:46

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC PATCH 0/8] KVM: Prepopulate guest memory API

On Tue, Mar 19, 2024, Isaku Yamahata wrote:
> On Wed, Mar 06, 2024 at 06:09:54PM -0800,
> Isaku Yamahata <[email protected]> wrote:
>
> > On Wed, Mar 06, 2024 at 04:53:41PM -0800,
> > David Matlack <[email protected]> wrote:
> >
> > > On 2024-03-01 09:28 AM, [email protected] wrote:
> > > > From: Isaku Yamahata <[email protected]>
> > > >
> > > > Implementation:
> > > > - x86 KVM MMU
> > > > In x86 KVM MMU, I chose to use kvm_mmu_do_page_fault(). It's not confined to
> > > > KVM TDP MMU. We can restrict it to KVM TDP MMU and introduce an optimized
> > > > version.
> > >
> > > Restricting to TDP MMU seems like a good idea. But I'm not quite sure
> > > how to reliably do that from a vCPU context. Checking for TDP being
> > > enabled is easy, but what if the vCPU is in guest-mode?
> >
> > As you pointed out in other mail, legacy KVM MMU support or guest-mode will be
> > troublesome.

Why is shadow paging troublesome? I don't see any obvious issues with effectively
prefetching into a shadow MMU with read fault semantics. It might be pointless
and wasteful, as the guest PTEs need to be in place, but that's userspace's problem.

Testing is the biggest gap I see, as using the ioctl() for shadow paging will
essentially require a live guest, but that doesn't seem like it'd be too hard to
validate. And unless we lock down the ioctl() to only be allowed on vCPUs that
have never done KVM_RUN, we need that test coverage anyways.

And I don't think it makes sense to try and lock down the ioctl(), because for
the enforcement to have any meaning, KVM would need to reject the ioctl() if *any*
vCPU has run, and adding that code would likely add more complexity than it solves.

> > The use case I supposed is pre-population before guest runs, the guest-mode
> > wouldn't matter. I didn't add explicit check for it, though.

KVM shouldn't have an explicit is_guest_mode() check, the support should be a
property of the underlying MMU, and KVM can use the TDP MMU for L2 (if L1 is
using legacy shadow paging, not TDP).

> > Any use case while vcpus running?
> >
> >
> > > Perhaps we can just return an error out to userspace if the vCPU is in
> > > guest-mode or TDP is disabled, and make it userspace's problem to do
> > > memory mapping before loading any vCPU state.
> >
> > If the use case for default VM or sw-proteced VM is to avoid excessive kvm page
> > fault at guest boot, error on guest-mode or disabled TDP wouldn't matter.
>
> Any input? If no further input, I assume the primary use case is pre-population
> before guest running.

Pre-populating is the primary use case, but that could happen if L2 is active,
e.g. after live migration.

I'm not necessarily opposed to initially adding support only for the TDP MMU, but
if the delta to also support the shadow MMU is relatively small, my preference
would be to add the support right away. E.g. to give us confidence that the uAPI
can work for multiple MMUs, and so that we don't have to write documentation for
x86 to explain exactly when it's legal to use the ioctl().

2024-04-03 22:00:36

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [RFC PATCH 0/8] KVM: Prepopulate guest memory API

On Wed, Apr 03, 2024 at 11:30:21AM -0700,
Sean Christopherson <[email protected]> wrote:

> On Tue, Mar 19, 2024, Isaku Yamahata wrote:
> > On Wed, Mar 06, 2024 at 06:09:54PM -0800,
> > Isaku Yamahata <[email protected]> wrote:
> >
> > > On Wed, Mar 06, 2024 at 04:53:41PM -0800,
> > > David Matlack <[email protected]> wrote:
> > >
> > > > On 2024-03-01 09:28 AM, [email protected] wrote:
> > > > > From: Isaku Yamahata <[email protected]>
> > > > >
> > > > > Implementation:
> > > > > - x86 KVM MMU
> > > > > In x86 KVM MMU, I chose to use kvm_mmu_do_page_fault(). It's not confined to
> > > > > KVM TDP MMU. We can restrict it to KVM TDP MMU and introduce an optimized
> > > > > version.
> > > >
> > > > Restricting to TDP MMU seems like a good idea. But I'm not quite sure
> > > > how to reliably do that from a vCPU context. Checking for TDP being
> > > > enabled is easy, but what if the vCPU is in guest-mode?
> > >
> > > As you pointed out in other mail, legacy KVM MMU support or guest-mode will be
> > > troublesome.
>
> Why is shadow paging troublesome? I don't see any obvious issues with effectively
> prefetching into a shadow MMU with read fault semantics. It might be pointless
> and wasteful, as the guest PTEs need to be in place, but that's userspace's problem.

The populating address for shadow paging is GVA, not GPA. I'm not sure if
that's what the user space wants. If it's user-space problem, I'm fine.


> Testing is the biggest gap I see, as using the ioctl() for shadow paging will
> essentially require a live guest, but that doesn't seem like it'd be too hard to
> validate. And unless we lock down the ioctl() to only be allowed on vCPUs that
> have never done KVM_RUN, we need that test coverage anyways.

So far I tried only TDP MMU case. I can try other MMU type.


> And I don't think it makes sense to try and lock down the ioctl(), because for
> the enforcement to have any meaning, KVM would need to reject the ioctl() if *any*
> vCPU has run, and adding that code would likely add more complexity than it solves.
>
> > > The use case I supposed is pre-population before guest runs, the guest-mode
> > > wouldn't matter. I didn't add explicit check for it, though.
>
> KVM shouldn't have an explicit is_guest_mode() check, the support should be a
> property of the underlying MMU, and KVM can use the TDP MMU for L2 (if L1 is
> using legacy shadow paging, not TDP).

I see. So the type of the populating address can vary depending on vcpu mode.
It's user-space problem which address (GVA, L1 GPA, L2 GPA) is used.


> > > Any use case while vcpus running?
> > >
> > >
> > > > Perhaps we can just return an error out to userspace if the vCPU is in
> > > > guest-mode or TDP is disabled, and make it userspace's problem to do
> > > > memory mapping before loading any vCPU state.
> > >
> > > If the use case for default VM or sw-proteced VM is to avoid excessive kvm page
> > > fault at guest boot, error on guest-mode or disabled TDP wouldn't matter.
> >
> > Any input? If no further input, I assume the primary use case is pre-population
> > before guest running.
>
> Pre-populating is the primary use case, but that could happen if L2 is active,
> e.g. after live migration.
>
> I'm not necessarily opposed to initially adding support only for the TDP MMU, but
> if the delta to also support the shadow MMU is relatively small, my preference
> would be to add the support right away. E.g. to give us confidence that the uAPI
> can work for multiple MMUs, and so that we don't have to write documentation for
> x86 to explain exactly when it's legal to use the ioctl().

If we call kvm_mmu.page_fault() without caring of what address will be
populated, I don't see the big difference.
--
Isaku Yamahata <[email protected]>

2024-04-03 22:43:02

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC PATCH 0/8] KVM: Prepopulate guest memory API

On Wed, Apr 03, 2024, Isaku Yamahata wrote:
> On Wed, Apr 03, 2024 at 11:30:21AM -0700,
> Sean Christopherson <[email protected]> wrote:
>
> > On Tue, Mar 19, 2024, Isaku Yamahata wrote:
> > > On Wed, Mar 06, 2024 at 06:09:54PM -0800,
> > > Isaku Yamahata <[email protected]> wrote:
> > >
> > > > On Wed, Mar 06, 2024 at 04:53:41PM -0800,
> > > > David Matlack <[email protected]> wrote:
> > > >
> > > > > On 2024-03-01 09:28 AM, [email protected] wrote:
> > > > > > From: Isaku Yamahata <[email protected]>
> > > > > >
> > > > > > Implementation:
> > > > > > - x86 KVM MMU
> > > > > > In x86 KVM MMU, I chose to use kvm_mmu_do_page_fault(). It's not confined to
> > > > > > KVM TDP MMU. We can restrict it to KVM TDP MMU and introduce an optimized
> > > > > > version.
> > > > >
> > > > > Restricting to TDP MMU seems like a good idea. But I'm not quite sure
> > > > > how to reliably do that from a vCPU context. Checking for TDP being
> > > > > enabled is easy, but what if the vCPU is in guest-mode?
> > > >
> > > > As you pointed out in other mail, legacy KVM MMU support or guest-mode will be
> > > > troublesome.
> >
> > Why is shadow paging troublesome? I don't see any obvious issues with effectively
> > prefetching into a shadow MMU with read fault semantics. It might be pointless
> > and wasteful, as the guest PTEs need to be in place, but that's userspace's problem.
>
> The populating address for shadow paging is GVA, not GPA. I'm not sure if
> that's what the user space wants. If it's user-space problem, I'm fine.

/facepalm

> > Pre-populating is the primary use case, but that could happen if L2 is active,
> > e.g. after live migration.
> >
> > I'm not necessarily opposed to initially adding support only for the TDP MMU, but
> > if the delta to also support the shadow MMU is relatively small, my preference
> > would be to add the support right away. E.g. to give us confidence that the uAPI
> > can work for multiple MMUs, and so that we don't have to write documentation for
> > x86 to explain exactly when it's legal to use the ioctl().
>
> If we call kvm_mmu.page_fault() without caring of what address will be
> populated, I don't see the big difference.

Ignore me, I completely spaced that shadow MMUs don't operate on an L1 GPA. I
100% agree that restricting this to TDP, at least for the initial merge, is the
way to go. A uAPI where the type of address varies based on the vCPU mode and
MMU type would be super ugly, and probably hard to use.

At that point, I don't have a strong preference as to whether or not direct
legacy/shadow MMUs are supported. That said, I think it can (probably should?)
be done in a way where it more or less Just Works, e.g. by having a function hook
in "struct kvm_mmu".

2024-04-03 23:15:46

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC PATCH 6/8] KVM: x86: Implement kvm_arch_{, pre_}vcpu_map_memory()

On Tue, Mar 19, 2024, Isaku Yamahata wrote:
> On Wed, Mar 06, 2024 at 05:51:51PM -0800,
> > Yes. We'd like to map exact gpa range for SNP or TDX case. We don't want to map
> > zero at around range. For SNP or TDX, we map page to GPA, it's one time
> > operation. It updates measurement.
> >
> > Say, we'd like to populate GPA1 and GPA2 with initial guest memory image. And
> > they are within same 2M range. Map GPA1 first. If GPA2 is also mapped with zero
> > with 2M page, the following mapping of GPA2 fails. Even if mapping of GPA2
> > succeeds, measurement may be updated when mapping GPA1.
> >
> > It's user space VMM responsibility to map GPA range only once at most for SNP or
> > TDX. Is this too strict requirement for default VM use case to mitigate KVM
> > page fault at guest boot up? If so, what about a flag like EXACT_MAPPING or
> > something?
>
> I'm thinking as follows. What do you think?
>
> - Allow mapping larger than requested with gmem_max_level hook:

I don't see any reason to allow userspace to request a mapping level. If the
prefetch is defined to have read fault semantics, KVM has all the wiggle room it
needs to do the optimal/sane thing, without having to worry reconcile userspace's
desired mapping level.

> Depend on the following patch. [1]
> The gmem_max_level hook allows vendor-backend to determine max level.
> By default (for default VM or sw-protected), it allows KVM_MAX_HUGEPAGE_LEVEL
> mapping. TDX allows only 4KB mapping.
>
> [1] https://lore.kernel.org/kvm/[email protected]/
> [PATCH v11 30/35] KVM: x86: Add gmem hook for determining max NPT mapping level
>
> - Pure mapping without coco operation:
> As Sean suggested at [2], make KVM_MAP_MEMORY pure mapping without coco
> operation. In the case of TDX, the API doesn't issue TDX specific operation
> like TDH.PAGE.ADD() and TDH.EXTEND.MR(). We need TDX specific API.
>
> [2] https://lore.kernel.org/kvm/[email protected]/
>
> - KVM_MAP_MEMORY on already mapped area potentially with large page:
> It succeeds. Not error. It doesn't care whether the GPA is backed by large
> page or not. Because the use case is pre-population before guest running, it
> doesn't matter if the given GPA was mapped or not, and what large page level
> it backs.
>
> Do you want error like -EEXIST?

No error. As above, I think the ioctl() should behave like a read fault, i.e.
be an expensive nop if there's nothing to be done.

For VMA-based memory, userspace can operate on the userspace address. E.g. if
userspace wants to break CoW, it can do that by writing from userspace. And if
userspace wants to "request" a certain mapping level, it can do that by MADV_*.

For guest_memfd, there are no protections (everything is RWX, for now), and when
hugepage support comes along, userspace can simply manipulate the guest_memfd
instance as needed.